xmllint and DTD caching

03.12.2010 20:10

Back in 2008 W3C published a blog post complaining about the hundred million requests their servers get per day for HTML Document Type Definitions and other files you usually find referenced at the top of valid web pages. The main point of the post was that these files should be cached, not continuously retrieved by various validation tools from their site, since they never change. I'm sure these two years have added another order of magnitude to their web statistics.

Anyway, some days ago I wanted to validate a large number of XHTML+RDFa documents and I wrote a simple shell script that ran xmllint on each of them. It ran much slowly than I anticipated and in the end I found out that xmllint will request a fresh copy of 38 DTD files for each document it validates. It does that even if you specify multiple documents on the command line. If the most authoritative XML validation tool I know behaves in such a blatant disregard of HTTP caching standards I can see how W3C manages to use hundreds of Mb/s of bandwidth.

The bright side is that there is a standard for specifying local copies of external XML entities. It's called XML Catalog and xmllint honors it.

On Debian systems there's a w3c-dtd-xhtml package that will install a local copy of W3C DTDs. However it only covers XHTML 1.0 and 1.1, it's on the brink of being orphaned and its XML cataloging system is so complicated that after an hour I did not find a way to supplement it with the XHTML+RDFa DTDs.

In the end I found a simpler way:

First make a local copy of required DTDs. Running xmllint on a sample file while using tcpdump and some grepping to capture GET requests on W3C's servers got me the list of files required. A recursive wget did the rest. In the case of XHTML+RDFa that meant mirroring the content of http://www.w3.org/Markup and http://www.w3.org/TR/ruby (don't ask) to a local directory.

Then create a catalog.xml with the following:

<?xml version="1.0"?>
<!DOCTYPE catalog PUBLIC "-//OASIS/DTD Entity Resolution XML Catalog V1.0//EN" "http://www.oasis-open.org/committees/entity/release/1.0/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
	<rewriteSystem systemIdStartString="http://www.w3.org/MarkUp" rewritePrefix="file:///path_to_MarkUp/" />
	<rewriteSystem systemIdStartString="http://www.w3.org/TR/ruby" rewritePrefix="file:///path_to_TR/ruby/" />
	<nextCatalog catalog="/etc/xml/catalog.xml" />
</catalog>

This simple XML catalog simply redirects requests for resources under the given URL prefixes to the mirror on the local filesystem. It also points any unknown requests to the system's XML catalog file.

Using this setup, the xmllint can be used with the following command line:

$ XML_CATALOG_FILES=catalog.xml xmllint --valid --noout --nonet file_to_validate.xml

--nonet flag is there so that xmllint throws an error if there's still something it needs to fetch over the internet.

Needless to say, this manual procedure is unreasonably time consuming when all you want is make sure a bunch files are valid. It starts to make sense when a local DTD copy reduces the run time from days to minutes. Plus you get the warm feeling you're not breaking the internet. However I think there's no excuse that such a broadly used tool as xmllint doesn't have some kind of a cache built-in, even if it's only for the time of running a single xmllint instance.

Posted by Tomaž | Categories: Code

Add a new comment


(No HTML tags allowed. Separate paragraphs with a blank line.)