Ampersand Ay Em Pee Semicolon

25.07.2010 14:10

Most people don't seem to get content encodings. The most popular way of working with them seems to be randomly throwing encode() and decode() functions at the problem until it compiles/runs/displays in the browser.

I found the most recent example of such practice the other day at work. An ampersand is a pretty common character in URLs since it's used as a parameter separator in GET requests. One common place to find URLs is in RSS and Atom feeds. These two formats are dialects of XML, where ampersand is a special character, denoting a character reference or an external entity. Hence it must be escaped. See the problem?

It turns out a surprising amount of RSS and Atom feeds on the web have URLs that contain obviously incorrectly escaped ampersands. The most common mistake looks like this:

...
<link>http://www.example.com/?a=1&amp;amp;b=2</link>
...

This is valid XML, but the URL is:

http://www.example.com/?a=1&amp;b=2

when it obviously should have been:

http://www.example.com/?a=1&b=2

But since those GET parameters that get garbled are in most cases used for tracking people around, nobody ever notices that because pages still display fine in the browser. The corollary is that nobody ever looks at the results of that tracking or somebody would surely notice the missing traffic coming from RSS feeds.

So, what is the correct way of dealing with these URLs? In theory, a valid URL could actually contain the sequence of characters "&amp;". So there is no way to know for sure if this is an error on the part of the XML author or not.

In practice, the solution is to just call decode() on the URL until it eats up all &amps;. It seems to work every time.

But are there sites out there that legitimately use such URLs? If there are, they seem to be ignored by Google. "inurl:&amp;" query returns only sites that have an ampersand look-alike character in urls (U+FF06, "fullwidth ampersand"). So the query obviously runs in Google's index, but pages with such URLs are either missing or filtered out.

As a side note, the URL http://en.wikipedia.org/wiki/%26amp%3B actually always resolves to a server error page.

Posted by Tomaž | Categories: Code

Add a new comment


(No HTML tags allowed. Separate paragraphs with a blank line.)