Python's disappearing character trick

08.12.2014 19:42

While working on some improvements to the way Unidecode behaves on "narrow" Unicode Python builds, I stumbled upon an interesting bug in Python 2.7. I noticed it because some unit tests passed only once after saving the .py file, but always failed afterwards. Since the unit tests did not save any state, that behavior appeared very suspicious.

I managed to distill the cause to the following example:

$ python -V
Python 2.7.3
$ cat a.py
from b import foo
foo()
$ cat b.py
def foo():
	s_sp_1 = u'\ud835\udce3'
	print len(s_sp_1)
$ python a.py
2
$ python a.py
1
$ python a.py
1

Notice how the program prints out "2" on the first invocation but on "1" all subsequent calls. You can make it print 2 again by deleting b.pyc.

What is happening here? The Unicode string u'\ud835\udce3' actually contains only one character. However, this character is written as a pair of surrogate characters. The string is equivalent to u'\U0001d4e3'.

Surrogate characters are a hack in Unicode that allows implementations that only support 16-bit code points to handle more than 65536 characters. Basically, it introduces variable-length encoding with 16-bit code units (similarly to how UTF-8 encoding encodes a character in a variable number of 8-bit code units). The problem is that for a while, systems with 16-bit code points were considered to always have one code unit per code point. To muddle up things further, even systems that have 32-bit code units (like most Python builds these days) are supposed to support surrogate pair encoding. This leads to yet another case where very different sequences of code points can produce equivalent Unicode strings (there are quite a few others as well)

What appears to happen in the code above is that Python re-encodes the Unicode string when writing out the compiled version of the b.py module to b.pyc. When re-encoding, it doesn't use the surrogate pair (which makes sense, since my Python build can handle 32-bit code points just fine). However, that means that when reading the b.pyc file on the second invocation, the length of the string is different.

This is actually a bug in Python that has been known since 2008. It's fixed in Python 3.2 and above, but seems to remain unfixed in all Python 2 versions I tried. I tend to agree with Python developers though. It's unusual to have this kind of strings written directly in Python source code and even more unusual to notice the change in encoding between runs. However I might be influenced by the fact that I didn't waste more than a couple of minutes because of this bug.

Posted by Tomaž | Categories: Code

Add a new comment


(No HTML tags allowed. Separate paragraphs with a blank line.)