Amazing adventures of Py_UNICODE in Pythonland

05.10.2009 20:08

Today I ported some Python modules written in C from Python 2.4 to 2.5 and got stuck when compilation failing on translation between Py_UNICODE and standard C wide-strings. By now it is probably obvious that I won't let a piece of software out of my hands until it handles Unicode properly, even when I have to dig out dirty implementation details. So here are all the skeletons I dug out of Python's back yard.

Regarding Unicode support, there are actually two versions of Python: one stores strings encoded in UCS-2 (string is serialized into a sequence of 16-bit words, one word per character) while the other uses UCS-4 (sequence of 32-bit words). Since these two internal formats are obviously incompatible with each other this means that modules compiled for one version of Python will not work with the other. The first one is more memory efficient, while the second one allows the use of characters outside of the Unicode Basic Multilingual Plane (i.e. characters with numbers above 0xFFFF). I'm not sure why the interpreter keeps supporting both variants, but at least on Linux it seems most distributions tend to stick with the UCS-4 variant. By the way, this is controlled by the --enable-unicode=ucs4 argument to the configure script when compiling Python and is apparent from the value of Py_UNICODE_SIZE constant.

This settles the question of the lowest-level representation - mapping from abstract codes points in the Unicode code space to concrete sequences of ones and zeros in computer's memory. Apart from this distinction all Python interpreters use the same lowest-level representation of strings - regardless of what is considered the "native" form of the system.

Now comes the question of how this string representation presented to the C compiler. This is important, because it affects the way algorithms are expressed in C (Python and it's standard library contain many functions that have to deal with encoded Unicode strings). For example, if a compiler sees an UCS-2 representation as an array of unsigned 16-bit values, C code that deals with characters above 0x8000 will be different than if an array of signed values is used: in the first instance these characters will be interpreted as positive values and in the other as negative - which affects comparison operators.

Pythonistas decided on using unsigned values, which is a sensible decision, since Unicode characters have code points starting at 0 and negative values don't make much sense. So, at this point we know that Python uses UCS-2 or UCS-4 encodings which are presented at the C level as lists of unsigned 16 or 32-bit values. And this is exactly what various arrays of type Py_UNICODE* in the Python API refer to.

Things get complicated when you want to plug strings from the Python API to code that uses the standard C library. C has its own equivalent to Py_UNICODE* called wchar_t*, which is what various string-handling functions accept and return. The problem being that wchar_t* is defined as platform dependent which translates as go away - we don't want to deal with all this mess. It can be either signed or unsigned and the only thing you know about it is that it's at least 8 bits wide.

For example, on this Debian x86 system, it's defined as a 32-bit signed value and contains UCS-4 (which also depends on the POSIX locale settings). This appears to be the most common setup on Linux machines I've seen, while I hear Windows uses 16-bit for wchar_t. This ambiguity in C is also the reason why you have to dodge all those gchars, xmlChars and similar types all the time as authors of other libraries try to work around the problem.

So apparently in this common case the standard C library and Python have the same basic UCS-4 representation for strings, except their view on it is slightly different: C library interprets them as signed numbers and Python unsigned. But both write the same sequence of ones and zeros to the RAM for the same abstract Unicode string. Obviously, the translation between both representations is nothing more complicated than a C pointer type cast.

Python in fact checks whether wchar_t matches his world view and sets Py_UNICODE to be an alias for wchar_t in that case (to enhance native platform compatibility as the manual says). It just so happened that Python 2.4 had a slightly broken check and didn't take into the account the signedness of the type, so on Python 2.4 Py_UNICODE could be set to wchar_t even when it was a signed type.

However, nothing says that this match of word size and encodings isn't just a lucky coincidence. When C and Python's ideas do differ, Python API actually has routines that at first glance appear to try to lend you a helping hand: PyUnicode_FromWideChar and PyUnicode_AsWideChar. But these really just wait to push you off the cliff. Under the hood they basically do this:

Py_UNICODE *u;
wchar_t *w;

for (i = size; i > 0; i--)
    *u++ = *w++;

Given the large range of sizes and encodings that wide C strings can hold, this simple loop can break character encodings in a wonderful number of ways.

Update: Here's the correct way to convert wide strings to a known encoding using iconv().

Posted by Tomaž | Categories: Code

Add a new comment


(No HTML tags allowed. Separate paragraphs with a blank line.)