Unicode transliteration in Python
It's surprising how much software today still doesn't support international characters out of the box. Even in Slovenia it's common to get to a brand new web site where one form or the other will not allow Slovenian characters (č, ž and š) to be entered. Recently I was told by one Slovenian company that their "computer" only supports English letters when I complained that my name was misspelled on their correspondence.
While Unicode has been established as a standard way to represent most of the world's writing for most than 15 years now, it still often gets supported as an afterthought and even then it's usually more of an ugly hack.
However, even if you're writing software from the bottom up with Unicode in mind you often have to communicate with other legacy systems that will only accept 7-bit ASCII. In that case, you want to choose the path of least surprise for the user (No, replacing all characters outside of ASCII with "?" is not it).
transliteration: the act or product of transliterating, or of representing letters or words in the characters of another alphabet or script.
Or, in this case particular case, finding the sequence of ASCII characters that is the closest approximation to your Unicode string.
For example, the closest to string "Tomaž" in ASCII is "Tomaz". Some information is lost in this transformation, of course, since several Unicode strings can be transformed in the same ASCII representation. So this is a strictly one-way transformation. However a human reader will probably still be able to guess what original string was meant from the context.
The same transformation can be useful when trying to match a proper Unicode string stored in some database with a string that was entered by a user with an US keyboard that is too lazy or ignorant to properly spell a foreign word (another one of my pet peeves, by the way).
Sean M. Burke wrote Text::Unidecode, a wonderful Perl module that performs Unicode transliteration I described. However, since I mostly write Python code at work, I kept missing an equivalent library in Python.
So, here's my port of Sean's module to Python. I took the most convenient path and did pretty much literal translation, with most of the work done by a simple script that converted character tables from Perl to Python syntax.
The use is pretty straight forward and it works even on more complicated cases:
from unidecode import unidecode print unidecode(u"\u5317\u4EB0") # That prints: Bei Jing
You can get the source directly from a git repository:
$git clone http://code.zemanta.com/tsolc/git/unidecode
To install, use the classic Python trick:
$ python setup.py install $ python setup.py test
Update: The repository above is no longer available. Please see Unidecode page in the Python Package Index for updated installation instructions and a link to the new git repository.
If you're going to use this for anything mission-critical, the original documentation is still a good read. Specially the section about design goals and constraints.
I'd like to look at your code, but not if I have to install Git first. Even if I had Git, I wouldn't want your full repository. Just publish the code itself. How about a web-accessible archive (surely there's a Git equivalent of ViewVC/ViewCVS/WebSVN)? Or a tarball?