Unicode transliteration in Python

25.01.2009 19:45

It's surprising how much software today still doesn't support international characters out of the box. Even in Slovenia it's common to get to a brand new web site where one form or the other will not allow Slovenian characters (č, ž and š) to be entered. Recently I was told by one Slovenian company that heir "computer" only supports English letters when I complained that my name was misspelled on their correspondence.

While Unicode has been established as a standard way to represent most of the world's writing for most than 15 years now, it still often gets supported as an afterthought and even then it's usually more of an ugly hack.

However, even if you're writing software from the bottom up with Unicode in mind you often have to communicate with other legacy systems that will only accept 7-bit ASCII. In that case, you want to choose the path of least surprise for the user (No, replacing all characters outside of ASCII with "?" is not it).

transliteration: the act or product of transliterating, or of representing letters or words in the characters of another alphabet or script.

Or, in this case particular case, finding the sequence of ASCII characters that is the closest approximation to your Unicode string.

For example, the closest to string "Tomaž" in ASCII is "Tomaz". Some information is lost in this transformation, of course, since several Unicode strings can be transformed in the same ASCII representation. So this is a strictly one-way transformation. However a human reader will probably still be able to guess what original string was meant from the context.

The same transformation can be useful when trying to match a proper Unicode string stored in some database with a string that was entered by a user with an US keyboard that is too lazy or ignorant to properly spell a foreign word (another one of my pet peeves, by the way).

Sean M. Burke wrote Text::Unidecode, a wonderful Perl module that performs Unicode transliteration I described. However, since I mostly write Python code at work, I kept missing an equivalent library in Python.

So, here's my port of Sean's module to Python. I took the most convenient path and did pretty much literal translation, with most of the work done by a simple script that converted character tables from Perl to Python syntax.

The use is pretty straight forward and it works even on more complicated cases:

from unidecode import unidecode
print unidecode(u"\u5317\u4EB0")

# That prints: Bei Jing

You can get the source directly from a git repository:

$ git clone http://code.zemanta.com/tsolc/git/unidecode

To install, use the classic Python trick:

$ python setup.py install
$ python setup.py test

If you're going to use this for anything mission-critical, the original documentation is still a good read. Specially the section about design goals and constraints.

Posted by Tomaž | Categories: Code
Comments

I'd like to look at your code, but not if I have to install Git first. Even if I had Git, I wouldn't want your full repository. Just publish the code itself. How about a web-accessible archive (surely there's a Git equivalent of ViewVC/ViewCVS/WebSVN)? Or a tarball?

Posted on 21 February 2009 by David Goodger

Hi David

I've made a tarball with the latest version. See

http://code.zemanta.com/tsolc/unidecode

Posted on 1 March 2009 by Tomaž
Posted on 25 May 2009 by Tiberiu Ichim

From the documentation I don't see how you could use unicodedata to do the same thing as unidecode. The "normalize" method will produce a normalized string representation, but that is still unicode, not 7-bit ASCII. Maybe I'm missing something.

The second link suggests simply ignoring or replacing non-ASCII characters with "?", which is exactly the opposite to what I'm trying to do here.

Posted on 25 May 2009 by Tomaž

How about using setuptools and putting it on cheeseshop so I can easy_install it? :)

I'm actually rather in a hurry to use your package, so I'm putting it up already, if you're on cheeseshop, let me know your username there and I'll make you package owner.

Cheers (and thanks for porting it!!) - Ben

Posted on 17 June 2009 by Ben Bangert

Hey, this is already done in the standard lib! See codecs table at http://docs.python.org/library/codecs.html. And then: s.encode('unicode_encode') or s.decode('unicode_encode')

Posted on 5 July 2009 by Artur

Coming back to this post, based on its recent reddit highlight. The link I gave you in the above comment does exactly what you're trying to achieve: replace characters that are not in the ascii set with their closest match (not the ? character).

Posted on 5 July 2009 by Tiberiu Ichim

Tiberiu, have you actually tried the code in the link you posted?

That code only works for removing some accent marks. Unicode contains much more than accented characters. Unidecode will for example replace Greek and Cyrillic letters with Latin equivalents, while your method simply strips them out.

Posted on 7 July 2009 by Tomaž

@Artur: I can't seem to find it in the standard libs. Where exactly did you find it?

Posted on 26 August 2009 by exhuma

Hi Tomaž!

Thanks for the module, it's quite a usefull one. But, from my point of view, it does not handle russian cyrillic the right way sometimes (but other do not work with cyrillic at all :) -- can I help You to imporove it?

With best regards, Sergei.

P.S. You can drop me a line via Jabber: xmpp://saabeilin@jabber.org

Posted on 30 November 2009 by Sergei Beilin

Sergei, I'm glad you found it useful. Additions and corrections are always welcome, of course.

Posted on 1 December 2009 by Tomaž

Here is a patch with corrected spelling of cyrillic http://dl.dropbox.com/u/502270/0001-corrected-spelling-of-cyrillic.patch

Posted 3 weeks ago by Ruslan Grokhovetskiy

Thanks Ruslan! I've committed your patch to the git repository and uploaded a new release.

Posted 3 weeks ago by Tomaž
Add a new comment

Your name

Your email (optional, will be published)

Your web site (optional)


(No HTML tags allowed. Separate paragraphs with a blank line.)