Unicode transliteration in Python

25.01.2009 19:45

It's surprising how much software today still doesn't support international characters out of the box. Even in Slovenia it's common to get to a brand new web site where one form or the other will not allow Slovenian characters (č, ž and š) to be entered. Recently I was told by one Slovenian company that heir "computer" only supports English letters when I complained that my name was misspelled on their correspondence.

While Unicode has been established as a standard way to represent most of the world's writing for most than 15 years now, it still often gets supported as an afterthought and even then it's usually more of an ugly hack.

However, even if you're writing software from the bottom up with Unicode in mind you often have to communicate with other legacy systems that will only accept 7-bit ASCII. In that case, you want to choose the path of least surprise for the user (No, replacing all characters outside of ASCII with "?" is not it).

transliteration: the act or product of transliterating, or of representing letters or words in the characters of another alphabet or script.

Or, in this case particular case, finding the sequence of ASCII characters that is the closest approximation to your Unicode string.

For example, the closest to string "Tomaž" in ASCII is "Tomaz". Some information is lost in this transformation, of course, since several Unicode strings can be transformed in the same ASCII representation. So this is a strictly one-way transformation. However a human reader will probably still be able to guess what original string was meant from the context.

The same transformation can be useful when trying to match a proper Unicode string stored in some database with a string that was entered by a user with an US keyboard that is too lazy or ignorant to properly spell a foreign word (another one of my pet peeves, by the way).

Sean M. Burke wrote Text::Unidecode, a wonderful Perl module that performs Unicode transliteration I described. However, since I mostly write Python code at work, I kept missing an equivalent library in Python.

So, here's my port of Sean's module to Python. I took the most convenient path and did pretty much literal translation, with most of the work done by a simple script that converted character tables from Perl to Python syntax.

The use is pretty straight forward and it works even on more complicated cases:

from unidecode import unidecode
print unidecode(u"\u5317\u4EB0")

# That prints: Bei Jing

You can get the source directly from a git repository:

$ git clone http://code.zemanta.com/tsolc/git/unidecode

To install, use the classic Python trick:

$ python setup.py install
$ python setup.py test

If you're going to use this for anything mission-critical, the original documentation is still a good read. Specially the section about design goals and constraints.

Posted by Tomaž | Categories: Code

Comments

I'd like to look at your code, but not if I have to install Git first. Even if I had Git, I wouldn't want your full repository. Just publish the code itself. How about a web-accessible archive (surely there's a Git equivalent of ViewVC/ViewCVS/WebSVN)? Or a tarball?

Posted on 21 February 2009 by David Goodger

Hi David

I've made a tarball with the latest version. See

http://code.zemanta.com/tsolc/unidecode

Posted on 1 March 2009 by Tomaž
Posted on 25 May 2009 by Tiberiu Ichim

From the documentation I don't see how you could use unicodedata to do the same thing as unidecode. The "normalize" method will produce a normalized string representation, but that is still unicode, not 7-bit ASCII. Maybe I'm missing something.

The second link suggests simply ignoring or replacing non-ASCII characters with "?", which is exactly the opposite to what I'm trying to do here.

Posted on 25 May 2009 by Tomaž

How about using setuptools and putting it on cheeseshop so I can easy_install it? :)

I'm actually rather in a hurry to use your package, so I'm putting it up already, if you're on cheeseshop, let me know your username there and I'll make you package owner.

Cheers (and thanks for porting it!!)
- Ben

Posted on 17 June 2009 by Ben Bangert

Hey, this is already done in the standard lib! See codecs table at http://docs.python.org/library/codecs.html. And then:
s.encode('unicode_encode')
or
s.decode('unicode_encode')

Posted on 5 July 2009 by Artur

Coming back to this post, based on its recent reddit highlight. The link I gave you in the above comment does exactly what you're trying to achieve: replace characters that are not in the ascii set with their closest match (not the ? character).

Posted on 5 July 2009 by Tiberiu Ichim

Tiberiu, have you actually tried the code in the link you posted?

That code only works for removing some accent marks. Unicode contains much more than accented characters. Unidecode will for example replace Greek and Cyrillic letters with Latin equivalents, while your method simply strips them out.

Posted on 7 July 2009 by Tomaž

@Artur: I can't seem to find it in the standard libs. Where exactly did you find it?

Posted on 26 August 2009 by exhuma

Hi Tomaž!

Thanks for the module, it's quite a usefull one.
But, from my point of view, it does not handle russian cyrillic the right way sometimes (but other do not work with cyrillic at all :) -- can I help You to imporove it?

With best regards,
Sergei.

P.S. You can drop me a line via Jabber: xmpp://saabeilin@jabber.org

Posted on 30 November 2009 by Sergei Beilin

Sergei, I'm glad you found it useful. Additions and corrections are always welcome, of course.

Posted on 1 December 2009 by Tomaž

Here is a patch with corrected spelling of cyrillic http://dl.dropbox.com/u/502270/0001-corrected-spelling-of-cyrillic.patch

Posted on 12 January 2010 by Ruslan Grokhovetskiy

Thanks Ruslan! I've committed your patch to the git repository and uploaded a new release.

Posted on 12 January 2010 by Tomaž

from unidecode import unidecode
s = ' von Änderbarkeit'
print unidecode(s)
von Anderbarkeit
# good
t = 'Efficient'
print unidecode(t)
Efi!cient
# bad

Posted on 26 February 2010 by ligature

echo 'Efficient' | iconv -t ASCII//translit

Efficient

Posted on 26 February 2010 by ligature

Ligature, you should pass unicode objects to unidecode, not UTF8 encoded strings.

Try using u"" around your strings.

Posted on 26 February 2010 by Tomaž

Nice, thank u!!!

Posted on 26 February 2010 by ligature

Hi Tomaž,

thanks for your nice work / porting work! I've looked at your git repo and noticed that it doesn't contain tags matching your tar-ball releases. Would you mind adding the missing tags?

Cheers, Fabian

Posted on 10 March 2010 by Fabian Knittel

Ah, didn't notice "git push" needs a "--tags" parameter to also push tags to the public repo. Thanks for the note Fabian, release tags should be there now.

Posted on 11 March 2010 by Tomaž

Actually, "Änderbarkeit" -> "Anderbarkeit" _is_ bad. It should be transliterated to "Aenderbarkeit".

Posted on 15 March 2010 by hop

Hop, are you sure that "ae" is the correct transliteration for "ä" in all languages? I know it is in German, but I'm sure many other languages also use this symbol. Transliteration is always a compromise.

Posted on 16 March 2010 by Tomaž

1) Don't get me wrong: anyone who is aware of a difference between ripping-out-things-i-don't-understand-cause-they-can't-be-importent and transliteration, I like already. Kudos to you for putting up code!

2) "Änderbarkeit" is a german word, so it should be transliterated by german rules, no?

3) It's hard to compromise in this case, because one alternative leads to incorrect (really wrong, not just unaesthetical) results in and vice versa.

So, on which solution do you decide? Do you count speakers of languages that use the diaresis only (and rarely) for splitting digraphs instead of essential daily-use characters? If not, German "beats" the other umlaut-using languages 3.5:1…

3) Luckily, that's what locales are for. iconv can do it (for only a view cases):

$ echo täßt | LC_CTYPE=de_AT.UTF-8 iconv -t ascii//translit

taesst

$ echo täßt | LC_CTYPE=en_US.UTF-8 iconv -t ascii//translit

tasst

4) Sometimes it's even harder: neither a nor ae as a transliteration for ä from finnish leaves you with something pronouncable (to a fin, anyway).

Posted on 16 March 2010 by hop

I agree with everything you said.

Unidecode puts simplicity and robustness over accuracy. That is stated quite well in the original Perl module's documentation.

It is good for producing file names or identifiers in systems that can't support Unicode, where something remotely readable is way better than a numeric ID.

However, if you need language or context specific transliteration, you need something else. I would even argue that in that case you can't afford not to support true Unicode.

Posted on 16 March 2010 by Tomaž

If you want to apply language-specific transliteration rules in a situation where you know them, you can always transliterate yourself first, using a simple replacement dictionary, and THEN you can use unidecode to transliterate the rest of your text. For example, if you know that your text contains German mixed with some foreign words that may use, say, Polish or Czech accents, or the Cyrillic alphabet, you can apply the ä -> ae, ö -> oe, ü -> ue replacements, and then use unidecode to convert the output into another output using the generalized unidecode transliteration.

I agree with Tomaz that unidecode should be a generalized approach, while specialized approaches can be used in addition.

Locale-dependent transliterations would be a nice extension of unidecode, but I find the module useful even without them.

> "Änderbarkeit" is a german word, so it should be transliterated by german rules, no?

It's hard to draw a hard line. For example the last name of Georg Friedrich Händel is simplified into "Haendel" by the Germans and into "Handel" by the English (the latter is the spelling he used himself when he moved to London). Similarly, for transliterating Chinese or Cyrillic or many other non-Latin alphabets, dozens of conventions exist. Currently, unidecode's purpose is not to become a complex system that supports them all. It's a simple solution.

However, I should add that it might be practical if the module was rewritten as a Python text codec, so that the standard Python methods string.encode() and string.decode() could be used.

Adam

Posted on 20 March 2010 by Adam Twardoch

The standard library does all of this already, right?
Try this:
nkfd_form = unicodedata.normalize('NFKD', unicode(str))
return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])

cheers
Max

Posted on 25 May 2010 by Max

OK, I understand know: this module is translating various characters to longer words. the greek letters are translated to words like "alpha", "beta", etc. I think the standard library is not doing this. So the module might still be useful for some folks...

Posted on 25 May 2010 by Max

Add a new comment


(No HTML tags allowed. Separate paragraphs with a blank line.)