Unicode transliteration in Python

25.01.2009 19:45

It's surprising how much software today still doesn't support international characters out of the box. Even in Slovenia it's common to get to a brand new web site where one form or the other will not allow Slovenian characters (č, ž and š) to be entered. Recently I was told by one Slovenian company that heir "computer" only supports English letters when I complained that my name was misspelled on their correspondence.

While Unicode has been established as a standard way to represent most of the world's writing for most than 15 years now, it still often gets supported as an afterthought and even then it's usually more of an ugly hack.

However, even if you're writing software from the bottom up with Unicode in mind you often have to communicate with other legacy systems that will only accept 7-bit ASCII. In that case, you want to choose the path of least surprise for the user (No, replacing all characters outside of ASCII with "?" is not it).

transliteration: the act or product of transliterating, or of representing letters or words in the characters of another alphabet or script.

Or, in this case particular case, finding the sequence of ASCII characters that is the closest approximation to your Unicode string.

For example, the closest to string "Tomaž" in ASCII is "Tomaz". Some information is lost in this transformation, of course, since several Unicode strings can be transformed in the same ASCII representation. So this is a strictly one-way transformation. However a human reader will probably still be able to guess what original string was meant from the context.

The same transformation can be useful when trying to match a proper Unicode string stored in some database with a string that was entered by a user with an US keyboard that is too lazy or ignorant to properly spell a foreign word (another one of my pet peeves, by the way).

Sean M. Burke wrote Text::Unidecode, a wonderful Perl module that performs Unicode transliteration I described. However, since I mostly write Python code at work, I kept missing an equivalent library in Python.

So, here's my port of Sean's module to Python. I took the most convenient path and did pretty much literal translation, with most of the work done by a simple script that converted character tables from Perl to Python syntax.

The use is pretty straight forward and it works even on more complicated cases:

from unidecode import unidecode
print unidecode(u"\u5317\u4EB0")

# That prints: Bei Jing

You can get the source directly from a git repository:

$ git clone http://code.zemanta.com/tsolc/git/unidecode

To install, use the classic Python trick:

$ python setup.py install
$ python setup.py test

If you're going to use this for anything mission-critical, the original documentation is still a good read. Specially the section about design goals and constraints.

Posted by Tomaž | Categories: Code

Comments

I'd like to look at your code, but not if I have to install Git first. Even if I had Git, I wouldn't want your full repository. Just publish the code itself. How about a web-accessible archive (surely there's a Git equivalent of ViewVC/ViewCVS/WebSVN)? Or a tarball?

Hi David

I've made a tarball with the latest version. See

http://code.zemanta.com/tsolc/unidecode

Posted by Tomaž

From the documentation I don't see how you could use unicodedata to do the same thing as unidecode. The "normalize" method will produce a normalized string representation, but that is still unicode, not 7-bit ASCII. Maybe I'm missing something.

The second link suggests simply ignoring or replacing non-ASCII characters with "?", which is exactly the opposite to what I'm trying to do here.

Posted by Tomaž

How about using setuptools and putting it on cheeseshop so I can easy_install it? :)

I'm actually rather in a hurry to use your package, so I'm putting it up already, if you're on cheeseshop, let me know your username there and I'll make you package owner.

Cheers (and thanks for porting it!!)
- Ben

Hey, this is already done in the standard lib! See codecs table at http://docs.python.org/library/codecs.html. And then:
s.encode('unicode_encode')
or
s.decode('unicode_encode')

Posted by Artur

Coming back to this post, based on its recent reddit highlight. The link I gave you in the above comment does exactly what you're trying to achieve: replace characters that are not in the ascii set with their closest match (not the ? character).

Tiberiu, have you actually tried the code in the link you posted?

That code only works for removing some accent marks. Unicode contains much more than accented characters. Unidecode will for example replace Greek and Cyrillic letters with Latin equivalents, while your method simply strips them out.

Posted by Tomaž

@Artur: I can't seem to find it in the standard libs. Where exactly did you find it?

Posted by exhuma

Hi Tomaž!

Thanks for the module, it's quite a usefull one.
But, from my point of view, it does not handle russian cyrillic the right way sometimes (but other do not work with cyrillic at all :) -- can I help You to imporove it?

With best regards,
Sergei.

P.S. You can drop me a line via Jabber: xmpp://saabeilin@jabber.org

Posted by Sergei Beilin

Sergei, I'm glad you found it useful. Additions and corrections are always welcome, of course.

Posted by Tomaž

Here is a patch with corrected spelling of cyrillic http://dl.dropbox.com/u/502270/0001-corrected-spelling-of-cyrillic.patch

Thanks Ruslan! I've committed your patch to the git repository and uploaded a new release.

Posted by Tomaž

from unidecode import unidecode
s = ' von Änderbarkeit'
print unidecode(s)
von Anderbarkeit
# good
t = 'Efficient'
print unidecode(t)
Efi!cient
# bad

Posted by ligature

echo 'Efficient' | iconv -t ASCII//translit

Efficient

Posted by ligature

Ligature, you should pass unicode objects to unidecode, not UTF8 encoded strings.

Try using u"" around your strings.

Posted by Tomaž

Nice, thank u!!!

Posted by ligature

Hi Tomaž,

thanks for your nice work / porting work! I've looked at your git repo and noticed that it doesn't contain tags matching your tar-ball releases. Would you mind adding the missing tags?

Cheers, Fabian

Posted by Fabian Knittel

Ah, didn't notice "git push" needs a "--tags" parameter to also push tags to the public repo. Thanks for the note Fabian, release tags should be there now.

Posted by Tomaž

Actually, "Änderbarkeit" -> "Anderbarkeit" _is_ bad. It should be transliterated to "Aenderbarkeit".

Posted by hop

Hop, are you sure that "ae" is the correct transliteration for "ä" in all languages? I know it is in German, but I'm sure many other languages also use this symbol. Transliteration is always a compromise.

Posted by Tomaž

1) Don't get me wrong: anyone who is aware of a difference between ripping-out-things-i-don't-understand-cause-they-can't-be-importent and transliteration, I like already. Kudos to you for putting up code!

2) "Änderbarkeit" is a german word, so it should be transliterated by german rules, no?

3) It's hard to compromise in this case, because one alternative leads to incorrect (really wrong, not just unaesthetical) results in and vice versa.

So, on which solution do you decide? Do you count speakers of languages that use the diaresis only (and rarely) for splitting digraphs instead of essential daily-use characters? If not, German "beats" the other umlaut-using languages 3.5:1…

3) Luckily, that's what locales are for. iconv can do it (for only a view cases):

$ echo täßt | LC_CTYPE=de_AT.UTF-8 iconv -t ascii//translit

taesst

$ echo täßt | LC_CTYPE=en_US.UTF-8 iconv -t ascii//translit

tasst

4) Sometimes it's even harder: neither a nor ae as a transliteration for ä from finnish leaves you with something pronouncable (to a fin, anyway).

Posted by hop

I agree with everything you said.

Unidecode puts simplicity and robustness over accuracy. That is stated quite well in the original Perl module's documentation.

It is good for producing file names or identifiers in systems that can't support Unicode, where something remotely readable is way better than a numeric ID.

However, if you need language or context specific transliteration, you need something else. I would even argue that in that case you can't afford not to support true Unicode.

Posted by Tomaž

If you want to apply language-specific transliteration rules in a situation where you know them, you can always transliterate yourself first, using a simple replacement dictionary, and THEN you can use unidecode to transliterate the rest of your text. For example, if you know that your text contains German mixed with some foreign words that may use, say, Polish or Czech accents, or the Cyrillic alphabet, you can apply the ä -> ae, ö -> oe, ü -> ue replacements, and then use unidecode to convert the output into another output using the generalized unidecode transliteration.

I agree with Tomaz that unidecode should be a generalized approach, while specialized approaches can be used in addition.

Locale-dependent transliterations would be a nice extension of unidecode, but I find the module useful even without them.

> "Änderbarkeit" is a german word, so it should be transliterated by german rules, no?

It's hard to draw a hard line. For example the last name of Georg Friedrich Händel is simplified into "Haendel" by the Germans and into "Handel" by the English (the latter is the spelling he used himself when he moved to London). Similarly, for transliterating Chinese or Cyrillic or many other non-Latin alphabets, dozens of conventions exist. Currently, unidecode's purpose is not to become a complex system that supports them all. It's a simple solution.

However, I should add that it might be practical if the module was rewritten as a Python text codec, so that the standard Python methods string.encode() and string.decode() could be used.

Adam

The standard library does all of this already, right?
Try this:
nkfd_form = unicodedata.normalize('NFKD', unicode(str))
return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])

cheers
Max

Posted by Max

OK, I understand know: this module is translating various characters to longer words. the greek letters are translated to words like "alpha", "beta", etc. I think the standard library is not doing this. So the module might still be useful for some folks...

Posted by Max

This is a very cool library, thanks for porting/improving/maintaining it! In most cases, I need to handle these things myself according to more sophisticated conventions, but this is a fantastic catch-all solution, especially for situations where I don't know the proper rules myself!

Just one small note about terminology. In many cases, "transliteration" is the wrong term (as you can see if you read the definition you've posted closely). A more accurate term, philologically speaking, would be "transcription". For example, I was surprised and impressed to find that this will work with CJK ideographs (not well enough for any real purpose, but pretty cool nonetheless!) - these cannot be transliterated, and can only be transcribed. Also, whilst it is possible to transliterate the Devanāgarī देवनागरी script, used for Hindi and sometimes other Indian languages, what this library does is transcribe it - more useful, in fact, for many (but not all) purposes.

Posted by simon

@Max - this does *tonnes* of stuff that aren't done by the function you quote, and indeed many things that simply can't be done with the unicodedata library alone. The example you give of the Greek, however, is another strong example of "transcription" as opposed to "transliteration" :)

Posted by wilo

Thanks for a great library, Tomaž!

I've found a small bug in czech language characters transliteration, \xdd should translate to Y instead of U
There's a fix for that and a pangram based test commited in my cloned repository at github (linked as my website below), Pull the changes if you like: Cheers

"Änderbarkeit" transliterate to ae is fine. However this does not work out very well for finnish who would really often prefer you to omit the umlauts characters without umlauts. The classic problem is:

Määttä which becomes Maeaettae which unfortunately breaks the word completely because the aeae does not become to mean same as ää even if it really quite well works for one ä. Bigger problem off course is the double consonant thats not really decipherable even if left as it is to most languages.

Now in Finnish this is not a big problem to throw out the umlauts because the language has something called vocal harmony which can mostly fix the problem. However the problem is totally different when you start to talk Estonian.

So no even if you really try then the transliteration does NEVER end up 100% correct but at least i can live with some wrong in transliteration. Makes some sense is better than makes no sense whatsoever.

Posted by Finn

Excellent module! Thanks for your work on it.

Now my slugs in Django are so beautifully international. :)

Posted by Greg

Thanks for the module!
I was wondering if it would improve performance on a web server to replace
Cache = {}
with
global Cache = {}

Thanks
Kris

Posted by Kris

Hi, great module!

It does however has some 'strange' behaviour, the return value can be a string or a unicode string depending on the input:

unidecode(u"\u5317\u4EB0") -> "Bei Jing " (an ascii encoded string)
unidecode(u"\u5317\u4EB0 test") -> u"Bei Jing test" (A unicode string)

I fixed this for my usage:
#### __init__.py, line 30
if codepoint < 0x80: # Basic ASCII
#retval.append(char) #CUSTOM FIX: This was the original code
retval.append(char.encode('ascii')) #CUSTOM FIX: The encode makes sure that an ascii string is returned
continue
####

Maybe you could include this in a future version, although i realise that this might break existing applications somehow relying on this behaviour.. (it seems hard to come up with code that would break with this fix though..)

Cheers!

Wesley, thanks for the bug report. Your patch breaks on Python 3, but I agree that this behavior is a bug. It will be fixed in the next release.

Posted by Tomaž

Add a new comment


(No HTML tags allowed. Separate paragraphs with a blank line.)