Python Unidecode release 0.04.14

21.09.2013 14:40

Yesterday I released Unidecode 0.04.14, a new version of my Python port of Sean Burke's Text::Unidecode Perl module for transliterating Unicode strings to 7-bit ASCII.

Together with a few other minor changes, this release reverts one quite controversial change in the previous version. In the new version, Latin characters with diaeresis (ä, ö, ü) are again simply stripped of accents instead of using German transliterations (ae, oe, ue).

Using German transliterations instead of simple language-neutral accent stripping was the most often requested change to Unidecode. However, after that change was released, the most often reported bug was again concerning the transliteration of these characters. This reaction was interesting and several lessons have been learned from it.

Before diving into the problem of transliteration at all, it was obvious that people wrote code under the assumption that Unidecode will not change the transliterations over time. Apparently the 0.04.13 release broke many websites because of this, since the most popular use of Unidecode is to generate ASCII-only URLs from article titles. This is interesting because each and every release of Python Unidecode contained changes to the mappings between Unicode and ASCII characters, as it was clearly stated in the ChangeLog. This bug only became apparent once transliteration of an often used character, like ä, was changed.

So, the lesson here is that if you are using Unidecode for generating article slugs, you should only do transliteration once and store it in the database.

Now, the issue with automatic transliteration is that it's hard to do it right. In fact, it's something that requires understanding of natural language, itself a strong AI problem. If you want to get a rough idea what it involves and what kind of trade-offs Unidecode does, I suggest reading Sean Burke's article about it. Here's a relevant quote from it:

The grand lesson here is tht if y lv lttrs ot, ppl cn stll make sense of it, but ifa yaou gao araounada inasaeratainaga laetataerasa, the result is pretty confusing.

Above explains nicely why having German transliterations in Unidecode doesn't work. While it might be that German is the most common use of these characters, it makes transliteration for other languages using the same characters much worse. As Sean points out, it is much easier for a German speaker to recognize words with missing es than it is for instance for a Finnish reader to deal with extra es (compare Hyvää päivää with Hyvaeae paeivaeae or Wörterbuch with Worterbuch). It was my error that I did not remember this argument before accepting the German transliteration patch.

However, this is such a popular issue that even Wordpress implements the German transliteration as the only language-specific exception in their own transliteration tables. It should also be pointed out though that this simple fix does not actually make German transliteration perfect. There is an issue with capitalization (without understanding the context you can't know whether to replace upper-case Ä with Ae or AE).

The solution here, which I will suggest in the future to anyone that will report this as a bug, is that you should use a language-specific transliteration step before using Unidecode. You can find several of them on the web. Some, like Unihandecode, have been built on top of Unidecode.

One thing you should be aware though, is that these language-specific solutions can give a false sense of correctness. You are now relying on responsible people setting the string language correctly, when many don't even get the string encodings right. Also, can you be sure that all of the input will be in the language that was specified? An occasional foreign visitor might be much more upset with a wrong language-specific transliteration of her name than a somewhat language-neutral one provided by Unidecode.

In any case, you should be aware that using automatic transliteration to produce strings that are visible to general public will lead to such problems. This is something that developers of Plone and Launchpad got to experience first hand (although I believe the latter was not due to Unidecode).

In conclusion, I will now be much more careful accepting patches to Unidecode that deal with language-specific characters. In contrast to Sean I don't have a Master's degree in linguistics and only have a working knowledge of three languages. This makes me mostly unqualified to judge whether proposed changes make sense. Even if submitters put forth good arguments, they probably don't have a complete picture of which other languages their change might affect. Even though they didn't raise as much dust as this most recent one, I'm now actually afraid of how much damage was caused by those other few changes to Sean's original transliteration tables I accepted.

Posted by Tomaž | Categories: Code

Add a new comment

(No HTML tags allowed. Separate paragraphs with a blank line.)