Unicode transliteration in Python

25.01.2009 19:45

It's surprising how much software today still doesn't support international characters out of the box. Even in Slovenia it's common to get to a brand new web site where one form or the other will not allow Slovenian characters (č, ž and š) to be entered. Recently I was told by one Slovenian company that their "computer" only supports English letters when I complained that my name was misspelled on their correspondence.

While Unicode has been established as a standard way to represent most of the world's writing for most than 15 years now, it still often gets supported as an afterthought and even then it's usually more of an ugly hack.

However, even if you're writing software from the bottom up with Unicode in mind you often have to communicate with other legacy systems that will only accept 7-bit ASCII. In that case, you want to choose the path of least surprise for the user (No, replacing all characters outside of ASCII with "?" is not it).

transliteration: the act or product of transliterating, or of representing letters or words in the characters of another alphabet or script.

Or, in this case particular case, finding the sequence of ASCII characters that is the closest approximation to your Unicode string.

For example, the closest to string "Tomaž" in ASCII is "Tomaz". Some information is lost in this transformation, of course, since several Unicode strings can be transformed in the same ASCII representation. So this is a strictly one-way transformation. However a human reader will probably still be able to guess what original string was meant from the context.

The same transformation can be useful when trying to match a proper Unicode string stored in some database with a string that was entered by a user with an US keyboard that is too lazy or ignorant to properly spell a foreign word (another one of my pet peeves, by the way).

Sean M. Burke wrote Text::Unidecode, a wonderful Perl module that performs Unicode transliteration I described. However, since I mostly write Python code at work, I kept missing an equivalent library in Python.

So, here's my port of Sean's module to Python. I took the most convenient path and did pretty much literal translation, with most of the work done by a simple script that converted character tables from Perl to Python syntax.

The use is pretty straight forward and it works even on more complicated cases:

from unidecode import unidecode
print unidecode(u"\u5317\u4EB0")

# That prints: Bei Jing

You can get the source directly from a git repository:

$ git clone http://code.zemanta.com/tsolc/git/unidecode

To install, use the classic Python trick:

$ python setup.py install
$ python setup.py test

Update: The repository above is no longer available. Please see Unidecode page in the Python Package Index for updated installation instructions and a link to the new git repository.

If you're going to use this for anything mission-critical, the original documentation is still a good read. Specially the section about design goals and constraints.

Posted by Tomaž | Categories: Code | Comments »

High resolution graphics on Galaksija

22.01.2009 18:04

It has been almost a year and a half since I defended my diploma thesis about Galaksija, the now almost forgotten Yugoslavian microcomputer from the 1980's.

My replica has been mostly gathering dust all this time, since I hadn't had much time to play with it since I got employed by Zemanta. Recent work by Stefano Bodratto on the Galaksija port of z88dk however reminded me again of the good old 8 bit days.


Then, during the 25C3 in Berlin, I listened to Michael Steil's Ultimate Commodore 64 Talk. His detailed description of Commodore's graphics modes and the hacks to get more pixels and colors pushed through the VIC made me remember some ideas I had on how to get a higher resolution graphics out of an unmodified hardware of the original Galaksija. Just to be clear, I'm not talking about Galaksija Plus here. Plus was the later model that supported high resolution graphics out of the box due to a modified character generator circuit.

I opened up my electronics projects folder and dug out the schematics and documentation I've written. And in about two weeks of on and off evening work I got it to work. So, 25 years after Galaksija was introduced to public here is the first-time demonstration of this computer displaying more than just the ordinary low-resolution 64 by 48 character graphics.

How did I do this?

Video generation in Galaksija involves a carefully tuned combination of hardware and software. In short, the Z80 CPU is used as a video address generator. Since it's too slow to bit-bang pixels directly on the screen, its dynamic memory refresh counter is used as a way to sequentially read contents of the frame buffer to the screen. The bytes of data from the buffer get transformed to the actual pixels by a character ROM. The refresh counter runs continuously and can't be direct controlled by the software. It's period depends a little bit on the instructions that the CPU is executing and can only be reset to a known value by a relatively expensive load instruction.

Controlling Galaksija's video hardware is a matter of high precision programming. The code must have perfect timing. Each instruction must be accounted for, down to each cycle of the clock and except for the horizontal blanking period between two consecutive scan lines there are severe restrictions on which instructions can be executed to keep the refresh counter running deterministically.

The original video driver manages to perform this job in a wonderful way. As if timing restrictions weren't enough it also saves a bit of ROM space by having a part of the code also serve as an ASCII string "BREAK" and a 1.0 floating point constant.

Without being restricted by the lack of space in ROM the video driver can be done in a better way. The original video driver re-reads a single 32-byte line from the frame buffer for 13 scanlines, to construct one line of characters on the screen. With 512 bytes of RAM reserved as a framebuffer, this gives 32 x 16 characters. Since there are 2 x 3 graphics symbols available in the character set, the full resolution is 64 x 48.

However apart from the lack of CPU time there is no theoretical reason why it can't read a different frame buffer line for each scan line. You're still limited by the pixel combinations in the character ROM on the horizontal of course, but on the vertical you can do the full PAL resolution.

High-resolution framebuffer memorymap

The first reality check is the amount of RAM available for the video buffer. The original Galaksija has 6 kB, so you get to choose either more pixels on the screen or more space for code to actually do something with those pixels.

I opted for a trick here: since I'm already limited on the horizontal with the character ROM, I decided to stay with Galaksija's character graphics, but I only read three scan lines from each character. This compresses the character set vertically and removes empty scanlines between rows. I decided to go with a 2 kB framebuffer, giving an effective resolution of 64 x 192 and 6 pixels per byte.

The second problem is how to control the video interrupt routine since the Z80's hardwired interrupt vector resides inside the ROM. Luckily, this problem has already been solved on ZX Spectrum and a similar solution is applicable on Galaksija. This however takes additional 260 bytes for the vector trampoline area, which I decided to put in place of the original framebuffer.

I won't go into much detail about how the high-res video driver is written as I'll be releasing the assembly source code in a few days. I only managed to get the timing right after unrolling the scanline loop a little, but the code still takes a little less than 170 bytes of RAM. So even with a conservative estimate you still have around 3.5 kB RAM free for any software that uses the new graphics mode.

One interesting point here is that I don't use the A7 clamp in the new video driver. The original software uses a tricky little piece of hardware to clamp the 7th CPU bus address line to 0, which effectively remaps upper halves of all 256 byte blocks of RAM to their lower halves. I always found this technique interesting because it made the original driver even harder to design - you also had to make sure that no code path ventured outside of the lower blocks when remapping was active. The rationale was that this is a work-around for Z80's feature of not incrementing the 7th address bit of the memory refresh counter. However my work here shows that a hardware solution is not necessary as this can be also worked around in software, by resetting the R register with correct values at the correct moments.

Galaksija high-resolution demo

So, I invite any owners of the original hardware to test out this demo themselves. It has currently been tested on my CMOS replica, but I'm confident that its circuits exactly reproduce the function of the original, so it should work correctly on vintage hardware as well.

There is a GTP and a WAV file available for download. WAV file should be suitable for directly uploading into Galaksija through an audio port. The GTP file has to be first converted to audio using my gtp2wav utility.

No Galaksija emulators are currently able to run this demo since they do not emulate the video hardware in enough detail.

Finally I would also like to invite anyone that has any evidence that this has been done before to drop me a mail. I've heard numerous rumors of Galaksija software that used some kind of high resolution graphics. But in my studies I never found any reliable evidence that any ever existed and most rumors turned out to be inconsistent with the capabilities of the hardware.

Posted by Tomaž | Categories: Digital | Comments »

Paper towel DRM

20.01.2009 16:15

Just when I thought that the practice of intentionally breaking product functionality (i.e. DRM) stops at audio and video recording equipment and shoes, I stumble upon this EULA-like sticker hidden inside the paper towel dispenser at Zemanta headquarters:

American paper towel DRM

Not surprisingly, the dispenser will not dispense your ordinary roll of towels as the roll center needs to be shaped in a special way to fit in the holder. And of course, it could also be easily modified to fix this shortcoming (that's where the Georgia Pacific's legal threat comes in).

I'm sure there's a lively dispenser modding community out there distributing pirated towel rolls for free.

Seriously, must really every product nowadays bear some silly, unenforceable restrictions for its user?

Posted by Tomaž | Categories: Life | Comments »

2008 statistics

05.01.2009 19:32

I'm looking back at my web server's logs that have been accumulating for the past year (like everyone else it seems).

Here are my 10 most popular posts of 2008:

  1. Linux mmap weirdness
  2. Python readline() surprise
  3. Mutual inductance problem
  4. EeePC's hotkeys
  5. Eee power utility
  6. Capacitor charging for dummies
  7. Electronics quiz
  8. Slow to boot
  9. C++ streams suck
  10. EeePC ATA hickups

And the top user agents of readers like you:

User agents for 2008

Posted by Tomaž | Categories: Life | Comments »

25C3, final thoughts

03.01.2009 20:47

I just returned home from Berlin. While flying back I had some final thoughts about this year's Chaos Communications Congress.

BCC by thosch66

Photo by thosch66 CC BY-NC-ND 2.0

If I would have to describe the congress with one word, crowded would be it.

According to the numbers given at the closing ceremony, there were 4230 visitors, which broke the record from two years ago. It showed. Sometimes you could hardly move in the halls between the talks, not to mention lines in front of the toilets. Saals were full as well. Often talks were delayed because there were too many people in the room and the organizers could not allow the emergency exits to be blocked by the crowd.

The tickets were sold out I think on the second day and in some cases you had to be in the room two talks earlier if you wanted to participate in a particularly popular discussion. Especially on the last day, the largest room had doors basically sealed from the start - nobody could get in and if you got out you weren't getting back.

In contrast to meatspace infrastructure, network was coping surprisingly well. The wireless LAN mostly worked and was quite usable for a casual web page or mail check. There was also more than enough wired gigabit sockets, so I could get one whenever I wanted (that usually meant sitting on the floor though, but I'm used to that from the previous congresses). The switch I brought was mostly dead weight in my backpack.

Another thing I noticed was that a lot (well, more than I remember from previous years) of attention was given to possible legal ramifications of actions the speakers were demonstrating in front of the audience. The Stormbot guys repeated several times that all computers they tested their attack on were their property. The MD5 cracking team gave out non-disclosure agreements to any CA and browser vendor they told about their work before public disclosure at 25C3. Several people (particularly those running private GSM networks and hacking iPhones) requested that cameras be turned off during live demonstrations.

Interesting. The CCC also didn't notice any problematic activity originating from their network this year.

If nothing else, this certainly confirmed that the Nothing to hide motto was well chosen.

Posted by Tomaž | Categories: Life | Comments »

25C3, day 4

01.01.2009 21:00

It looks like the organizers saved the best stuff for the end of the congress.

Conference DVDs

For starters, there was Luciano Bello and Maximiliano Bertacchini with their detailed reported about the Debian SSL vulnerability. They explained what the vulnerability was, what was the line of events that led to it and what it meant for users and administrators around the world (even those not using Debian - hint: when you're answering a question on a mailing list, your answer may have serious consequences someday).

Although they didn't explain how the problem was discovered in the first place, the point they made is that Linus' law doesn't always hold (you know, the idea that if a lot of eyes are looking at the code all bugs will eventually be found). In this particular case, the problem lay undiscovered for two years and 3% percent of the HTTPS servers around the world now use vulnerable certificates as a consequence.

I certainly recommend listening to their talk (once the CCC video team puts it online). I didn't know for example that you can compromise your otherwise secure private key by merely making a DSA signature on a machine that uses a vulnerable OpenSSL library.

Meet Wall-E on FTP!

What followed was probably the highlight of the whole conference. Jacob Appelbaum and Alexander Sotirov with a bunch of other researchers demonstrated how they managed to obtain a rogue SSL CA private key that was implicitly trusted by all major web browsers. They used a well-known MD5 chosen-prefix collision attack and some bad implementation of site certificate signing on the part of a particular trusted certificate authority. The idea they managed to implement was to change a MD5 signed SSL site private key into a SSL CA private key without changing the validity of the signature.

They only needed 200 PlayStations 3 for finding the MD5 collision and around $600 for a number of SSL site certificate registrations. Mere change for someone wanting, for example, to impersonate some major banking site.

They have made a publicly accessible demo secure page that is signed with a certificate that is signed by their CA. A complete description of the work they did is also available on their site, however without the CA keys and their PlayStation optimized MD5 collision finding code. The key would be useless anyway, since they have set it expire a couple of years ago, so that there would be absolutely no chance of malicious use.

It's interesting that it has been known in theory for at least two years that MD5 digest is vulnerable to such attack. However no one took it seriously until they did it in practice and started silently sending around links to their demo page and their paper. At which time everyone got up in the air of course.

Posted by Tomaž | Categories: Life | Comments »