Avian’s Blog

Electronics and Free Software

Internationalization in Tablix

01.02.2006 17:00

I'm easily annoyed by software that doesn't work or somehow mangles Slovenian characters č, ž and š. So I really want Tablix to work correctly with all languages.

Currently Tablix already handles different languages quite well, but according to LibXML documentation this is a mere coincidence, since a lot of my code is not written in The Right Way. I want to correct that.

However more I look into this subject more it seems to me that this task is really complicated. The main problem in internationalization is that for each string stored in the memory you must know exactly what its encoding is. This is not as trivial as it sounds, since there are a lot of different sources of strings in Tablix:

  • Strings obtained from the LibXML are the least problematic, since all are encoded in UTF-8. These are also nicely contained in xmlChar type.
  • Constant strings, internationalized with GetText (_() macro), are encoded in the encoding of the current locale which can be anything.
  • Simple constant strings in C. I guess these are in whatever encoding the C source file was, but I'm far from sure after some googling.
  • Strings returned by sprintf(). I think these should be in the encoding of the current locale, but again not sure.
  • Command line arguments obtained from argv. Current locale I guess.

A similar problem appears when these strings are written somewhere. But here at least the situation is a bit less foggy.

  • All strings passed to LibXML must be UTF-8.
  • Strings written to the console must be in the current locale encoding.
  • Strings written to HTML files must be in the encoding of that file.
  • File names must probably be in the current locale encoding. Not sure.

I'll buy a beer for anyone who can clarify the points I'm not sure about.

Posted by Tomaž | Categories: Code
Comments
Add a new comment

Your name

Your email (optional, will be shown publicly)

Your web site (optional)


(No HTML tags allowed. Separate paragraphs with a blank line.)