Python readline() surprise
05.08.2008 21:01
If you think Python readline() function works like the C fgets(), you're in for a nasty surprise the first time you'll be using it to read line-oriented files containing Unicode text (like FreeBase data dumps for example).
Check this example:
import codecs
one_line = u"Hello\u2028world!\n"
print one_line
f = codecs.open("tmp", "w", "utf-8")
f.write(one_line)
f.close()
f = codecs.open("tmp", "r", "utf-8")
for line in f:
# How many lines are read here?
print line
f.close()
The loop at the end prints out two lines: one with "Hello", and other with "world!". Of course, once you look into it the reason for this behavior becomes obvious. Unicode character 2028 is a line separator and readline() seems to be smart enough to know this.
This is my first time that I encountered a function that reads a line from a file and has intimate knowledge of Unicode. Since most other such functions only know "\n" (or "\r" or "\r\n" on line-ending challenged systems) it would be nice if this oddity would at least be mentioned in Python documentation.
On the other hand, maybe it's time to put Tab Separated Values file format to rest and bravely step into the bright Unicode-encoded XML-formatted future?
