24.03.2010 17:59

Python has a handy module called gzip that can be used to read gzip-compressed files just like if they were uncompressed. However, it has one nasty flaw - it needs support for the seek() method in file objects it reads, which isn't available for example in network streams.

Since the gzip command-line utility can decompress from a pipe, Python's requirement for seekable streams obviously isn't a requirement of the file format.

I hit upon this limitation when trying to implement a map reader function for compressed files for Disco map-reduce framework I'm experimenting with at work. It appears I'm not alone with this problem.

I nearly implemented a fix myself, when I found not one, but two patches gathering dust in the official Python's Bugzilla that solve this issue: 914340 and 1675951.

I tried the second one and it works beautifully. The description also claims that it improves read performance by 20%, although I can't confirm that.

This was submitted back in 2007 and as far as I see fits the description of a patch any free software maintainer should be most happy to accept. I can only guess why 3 years later this is still not in the official Python builds.

Guess what... 2012 and it's still not fixed in the 2.7 tree. Claims to have been fixed in 3.2 but I didn't check.

-- Arik

Arik

fixed in 3.3.0. I just backported to 2.x, works great


fruitbat

