Streaming gzip decompression in Python

24.03.2010 17:59

Python has a handy module called gzip that can be used to read gzip-compressed files just like if they were uncompressed. However, it has one nasty flaw - it needs support for the seek() method in file objects it reads, which isn't available for example in network streams.

Since the gzip command-line utility can decompress from a pipe, Python's requirement for seekable streams obviously isn't a requirement of the file format.

I hit upon this limitation when trying to implement a map reader function for compressed files for Disco map-reduce framework I'm experimenting with at work. It appears I'm not alone with this problem.

I nearly implemented a fix myself, when I found not one, but two patches gathering dust in the official Python's Bugzilla that solve this issue: 914340 and 1675951.

I tried the second one and it works beautifully. The description also claims that it improves read performance by 20%, although I can't confirm that.

This was submitted back in 2007 and as far as I see fits the description of a patch any free software maintainer should be most happy to accept. I can only guess why 3 years later this is still not in the official Python builds.

Posted by Tomaž | Categories: Code

Comments

Guess what... 2012 and it's still not fixed in the 2.7 tree. Claims to have been fixed in 3.2 but I didn't check.

-- Arik

Posted by Arik

fixed in 3.3.0. I just backported to 2.x, works great

Fruitbat

Posted by fruitbat

Add a new comment


(No HTML tags allowed. Separate paragraphs with a blank line.)