A new kind of mess

29.04.2010 15:29

One of the features of Zemanta API is image suggestion. For example, if you are writing an article about bowls of petunias, Zemanta will suggest nice photographs of that particular kind of plant life to go with your post.

A part of those suggested images comes from English Wikipedia and Wikimedia Commons. And since computer vision just isn't there yet, Zemanta's back-end can only learn about the content of the images from text and various other machine-readable data that is related to that image.

The problem is that while usable data is relatively abundant, it is scattered around English and Commons wikis. It's present directly or in various more or less complicated templates that appear on image description pages. Articles that include an image and captions they use also hold clues to the image content. Sometimes it's even necessary to go through 5 of more jumps between you can connect a useful piece of metadata, like for example between an article including a "Wikimedia Commons has more media related to" box and the actual picture.

Previously, this data extraction was performed in a traditional way: a series of Python scripts read the dumps provided by Wikimedia and stored the information in various MySQL tables. In its last incarnation, this system took around 30 hours to process both wikis. This might not appear much, but in every new dump something inevitably breaks. Perhaps it's a critical template that has been renamed or a minor markup change exposed an odd bug in a Python script. This meant that often two or three sessions were required and a new dump quickly consumed a week worth of work.

Old image processing system using Python scripts and MySQL

Old image processing system using Python scripts and MySQL tables.

Two weeks ago I replaced this monster with a new one built upon the map-reduce paradigm that's all the rage these days. It performs the job in a little over 2 hours and uses Disco framework. This basically trades indexed MySQL table access (lots of expensive hard-disk seeks) for multiple passes over sorted flat files. These use sequential reads and are thus much faster. In practice however, there's still a lot of disk seeking going on, because Disco will sort these huge files on-disk (actually using GNU sort behind the curtains). But obviously the performance improvement is still significant.

I should also mention that Disco took some significant hacking before it became useful, so I can't really say it's a mature solution.

New image processing system using map-reduce

New image processing system using map-reduce.

As far as complexity of Zemanta's system goes, it hasn't gone down either. The part of the code that was directly affected by this change went from around 560 lines of Python to a bit over 1100. On the other hand the boxes in the graph above (representing individual map-reduce jobs) are now much more separated from each other. I guess only time will tell if this will be easier of harder to maintain. One thing is certain: the development cycle has become much faster.

Finally, the time improvement of an order of magnitude starts to look way less impressive when you take into account that the old system used one CPU on one machine while the new one takes two machines with 12 CPUs each. But I guess that doesn't matter if you have processors sitting idle in the rack.

Posted by Tomaž | Categories: Code

Add a new comment

(No HTML tags allowed. Separate paragraphs with a blank line.)