BeautifulSoup tips
20.05.2008 22:26
BeautifulSoup is a wonderful tool for parsing XML sludge gathered on the bottom of the internet. However you pay for versatility and resistance to broken markup with a big performance hit compared to proper XML parsers. The documentation acknowledges that and suggests that you only parse a part of the document - it doesn't offer any tips on what to do if you really must walk through the whole tree.
Today I found the following trick with which I managed to speed some code that walks entire BeautifulSoup document trees by almost four times:
NavigableString and Tag objects offer a very useful string member which can be used to access the Unicode string contained in them.
This is very convenient, since you only access the string member no matter what kind of node you are processing. However, there is a catch: If the Tag object doesn't contain a single NavigableString, the string member evaluates to None (as per documentation). What happens behind the scenes is that the Tag object doesn't have the string member in __dict__ and the Tag's __getattr__() method gets called. This does an expensive search through all of Tag's attributes (because BeautifulSoup Tag's XML attributes can also be accessed as Python members) before returning None. So hitting the string member when there isn't any will do a lot of processing for nothing.
The second catch is that while the string member is implemented in __dict__ in Tag objects, it's implemented in a (fairly efficient) __getattr__() method in NavigableString.
So the following code:
s = tag.string
can be replaced with this equivalent, which was approximately 4 times faster in my case:
if isinstance(tag, NavigableString):
s = tag.string
else:
s = tag.__dict__.get('string', None)
Of course, using hacks like this means that your application now depends on library internals, so big fat warnings are probably in order around such code. The code above was tested with BeautifulSoup 3.0.5 - it's quite possible that newer versions will have things done differently.
