A case for cjson

21.07.2011 20:53

These days JSON is a quite popular data serialization format. Although it comes from Javascript it is well supported in a lot of languages. CPython for instance shipped a JSON encoder and decoder in its standard library since version 2.5.

Zemanta is using JSON extensively in its backend. For instance it's the format of choice for storing denormalized views in MySQL or key-value stores, for various logs and for passing parameters through various RPC mechanisms. We are storing a significant quantity of data in JSON - our largest production table containing compressed JSON-encoded structures is approaching 500 GB.

Naturally at those scales encode and decode performance starts to matter. In Python code we have been mostly using python-cjson. In hindsight mostly because of its higher performance over proverbially slow JSON implementations in Python's standard library.

While moderately popular (current Debian stable distribution lists 16 packages that depend on it) python-cjson has its share of problems. The official distribution hasn't been updated since 2007. In fact, most distributions ship a patched version that fixes a buffer overflow vulnerability in its encode function. I've also recently discovered that encoding strings that contain non-BMP Unicode characters results in non-standard JSON that will break some parsers (most notably Java's Jackson). However my patch to fix this error was just as unsuccessful in getting applied to the official distribution as were patches by the Ubuntu security team.

How much is cjson actually in front of the standard library implementation in terms of performance? Is it enough to justify dealing with the issues above? To find out I took 55 MB worth of JSON from one of Zemanta's databases and did some benchmarking.

The test data consisted of 10000 separate documents. Each of them encodes a top-level dictionary, mostly with strings for values and an occasional list or second-level dictionary thrown in. I used time utility to measure CPU time consumed by the interpreter running a simple loop that decoded files one by one:

for path in glob.glob("data/*")[:10000]:
	f = open(path)
	data = f.read()
	f.close()
	struct = json.loads(data)

I read the file to a string first because cjson doesn't support decoding from a file and because that is the typical use in Zemanta. Explicit close() is there to keep PyPy from using up all the file handles.

I tested several pairs of interpreters and JSON libraries, as is evident from the chart below. For each pair I did 5 runs of the script with the json.loads() line (adopted as necessary for different module and function names, tbenchmark) and 5 runs without it (toverhead). The final result was calculated as:

t = \min_j t_{benchmark_j} - \min_i t_{overhead_i}

This was done to calculate the fastest observed run and to ignore the file load overhead.

The test machine was an Intel Core2 Duo at 2.20 GHz running Debian Squeeze amd64 with stock Python 2.5, 2.6 and 3.1 packages and a Python 2.7 package installed from the Wheezy repository. PyPy was installed from a binary build (pypy-c-jit-43780-b590cf6de419-linux64).

Performance of different Python JSON parsers

As you can see, some of the results are quite surprising. From standard library implementations, the ancient simplejson from Python 2.5 leads. Python 2.6 had a staggering regression in terms of performance. This I found so surprising that I ran the test on other machines, with similar results. This thread suggests that this is due to some optimizations being left out in 2.6 release. The performance problem seems to have been fixed in later Python versions.

Ratio of CPU time consumed in cjson to stock Python 2.7 is around 1 to 1.6. I would say that is just on the edge of being negligible and I will probably not use cjson for new projects. However for code that is still migrating from Python 2.5 to newer versions, cjson has a valuable advantage of being consistently well-performing.

PyPy obviously still needs to do some catching-up in the field of JSON parsing.

Another curiosity was hiding in the overhead timings (that is, how long it took the interpreter to just read 55 MB from 10000 files into strings without decoding). All runs were pretty consistent at 300 ms (the dataset was small enough to fit into cache after the first read), except for Python 3.1 which at 870 ms took almost three time as much. But finding the cause of that can wait for another blog post.

Posted by Tomaž | Categories: Code

Comments

Great post. If i can remember correctly Python 3.1 had some serious issues with IO performance, which was fixed in 3.2

Add a new comment


(No HTML tags allowed. Separate paragraphs with a blank line.)