If you make elevators like this

23.07.2011 16:50

Last week I attended Barcamp Ljubljana. It was hosted at the IBM innovation center in Ljubljana's newest and highest skyscraper. The event took place on the 3rd floor, which meant you have to use one of the six elevators to get there. This proved to be somewhat of a challenge for an ignorant visitor like me.

Upon entering the hall I saw a couple of elevators and no obvious means to call them. There were a few terminals like this on the wall:

Miconic 10 elevator terminal

Now, in my experience, elevators come in two varieties: the one for public use where you get one or two buttons to call them, and the one for private use that require some sort of authentication, either with a key, card or PIN code. The complicated keyboard in this particular example hinted strongly at the latter case.

I should mention that I'm not usually the type that doesn't read instructions. But somehow I completely missed the A4-sized manual on the top of the keyboard at that time (I only noticed it after the lunch break, when I also took this photograph).

After seeking help I was told there is no password required and that I only need to type in the floor I want to go to. I did so, and before the elevator came, a group of people joined me in the hall. They punched a few numbers on the keypad and looked relatively familiar with the system.

When the elevator came, it was obvious that it was too small for all of us. So some stayed behind in the hall to wait for another one. This confused the system somewhat, because people that punched floors 1 and 2 didn't actually go in. We stopped at both of them and the knowledgeable co-passengers knew that they had to wave hands between the doors in order to trick the elevator to close the doors and continue.

After the lunch break I read through the manual and noticed that every passenger ought to type the destination separately (which the group obviously didn't) with at least 2.5 seconds between different passengers and that the doors will not close until the elevator has seen people enter or leave the cabin (that there is no additional button to override this in the cabin is listed as a feature).

So for whoever designed this Miconic 10 system, here's a free user testing conclusion: whatever theoretical gain this system can produce in terms of travel time using the additional information it has available over the traditional up-down call button is completely offset by practical limitations of lowly humans that change their minds and don't want to stick to overly complicated procedures.

Posted by Tomaž | Categories: Life | Comments »

A case for cjson

21.07.2011 20:53

These days JSON is a quite popular data serialization format. Although it comes from Javascript it is well supported in a lot of languages. CPython for instance shipped a JSON encoder and decoder in its standard library since version 2.5.

Zemanta is using JSON extensively in its backend. For instance it's the format of choice for storing denormalized views in MySQL or key-value stores, for various logs and for passing parameters through various RPC mechanisms. We are storing a significant quantity of data in JSON - our largest production table containing compressed JSON-encoded structures is approaching 500 GB.

Naturally at those scales encode and decode performance starts to matter. In Python code we have been mostly using python-cjson. In hindsight mostly because of its higher performance over proverbially slow JSON implementations in Python's standard library.

While moderately popular (current Debian stable distribution lists 16 packages that depend on it) python-cjson has its share of problems. The official distribution hasn't been updated since 2007. In fact, most distributions ship a patched version that fixes a buffer overflow vulnerability in its encode function. I've also recently discovered that encoding strings that contain non-BMP Unicode characters results in non-standard JSON that will break some parsers (most notably Java's Jackson). However my patch to fix this error was just as unsuccessful in getting applied to the official distribution as were patches by the Ubuntu security team.

How much is cjson actually in front of the standard library implementation in terms of performance? Is it enough to justify dealing with the issues above? To find out I took 55 MB worth of JSON from one of Zemanta's databases and did some benchmarking.

The test data consisted of 10000 separate documents. Each of them encodes a top-level dictionary, mostly with strings for values and an occasional list or second-level dictionary thrown in. I used time utility to measure CPU time consumed by the interpreter running a simple loop that decoded files one by one:

for path in glob.glob("data/*")[:10000]:
	f = open(path)
	data = f.read()
	struct = json.loads(data)

I read the file to a string first because cjson doesn't support decoding from a file and because that is the typical use in Zemanta. Explicit close() is there to keep PyPy from using up all the file handles.

I tested several pairs of interpreters and JSON libraries, as is evident from the chart below. For each pair I did 5 runs of the script with the json.loads() line (adopted as necessary for different module and function names, tbenchmark) and 5 runs without it (toverhead). The final result was calculated as:

t = \min_j t_{benchmark_j} - \min_i t_{overhead_i}

This was done to calculate the fastest observed run and to ignore the file load overhead.

The test machine was an Intel Core2 Duo at 2.20 GHz running Debian Squeeze amd64 with stock Python 2.5, 2.6 and 3.1 packages and a Python 2.7 package installed from the Wheezy repository. PyPy was installed from a binary build (pypy-c-jit-43780-b590cf6de419-linux64).

Performance of different Python JSON parsers

As you can see, some of the results are quite surprising. From standard library implementations, the ancient simplejson from Python 2.5 leads. Python 2.6 had a staggering regression in terms of performance. This I found so surprising that I ran the test on other machines, with similar results. This thread suggests that this is due to some optimizations being left out in 2.6 release. The performance problem seems to have been fixed in later Python versions.

Ratio of CPU time consumed in cjson to stock Python 2.7 is around 1 to 1.6. I would say that is just on the edge of being negligible and I will probably not use cjson for new projects. However for code that is still migrating from Python 2.5 to newer versions, cjson has a valuable advantage of being consistently well-performing.

PyPy obviously still needs to do some catching-up in the field of JSON parsing.

Another curiosity was hiding in the overhead timings (that is, how long it took the interpreter to just read 55 MB from 10000 files into strings without decoding). All runs were pretty consistent at 300 ms (the dataset was small enough to fit into cache after the first read), except for Python 3.1 which at 870 ms took almost three time as much. But finding the cause of that can wait for another blog post.

Posted by Tomaž | Categories: Code | Comments »

Django admin with large tables

05.07.2011 22:33

It is a pretty well-known fact that an unconstrained SELECT COUNT(*) in MySQL Innodb tables is slow. It is also known that Django admin page likes to do counts on tables in its change list view so that it can draw a pretty pagination widget at the bottom. These two facts make for a miserable admin experience when you have more than a handful of rows in your database. One visitor to the admin site will at best have to wait for a while before the page loads, several will most likely bring the database server to its knees.

Most work-arounds for this problem I've seen, like for instance the patch attached to ticket #8408, focus on removing the need for the row count from the pagination logic and require patching Django.

Yesterday I came up with this solution that only requires a somewhat dirty monkey patch to the QuerySet object and seems to run fine on stock Django (version 1.2.5 in this case):

class ApproxCountQuerySet(QuerySet): 
	"""Counting all rows is very expensive on large Innodb tables. This 
	is a replacement for QuerySet that returns an approximation if count()
	is called with no additional constraints. In all other cases it should
	behave exactly as QuerySet. 
	def count(self): 
		# Code from django/db/models/query.py 
		if self._result_cache is not None and not self._iter: 
			return len(self._result_cache) 
		query = self.query 
		if (not query.where) and \ 
				query.high_mark is None and \ 
				query.low_mark == 0 and \ 
				(not query.select) and \ 
				(not query.group_by) and \ 
				(not query.having) and \ 
				(not query.distinct): 
			# If query has no constraints, we would be simply doing
			# "SELECT COUNT(*) FROM foo". Monkey patch so the we 
			# get an approximation instead. 
			cursor = connections[self.db].cursor() 
			cursor.execute("SHOW TABLE STATUS LIKE %s", 
			return cursor.fetchall()[0][4] 
			return self.query.get_count(using=self.db) 

As you can see this substitutes a SELECT COUNT(*) with an approximate count from the row count column of SHOW TABLE STATUS output. While this query returns instantaneously on Innodb tables it only gives an approximate count (actually, executing the query twice on an otherwise idle server gives you two similar, but different counts).

Of course, even ignoring for the moment that this is highly MySQL specific, blindly using this can be dangerous if your code expects the count to actually be accurate. However Django admin page seems to be pretty happy with it. You can plug it into your admin.py like this:

class FooAdmin(admin.ModelAdmin): 
	def queryset(self, request): 
		qs = super(FooAdmin, self).queryset(request) 
		return qs._clone(klass=ApproxCountQuerySet) 

Of course, curious people clicking on page number 100000 can cause just as much trouble for the server as those count queries we just removed. Removing those links from the page templates is somewhat easier and left as an exercise for the reader.

Posted by Tomaž | Categories: Code | Comments »