Large data structures in Python

20.10.2007 20:57

A lot my work at Zemanta has to do with storing large amounts of data (like titles of lots and lots of Wikipedia pages) in memory. Since the main problem here is running out of swap space I've done a couple of simple experiments with different data structures in Python and measured how much memory each of them used versus number of stored objects.

How I got these results: I used Python 2.4.4 (as packaged for current Debian Testing). The test machine is running Linux kernel 2.6.21 (again from Debian Testing) and has 1 GB of RAM. The metric is virtual memory size, as reported by the kernel in /proc/*/status (I used this piece of code)

First I tested dictionaries (or hashes in Perl-speak). The test code simply adds entries to the dictionary one by one and writes out virtual memory size every 1000 iterations. I've made three different tests here: simple integer to integer mapping, string to integer, string to tuple and string to list.

bighash={}
count=0
step=1000

while True:

	for n in xrange(step):
		# bighash[count]=1		#1
		# bighash[str(count)]=1		#2
		# bighash[str(count)]=[1]	#3
		# bighash[str(count)]=(1)	#4

		count+=1


	if count>=10000000:
		sys.exit(0)

	sys.stdout.write("%d\t%f\n"%(count, memory()/1024.0/1024.0))

Results:

The most interesting point here is how inefficient lists are compared to tuples - there is no significant difference when storing an integer in the dictionary or a tuple containing a single integer.

In the second part I compared tuples and lists. For the first test I used append method to add elements to the list on each iteration. For the second test I repeated that with tuples. However because they are immutable I created a new tuple on each iteration with one additional element. The third test is identical to the second except this time it was done with lists (i.e. I treated lists as if they were immutable like tuples).

# biglist=[]	#1 and #3
# bigtuple=()	#2
count=0
step=1000

while True:

	for n in xrange(step):
		# biglist.append(1)		#1
		# bigtuple=bigtuple+(1,)	#2
		# biglist=biglist+[1]		#3

		count+=1


	if count>=100000:
		sys.exit(0)

	sys.stdout.write("%d\t%f\n"%(count, memory()/1024.0/1024.0))

The second and third tests ran so slowly that I only tested sizes from 1 to 100000.

Interesting result here is that tuples do not seem to be significantly more efficient than lists when storing a lot of items (see the right end of the lower graph). However adding another element by creating a new tuple is very inefficient and takes a lot of memory and CPU time (as expected for an immutable data structure).

Posted by Tomaž | Categories: Code

Add a new comment


(No HTML tags allowed. Separate paragraphs with a blank line.)