Lengths of article titles

30.10.2007 14:34

How long is the average title of an article on the web? Jure needed this information yesterday when he was designing some part of Zemanta's web interface.

The nice thing when you have half of the web1 cleaned-up and stored in your database is that you can get answers to questions like this with a simple SQL query. A gnuplot one-liner later, we came up with this:

Article title length histogram

It's interesting how the histogram has a sharp spike at around 30 characters in an otherwise smooth bell-like curve.

1 Ok, maybe half of the web that's not porn.

Posted by Tomaž | Categories: Life | Comments »

SI units the English way

28.10.2007 10:19
Posted by Tomaž | Categories: Life | Comments »

Quantum top

27.10.2007 10:00

Our development server sometimes shows weird things in top (it also tends to crash quite a bit):

It's not limited to 9999% CPU - I've seen all sorts of strange readings, larger than 100%.

Maybe we had a quantum computer in our living room all this time without knowing it.

Posted by Tomaž | Categories: Code | Comments »

Dorkbot London

26.10.2007 20:00

Yesterday Boštjan and I went to see Dorkbot London. The place (called "01" in Soho) reminded us of Kiberpipa and the event was surprisingly like a couple of POT talks in a row. There were somewhere around 50 or 60 people in the audience, more than there were chairs available.

Intruduction

The event consisted of three talks: first was James Larsson who presented a scary modification of the original Pong video game: he replaced two joystick controllers with a pair of pressure sensitive leather boots on a table. The players controlled their pads by squeezing the boots and a motorized whip hit the unfortunate looser.

Modified Pong console

The part of his contraption I found the most interesting was how he controlled the whip. The AY 3 8500 chip on which the Pong game runs doesn't have any digital outputs that would indicate which player lost. So in order for his machine to know which player to punish he made a circuit that figured out the last position of the ball from the analogue video signal produced by the chip. This seemed very impressive to me at first (especially since only a couple of simple logic chips seemed to be enough - see picture above). However if you read the description of the chip you see that the chip produces separate video signals for each object - ball, pads, background, etc. This makes this feat much more credible.

Matthew Garrett on OLPC

The second talk was by Matthew Garrett about the OLPC project. Nothing new here, I only got the impression that maybe they set the goals of the project a bit too optimistically. It's been 2 years since the announcement of the project and according to the presentation they still have a lot of problems with software.

The final talk was by Tim Hunkin, creator of some very interesting arcade machines. Judging by video presentations of his machines he has shown us his creations are incredibly low tech (he said they are controlled by nothing more complicated than some industrial PLCs) and incredibly funny / interesting. For example Mobility Masterclass game uses a camera moving on a robotic arm through a model of a street to produce the video that the player sees on her screen. There's also Rent-a-dog where he recorded the video on a scale replica of the nearby street he constructed out of photographs, glued to cardboard.

His machines are great examples how games can immersive even if the technical background is simple and display isn't pixel-perfect. I would love to go see his arcade (most machines are on display in a pier pavillion over the sea), but as far as I know it's not very easy to get there with public transportation.

Posted by Tomaž | Categories: Life | Comments »

Large data structures in Python

20.10.2007 20:57

A lot my work at Zemanta has to do with storing large amounts of data (like titles of lots and lots of Wikipedia pages) in memory. Since the main problem here is running out of swap space I've done a couple of simple experiments with different data structures in Python and measured how much memory each of them used versus number of stored objects.

How I got these results: I used Python 2.4.4 (as packaged for current Debian Testing). The test machine is running Linux kernel 2.6.21 (again from Debian Testing) and has 1 GB of RAM. The metric is virtual memory size, as reported by the kernel in /proc/*/status (I used this piece of code)

First I tested dictionaries (or hashes in Perl-speak). The test code simply adds entries to the dictionary one by one and writes out virtual memory size every 1000 iterations. I've made three different tests here: simple integer to integer mapping, string to integer, string to tuple and string to list.

bighash={}
count=0
step=1000

while True:

	for n in xrange(step):
		# bighash[count]=1		#1
		# bighash[str(count)]=1		#2
		# bighash[str(count)]=[1]	#3
		# bighash[str(count)]=(1)	#4

		count+=1


	if count>=10000000:
		sys.exit(0)

	sys.stdout.write("%d\t%f\n"%(count, memory()/1024.0/1024.0))

Results:

The most interesting point here is how inefficient lists are compared to tuples - there is no significant difference when storing an integer in the dictionary or a tuple containing a single integer.

In the second part I compared tuples and lists. For the first test I used append method to add elements to the list on each iteration. For the second test I repeated that with tuples. However because they are immutable I created a new tuple on each iteration with one additional element. The third test is identical to the second except this time it was done with lists (i.e. I treated lists as if they were immutable like tuples).

# biglist=[]	#1 and #3
# bigtuple=()	#2
count=0
step=1000

while True:

	for n in xrange(step):
		# biglist.append(1)		#1
		# bigtuple=bigtuple+(1,)	#2
		# biglist=biglist+[1]		#3

		count+=1


	if count>=100000:
		sys.exit(0)

	sys.stdout.write("%d\t%f\n"%(count, memory()/1024.0/1024.0))

The second and third tests ran so slowly that I only tested sizes from 1 to 100000.

Interesting result here is that tuples do not seem to be significantly more efficient than lists when storing a lot of items (see the right end of the lower graph). However adding another element by creating a new tuple is very inefficient and takes a lot of memory and CPU time (as expected for an immutable data structure).

Posted by Tomaž | Categories: Code | Comments »

Wikipedia is broken

14.10.2007 18:24

When the World Wide Web and HTML were designed, a decision was made to try to make web page authoring as easy as possible. That meant that web browsers gracefully accepted all documents, even those that did not strictly conform to the HTML syntax, and tried their best to present them on the screen in the way document authors intended. This was probably one of the key factors of why WWW became so popular - everyone with a text editor and some patience could come up with some tag soup document that would be silently rendered by his web browser without displaying a single error message. However this also became a major problem of the web, because no one wrote standards-compliant HTML and browsers were forced to become more and more complex to cope with all the mistyped garbage that was floating around.

Code

Wikipedia was founded good 10 years after the World Wide Web and it's current engine MediaWiki a year later. At that time the tag soup problem of the web was already well-known. You would think that the founders of Wikipedia would learn from history and would know that giving your users too much freedom in regard to markup syntax will only lead to problems. In reality it seems that exactly the opposite is true.

The syntax behind Wikipedia pages today is so diverse, filled with hacks and workarounds for errors and typos page editors made that the only thing capable of properly rendering a page from Wikipedia is MediaWiki itself. It's wonderfully difficult to use Wikipedia dumps from any other software and for any other purpose than displaying them in the browser. It takes for example a 2000 line Perl script Wikipedia Preprocessor to make sense of the most of the garbage and make information even remotely machine-readable.

Consider for example this comment from Wikiprep:

# The correct form to create a redirect is #REDIRECT [[ link ]],
# and function 'Parse::MediaWikiDump::page->redirect' only supports this form.
# However, it seems that Wikipedia can also tolerate a variety of other forms, such as
# REDIRECT|REDIRECTS|REDIRECTED|REDIRECTION, then an optional ":", optional "to" or optional "=".
# Therefore, we use our own function to handle these cases as well.

What possible reason could there be to allow this kind of flexibility in the markup syntax? The only one I can think of is that some administrator noticed a broken page that for example had a "REDIRECTS" keyword instead instead of "REDIRECT" and instead of fixing that page fixed MediaWiki to support this typo. There are a lot of other cases like this. For example disambiguation pages can be marked with {{disambiguation}}, {{disambig}} or {{dab}} because of those who can't remember the name. Then there is this strange policy of ignoring the case of the first letter in a page title and distinguishing the case of subsequent letters. I can't imagine a good reason for that.

In the end I have a feeling the syntax itself is starting to bite back. With time it got more and more complex. Take for example the source of this Wikipedia template:

<div class="notice metadata" id="disambig">
{|style="background:none"
|style="vertical-align:middle;"|[[Image:Disambig gray.svg|30px]]
|style="vertical-align:middle;"|''This
[[Wikipedia:Disambiguation|disambiguation]] page lists articles about distinct
geographical locations with the same name. If <!-- you are viewing this
online as opposed to as a [[hard copy]] and -->an
[[Special:Whatlinkshere/{{FULLPAGENAME}}|internal link]] led you here, you may
wish to change the link to point directly to the intended article.''
|}</div>
</div><includeonly>[[Category:Ambiguous place
names]]</includeonly><noinclude>[[Category:Disambiguation and
redirection templates|Geodis]]</noinclude>

This neither human nor machine readable and the only thing that can make sense out of it is the MediaWiki with its 100000 lines of PHP code dedicated to interpreting mess like this. Just figuring out what gets included from a template page is complex, full of special cases and exceptions:

# We're storing template text for future inclusion, therefore,
# remove all <noinclude> text and keep all <includeonly> text
# (but eliminate <includeonly> tags per se).
# However, if <onlyinclude> ... </onlyinclude> parts are present,
# then only keep them and discard the rest of the template body.
# This is because using <onlyinclude> on a text fragment is
# equivalent to enclosing it in <includeonly> tags **AND**
# enclosing all the rest of the template body in <noinclude> tags.
# These definitions can easily span several lines, hence the "/s" modifiers.

The very Wiki markup that made Wikipedia accessible to many is now making hard for common people to contribute. If I want to make a new page on Wikipedia today and mess up the markup there is a good chance it will get deleted. It isn't realistic to expect people will read through long, boring pages describing the markup.

How exactly would one solve this problem? I don't know, but I'm sure it won't be easy - most of the pages on the Web still aren't standards-compliant. The difference with Wikipedia is that it is all under the control of WikiMedia Foundation, so in theory it would be possible to try to automatically convert all pages to some saner, more strict markup and manually fix those that failed to convert. However it would require some enormous effort and it would probably turn away a lot of current editors so I don't think it will happen any time soon.

Posted by Tomaž | Categories: Code | Comments »

Article about nVidia

13.10.2007 19:43

There's an interesting article with some pictures about facilities at nVidia headquarters at FiringSquad.

It's surprising what extensive equipment they have even though they do not manufacture chips themselves. Granted they are one of the leading specialized integrated circuit design companies but I didn't know that companies that outsource their chip fabrication do chip testing at the level that is claimed by this article - chemical composition analysis, checking transistor level failures, etc.

They also say they are doing some things that I didn't even know are possible. Like changing on chip connections with gallium ion beam to diagnose a chip failure. Considering that a completely manufactured chip is probably impossible to get undamaged from its package I guess they are only doing this for diagnosing and fixing problems with prototypes they get on a bare die from the fab. However even this is impressive. Does this mean that transistor level simulation tools aren't accurate enough to model some failures on their chips?

I also wonder where they get their failed chips from to analyze. I doubt they do this kind of in-depth checking on every failed card they get in their mail. My guess would be only from trusted sources like other graphic card manufactures that use their chips.

Posted by Tomaž | Categories: Life | Comments »

PCB coasters

11.10.2007 20:00

PCB coaster

I bought a couple of coasters like this yesterday at Science Museum. They claim that they are made from recycled printed circuit boards.

From a close look they appear to be made out of a double-sided 6-layer PCB. I doubt that it is recycled though. You can see the gold colored metallization on SMD pads that would be covered if the pads ever had any solder on them. It is more likely that the material came from some stock of obsolete boards that were never assembled.

Posted by Tomaž | Categories: Life | Comments »

Science museum

10.10.2007 22:04

I took a break today from natural language processing and visited London's Science Museum.

I visited Science Museum several years ago and one thing I remember the most from that visit is the big running steam engine you see right beyond the entrance. Well, there are still some steam engines in the first hall (right after you go through mandatory backpack search), but I got the feeling that they're there simply because they're too large to move away. The focus of the museum seems now to be more recent technology.

Right after the first hall you go through the space flight exhibition.

That's a full size replica of the Apollo LEM and the authentic Apollo 10 command module (that was the last lunar orbit mission before the first landing). What amazed me was the size of this thing. From the pictures I never got the impression just how much larger the landing module is compared to a human. The complete Saturn V stack must really have looked incredible.

On to the computing and electronics section. I didn't know that Ferranti was a known name in electronics well before integrated circuits. Judging from Google results you get today I had the feeling that they were mostly known by their innovative ASIC technology that made Sinclair Spectrum's ULA possible.

This is a mechanical analogue computer that was used to research and predict economic changes. It uses water as an analogue for monetary value.

One of the first experiments with artificial intelligence. According to the looks and age of this device it probably uses some analogue electronic circuit to model human reactions.

The replica of Babbage's Difference engine is one of the highlights of Science museum's collection. They are building another replica for display in an American museum.

There were also some art installations on display. This particular one caught everybody's eye because of the big "DO NOT TOUCH" sign. Of course, who can resist touching a shiny unusual object, especially if there are no obvious obstacles? In the end it turns out that it will only give you a slight electric shock and emit a loud "Bzzzt" sound.

Ok, according to some screams maybe it's not so slight.

Posted by Tomaž | Categories: Life | Comments »

Weird priorities

09.10.2007 12:49

The English have some weird priorities regarding household safety.

On one side they seem absolutely paranoid about everything dealing with electricity. I've seen this last year in Lancaster as well as now in London. Every wall plug has a dedicated switch. Larger electrical appliances, like our electric oven for example, have an additional big switch on the wall with a red warning light. Everything, from extension cords to simple continental-to-UK plug adapters has its own fuse. At Lancaster University everything that had even a remote connection with electricity, from computers, toasters to extension cords and cables, had to be periodically checked, sealed and signed by an authorized electrician.

On the other side gas stove in our house in London doesn't have safety valves that turn off automatically if the flame goes off to prevent gas building up in the room. I also don't see anywhere a clearly marked gas shut-off valve (the kind you usually see in Slovenia where houses are connected to city gas lines). Quite unbelievable. You definitely don't want to get electrocuted, but a gas explosion can take down the house.

I have this feeling that electricity is still regarded as something new, unknown and dangerous while domestic gas has been used for centuries and is a well known, tamed beast.

Posted by Tomaž | Categories: Life | Comments »

Gimping Galaksija

07.10.2007 13:24

Some time ago (probably while I was waiting to get my thesis approved) I didn't have anything better to do and I played with printed circuit board masks for Galaksija motherboard and GIMP. I just found these two images again today when I did some hard disk clean-up

This is how a professionally created Galaksija PCB would look like, with green solder mask, white silk layer print and gold plated contacts. I got the idea for doing this from a post on gEDA mailing list. Maybe I'll eventually hack up a GIMP script that will do this automatically from a PCB file. A picture like this is useful to do a last sanity check on a board before sending it to the manufacturer.

This one is a bit weirder and shows how a motherboard would look like on a x-ray machine.

Posted by Tomaž | Categories: Ideas | Comments »

Business card

05.10.2007 22:50

Guess what I got today...

My Zemanta Business card

Posted by Tomaž | Categories: Life | Comments »

FOWA 2007

04.10.2007 22:20

I visited the Future of web applications conference today. Zemanta has a little booth there right behind the registration desk. It's funny to think that it's in as exposed place as Adobe booth next door, which is one of the main sponsors of the conference.

Entrance to the conference center

Zemanta booth

The rest

After listening to talks about such and such planned social networking sites I have mixed feelings. I myself would just not be comfortable with giving away that much personal information (or even ability to track my location at any moment!) to some business.

It's interesting that people get really angry when some government wants to introduce some technology that would in theory enable tracking of people, but on the other hand they happily volunteer to be tracked by some commercial web site.

One notable project that caught my eye was wakoopa.com. It's a site that tracks what software you use and what software people you know use and then recommends you what software may be useful to you. Again you're sending scary personal information somewhere on the net but seeing how many useful little applications I found in these 3 days working with other guys from Zemanta I see why it could be useful.

Posted by Tomaž | Categories: Life | Comments »

Secret Zemanta headquarters

02.10.2007 17:37

For the next three months I'll be working for Zemanta on some advanced natural language processing algorithms. Not exactly my profession, but it's always interesting to try something new.

Outside

These are Zemanta's famous secret London headquarters where I'm staying. It's a typical English house with two floors and four little bedrooms like this:

Room

Internet access is unfortunately quite problematic here. Currently we have a Vodafone UMTS modem connection that is shared with all our computers through wireless LAN. Certainly not an ideal solution, but it works (sometimes). On the other hand it's nice to know just how much better Slovenian ISPs are compared to this one. This is the first time I see an ISP transparently replacing JPEG images with lower quality ones, inserting Javascript into HTML pages and even blocking some domains completely.

Posted by Tomaž | Categories: Life | Comments »