Computers want to learn too

28.11.2008 21:01

Wikipedia is a wonderful learning resource. It provides a wealth of easily browsable articles on just about every topic. An article on English Wikipedia is a great starting point when you're either merely curious about a specific topic or you're just beginning a more serious study of a subject. Indeed, the ease of access to that much knowledge even poses a problem for some.

XKCD: The problem with Wikipedia

Image by Randall Munroe CC BY-NC 2.5

This is all the realization of dreams the original creators had of Wikipedia becoming a mainstream freely accessible and editable encyclopedia. However what they probably didn't envisage is that their site will also become an invaluable resource for computers to learn about the world. English Wikipedia as one of the largest freely accessible corpora has also become an important resource in machine learning science. A lot of research in natural language processing, search algorithms, text classification and similar fields is based on data gathered from Wikipedia. Results of this research are now being used by a number of companies and non-profit projects - some directly like Wikia, Powerset, Tagaroo, FreeBase, DBPedia and last but not least Zemanta. Many more are using them indirectly, maybe even unknowingly, by employing methods and algorithms that have been developed from research that was based or was evaluated on data from Wikipedia.

What makes Wikipedia inviting for research is that it's the best real-life approximation of a very large repository of structured information. Why is this structure important? After all the promises in the past decades artificial intelligence research has failed to come up with a system that could understand natural language to a degree comparable with an average human. With the hopes that a computer could ever learn directly from plain text trashed it was realized that in order to make computer systems smarter people must help them understand important pieces text. This means that concepts in the text must be clearly marked as having some specific meaning. Only then can the current state-of-the-art algorithms start learning from it, giving rise to intelligent systems that know how to suggest what book you might want to read next, or can directly answer your questions, not just point you to a semi-relevant webpage, leaving the tough part of extracting the information from its text to you.

While Wikipedia isn't properly semantically tagged, it is a good approximation. What makes this possible is its use of templates - an editing tool originally designed to ease input of data and standardize layout of specific classes of topics. Since text is entered into templates through a standardized set of parameters, the template gives structure to that text that can be used for more that just page layout. For example text that is entered for the parameter birthdate in the Infobox People template suddenly becomes a piece of information with a certain meaning: person described by the article was born on the date, described by that piece of text. Even the presence of Infobox People on a page itself classifies that page as biographical page.

DBpedia links between databases

Image by DBpedia

However not all templates are created equal. Wikipedia as a collaboratively edited project has a curious property that some technical feature (like templates) will only be used properly when the misuse of the feature will be blatantly obvious to ordinary (human) visitors of Wikipedia.

Take for example the category hierarchy. MediaWiki software that powers Wikipedia supports assigning articles to a hierarchy of categories. By themselves these categories seem like a more natural way of classifying articles than checking which page uses which Infobox template. A closer look however reveals that the category system is wonderfully abused: a lot of pages are put in completely wrong categories, hierarchy is full of circles and nonsensical relationships. The reason is that only a minority of Wikipedia visitors know that a category system exists. Even less actually use it to find pages. On the other hand a Botanical Infobox on a biographical page is so striking to most users that sooner or later somebody will replace it with a more fitting Infobox.

Interlocking by Paul Goyette

Image by Paul Goyette CC BY-SA 2.0

Recently a movement in the Wikipedia community seems to have arisen that is against adding more specific fields to Infobox templates, voting instead for smaller, more specialized templates dispersed throughout the page. Take for example the decision to move external links to IMDB out of the Infobox Film and into smaller templates, specialized to make links to IMDB. Or refusal to add official home page fields to several other templates.

While in theory smaller templates give as much structure to text as larger Infoboxes they are in practice much more easily abused. An IMDB field in the Infobox can only be used to point to the Internet Movie Database entry for the movie that is the subject of the article the Infobox appears on. If it's not, it will be very noticeable for anyone that follows that link and there are good chances that it will be fixed soon. On the other hand, smaller templates can (and are) used to link to IMDB entries that have only some weak relationship with the subject - for example a page about an actor can have a multiple smaller templates providing links to movies she has acted in. It will not be obvious to the average user that a template that should only point to an IMDB entry, equivalent to the Wikipedia page it appears on, has been misused. Since a computer can not understand the text surrounding the links like a human reader does, it will learn that the concept of the actor is equivalent to the concepts of her movies. Suddenly, a pretty reliable way to link Wikipedia entries to another large database has been made a lot more noisy.

I understand there are (some) good reasons these decisions have been made in the Wikipedia community. Pages with large Infoboxes do become less convenient for human readers and can be time-consuming to keep up-to-date. However Wikipedia editors should acknowledge that Wikipedia has also become an important resource outside their original human audience.

Both goals, an easily readable and editable encyclopedia and a good quality machine learning resource are not necessarily incompatible. There are many minor changes that could be made to enhance Wikipedia for machine learners without sacrificing human usability. If some piece of information really can not be put inside an Infobox, then at least the specialized templates should be made in a way that makes them hard to abuse. For example the current recommended way to link to an IMDB entry is a template that looks like this to a visitor of a page:

TITLE at the Internet Movie Database

Where TITLE is a movie title, chosen by the editor that inserted that template. A better, more robust way to make that template would be for example to make TITLE always say the name of the current page. This approach, which is well within MediaWiki current capabilities, would make it immediately obvious that a template has been misused.

This is a pretty minor change, but it would probably go a long way to make Wikipedia easier to reliably connect to other databases. If not sooner, this is a problem Wikipedia will have to face itself when they will make the transition to semantic MediaWiki, as distant as that seems right now. It's clear that such a change to the IMDB template is no longer possible now that thousands of pages use it, however I do hope that more thought will be given to this problem when interfaces of new templates are debated.

Posted by Tomaž | Categories: Ideas | Comments »

Your very own HSDPA base station

26.11.2008 20:29

A couple of days ago I got the opportunity to play with this Samsung SWS-23UC HSDPA femtocell. It's a base station for the third generation of cellular telephony that communicates with the rest of the UMTS system via a public internet connection. It can provide connectivity for UMTS mobile phones in places where there is no coverage from larger base stations and where setting up a microcell with a dedicated wired connection wouldn't be economic.

Samsung SWS-23UC

It's surprisingly easy to setup - you just plug it into the nearest ethernet socket and it uses DHCP to get an IP and internet gateway information. It then sets up an encrypted connection to the service provider and mobile phones in the vicinity can start connecting to it. It's using UDP encapsulated IPsec, so there's a good chance it can get through various NATs and firewalls - at least it had no problems going through my NAT setup. The box is otherwise sealed. It has some kind of a proprietary connector on one side, presumably for configuration and uploading of authentication keys for connecting with service provider's base station controllers. Port scans revealed no open ports or any clues what kind of software is running in there.

It has a built-in antenna, which can be replaced with an external one. The case strongly reminds me of something designed by Apple (replace blue LEDs with white ones and you could call it an iStation)

Bandwidth requirements aren't high. Some experiments showed it uses approximately 0.5 kB/s of uplink/downlink bandwidth when idle. When a call is in progress that jumps to average 15 kB/s in both directions. It's using a variable bit rate encoding for audio, since the bandwidth clearly depends on what is happening on the line. With silence you only get 5 kB/s and white noise forces it to 40 kB/s. Since that is pretty much all I can get from my ADSL line I suspect it's also adjusting the bit rate according to the capabilities of the internet link.

Of course, your phone's connection reliability now depends on your internet connection. So if somebody trips over the LAN cable or saturates the uplink with bittorrent, your call will get dropped. There is still a good reason real telecommunication systems use dedicated lines and protocols other than IP.

In conclusion, it's a pretty simple way to get your cell phone to work in that basement. Bandwidth requirements are pretty bearable, even on a uplink-challenged ADSL line. With something like a wireless bridge you can even put it in a place that only has wireless ethernet connection. I couldn't test the range of its signal since I couldn't be sure when the phone was connected to it and not to the cell tower outside. I guess it's not much more than a neighboring room in a building. Given that this makes mockery of careful cell frequency planning, that's probably for the best.

Posted by Tomaž | Categories: Digital | Comments »

The sound of hot tea

23.11.2008 16:25

One thing that I was wondering about for some time is why when I pour boiling water from a kettle into a cup, the bubbling sound it makes seems different than when I fill it with cold water. I found it curious, but I never gave it much thought. I always guessed it that if I wasn't just imagining this difference it was more likely the effect of the container from which I was pouring water, not the temperature of the water itself. I never tried filling a cup with cold water from a kettle to check this assumption.

Teakettle by Mr. T in DC

Photo by Mr. T in DC CC BY-ND 2.0

That is until the coffee machine in the office broke down. That forced me to use the office water cooler to make tea. You see, this particular water cooler has two identical faucets: one for chilled and one for hot water. And this time it occurred to me that the sound is still different, even when the two faucets are identical in shape. So I went on and made an experiment in controlled circumstances to come to the bottom of this.

I made a simple replica of the important parts of the water cooler: a funnel with an empty ceramic cup below it, so that when the funnel was quickly filled up, the water trickled down into the cup over the period of around 20 seconds (diameter of the opening was 3 mm, volume of water was 200 ml). I recorded the sound with a microphone placed over the top of the cup. The funnel was high enough that the flow became turbulent.

I did 10 measurements, 5 with water at room temperature and 5 with freshly boiled water just below 100°C. For each measurement I cut out a 10 s long part 3 s into the recording to ignore any transient effects of filling the funnel. On those cut-outs I made a discrete Fourier transform.

Sound spectrum of a cup being filled with water

This figure shows all 10 measured spectra superimposed. Measurements with hot water are red, while those with cold water are blue.

The most obvious difference is the nicely defined peak at 3000 Hz: it raises in frequency for almost 500 Hz with hot water. Also noticeable is that hot water spectra are on average weaker than cold water between 6000 to 12000 Hz.

So it looks like there is a noticeable difference. The question remains what mechanism causes it.

One factor that contributes to the sound is the ringing of the ceramic cup, excited by the falling water. To get the resonant frequencies of an empty cup I did an impulse response test without any water in it (i. e. I hit the cup and recorded the spectrum of the 'ding' sound):

Impulse response spectrum of a ceramic cup

As you can see this particular cup design resonates strongest at around 2500 Hz, so I'm confident that the similar peak in the spectra in the previous figure is connected with the same cause - the resonant frequency is probably higher when the cup is partly filled with water. I'm not sure why the peak moves with temperature though. Mechanical resonant frequencies of solid objects do change with temperature, but the rate observed here seems a bit excessive. It's also possible that difference in water viscosity caused the hot cups to fill up faster and so resonate at higher frequencies during the measurement interval I used. Some more measurements of responses of an empty cup at different temperatures may clear this up.

The change in higher frequencies is a bit trickier to explain. After a bit of browsing it turned out that I'm not the only one asking such silly questions. For example, the change can be attributed to tiny water droplets of condensed steam in the air according to this post at Yahoo Answers. It seems a plausible explanation to me, although I can't think of a simple way to test it.

Posted by Tomaž | Categories: Ideas | Comments »

DWL-G810 wireless bridge

20.11.2008 21:49

I bought this D-link DWL-G810 wireless bridge a couple of years ago to connect a network segment in the basement of our house to the rest of the computers and the internet, since connecting them by real UTP cables proved pretty much impossible.

D-link DWL-G810

On the outside it's a simple box that has an ethernet socket on one end and a 802.11g connection on the other. What ever device you connect to it behaves like it would be connected by a wired network to all the computers on the local network, including those that are actually on the wireless LAN.

Well, that's the theory at least. In practice, I never could get it to work properly. The connection kept dropping, DHCP requests wouldn't get through and so on. Very annoying, since even once a machine managed to get an IP only each other webpage would open in a browser. So it has been accumulating dust in a corner ever since.

Now I've noticed that D-link released some new firmware updates (which is pretty surprising for a device that will soon be 3 years old). It's quite confusing though, since you got U.K., German and Australian departments of D-link, each claiming a different version of the firmware is the latest. And apparently this product has been discontinued in U.S.

I've tried 3.16b77 (from the German site), 3.15b70 (from the U.K. site) and 3.15b73 (from the Australian site).

I've had most success with 3.15b73. So far, I haven't seen any connection drops with it and it also has the added bonus that the admin page can be accessed from the wireless side, which seems impossible with other versions. Interestingly, other two versions didn't bring any obvious improvements over the old 2.2 firmware, although they appear to be newer.

The device still has some interesting properties though. For example, inbound traffic that you send to its IP via the wireless interface can be seen on the wired side, which is weird. So if you ping it from Wi-Fi, you see your ping packets on the wire with tcpdump, but not the answers. Same for HTTP traffic to the admin page (including the plain-text password for HTTP basic authentication).

It's also constantly searching for multicast-capable clients on the wireless side for some reason (no idea why):

21:41:28.394980 xx:xx:xx:xx:xx:xx (oui Unknown) > Broadcast, ethertype ARP (0x0806), length 60: arp who-has ALL-SYSTEMS.MCAST.NET tell 0.0.0.0

For the record, this is hardware H/W Ver: C1 (product number ending with BEUC1) and I'm using it with a WPA-PSK protected network.

Posted by Tomaž | Categories: Digital | Comments »

Blog search pinger for Nanoblogger

15.11.2008 12:42

I'm sure the rest of the Nanoblogger using crowd (yes, I'm thinking about both of you) will find this useful. It's a plug-in that pings the Google's Pinging API, so that your blog entries appear on their blog search.

It's simple to use, just download the archive and uncompress it into the plugins/ directory of your Nanoblogger installation. It will then make a call to the big brother each time you update your web page.

I'm using it with Nanoblogger 3.3, but I guess it should also work fine on newer versions. Any feedback is welcome as always.

Posted by Tomaž | Categories: Code | Comments »

Best photoshoped pilot ever

13.11.2008 17:13

Recently a video has begun circulating of an airplane that loses a wing during a snap roll. Despite this problem the pilot miraculously manages to save himself and what is left of the airplane. Even a major Slovenian news site picked up the story, attributing the maneuver to James Andresson.

As many have noted, the video is undoubtedly fake. While the basic aerodynamics of the flying appear to be correct, there are glitches: like the direction of the initial uncontrolled spin of the aircraft and the unrealistically hard landing that does not even bend the landing gear. A more careful look at the video itself also reveals lots of other clues that support the theory that it has been constructed from several different sources.

my father's x-free

However, if you forget for a minute that it's fake, the video actually shows a really good trick. I have some experience piloting model aircraft and I know that aerobatic airplanes are capable of flying with wings in the vertical position ("knife edge" flight). In this position the body of the airplane provides the lift instead of the wings. So, in theory a controlled flight with a missing wing is possible, provided you manage to pull out of the initial spin.

Now there are always people wanting to show off their skills at RC meetings and competitions. Why doesn't somebody try to replicate this, Mythbuster style? The wing could be constructed so that one half would come off in mid-flight by remote control. And the model airplane can be one of those modern, lightweight Depron foam types so that it would survive more than one failed attempt.

It would take on hell of a pilot to do it, but I'm sure that with a trick like this you would be the star of the event. Anyone up for the challenge?

Posted by Tomaž | Categories: Ideas | Comments »

m in malloc stands for molasses

11.11.2008 21:34

For the past few days I've been trying to fix a bug in one application I've made for internal use in Zemanta. This particular C++ program processes large amount of data gathered from all sorts of dark places on the Internet and compiles binary datasets that can be used by Wikitag. It uses GNU Standard Template Library and at one point allocates (and frees) several gigabytes worth of STL data structures. The problem is that when the amount of objects stored in containers reaches some threshold, the malloc() calls become painfully slow (i.e. minutes to hours per single call on a 2 GHz Opteron). The code starts spending all its time in _int_malloc() function, one of the helpers of glibc's malloc().

This kind of problems with are usually related to some kind of memory corruption, but I'm skeptical in this case. I'm leaving memory management to STL almost exclusively and tools like Valgrind's memcheck don't report anything strange. Additionally, most of the code is pretty old, well tested and worked well before with smaller amounts of data. I think it's more probable that it's related to a particular pattern of memory fragmentation that breaks the built in memory management heuristics of Debian's glibc.

Anyway, I haven't yet found a satisfactory solution to this problem, but I did find one interesting memory debugging tool for glibc that I previously didn't know existed.

I stumbled upon libmemusage.so while studying glibc's malloc() implementation. It's a dynamic library that contains a diagnostic version of malloc() and friends and prints some useful statistics about its use to stderr on program exit, like the number of calls to each memory handling function. It's part of the base Debian glibc6 package, so it's probably there on every Debian based system. You can use it without recompiling your code like this:

orion$ LD_PRELOAD=/lib/libmemusage.so ./example

Memory usage summary: heap total: 10, heap peak: 10, stack peak: 64
         total calls   total memory   failed calls
 malloc|          1             10              0
realloc|          0              0              0  (nomove:0, dec:0, free:0)
 calloc|          0              0              0
   free|          0              0
Histogram for block sizes:
    0-15              1 100% ==================================================
orion$ 

The printed information is pretty much self-explanatory. It's just staggering how many tiny malloc()s and free()s an average STL-using C++ program can get if you're not careful.

Another thing worth looking into if you have similar memory management problems is the /usr/include/malloc.h header. It contains some functions man pages do not mention, like malloc_stats() that also prints some helpful things to standard error, and the mallinfo() that returns a struct with some internal malloc() housekeeping data. But to understand those you will need to read through comments in malloc.c in glibc source code (it's surprisingly well documented).

Posted by Tomaž | Categories: Code | Comments »

On Pleo touch sensors

08.11.2008 20:26

I've been asked a couple of times how Pleo knows when you're petting it. Here's what I've learned about his sense of touch when I was taking Zemanta's Piki apart.

Ugobe explains Pleo's features

Image by Ugobe

The official story Ugobe tells is that Pleo is equipped with several sensors under its skin that respond to human touch. They tell you approximate locations, like under the chin, on the back and legs, but how they work is left to your imagination.

Pleo's shoulder capacitive sensor

Pleo's shoulder touch sensor

While Pleo is in all respects an impressive technical achievement the first look at these sensors reveal that he's not exactly Commander Data. As you can see they are composed simply of adhesive aluminum foil that is fixed on strategic spots on Pleo's plastic skeleton. The curious fact that reveals that there is nothing magical about this patch is that there is only one wire going to it from the circuit board inside. Even the simplest switch would need at least two wires. I guess bionic skin will stay the domain of science fiction for a little while longer.

So how does Pleo sense your hand with a simple piece of aluminum? Well, Ugobe's description isn't entirely accurate. These sensors merely detect proximity of objects, not touch or pressure. Exceptions are of course four sensors on Pleo's feet, but those are ordinary mechanical switches and require no explanation.

Pleo's capacitive proximity sensors

A patch of foil forms a sensor electrode and a control circuit is connected between it and the circuit's ground plane. This circuit continuously measures capacitance of the electrode: it brings the electrode to a known potential respective to the ground and measures the amount of charge that has accumulated on it.

Since you are relatively conductive, you significantly alter the electric field in you vicinity. When you move your hand close to the electrode, you form a shorter, better conducting path to the ground plane (compared to the air that was previously occupying your place). The floor can also help forming this path, if it's conductive enough. Since the path is shorter it takes more accumulated charge on the sensor plate to bring it to the reference potential. The control circuit can detect this change. With some careful signal processing it can also determine when your hand is close enough to be touching the skin and that a trigger signal to the Pleo's microcontroller should be sent.

The typical electrode capacitance here is in the order of 10-12 farads, so even an object with the resistance in the order gigaohms will appear as conductive to the sensor that is doing hundreds of measurements per second.

On the other hand Pleo's rubber skin is a good dielectric and the electric field passes through it without much attenuation. In fact, this kind of sensors work through most isolating materials, like glass, plastic or ceramic.

If you think about it in another way, this kind of sense is in some way superior to touch. Pleo in fact senses the geometry and electric properties of the space around it, by probing it by electric field at a distance. It's doubtful that this analogue information about capacitance from different sensors gets to microcontrollers though, which probably receive only digital on/off signals from sensor controller ICs.

The question that remains in the end is how exactly Pleo measures electrode capacitance. I haven't had the opportunity to put an oscilloscope probe to the sensor electrode of a working Pleo (Piki is still in very bad shape and I don't dare to power him up), so I'm left to guessing.

It most probably uses the charge transfer method, where the unknown capacitance of the electrode is compared to a known capacitance. There are cheap and reliable ICs that can do that, like the QT110. Another possibility is a simple and less accurate R-C circuit and measuring of the time constant, which can be done with a microcontroller itself and very little extra hardware.

Posted by Tomaž | Categories: Analog | Comments »

This must be the place

02.11.2008 11:38

I've been traveling around Sicily and Aeolian islands last week. I won't bore you with all the ancient engineering achievements I've seen there, but the two dormant volcanoes I visited there are worth a brief mention.

One of the smaller craters of Etna I've been to didn't show any signs of activity. If it wasn't for the out-of-this-world landscape I wouldn't think I'm standing on a place that was covered with molten rock in the 2003 eruption.

Etna

Vulcano on the other hand did show some gas and sulfur coming out of the ground. No signs of lava though, so I guess the place can also be scratched from the list of suitable places for an evil overlord's lair.

Vulcano
Posted by Tomaž | Categories: Life | Comments »