Computers want to learn too

28.11.2008 21:01

Wikipedia is a wonderful learning resource. It provides a wealth of easily browsable articles on just about every topic. An article on English Wikipedia is a great starting point when you're either merely curious about a specific topic or you're just beginning a more serious study of a subject. Indeed, the ease of access to that much knowledge even poses a problem for some.

XKCD: The problem with Wikipedia

Image by Randall Munroe CC BY-NC 2.5

This is all the realization of dreams the original creators had of Wikipedia becoming a mainstream freely accessible and editable encyclopedia. However what they probably didn't envisage is that their site will also become an invaluable resource for computers to learn about the world. English Wikipedia as one of the largest freely accessible corpora has also become an important resource in machine learning science. A lot of research in natural language processing, search algorithms, text classification and similar fields is based on data gathered from Wikipedia. Results of this research are now being used by a number of companies and non-profit projects - some directly like Wikia, Powerset, Tagaroo, FreeBase, DBPedia and last but not least Zemanta. Many more are using them indirectly, maybe even unknowingly, by employing methods and algorithms that have been developed from research that was based or was evaluated on data from Wikipedia.

What makes Wikipedia inviting for research is that it's the best real-life approximation of a very large repository of structured information. Why is this structure important? After all the promises in the past decades artificial intelligence research has failed to come up with a system that could understand natural language to a degree comparable with an average human. With the hopes that a computer could ever learn directly from plain text trashed it was realized that in order to make computer systems smarter people must help them understand important pieces text. This means that concepts in the text must be clearly marked as having some specific meaning. Only then can the current state-of-the-art algorithms start learning from it, giving rise to intelligent systems that know how to suggest what book you might want to read next, or can directly answer your questions, not just point you to a semi-relevant webpage, leaving the tough part of extracting the information from its text to you.

While Wikipedia isn't properly semantically tagged, it is a good approximation. What makes this possible is its use of templates - an editing tool originally designed to ease input of data and standardize layout of specific classes of topics. Since text is entered into templates through a standardized set of parameters, the template gives structure to that text that can be used for more that just page layout. For example text that is entered for the parameter birthdate in the Infobox People template suddenly becomes a piece of information with a certain meaning: person described by the article was born on the date, described by that piece of text. Even the presence of Infobox People on a page itself classifies that page as biographical page.

DBpedia links between databases

Image by DBpedia

However not all templates are created equal. Wikipedia as a collaboratively edited project has a curious property that some technical feature (like templates) will only be used properly when the misuse of the feature will be blatantly obvious to ordinary (human) visitors of Wikipedia.

Take for example the category hierarchy. MediaWiki software that powers Wikipedia supports assigning articles to a hierarchy of categories. By themselves these categories seem like a more natural way of classifying articles than checking which page uses which Infobox template. A closer look however reveals that the category system is wonderfully abused: a lot of pages are put in completely wrong categories, hierarchy is full of circles and nonsensical relationships. The reason is that only a minority of Wikipedia visitors know that a category system exists. Even less actually use it to find pages. On the other hand a Botanical Infobox on a biographical page is so striking to most users that sooner or later somebody will replace it with a more fitting Infobox.

Interlocking by Paul Goyette

Image by Paul Goyette CC BY-SA 2.0

Recently a movement in the Wikipedia community seems to have arisen that is against adding more specific fields to Infobox templates, voting instead for smaller, more specialized templates dispersed throughout the page. Take for example the decision to move external links to IMDB out of the Infobox Film and into smaller templates, specialized to make links to IMDB. Or refusal to add official home page fields to several other templates.

While in theory smaller templates give as much structure to text as larger Infoboxes they are in practice much more easily abused. An IMDB field in the Infobox can only be used to point to the Internet Movie Database entry for the movie that is the subject of the article the Infobox appears on. If it's not, it will be very noticeable for anyone that follows that link and there are good chances that it will be fixed soon. On the other hand, smaller templates can (and are) used to link to IMDB entries that have only some weak relationship with the subject - for example a page about an actor can have a multiple smaller templates providing links to movies she has acted in. It will not be obvious to the average user that a template that should only point to an IMDB entry, equivalent to the Wikipedia page it appears on, has been misused. Since a computer can not understand the text surrounding the links like a human reader does, it will learn that the concept of the actor is equivalent to the concepts of her movies. Suddenly, a pretty reliable way to link Wikipedia entries to another large database has been made a lot more noisy.

I understand there are (some) good reasons these decisions have been made in the Wikipedia community. Pages with large Infoboxes do become less convenient for human readers and can be time-consuming to keep up-to-date. However Wikipedia editors should acknowledge that Wikipedia has also become an important resource outside their original human audience.

Both goals, an easily readable and editable encyclopedia and a good quality machine learning resource are not necessarily incompatible. There are many minor changes that could be made to enhance Wikipedia for machine learners without sacrificing human usability. If some piece of information really can not be put inside an Infobox, then at least the specialized templates should be made in a way that makes them hard to abuse. For example the current recommended way to link to an IMDB entry is a template that looks like this to a visitor of a page:

TITLE at the Internet Movie Database

Where TITLE is a movie title, chosen by the editor that inserted that template. A better, more robust way to make that template would be for example to make TITLE always say the name of the current page. This approach, which is well within MediaWiki current capabilities, would make it immediately obvious that a template has been misused.

This is a pretty minor change, but it would probably go a long way to make Wikipedia easier to reliably connect to other databases. If not sooner, this is a problem Wikipedia will have to face itself when they will make the transition to semantic MediaWiki, as distant as that seems right now. It's clear that such a change to the IMDB template is no longer possible now that thousands of pages use it, however I do hope that more thought will be given to this problem when interfaces of new templates are debated.

Posted by Tomaž | Categories: Ideas


What a wonderful article,

however I think you should also point out great things that "machine readable" Wikipedia can mean for Wikipedia itself! Automatically checking sanity of many small pieces of data, easier discovery of vandalism, automatically adding missing data or automatically informing Wikipedia editors what is missing in articles, discovery of contradictory information, etc.

I think semi-proper templating system that Wikipedia used until now was a balanced compromise... It's a shame it is being dismantled.


Add a new comment

(No HTML tags allowed. Separate paragraphs with a blank line.)