Wikipedia is broken

14.10.2007 18:24

When the World Wide Web and HTML were designed, a decision was made to try to make web page authoring as easy as possible. That meant that web browsers gracefully accepted all documents, even those that did not strictly conform to the HTML syntax, and tried their best to present them on the screen in the way document authors intended. This was probably one of the key factors of why WWW became so popular - everyone with a text editor and some patience could come up with some tag soup document that would be silently rendered by his web browser without displaying a single error message. However this also became a major problem of the web, because no one wrote standards-compliant HTML and browsers were forced to become more and more complex to cope with all the mistyped garbage that was floating around.


Wikipedia was founded good 10 years after the World Wide Web and it's current engine MediaWiki a year later. At that time the tag soup problem of the web was already well-known. You would think that the founders of Wikipedia would learn from history and would know that giving your users too much freedom in regard to markup syntax will only lead to problems. In reality it seems that exactly the opposite is true.

The syntax behind Wikipedia pages today is so diverse, filled with hacks and workarounds for errors and typos page editors made that the only thing capable of properly rendering a page from Wikipedia is MediaWiki itself. It's wonderfully difficult to use Wikipedia dumps from any other software and for any other purpose than displaying them in the browser. It takes for example a 2000 line Perl script Wikipedia Preprocessor to make sense of the most of the garbage and make information even remotely machine-readable.

Consider for example this comment from Wikiprep:

# The correct form to create a redirect is #REDIRECT [[ link ]],
# and function 'Parse::MediaWikiDump::page->redirect' only supports this form.
# However, it seems that Wikipedia can also tolerate a variety of other forms, such as
# REDIRECT|REDIRECTS|REDIRECTED|REDIRECTION, then an optional ":", optional "to" or optional "=".
# Therefore, we use our own function to handle these cases as well.

What possible reason could there be to allow this kind of flexibility in the markup syntax? The only one I can think of is that some administrator noticed a broken page that for example had a "REDIRECTS" keyword instead instead of "REDIRECT" and instead of fixing that page fixed MediaWiki to support this typo. There are a lot of other cases like this. For example disambiguation pages can be marked with {{disambiguation}}, {{disambig}} or {{dab}} because of those who can't remember the name. Then there is this strange policy of ignoring the case of the first letter in a page title and distinguishing the case of subsequent letters. I can't imagine a good reason for that.

In the end I have a feeling the syntax itself is starting to bite back. With time it got more and more complex. Take for example the source of this Wikipedia template:

<div class="notice metadata" id="disambig">
|style="vertical-align:middle;"|[[Image:Disambig gray.svg|30px]]
[[Wikipedia:Disambiguation|disambiguation]] page lists articles about distinct
geographical locations with the same name. If <!-- you are viewing this
online as opposed to as a [[hard copy]] and -->an
[[Special:Whatlinkshere/{{FULLPAGENAME}}|internal link]] led you here, you may
wish to change the link to point directly to the intended article.''
</div><includeonly>[[Category:Ambiguous place
names]]</includeonly><noinclude>[[Category:Disambiguation and
redirection templates|Geodis]]</noinclude>

This neither human nor machine readable and the only thing that can make sense out of it is the MediaWiki with its 100000 lines of PHP code dedicated to interpreting mess like this. Just figuring out what gets included from a template page is complex, full of special cases and exceptions:

# We're storing template text for future inclusion, therefore,
# remove all <noinclude> text and keep all <includeonly> text
# (but eliminate <includeonly> tags per se).
# However, if <onlyinclude> ... </onlyinclude> parts are present,
# then only keep them and discard the rest of the template body.
# This is because using <onlyinclude> on a text fragment is
# equivalent to enclosing it in <includeonly> tags **AND**
# enclosing all the rest of the template body in <noinclude> tags.
# These definitions can easily span several lines, hence the "/s" modifiers.

The very Wiki markup that made Wikipedia accessible to many is now making hard for common people to contribute. If I want to make a new page on Wikipedia today and mess up the markup there is a good chance it will get deleted. It isn't realistic to expect people will read through long, boring pages describing the markup.

How exactly would one solve this problem? I don't know, but I'm sure it won't be easy - most of the pages on the Web still aren't standards-compliant. The difference with Wikipedia is that it is all under the control of WikiMedia Foundation, so in theory it would be possible to try to automatically convert all pages to some saner, more strict markup and manually fix those that failed to convert. However it would require some enormous effort and it would probably turn away a lot of current editors so I don't think it will happen any time soon.

Posted by Tomaž | Categories: Code

Add a new comment

(No HTML tags allowed. Separate paragraphs with a blank line.)