When the World Wide Web and
HTML were
designed, a decision was made to try to make web page authoring as easy as
possible. That meant that web browsers gracefully accepted all documents,
even those that did not strictly conform to the HTML syntax, and tried
their best to present them on the screen in the way document authors
intended. This was
probably one of the key factors of why WWW became so popular - everyone
with a text editor and some patience could come up with some tag soup document that
would be silently rendered by his web browser without displaying a single
error message. However this also became a major problem of the web, because
no one wrote standards-compliant HTML and browsers were forced to become
more and more complex to cope with all the mistyped garbage that was
floating around.
Wikipedia was
founded good 10 years after the World Wide Web and it's current engine
MediaWiki a year later.
At that time the tag soup problem of the web was already well-known. You
would think that the founders of Wikipedia would learn from history and
would know that giving your users too much freedom in regard to markup
syntax will only lead to problems. In reality it seems that exactly the
opposite is true.
The syntax behind Wikipedia pages today is so diverse, filled with hacks
and workarounds for errors and typos page editors made that the only thing
capable of properly rendering a page from Wikipedia is MediaWiki itself.
It's wonderfully difficult to use Wikipedia dumps from any other software
and for any other purpose than displaying them in the browser. It takes for
example a 2000 line Perl script
Wikipedia
Preprocessor to make sense of the most of the garbage and make
information even remotely machine-readable.
Consider for example this comment from Wikiprep:
# The correct form to create a redirect is #REDIRECT [[ link ]],
# and function 'Parse::MediaWikiDump::page->redirect' only supports this form.
# However, it seems that Wikipedia can also tolerate a variety of other forms, such as
# REDIRECT|REDIRECTS|REDIRECTED|REDIRECTION, then an optional ":", optional "to" or optional "=".
# Therefore, we use our own function to handle these cases as well.
What possible reason could there be to allow this kind of flexibility in
the markup syntax? The only one I can think of is that some administrator
noticed a broken page that for example had a "REDIRECTS" keyword instead
instead of "REDIRECT" and instead of fixing that page fixed MediaWiki to
support this typo. There are a lot of other cases like this. For example
disambiguation
pages can be marked with {{disambiguation}}, {{disambig}} or {{dab}}
because of those who can't remember the name. Then there is this
strange policy of ignoring the case of the first letter in a page title and
distinguishing the case of subsequent letters. I can't imagine a good
reason for that.
In the end I have a feeling the syntax itself is starting to bite back.
With time it got more and more complex. Take for example the source of this
Wikipedia template:
<div class="notice metadata" id="disambig">
{|style="background:none"
|style="vertical-align:middle;"|[[Image:Disambig gray.svg|30px]]
|style="vertical-align:middle;"|''This
[[Wikipedia:Disambiguation|disambiguation]] page lists articles about distinct
geographical locations with the same name. If <!-- you are viewing this
online as opposed to as a [[hard copy]] and -->an
[[Special:Whatlinkshere/{{FULLPAGENAME}}|internal link]] led you here, you may
wish to change the link to point directly to the intended article.''
|}</div>
</div><includeonly>[[Category:Ambiguous place
names]]</includeonly><noinclude>[[Category:Disambiguation and
redirection templates|Geodis]]</noinclude>
This neither human nor machine readable and the only thing that can make
sense out of it is the MediaWiki with its 100000 lines of PHP code
dedicated to interpreting mess like this. Just figuring out what gets
included from a template page is complex, full of special cases and
exceptions:
# We're storing template text for future inclusion, therefore,
# remove all <noinclude> text and keep all <includeonly> text
# (but eliminate <includeonly> tags per se).
# However, if <onlyinclude> ... </onlyinclude> parts are present,
# then only keep them and discard the rest of the template body.
# This is because using <onlyinclude> on a text fragment is
# equivalent to enclosing it in <includeonly> tags **AND**
# enclosing all the rest of the template body in <noinclude> tags.
# These definitions can easily span several lines, hence the "/s" modifiers.
The very Wiki markup that made Wikipedia accessible to many is now
making hard for common people to contribute. If I want to make a new page
on Wikipedia today and mess up the markup there is a good chance it will
get deleted. It isn't realistic to expect people will read through long, boring pages
describing the markup.
How exactly would one solve this problem? I don't know, but I'm sure it
won't be easy - most of the pages on the Web still aren't
standards-compliant. The difference with Wikipedia is that it is all under
the control of WikiMedia Foundation, so in theory it would be possible to
try to automatically convert all pages to some saner, more strict markup
and manually fix those that failed to convert. However it would require
some enormous effort and it would probably turn away a lot of current
editors so I don't think it will happen any time soon.