Treacherous waters

09.02.2011 21:44

These days (or rather months) my daily work at Zemanta often takes me to the part of the web I would not normally visit. Its shadier parts so to speak. And it turns out that those are surprisingly crowded these days.

I'm sure you've been there. Probably when you were searching for some useful piece of information and such sites cluttered the top of the result list and you had to sort through piles and piles of fluff before you found what you were looking for. Or maybe it was recommended by a friend through one of the many channels such recommendations travel in the age of social web. Perhaps you even had to deal with a bug report because a piece of your web-facing software, while compliant to all relevant standards, didn't perform up to some user's satisfaction when dealing with such a web site.

Dark tunnel that is HTML 5

Imagine for a second the stereotypical web site of this class: fixed width design, unreadably small, gray type on white backgrounds. Left-top logo in saturated colors and gray gradients, courtesy of web-two-point-oh. Probably the definitive destination for some wildly popular topic right off the first page of your typical yellow press (celebrities, health, cars, shopping) or emerging interests (say Android development). At most two paragraphs of actual content and the rest filled with ads, user comments and everything in between. And of course at least 10 ways to share this page on all social networks you know about plus 10 more that you don't.

Considering that serving those abominations of the web is the only thing the companies behind them do, they are surprisingly incompetent about it. Pages won't validate or will throw a hundred Javascript errors from tens of different scripts that load behind the curtains. The little content there is was scraped from the Wikipedia or it looks like someone from a less fortunate country was hired to copy-and-paste a few statements on a prescribed topic from all around the Internet. Everything under a CC license is considered free-for-all (but don't dare break their lengthy personal-use-only terms of use!). Nobody cared about the fact that there is anything other than ASCII encoding or the subtleties XML parsing or for that matter even that the description of software product and an image that shows a porn star of the same name do not refer the same thing. As long as half a dopamine-starved human brain is able to decode it it's good enough.

What's puzzling at first is that such sites seem to be getting a shocking amount of traffic (and probably revenue as well). Of course, the opinions about the quality differ. Even among my colleagues some consider such sites valuable destinations. They have comment buttons and you can share them on Twitter! They're way more fun to visit than some tired old Wikipedia that doesn't even have Facebook integration. Regardless of the fact that any user-contributed discussion is as devoid of actual content as the site itself.

What I see in such pages is an evolution of link farming. Social farming if you will. Search engines have gotten better at detecting content that has just been blatantly and automatically copied from somewhere. So an up-to-date spammer, er, vertical influencer has switched from a website copying bot to a few mechanical turks producing syntactically unique but semantically carbon-copied content. The network effect of modern social networks brings more and more people to the site, producing worthless comments that again give the appearance of a respectable site. At this point they are trying to trick an algorithm by introducing living people into the content copying process.

Therefore you can hear a lot about how the traditional search engines will in time be completely replaced by your social network, introducing wetware on the other side as well. The idea is that natural language processing and information retrieval won't be able to distinguish between what you would consider a reputable site and a link-farmed site that approximately copied content from that reputable site. But your friends in a social network will. First because they are (hopefully) human and can understand things AI can't and second because you share their interests and trust what they trust. They can therefore in theory push more useful information in your direction than some algorithm, even when it is intimately familiar with your click- and search history.

However I think the sites I described above are a perfect example why this scheme won't bring any reduction to the fluff you will need to go through before getting to the information you want. At this point you even don't have to fool search engines any more because once you've got people clicking those little "share" buttons they will bring more visitors to your site and push your coupon codes and endorsements and whatnot regardless. In the end it's just as easy to subvert the human social network to pass around unsolicited advertisements as it is for a software algorithm. You just need a different kind of engineering.

Posted by Tomaž | Categories: Ideas

Add a new comment

(No HTML tags allowed. Separate paragraphs with a blank line.)