United colors of the blogosphere

10.06.2011 21:57

Several months ago we had a discussion in the office about the icons that Zemanta automatically adds to the footer of blog posts that contain suggested content. The conversation mostly revolved about how aesthetically pleasing they are combined with various web site designs out there.

What bothered me is that most of the arguments there were based on guesses and anecdotal evidence. It made me curios about what are the actual prevailing colors used on web sites out there. So I dumped the list of blogs Zemanta knows about, threw together a bunch of really simple shell scripts and let a machine crawl the blogs around the world. Of course it wasn't that simple and it wasted a week making screen shots of a Firefox error window before I noticed and fixed the bug. The whole machinery grew up to be pretty complex towards the end, mostly because it turns out that modern desktop software just isn't up to such a task (and I refused to go through the process of embedding a HTML rendering engine into some custom software). When you are visiting tens of thousands of pages a browser instance is good for at best one page load and the X server instance survives maybe thousand browser restarts.

Collage of screen shots of a few blogs.

After around two months and a bit over 150.000 visited blogs I ended up with 50 GB of screen shots, which hopefully make a representative sample of the world's blogger population.

So far I extracted two numbers from each of those files: the average color (the mean red, green and blue values for each page) and the dominant color (the red, green and blue value for the color that is present in the most pixels on the page). The idea is that the dominant color should generally be equal to the background color (except for pages that use a patterned background), while the average color is also affected by the content of the page.

Here are how histograms of those values look like, when converted to the HSV color model. Let's start with the dominant colors:

Histogram of dominant color hue used in blog themes.

You can see pretty well defined peaks around orange, blue and a curious sharp peak around green. Note that this graph only shows hue, so that orange peak also includes pages with, for instance, light brown background.

I excluded pages where the dominant color had zero saturation (meaning shades of gray from black to white) and as such had an undefined hue.

Histogram of dominant color saturation used in blog themes.

The saturation histogram is weighted heavily towards unsaturated colors (note that the peak at zero is much higher and is cut off in this picture). This is pretty reasonable. Saturated backgrounds are a bad choice for blogs, which mainly publish written content and should focus on the legibility of the text.

Histogram of dominant color value used in blog themes.

Again this result is pretty much what I expected. Peaks at very light colors and very dark ones. Backgrounds in the middle of the scale don't leave much space for text contrast.

Moving on to histograms of average colors:

Histogram of average hues used in blog themes.

Average color hues are pretty much equivalent to dominant color hues, which increases my confidence in these distributions. Still we have high peaks around orange and blue, although they are a bit more spread out. That is expected, since average colors are affected by content on the site and different blogs using the same theme but publishing different content will have a slightly different average color.

Histogram of average color saturation used in blog themes.

Again, weighted strongly towards unsaturated colors.

Histogram of average color value used in blog themes.

Now this is interesting. The peak around black has disappeared completely! This suggests that the black peak in dominant colors was an artifact, probably due to the black color of the text being dominant over any single background color (say in a patterned background). The white peak is again very spread out, probably due to light background colors mixing with dark text in the foreground.

Conclusions at this point would be that light backgrounds are in majority over dark backgrounds, most popular colors are based on orange and blue and most bloggers have the common sense to use desaturated colors in their designs.

I'm sure there are loads of other interesting metrics that can be extracted from this dataset, so any suggestions and comments are welcome as always. I also spent this Zemanta Hack Day working on a fancy interactive visualization, which will be a subject of a future blog post.

Posted by Tomaž | Categories: Ideas

Comments

A very nice read, indeed.

Since you used a browser instance on top of X to harvest the image collection, I was wondering if you took fixed height screenshots or did you somehow capture the whole page from top to bottom?

I took fixed size screenshots. (1024x768 to be exact, with scroll bars and browser chrome hidden or cropped away). So anything below the fold isn't accounted for in these statistics.

I didn't find any simple, scriptable ways of capturing screenshots of whole pages.

Posted by Tomaž

FYI: just found this neat CLI tool that's written in python and uses webkit to grab screenshots of the whole webpage http://www.paulhammond.org/webkit2png/

Thanks, this looks very useful.

Posted by Tomaž

Possibly even more useful is the https://github.com/AdamN/python-webkit2png/ that runs nicely on Linux, does proper screenshots and has some more params with which to control when the screenshot is made.

Posted by fry

Add a new comment


(No HTML tags allowed. Separate paragraphs with a blank line.)