Several months ago we had a discussion in the office about the icons that Zemanta automatically adds to the footer of blog posts that contain suggested content. The conversation mostly revolved about how aesthetically pleasing they are combined with various web site designs out there.
What bothered me is that most of the arguments there were based on guesses and anecdotal evidence. It made me curios about what are the actual prevailing colors used on web sites out there. So I dumped the list of blogs Zemanta knows about, threw together a bunch of really simple shell scripts and let a machine crawl the blogs around the world. Of course it wasn't that simple and it wasted a week making screen shots of a Firefox error window before I noticed and fixed the bug. The whole machinery grew up to be pretty complex towards the end, mostly because it turns out that modern desktop software just isn't up to such a task (and I refused to go through the process of embedding a HTML rendering engine into some custom software). When you are visiting tens of thousands of pages a browser instance is good for at best one page load and the X server instance survives maybe thousand browser restarts.
After around two months and a bit over 150.000 visited blogs I ended up with 50 GB of screen shots, which hopefully make a representative sample of the world's blogger population.
So far I extracted two numbers from each of those files: the average color (the mean red, green and blue values for each page) and the dominant color (the red, green and blue value for the color that is present in the most pixels on the page). The idea is that the dominant color should generally be equal to the background color (except for pages that use a patterned background), while the average color is also affected by the content of the page.
Here are how histograms of those values look like, when converted to the HSV color model. Let's start with the dominant colors:
You can see pretty well defined peaks around orange, blue and a curious sharp peak around green. Note that this graph only shows hue, so that orange peak also includes pages with, for instance, light brown background.
I excluded pages where the dominant color had zero saturation (meaning shades of gray from black to white) and as such had an undefined hue.
The saturation histogram is weighted heavily towards unsaturated colors (note that the peak at zero is much higher and is cut off in this picture). This is pretty reasonable. Saturated backgrounds are a bad choice for blogs, which mainly publish written content and should focus on the legibility of the text.
Again this result is pretty much what I expected. Peaks at very light colors and very dark ones. Backgrounds in the middle of the scale don't leave much space for text contrast.
Moving on to histograms of average colors:
Average color hues are pretty much equivalent to dominant color hues, which increases my confidence in these distributions. Still we have high peaks around orange and blue, although they are a bit more spread out. That is expected, since average colors are affected by content on the site and different blogs using the same theme but publishing different content will have a slightly different average color.
Again, weighted strongly towards unsaturated colors.
Now this is interesting. The peak around black has disappeared completely! This suggests that the black peak in dominant colors was an artifact, probably due to the black color of the text being dominant over any single background color (say in a patterned background). The white peak is again very spread out, probably due to light background colors mixing with dark text in the foreground.
Conclusions at this point would be that light backgrounds are in majority over dark backgrounds, most popular colors are based on orange and blue and most bloggers have the common sense to use desaturated colors in their designs.
I'm sure there are loads of other interesting metrics that can be extracted from this dataset, so any suggestions and comments are welcome as always. I also spent this Zemanta Hack Day working on a fancy interactive visualization, which will be a subject of a future blog post.