Google index coverage

08.02.2019 11:57

It recently came to my attention that Google has a new Search Console where you can see the status of your web site in Google's search index. I checked out what it says for this blog and I was a bit surprised.

Some things I expected, like the number of pages I've blocked in the robots.txt file to prevent crawling (however I didn't know that blocking an URL there means that it can still appear in search results). Other things were weirder, like this old post being soft recognized as a 404 Not Found response. My web server is properly configured and quite capable of sending correct HTTP response codes, so ignoring standards in that regard is just craziness on Google's part. But the thing that caught my eye the most was the number of Excluded pages on the Index Coverage pane:

Screenshot of the Index Coverage pane.

Considering that I have less than a thousand published blog posts this number seemed high. Diving into the details, it turned out that most of the excluded pages were redirects to canonical URLs and Atom feeds for post comments. However at least 160 URL were permalink addresses of actual blog posts (there may be more, because the CSV export only contains the first 1000 URLs).

Index coverage of blog posts versus year of publication.

All of these were in the "crawled, not indexed" category. In their usual hand-waving way, Google describes this as:

The page was crawled by Google, but not indexed. It may or may not be indexed in the future; no need to resubmit this URL for crawling.

I read this as "we know this page exists, there's no technical problem, but we don't consider it useful to show in search results". The older the blog post, the more likely that it was excluded. Google's index apparently contains only around 60% of my content from 2006, but 100% of that published in the last couple of years. I've tried searching for some of these excluded blog posts and indeed they don't show in the results.

I have no intention to complain about my early writings not being shown to Google's users. As long as my web site complies with generally accepted technical standards I'm happy. I write about things that I find personally interesting and what I earnestly believe might be useful information in general. I don't feel entitled to be shown in Google's search results and what they include in their index or not is their own business.

That said, it did made me think. I'm using Google Search almost exclusively to find information on the web. I suspected that they heavily prioritize new over old, but I've never seriously considered that Google might be intentionally excluding parts of the web from their index altogether. I often hear the sentiment how the old web is disappearing. That the long tail of small websites is as good as gone. Some old one-person web sites may indeed be gone for good, but as this anecdote shows, some such content might just not be discoverable through Google.

All this made me switch my default search engine in Firefox to DuckDuckGo. Granted I don't know what they include or exclude from their search either. I have yet to see how well it works, but maybe it isn't such a bad idea to go back to the time where trying several search engines for a query was a standard practice.

Posted by Tomaž | Categories: Life

Comments

How does DuckDuckGo's search results fare with your old content?

Posted by Tyler G

Google posted about using algorithms to manage canonical tags. The bot can now group similar pages together even when the tags are wrong. Solving complex canonical tags is better served by automation as I've seen groups fail to be an nuanced as the algorithm. Before that, there was some old examples about the google index slowly forgetting pages and reindexing. I wonder if reading your older posts, deleting inaccurate ones, updating others would help. Also, I noticed of small sites being shown in search more but only those that are actively updated.

Are you saying Google is actually hurting itself by being transparent about what is and isn't indexed? After all, you don't seem to have evidence that DuckDuckGo is any better, but since they don't tell you one way or the other you seem to prefer them.

Posted by Jake

Tyler,

duckduckgo is meta engine that uses bing for results.

Posted by Mikkom

I used DuckDuckGo as my default search engine for months and finally switched back to Google (with JS disabled and VPN and href rewriting via GreaseMonkey).

Especially for semi-obscure or technical topics, DuckDuckGo results cannot even hold a candle to Google's. They're awful by comparison. We don't know how good we have it with Google. Try some comparisons for yourself.

Recently I have experimented with another one that (so far) seems significantly better than DDG called Qwant: https://www.qwant.com/

Posted by Wes

Duckduckgo is okay for most searches. And if it fails, you can always attach !g and switch to google.

DuckDuckGo often has mediocre search results, so I unusually use StartPage.com , which basically proxies search results through themselves to Google. That way I get Google-quality results without the spying, and without the filter bubble, since Google doesn't know I'm the one doing the search!

Thanks for sharing, to me this comes as a complete surprise - as I am using Google most of the time. Could you share some URLs from the not indexed pages/posts, so that we could try other search engines on them?

Posted by MarcusLindemann

Add a new comment


(No HTML tags allowed. Separate paragraphs with a blank line.)