Google Are Bullies (403 Means 403) 2016-05-14
Today I was going through the logs on my web server (a hobby I probably take more seriously than I should) when I noticed the following log entry
14/May/2016:18:39:45 -0400 184.108.40.206 [archive.snork.ca] "GET /wp-content/uploads/2014/02/Snorkified_Homer.exe HTTP/1.1" 200 "http://archive.snork.ca/?m=201402" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.110 Safari/537.36"
Well of course I was immediately interested in knowing who was downloading my nifty Snorkified_Homer file, because frankly it isn't something that gets many hits around here. So I did a lookup on the IP address 220.127.116.11, and it turns out is is owned by Google. Does that mean someone at Google was actually downloading it? And if so, why were they not looking at the web page that links to it? Then I wondered if maybe it was their web crawler secretly scraping my site with a fake user-agent. I decided to search for a tool or list that would tell me if the IP address is in fact a scraper address, when I came across Google's official webmaster FAQ which says
I return 403 "Forbidden" for all URLs, including the robots.txt file. Why is the site still being crawled?
The HTTP result code 403—as all other 4xx HTTP result codes—is seen as a sign that the robots.txt file does not exist. Because of this, crawlers will generally assume that they can crawl all URLs of the website. In order to block crawling of the website, the robots.txt must be returned normally (with a 200 "OK" HTTP result code) with an appropriate "disallow" in it.
But that is not true. A 403 status does not does not mean that the file does not exist. It means that whoever is trying to access it is forbidden from seeing it. The list of HTTP Status Codes are quite clear about this. A 403 is defined as
A web server may return a 403 Forbidden HTTP status code in response to a request from a client for a web page or resource to indicate that the server can be reached and understood the request, but refuses to take any further action. Status code 403 responses are the result of the web server being configured to deny access, for some reason, to the requested resource by the client.
Denying access to a resource is very different from not finding it. Unfortunately, Google's response to a 403 is to give themselves permission to continue to crawl for the same resource if they see fit. What is even worse is that frankly there is no other crawler that even comes close to Google's abilities. Other search engines are frequently either significantly out of date or have serious problems understanding site layouts (even really common ones like Wordpress sites). It is really very disappointing to think that a search engine is probably the most important and frequently used web based tool, and yet we are stuck with a monopoly run by bullies or entirely inferior results.