Blocking Bots The Nginx Way 2024-05-02
Recently I saw a post on minutestomidnight.co.uk about blocking certain bots (specifically targeting AI scrapers) from a web site. I have been doing this for some time now, but with Nginx rather than Apache. Now some folks may question why anyone would want to block scrapers from their site. Well, in my case I am not interested in being a source of AI knowledge, and I also block garbage crawlers like Bing, which I suspect has generated very close to zero legitimate views of my site over the years. Does this potentially affect my SEO? Sure... but if you're asking that question then clearly you are not familiar with the rest of this site which generally slags the SEO industry as a bunch of whores. Besides, you are reading this, so clearly my SEO is just fine and I don't really need help from the wankers using outlook.com email addresses to offer me quotes on getting to the first page of Google (not Bing) results.
To accomplish this I used a map in Nginx that lists the user-agent strings I wish to block.
/etc/nginx/nginx.conf include /etc/nginx/block_bots.conf;
/etc/nginx/block_bots.conf
map $http_user_agent $badagent {
~BotPoke 1;
~GPT 1;
~bingbot 1;
~AhrefsBot 1;
...
default 0;
}
Then use an if statement to dump connections with those user-agents. Now, before you go telling me that if is evil, you should get your shit straight. The article you are referring to is this one, and the title is not "If Is Evil". The title is "If is Evil... when used in location context", and you do not need to use this in a location context. My location context has no if statements, so shut the fuck up. Having said that you can use:
if ( $badagent ) { return 444; }
To reject the agents you do not want. I guess you could be kinder to them and give them a redirection instead, but it's not like they were being kind to your site in the first place right?