Due to my very kind posting of the overnet gui clc source code, I have had many a search bot drive by to index my web site. Unfortunatly there are parts which I don’t want index (like my wiki which is boring/empty and slow to use, and cgi-bin directories, etc). So I added a robots.txt to exclude these.

Now I’m all happy for Google, Ask Jeeves, etc to index everthing else, but where I draw the line is spambots and other dark harvesters (and particularly Web Content International) that blatently ignore robots.txt. Ideas to block them include mod_rewrite and deny from env, much like mark experienced with block spambots, ban spybots and tell unwanted robots to go to hell.

My immediate solution was to add an iptables reject line for
65.102.*.*. It worked beautifully and so far 10k blocked packets. Most likely I would like to set up a honey pot (a page that’s linked to but excluded in robots.txt) and automatically add servers that request it to the iptables reject list. I’m just a bit worried about the security requirement (root) to add the iptables line *grin*

  1. Amen.

    These Web Content International a-holes need to be stopped. To be crawling on the scale that they do, they *really* should play by the rules (like identifying their UA properly and respecting robots.txt).

    I recently blocked them (after they hammered my site), giving them a 403 page. Look at this morning’s example from my logs below. 5 hits in 1 second using 5 different UserAgents. Lies. All Web Content International IP addresses.

    Web Content International are scum. – - [05/Dec/2003:08:33:21 -0800] “GET /games HTTP/1.0″ 403 203 “-” “Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) AppleWebKit/103u (KHTML, like Gecko) Safari/100″ – - [05/Dec/2003:08:33:21 -0800]
    “GET /games HTTP/1.0″ 403 203 “-” “Mozilla/4.0 (compatible; MSIE 4.0; Windows 95)” – - [05/Dec/2003:08:33:21 -0800] “GET /games HTTP/1.0″ 403 203 “-” “Mozilla/4.77 [en] (X11; U; Linux 2.2.19 i686)” – - [05/Dec/2003:08:33:21 -0800] “GET /games HTTP/1.0″ 403 203 “-” “Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)” – - [05/Dec/2003:08:33:21 -0800] “GET /games HTTP/1.0″ 403 203 “-” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)”

  2. 2 years latter and they are still hammering away Thanks for leaving this post up here


