You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Cam Bazz <ca...@gmail.com> on 2011/02/09 04:57:53 UTC

filtering out crawlers

Hello,

Is there a practical way to filter the logs left by crawlers like google?

They usually have user-agent strings like

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

is there a database for these?

Best Regards,
-C.B.

Re: filtering out crawlers

Posted by Wil - <wi...@yahoo.com>.
Hi,

There are quite a few databases online with known robots. 
http://www.robotstxt.org/db.html 
and http://www.botsvsbrowsers.com/category/1/index.html comes to mind. The 
hardest part is figuring out the suspect robots which do not identify 
themselves.

________________________________
From: Cam Bazz <ca...@gmail.com>
To: user@hive.apache.org
Sent: Tue, February 8, 2011 7:57:53 PM
Subject: filtering out crawlers

Hello,

Is there a practical way to filter the logs left by crawlers like google?

They usually have user-agent strings like

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

is there a database for these?

Best Regards,
-C.B.