You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@httpd.apache.org by Doug McNutt <do...@macnauchtan.com> on 2010/04/13 19:23:47 UTC

[users@httpd] Scrubbing log files

It's a shared apache 2 server that's set up to put daily log files in my home directory. I can't muck with config files. What I'm trying to do is to remove the entries due to spiders, robots and other requests that don't matter to me.  My perl script now looks for IP addresses used for /robots.txt requests and removes other entries with those IP addresses. But that doesn't get entries from the likes of Yahoo which informs me that the word "slurp" is the thing to look for in the browser identification entries. That works.

But I find I'm also looking for "bot", "spider", and some others which, I'm afraid, will pull out things that I would rather keep because of accidental matches. 

Are there any lists of common robots on the net?  Are there some regular expressions or searches that would help? Are there known IP addresses that are safe to discard?

-- 

--> A fair tax is one that you pay but I don't <--

---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
   "   from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org

RE: [users@httpd] Scrubbing log files

Posted by Geoff Millikan <gm...@t1shopper.com>.

> Are there any lists of common robots on the net?  Are there 
> some regular expressions or searches that would help? Are 
> there known IP addresses that are safe to discard?

I believe your question is off topic for this forum however I'll share our
joy with you.

Some are known by hostname:
http://googlewebmastercentral.blogspot.com/2006/09/how-to-verify-googlebot.h
tml 

others by IP:
http://www.cuil.com/info/webmaster_info/ 

We whitelist certain bots and others, if they crawl too fast and don't obey
robots.txt, become banned.  Maintaining this is alot of ongoing task,
especially if the bot company is using plain IP addresses to identify
instead of 
http://en.wikipedia.org/wiki/Forward-confirmed_reverse_DNS which Google,
MSN, Yahoo, etc. use which is much more flexible.

Some code & thoughts to keep you busy:
http://www.webmasterworld.com/google/3092423.htm
http://www.webmasterworld.com/php/3606836.htm

Thanks,

http://www.t1shopper.com/