You are viewing a plain text version of this content. The canonical link for it is here.
Posted to agent@nutch.apache.org by Nathan Zipfel <nz...@atlanticbb.net> on 2007/08/30 02:51:24 UTC
Nutch not obeying robots.txt
Hi,
Was looking at my log files and I had the Nutch spider blocked many many
months ago, but it doesn't appear to be obeying..
http://www.pa-roots.org/robots.txt is my robots file
Here is examples from my log files
132.178.248.36 - - [29/Aug/2007:06:11:21 -0400] "GET
/~randolph/history/history.html HTTP/1.0" 200 5340 "-" "Kishore Sajja
Research Crawler/Nutch-0.9 (Masters Project Research Crawler; Masters
Project; kishoresajja@mail.boisestate.edu)"
132.178.248.36 - - [29/Aug/2007:06:36:55 -0400] "GET
/~randolph/civilwar/index.html HTTP/1.0" 200 1551 "-" "Kishore Sajja
Research Crawler/Nutch-0.9 (Masters Project Research Crawler; Masters
Project; kishoresajja@mail.boisestate.edu)"
Nathan Zipfel
PA-Roots.com
<http://www.pa-roots.com/> http://www.pa-roots.com/