You are viewing a plain text version of this content. The canonical link for it is here.

Posted to agent@nutch.apache.org by Brian Ziman <br...@swisspig.net> on 2006/06/12 01:39:27 UTC

decomposing URLs issue

Dear Nutch Project Gurus,

I'm the webmaster of http://swisspig.net/, and I have noticed periodic 
access by the Nutch crawler at U Washington.  However, today's access 
was strange, in that it attempted to crawl to a *portion* of a URL 
(which of course is not a link in itself).  This might be a bug in the 
crawler, or a bug in a modification made by the UW folks.  The relevant 
log snippets are:

128.208.6.200 - - [11/Jun/2006:18:27:27 -0400] "GET /robots.txt 
HTTP/1.0" 200 262 "" "NutchCVS/0.8-dev (Nutch running at UW; 
http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)"
128.208.6.200 - - [11/Jun/2006:18:27:28 -0400] "GET /post.php HTTP/1.0" 
200 25000 "" "NutchCVS/0.8-dev (Nutch running at UW; 
http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)"
128.208.6.200 - - [11/Jun/2006:18:27:33 -0400] "GET / HTTP/1.0" 200 
25000 "" "NutchCVS/0.8-dev (Nutch running at UW; 
http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)"
128.208.6.200 - - [11/Jun/2006:18:27:38 -0400] "GET /r/post/ HTTP/1.0" 
200 25000 "" "NutchCVS/0.8-dev (Nutch running at UW; 
http://www.nutch.org/docs/en/bot.html; sycrawl@cs.washington.edu)"

Please note that http://swisspig.net/post.php and 
http://swisspig.net/r/post/ are scripts (the same script actually -- I 
recently migrated from the format "/post.php?id=foo" to "/r/post/foo") 
that are not meant to be accessed directly.  There are of course no 
links from http://swisspig.net/ to these URLs.


Regards,
Brian Ziman
webmaster, swisspig.net