You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Alex diNorcia <al...@dinorcia.net> on 2012/10/25 17:59:07 UTC

misbehaving crawler

http://alex.dinorcia.net/robots.txt has been in place and unchanged
since Aug 24  2004

** i'd also point out that it's crawling poorly to boot. the original link
it got into the directory with was
http://alex.dinorcia.net/stuff_i_got_in_emails/?C=M;O=D*
*it appears to add the descending order part of the get variables to each
file and gets a 404 error.*

here are some of the 14516 log entries that are not obeying the rules :
119.139.27.64 - - [25/Oct/2012:04:22:08 -0400] "GET
/stuff_i_got_in_emails/Japanese%20Engrish%204.jpg;O=D HTTP/1.0" 404 246 "-"
"HD nutch agent/Nutch-1.1 (Think)"
119.139.27.64 - - [25/Oct/2012:05:20:50 -0400] "GET
/stuff_i_got_in_emails/LeafBlower.jpg;O=D HTTP/1.0" 404 238 "-" "HD nutch
agent/Nutch-1.1 (Think)"
119.139.27.64 - - [25/Oct/2012:06:26:43 -0400] "GET
/stuff_i_got_in_emails/snowmen3.gif;O=D HTTP/1.0" 404 236 "-" "HD nutch
agent/Nutch-1.1 (Think)"
119.139.27.64 - - [25/Oct/2012:07:01:49 -0400] "GET
/stuff_i_got_in_emails/Everything.About.The.Doctor.jpg;O=D HTTP/1.0" 404
255 "-" "HD nutch agent/Nutch-1.1 (Think)"
119.139.27.64 - - [25/Oct/2012:08:12:06 -0400] "GET
/stuff_i_got_in_emails/fucked.jpg;O=D HTTP/1.0" 404 234 "-" "HD nutch
agent/Nutch-1.1 (Think)"
119.139.27.64 - - [25/Oct/2012:08:18:54 -0400] "GET
/stuff_i_got_in_emails/H28.gif;O=D HTTP/1.0" 404 231 "-" "HD nutch
agent/Nutch-1.1 (Think)"
119.139.27.64 - - [25/Oct/2012:08:26:50 -0400] "GET
/stuff_i_got_in_emails/Oprahs-Bees.gif;O=D HTTP/1.0" 404 239 "-" "HD nutch
agent/Nutch-1.1 (Think)"
119.139.27.64 - - [25/Oct/2012:08:50:31 -0400] "GET
/stuff_i_got_in_emails/Reindeer_Mural.jpg;O=D HTTP/1.0" 404 242 "-" "HD
nutch agent/Nutch-1.1 (Think)"
119.139.27.64 - - [25/Oct/2012:09:02:52 -0400] "GET
/stuff_i_got_in_emails/snowmen4.gif;O=D HTTP/1.0" 404 236 "-" "HD nutch
agent/Nutch-1.1 (Think)"
119.139.27.64 - - [25/Oct/2012:09:04:52 -0400] "GET
/stuff_i_got_in_emails/ATT00173.jpg;O=D HTTP/1.0" 404 236 "-" "HD nutch
agent/Nutch-1.1 (Think)"
119.139.27.64 - - [25/Oct/2012:09:22:19 -0400] "GET
/stuff_i_got_in_emails/?C=S;O=A HTTP/1.0" 200 159957 "-" "HD nutch
agent/Nutch-1.1 (Think)"
119.139.27.64 - - [25/Oct/2012:10:55:09 -0400] "GET
/stuff_i_got_in_emails/outofthecloset%20(5).jpg;O=D HTTP/1.0" 404 246 "-"
"HD nutch agent/Nutch-1.1 (Think)"

*
*

RE: misbehaving crawler

Posted by Markus Jelsma <ma...@openindex.io>.
Seems like a problem with embedded params which was fixed some versions ago. But still, Nutch should not crawl anything in that directory regardless of that problem. Perhaps the bot operator removed the robots checking.

Cheers 
 
-----Original message-----
> From:Lewis John Mcgibbney <le...@gmail.com>
> Sent: Fri 26-Oct-2012 15:42
> To: dev@nutch.apache.org
> Subject: Re: misbehaving crawler
> 
> What exactly is the issue here?
> 
> Lewis
> 
> On Thu, Oct 25, 2012 at 4:59 PM, Alex diNorcia <al...@dinorcia.net> wrote:
> > http://alex.dinorcia.net/robots.txt has been in place and unchanged since
> > Aug 24  2004
> >
> > * i'd also point out that it's crawling poorly to boot. the original link it
> > got into the directory with was
> > http://alex.dinorcia.net/stuff_i_got_in_emails/?C=M;O=D
> > it appears to add the descending order part of the get variables to each
> > file and gets a 404 error.
> >
> > here are some of the 14516 log entries that are not obeying the rules :
> > 119.139.27.64 - - [25/Oct/2012:04:22:08 -0400] "GET
> > /stuff_i_got_in_emails/Japanese%20Engrish%204.jpg;O=D HTTP/1.0" 404 246 "-"
> > "HD nutch agent/Nutch-1.1 (Think)"
> > 119.139.27.64 - - [25/Oct/2012:05:20:50 -0400] "GET
> > /stuff_i_got_in_emails/LeafBlower.jpg;O=D HTTP/1.0" 404 238 "-" "HD nutch
> > agent/Nutch-1.1 (Think)"
> > 119.139.27.64 - - [25/Oct/2012:06:26:43 -0400] "GET
> > /stuff_i_got_in_emails/snowmen3.gif;O=D HTTP/1.0" 404 236 "-" "HD nutch
> > agent/Nutch-1.1 (Think)"
> > 119.139.27.64 - - [25/Oct/2012:07:01:49 -0400] "GET
> > /stuff_i_got_in_emails/Everything.About.The.Doctor.jpg;O=D HTTP/1.0" 404 255
> > "-" "HD nutch agent/Nutch-1.1 (Think)"
> > 119.139.27.64 - - [25/Oct/2012:08:12:06 -0400] "GET
> > /stuff_i_got_in_emails/fucked.jpg;O=D HTTP/1.0" 404 234 "-" "HD nutch
> > agent/Nutch-1.1 (Think)"
> > 119.139.27.64 - - [25/Oct/2012:08:18:54 -0400] "GET
> > /stuff_i_got_in_emails/H28.gif;O=D HTTP/1.0" 404 231 "-" "HD nutch
> > agent/Nutch-1.1 (Think)"
> > 119.139.27.64 - - [25/Oct/2012:08:26:50 -0400] "GET
> > /stuff_i_got_in_emails/Oprahs-Bees.gif;O=D HTTP/1.0" 404 239 "-" "HD nutch
> > agent/Nutch-1.1 (Think)"
> > 119.139.27.64 - - [25/Oct/2012:08:50:31 -0400] "GET
> > /stuff_i_got_in_emails/Reindeer_Mural.jpg;O=D HTTP/1.0" 404 242 "-" "HD
> > nutch agent/Nutch-1.1 (Think)"
> > 119.139.27.64 - - [25/Oct/2012:09:02:52 -0400] "GET
> > /stuff_i_got_in_emails/snowmen4.gif;O=D HTTP/1.0" 404 236 "-" "HD nutch
> > agent/Nutch-1.1 (Think)"
> > 119.139.27.64 - - [25/Oct/2012:09:04:52 -0400] "GET
> > /stuff_i_got_in_emails/ATT00173.jpg;O=D HTTP/1.0" 404 236 "-" "HD nutch
> > agent/Nutch-1.1 (Think)"
> > 119.139.27.64 - - [25/Oct/2012:09:22:19 -0400] "GET
> > /stuff_i_got_in_emails/?C=S;O=A HTTP/1.0" 200 159957 "-" "HD nutch
> > agent/Nutch-1.1 (Think)"
> > 119.139.27.64 - - [25/Oct/2012:10:55:09 -0400] "GET
> > /stuff_i_got_in_emails/outofthecloset%20(5).jpg;O=D HTTP/1.0" 404 246 "-"
> > "HD nutch agent/Nutch-1.1 (Think)"
> >
> >
> >
> 
> 
> 
> -- 
> Lewis
> 

Re: misbehaving crawler

Posted by Lewis John Mcgibbney <le...@gmail.com>.
What exactly is the issue here?

Lewis

On Thu, Oct 25, 2012 at 4:59 PM, Alex diNorcia <al...@dinorcia.net> wrote:
> http://alex.dinorcia.net/robots.txt has been in place and unchanged since
> Aug 24  2004
>
> * i'd also point out that it's crawling poorly to boot. the original link it
> got into the directory with was
> http://alex.dinorcia.net/stuff_i_got_in_emails/?C=M;O=D
> it appears to add the descending order part of the get variables to each
> file and gets a 404 error.
>
> here are some of the 14516 log entries that are not obeying the rules :
> 119.139.27.64 - - [25/Oct/2012:04:22:08 -0400] "GET
> /stuff_i_got_in_emails/Japanese%20Engrish%204.jpg;O=D HTTP/1.0" 404 246 "-"
> "HD nutch agent/Nutch-1.1 (Think)"
> 119.139.27.64 - - [25/Oct/2012:05:20:50 -0400] "GET
> /stuff_i_got_in_emails/LeafBlower.jpg;O=D HTTP/1.0" 404 238 "-" "HD nutch
> agent/Nutch-1.1 (Think)"
> 119.139.27.64 - - [25/Oct/2012:06:26:43 -0400] "GET
> /stuff_i_got_in_emails/snowmen3.gif;O=D HTTP/1.0" 404 236 "-" "HD nutch
> agent/Nutch-1.1 (Think)"
> 119.139.27.64 - - [25/Oct/2012:07:01:49 -0400] "GET
> /stuff_i_got_in_emails/Everything.About.The.Doctor.jpg;O=D HTTP/1.0" 404 255
> "-" "HD nutch agent/Nutch-1.1 (Think)"
> 119.139.27.64 - - [25/Oct/2012:08:12:06 -0400] "GET
> /stuff_i_got_in_emails/fucked.jpg;O=D HTTP/1.0" 404 234 "-" "HD nutch
> agent/Nutch-1.1 (Think)"
> 119.139.27.64 - - [25/Oct/2012:08:18:54 -0400] "GET
> /stuff_i_got_in_emails/H28.gif;O=D HTTP/1.0" 404 231 "-" "HD nutch
> agent/Nutch-1.1 (Think)"
> 119.139.27.64 - - [25/Oct/2012:08:26:50 -0400] "GET
> /stuff_i_got_in_emails/Oprahs-Bees.gif;O=D HTTP/1.0" 404 239 "-" "HD nutch
> agent/Nutch-1.1 (Think)"
> 119.139.27.64 - - [25/Oct/2012:08:50:31 -0400] "GET
> /stuff_i_got_in_emails/Reindeer_Mural.jpg;O=D HTTP/1.0" 404 242 "-" "HD
> nutch agent/Nutch-1.1 (Think)"
> 119.139.27.64 - - [25/Oct/2012:09:02:52 -0400] "GET
> /stuff_i_got_in_emails/snowmen4.gif;O=D HTTP/1.0" 404 236 "-" "HD nutch
> agent/Nutch-1.1 (Think)"
> 119.139.27.64 - - [25/Oct/2012:09:04:52 -0400] "GET
> /stuff_i_got_in_emails/ATT00173.jpg;O=D HTTP/1.0" 404 236 "-" "HD nutch
> agent/Nutch-1.1 (Think)"
> 119.139.27.64 - - [25/Oct/2012:09:22:19 -0400] "GET
> /stuff_i_got_in_emails/?C=S;O=A HTTP/1.0" 200 159957 "-" "HD nutch
> agent/Nutch-1.1 (Think)"
> 119.139.27.64 - - [25/Oct/2012:10:55:09 -0400] "GET
> /stuff_i_got_in_emails/outofthecloset%20(5).jpg;O=D HTTP/1.0" 404 246 "-"
> "HD nutch agent/Nutch-1.1 (Think)"
>
>
>



-- 
Lewis