You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Sjaiful Bahri <sb...@rocketmail.com> on 2010/02/03 11:21:14 UTC

A well-behaved crawler

"A well-behaved crawler needs to follow a set of loosely-defined behaviors to be 'polite' - don't crawl a site too fast, don't crawl any single IP address too fast, don't pull too much bandwidth from small sites by e.g. downloading tons of full res media that will never be indexed, meticulously obey robots.txt, identify itself with user-agent string that points to a detailed web page explaining the purpose of the bot, etc. "

But my crawler still banned by several sites... :(

cheers
iful
 

http://zipclue.com

RE: A well-behaved crawler

Posted by Fuad Efendi <fu...@efendi.ca>.

In my past experience, I was explicitly banned by about 60 sites (from 10000
in my "vertical" list!); via explicit instruction in their robots.txt file

After detailed analysis I found that about 50 sites were hosted on same IP
address; I used fetch-per-TLD instead of fetch-per-IP.

The rest sites were simply not willing to appear in my search results list -
it's their right! 


-Fuad
Tokenizer



> -----Original Message-----
> From: Ken Krugler [mailto:kkrugler_lists@transpac.com]
> Sent: February-03-10 2:50 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: A well-behaved crawler
> 
> When you say "banned by several sites", do you mean that you get back
> non-200 responses for pages that you know exist? Or something else?
> 
> Also, there's another constraint that many sites impose, which is the
> total number of page fetches/day. Unfortunately you don't know if
> you've hit this until you run into problems. A good rule of thumb is
> no more than 5K requests/day for a major site.
> 
> -- Ken
> 
> PS - You're not running in EC2 by any chance, are you?
> 
> On Feb 3, 2010, at 2:21am, Sjaiful Bahri wrote:
> 
> > "A well-behaved crawler needs to follow a set of loosely-defined
> > behaviors to be 'polite' - don't crawl a site too fast, don't crawl
> > any single IP address too fast, don't pull too much bandwidth from
> > small sites by e.g. downloading tons of full res media that will
> > never be indexed, meticulously obey robots.txt, identify itself with
> > user-agent string that points to a detailed web page explaining the
> > purpose of the bot, etc. "
> >
> > But my crawler still banned by several sites... :(
> >
> > cheers
> > iful
> >
> >
> > http://zipclue.com
> >
> >
> >
> >
> 
> --------------------------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
> 
> 
>

Re: A well-behaved crawler

Posted by Ken Krugler <kk...@transpac.com>.

When you say "banned by several sites", do you mean that you get back  
non-200 responses for pages that you know exist? Or something else?

Also, there's another constraint that many sites impose, which is the  
total number of page fetches/day. Unfortunately you don't know if  
you've hit this until you run into problems. A good rule of thumb is  
no more than 5K requests/day for a major site.

-- Ken

PS - You're not running in EC2 by any chance, are you?

On Feb 3, 2010, at 2:21am, Sjaiful Bahri wrote:

> "A well-behaved crawler needs to follow a set of loosely-defined  
> behaviors to be 'polite' - don't crawl a site too fast, don't crawl  
> any single IP address too fast, don't pull too much bandwidth from  
> small sites by e.g. downloading tons of full res media that will  
> never be indexed, meticulously obey robots.txt, identify itself with  
> user-agent string that points to a detailed web page explaining the  
> purpose of the bot, etc. "
>
> But my crawler still banned by several sites... :(
>
> cheers
> iful
>
>
> http://zipclue.com
>
>
>
>

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g