You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Doug Cutting <cu...@apache.org> on 2007/06/04 22:25:03 UTC

[Fwd: Nutch 0.9 and Crawl-Delay]

Does the 0.9 crawl-delay implementation actually permit multiple threads 
to access a site simultaneously?

Doug

-------- Original Message --------
Subject: Nutch 0.9 and Crawl-Delay
Date: Sun, 3 Jun 2007 10:50:24 +0200
From: Lutz Zetzsche <Lu...@sea-rescue.de>
Reply-To: nutch-agent@lucene.apache.org
To: agent@nutch.org

Dear Nutch developers,

I have had problems with a Nutch based robot during the last 12 hours,
which I have now solved by banning this particular bot from my server
(not Nutch completely for the moment). The ilial bot, which created
considerable load on my server, was using the latest Nutch version -
v0.9 - which is now also supporting the crawl-delay directive in the
robots.txt.

The bot seems to have obeyed the directive - crawl-delay: 10 - as it
visited my website every 15 seconds, which would have been ok, BUT it
then submitted FIVE requests at once (see example log extract below)! 5
requests at once every 15 seconds is not acceptable on my server, which
is principally serving dynamic content and is often visited by up to 10
search engines at the same time, alltogether surely creating 99.9% of
the server traffic.

So my suggestion is that Nutch only submits one request each time, when
it detects a crawl-delay directive in the robots.txt. This is the
behaviour, the MSNbot shows for example. The MSNbot also liked to
submit several requests at once every few seconds, until I added the
crawl-delay directive to my robots.txt.


Best wishes

Lutz Zetzsche
http://www.sea-rescue.de/



72.44.58.191 - - [03/Jun/2007:04:40:53
+0200] "GET /english/Photos+%26+Videos/PV/ HTTP/1.0" 200
13661 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet
startup company. For more information please visit
http://www.ilial.com/crawler; http://www.ilial.com/crawler;
crawl@ilial.com)"
72.44.58.191 - - [03/Jun/2007:04:40:53
+0200] "GET /english/Links/WRGL/Countries/ HTTP/1.0" 200
15048 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet
startup company. For more information please visit
http://www.ilial.com/crawler; http://www.ilial.com/crawler;
crawl@ilial.com)"
72.44.58.191 - - [03/Jun/2007:04:40:53
+0200] "GET /islenska/Hlekkir/Brede-ger%C3%B0%20%2F%2033%20fet/
HTTP/1.0" 200 60041 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles
based Internet startup company. For more information please visit
http://www.ilial.com/crawler; http://www.ilial.com/crawler;
crawl@ilial.com)"
66.249.72.244 - - [03/Jun/2007:04:40:55
+0200] "GET /francais/Liens/Philip+Vaux/Brede%20%2F%2033%20pieds/
HTTP/1.1" 200 17568 "-" "Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)"
66.231.189.119 - - [03/Jun/2007:04:40:55
+0200] "GET 
/english/Links/Martijn%20Koenraad%20Hof/Netherlands%20Antilles/Sint%20Maarten/ 

HTTP/1.0" 200 17193 "-" "Gigabot/2.0
(http://www.gigablast.com/spider.html)"
74.6.86.105 - - [03/Jun/2007:04:40:56
+0200] "GET /dansk/Links/Hermann+Apelt/ HTTP/1.0" 200
30496 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp;
http://help.yahoo.com/help/us/ysearch/slurp)"
72.44.58.191 - - [03/Jun/2007:04:40:53
+0200] "GET /italiano/Links/Giamaica/MRCCs+%26+Stazioni+radio+costiera/
HTTP/1.0" 200 16658 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles
based Internet startup company. For more information please visit
http://www.ilial.com/crawler; http://www.ilial.com/crawler;
crawl@ilial.com)"
72.44.58.191 - - [03/Jun/2007:04:40:53
+0200] "GET /english/Links/Mauritius/Countries/Organisations/ HTTP/1.0"
200 15624 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based
Internet startup company. For more information please visit
http://www.ilial.com/crawler; http://www.ilial.com/crawler;
crawl@ilial.com)"

Re: [Fwd: Nutch 0.9 and Crawl-Delay]

Posted by Doğacan Güney <do...@gmail.com>.
Hi,

On 6/4/07, Doug Cutting <cu...@apache.org> wrote:
> Does the 0.9 crawl-delay implementation actually permit multiple threads
> to access a site simultaneously?

AFAIK, yes. Option fetcher.threads.per.host should be greater than 1
_only_ when you are accessing a site under your control. So, all of
nutch's politeness policies are pretty much ignored when
fetcher.threads.per.host is greater than 1.

Fetcher2 completely ignores nutch's server-delay and site's
crawl-delay value if maxThreads > 1 and uses another min.crawl.delay
value when accessing the site.

I am not sure about Fetcher but I think it is going to allow
maxThreads many fetchers to access the site simultaneously then block
the next one.

There may be a better explanation in this post to nutch-dev:
"Fetcher2's delay between successive requests"  .


>
> Doug
>
> -------- Original Message --------
> Subject: Nutch 0.9 and Crawl-Delay
> Date: Sun, 3 Jun 2007 10:50:24 +0200
> From: Lutz Zetzsche <Lu...@sea-rescue.de>
> Reply-To: nutch-agent@lucene.apache.org
> To: agent@nutch.org
>
> Dear Nutch developers,
>
> I have had problems with a Nutch based robot during the last 12 hours,
> which I have now solved by banning this particular bot from my server
> (not Nutch completely for the moment). The ilial bot, which created
> considerable load on my server, was using the latest Nutch version -
> v0.9 - which is now also supporting the crawl-delay directive in the
> robots.txt.
>
> The bot seems to have obeyed the directive - crawl-delay: 10 - as it
> visited my website every 15 seconds, which would have been ok, BUT it
> then submitted FIVE requests at once (see example log extract below)! 5
> requests at once every 15 seconds is not acceptable on my server, which
> is principally serving dynamic content and is often visited by up to 10
> search engines at the same time, alltogether surely creating 99.9% of
> the server traffic.
>
> So my suggestion is that Nutch only submits one request each time, when
> it detects a crawl-delay directive in the robots.txt. This is the
> behaviour, the MSNbot shows for example. The MSNbot also liked to
> submit several requests at once every few seconds, until I added the
> crawl-delay directive to my robots.txt.
>
>
> Best wishes
>
> Lutz Zetzsche
> http://www.sea-rescue.de/
>
>
>
> 72.44.58.191 - - [03/Jun/2007:04:40:53
> +0200] "GET /english/Photos+%26+Videos/PV/ HTTP/1.0" 200
> 13661 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet
> startup company. For more information please visit
> http://www.ilial.com/crawler; http://www.ilial.com/crawler;
> crawl@ilial.com)"
> 72.44.58.191 - - [03/Jun/2007:04:40:53
> +0200] "GET /english/Links/WRGL/Countries/ HTTP/1.0" 200
> 15048 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet
> startup company. For more information please visit
> http://www.ilial.com/crawler; http://www.ilial.com/crawler;
> crawl@ilial.com)"
> 72.44.58.191 - - [03/Jun/2007:04:40:53
> +0200] "GET /islenska/Hlekkir/Brede-ger%C3%B0%20%2F%2033%20fet/
> HTTP/1.0" 200 60041 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles
> based Internet startup company. For more information please visit
> http://www.ilial.com/crawler; http://www.ilial.com/crawler;
> crawl@ilial.com)"
> 66.249.72.244 - - [03/Jun/2007:04:40:55
> +0200] "GET /francais/Liens/Philip+Vaux/Brede%20%2F%2033%20pieds/
> HTTP/1.1" 200 17568 "-" "Mozilla/5.0 (compatible; Googlebot/2.1;
> +http://www.google.com/bot.html)"
> 66.231.189.119 - - [03/Jun/2007:04:40:55
> +0200] "GET
> /english/Links/Martijn%20Koenraad%20Hof/Netherlands%20Antilles/Sint%20Maarten/
>
> HTTP/1.0" 200 17193 "-" "Gigabot/2.0
> (http://www.gigablast.com/spider.html)"
> 74.6.86.105 - - [03/Jun/2007:04:40:56
> +0200] "GET /dansk/Links/Hermann+Apelt/ HTTP/1.0" 200
> 30496 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp;
> http://help.yahoo.com/help/us/ysearch/slurp)"
> 72.44.58.191 - - [03/Jun/2007:04:40:53
> +0200] "GET /italiano/Links/Giamaica/MRCCs+%26+Stazioni+radio+costiera/
> HTTP/1.0" 200 16658 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles
> based Internet startup company. For more information please visit
> http://www.ilial.com/crawler; http://www.ilial.com/crawler;
> crawl@ilial.com)"
> 72.44.58.191 - - [03/Jun/2007:04:40:53
> +0200] "GET /english/Links/Mauritius/Countries/Organisations/ HTTP/1.0"
> 200 15624 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based
> Internet startup company. For more information please visit
> http://www.ilial.com/crawler; http://www.ilial.com/crawler;
> crawl@ilial.com)"
>


-- 
Doğacan Güney