You are viewing a plain text version of this content. The canonical link for it is here.
Posted to agent@nutch.apache.org by Lutz Zetzsche <Lu...@sea-rescue.de> on 2007/06/03 10:50:24 UTC

Nutch 0.9 and Crawl-Delay

Dear Nutch developers,

I have had problems with a Nutch based robot during the last 12 hours, 
which I have now solved by banning this particular bot from my server 
(not Nutch completely for the moment). The ilial bot, which created 
considerable load on my server, was using the latest Nutch version - 
v0.9 - which is now also supporting the crawl-delay directive in the 
robots.txt.

The bot seems to have obeyed the directive - crawl-delay: 10 - as it 
visited my website every 15 seconds, which would have been ok, BUT it 
then submitted FIVE requests at once (see example log extract below)! 5 
requests at once every 15 seconds is not acceptable on my server, which 
is principally serving dynamic content and is often visited by up to 10 
search engines at the same time, alltogether surely creating 99.9% of 
the server traffic.

So my suggestion is that Nutch only submits one request each time, when 
it detects a crawl-delay directive in the robots.txt. This is the 
behaviour, the MSNbot shows for example. The MSNbot also liked to 
submit several requests at once every few seconds, until I added the 
crawl-delay directive to my robots.txt.


Best wishes

Lutz Zetzsche
http://www.sea-rescue.de/



72.44.58.191 - - [03/Jun/2007:04:40:53 
+0200] "GET /english/Photos+%26+Videos/PV/ HTTP/1.0" 200 
13661 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet 
startup company. For more information please visit 
http://www.ilial.com/crawler; http://www.ilial.com/crawler; 
crawl@ilial.com)"
72.44.58.191 - - [03/Jun/2007:04:40:53 
+0200] "GET /english/Links/WRGL/Countries/ HTTP/1.0" 200 
15048 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet 
startup company. For more information please visit 
http://www.ilial.com/crawler; http://www.ilial.com/crawler; 
crawl@ilial.com)"
72.44.58.191 - - [03/Jun/2007:04:40:53 
+0200] "GET /islenska/Hlekkir/Brede-ger%C3%B0%20%2F%2033%20fet/ 
HTTP/1.0" 200 60041 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles 
based Internet startup company. For more information please visit 
http://www.ilial.com/crawler; http://www.ilial.com/crawler; 
crawl@ilial.com)"
66.249.72.244 - - [03/Jun/2007:04:40:55 
+0200] "GET /francais/Liens/Philip+Vaux/Brede%20%2F%2033%20pieds/ 
HTTP/1.1" 200 17568 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; 
+http://www.google.com/bot.html)"
66.231.189.119 - - [03/Jun/2007:04:40:55 
+0200] "GET /english/Links/Martijn%20Koenraad%20Hof/Netherlands%20Antilles/Sint%20Maarten/ 
HTTP/1.0" 200 17193 "-" "Gigabot/2.0 
(http://www.gigablast.com/spider.html)"
74.6.86.105 - - [03/Jun/2007:04:40:56 
+0200] "GET /dansk/Links/Hermann+Apelt/ HTTP/1.0" 200 
30496 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; 
http://help.yahoo.com/help/us/ysearch/slurp)"
72.44.58.191 - - [03/Jun/2007:04:40:53 
+0200] "GET /italiano/Links/Giamaica/MRCCs+%26+Stazioni+radio+costiera/ 
HTTP/1.0" 200 16658 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles 
based Internet startup company. For more information please visit 
http://www.ilial.com/crawler; http://www.ilial.com/crawler; 
crawl@ilial.com)"
72.44.58.191 - - [03/Jun/2007:04:40:53 
+0200] "GET /english/Links/Mauritius/Countries/Organisations/ HTTP/1.0" 
200 15624 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based 
Internet startup company. For more information please visit 
http://www.ilial.com/crawler; http://www.ilial.com/crawler; 
crawl@ilial.com)"

Re: Nutch 0.9 and Crawl-Delay

Posted by Ken Krugler <kk...@transpac.com>.
Hi Lutz,

>I have had problems with a Nutch based robot during the last 12 hours,
>which I have now solved by banning this particular bot from my server
>(not Nutch completely for the moment). The ilial bot, which created
>considerable load on my server, was using the latest Nutch version -
>v0.9 - which is now also supporting the crawl-delay directive in the
>robots.txt.
>
>The bot seems to have obeyed the directive - crawl-delay: 10 - as it
>visited my website every 15 seconds, which would have been ok, BUT it
>then submitted FIVE requests at once (see example log extract below)! 5
>requests at once every 15 seconds is not acceptable on my server, which
>is principally serving dynamic content and is often visited by up to 10
>search engines at the same time, alltogether surely creating 99.9% of
>the server traffic.
>
>So my suggestion is that Nutch only submits one request each time, when
>it detects a crawl-delay directive in the robots.txt. This is the
>behaviour, the MSNbot shows for example. The MSNbot also liked to
>submit several requests at once every few seconds, until I added the
>crawl-delay directive to my robots.txt.

I believe Nutch should be only submitting one request at a time, 
unless the person who configured the crawler specified a 
fetcher.threads.per.host value greater than 1.

If this is the case, then they are crawling impolitely. I've cc'd 
them on this email, since their user agent string (politely) includes 
this contact information.

If they are using only one thread per host, then this would be a bug 
in the Nutch code.

-- Ken



>Best wishes
>
>Lutz Zetzsche
>http://www.sea-rescue.de/
>
>
>
>72.44.58.191 - - [03/Jun/2007:04:40:53
>+0200] "GET /english/Photos+%26+Videos/PV/ HTTP/1.0" 200
>13661 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet
>startup company. For more information please visit
>http://www.ilial.com/crawler; http://www.ilial.com/crawler;
>crawl@ilial.com)"
>72.44.58.191 - - [03/Jun/2007:04:40:53
>+0200] "GET /english/Links/WRGL/Countries/ HTTP/1.0" 200
>15048 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based Internet
>startup company. For more information please visit
>http://www.ilial.com/crawler; http://www.ilial.com/crawler;
>crawl@ilial.com)"
>72.44.58.191 - - [03/Jun/2007:04:40:53
>+0200] "GET /islenska/Hlekkir/Brede-ger%C3%B0%20%2F%2033%20fet/
>HTTP/1.0" 200 60041 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles
>based Internet startup company. For more information please visit
>http://www.ilial.com/crawler; http://www.ilial.com/crawler;
>crawl@ilial.com)"
>66.249.72.244 - - [03/Jun/2007:04:40:55
>+0200] "GET /francais/Liens/Philip+Vaux/Brede%20%2F%2033%20pieds/
>HTTP/1.1" 200 17568 "-" "Mozilla/5.0 (compatible; Googlebot/2.1;
>+http://www.google.com/bot.html)"
>66.231.189.119 - - [03/Jun/2007:04:40:55
>+0200] "GET 
>/english/Links/Martijn%20Koenraad%20Hof/Netherlands%20Antilles/Sint%20Maarten/
>HTTP/1.0" 200 17193 "-" "Gigabot/2.0
>(http://www.gigablast.com/spider.html)"
>74.6.86.105 - - [03/Jun/2007:04:40:56
>+0200] "GET /dansk/Links/Hermann+Apelt/ HTTP/1.0" 200
>30496 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp;
>http://help.yahoo.com/help/us/ysearch/slurp)"
>72.44.58.191 - - [03/Jun/2007:04:40:53
>+0200] "GET /italiano/Links/Giamaica/MRCCs+%26+Stazioni+radio+costiera/
>HTTP/1.0" 200 16658 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles
>based Internet startup company. For more information please visit
>http://www.ilial.com/crawler; http://www.ilial.com/crawler;
>crawl@ilial.com)"
>72.44.58.191 - - [03/Jun/2007:04:40:53
>+0200] "GET /english/Links/Mauritius/Countries/Organisations/ HTTP/1.0"
>200 15624 "-" "ilial/Nutch-0.9 (Ilial, Inc. is a Los Angeles based
>Internet startup company. For more information please visit
>http://www.ilial.com/crawler; http://www.ilial.com/crawler;
>crawl@ilial.com)"


-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"