You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Kai_testing Middleton <ka...@yahoo.com> on 2007/08/07 17:58:45 UTC

nutch stuck crawling mostly one site

I started a crawl on July 30 like this:

nohup time nutch crawl /usr/tmp/urls.txt -dir /usr/tmp/85sites -threads 20
-depth 10 -topN 103103

I included 85 different sites in the seed URLS.  However, my crawl has slowed
down to about one site every  seconds--the value I have for
fetcher.server.delay.  The console output will look something like this:

fetching
http://www.topix.net/soccer-fifa/world-cup/2007/04/news-fixtures-results-match-reports-stats
fetching
http://www.topix.net/city/tarpon-springs-fl/2007/04/city/tarpon-springs-fl
fetching
http://sportsillustrated.cnn.com/si_online/scorecard/news/2002/07/02/sc/
fetching
http://www.topix.net/classifieds/rockford-mi/WVZ850H4sIAAAAAAAAA2NhYzE0MjZhYWHhYOZmYGBgZQISpYmszEAqP7OAlQVIl6QWl4AFSjNTwPIlxWBuDpDLCqQLkvNTUrkEkjNLKvWL8pOz0/KLUnRzM0Fa88AyzqbBzkZOQcGBwQaebq5mFiCZgrLMFC5Z/eScxOLizLTM1JRiZL36BaklxWAri3K4BJEVARWU5rIBZRKTSzLz8wCKjZwMwAAAAA__
fetching
http://www.topix.net/forum/football-players/steve-mcnair/TOAGFF3LREQQ0C40H/p4
fetching http://www.bbc.co.uk/wales/southeast/webguide/pages/books.shtml
fetching
http://www.topix.net/football-players/dede-dorsey/2007/07/ready-to-roll
fetching http://www.bbc.co.uk/wales/raiseyourgame/preparation/
fetching http://www.topix.net/forum/news/terri-schiavo/TGNF3ITGCIOGCQG74/post14
fetching http://www.topix.net/forum/world/canada/TLDSECEPRGQCAOSJI/post11

Often when it hits topix.net it's slow to respond.  I purposely started my seed
with a large number of sites so that it wouldn't get stuck with this kind of
low speed.  I anticipated that nutch would initiate many downloads during it's
three second window per site.  However, the horizon seems to be fixated with
references to topix.net so it's really just slower than ooze.

Should I have set the topN lower?  Depth higher?  Why is it behaving this way?


      ____________________________________________________________________________________
Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center.
http://autos.yahoo.com/green_center/ 

Re: nutch stuck crawling mostly one site

Posted by Renaud Richardet <re...@apache.org>.
hi Kai,

nothing to do with Nutch, but do you really need to crawl all the topix 
forums? Because ignoring them would certainly speed the crawl...

also, you might want to check the excellent post from Sami Siren on the 
fetching sorting: http://blog.foofactory.fi/2007/01/sorted-out.html

HTH,
Renaud


Kai_testing Middleton wrote:
> I started a crawl on July 30 like this:
>
> nohup time nutch crawl /usr/tmp/urls.txt -dir /usr/tmp/85sites -threads 20
> -depth 10 -topN 103103
>
> I included 85 different sites in the seed URLS.  However, my crawl has slowed
> down to about one site every  seconds--the value I have for
> fetcher.server.delay.  The console output will look something like this:
>
> fetching
> http://www.topix.net/soccer-fifa/world-cup/2007/04/news-fixtures-results-match-reports-stats
> fetching
> http://www.topix.net/city/tarpon-springs-fl/2007/04/city/tarpon-springs-fl
> fetching
> http://sportsillustrated.cnn.com/si_online/scorecard/news/2002/07/02/sc/
> fetching
> http://www.topix.net/classifieds/rockford-mi/WVZ850H4sIAAAAAAAAA2NhYzE0MjZhYWHhYOZmYGBgZQISpYmszEAqP7OAlQVIl6QWl4AFSjNTwPIlxWBuDpDLCqQLkvNTUrkEkjNLKvWL8pOz0/KLUnRzM0Fa88AyzqbBzkZOQcGBwQaebq5mFiCZgrLMFC5Z/eScxOLizLTM1JRiZL36BaklxWAri3K4BJEVARWU5rIBZRKTSzLz8wCKjZwMwAAAAA__
> fetching
> http://www.topix.net/forum/football-players/steve-mcnair/TOAGFF3LREQQ0C40H/p4
> fetching http://www.bbc.co.uk/wales/southeast/webguide/pages/books.shtml
> fetching
> http://www.topix.net/football-players/dede-dorsey/2007/07/ready-to-roll
> fetching http://www.bbc.co.uk/wales/raiseyourgame/preparation/
> fetching http://www.topix.net/forum/news/terri-schiavo/TGNF3ITGCIOGCQG74/post14
> fetching http://www.topix.net/forum/world/canada/TLDSECEPRGQCAOSJI/post11
>
> Often when it hits topix.net it's slow to respond.  I purposely started my seed
> with a large number of sites so that it wouldn't get stuck with this kind of
> low speed.  I anticipated that nutch would initiate many downloads during it's
> three second window per site.  However, the horizon seems to be fixated with
> references to topix.net so it's really just slower than ooze.
>
> Should I have set the topN lower?  Depth higher?  Why is it behaving this way?
>
>
>       ____________________________________________________________________________________
> Park yourself in front of a world of choices in alternative vehicles. Visit the Yahoo! Auto Green Center.
> http://autos.yahoo.com/green_center/ 
>
>