You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by george123 <da...@gmail.com> on 2012/08/28 13:58:39 UTC

Nutch crawl commands and efficiency

Hi
as per the below I am running Nutch 1.2 in what I think is local mode.
http://lucene.472066.n3.nabble.com/nutch-stops-when-my-ssh-connection-drops-out-td4001938.html

I have a large crawl, about eventually 5000 sites that I am using nutch to
scrape from.
Right now I have a list of sites, 20 of them, and the total urls within
those sites will amount to about 200 000 eventually crawled/scraped.

I have to seed these sites with a range of urls for them to crawl. Some are
very simple (like domain.com/results.php) others are very difficult, some
sites have between 1500 and 15000 seed urls just to make sure they are
crawled properly.

So my seeds.txt has about 20 000 seed urls in it (but only 20 domains - 1
domain has 15000 seed urls).

I SSH in, navigate to the nutch install, and run
*bin/nutch crawl urls -dir crawl -depth 1000 -topN 100 -threads 500
*
Now, its very very very slow, it has been running for 2 hrs and not much is
happening. It seems to take about 10 minutes to generate a crawl list, then
about 30 seconds to crawl that.

The average website will have a list of anywhere between 10 and 200 results
on each page, that require crawling further to get the listing. There is
anywhere between 10 and 5000 pages of results so there is a bit of a crawl
involved.

I still think its slow, its certainly not even close to the servers
resources.

So I think there is either a nutch politeness delay because its trying to
crawl everything in the seeds.txt first, including the first 15000 urls from
1 site (so not making many requests to kick that off).

Or its just trying to generate a long list for the 500 threads setting that
takes 10+ minutes.
Or trying to go 1000 deep is slowing it down, but I dont think so because my
crawlfilter is pretty tightly controlled.

Any ideas on how to speed this up? Am I just needing  to wait for nutch to
process this original 20 000 list then it speeds up?

What are some other things I can look at? 




--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-crawl-commands-and-efficiency-tp4003690.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: Nutch crawl commands and efficiency

Posted by Ar...@csiro.au.
Hi,

I can't see from your description what exactly is slow, but I'd suggest to make sure that Nutch is using Hadoop native libraries. They make a huge difference for some operations.

Regards,

Arkadi

> -----Original Message-----
> From: george123 [mailto:daniel.tarasenko@gmail.com]
> Sent: Tuesday, 28 August 2012 9:59 PM
> To: user@nutch.apache.org
> Subject: Nutch crawl commands and efficiency
> 
> Hi
> as per the below I am running Nutch 1.2 in what I think is local mode.
> http://lucene.472066.n3.nabble.com/nutch-stops-when-my-ssh-connection-
> drops-out-td4001938.html
> 
> I have a large crawl, about eventually 5000 sites that I am using nutch
> to scrape from.
> Right now I have a list of sites, 20 of them, and the total urls within
> those sites will amount to about 200 000 eventually crawled/scraped.
> 
> I have to seed these sites with a range of urls for them to crawl. Some
> are very simple (like domain.com/results.php) others are very
> difficult, some sites have between 1500 and 15000 seed urls just to
> make sure they are crawled properly.
> 
> So my seeds.txt has about 20 000 seed urls in it (but only 20 domains -
> 1 domain has 15000 seed urls).
> 
> I SSH in, navigate to the nutch install, and run *bin/nutch crawl urls
> -dir crawl -depth 1000 -topN 100 -threads 500
> *
> Now, its very very very slow, it has been running for 2 hrs and not
> much is happening. It seems to take about 10 minutes to generate a
> crawl list, then about 30 seconds to crawl that.
> 
> The average website will have a list of anywhere between 10 and 200
> results on each page, that require crawling further to get the listing.
> There is anywhere between 10 and 5000 pages of results so there is a
> bit of a crawl involved.
> 
> I still think its slow, its certainly not even close to the servers
> resources.
> 
> So I think there is either a nutch politeness delay because its trying
> to crawl everything in the seeds.txt first, including the first 15000
> urls from
> 1 site (so not making many requests to kick that off).
> 
> Or its just trying to generate a long list for the 500 threads setting
> that takes 10+ minutes.
> Or trying to go 1000 deep is slowing it down, but I dont think so
> because my crawlfilter is pretty tightly controlled.
> 
> Any ideas on how to speed this up? Am I just needing  to wait for nutch
> to process this original 20 000 list then it speeds up?
> 
> What are some other things I can look at?
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Nutch-
> crawl-commands-and-efficiency-tp4003690.html
> Sent from the Nutch - User mailing list archive at Nabble.com.