You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ar...@csiro.au on 2012/09/04 01:44:59 UTC
RE: Nutch crawl commands and efficiency

Hi,

I can't see from your description what exactly is slow, but I'd suggest to make sure that Nutch is using Hadoop native libraries. They make a huge difference for some operations.

Regards,

Arkadi

> -----Original Message-----
> From: george123 [mailto:daniel.tarasenko@gmail.com]
> Sent: Tuesday, 28 August 2012 9:59 PM
> To: user@nutch.apache.org
> Subject: Nutch crawl commands and efficiency
> 
> Hi
> as per the below I am running Nutch 1.2 in what I think is local mode.
> http://lucene.472066.n3.nabble.com/nutch-stops-when-my-ssh-connection-
> drops-out-td4001938.html
> 
> I have a large crawl, about eventually 5000 sites that I am using nutch
> to scrape from.
> Right now I have a list of sites, 20 of them, and the total urls within
> those sites will amount to about 200 000 eventually crawled/scraped.
> 
> I have to seed these sites with a range of urls for them to crawl. Some
> are very simple (like domain.com/results.php) others are very
> difficult, some sites have between 1500 and 15000 seed urls just to
> make sure they are crawled properly.
> 
> So my seeds.txt has about 20 000 seed urls in it (but only 20 domains -
> 1 domain has 15000 seed urls).
> 
> I SSH in, navigate to the nutch install, and run *bin/nutch crawl urls
> -dir crawl -depth 1000 -topN 100 -threads 500
> *
> Now, its very very very slow, it has been running for 2 hrs and not
> much is happening. It seems to take about 10 minutes to generate a
> crawl list, then about 30 seconds to crawl that.
> 
> The average website will have a list of anywhere between 10 and 200
> results on each page, that require crawling further to get the listing.
> There is anywhere between 10 and 5000 pages of results so there is a
> bit of a crawl involved.
> 
> I still think its slow, its certainly not even close to the servers
> resources.
> 
> So I think there is either a nutch politeness delay because its trying
> to crawl everything in the seeds.txt first, including the first 15000
> urls from
> 1 site (so not making many requests to kick that off).
> 
> Or its just trying to generate a long list for the 500 threads setting
> that takes 10+ minutes.
> Or trying to go 1000 deep is slowing it down, but I dont think so
> because my crawlfilter is pretty tightly controlled.
> 
> Any ideas on how to speed this up? Am I just needing  to wait for nutch
> to process this original 20 000 list then it speeds up?
> 
> What are some other things I can look at?
> 
> 
> 
> 
> --
> View this message in context: http://lucene.472066.n3.nabble.com/Nutch-
> crawl-commands-and-efficiency-tp4003690.html
> Sent from the Nutch - User mailing list archive at Nabble.com.