You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Brian Tingle <Br...@ucop.edu> on 2009/07/23 04:21:38 UTC

nutch -threads in hadoop

Hey,

 

I'm playing around the nutch on hadoop; when I go 

hadoop jar nutch-1.0.job org.apache.nutch.crawl.Crawl -threads ... is
that threads per node or total threads for all nodes?

 

Thanks -- Brian


Re: nutch -threads in hadoop

Posted by Andrzej Bialecki <ab...@getopt.org>.
Brian Tingle wrote:
> Thanks, I eventually found where the job trackers were in the :50030 web
> page of the cloudera thing, and I saw it said "10 threads" for each
> crawler in the little status update box where it was telling me how far
> along each crawl was.  I have to say, this whole thing (nutch/hadoop) is
> pretty flipping awesome.  Great work.
> 
> I'm running on aws EC2 us-east and spidering sites that should be hosted
> on the CENIC network in California, do you have any suggestions on what
> a good number of threads to try per crawler might be in that situation
> (I'm guessing it might be hard to saturate the bandwidth)?  I'm thinking
> I'll bump it up to at least 25.

You need to be careful when running large crawls on someone else's 
infrastructure. While the raw bandwidth may be enough, the DNS infra may 
be insufficient - both on the side of the target domains as well as the 
local resolver. I strongly recommend setting up a local caching DNS.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


RE: nutch -threads in hadoop

Posted by Brian Tingle <Br...@ucop.edu>.
Thanks, I eventually found where the job trackers were in the :50030 web
page of the cloudera thing, and I saw it said "10 threads" for each
crawler in the little status update box where it was telling me how far
along each crawl was.  I have to say, this whole thing (nutch/hadoop) is
pretty flipping awesome.  Great work.

I'm running on aws EC2 us-east and spidering sites that should be hosted
on the CENIC network in California, do you have any suggestions on what
a good number of threads to try per crawler might be in that situation
(I'm guessing it might be hard to saturate the bandwidth)?  I'm thinking
I'll bump it up to at least 25.

Thanks again,

-- Brian

|-----Original Message-----
|From: Andrzej Bialecki [mailto:ab@getopt.org]
|Sent: Thursday, July 23, 2009 1:01 AM
|To: nutch-user@lucene.apache.org
|Subject: Re: nutch -threads in hadoop
|
|Brian Tingle wrote:
|> Hey,
|>
|>
|>
|> I'm playing around the nutch on hadoop; when I go
|>
|> hadoop jar nutch-1.0.job org.apache.nutch.crawl.Crawl -threads ... is
|> that threads per node or total threads for all nodes?
|
|Threads per map task - if you run multiple map tasks per node then you
|will get numThreads * numMapTasks per node.
|
|So be careful to set it to a number that doesn't overwhelm your network
;)
|
|--
|Best regards,
|Andrzej Bialecki     <><
|  ___. ___ ___ ___ _ _   __________________________________
|[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
|___|||__||  \|  ||  |  Embedded Unix, System Integration
|http://www.sigram.com  Contact: info at sigram dot com


Re: nutch -threads in hadoop

Posted by Andrzej Bialecki <ab...@getopt.org>.
Brian Tingle wrote:
> Hey,
> 
>  
> 
> I'm playing around the nutch on hadoop; when I go 
> 
> hadoop jar nutch-1.0.job org.apache.nutch.crawl.Crawl -threads ... is
> that threads per node or total threads for all nodes?

Threads per map task - if you run multiple map tasks per node then you 
will get numThreads * numMapTasks per node.

So be careful to set it to a number that doesn't overwhelm your network ;)

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com