You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by webdev1977 <we...@gmail.com> on 2011/10/05 14:26:01 UTC

Nutch 1.3 Fetching where does this happen?

Hello All!  

When using nutch 1.3 in fully distributed mode, where does the fetching
occur? Does each node get a list of urls to fetch?  What property in
hadoop/mareduce, etc decides how many urls that a node gets to fetch?  I am
worried about memory on my nodes.  Some of the files in our enterprise are
very, very large.  Like 800mb pdf files. 

I am able to run inject on my cluster, but then the generate step fails and
I always loose one node from the cluster.  

--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-1-3-Fetching-where-does-this-happen-tp3396326p3396326.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch 1.3 Fetching where does this happen?

Posted by Markus Jelsma <ma...@openindex.io>.

On Wednesday 05 October 2011 14:26:01 webdev1977 wrote:
> Hello All!
> 
> When using nutch 1.3 in fully distributed mode, where does the fetching
> occur? Does each node get a list of urls to fetch?  What property in
> hadoop/mareduce, etc decides how many urls that a node gets to fetch?

Check the numFetchers parameter of the generator. If you set it to equal then 
number of nodes, the entire fetch list is split in parts.

> I am
> worried about memory on my nodes.  Some of the files in our enterprise are
> very, very large.  Like 800mb pdf files.

I would be worried about that too especially if multiple files are downloaded 
at the same time on the same node. Limit the number of threads and check 
memory settings.

> 
> I am able to run inject on my cluster, but then the generate step fails and
> I always loose one node from the cluster.

More details?

> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-1-3-Fetching-where-does-this-happ
> en-tp3396326p3396326.html Sent from the Nutch - User mailing list archive
> at Nabble.com.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350