You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by webdev1977 <we...@gmail.com> on 2011/10/05 14:26:01 UTC
Nutch 1.3 Fetching where does this happen?
Hello All!
When using nutch 1.3 in fully distributed mode, where does the fetching
occur? Does each node get a list of urls to fetch? What property in
hadoop/mareduce, etc decides how many urls that a node gets to fetch? I am
worried about memory on my nodes. Some of the files in our enterprise are
very, very large. Like 800mb pdf files.
I am able to run inject on my cluster, but then the generate step fails and
I always loose one node from the cluster.
--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-1-3-Fetching-where-does-this-happen-tp3396326p3396326.html
Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch 1.3 Fetching where does this happen?
Posted by Markus Jelsma <ma...@openindex.io>.
On Wednesday 05 October 2011 14:26:01 webdev1977 wrote:
> Hello All!
>
> When using nutch 1.3 in fully distributed mode, where does the fetching
> occur? Does each node get a list of urls to fetch? What property in
> hadoop/mareduce, etc decides how many urls that a node gets to fetch?
Check the numFetchers parameter of the generator. If you set it to equal then
number of nodes, the entire fetch list is split in parts.
> I am
> worried about memory on my nodes. Some of the files in our enterprise are
> very, very large. Like 800mb pdf files.
I would be worried about that too especially if multiple files are downloaded
at the same time on the same node. Limit the number of threads and check
memory settings.
>
> I am able to run inject on my cluster, but then the generate step fails and
> I always loose one node from the cluster.
More details?
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Nutch-1-3-Fetching-where-does-this-happ
> en-tp3396326p3396326.html Sent from the Nutch - User mailing list archive
> at Nabble.com.
--
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350