You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Shawn Gervais <pr...@project10.net> on 2006/04/10 10:13:14 UTC

When Nutch fetches using mapred ...

Greetings list,

When Nutch (trunk) fetches using mapred, it seems to be assigning one 
fetch thread (one slave node) to process previously-errored pages. Is 
this as designed or do I have some type of malfunction?

  When I perform a search large enough to observe the fetch process for 
an extended period of time (1M pages over 16 nodes, in this case), I 
notice there is one map task which performs _very_ poorly compared to 
the others:

4905 pages, 33094 errors, 3.5 pages/s, 432 kb/s,
	versus
46639 pages, 13227 errors, 43.9 pages/s, 4547 kb/s,

It is deficient in terms of raw pages/sec, execution time (it is the 
last map task to complete), and the number of errors encountered.

As I said, there seems to always be exactly one map task like this. 
Different fetch executions will have the thread assigned to different 
machines -- there doesn't seem to be any pattern.

What the heck is going on here?

-Shawn

Re: When Nutch fetches using mapred ...

Posted by Shawn Gervais <pr...@project10.net>.
Doug Cutting wrote:
> Shawn Gervais wrote:
>>  When I perform a search large enough to observe the fetch process for 
>> an extended period of time (1M pages over 16 nodes, in this case), I 
>> notice there is one map task which performs _very_ poorly compared to 
>> the others:

> 
> My suspicion is that you're trying to fetch a large number of pages from 
> a single site.  Fetch tasks are partitioned by host name.  All urls with 
> a given host are fetched in a single fetcher map task.  Grep the errors 
> from the log on the slow node: I'll bet most are from a single host name.
> 
> To fix this, try setting generate.max.per.host.
> 
> A good value might be something like 
> topN/(mapred.map.tasks*fetcher.threads.fetch).  So if you're setting 
> -topN to 10M and running with 10 fetch tasks and using 100 threads, then 
> each fetch task will fetch around 1M urls, 10,000 per thread.  Fetching 
> a single host is single-threaded, so any host with more than 10,000 urls 
> will slow the overall fetch.

Doug,

Thanks for the tip! You were indeed correct that the errant thread was 
fetching pages from a handful of domains (cnn and geocities).

Setting generate.max.per.host has yielded more consistent performance 
across all my fetcher tasks.

Now to figure out why a lone reduce task always dies on large fetches :/

-Shawn

Re: When Nutch fetches using mapred ...

Posted by Doug Cutting <cu...@apache.org>.
Shawn Gervais wrote:
>  When I perform a search large enough to observe the fetch process for 
> an extended period of time (1M pages over 16 nodes, in this case), I 
> notice there is one map task which performs _very_ poorly compared to 
> the others:
> 
> 4905 pages, 33094 errors, 3.5 pages/s, 432 kb/s,
>     versus
> 46639 pages, 13227 errors, 43.9 pages/s, 4547 kb/s,
> 
> It is deficient in terms of raw pages/sec, execution time (it is the 
> last map task to complete), and the number of errors encountered.
> 
> As I said, there seems to always be exactly one map task like this. 
> Different fetch executions will have the thread assigned to different 
> machines -- there doesn't seem to be any pattern.
> 
> What the heck is going on here?

My suspicion is that you're trying to fetch a large number of pages from 
a single site.  Fetch tasks are partitioned by host name.  All urls with 
a given host are fetched in a single fetcher map task.  Grep the errors 
from the log on the slow node: I'll bet most are from a single host name.

To fix this, try setting generate.max.per.host.

A good value might be something like 
topN/(mapred.map.tasks*fetcher.threads.fetch).  So if you're setting 
-topN to 10M and running with 10 fetch tasks and using 100 threads, then 
each fetch task will fetch around 1M urls, 10,000 per thread.  Fetching 
a single host is single-threaded, so any host with more than 10,000 urls 
will slow the overall fetch.

Here's another way to think about it: If you're fetching a page/second 
per host (fetcher.server.delay) and your fetch tasks are averaging 
around an hour (3600 seconds) then any host which has more than 3600 
pages will cause its fetch tasks to run slower than the others and/or to 
have high error rates.

Doug