You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Michael Coffey <mc...@yahoo.com.INVALID> on 2017/05/12 19:38:54 UTC

tuning for speed

I am looking for a methodology for making the crawler cycle go faster. I had expected the run-time to be dominated by fetcher performance but, instead, the greater bulk of the time is taken by linkdb-merge + indexer + crawldb-update + generate-select.
Can anyone provide an outline of such a methodology, or a link to one already published?
Also, more tactically speaking, I read in "Hadoop, the Definitive Guide" that the numbers of mappers and reducers are the first things to check. I know how to set the number of reducers, but it's not obvious how to control the number of mappers. In my situation (1.7e8+ urls in crawldb), I get orders of magnitude more mappers than there are disks (or cpu cores) in my cluster. Are there things I should do to bring it down to something less than 10x the number of disks or 4x the number of cores, or something like that?

Re: tuning for speed

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Michael,

operations on a large CrawlDb of 200 million become slow, that's a matter of fact and a well-known
limitation of Nutch 1.x :(  The CrawlDb is a large Hadoop map file and needs to be rewritten for
every update (even if it's small).

If your workflow does allow it, you could process multiple segments in one cycle
- generate N segments in one turn
- fetch them sequentially
- (you may start fetching the next one if the previous is in the reduce phase)
- do update, linkdb, index in one turn for all N segments

Regarding mappers and reducers: take a multiple of what you can run in parallel on your cluster
to avoid that the cluster is underutilized when a job is waiting for few tasks to finish.
The number of mappers is first determined by the number of input partitions (determined by number of
reducers writing the CrawlDb or LinkDb). If partitions are small or splittable there are a couple of
Hadoop configuration properties to tune the data size processed by a single map task.

I would view from the data size: if a mapper processes only few MBs of the CrawlDb - it's too small,
there is too much overhead, if it's multiple GBs (compressed) reducers will run too long (also
mappers if not splittable). But details depend on your cluster hardware.

Best,
Sebastian

On 05/16/2017 08:04 PM, Michael Coffey wrote:
> I am looking for a methodology for making the crawler cycle go faster. I had expected the run-time to be dominated by fetcher performance but, instead, the greater bulk of the time is taken by linkdb-merge + indexer + crawldb-update + generate-select.
> 
> 
> Can anyone provide an outline of such a methodology, or a link to one already published?
> Also, more tactically speaking, I read in "Hadoop, the Definitive Guide" that the numbers of mappers and reducers are the first things to check. I know how to set the number of reducers, but it's not obvious how to control the number of mappers. In my situation (1.9e8+ urls in crawldb), I get orders of magnitude more mappers than there are disks (or cpu cores) in my cluster. Are there things I should do to bring it down to something less than 10x the number of disks or 4x the number of cores, or something like that?
>    
>

tuning for speed

Posted by Michael Coffey <mc...@yahoo.com.INVALID>.

I am looking for a methodology for making the crawler cycle go faster. I had expected the run-time to be dominated by fetcher performance but, instead, the greater bulk of the time is taken by linkdb-merge + indexer + crawldb-update + generate-select.


Can anyone provide an outline of such a methodology, or a link to one already published?
Also, more tactically speaking, I read in "Hadoop, the Definitive Guide" that the numbers of mappers and reducers are the first things to check. I know how to set the number of reducers, but it's not obvious how to control the number of mappers. In my situation (1.9e8+ urls in crawldb), I get orders of magnitude more mappers than there are disks (or cpu cores) in my cluster. Are there things I should do to bring it down to something less than 10x the number of disks or 4x the number of cores, or something like that?