You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Michael Coffey <mc...@yahoo.com.INVALID> on 2018/04/13 00:52:53 UTC

spilled records from reducer

Greetings Nutchlings,
I would like to make my generate jobs go faster, and I see that the reducer spills a lot of records.
Here are the numbers for a typical long-running reduce task of the generate-select job: 100 million spilled records, 255K input records, 90k output records, 13G file bytes written, only 3G committed heap usage. mapreduce.reduce.java.opts is 8000M, mapreduce.reduce.memory.mb is12000.
Do I have to increase  mapreduce.reduce.java.opts and mapreduce.reduce.memory.mb? If so, how can I compute how big they should be? Also, are there other settings changes needed?
My actual commd line is apache-nutch-1.12/runtime/deploy/bin/nutch generate  -D mapreduce.job.reduces=16 -D mapreduce.input.fileinputformat.split.minsize=536870912 -D mapreduce.reduce.memory.mb=12000 -D mapreduce.reduce.java.opts=-Xmx8000m  -D db.fetch.interval.default=5184000 -D db.fetch.schedule.adaptive.min_interval=3888000 -D generate.update.crawldb=true  -D generate.max.count=25 /crawls/popular/data/crawldb /crawls/popular/data/segments/ -topN 60000 -numFetchers 2 -noFilter -maxNumSegments 24

Re: spilled records from reducer

Posted by Michael Coffey <mc...@yahoo.com.INVALID>.

Hi Sebastian, thanks for the response.

The numbers I gave were for a single reduce task, not a whole job. I'll try to give a better picture.

crawldb/current has 161.4 gbytes of data, on about 1.6 billion urls. I don't know how many hosts or domains, but I assume it is many millions.

Cluster currently has 6 worker nodes, each with 32 GB RAM, 1TB SSD and 4 TBB HDD. Crawldb, linkdb and OS are on SSD. Other hdfs directories are on HD.

The generate jobs are given 16 reduces and a split size of 512 MB which yields about 330 maps.

The per-reduce-task spilled records is about 100 million. For the whole job, there are about 3.1 billion spilled records, half attributed to maps, half to reduces.While the job totals about 1.6 billion map input records and map output records, there are just 6.5 million reduce input records.

Does the number of spilled records make sense for a job this size, or is it something I should try to decrease? Where I'm coming from is that my dedicated Nutch cluster is disk-bound much of the time and I'm wondering if there are ways to do more in memory to save on disk IO.

On Friday, April 13, 2018, 1:19:53 AM PDT, Sebastian Nagel <> wrote: 

Hi Michael,

> reducer spills a lot of records

The job counter "Spilled Records" is not for the reducers alone.

> 255K input records

Does your CrawlDb only contain 250,000 entries?

Also, how many hosts (resp. domains/ips depending on partition.url.mode)
are in the CrawlDb? Note: the counts per host/domain/ip are kept in a
HashMap, that does not scale up to 100 millions of hosts.

> 100 million spilled records
> 13G file bytes written

With these, my estimation would be 10s or 100s millions of CrawlDb items.
Something is wrong if the CrawlDb is really so small.

> -D generate.update.crawldb=true

That's expensive if your CrawlDb is large.

> Do I have to increase  mapreduce.reduce.java.opts and mapreduce.reduce.memory.mb?

If it's about a large number of hosts, this might help.
But you could also try to make sure that all data (HDFS and temporary) is on SSDs
try different compression settings (CrawlDb and temporary data), see
  mapreduce.output.fileoutputformat.compress.codec
  mapreduce.map.output.compress
  mapreduce.map.output.compress.codec

Best,
Sebastian

On 04/13/2018 02:52 AM, Michael Coffey wrote:
> Greetings Nutchlings,
> I would like to make my generate jobs go faster, and I see that the reducer spills a lot of records.
> Here are the numbers for a typical long-running reduce task of the generate-select job: 100 million spilled records, 255K input records, 90k output records, 13G file bytes written, only 3G committed heap usage. mapreduce.reduce.java.opts is 8000M, mapreduce.reduce.memory.mb is12000.
> Do I have to increase  mapreduce.reduce.java.opts and mapreduce.reduce.memory.mb? If so, how can I compute how big they should be? Also, are there other settings changes needed?
> My actual commd line is apache-nutch-1.12/runtime/deploy/bin/nutch generate  -D mapreduce.job.reduces=16 -D mapreduce.input.fileinputformat.split.minsize=536870912 -D mapreduce.reduce.memory.mb=12000 -D mapreduce.reduce.java.opts=-Xmx8000m  -D db.fetch.interval.default=5184000 -D db.fetch.schedule.adaptive.min_interval=3888000 -D generate.update.crawldb=true  -D generate.max.count=25 /crawls/popular/data/crawldb /crawls/popular/data/segments/ -topN 60000 -numFetchers 2 -noFilter -maxNumSegments 24
> 
> 
> 
>

Re: spilled records from reducer

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Michael,

> reducer spills a lot of records

The job counter "Spilled Records" is not for the reducers alone.

> 255K input records

Does your CrawlDb only contain 250,000 entries?

Also, how many hosts (resp. domains/ips depending on partition.url.mode)
are in the CrawlDb? Note: the counts per host/domain/ip are kept in a
HashMap, that does not scale up to 100 millions of hosts.

> 100 million spilled records
> 13G file bytes written

With these, my estimation would be 10s or 100s millions of CrawlDb items.
Something is wrong if the CrawlDb is really so small.

> -D generate.update.crawldb=true

That's expensive if your CrawlDb is large.

> Do I have to increase  mapreduce.reduce.java.opts and mapreduce.reduce.memory.mb?

If it's about a large number of hosts, this might help.
But you could also try to make sure that all data (HDFS and temporary) is on SSDs
try different compression settings (CrawlDb and temporary data), see
  mapreduce.output.fileoutputformat.compress.codec
  mapreduce.map.output.compress
  mapreduce.map.output.compress.codec

Best,
Sebastian

On 04/13/2018 02:52 AM, Michael Coffey wrote:
> Greetings Nutchlings,
> I would like to make my generate jobs go faster, and I see that the reducer spills a lot of records.
> Here are the numbers for a typical long-running reduce task of the generate-select job: 100 million spilled records, 255K input records, 90k output records, 13G file bytes written, only 3G committed heap usage. mapreduce.reduce.java.opts is 8000M, mapreduce.reduce.memory.mb is12000.
> Do I have to increase  mapreduce.reduce.java.opts and mapreduce.reduce.memory.mb? If so, how can I compute how big they should be? Also, are there other settings changes needed?
> My actual commd line is apache-nutch-1.12/runtime/deploy/bin/nutch generate  -D mapreduce.job.reduces=16 -D mapreduce.input.fileinputformat.split.minsize=536870912 -D mapreduce.reduce.memory.mb=12000 -D mapreduce.reduce.java.opts=-Xmx8000m  -D db.fetch.interval.default=5184000 -D db.fetch.schedule.adaptive.min_interval=3888000 -D generate.update.crawldb=true  -D generate.max.count=25 /crawls/popular/data/crawldb /crawls/popular/data/segments/ -topN 60000 -numFetchers 2 -noFilter -maxNumSegments 24
> 
> 
> 
>