You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by MoD <w...@ant.com> on 2009/08/16 18:27:54 UTC

Nutch updatedb Crash

Hi,

During CrawlDb Map reduce job,
The reduce worker fail 1 by 1 with :

java.lang.OutOfMemoryError: GC overhead limit exceeded
	at java.util.concurrent.ConcurrentHashMap$HashEntry.newArray(ConcurrentHashMap.java:205)
	at java.util.concurrent.ConcurrentHashMap$Segment.(ConcurrentHashMap.java:291)
	at java.util.concurrent.ConcurrentHashMap.(ConcurrentHashMap.java:613)
	at java.util.concurrent.ConcurrentHashMap.(ConcurrentHashMap.java:652)
	at org.apache.hadoop.io.AbstractMapWritable.(AbstractMapWritable.java:49)
	at org.apache.hadoop.io.MapWritable.(MapWritable.java:42)
	at org.apache.hadoop.io.MapWritable.(MapWritable.java:52)
	at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321)
	at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96)
	at org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
	at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
	at org.apache.hadoop.mapred.Child.main(Child.java:158)


I have default 1Gb per JVM.

/opt/java/jre/bin/java -Xmx1000m


Being out of memory for a Java process is somewhat surprising,
Does this job something that needs over 1Gb ram per node ?

Oh by the way I don't have swap files, system have 8Gb and don't seems
to be missing any ram.

My command line :

nutch@titaniumpelican search $ ./bin/nutch  updatedb
hdfs://titaniumpelican:9000/user/nutch/crawl/crawldb -dir
hdfs://titaniumpelican:9000/user/nutch/crawl/segments
CrawlDb update: starting
CrawlDb update: db: hdfs://titaniumpelican:9000/user/nutch/crawl/crawldb
CrawlDb update: segments:
[hdfs://titaniumpelican:9000/user/nutch/crawl/segments/20090814122219]
CrawlDb update: additions allowed: false
CrawlDb update: URL normalizing: false
CrawlDb update: URL filtering: false
CrawlDb update: Merging segment data into db.
java.lang.OutOfMemoryError: Java heap space


Question : Why this job cut work into 140 map tasks ?

Regards,
Louis

Re: Nutch updatedb Crash

Posted by MoD <w...@ant.com>.

fixed, thanks.


On Sun, Aug 16, 2009 at 8:38 PM, Andrzej Bialecki<ab...@getopt.org> wrote:
> MoD wrote:
>>
>> Julien,
>>
>> I did tryed with 2048M / Task child,
>> no luck I still have two reduce that doesn't go through,
>>
>> Is it somewhat related to the number of reduce,
>> on this cluster I have 4 servers :
>> - dual xeon dual core (8 core)
>> - 8Gb ram
>> - 4 disks
>>
>> I did set mapred.reduce.tasks and mapred.map.tasks to 16.
>> because : 4 server of 4 disks. (what do you think)
>>
>> Maybe if this job is too big for my cluster, does adding reduce task
>> could subdivise the problem into smaller reduces.
>> indeed I think no, cause I guess the input key is for the same domain ?
>>
>> so my two last reduce task are the biggest domains of my DB ?
>
> This is likely caused by a large number of inlinks for certain urls - the
> updatedb reduce collects this list in memory, and this sometimes leads to
> memory exhaustion. Please try limiting the max. number of inlinks per url
> (see nutch-default.xml for details).
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Nutch updatedb Crash

Posted by Andrzej Bialecki <ab...@getopt.org>.

MoD wrote:
> Julien,
> 
> I did tryed with 2048M / Task child,
> no luck I still have two reduce that doesn't go through,
> 
> Is it somewhat related to the number of reduce,
> on this cluster I have 4 servers :
> - dual xeon dual core (8 core)
> - 8Gb ram
> - 4 disks
> 
> I did set mapred.reduce.tasks and mapred.map.tasks to 16.
> because : 4 server of 4 disks. (what do you think)
> 
> Maybe if this job is too big for my cluster, does adding reduce task
> could subdivise the problem into smaller reduces.
> indeed I think no, cause I guess the input key is for the same domain ?
> 
> so my two last reduce task are the biggest domains of my DB ?

This is likely caused by a large number of inlinks for certain urls - 
the updatedb reduce collects this list in memory, and this sometimes 
leads to memory exhaustion. Please try limiting the max. number of 
inlinks per url (see nutch-default.xml for details).


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Nutch updatedb Crash

Posted by MoD <w...@ant.com>.

Julien,

I did tryed with 2048M / Task child,
no luck I still have two reduce that doesn't go through,

Is it somewhat related to the number of reduce,
on this cluster I have 4 servers :
- dual xeon dual core (8 core)
- 8Gb ram
- 4 disks

I did set mapred.reduce.tasks and mapred.map.tasks to 16.
because : 4 server of 4 disks. (what do you think)

Maybe if this job is too big for my cluster, does adding reduce task
could subdivise the problem into smaller reduces.
indeed I think no, cause I guess the input key is for the same domain ?

so my two last reduce task are the biggest domains of my DB ?

L.


On Sun, Aug 16, 2009 at 6:39 PM, Julien
Nioche<li...@gmail.com> wrote:
> Hi,
>
> The reducing step of the updatedb requires quite a lot of memory indeed. See
> https://issues.apache.org/jira/browse/NUTCH-702 for a discussion on this
> subject.
> BTW you'll have to specify the parameter mapred.child.java.opts in your
> conf/hadoop-site.xml so that the value is sent to the hadoop slaves. Another
> way to do that is to specify it on the command line with : -D
> mapred.child.java.opts=-Xmx2000m
>
> Julien
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>
> 2009/8/16 MoD <w...@ant.com>
>
>> Hi,
>>
>> During CrawlDb Map reduce job,
>> The reduce worker fail 1 by 1 with :
>>
>> java.lang.OutOfMemoryError: GC overhead limit exceeded
>>        at
>> java.util.concurrent.ConcurrentHashMap$HashEntry.newArray(ConcurrentHashMap.java:205)
>>        at
>> java.util.concurrent.ConcurrentHashMap$Segment.(ConcurrentHashMap.java:291)
>>        at
>> java.util.concurrent.ConcurrentHashMap.(ConcurrentHashMap.java:613)
>>        at
>> java.util.concurrent.ConcurrentHashMap.(ConcurrentHashMap.java:652)
>>        at
>> org.apache.hadoop.io.AbstractMapWritable.(AbstractMapWritable.java:49)
>>        at org.apache.hadoop.io.MapWritable.(MapWritable.java:42)
>>        at org.apache.hadoop.io.MapWritable.(MapWritable.java:52)
>>        at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321)
>>        at
>> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96)
>>        at
>> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
>>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
>>        at org.apache.hadoop.mapred.Child.main(Child.java:158)
>>
>>
>> I have default 1Gb per JVM.
>>
>> /opt/java/jre/bin/java -Xmx1000m
>>
>>
>> Being out of memory for a Java process is somewhat surprising,
>> Does this job something that needs over 1Gb ram per node ?
>>
>> Oh by the way I don't have swap files, system have 8Gb and don't seems
>> to be missing any ram.
>>
>> My command line :
>>
>> nutch@titaniumpelican search $ ./bin/nutch  updatedb
>> hdfs://titaniumpelican:9000/user/nutch/crawl/crawldb -dir
>> hdfs://titaniumpelican:9000/user/nutch/crawl/segments
>> CrawlDb update: starting
>> CrawlDb update: db: hdfs://titaniumpelican:9000/user/nutch/crawl/crawldb
>> CrawlDb update: segments:
>> [hdfs://titaniumpelican:9000/user/nutch/crawl/segments/20090814122219]
>> CrawlDb update: additions allowed: false
>> CrawlDb update: URL normalizing: false
>> CrawlDb update: URL filtering: false
>> CrawlDb update: Merging segment data into db.
>> java.lang.OutOfMemoryError: Java heap space
>>
>>
>> Question : Why this job cut work into 140 map tasks ?
>>
>> Regards,
>> Louis
>>
>
>
>
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>

Re: Nutch updatedb Crash

Posted by Julien Nioche <li...@gmail.com>.

Hi,

The reducing step of the updatedb requires quite a lot of memory indeed. See
https://issues.apache.org/jira/browse/NUTCH-702 for a discussion on this
subject.
BTW you'll have to specify the parameter mapred.child.java.opts in your
conf/hadoop-site.xml so that the value is sent to the hadoop slaves. Another
way to do that is to specify it on the command line with : -D
mapred.child.java.opts=-Xmx2000m

Julien
-- 
DigitalPebble Ltd
http://www.digitalpebble.com

2009/8/16 MoD <w...@ant.com>

> Hi,
>
> During CrawlDb Map reduce job,
> The reduce worker fail 1 by 1 with :
>
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>        at
> java.util.concurrent.ConcurrentHashMap$HashEntry.newArray(ConcurrentHashMap.java:205)
>        at
> java.util.concurrent.ConcurrentHashMap$Segment.(ConcurrentHashMap.java:291)
>        at
> java.util.concurrent.ConcurrentHashMap.(ConcurrentHashMap.java:613)
>        at
> java.util.concurrent.ConcurrentHashMap.(ConcurrentHashMap.java:652)
>        at
> org.apache.hadoop.io.AbstractMapWritable.(AbstractMapWritable.java:49)
>        at org.apache.hadoop.io.MapWritable.(MapWritable.java:42)
>        at org.apache.hadoop.io.MapWritable.(MapWritable.java:52)
>        at org.apache.nutch.crawl.CrawlDatum.set(CrawlDatum.java:321)
>        at
> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:96)
>        at
> org.apache.nutch.crawl.CrawlDbReducer.reduce(CrawlDbReducer.java:35)
>        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:436)
>        at org.apache.hadoop.mapred.Child.main(Child.java:158)
>
>
> I have default 1Gb per JVM.
>
> /opt/java/jre/bin/java -Xmx1000m
>
>
> Being out of memory for a Java process is somewhat surprising,
> Does this job something that needs over 1Gb ram per node ?
>
> Oh by the way I don't have swap files, system have 8Gb and don't seems
> to be missing any ram.
>
> My command line :
>
> nutch@titaniumpelican search $ ./bin/nutch  updatedb
> hdfs://titaniumpelican:9000/user/nutch/crawl/crawldb -dir
> hdfs://titaniumpelican:9000/user/nutch/crawl/segments
> CrawlDb update: starting
> CrawlDb update: db: hdfs://titaniumpelican:9000/user/nutch/crawl/crawldb
> CrawlDb update: segments:
> [hdfs://titaniumpelican:9000/user/nutch/crawl/segments/20090814122219]
> CrawlDb update: additions allowed: false
> CrawlDb update: URL normalizing: false
> CrawlDb update: URL filtering: false
> CrawlDb update: Merging segment data into db.
> java.lang.OutOfMemoryError: Java heap space
>
>
> Question : Why this job cut work into 140 map tasks ?
>
> Regards,
> Louis
>



-- 
DigitalPebble Ltd
http://www.digitalpebble.com