You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by fa...@butterflycluster.net on 2009/11/05 02:29:14 UTC

MergeSegments - map reduce thread death

Hi there,

seems i have some serious problems with hadoop during map-reduce for
MergeSegments.

i am out of ideas on this. Any suggestions will be quite welcome.

Here is my set up:

RAM: 4G
JVM HEAP: 2G
mapred.child.java.opts = 1024M
hadoop-0.19.1-core.jar
nutch-1.0
Xen VPS.

After running a recrawl a few times; i end up with one segment that is
relatively larger compared to the new ones last generated. here is my
segments structure when things blow up after a (5th) recrawl;

segment1 = 674Megs (after several recrawls)
segment2 = 580k (last recrawl)
segment3 = 568k (last recrawl)
segment4 = 584k (last recrawl)
..
segment8 = 560k (last recrawl)

when i run mergeSegments everything goes well until we get up to 90% of
the map-reduce and we get a thread death; here is a stack trace

2009-11-05 10:54:16,874 INFO  [org.apache.hadoop.mapred.LocalJobRunner]
reduce > reduce
2009-11-05 10:54:29,794 INFO  [org.apache.hadoop.mapred.LocalJobRunner]
reduce > reduce
2009-11-05 10:54:55,194 INFO  [org.apache.hadoop.mapred.LocalJobRunner]
reduce > reduce
2009-11-05 10:57:25,844 WARN  [org.apache.hadoop.mapred.LocalJobRunner]
job_local_0001
java.lang.ThreadDeath
        at java.lang.Thread.stop(Thread.java:715)
        at
org.apache.hadoop.mapred.LocalJobRunner.killJob(LocalJobRunner.java:310)
        at
org.apache.hadoop.mapred.JobClient$NetworkedJob.killJob(JobClient.java:315)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1239)
        at
org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:620)
        at
org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:665)

any suggestions please!!!!

thanks.

Re: MergeSegments - map reduce thread death

Posted by fa...@butterflycluster.net.

hi there,

we tried a few things around this; one suggestion was to run on it on a
local machine; so i pulled one of our decent servers and got to work...
but surprisingly we got the same error on a local machine!

so it seems the hardware (VPS/Local) wasnt the culprit.. probably the
data, or the code.

so we decided to discard the db and generate a new one - things seem to be
working normally so far.. but lets see when db becomes larger.

having said that - there were a few things we found out and need
clarification whether they were a cause for problems or not;

here is the scenario - in sequence of execution;

step 1 setup.

* first crawl was done using "bin/nutch crawl.."
- urls = 1500
- depth = 10
- topN = 500
(so it should do all by round 3 right? what happens at rounds 4 to 10?)

step 2 to 5 setup.

* recrawl (repeat)
- topN = 10000
- depth = 10
- db.default.fetch.interval = 30 (doesnt seem to do anything)
- generate.update.crawldb = false (same fetchlist was being generated)
- injected seed urls again (bad! we didnt realise this was happening, but
whats the effect of doing this?)
- fetch
- update db
(this step above was an effort to get an incremental crawl.. )

step 6
* merge segments, invertlinks, indexes...
- at this stage map reduce just died during MergeSegments, ..with an out
of heap memory exception.

the assumption was with a seed url list of 1500, nutch will generate more
NEW urls from the crawldb based on the outlinks it found - is this true?
because it did not seem to be the case.

also what is the effect of running a recrawl with using topN more than
what nutch can generate?

> i tried this once but before i knew it my log file was approaching a gig
> within an hour or so!
>
>
>> I suggest maybe turning the debug logs on for hadoop before you do the
>> next crawl... you can do this by editing log4j.properties
>> and change the rootLogger from INFO to DEBUG
>>
>> On Thu, Nov 5, 2009 at 12:37 AM, Andrzej Bialecki <ab...@getopt.org> wrote:
>>> fadzi@butterflycluster.net wrote:
>>>>
>>>> Hi there,
>>>>
>>>> seems i have some serious problems with hadoop during map-reduce for
>>>> MergeSegments.
>>>>
>>>> i am out of ideas on this. Any suggestions will be quite welcome.
>>>>
>>>> Here is my set up:
>>>>
>>>> RAM: 4G
>>>> JVM HEAP: 2G
>>>> mapred.child.java.opts = 1024M
>>>> hadoop-0.19.1-core.jar
>>>> nutch-1.0
>>>> Xen VPS.
>>>>
>>>> After running a recrawl a few times; i end up with one segment that is
>>>> relatively larger compared to the new ones last generated. here is my
>>>> segments structure when things blow up after a (5th) recrawl;
>>>>
>>>> segment1 = 674Megs (after several recrawls)
>>>> segment2 = 580k (last recrawl)
>>>> segment3 = 568k (last recrawl)
>>>> segment4 = 584k (last recrawl)
>>>> ..
>>>> segment8 = 560k (last recrawl)
>>>>
>>>> when i run mergeSegments everything goes well until we get up to 90%
>>>> of
>>>> the map-reduce and we get a thread death; here is a stack trace
>>>>
>>>> 2009-11-05 10:54:16,874 INFO
>>>>  [org.apache.hadoop.mapred.LocalJobRunner]
>>>> reduce > reduce
>>>> 2009-11-05 10:54:29,794 INFO
>>>>  [org.apache.hadoop.mapred.LocalJobRunner]
>>>> reduce > reduce
>>>> 2009-11-05 10:54:55,194 INFO
>>>>  [org.apache.hadoop.mapred.LocalJobRunner]
>>>> reduce > reduce
>>>> 2009-11-05 10:57:25,844 WARN
>>>>  [org.apache.hadoop.mapred.LocalJobRunner]
>>>> job_local_0001
>>>> java.lang.ThreadDeath
>>>>        at java.lang.Thread.stop(Thread.java:715)
>>>>        at
>>>> org.apache.hadoop.mapred.LocalJobRunner.killJob(LocalJobRunner.java:310)
>>>>        at
>>>>
>>>> org.apache.hadoop.mapred.JobClient$NetworkedJob.killJob(JobClient.java:315)
>>>>        at
>>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1239)
>>>>        at
>>>> org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:620)
>>>>        at
>>>> org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:665)
>>>>
>>>> any suggestions please!!!!
>>>
>>> This is a high-level exception that doesn't indicate the nature of the
>>> original problem. Is there any other information in hadoop.log or in
>>> task
>>> logs (logs/userlogs)?
>>>
>>> In my experience this sort of things happen rarely, for the relatively
>>> small
>>> dataset that you have, so you are lucky ;) This could be related to a
>>> number
>>> of issues, like running this under Xen that imposes some limits and
>>> slowdowns, or you may have a low number of file descriptors (ulimit
>>> -n),
>>> or
>>> a faulty RAM, or an overheated CPU ...
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki     <><
>>>  ___. ___ ___ ___ _ _   __________________________________
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>>
>>
>
>
>

Re: MergeSegments - map reduce thread death

Posted by fa...@butterflycluster.net.

i tried this once but before i knew it my log file was approaching a gig
within an hour or so!


> I suggest maybe turning the debug logs on for hadoop before you do the
> next crawl... you can do this by editing log4j.properties
> and change the rootLogger from INFO to DEBUG
>
> On Thu, Nov 5, 2009 at 12:37 AM, Andrzej Bialecki <ab...@getopt.org> wrote:
>> fadzi@butterflycluster.net wrote:
>>>
>>> Hi there,
>>>
>>> seems i have some serious problems with hadoop during map-reduce for
>>> MergeSegments.
>>>
>>> i am out of ideas on this. Any suggestions will be quite welcome.
>>>
>>> Here is my set up:
>>>
>>> RAM: 4G
>>> JVM HEAP: 2G
>>> mapred.child.java.opts = 1024M
>>> hadoop-0.19.1-core.jar
>>> nutch-1.0
>>> Xen VPS.
>>>
>>> After running a recrawl a few times; i end up with one segment that is
>>> relatively larger compared to the new ones last generated. here is my
>>> segments structure when things blow up after a (5th) recrawl;
>>>
>>> segment1 = 674Megs (after several recrawls)
>>> segment2 = 580k (last recrawl)
>>> segment3 = 568k (last recrawl)
>>> segment4 = 584k (last recrawl)
>>> ..
>>> segment8 = 560k (last recrawl)
>>>
>>> when i run mergeSegments everything goes well until we get up to 90% of
>>> the map-reduce and we get a thread death; here is a stack trace
>>>
>>> 2009-11-05 10:54:16,874 INFO  [org.apache.hadoop.mapred.LocalJobRunner]
>>> reduce > reduce
>>> 2009-11-05 10:54:29,794 INFO  [org.apache.hadoop.mapred.LocalJobRunner]
>>> reduce > reduce
>>> 2009-11-05 10:54:55,194 INFO  [org.apache.hadoop.mapred.LocalJobRunner]
>>> reduce > reduce
>>> 2009-11-05 10:57:25,844 WARN  [org.apache.hadoop.mapred.LocalJobRunner]
>>> job_local_0001
>>> java.lang.ThreadDeath
>>>        at java.lang.Thread.stop(Thread.java:715)
>>>        at
>>> org.apache.hadoop.mapred.LocalJobRunner.killJob(LocalJobRunner.java:310)
>>>        at
>>>
>>> org.apache.hadoop.mapred.JobClient$NetworkedJob.killJob(JobClient.java:315)
>>>        at
>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1239)
>>>        at
>>> org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:620)
>>>        at
>>> org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:665)
>>>
>>> any suggestions please!!!!
>>
>> This is a high-level exception that doesn't indicate the nature of the
>> original problem. Is there any other information in hadoop.log or in
>> task
>> logs (logs/userlogs)?
>>
>> In my experience this sort of things happen rarely, for the relatively
>> small
>> dataset that you have, so you are lucky ;) This could be related to a
>> number
>> of issues, like running this under Xen that imposes some limits and
>> slowdowns, or you may have a low number of file descriptors (ulimit -n),
>> or
>> a faulty RAM, or an overheated CPU ...
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>

Re: MergeSegments - map reduce thread death

Posted by Kalaimathan Mahenthiran <ma...@gmail.com>.

I suggest maybe turning the debug logs on for hadoop before you do the
next crawl... you can do this by editing log4j.properties
and change the rootLogger from INFO to DEBUG

On Thu, Nov 5, 2009 at 12:37 AM, Andrzej Bialecki <ab...@getopt.org> wrote:
> fadzi@butterflycluster.net wrote:
>>
>> Hi there,
>>
>> seems i have some serious problems with hadoop during map-reduce for
>> MergeSegments.
>>
>> i am out of ideas on this. Any suggestions will be quite welcome.
>>
>> Here is my set up:
>>
>> RAM: 4G
>> JVM HEAP: 2G
>> mapred.child.java.opts = 1024M
>> hadoop-0.19.1-core.jar
>> nutch-1.0
>> Xen VPS.
>>
>> After running a recrawl a few times; i end up with one segment that is
>> relatively larger compared to the new ones last generated. here is my
>> segments structure when things blow up after a (5th) recrawl;
>>
>> segment1 = 674Megs (after several recrawls)
>> segment2 = 580k (last recrawl)
>> segment3 = 568k (last recrawl)
>> segment4 = 584k (last recrawl)
>> ..
>> segment8 = 560k (last recrawl)
>>
>> when i run mergeSegments everything goes well until we get up to 90% of
>> the map-reduce and we get a thread death; here is a stack trace
>>
>> 2009-11-05 10:54:16,874 INFO  [org.apache.hadoop.mapred.LocalJobRunner]
>> reduce > reduce
>> 2009-11-05 10:54:29,794 INFO  [org.apache.hadoop.mapred.LocalJobRunner]
>> reduce > reduce
>> 2009-11-05 10:54:55,194 INFO  [org.apache.hadoop.mapred.LocalJobRunner]
>> reduce > reduce
>> 2009-11-05 10:57:25,844 WARN  [org.apache.hadoop.mapred.LocalJobRunner]
>> job_local_0001
>> java.lang.ThreadDeath
>>        at java.lang.Thread.stop(Thread.java:715)
>>        at
>> org.apache.hadoop.mapred.LocalJobRunner.killJob(LocalJobRunner.java:310)
>>        at
>>
>> org.apache.hadoop.mapred.JobClient$NetworkedJob.killJob(JobClient.java:315)
>>        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1239)
>>        at
>> org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:620)
>>        at
>> org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:665)
>>
>> any suggestions please!!!!
>
> This is a high-level exception that doesn't indicate the nature of the
> original problem. Is there any other information in hadoop.log or in task
> logs (logs/userlogs)?
>
> In my experience this sort of things happen rarely, for the relatively small
> dataset that you have, so you are lucky ;) This could be related to a number
> of issues, like running this under Xen that imposes some limits and
> slowdowns, or you may have a low number of file descriptors (ulimit -n), or
> a faulty RAM, or an overheated CPU ...
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: MergeSegments - map reduce thread death

Posted by Andrzej Bialecki <ab...@getopt.org>.

fadzi@butterflycluster.net wrote:
> Hi there,
> 
> seems i have some serious problems with hadoop during map-reduce for
> MergeSegments.
> 
> i am out of ideas on this. Any suggestions will be quite welcome.
> 
> Here is my set up:
> 
> RAM: 4G
> JVM HEAP: 2G
> mapred.child.java.opts = 1024M
> hadoop-0.19.1-core.jar
> nutch-1.0
> Xen VPS.
> 
> After running a recrawl a few times; i end up with one segment that is
> relatively larger compared to the new ones last generated. here is my
> segments structure when things blow up after a (5th) recrawl;
> 
> segment1 = 674Megs (after several recrawls)
> segment2 = 580k (last recrawl)
> segment3 = 568k (last recrawl)
> segment4 = 584k (last recrawl)
> ..
> segment8 = 560k (last recrawl)
> 
> when i run mergeSegments everything goes well until we get up to 90% of
> the map-reduce and we get a thread death; here is a stack trace
> 
> 2009-11-05 10:54:16,874 INFO  [org.apache.hadoop.mapred.LocalJobRunner]
> reduce > reduce
> 2009-11-05 10:54:29,794 INFO  [org.apache.hadoop.mapred.LocalJobRunner]
> reduce > reduce
> 2009-11-05 10:54:55,194 INFO  [org.apache.hadoop.mapred.LocalJobRunner]
> reduce > reduce
> 2009-11-05 10:57:25,844 WARN  [org.apache.hadoop.mapred.LocalJobRunner]
> job_local_0001
> java.lang.ThreadDeath
>         at java.lang.Thread.stop(Thread.java:715)
>         at
> org.apache.hadoop.mapred.LocalJobRunner.killJob(LocalJobRunner.java:310)
>         at
> org.apache.hadoop.mapred.JobClient$NetworkedJob.killJob(JobClient.java:315)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1239)
>         at
> org.apache.nutch.segment.SegmentMerger.merge(SegmentMerger.java:620)
>         at
> org.apache.nutch.segment.SegmentMerger.main(SegmentMerger.java:665)
> 
> any suggestions please!!!!

This is a high-level exception that doesn't indicate the nature of the 
original problem. Is there any other information in hadoop.log or in 
task logs (logs/userlogs)?

In my experience this sort of things happen rarely, for the relatively 
small dataset that you have, so you are lucky ;) This could be related 
to a number of issues, like running this under Xen that imposes some 
limits and slowdowns, or you may have a low number of file descriptors 
(ulimit -n), or a faulty RAM, or an overheated CPU ...

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com