You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by MilleBii <mi...@gmail.com> on 2009/12/05 09:50:34 UTC

Fetch failing ?

My fetch cycle failed on the following initial error :

java.io.IOException: Task process exit with nonzero status of 65.
	at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425)

Than it makes a second attempt and after 3 hours I bump on that error
(altough I had double HADOOP_HEAPSIZE):

java.lang.OutOfMemoryError: GC overhead limit exceeded


Any idea what the initial error is or could be ?
For the second one, I'm going to reduce number of threads... but I'm
wondering if there could be a memory leak ? And I don't how to trace that.

-- 
-MilleBii-

Re: Fetch failing ?

Posted by MilleBii <mi...@gmail.com>.
Still failing on a 300k run of fetching (about 4 hours)

I first get a long series OutOfMemory but it keeps fetching somehow and then
it ends with :
attempt_200912070739_0011_m_000000_0: Exception in thread "Thread for
syncLogs" java.lang.OutOfMemoryError: Java heap space

But the job never ends not even on error... so I have to shut it down (kill
and restart hadoop)
I increased NUTCH_HEAPSIZE, no luck

Any idea what to do further, and I'd like not to reduce the run size

2009/12/6 MilleBii <mi...@gmail.com>

> New and longer run ... I get plenty of  :  failed with:
> java.lang.OutOfMemoryError: Java heap space
> Fetching still goes on, not sure if this the expected behavior.
>
>
> 2009/12/6 MilleBii <mi...@gmail.com>
>
> Works fine and my memory problem had to do with the fact that I had too
>> many threads...
>>
>> 2009/12/5 MilleBii <mi...@gmail.com>
>>
>>> Thx again Julien,
>>>
>>> Yes I'm going to buy myself the Hadoop book, because I thought I could do
>>> without but I realize that I need to make good use of hadooop.
>>>
>>> Didn't know you could split fetching & parsing:  so I suppose you just
>>> issue nutch fetch <segment> -noParsing, followed by nutch parse <segment>. I
>>> will try on my next run.
>>>
>>>
>>>
>>> 2009/12/5 Julien Nioche <li...@gmail.com>
>>>
>>> HADOOP_HEAPSIZE specifies the memory to be used by the hadoop demons and
>>>> does NOT affect the memory used for the map/ reduce jobs. Maybe you
>>>> should
>>>> invest a bit of time reading about Hadoop first?
>>>>
>>>> As for your memory problem it could be due to the parsing and not the
>>>> fetching. If you don't already do so I suggest that you separate the
>>>> fetching from the parsing. First that will tell you which part fails +
>>>> if it
>>>> does fail in the parsing then you would not need to refetch the content
>>>>
>>>> J.
>>>>
>>>> 2009/12/5 MilleBii <mi...@gmail.com>
>>>>
>>>> > My fetch cycle failed on the following initial error :
>>>> >
>>>> > java.io.IOException: Task process exit with nonzero status of 65.
>>>> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425)
>>>> >
>>>> > Than it makes a second attempt and after 3 hours I bump on that error
>>>> > (altough I had double HADOOP_HEAPSIZE):
>>>> >
>>>> > java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>> >
>>>> >
>>>> > Any idea what the initial error is or could be ?
>>>> > For the second one, I'm going to reduce number of threads... but I'm
>>>> > wondering if there could be a memory leak ? And I don't how to trace
>>>> that.
>>>> >
>>>> > --
>>>> > -MilleBii-
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> DigitalPebble Ltd
>>>> http://www.digitalpebble.com
>>>>
>>>
>>>
>>>
>>> --
>>> -MilleBii-
>>>
>>
>>
>>
>> --
>> -MilleBii-
>>
>
>
>
> --
> -MilleBii-
>



-- 
-MilleBii-

Re: Fetch failing ?

Posted by MilleBii <mi...@gmail.com>.
New and longer run ... I get plenty of  :  failed with:
java.lang.OutOfMemoryError: Java heap space
Fetching still goes on, not sure if this the expected behavior.


2009/12/6 MilleBii <mi...@gmail.com>

> Works fine and my memory problem had to do with the fact that I had too
> many threads...
>
> 2009/12/5 MilleBii <mi...@gmail.com>
>
>> Thx again Julien,
>>
>> Yes I'm going to buy myself the Hadoop book, because I thought I could do
>> without but I realize that I need to make good use of hadooop.
>>
>> Didn't know you could split fetching & parsing:  so I suppose you just
>> issue nutch fetch <segment> -noParsing, followed by nutch parse <segment>. I
>> will try on my next run.
>>
>>
>>
>> 2009/12/5 Julien Nioche <li...@gmail.com>
>>
>> HADOOP_HEAPSIZE specifies the memory to be used by the hadoop demons and
>>> does NOT affect the memory used for the map/ reduce jobs. Maybe you
>>> should
>>> invest a bit of time reading about Hadoop first?
>>>
>>> As for your memory problem it could be due to the parsing and not the
>>> fetching. If you don't already do so I suggest that you separate the
>>> fetching from the parsing. First that will tell you which part fails + if
>>> it
>>> does fail in the parsing then you would not need to refetch the content
>>>
>>> J.
>>>
>>> 2009/12/5 MilleBii <mi...@gmail.com>
>>>
>>> > My fetch cycle failed on the following initial error :
>>> >
>>> > java.io.IOException: Task process exit with nonzero status of 65.
>>> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425)
>>> >
>>> > Than it makes a second attempt and after 3 hours I bump on that error
>>> > (altough I had double HADOOP_HEAPSIZE):
>>> >
>>> > java.lang.OutOfMemoryError: GC overhead limit exceeded
>>> >
>>> >
>>> > Any idea what the initial error is or could be ?
>>> > For the second one, I'm going to reduce number of threads... but I'm
>>> > wondering if there could be a memory leak ? And I don't how to trace
>>> that.
>>> >
>>> > --
>>> > -MilleBii-
>>> >
>>>
>>>
>>>
>>> --
>>> DigitalPebble Ltd
>>> http://www.digitalpebble.com
>>>
>>
>>
>>
>> --
>> -MilleBii-
>>
>
>
>
> --
> -MilleBii-
>



-- 
-MilleBii-

Re: Fetch failing ?

Posted by MilleBii <mi...@gmail.com>.
Works fine and my memory problem had to do with the fact that I had too many
threads...

2009/12/5 MilleBii <mi...@gmail.com>

> Thx again Julien,
>
> Yes I'm going to buy myself the Hadoop book, because I thought I could do
> without but I realize that I need to make good use of hadooop.
>
> Didn't know you could split fetching & parsing:  so I suppose you just
> issue nutch fetch <segment> -noParsing, followed by nutch parse <segment>. I
> will try on my next run.
>
>
>
> 2009/12/5 Julien Nioche <li...@gmail.com>
>
> HADOOP_HEAPSIZE specifies the memory to be used by the hadoop demons and
>> does NOT affect the memory used for the map/ reduce jobs. Maybe you should
>> invest a bit of time reading about Hadoop first?
>>
>> As for your memory problem it could be due to the parsing and not the
>> fetching. If you don't already do so I suggest that you separate the
>> fetching from the parsing. First that will tell you which part fails + if
>> it
>> does fail in the parsing then you would not need to refetch the content
>>
>> J.
>>
>> 2009/12/5 MilleBii <mi...@gmail.com>
>>
>> > My fetch cycle failed on the following initial error :
>> >
>> > java.io.IOException: Task process exit with nonzero status of 65.
>> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425)
>> >
>> > Than it makes a second attempt and after 3 hours I bump on that error
>> > (altough I had double HADOOP_HEAPSIZE):
>> >
>> > java.lang.OutOfMemoryError: GC overhead limit exceeded
>> >
>> >
>> > Any idea what the initial error is or could be ?
>> > For the second one, I'm going to reduce number of threads... but I'm
>> > wondering if there could be a memory leak ? And I don't how to trace
>> that.
>> >
>> > --
>> > -MilleBii-
>> >
>>
>>
>>
>> --
>> DigitalPebble Ltd
>> http://www.digitalpebble.com
>>
>
>
>
> --
> -MilleBii-
>



-- 
-MilleBii-

Re: Fetch failing ?

Posted by MilleBii <mi...@gmail.com>.
Thx again Julien,

Yes I'm going to buy myself the Hadoop book, because I thought I could do
without but I realize that I need to make good use of hadooop.

Didn't know you could split fetching & parsing:  so I suppose you just issue
nutch fetch <segment> -noParsing, followed by nutch parse <segment>. I will
try on my next run.



2009/12/5 Julien Nioche <li...@gmail.com>

> HADOOP_HEAPSIZE specifies the memory to be used by the hadoop demons and
> does NOT affect the memory used for the map/ reduce jobs. Maybe you should
> invest a bit of time reading about Hadoop first?
>
> As for your memory problem it could be due to the parsing and not the
> fetching. If you don't already do so I suggest that you separate the
> fetching from the parsing. First that will tell you which part fails + if
> it
> does fail in the parsing then you would not need to refetch the content
>
> J.
>
> 2009/12/5 MilleBii <mi...@gmail.com>
>
> > My fetch cycle failed on the following initial error :
> >
> > java.io.IOException: Task process exit with nonzero status of 65.
> >        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425)
> >
> > Than it makes a second attempt and after 3 hours I bump on that error
> > (altough I had double HADOOP_HEAPSIZE):
> >
> > java.lang.OutOfMemoryError: GC overhead limit exceeded
> >
> >
> > Any idea what the initial error is or could be ?
> > For the second one, I'm going to reduce number of threads... but I'm
> > wondering if there could be a memory leak ? And I don't how to trace
> that.
> >
> > --
> > -MilleBii-
> >
>
>
>
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>



-- 
-MilleBii-

Re: Fetch failing ?

Posted by Julien Nioche <li...@gmail.com>.
HADOOP_HEAPSIZE specifies the memory to be used by the hadoop demons and
does NOT affect the memory used for the map/ reduce jobs. Maybe you should
invest a bit of time reading about Hadoop first?

As for your memory problem it could be due to the parsing and not the
fetching. If you don't already do so I suggest that you separate the
fetching from the parsing. First that will tell you which part fails + if it
does fail in the parsing then you would not need to refetch the content

J.

2009/12/5 MilleBii <mi...@gmail.com>

> My fetch cycle failed on the following initial error :
>
> java.io.IOException: Task process exit with nonzero status of 65.
>        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:425)
>
> Than it makes a second attempt and after 3 hours I bump on that error
> (altough I had double HADOOP_HEAPSIZE):
>
> java.lang.OutOfMemoryError: GC overhead limit exceeded
>
>
> Any idea what the initial error is or could be ?
> For the second one, I'm going to reduce number of threads... but I'm
> wondering if there could be a memory leak ? And I don't how to trace that.
>
> --
> -MilleBii-
>



-- 
DigitalPebble Ltd
http://www.digitalpebble.com