You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Sybille Peters <pe...@rrzn.uni-hannover.de> on 2013/10/18 15:32:20 UTC

Nutch 1.7 / Parser / java.lang.OutOfMemoryError: unable to create new native thread

Hello,

using the default crawl script (runtime/local/bin/crawl) the parser will 
crash trying to create a new thread after parsing slightly more than 
5000 documents.

This only happens if the number of documents to crawl (generate -topN) 
is set to > 5000.

Monitoring the number of threads created by the nutch java process: it 
increases to about 5700 before the crash occurs.

I thought that the parser would not create that many threads in the 
first place. Is this a bug/misconfiguration? Ist there any way to limit 
the number of threads explicitly for parsing?

I found this thread and it is recommended to decrease the number of urls 
(topN): 
http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html

Is this the only possible solution? Older nutch versions did not have 
this problem.

Parameters:
---------------
numSlaves=1
numTasks=`expr $numSlaves \* 2`
commonOptions="-D mapred.reduce.tasks=$numTasks -D 
mapred.child.java.opts=-Xmx1000m -D 
mapred.reduce.tasks.speculative.execution=false -D 
mapred.map.tasks.speculative.execution=false -D 
mapred.compress.map.output=true"
skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D 
mapred.skip.map.max.skip.records=1"

$bin/nutch parse $commonOptions $skipRecordsOptions 
$CRAWL_PATH/segments/$SEGMENT

hadoop.log
----------------

2013-10-18 14:57:28,294 INFO  parse.ParseSegment - Parsed 
(0ms):http://www....
2013-10-18 14:57:28,301 WARN  mapred.LocalJobRunner - 
job_local613646134_0001
java.lang.Exception: java.lang.OutOfMemoryError: unable to create new 
native thread
     at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
     at java.lang.Thread.start0(Native Method)
     at java.lang.Thread.start(Thread.java:640)
     at 
java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
     at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:657)
     at 
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
     at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
     at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
     at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
     at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
     at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
     at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
     at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
     at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
     at java.util.concurrent.FutureTask.run(FutureTask.java:138)
     at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
     at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
     at java.lang.Thread.run(Thread.java:662)
-----------------

Any help (especially information) is appreciated.

Sybille

Re: Nutch 1.7 / Parser / java.lang.OutOfMemoryError: unable to create new native thread

Posted by feng lu <am...@gmail.com>.

Hi Sybille

this issue may caused by this code executor service that use cached thread
pool. in the same time, user call a lot of parse method and this lead to
create a lot of thread.

check the code

executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder()
      .setNameFormat("parse-%d").setDaemon(true).build());

one solution is use fixed thread pool

int threadPoolSize = 10;

executorService = Executors.newFixedThreadPool(threadPoolSize,new
ThreadFactoryBuilder()
      .setNameFormat("parse-%d").setDaemon(true).build());

thanks.




On Fri, Oct 18, 2013 at 9:32 PM, Sybille Peters <peters@rrzn.uni-hannover.de
> wrote:

> Hello,
>
> using the default crawl script (runtime/local/bin/crawl) the parser will
> crash trying to create a new thread after parsing slightly more than 5000
> documents.
>
> This only happens if the number of documents to crawl (generate -topN) is
> set to > 5000.
>
> Monitoring the number of threads created by the nutch java process: it
> increases to about 5700 before the crash occurs.
>
> I thought that the parser would not create that many threads in the first
> place. Is this a bug/misconfiguration? Ist there any way to limit the
> number of threads explicitly for parsing?
>
> I found this thread and it is recommended to decrease the number of urls
> (topN): http://lucene.472066.n3.**nabble.com/Nutch-1-6-java-**
> lang-OutOfMemoryError-unable-**to-create-new-native-thread-**
> td4044231.html<http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html>
>
> Is this the only possible solution? Older nutch versions did not have this
> problem.
>
> Parameters:
> ---------------
> numSlaves=1
> numTasks=`expr $numSlaves \* 2`
> commonOptions="-D mapred.reduce.tasks=$numTasks -D mapred.child.java.opts=-
> **Xmx1000m -D mapred.reduce.tasks.**speculative.execution=false -D
> mapred.map.tasks.speculative.**execution=false -D
> mapred.compress.map.output=**true"
> skipRecordsOptions="-D mapred.skip.attempts.to.start.**skipping=2 -D
> mapred.skip.map.max.skip.**records=1"
>
> $bin/nutch parse $commonOptions $skipRecordsOptions
> $CRAWL_PATH/segments/$SEGMENT
>
> hadoop.log
> ----------------
>
> 2013-10-18 14:57:28,294 INFO  parse.ParseSegment - Parsed (0ms):
> http://www....
> 2013-10-18 14:57:28,301 WARN  mapred.LocalJobRunner -
> job_local613646134_0001
> java.lang.Exception: java.lang.OutOfMemoryError: unable to create new
> native thread
>     at org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
> LocalJobRunner.java:354)
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
>     at java.lang.Thread.start0(Native Method)
>     at java.lang.Thread.start(Thread.**java:640)
>     at java.util.concurrent.**ThreadPoolExecutor.**
> addIfUnderMaximumPoolSize(**ThreadPoolExecutor.java:727)
>     at java.util.concurrent.**ThreadPoolExecutor.execute(**
> ThreadPoolExecutor.java:657)
>     at java.util.concurrent.**AbstractExecutorService.**submit(**
> AbstractExecutorService.java:**92)
>     at org.apache.nutch.parse.**ParseUtil.runParser(ParseUtil.**java:159)
>     at org.apache.nutch.parse.**ParseUtil.parse(ParseUtil.**java:93)
>     at org.apache.nutch.parse.**ParseSegment.map(ParseSegment.**java:97)
>     at org.apache.nutch.parse.**ParseSegment.map(ParseSegment.**java:44)
>     at org.apache.hadoop.mapred.**MapRunner.run(MapRunner.java:**50)
>     at org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.**java:430)
>     at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:366)
>     at org.apache.hadoop.mapred.**LocalJobRunner$Job$**
> MapTaskRunnable.run(**LocalJobRunner.java:223)
>     at java.util.concurrent.**Executors$RunnableAdapter.**
> call(Executors.java:441)
>     at java.util.concurrent.**FutureTask$Sync.innerRun(**
> FutureTask.java:303)
>     at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
>     at java.util.concurrent.**ThreadPoolExecutor$Worker.**
> runTask(ThreadPoolExecutor.**java:886)
>     at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**
> ThreadPoolExecutor.java:908)
>     at java.lang.Thread.run(Thread.**java:662)
> -----------------
>
> Any help (especially information) is appreciated.
>
> Sybille
>
>
>


-- 
Don't Grow Old, Grow Up... :-)

Re: Nutch 1.7 / Parser / java.lang.OutOfMemoryError: unable to create new native thread

Posted by Julien Nioche <li...@gmail.com>.

Hi Sybille

The segment did not make it to the JIRA issue, could you zip it and make it
public somewhere else (dropbox, google docs?)

Thanks

Julien


On 22 October 2013 16:26, Julien Nioche <li...@gmail.com>wrote:

> Great could you share the segment that caused it though? I had not been
> able to reproduce the problem and would like to get to the bottom of it.
> thanks
>
>
> On Tuesday, 22 October 2013, Sybille Peters <pe...@rrzn.uni-hannover.de>
> wrote:
> > Hi,
> >
> > I reproduced the parser issue with the current nutch 1.7 branch.
> >
> > Applying the patch suggested by Julien fixed the problem. (
> https://issues.apache.org/jira/browse/NUTCH-1640.)
> >
> >
> > Thanks!
> >
> > Sybille
> >
> >
> >
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch 1.7 / Parser / java.lang.OutOfMemoryError: unable to create new native thread

Posted by Julien Nioche <li...@gmail.com>.

Great could you share the segment that caused it though? I had not been
able to reproduce the problem and would like to get to the bottom of it.
thanks

On Tuesday, 22 October 2013, Sybille Peters <pe...@rrzn.uni-hannover.de>
wrote:
> Hi,
>
> I reproduced the parser issue with the current nutch 1.7 branch.
>
> Applying the patch suggested by Julien fixed the problem. (
https://issues.apache.org/jira/browse/NUTCH-1640.)
>
>
> Thanks!
>
> Sybille
>
>
>

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch 1.7 / Parser / java.lang.OutOfMemoryError: unable to create new native thread

Posted by Sybille Peters <pe...@rrzn.uni-hannover.de>.

Hi,

I reproduced the parser issue with the current nutch 1.7 branch.

Applying the patch suggested by Julien fixed the problem. 
(https://issues.apache.org/jira/browse/NUTCH-1640.)


Thanks!

Sybille

Re: Nutch 1.7 / Parser / java.lang.OutOfMemoryError: unable to create new native thread

Posted by Julien Nioche <li...@gmail.com>.

Hi Sybille,

thanks for the hints. I have a reproducable testcase that fails every time.
> Applying the ParserSegment patch did not help, unfortunately.The
> parser.timeout is set to the default of 30 seconds. I reduced this value,
> but it does not really help.

The threads are created very fast (parsing output shows a parse time of 0ms
> for most). The thread count of over 5000 is reached in about 50 seconds. It
> seems the threads are not closed down at all.


> I already commented out most custom and extra plugins.
>
> nutch-site.xml:
>  <name>plugin.includes</name>
> <value>protocol-http|**urlfilter-regex|parse-(html|**
> tika)|index-(basic|anchor)|**indexer-solr||query-(basic|**
> site|url)|response-(json|xml)|**summary-basic|scoring-opic|**
> urlnormalizer-(pass|regex|**basic)</value>
>
> Even if there is some bug in a parse filter (infinite loop), shouldn't the
> parsing stop instead of creating threads like crazy?
>

The purpose of using threads was actually to prevent the parsing of an
entire segment to fail (and possibly take a long time to reparse by
skipping the culprit). What it does is that the thread is actually not
reclaimed when it reaches a timeout (threads can't be stopped) but is left
hanging.  That's fine in most cases as there would be just a few timeouts
even on a large segment.
If you are getting many threads created, then it probably means that there
is something wrong with your parsing and that all your documents are
triggering a timeout.

One of the reasons why we marked the old crawl command as deprecated is
that the crawl cycles where running in the same JVM and that the parse
failures could accumulate over the lifetime of the crawl. This is not the
case when using the crawl script


>
> I cannot completely rule out some misconfiguration or error on my end.
> Might be interesting to try to reproduce this with a fresh, unmodified
> version of nutch 1.7.
>

Judging by your nutch-site.xml you must be using a very old version of
Nutch. Could you try to parse the segment which is giving you trouble with
the current trunk and if it happens there, please open a JIRA and attach a
zip of the segment so that we can reproduce the issue

Thanks

Julien




>
> Sybille
>
>
>
>
> On 18.10.2013 15:50, Julien Nioche wrote:
>
>> Hi Sybille
>>
>> The threads spawned by the parser should be reclaimed once a page has been
>> parsed. The parsing itself is not multi-threaded, so it would mean that
>> something is preventing the threads to be deleted, or maybe as the error
>> suggests you are running out of memory.
>>
>> Do you specify parser.timeout in nutch-site.xml? Are you using any custom
>> HTMLParsingFilter?
>>
>> The number of docs should not affect the memory. The parser runs on one
>> document after the other so that would indicate a leak. There was a
>> related
>> issue not very long ago https://issues.apache.org/**
>> jira/browse/NUTCH-1640 <https://issues.apache.org/jira/browse/NUTCH-1640>
>> .
>> Can you patch your code accordingly or use the trunk? I never got to the
>> bottom of it but I am wondering whether this would fix the issue.
>>
>> Thanks
>>
>> Julien
>>
>>
>> On 18 October 2013 14:32, Sybille Peters <pe...@rrzn.uni-hannover.de>**
>> wrote:
>>
>>  Hello,
>>>
>>> using the default crawl script (runtime/local/bin/crawl) the parser will
>>> crash trying to create a new thread after parsing slightly more than 5000
>>> documents.
>>>
>>> This only happens if the number of documents to crawl (generate -topN) is
>>> set to > 5000.
>>>
>>> Monitoring the number of threads created by the nutch java process: it
>>> increases to about 5700 before the crash occurs.
>>>
>>> I thought that the parser would not create that many threads in the first
>>> place. Is this a bug/misconfiguration? Ist there any way to limit the
>>> number of threads explicitly for parsing?
>>>
>>> I found this thread and it is recommended to decrease the number of urls
>>> (topN): http://lucene.472066.n3.**nabb**le.com/Nutch-1-6-java-**<http://nabble.com/Nutch-1-6-java-**>
>>> lang-OutOfMemoryError-unable-****to-create-new-native-thread-****
>>> td4044231.html<http://lucene.**472066.n3.nabble.com/Nutch-1-**
>>> 6-java-lang-OutOfMemoryError-**unable-to-create-new-native-**
>>> thread-td4044231.html<http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html>
>>> >
>>>
>>>
>>> Is this the only possible solution? Older nutch versions did not have
>>> this
>>> problem.
>>>
>>> Parameters:
>>> ---------------
>>> numSlaves=1
>>> numTasks=`expr $numSlaves \* 2`
>>> commonOptions="-D mapred.reduce.tasks=$numTasks -D
>>> mapred.child.java.opts=-
>>> **Xmx1000m -D mapred.reduce.tasks.****speculative.execution=false -D
>>> mapred.map.tasks.speculative.****execution=false -D
>>> mapred.compress.map.output=****true"
>>> skipRecordsOptions="-D mapred.skip.attempts.to.start.****skipping=2 -D
>>> mapred.skip.map.max.skip.****records=1"
>>>
>>>
>>> $bin/nutch parse $commonOptions $skipRecordsOptions
>>> $CRAWL_PATH/segments/$SEGMENT
>>>
>>> hadoop.log
>>> ----------------
>>>
>>> 2013-10-18 14:57:28,294 INFO  parse.ParseSegment - Parsed (0ms):
>>> http://www....
>>> 2013-10-18 14:57:28,301 WARN  mapred.LocalJobRunner -
>>> job_local613646134_0001
>>> java.lang.Exception: java.lang.OutOfMemoryError: unable to create new
>>> native thread
>>>      at org.apache.hadoop.mapred.****LocalJobRunner$Job.run(**
>>>
>>> LocalJobRunner.java:354)
>>> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
>>>      at java.lang.Thread.start0(Native Method)
>>>      at java.lang.Thread.start(Thread.****java:640)
>>>      at java.util.concurrent.****ThreadPoolExecutor.**
>>> addIfUnderMaximumPoolSize(****ThreadPoolExecutor.java:727)
>>>      at java.util.concurrent.****ThreadPoolExecutor.execute(**
>>> ThreadPoolExecutor.java:657)
>>>      at java.util.concurrent.****AbstractExecutorService.****submit(**
>>> AbstractExecutorService.java:****92)
>>>      at org.apache.nutch.parse.****ParseUtil.runParser(ParseUtil.**
>>> **java:159)
>>>      at org.apache.nutch.parse.****ParseUtil.parse(ParseUtil.****
>>> java:93)
>>>      at org.apache.nutch.parse.****ParseSegment.map(ParseSegment.**
>>> **java:97)
>>>      at org.apache.nutch.parse.****ParseSegment.map(ParseSegment.**
>>> **java:44)
>>>      at org.apache.hadoop.mapred.****MapRunner.run(MapRunner.java:***
>>> *50)
>>>      at org.apache.hadoop.mapred.****MapTask.runOldMapper(MapTask.***
>>> *java:430)
>>>      at org.apache.hadoop.mapred.****MapTask.run(MapTask.java:366)
>>>      at org.apache.hadoop.mapred.****LocalJobRunner$Job$**
>>> MapTaskRunnable.run(****LocalJobRunner.java:223)
>>>      at java.util.concurrent.****Executors$RunnableAdapter.**
>>> call(Executors.java:441)
>>>      at java.util.concurrent.****FutureTask$Sync.innerRun(**
>>> FutureTask.java:303)
>>>      at java.util.concurrent.****FutureTask.run(FutureTask.****java:138)
>>>      at java.util.concurrent.****ThreadPoolExecutor$Worker.**
>>> runTask(ThreadPoolExecutor.****java:886)
>>>      at java.util.concurrent.****ThreadPoolExecutor$Worker.run(****
>>> ThreadPoolExecutor.java:908)
>>>      at java.lang.Thread.run(Thread.****java:662)
>>>
>>> -----------------
>>>
>>> Any help (especially information) is appreciated.
>>>
>>> Sybille
>>>
>>>
>>>
>>>
>>
>
> --
> Diplom-Informatikerin (FH) Sybille Peters
> Leibniz Universität IT Services (ehemals RRZN)
> Schloßwender Straße 5, 30159 Hannover
> Tel.: +49 511 762 793280
> Email: peters@rrzn.uni-hannover.de
> http://www.rrzn.uni-hannover.**de <http://www.rrzn.uni-hannover.de>
>
> TYPO3@RRZN
> TYPO3-Team Leibniz Universität IT Services (ehemals RRZN)
> Email: typo3@rrzn.uni-hannover.de
> http://www.t3luh.rrzn.uni-**hannover.de<http://www.t3luh.rrzn.uni-hannover.de>
>
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch 1.7 / Parser / java.lang.OutOfMemoryError: unable to create new native thread

Posted by Sybille Peters <pe...@rrzn.uni-hannover.de>.

Hi Julien,

thanks for the hints. I have a reproducable testcase that fails every 
time. Applying the ParserSegment patch did not help, unfortunately. The 
parser.timeout is set to the default of 30 seconds. I reduced this 
value, but it does not really help. The threads are created very fast 
(parsing output shows a parse time of 0ms for most). The thread count of 
over 5000 is reached in about 50 seconds. It seems the threads are not 
closed down at all.

I already commented out most custom and extra plugins.

nutch-site.xml:
  <name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr||query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

Even if there is some bug in a parse filter (infinite loop), shouldn't 
the parsing stop instead of creating threads like crazy?

I cannot completely rule out some misconfiguration or error on my end. 
Might be interesting to try to reproduce this with a fresh, unmodified 
version of nutch 1.7.

Sybille



On 18.10.2013 15:50, Julien Nioche wrote:
> Hi Sybille
>
> The threads spawned by the parser should be reclaimed once a page has been
> parsed. The parsing itself is not multi-threaded, so it would mean that
> something is preventing the threads to be deleted, or maybe as the error
> suggests you are running out of memory.
>
> Do you specify parser.timeout in nutch-site.xml? Are you using any custom
> HTMLParsingFilter?
>
> The number of docs should not affect the memory. The parser runs on one
> document after the other so that would indicate a leak. There was a related
> issue not very long ago https://issues.apache.org/jira/browse/NUTCH-1640.
> Can you patch your code accordingly or use the trunk? I never got to the
> bottom of it but I am wondering whether this would fix the issue.
>
> Thanks
>
> Julien
>
>
> On 18 October 2013 14:32, Sybille Peters <pe...@rrzn.uni-hannover.de>wrote:
>
>> Hello,
>>
>> using the default crawl script (runtime/local/bin/crawl) the parser will
>> crash trying to create a new thread after parsing slightly more than 5000
>> documents.
>>
>> This only happens if the number of documents to crawl (generate -topN) is
>> set to > 5000.
>>
>> Monitoring the number of threads created by the nutch java process: it
>> increases to about 5700 before the crash occurs.
>>
>> I thought that the parser would not create that many threads in the first
>> place. Is this a bug/misconfiguration? Ist there any way to limit the
>> number of threads explicitly for parsing?
>>
>> I found this thread and it is recommended to decrease the number of urls
>> (topN): http://lucene.472066.n3.**nabble.com/Nutch-1-6-java-**
>> lang-OutOfMemoryError-unable-**to-create-new-native-thread-**
>> td4044231.html<http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html>
>>
>> Is this the only possible solution? Older nutch versions did not have this
>> problem.
>>
>> Parameters:
>> ---------------
>> numSlaves=1
>> numTasks=`expr $numSlaves \* 2`
>> commonOptions="-D mapred.reduce.tasks=$numTasks -D mapred.child.java.opts=-
>> **Xmx1000m -D mapred.reduce.tasks.**speculative.execution=false -D
>> mapred.map.tasks.speculative.**execution=false -D
>> mapred.compress.map.output=**true"
>> skipRecordsOptions="-D mapred.skip.attempts.to.start.**skipping=2 -D
>> mapred.skip.map.max.skip.**records=1"
>>
>> $bin/nutch parse $commonOptions $skipRecordsOptions
>> $CRAWL_PATH/segments/$SEGMENT
>>
>> hadoop.log
>> ----------------
>>
>> 2013-10-18 14:57:28,294 INFO  parse.ParseSegment - Parsed (0ms):
>> http://www....
>> 2013-10-18 14:57:28,301 WARN  mapred.LocalJobRunner -
>> job_local613646134_0001
>> java.lang.Exception: java.lang.OutOfMemoryError: unable to create new
>> native thread
>>      at org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
>> LocalJobRunner.java:354)
>> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
>>      at java.lang.Thread.start0(Native Method)
>>      at java.lang.Thread.start(Thread.**java:640)
>>      at java.util.concurrent.**ThreadPoolExecutor.**
>> addIfUnderMaximumPoolSize(**ThreadPoolExecutor.java:727)
>>      at java.util.concurrent.**ThreadPoolExecutor.execute(**
>> ThreadPoolExecutor.java:657)
>>      at java.util.concurrent.**AbstractExecutorService.**submit(**
>> AbstractExecutorService.java:**92)
>>      at org.apache.nutch.parse.**ParseUtil.runParser(ParseUtil.**java:159)
>>      at org.apache.nutch.parse.**ParseUtil.parse(ParseUtil.**java:93)
>>      at org.apache.nutch.parse.**ParseSegment.map(ParseSegment.**java:97)
>>      at org.apache.nutch.parse.**ParseSegment.map(ParseSegment.**java:44)
>>      at org.apache.hadoop.mapred.**MapRunner.run(MapRunner.java:**50)
>>      at org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.**java:430)
>>      at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:366)
>>      at org.apache.hadoop.mapred.**LocalJobRunner$Job$**
>> MapTaskRunnable.run(**LocalJobRunner.java:223)
>>      at java.util.concurrent.**Executors$RunnableAdapter.**
>> call(Executors.java:441)
>>      at java.util.concurrent.**FutureTask$Sync.innerRun(**
>> FutureTask.java:303)
>>      at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
>>      at java.util.concurrent.**ThreadPoolExecutor$Worker.**
>> runTask(ThreadPoolExecutor.**java:886)
>>      at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**
>> ThreadPoolExecutor.java:908)
>>      at java.lang.Thread.run(Thread.**java:662)
>> -----------------
>>
>> Any help (especially information) is appreciated.
>>
>> Sybille
>>
>>
>>
>


-- 
Diplom-Informatikerin (FH) Sybille Peters
Leibniz Universität IT Services (ehemals RRZN)
Schloßwender Straße 5, 30159 Hannover
Tel.: +49 511 762 793280
Email: peters@rrzn.uni-hannover.de
http://www.rrzn.uni-hannover.de

TYPO3@RRZN
TYPO3-Team Leibniz Universität IT Services (ehemals RRZN)
Email: typo3@rrzn.uni-hannover.de
http://www.t3luh.rrzn.uni-hannover.de

Re: Nutch 1.7 / Parser / java.lang.OutOfMemoryError: unable to create new native thread

Posted by Julien Nioche <li...@gmail.com>.

Hi Sybille

The threads spawned by the parser should be reclaimed once a page has been
parsed. The parsing itself is not multi-threaded, so it would mean that
something is preventing the threads to be deleted, or maybe as the error
suggests you are running out of memory.

Do you specify parser.timeout in nutch-site.xml? Are you using any custom
HTMLParsingFilter?

The number of docs should not affect the memory. The parser runs on one
document after the other so that would indicate a leak. There was a related
issue not very long ago https://issues.apache.org/jira/browse/NUTCH-1640.
Can you patch your code accordingly or use the trunk? I never got to the
bottom of it but I am wondering whether this would fix the issue.

Thanks

Julien


On 18 October 2013 14:32, Sybille Peters <pe...@rrzn.uni-hannover.de>wrote:

> Hello,
>
> using the default crawl script (runtime/local/bin/crawl) the parser will
> crash trying to create a new thread after parsing slightly more than 5000
> documents.
>
> This only happens if the number of documents to crawl (generate -topN) is
> set to > 5000.
>
> Monitoring the number of threads created by the nutch java process: it
> increases to about 5700 before the crash occurs.
>
> I thought that the parser would not create that many threads in the first
> place. Is this a bug/misconfiguration? Ist there any way to limit the
> number of threads explicitly for parsing?
>
> I found this thread and it is recommended to decrease the number of urls
> (topN): http://lucene.472066.n3.**nabble.com/Nutch-1-6-java-**
> lang-OutOfMemoryError-unable-**to-create-new-native-thread-**
> td4044231.html<http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html>
>
> Is this the only possible solution? Older nutch versions did not have this
> problem.
>
> Parameters:
> ---------------
> numSlaves=1
> numTasks=`expr $numSlaves \* 2`
> commonOptions="-D mapred.reduce.tasks=$numTasks -D mapred.child.java.opts=-
> **Xmx1000m -D mapred.reduce.tasks.**speculative.execution=false -D
> mapred.map.tasks.speculative.**execution=false -D
> mapred.compress.map.output=**true"
> skipRecordsOptions="-D mapred.skip.attempts.to.start.**skipping=2 -D
> mapred.skip.map.max.skip.**records=1"
>
> $bin/nutch parse $commonOptions $skipRecordsOptions
> $CRAWL_PATH/segments/$SEGMENT
>
> hadoop.log
> ----------------
>
> 2013-10-18 14:57:28,294 INFO  parse.ParseSegment - Parsed (0ms):
> http://www....
> 2013-10-18 14:57:28,301 WARN  mapred.LocalJobRunner -
> job_local613646134_0001
> java.lang.Exception: java.lang.OutOfMemoryError: unable to create new
> native thread
>     at org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
> LocalJobRunner.java:354)
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
>     at java.lang.Thread.start0(Native Method)
>     at java.lang.Thread.start(Thread.**java:640)
>     at java.util.concurrent.**ThreadPoolExecutor.**
> addIfUnderMaximumPoolSize(**ThreadPoolExecutor.java:727)
>     at java.util.concurrent.**ThreadPoolExecutor.execute(**
> ThreadPoolExecutor.java:657)
>     at java.util.concurrent.**AbstractExecutorService.**submit(**
> AbstractExecutorService.java:**92)
>     at org.apache.nutch.parse.**ParseUtil.runParser(ParseUtil.**java:159)
>     at org.apache.nutch.parse.**ParseUtil.parse(ParseUtil.**java:93)
>     at org.apache.nutch.parse.**ParseSegment.map(ParseSegment.**java:97)
>     at org.apache.nutch.parse.**ParseSegment.map(ParseSegment.**java:44)
>     at org.apache.hadoop.mapred.**MapRunner.run(MapRunner.java:**50)
>     at org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.**java:430)
>     at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:366)
>     at org.apache.hadoop.mapred.**LocalJobRunner$Job$**
> MapTaskRunnable.run(**LocalJobRunner.java:223)
>     at java.util.concurrent.**Executors$RunnableAdapter.**
> call(Executors.java:441)
>     at java.util.concurrent.**FutureTask$Sync.innerRun(**
> FutureTask.java:303)
>     at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
>     at java.util.concurrent.**ThreadPoolExecutor$Worker.**
> runTask(ThreadPoolExecutor.**java:886)
>     at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**
> ThreadPoolExecutor.java:908)
>     at java.lang.Thread.run(Thread.**java:662)
> -----------------
>
> Any help (especially information) is appreciated.
>
> Sybille
>
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble