You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sybille Peters <pe...@rrzn.uni-hannover.de> on 2013/10/18 15:32:20 UTC
Nutch 1.7 / Parser / java.lang.OutOfMemoryError: unable to create
new native thread
Hello,
using the default crawl script (runtime/local/bin/crawl) the parser will
crash trying to create a new thread after parsing slightly more than
5000 documents.
This only happens if the number of documents to crawl (generate -topN)
is set to > 5000.
Monitoring the number of threads created by the nutch java process: it
increases to about 5700 before the crash occurs.
I thought that the parser would not create that many threads in the
first place. Is this a bug/misconfiguration? Ist there any way to limit
the number of threads explicitly for parsing?
I found this thread and it is recommended to decrease the number of urls
(topN):
http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html
Is this the only possible solution? Older nutch versions did not have
this problem.
Parameters:
---------------
numSlaves=1
numTasks=`expr $numSlaves \* 2`
commonOptions="-D mapred.reduce.tasks=$numTasks -D
mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true"
skipRecordsOptions="-D mapred.skip.attempts.to.start.skipping=2 -D
mapred.skip.map.max.skip.records=1"
$bin/nutch parse $commonOptions $skipRecordsOptions
$CRAWL_PATH/segments/$SEGMENT
hadoop.log
----------------
2013-10-18 14:57:28,294 INFO parse.ParseSegment - Parsed
(0ms):http://www....
2013-10-18 14:57:28,301 WARN mapred.LocalJobRunner -
job_local613646134_0001
java.lang.Exception: java.lang.OutOfMemoryError: unable to create new
native thread
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:640)
at
java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
at
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:657)
at
java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
-----------------
Any help (especially information) is appreciated.
Sybille
Re: Nutch 1.7 / Parser / java.lang.OutOfMemoryError: unable to create
new native thread
Posted by feng lu <am...@gmail.com>.
Hi Sybille
this issue may caused by this code executor service that use cached thread
pool. in the same time, user call a lot of parse method and this lead to
create a lot of thread.
check the code
executorService = Executors.newCachedThreadPool(new ThreadFactoryBuilder()
.setNameFormat("parse-%d").setDaemon(true).build());
one solution is use fixed thread pool
int threadPoolSize = 10;
executorService = Executors.newFixedThreadPool(threadPoolSize,new
ThreadFactoryBuilder()
.setNameFormat("parse-%d").setDaemon(true).build());
thanks.
On Fri, Oct 18, 2013 at 9:32 PM, Sybille Peters <peters@rrzn.uni-hannover.de
> wrote:
> Hello,
>
> using the default crawl script (runtime/local/bin/crawl) the parser will
> crash trying to create a new thread after parsing slightly more than 5000
> documents.
>
> This only happens if the number of documents to crawl (generate -topN) is
> set to > 5000.
>
> Monitoring the number of threads created by the nutch java process: it
> increases to about 5700 before the crash occurs.
>
> I thought that the parser would not create that many threads in the first
> place. Is this a bug/misconfiguration? Ist there any way to limit the
> number of threads explicitly for parsing?
>
> I found this thread and it is recommended to decrease the number of urls
> (topN): http://lucene.472066.n3.**nabble.com/Nutch-1-6-java-**
> lang-OutOfMemoryError-unable-**to-create-new-native-thread-**
> td4044231.html<http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html>
>
> Is this the only possible solution? Older nutch versions did not have this
> problem.
>
> Parameters:
> ---------------
> numSlaves=1
> numTasks=`expr $numSlaves \* 2`
> commonOptions="-D mapred.reduce.tasks=$numTasks -D mapred.child.java.opts=-
> **Xmx1000m -D mapred.reduce.tasks.**speculative.execution=false -D
> mapred.map.tasks.speculative.**execution=false -D
> mapred.compress.map.output=**true"
> skipRecordsOptions="-D mapred.skip.attempts.to.start.**skipping=2 -D
> mapred.skip.map.max.skip.**records=1"
>
> $bin/nutch parse $commonOptions $skipRecordsOptions
> $CRAWL_PATH/segments/$SEGMENT
>
> hadoop.log
> ----------------
>
> 2013-10-18 14:57:28,294 INFO parse.ParseSegment - Parsed (0ms):
> http://www....
> 2013-10-18 14:57:28,301 WARN mapred.LocalJobRunner -
> job_local613646134_0001
> java.lang.Exception: java.lang.OutOfMemoryError: unable to create new
> native thread
> at org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
> LocalJobRunner.java:354)
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.**java:640)
> at java.util.concurrent.**ThreadPoolExecutor.**
> addIfUnderMaximumPoolSize(**ThreadPoolExecutor.java:727)
> at java.util.concurrent.**ThreadPoolExecutor.execute(**
> ThreadPoolExecutor.java:657)
> at java.util.concurrent.**AbstractExecutorService.**submit(**
> AbstractExecutorService.java:**92)
> at org.apache.nutch.parse.**ParseUtil.runParser(ParseUtil.**java:159)
> at org.apache.nutch.parse.**ParseUtil.parse(ParseUtil.**java:93)
> at org.apache.nutch.parse.**ParseSegment.map(ParseSegment.**java:97)
> at org.apache.nutch.parse.**ParseSegment.map(ParseSegment.**java:44)
> at org.apache.hadoop.mapred.**MapRunner.run(MapRunner.java:**50)
> at org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.**java:430)
> at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:366)
> at org.apache.hadoop.mapred.**LocalJobRunner$Job$**
> MapTaskRunnable.run(**LocalJobRunner.java:223)
> at java.util.concurrent.**Executors$RunnableAdapter.**
> call(Executors.java:441)
> at java.util.concurrent.**FutureTask$Sync.innerRun(**
> FutureTask.java:303)
> at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
> at java.util.concurrent.**ThreadPoolExecutor$Worker.**
> runTask(ThreadPoolExecutor.**java:886)
> at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**
> ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.**java:662)
> -----------------
>
> Any help (especially information) is appreciated.
>
> Sybille
>
>
>
--
Don't Grow Old, Grow Up... :-)
Re: Nutch 1.7 / Parser / java.lang.OutOfMemoryError: unable to create
new native thread
Posted by Julien Nioche <li...@gmail.com>.
Hi Sybille
The segment did not make it to the JIRA issue, could you zip it and make it
public somewhere else (dropbox, google docs?)
Thanks
Julien
On 22 October 2013 16:26, Julien Nioche <li...@gmail.com>wrote:
> Great could you share the segment that caused it though? I had not been
> able to reproduce the problem and would like to get to the bottom of it.
> thanks
>
>
> On Tuesday, 22 October 2013, Sybille Peters <pe...@rrzn.uni-hannover.de>
> wrote:
> > Hi,
> >
> > I reproduced the parser issue with the current nutch 1.7 branch.
> >
> > Applying the patch suggested by Julien fixed the problem. (
> https://issues.apache.org/jira/browse/NUTCH-1640.)
> >
> >
> > Thanks!
> >
> > Sybille
> >
> >
> >
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
Re: Nutch 1.7 / Parser / java.lang.OutOfMemoryError: unable to create
new native thread
Posted by Julien Nioche <li...@gmail.com>.
Great could you share the segment that caused it though? I had not been
able to reproduce the problem and would like to get to the bottom of it.
thanks
On Tuesday, 22 October 2013, Sybille Peters <pe...@rrzn.uni-hannover.de>
wrote:
> Hi,
>
> I reproduced the parser issue with the current nutch 1.7 branch.
>
> Applying the patch suggested by Julien fixed the problem. (
https://issues.apache.org/jira/browse/NUTCH-1640.)
>
>
> Thanks!
>
> Sybille
>
>
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
Re: Nutch 1.7 / Parser / java.lang.OutOfMemoryError: unable to create
new native thread
Posted by Sybille Peters <pe...@rrzn.uni-hannover.de>.
Hi,
I reproduced the parser issue with the current nutch 1.7 branch.
Applying the patch suggested by Julien fixed the problem.
(https://issues.apache.org/jira/browse/NUTCH-1640.)
Thanks!
Sybille
Re: Nutch 1.7 / Parser / java.lang.OutOfMemoryError: unable to create
new native thread
Posted by Julien Nioche <li...@gmail.com>.
Hi Sybille,
thanks for the hints. I have a reproducable testcase that fails every time.
> Applying the ParserSegment patch did not help, unfortunately.The
> parser.timeout is set to the default of 30 seconds. I reduced this value,
> but it does not really help.
The threads are created very fast (parsing output shows a parse time of 0ms
> for most). The thread count of over 5000 is reached in about 50 seconds. It
> seems the threads are not closed down at all.
> I already commented out most custom and extra plugins.
>
> nutch-site.xml:
> <name>plugin.includes</name>
> <value>protocol-http|**urlfilter-regex|parse-(html|**
> tika)|index-(basic|anchor)|**indexer-solr||query-(basic|**
> site|url)|response-(json|xml)|**summary-basic|scoring-opic|**
> urlnormalizer-(pass|regex|**basic)</value>
>
> Even if there is some bug in a parse filter (infinite loop), shouldn't the
> parsing stop instead of creating threads like crazy?
>
The purpose of using threads was actually to prevent the parsing of an
entire segment to fail (and possibly take a long time to reparse by
skipping the culprit). What it does is that the thread is actually not
reclaimed when it reaches a timeout (threads can't be stopped) but is left
hanging. That's fine in most cases as there would be just a few timeouts
even on a large segment.
If you are getting many threads created, then it probably means that there
is something wrong with your parsing and that all your documents are
triggering a timeout.
One of the reasons why we marked the old crawl command as deprecated is
that the crawl cycles where running in the same JVM and that the parse
failures could accumulate over the lifetime of the crawl. This is not the
case when using the crawl script
>
> I cannot completely rule out some misconfiguration or error on my end.
> Might be interesting to try to reproduce this with a fresh, unmodified
> version of nutch 1.7.
>
Judging by your nutch-site.xml you must be using a very old version of
Nutch. Could you try to parse the segment which is giving you trouble with
the current trunk and if it happens there, please open a JIRA and attach a
zip of the segment so that we can reproduce the issue
Thanks
Julien
>
> Sybille
>
>
>
>
> On 18.10.2013 15:50, Julien Nioche wrote:
>
>> Hi Sybille
>>
>> The threads spawned by the parser should be reclaimed once a page has been
>> parsed. The parsing itself is not multi-threaded, so it would mean that
>> something is preventing the threads to be deleted, or maybe as the error
>> suggests you are running out of memory.
>>
>> Do you specify parser.timeout in nutch-site.xml? Are you using any custom
>> HTMLParsingFilter?
>>
>> The number of docs should not affect the memory. The parser runs on one
>> document after the other so that would indicate a leak. There was a
>> related
>> issue not very long ago https://issues.apache.org/**
>> jira/browse/NUTCH-1640 <https://issues.apache.org/jira/browse/NUTCH-1640>
>> .
>> Can you patch your code accordingly or use the trunk? I never got to the
>> bottom of it but I am wondering whether this would fix the issue.
>>
>> Thanks
>>
>> Julien
>>
>>
>> On 18 October 2013 14:32, Sybille Peters <pe...@rrzn.uni-hannover.de>**
>> wrote:
>>
>> Hello,
>>>
>>> using the default crawl script (runtime/local/bin/crawl) the parser will
>>> crash trying to create a new thread after parsing slightly more than 5000
>>> documents.
>>>
>>> This only happens if the number of documents to crawl (generate -topN) is
>>> set to > 5000.
>>>
>>> Monitoring the number of threads created by the nutch java process: it
>>> increases to about 5700 before the crash occurs.
>>>
>>> I thought that the parser would not create that many threads in the first
>>> place. Is this a bug/misconfiguration? Ist there any way to limit the
>>> number of threads explicitly for parsing?
>>>
>>> I found this thread and it is recommended to decrease the number of urls
>>> (topN): http://lucene.472066.n3.**nabb**le.com/Nutch-1-6-java-**<http://nabble.com/Nutch-1-6-java-**>
>>> lang-OutOfMemoryError-unable-****to-create-new-native-thread-****
>>> td4044231.html<http://lucene.**472066.n3.nabble.com/Nutch-1-**
>>> 6-java-lang-OutOfMemoryError-**unable-to-create-new-native-**
>>> thread-td4044231.html<http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html>
>>> >
>>>
>>>
>>> Is this the only possible solution? Older nutch versions did not have
>>> this
>>> problem.
>>>
>>> Parameters:
>>> ---------------
>>> numSlaves=1
>>> numTasks=`expr $numSlaves \* 2`
>>> commonOptions="-D mapred.reduce.tasks=$numTasks -D
>>> mapred.child.java.opts=-
>>> **Xmx1000m -D mapred.reduce.tasks.****speculative.execution=false -D
>>> mapred.map.tasks.speculative.****execution=false -D
>>> mapred.compress.map.output=****true"
>>> skipRecordsOptions="-D mapred.skip.attempts.to.start.****skipping=2 -D
>>> mapred.skip.map.max.skip.****records=1"
>>>
>>>
>>> $bin/nutch parse $commonOptions $skipRecordsOptions
>>> $CRAWL_PATH/segments/$SEGMENT
>>>
>>> hadoop.log
>>> ----------------
>>>
>>> 2013-10-18 14:57:28,294 INFO parse.ParseSegment - Parsed (0ms):
>>> http://www....
>>> 2013-10-18 14:57:28,301 WARN mapred.LocalJobRunner -
>>> job_local613646134_0001
>>> java.lang.Exception: java.lang.OutOfMemoryError: unable to create new
>>> native thread
>>> at org.apache.hadoop.mapred.****LocalJobRunner$Job.run(**
>>>
>>> LocalJobRunner.java:354)
>>> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
>>> at java.lang.Thread.start0(Native Method)
>>> at java.lang.Thread.start(Thread.****java:640)
>>> at java.util.concurrent.****ThreadPoolExecutor.**
>>> addIfUnderMaximumPoolSize(****ThreadPoolExecutor.java:727)
>>> at java.util.concurrent.****ThreadPoolExecutor.execute(**
>>> ThreadPoolExecutor.java:657)
>>> at java.util.concurrent.****AbstractExecutorService.****submit(**
>>> AbstractExecutorService.java:****92)
>>> at org.apache.nutch.parse.****ParseUtil.runParser(ParseUtil.**
>>> **java:159)
>>> at org.apache.nutch.parse.****ParseUtil.parse(ParseUtil.****
>>> java:93)
>>> at org.apache.nutch.parse.****ParseSegment.map(ParseSegment.**
>>> **java:97)
>>> at org.apache.nutch.parse.****ParseSegment.map(ParseSegment.**
>>> **java:44)
>>> at org.apache.hadoop.mapred.****MapRunner.run(MapRunner.java:***
>>> *50)
>>> at org.apache.hadoop.mapred.****MapTask.runOldMapper(MapTask.***
>>> *java:430)
>>> at org.apache.hadoop.mapred.****MapTask.run(MapTask.java:366)
>>> at org.apache.hadoop.mapred.****LocalJobRunner$Job$**
>>> MapTaskRunnable.run(****LocalJobRunner.java:223)
>>> at java.util.concurrent.****Executors$RunnableAdapter.**
>>> call(Executors.java:441)
>>> at java.util.concurrent.****FutureTask$Sync.innerRun(**
>>> FutureTask.java:303)
>>> at java.util.concurrent.****FutureTask.run(FutureTask.****java:138)
>>> at java.util.concurrent.****ThreadPoolExecutor$Worker.**
>>> runTask(ThreadPoolExecutor.****java:886)
>>> at java.util.concurrent.****ThreadPoolExecutor$Worker.run(****
>>> ThreadPoolExecutor.java:908)
>>> at java.lang.Thread.run(Thread.****java:662)
>>>
>>> -----------------
>>>
>>> Any help (especially information) is appreciated.
>>>
>>> Sybille
>>>
>>>
>>>
>>>
>>
>
> --
> Diplom-Informatikerin (FH) Sybille Peters
> Leibniz Universität IT Services (ehemals RRZN)
> Schloßwender Straße 5, 30159 Hannover
> Tel.: +49 511 762 793280
> Email: peters@rrzn.uni-hannover.de
> http://www.rrzn.uni-hannover.**de <http://www.rrzn.uni-hannover.de>
>
> TYPO3@RRZN
> TYPO3-Team Leibniz Universität IT Services (ehemals RRZN)
> Email: typo3@rrzn.uni-hannover.de
> http://www.t3luh.rrzn.uni-**hannover.de<http://www.t3luh.rrzn.uni-hannover.de>
>
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
Re: Nutch 1.7 / Parser / java.lang.OutOfMemoryError: unable to create
new native thread
Posted by Sybille Peters <pe...@rrzn.uni-hannover.de>.
Hi Julien,
thanks for the hints. I have a reproducable testcase that fails every
time. Applying the ParserSegment patch did not help, unfortunately. The
parser.timeout is set to the default of 30 seconds. I reduced this
value, but it does not really help. The threads are created very fast
(parsing output shows a parse time of 0ms for most). The thread count of
over 5000 is reached in about 50 seconds. It seems the threads are not
closed down at all.
I already commented out most custom and extra plugins.
nutch-site.xml:
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr||query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
Even if there is some bug in a parse filter (infinite loop), shouldn't
the parsing stop instead of creating threads like crazy?
I cannot completely rule out some misconfiguration or error on my end.
Might be interesting to try to reproduce this with a fresh, unmodified
version of nutch 1.7.
Sybille
On 18.10.2013 15:50, Julien Nioche wrote:
> Hi Sybille
>
> The threads spawned by the parser should be reclaimed once a page has been
> parsed. The parsing itself is not multi-threaded, so it would mean that
> something is preventing the threads to be deleted, or maybe as the error
> suggests you are running out of memory.
>
> Do you specify parser.timeout in nutch-site.xml? Are you using any custom
> HTMLParsingFilter?
>
> The number of docs should not affect the memory. The parser runs on one
> document after the other so that would indicate a leak. There was a related
> issue not very long ago https://issues.apache.org/jira/browse/NUTCH-1640.
> Can you patch your code accordingly or use the trunk? I never got to the
> bottom of it but I am wondering whether this would fix the issue.
>
> Thanks
>
> Julien
>
>
> On 18 October 2013 14:32, Sybille Peters <pe...@rrzn.uni-hannover.de>wrote:
>
>> Hello,
>>
>> using the default crawl script (runtime/local/bin/crawl) the parser will
>> crash trying to create a new thread after parsing slightly more than 5000
>> documents.
>>
>> This only happens if the number of documents to crawl (generate -topN) is
>> set to > 5000.
>>
>> Monitoring the number of threads created by the nutch java process: it
>> increases to about 5700 before the crash occurs.
>>
>> I thought that the parser would not create that many threads in the first
>> place. Is this a bug/misconfiguration? Ist there any way to limit the
>> number of threads explicitly for parsing?
>>
>> I found this thread and it is recommended to decrease the number of urls
>> (topN): http://lucene.472066.n3.**nabble.com/Nutch-1-6-java-**
>> lang-OutOfMemoryError-unable-**to-create-new-native-thread-**
>> td4044231.html<http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html>
>>
>> Is this the only possible solution? Older nutch versions did not have this
>> problem.
>>
>> Parameters:
>> ---------------
>> numSlaves=1
>> numTasks=`expr $numSlaves \* 2`
>> commonOptions="-D mapred.reduce.tasks=$numTasks -D mapred.child.java.opts=-
>> **Xmx1000m -D mapred.reduce.tasks.**speculative.execution=false -D
>> mapred.map.tasks.speculative.**execution=false -D
>> mapred.compress.map.output=**true"
>> skipRecordsOptions="-D mapred.skip.attempts.to.start.**skipping=2 -D
>> mapred.skip.map.max.skip.**records=1"
>>
>> $bin/nutch parse $commonOptions $skipRecordsOptions
>> $CRAWL_PATH/segments/$SEGMENT
>>
>> hadoop.log
>> ----------------
>>
>> 2013-10-18 14:57:28,294 INFO parse.ParseSegment - Parsed (0ms):
>> http://www....
>> 2013-10-18 14:57:28,301 WARN mapred.LocalJobRunner -
>> job_local613646134_0001
>> java.lang.Exception: java.lang.OutOfMemoryError: unable to create new
>> native thread
>> at org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
>> LocalJobRunner.java:354)
>> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
>> at java.lang.Thread.start0(Native Method)
>> at java.lang.Thread.start(Thread.**java:640)
>> at java.util.concurrent.**ThreadPoolExecutor.**
>> addIfUnderMaximumPoolSize(**ThreadPoolExecutor.java:727)
>> at java.util.concurrent.**ThreadPoolExecutor.execute(**
>> ThreadPoolExecutor.java:657)
>> at java.util.concurrent.**AbstractExecutorService.**submit(**
>> AbstractExecutorService.java:**92)
>> at org.apache.nutch.parse.**ParseUtil.runParser(ParseUtil.**java:159)
>> at org.apache.nutch.parse.**ParseUtil.parse(ParseUtil.**java:93)
>> at org.apache.nutch.parse.**ParseSegment.map(ParseSegment.**java:97)
>> at org.apache.nutch.parse.**ParseSegment.map(ParseSegment.**java:44)
>> at org.apache.hadoop.mapred.**MapRunner.run(MapRunner.java:**50)
>> at org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.**java:430)
>> at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:366)
>> at org.apache.hadoop.mapred.**LocalJobRunner$Job$**
>> MapTaskRunnable.run(**LocalJobRunner.java:223)
>> at java.util.concurrent.**Executors$RunnableAdapter.**
>> call(Executors.java:441)
>> at java.util.concurrent.**FutureTask$Sync.innerRun(**
>> FutureTask.java:303)
>> at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
>> at java.util.concurrent.**ThreadPoolExecutor$Worker.**
>> runTask(ThreadPoolExecutor.**java:886)
>> at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**
>> ThreadPoolExecutor.java:908)
>> at java.lang.Thread.run(Thread.**java:662)
>> -----------------
>>
>> Any help (especially information) is appreciated.
>>
>> Sybille
>>
>>
>>
>
--
Diplom-Informatikerin (FH) Sybille Peters
Leibniz Universität IT Services (ehemals RRZN)
Schloßwender Straße 5, 30159 Hannover
Tel.: +49 511 762 793280
Email: peters@rrzn.uni-hannover.de
http://www.rrzn.uni-hannover.de
TYPO3@RRZN
TYPO3-Team Leibniz Universität IT Services (ehemals RRZN)
Email: typo3@rrzn.uni-hannover.de
http://www.t3luh.rrzn.uni-hannover.de
Re: Nutch 1.7 / Parser / java.lang.OutOfMemoryError: unable to create
new native thread
Posted by Julien Nioche <li...@gmail.com>.
Hi Sybille
The threads spawned by the parser should be reclaimed once a page has been
parsed. The parsing itself is not multi-threaded, so it would mean that
something is preventing the threads to be deleted, or maybe as the error
suggests you are running out of memory.
Do you specify parser.timeout in nutch-site.xml? Are you using any custom
HTMLParsingFilter?
The number of docs should not affect the memory. The parser runs on one
document after the other so that would indicate a leak. There was a related
issue not very long ago https://issues.apache.org/jira/browse/NUTCH-1640.
Can you patch your code accordingly or use the trunk? I never got to the
bottom of it but I am wondering whether this would fix the issue.
Thanks
Julien
On 18 October 2013 14:32, Sybille Peters <pe...@rrzn.uni-hannover.de>wrote:
> Hello,
>
> using the default crawl script (runtime/local/bin/crawl) the parser will
> crash trying to create a new thread after parsing slightly more than 5000
> documents.
>
> This only happens if the number of documents to crawl (generate -topN) is
> set to > 5000.
>
> Monitoring the number of threads created by the nutch java process: it
> increases to about 5700 before the crash occurs.
>
> I thought that the parser would not create that many threads in the first
> place. Is this a bug/misconfiguration? Ist there any way to limit the
> number of threads explicitly for parsing?
>
> I found this thread and it is recommended to decrease the number of urls
> (topN): http://lucene.472066.n3.**nabble.com/Nutch-1-6-java-**
> lang-OutOfMemoryError-unable-**to-create-new-native-thread-**
> td4044231.html<http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-td4044231.html>
>
> Is this the only possible solution? Older nutch versions did not have this
> problem.
>
> Parameters:
> ---------------
> numSlaves=1
> numTasks=`expr $numSlaves \* 2`
> commonOptions="-D mapred.reduce.tasks=$numTasks -D mapred.child.java.opts=-
> **Xmx1000m -D mapred.reduce.tasks.**speculative.execution=false -D
> mapred.map.tasks.speculative.**execution=false -D
> mapred.compress.map.output=**true"
> skipRecordsOptions="-D mapred.skip.attempts.to.start.**skipping=2 -D
> mapred.skip.map.max.skip.**records=1"
>
> $bin/nutch parse $commonOptions $skipRecordsOptions
> $CRAWL_PATH/segments/$SEGMENT
>
> hadoop.log
> ----------------
>
> 2013-10-18 14:57:28,294 INFO parse.ParseSegment - Parsed (0ms):
> http://www....
> 2013-10-18 14:57:28,301 WARN mapred.LocalJobRunner -
> job_local613646134_0001
> java.lang.Exception: java.lang.OutOfMemoryError: unable to create new
> native thread
> at org.apache.hadoop.mapred.**LocalJobRunner$Job.run(**
> LocalJobRunner.java:354)
> Caused by: java.lang.OutOfMemoryError: unable to create new native thread
> at java.lang.Thread.start0(Native Method)
> at java.lang.Thread.start(Thread.**java:640)
> at java.util.concurrent.**ThreadPoolExecutor.**
> addIfUnderMaximumPoolSize(**ThreadPoolExecutor.java:727)
> at java.util.concurrent.**ThreadPoolExecutor.execute(**
> ThreadPoolExecutor.java:657)
> at java.util.concurrent.**AbstractExecutorService.**submit(**
> AbstractExecutorService.java:**92)
> at org.apache.nutch.parse.**ParseUtil.runParser(ParseUtil.**java:159)
> at org.apache.nutch.parse.**ParseUtil.parse(ParseUtil.**java:93)
> at org.apache.nutch.parse.**ParseSegment.map(ParseSegment.**java:97)
> at org.apache.nutch.parse.**ParseSegment.map(ParseSegment.**java:44)
> at org.apache.hadoop.mapred.**MapRunner.run(MapRunner.java:**50)
> at org.apache.hadoop.mapred.**MapTask.runOldMapper(MapTask.**java:430)
> at org.apache.hadoop.mapred.**MapTask.run(MapTask.java:366)
> at org.apache.hadoop.mapred.**LocalJobRunner$Job$**
> MapTaskRunnable.run(**LocalJobRunner.java:223)
> at java.util.concurrent.**Executors$RunnableAdapter.**
> call(Executors.java:441)
> at java.util.concurrent.**FutureTask$Sync.innerRun(**
> FutureTask.java:303)
> at java.util.concurrent.**FutureTask.run(FutureTask.**java:138)
> at java.util.concurrent.**ThreadPoolExecutor$Worker.**
> runTask(ThreadPoolExecutor.**java:886)
> at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**
> ThreadPoolExecutor.java:908)
> at java.lang.Thread.run(Thread.**java:662)
> -----------------
>
> Any help (especially information) is appreciated.
>
> Sybille
>
>
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble