You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by kiran chitturi <ch...@gmail.com> on 2013/03/02 20:12:08 UTC

Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Hi!

I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz.

Last night i started a crawl on local mode for 5 seeds with the config
given below. If the crawl goes well, it should fetch a total of 400
documents. The crawling is done on a single host that we own.

Config
---------------------

fetcher.threads.per.queue - 2
fetcher.server.delay - 1
fetcher.throughput.threshold.pages - -1

crawl script settings
----------------------------
timeLimitFetch- 30
numThreads - 5
topN - 10000
mapred.child.java.opts=-Xmx1000m


I have noticed today that the crawl has stopped due to an error and i have
found the below error in logs.

2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
> 2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner - job_local_0001
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:658)
>         at
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
>         at
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
>         at
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
>         at
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
>         at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> (END)



Did anyone run in to the same issue ? I am not sure why the new native
thread is not being created. The link here says [0] that it might due to
the limitation of number of processes in my OS. Will increase them solve
the issue ?


[0] - http://ww2.cs.fsu.edu/~czhang/errors.html

Thanks!

-- 
Kiran Chitturi

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Posted by neeraj <ne...@yahoo.com>.

Kiran,

  Were you able to resolve this issue?.. I am getting the same error when
fetching huge number of URL's

-Neeraj.



--
View this message in context: http://lucene.472066.n3.nabble.com/Nutch-1-6-java-lang-OutOfMemoryError-unable-to-create-new-native-thread-tp4044231p4044398.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Posted by kiran chitturi <ch...@gmail.com>.

Tejas,

I have a total of 364k files fetched in my last crawl and i used a topN of
2000 and 2 threads per queue. The gap i have noticed is between 5-8
minutes. I had a total of 180 rounds in my crawl ( i had some big crawls at
the beginning with  topN of 10k but after it crashed i changed topN to 2k).


Due to my hardware limitations and local mode, i think using smaller number
of rounds saved me quite some time. The downside might be having lot more
segments to go through but i am writing scripts for automating the index
and reparse tasks.





On Mon, Mar 4, 2013 at 11:18 PM, Tejas Patil <te...@gmail.com>wrote:

> Hi Kiran,
>
> Is the 6 mins consistent across those 5 rounds ? With 10k files is takes
> ~60 minutes for writing segments.
> With 2k file, it took 6 min gap. You will need 5 such small rounds to get
> total 10k, so total gap time would be (5 * 6) = 30 mins. Thats half of the
> time taken for the crawl with 10k !! So in a way, you saved 30 mins by
> running small crawls. Something does seem right with the math here.
>
> Thanks,
> Tejas Patil
>
> On Mon, Mar 4, 2013 at 12:45 PM, kiran chitturi
> <ch...@gmail.com>wrote:
>
> > Thanks Sebastian for the details. This was the bottleneck i had when i am
> > fetching 10k files. Now i switched to 2k and i have a 6 mins gap now.  It
> > took me some time finding right configuration in the local node.
> >
> >
> >
> > On Mon, Mar 4, 2013 at 3:33 PM, Sebastian Nagel
> > <wa...@googlemail.com>wrote:
> >
> > > After all documents are fetched (and ev. parsed) the segment has to be
> > > written:
> > > finish sorting the data and copy it from local temp dir
> (hadoop.tmp.dir)
> > > to the
> > > segment directory. If IO is a bottleneck this may take a while. Also
> > looks
> > > like
> > > you have a lot of content!
> > >
> > > On 03/04/2013 06:03 AM, kiran chitturi wrote:
> > > > Thanks for your suggestion guys! The big crawl is fetching large
> amount
> > > of
> > > > big PDF files.
> > > >
> > > > For something like below, the fetcher took a lot of time to finish
> up,
> > > even
> > > > though the files are fetched. It shows more than one hour of time.
> > > >
> > > >>
> > > >> 2013-03-01 19:45:43,217 INFO  fetcher.Fetcher - -activeThreads=0,
> > > >> spinWaiting=0, fetchQueues.totalSize=0
> > > >> 2013-03-01* 19:45:43,217 *INFO  fetcher.Fetcher - -activeThreads=0
> > > >> 2013-03-01* 20:57:55,288* INFO  fetcher.Fetcher - Fetcher: finished
> at
> > > >> 2013-03-01 20:57:55, elapsed: 01:34:09
> > > >
> > > >
> > > > Does fetching a lot of files causes this issue ? Should i stick to
> one
> > > > thread per local mode or use pseudo distributed mode to improve
> > > performance
> > > > ?
> > > >
> > > > What is an acceptable time fetcher should finish up after fetching
> the
> > > > files ? What exactly happens in this step ?
> > > >
> > > > Thanks again!
> > > > Kiran.
> > > >
> > > >
> > > >
> > > > On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma <
> > > markus.jelsma@openindex.io>wrote:
> > > >
> > > >> The default heap size of 1G is just enough for a parsing fetcher
> with
> > 10
> > > >> threads. The only problem that may rise is too large and complicated
> > PDF
> > > >> files or very large HTML files. If you generate fetch lists of a
> > > reasonable
> > > >> size there won't be a problem most of the time. And if you want to
> > > crawl a
> > > >> lot, then just generate more small segments.
> > > >>
> > > >> If there is a bug it's most likely to be the parser eating memory
> and
> > > not
> > > >> releasing it.
> > > >>
> > > >> -----Original message-----
> > > >>> From:Tejas Patil <te...@gmail.com>
> > > >>> Sent: Sun 03-Mar-2013 22:19
> > > >>> To: user@nutch.apache.org
> > > >>> Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to
> create
> > > >> new native thread
> > > >>>
> > > >>> I agree with Sebastian. It was a crawl in local mode and not over a
> > > >>> cluster. The intended crawl volume is huge and if we dont override
> > the
> > > >>> default heap size to some decent value, there is high possibility
> of
> > > >> facing
> > > >>> an OOM.
> > > >>>
> > > >>>
> > > >>> On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi <
> > > >> chitturikiran15@gmail.com>wrote:
> > > >>>
> > > >>>>> If you find the time you should trace the process.
> > > >>>>> Seems to be either a misconfiguration or even a bug.
> > > >>>>>
> > > >>>>> I will try to track this down soon with the previous
> configuration.
> > > >> Right
> > > >>>> now, i am just trying to get data crawled by Monday.
> > > >>>>
> > > >>>> Kiran.
> > > >>>>
> > > >>>>
> > > >>>>>>> Luckily, you should be able to retry via "bin/nutch parse ..."
> > > >>>>>>> Then trace the system and the Java process to catch the reason.
> > > >>>>>>>
> > > >>>>>>> Sebastian
> > > >>>>>>>
> > > >>>>>>> On 03/02/2013 08:13 PM, kiran chitturi wrote:
> > > >>>>>>>> Sorry, i am looking to crawl 400k documents with the crawl. I
> > > >> said
> > > >>>> 400
> > > >>>>> in
> > > >>>>>>>> my last message.
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi <
> > > >>>>>>> chitturikiran15@gmail.com>wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Hi!
> > > >>>>>>>>>
> > > >>>>>>>>> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5
> > > >> 2.8GHz.
> > > >>>>>>>>>
> > > >>>>>>>>> Last night i started a crawl on local mode for 5 seeds with
> the
> > > >>>> config
> > > >>>>>>>>> given below. If the crawl goes well, it should fetch a total
> of
> > > >> 400
> > > >>>>>>>>> documents. The crawling is done on a single host that we own.
> > > >>>>>>>>>
> > > >>>>>>>>> Config
> > > >>>>>>>>> ---------------------
> > > >>>>>>>>>
> > > >>>>>>>>> fetcher.threads.per.queue - 2
> > > >>>>>>>>> fetcher.server.delay - 1
> > > >>>>>>>>> fetcher.throughput.threshold.pages - -1
> > > >>>>>>>>>
> > > >>>>>>>>> crawl script settings
> > > >>>>>>>>> ----------------------------
> > > >>>>>>>>> timeLimitFetch- 30
> > > >>>>>>>>> numThreads - 5
> > > >>>>>>>>> topN - 10000
> > > >>>>>>>>> mapred.child.java.opts=-Xmx1000m
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> I have noticed today that the crawl has stopped due to an
> error
> > > >> and
> > > >>>> i
> > > >>>>>>> have
> > > >>>>>>>>> found the below error in logs.
> > > >>>>>>>>>
> > > >>>>>>>>> 2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed
> > (0ms):
> > > >>>>>>>>>>
> > > >> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
> > > >>>>>>>>>> 2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner -
> > > >>>> job_local_0001
> > > >>>>>>>>>> java.lang.OutOfMemoryError: unable to create new native
> thread
> > > >>>>>>>>>>         at java.lang.Thread.start0(Native Method)
> > > >>>>>>>>>>         at java.lang.Thread.start(Thread.java:658)
> > > >>>>>>>>>>         at
> > > >>>>>>>>>>
> > > >>>>>>>
> > > >>>>>
> > > >>>>
> > > >>
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> > > >>>>>>>>>>         at
> > > >>>>>>>>>>
> > > >>>>>>>
> > > >>>>>
> > > >>>>
> > > >>
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> > > >>>>>>>>>>         at
> > > >>>>>>>>>>
> > > >>>>>>>
> > > >>>>>
> > > >>>>
> > > >>
> > >
> >
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> > > >>>>>>>>>>         at
> > > >>>>>>>>>>
> > > >>>>>>>
> > > >>>>>
> > > >>>>
> > > >>
> > >
> >
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> > > >>>>>>>>>>         at
> > > >>>>>>> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> > > >>>>>>>>>>         at
> > > >>>>> org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
> > > >>>>>>>>>>         at
> > > >>>>>>> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> > > >>>>>>>>>>         at
> > > >>>>>>> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> > > >>>>>>>>>>         at
> > > >>>> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > > >>>>>>>>>>         at
> > > >>>>>>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> > > >>>>>>>>>>         at
> > > >> org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> > > >>>>>>>>>>         at
> > > >>>>>>>>>>
> > > >>>>>>>
> > > >>>>
> > > >>
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> > > >>>>>>>>>> (END)
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> Did anyone run in to the same issue ? I am not sure why the
> new
> > > >>>> native
> > > >>>>>>>>> thread is not being created. The link here says [0] that it
> > > >> might
> > > >>>> due
> > > >>>>> to
> > > >>>>>>>>> the limitation of number of processes in my OS. Will increase
> > > >> them
> > > >>>>> solve
> > > >>>>>>>>> the issue ?
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>>>> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html
> > > >>>>>>>>>
> > > >>>>>>>>> Thanks!
> > > >>>>>>>>>
> > > >>>>>>>>> --
> > > >>>>>>>>> Kiran Chitturi
> > > >>>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>>> --
> > > >>>> Kiran Chitturi
> > > >>>>
> > > >>>
> > > >>
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> > --
> > Kiran Chitturi
> >
>



-- 
Kiran Chitturi

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Posted by Tejas Patil <te...@gmail.com>.

Hi Kiran,

Is the 6 mins consistent across those 5 rounds ? With 10k files is takes
~60 minutes for writing segments.
With 2k file, it took 6 min gap. You will need 5 such small rounds to get
total 10k, so total gap time would be (5 * 6) = 30 mins. Thats half of the
time taken for the crawl with 10k !! So in a way, you saved 30 mins by
running small crawls. Something does seem right with the math here.

Thanks,
Tejas Patil

On Mon, Mar 4, 2013 at 12:45 PM, kiran chitturi
<ch...@gmail.com>wrote:

> Thanks Sebastian for the details. This was the bottleneck i had when i am
> fetching 10k files. Now i switched to 2k and i have a 6 mins gap now.  It
> took me some time finding right configuration in the local node.
>
>
>
> On Mon, Mar 4, 2013 at 3:33 PM, Sebastian Nagel
> <wa...@googlemail.com>wrote:
>
> > After all documents are fetched (and ev. parsed) the segment has to be
> > written:
> > finish sorting the data and copy it from local temp dir (hadoop.tmp.dir)
> > to the
> > segment directory. If IO is a bottleneck this may take a while. Also
> looks
> > like
> > you have a lot of content!
> >
> > On 03/04/2013 06:03 AM, kiran chitturi wrote:
> > > Thanks for your suggestion guys! The big crawl is fetching large amount
> > of
> > > big PDF files.
> > >
> > > For something like below, the fetcher took a lot of time to finish up,
> > even
> > > though the files are fetched. It shows more than one hour of time.
> > >
> > >>
> > >> 2013-03-01 19:45:43,217 INFO  fetcher.Fetcher - -activeThreads=0,
> > >> spinWaiting=0, fetchQueues.totalSize=0
> > >> 2013-03-01* 19:45:43,217 *INFO  fetcher.Fetcher - -activeThreads=0
> > >> 2013-03-01* 20:57:55,288* INFO  fetcher.Fetcher - Fetcher: finished at
> > >> 2013-03-01 20:57:55, elapsed: 01:34:09
> > >
> > >
> > > Does fetching a lot of files causes this issue ? Should i stick to one
> > > thread per local mode or use pseudo distributed mode to improve
> > performance
> > > ?
> > >
> > > What is an acceptable time fetcher should finish up after fetching the
> > > files ? What exactly happens in this step ?
> > >
> > > Thanks again!
> > > Kiran.
> > >
> > >
> > >
> > > On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma <
> > markus.jelsma@openindex.io>wrote:
> > >
> > >> The default heap size of 1G is just enough for a parsing fetcher with
> 10
> > >> threads. The only problem that may rise is too large and complicated
> PDF
> > >> files or very large HTML files. If you generate fetch lists of a
> > reasonable
> > >> size there won't be a problem most of the time. And if you want to
> > crawl a
> > >> lot, then just generate more small segments.
> > >>
> > >> If there is a bug it's most likely to be the parser eating memory and
> > not
> > >> releasing it.
> > >>
> > >> -----Original message-----
> > >>> From:Tejas Patil <te...@gmail.com>
> > >>> Sent: Sun 03-Mar-2013 22:19
> > >>> To: user@nutch.apache.org
> > >>> Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create
> > >> new native thread
> > >>>
> > >>> I agree with Sebastian. It was a crawl in local mode and not over a
> > >>> cluster. The intended crawl volume is huge and if we dont override
> the
> > >>> default heap size to some decent value, there is high possibility of
> > >> facing
> > >>> an OOM.
> > >>>
> > >>>
> > >>> On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi <
> > >> chitturikiran15@gmail.com>wrote:
> > >>>
> > >>>>> If you find the time you should trace the process.
> > >>>>> Seems to be either a misconfiguration or even a bug.
> > >>>>>
> > >>>>> I will try to track this down soon with the previous configuration.
> > >> Right
> > >>>> now, i am just trying to get data crawled by Monday.
> > >>>>
> > >>>> Kiran.
> > >>>>
> > >>>>
> > >>>>>>> Luckily, you should be able to retry via "bin/nutch parse ..."
> > >>>>>>> Then trace the system and the Java process to catch the reason.
> > >>>>>>>
> > >>>>>>> Sebastian
> > >>>>>>>
> > >>>>>>> On 03/02/2013 08:13 PM, kiran chitturi wrote:
> > >>>>>>>> Sorry, i am looking to crawl 400k documents with the crawl. I
> > >> said
> > >>>> 400
> > >>>>> in
> > >>>>>>>> my last message.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi <
> > >>>>>>> chitturikiran15@gmail.com>wrote:
> > >>>>>>>>
> > >>>>>>>>> Hi!
> > >>>>>>>>>
> > >>>>>>>>> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5
> > >> 2.8GHz.
> > >>>>>>>>>
> > >>>>>>>>> Last night i started a crawl on local mode for 5 seeds with the
> > >>>> config
> > >>>>>>>>> given below. If the crawl goes well, it should fetch a total of
> > >> 400
> > >>>>>>>>> documents. The crawling is done on a single host that we own.
> > >>>>>>>>>
> > >>>>>>>>> Config
> > >>>>>>>>> ---------------------
> > >>>>>>>>>
> > >>>>>>>>> fetcher.threads.per.queue - 2
> > >>>>>>>>> fetcher.server.delay - 1
> > >>>>>>>>> fetcher.throughput.threshold.pages - -1
> > >>>>>>>>>
> > >>>>>>>>> crawl script settings
> > >>>>>>>>> ----------------------------
> > >>>>>>>>> timeLimitFetch- 30
> > >>>>>>>>> numThreads - 5
> > >>>>>>>>> topN - 10000
> > >>>>>>>>> mapred.child.java.opts=-Xmx1000m
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> I have noticed today that the crawl has stopped due to an error
> > >> and
> > >>>> i
> > >>>>>>> have
> > >>>>>>>>> found the below error in logs.
> > >>>>>>>>>
> > >>>>>>>>> 2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed
> (0ms):
> > >>>>>>>>>>
> > >> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
> > >>>>>>>>>> 2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner -
> > >>>> job_local_0001
> > >>>>>>>>>> java.lang.OutOfMemoryError: unable to create new native thread
> > >>>>>>>>>>         at java.lang.Thread.start0(Native Method)
> > >>>>>>>>>>         at java.lang.Thread.start(Thread.java:658)
> > >>>>>>>>>>         at
> > >>>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> > >>>>>>>>>>         at
> > >>>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> > >>>>>>>>>>         at
> > >>>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> > >>>>>>>>>>         at
> > >>>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>
> >
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> > >>>>>>>>>>         at
> > >>>>>>> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> > >>>>>>>>>>         at
> > >>>>> org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
> > >>>>>>>>>>         at
> > >>>>>>> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> > >>>>>>>>>>         at
> > >>>>>>> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> > >>>>>>>>>>         at
> > >>>> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > >>>>>>>>>>         at
> > >>>>>>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> > >>>>>>>>>>         at
> > >> org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> > >>>>>>>>>>         at
> > >>>>>>>>>>
> > >>>>>>>
> > >>>>
> > >>
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> > >>>>>>>>>> (END)
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> Did anyone run in to the same issue ? I am not sure why the new
> > >>>> native
> > >>>>>>>>> thread is not being created. The link here says [0] that it
> > >> might
> > >>>> due
> > >>>>> to
> > >>>>>>>>> the limitation of number of processes in my OS. Will increase
> > >> them
> > >>>>> solve
> > >>>>>>>>> the issue ?
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html
> > >>>>>>>>>
> > >>>>>>>>> Thanks!
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> Kiran Chitturi
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Kiran Chitturi
> > >>>>
> > >>>
> > >>
> > >
> > >
> > >
> >
> >
>
>
> --
> Kiran Chitturi
>

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Posted by kiran chitturi <ch...@gmail.com>.

Thanks Sebastian for the details. This was the bottleneck i had when i am
fetching 10k files. Now i switched to 2k and i have a 6 mins gap now.  It
took me some time finding right configuration in the local node.



On Mon, Mar 4, 2013 at 3:33 PM, Sebastian Nagel
<wa...@googlemail.com>wrote:

> After all documents are fetched (and ev. parsed) the segment has to be
> written:
> finish sorting the data and copy it from local temp dir (hadoop.tmp.dir)
> to the
> segment directory. If IO is a bottleneck this may take a while. Also looks
> like
> you have a lot of content!
>
> On 03/04/2013 06:03 AM, kiran chitturi wrote:
> > Thanks for your suggestion guys! The big crawl is fetching large amount
> of
> > big PDF files.
> >
> > For something like below, the fetcher took a lot of time to finish up,
> even
> > though the files are fetched. It shows more than one hour of time.
> >
> >>
> >> 2013-03-01 19:45:43,217 INFO  fetcher.Fetcher - -activeThreads=0,
> >> spinWaiting=0, fetchQueues.totalSize=0
> >> 2013-03-01* 19:45:43,217 *INFO  fetcher.Fetcher - -activeThreads=0
> >> 2013-03-01* 20:57:55,288* INFO  fetcher.Fetcher - Fetcher: finished at
> >> 2013-03-01 20:57:55, elapsed: 01:34:09
> >
> >
> > Does fetching a lot of files causes this issue ? Should i stick to one
> > thread per local mode or use pseudo distributed mode to improve
> performance
> > ?
> >
> > What is an acceptable time fetcher should finish up after fetching the
> > files ? What exactly happens in this step ?
> >
> > Thanks again!
> > Kiran.
> >
> >
> >
> > On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma <
> markus.jelsma@openindex.io>wrote:
> >
> >> The default heap size of 1G is just enough for a parsing fetcher with 10
> >> threads. The only problem that may rise is too large and complicated PDF
> >> files or very large HTML files. If you generate fetch lists of a
> reasonable
> >> size there won't be a problem most of the time. And if you want to
> crawl a
> >> lot, then just generate more small segments.
> >>
> >> If there is a bug it's most likely to be the parser eating memory and
> not
> >> releasing it.
> >>
> >> -----Original message-----
> >>> From:Tejas Patil <te...@gmail.com>
> >>> Sent: Sun 03-Mar-2013 22:19
> >>> To: user@nutch.apache.org
> >>> Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create
> >> new native thread
> >>>
> >>> I agree with Sebastian. It was a crawl in local mode and not over a
> >>> cluster. The intended crawl volume is huge and if we dont override the
> >>> default heap size to some decent value, there is high possibility of
> >> facing
> >>> an OOM.
> >>>
> >>>
> >>> On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi <
> >> chitturikiran15@gmail.com>wrote:
> >>>
> >>>>> If you find the time you should trace the process.
> >>>>> Seems to be either a misconfiguration or even a bug.
> >>>>>
> >>>>> I will try to track this down soon with the previous configuration.
> >> Right
> >>>> now, i am just trying to get data crawled by Monday.
> >>>>
> >>>> Kiran.
> >>>>
> >>>>
> >>>>>>> Luckily, you should be able to retry via "bin/nutch parse ..."
> >>>>>>> Then trace the system and the Java process to catch the reason.
> >>>>>>>
> >>>>>>> Sebastian
> >>>>>>>
> >>>>>>> On 03/02/2013 08:13 PM, kiran chitturi wrote:
> >>>>>>>> Sorry, i am looking to crawl 400k documents with the crawl. I
> >> said
> >>>> 400
> >>>>> in
> >>>>>>>> my last message.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi <
> >>>>>>> chitturikiran15@gmail.com>wrote:
> >>>>>>>>
> >>>>>>>>> Hi!
> >>>>>>>>>
> >>>>>>>>> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5
> >> 2.8GHz.
> >>>>>>>>>
> >>>>>>>>> Last night i started a crawl on local mode for 5 seeds with the
> >>>> config
> >>>>>>>>> given below. If the crawl goes well, it should fetch a total of
> >> 400
> >>>>>>>>> documents. The crawling is done on a single host that we own.
> >>>>>>>>>
> >>>>>>>>> Config
> >>>>>>>>> ---------------------
> >>>>>>>>>
> >>>>>>>>> fetcher.threads.per.queue - 2
> >>>>>>>>> fetcher.server.delay - 1
> >>>>>>>>> fetcher.throughput.threshold.pages - -1
> >>>>>>>>>
> >>>>>>>>> crawl script settings
> >>>>>>>>> ----------------------------
> >>>>>>>>> timeLimitFetch- 30
> >>>>>>>>> numThreads - 5
> >>>>>>>>> topN - 10000
> >>>>>>>>> mapred.child.java.opts=-Xmx1000m
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> I have noticed today that the crawl has stopped due to an error
> >> and
> >>>> i
> >>>>>>> have
> >>>>>>>>> found the below error in logs.
> >>>>>>>>>
> >>>>>>>>> 2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
> >>>>>>>>>>
> >> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
> >>>>>>>>>> 2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner -
> >>>> job_local_0001
> >>>>>>>>>> java.lang.OutOfMemoryError: unable to create new native thread
> >>>>>>>>>>         at java.lang.Thread.start0(Native Method)
> >>>>>>>>>>         at java.lang.Thread.start(Thread.java:658)
> >>>>>>>>>>         at
> >>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> >>>>>>>>>>         at
> >>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> >>>>>>>>>>         at
> >>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> >>>>>>>>>>         at
> >>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> >>>>>>>>>>         at
> >>>>>>> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> >>>>>>>>>>         at
> >>>>> org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
> >>>>>>>>>>         at
> >>>>>>> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> >>>>>>>>>>         at
> >>>>>>> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> >>>>>>>>>>         at
> >>>> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >>>>>>>>>>         at
> >>>>>>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> >>>>>>>>>>         at
> >> org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> >>>>>>>>>>         at
> >>>>>>>>>>
> >>>>>>>
> >>>>
> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> >>>>>>>>>> (END)
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Did anyone run in to the same issue ? I am not sure why the new
> >>>> native
> >>>>>>>>> thread is not being created. The link here says [0] that it
> >> might
> >>>> due
> >>>>> to
> >>>>>>>>> the limitation of number of processes in my OS. Will increase
> >> them
> >>>>> solve
> >>>>>>>>> the issue ?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html
> >>>>>>>>>
> >>>>>>>>> Thanks!
> >>>>>>>>>
> >>>>>>>>> --
> >>>>>>>>> Kiran Chitturi
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Kiran Chitturi
> >>>>
> >>>
> >>
> >
> >
> >
>
>


-- 
Kiran Chitturi

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Posted by Sebastian Nagel <wa...@googlemail.com>.

After all documents are fetched (and ev. parsed) the segment has to be written:
finish sorting the data and copy it from local temp dir (hadoop.tmp.dir) to the
segment directory. If IO is a bottleneck this may take a while. Also looks like
you have a lot of content!

On 03/04/2013 06:03 AM, kiran chitturi wrote:
> Thanks for your suggestion guys! The big crawl is fetching large amount of
> big PDF files.
> 
> For something like below, the fetcher took a lot of time to finish up, even
> though the files are fetched. It shows more than one hour of time.
> 
>>
>> 2013-03-01 19:45:43,217 INFO  fetcher.Fetcher - -activeThreads=0,
>> spinWaiting=0, fetchQueues.totalSize=0
>> 2013-03-01* 19:45:43,217 *INFO  fetcher.Fetcher - -activeThreads=0
>> 2013-03-01* 20:57:55,288* INFO  fetcher.Fetcher - Fetcher: finished at
>> 2013-03-01 20:57:55, elapsed: 01:34:09
> 
> 
> Does fetching a lot of files causes this issue ? Should i stick to one
> thread per local mode or use pseudo distributed mode to improve performance
> ?
> 
> What is an acceptable time fetcher should finish up after fetching the
> files ? What exactly happens in this step ?
> 
> Thanks again!
> Kiran.
> 
> 
> 
> On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma <ma...@openindex.io>wrote:
> 
>> The default heap size of 1G is just enough for a parsing fetcher with 10
>> threads. The only problem that may rise is too large and complicated PDF
>> files or very large HTML files. If you generate fetch lists of a reasonable
>> size there won't be a problem most of the time. And if you want to crawl a
>> lot, then just generate more small segments.
>>
>> If there is a bug it's most likely to be the parser eating memory and not
>> releasing it.
>>
>> -----Original message-----
>>> From:Tejas Patil <te...@gmail.com>
>>> Sent: Sun 03-Mar-2013 22:19
>>> To: user@nutch.apache.org
>>> Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create
>> new native thread
>>>
>>> I agree with Sebastian. It was a crawl in local mode and not over a
>>> cluster. The intended crawl volume is huge and if we dont override the
>>> default heap size to some decent value, there is high possibility of
>> facing
>>> an OOM.
>>>
>>>
>>> On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi <
>> chitturikiran15@gmail.com>wrote:
>>>
>>>>> If you find the time you should trace the process.
>>>>> Seems to be either a misconfiguration or even a bug.
>>>>>
>>>>> I will try to track this down soon with the previous configuration.
>> Right
>>>> now, i am just trying to get data crawled by Monday.
>>>>
>>>> Kiran.
>>>>
>>>>
>>>>>>> Luckily, you should be able to retry via "bin/nutch parse ..."
>>>>>>> Then trace the system and the Java process to catch the reason.
>>>>>>>
>>>>>>> Sebastian
>>>>>>>
>>>>>>> On 03/02/2013 08:13 PM, kiran chitturi wrote:
>>>>>>>> Sorry, i am looking to crawl 400k documents with the crawl. I
>> said
>>>> 400
>>>>> in
>>>>>>>> my last message.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi <
>>>>>>> chitturikiran15@gmail.com>wrote:
>>>>>>>>
>>>>>>>>> Hi!
>>>>>>>>>
>>>>>>>>> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5
>> 2.8GHz.
>>>>>>>>>
>>>>>>>>> Last night i started a crawl on local mode for 5 seeds with the
>>>> config
>>>>>>>>> given below. If the crawl goes well, it should fetch a total of
>> 400
>>>>>>>>> documents. The crawling is done on a single host that we own.
>>>>>>>>>
>>>>>>>>> Config
>>>>>>>>> ---------------------
>>>>>>>>>
>>>>>>>>> fetcher.threads.per.queue - 2
>>>>>>>>> fetcher.server.delay - 1
>>>>>>>>> fetcher.throughput.threshold.pages - -1
>>>>>>>>>
>>>>>>>>> crawl script settings
>>>>>>>>> ----------------------------
>>>>>>>>> timeLimitFetch- 30
>>>>>>>>> numThreads - 5
>>>>>>>>> topN - 10000
>>>>>>>>> mapred.child.java.opts=-Xmx1000m
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I have noticed today that the crawl has stopped due to an error
>> and
>>>> i
>>>>>>> have
>>>>>>>>> found the below error in logs.
>>>>>>>>>
>>>>>>>>> 2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
>>>>>>>>>>
>> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
>>>>>>>>>> 2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner -
>>>> job_local_0001
>>>>>>>>>> java.lang.OutOfMemoryError: unable to create new native thread
>>>>>>>>>>         at java.lang.Thread.start0(Native Method)
>>>>>>>>>>         at java.lang.Thread.start(Thread.java:658)
>>>>>>>>>>         at
>>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
>>>>>>>>>>         at
>>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
>>>>>>>>>>         at
>>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
>>>>>>>>>>         at
>>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
>>>>>>>>>>         at
>>>>>>> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
>>>>>>>>>>         at
>>>>> org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
>>>>>>>>>>         at
>>>>>>> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
>>>>>>>>>>         at
>>>>>>> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
>>>>>>>>>>         at
>>>> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>>>>>>>>         at
>>>>>>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>>>>>>>>>>         at
>> org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>>>>>>>>>>         at
>>>>>>>>>>
>>>>>>>
>>>>
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>>>>>>>>>> (END)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Did anyone run in to the same issue ? I am not sure why the new
>>>> native
>>>>>>>>> thread is not being created. The link here says [0] that it
>> might
>>>> due
>>>>> to
>>>>>>>>> the limitation of number of processes in my OS. Will increase
>> them
>>>>> solve
>>>>>>>>> the issue ?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Kiran Chitturi
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Kiran Chitturi
>>>>
>>>
>>
> 
> 
>

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Posted by kiran chitturi <ch...@gmail.com>.

Thanks for your suggestion guys! The big crawl is fetching large amount of
big PDF files.

For something like below, the fetcher took a lot of time to finish up, even
though the files are fetched. It shows more than one hour of time.

>
> 2013-03-01 19:45:43,217 INFO  fetcher.Fetcher - -activeThreads=0,
> spinWaiting=0, fetchQueues.totalSize=0
> 2013-03-01* 19:45:43,217 *INFO  fetcher.Fetcher - -activeThreads=0
> 2013-03-01* 20:57:55,288* INFO  fetcher.Fetcher - Fetcher: finished at
> 2013-03-01 20:57:55, elapsed: 01:34:09


Does fetching a lot of files causes this issue ? Should i stick to one
thread per local mode or use pseudo distributed mode to improve performance
?

What is an acceptable time fetcher should finish up after fetching the
files ? What exactly happens in this step ?

Thanks again!
Kiran.



On Sun, Mar 3, 2013 at 4:55 PM, Markus Jelsma <ma...@openindex.io>wrote:

> The default heap size of 1G is just enough for a parsing fetcher with 10
> threads. The only problem that may rise is too large and complicated PDF
> files or very large HTML files. If you generate fetch lists of a reasonable
> size there won't be a problem most of the time. And if you want to crawl a
> lot, then just generate more small segments.
>
> If there is a bug it's most likely to be the parser eating memory and not
> releasing it.
>
> -----Original message-----
> > From:Tejas Patil <te...@gmail.com>
> > Sent: Sun 03-Mar-2013 22:19
> > To: user@nutch.apache.org
> > Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create
> new native thread
> >
> > I agree with Sebastian. It was a crawl in local mode and not over a
> > cluster. The intended crawl volume is huge and if we dont override the
> > default heap size to some decent value, there is high possibility of
> facing
> > an OOM.
> >
> >
> > On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi <
> chitturikiran15@gmail.com>wrote:
> >
> > > > If you find the time you should trace the process.
> > > > Seems to be either a misconfiguration or even a bug.
> > > >
> > > > I will try to track this down soon with the previous configuration.
> Right
> > > now, i am just trying to get data crawled by Monday.
> > >
> > > Kiran.
> > >
> > >
> > > > >> Luckily, you should be able to retry via "bin/nutch parse ..."
> > > > >> Then trace the system and the Java process to catch the reason.
> > > > >>
> > > > >> Sebastian
> > > > >>
> > > > >> On 03/02/2013 08:13 PM, kiran chitturi wrote:
> > > > >>> Sorry, i am looking to crawl 400k documents with the crawl. I
> said
> > > 400
> > > > in
> > > > >>> my last message.
> > > > >>>
> > > > >>>
> > > > >>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi <
> > > > >> chitturikiran15@gmail.com>wrote:
> > > > >>>
> > > > >>>> Hi!
> > > > >>>>
> > > > >>>> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5
> 2.8GHz.
> > > > >>>>
> > > > >>>> Last night i started a crawl on local mode for 5 seeds with the
> > > config
> > > > >>>> given below. If the crawl goes well, it should fetch a total of
> 400
> > > > >>>> documents. The crawling is done on a single host that we own.
> > > > >>>>
> > > > >>>> Config
> > > > >>>> ---------------------
> > > > >>>>
> > > > >>>> fetcher.threads.per.queue - 2
> > > > >>>> fetcher.server.delay - 1
> > > > >>>> fetcher.throughput.threshold.pages - -1
> > > > >>>>
> > > > >>>> crawl script settings
> > > > >>>> ----------------------------
> > > > >>>> timeLimitFetch- 30
> > > > >>>> numThreads - 5
> > > > >>>> topN - 10000
> > > > >>>> mapred.child.java.opts=-Xmx1000m
> > > > >>>>
> > > > >>>>
> > > > >>>> I have noticed today that the crawl has stopped due to an error
> and
> > > i
> > > > >> have
> > > > >>>> found the below error in logs.
> > > > >>>>
> > > > >>>> 2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
> > > > >>>>>
> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
> > > > >>>>> 2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner -
> > > job_local_0001
> > > > >>>>> java.lang.OutOfMemoryError: unable to create new native thread
> > > > >>>>>         at java.lang.Thread.start0(Native Method)
> > > > >>>>>         at java.lang.Thread.start(Thread.java:658)
> > > > >>>>>         at
> > > > >>>>>
> > > > >>
> > > >
> > >
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> > > > >>>>>         at
> > > > >>>>>
> > > > >>
> > > >
> > >
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> > > > >>>>>         at
> > > > >>>>>
> > > > >>
> > > >
> > >
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> > > > >>>>>         at
> > > > >>>>>
> > > > >>
> > > >
> > >
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> > > > >>>>>         at
> > > > >> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> > > > >>>>>         at
> > > > org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
> > > > >>>>>         at
> > > > >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> > > > >>>>>         at
> > > > >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> > > > >>>>>         at
> > > org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > > > >>>>>         at
> > > > >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> > > > >>>>>         at
> org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> > > > >>>>>         at
> > > > >>>>>
> > > > >>
> > >
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> > > > >>>>> (END)
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>> Did anyone run in to the same issue ? I am not sure why the new
> > > native
> > > > >>>> thread is not being created. The link here says [0] that it
> might
> > > due
> > > > to
> > > > >>>> the limitation of number of processes in my OS. Will increase
> them
> > > > solve
> > > > >>>> the issue ?
> > > > >>>>
> > > > >>>>
> > > > >>>> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html
> > > > >>>>
> > > > >>>> Thanks!
> > > > >>>>
> > > > >>>> --
> > > > >>>> Kiran Chitturi
> > > > >>>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>
> > > > >>
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Kiran Chitturi
> > >
> >
>



-- 
Kiran Chitturi

RE: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Posted by Markus Jelsma <ma...@openindex.io>.

The default heap size of 1G is just enough for a parsing fetcher with 10 threads. The only problem that may rise is too large and complicated PDF files or very large HTML files. If you generate fetch lists of a reasonable size there won't be a problem most of the time. And if you want to crawl a lot, then just generate more small segments.

If there is a bug it's most likely to be the parser eating memory and not releasing it. 
 
-----Original message-----
> From:Tejas Patil <te...@gmail.com>
> Sent: Sun 03-Mar-2013 22:19
> To: user@nutch.apache.org
> Subject: Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread
> 
> I agree with Sebastian. It was a crawl in local mode and not over a
> cluster. The intended crawl volume is huge and if we dont override the
> default heap size to some decent value, there is high possibility of facing
> an OOM.
> 
> 
> On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi <ch...@gmail.com>wrote:
> 
> > > If you find the time you should trace the process.
> > > Seems to be either a misconfiguration or even a bug.
> > >
> > > I will try to track this down soon with the previous configuration. Right
> > now, i am just trying to get data crawled by Monday.
> >
> > Kiran.
> >
> >
> > > >> Luckily, you should be able to retry via "bin/nutch parse ..."
> > > >> Then trace the system and the Java process to catch the reason.
> > > >>
> > > >> Sebastian
> > > >>
> > > >> On 03/02/2013 08:13 PM, kiran chitturi wrote:
> > > >>> Sorry, i am looking to crawl 400k documents with the crawl. I said
> > 400
> > > in
> > > >>> my last message.
> > > >>>
> > > >>>
> > > >>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi <
> > > >> chitturikiran15@gmail.com>wrote:
> > > >>>
> > > >>>> Hi!
> > > >>>>
> > > >>>> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz.
> > > >>>>
> > > >>>> Last night i started a crawl on local mode for 5 seeds with the
> > config
> > > >>>> given below. If the crawl goes well, it should fetch a total of 400
> > > >>>> documents. The crawling is done on a single host that we own.
> > > >>>>
> > > >>>> Config
> > > >>>> ---------------------
> > > >>>>
> > > >>>> fetcher.threads.per.queue - 2
> > > >>>> fetcher.server.delay - 1
> > > >>>> fetcher.throughput.threshold.pages - -1
> > > >>>>
> > > >>>> crawl script settings
> > > >>>> ----------------------------
> > > >>>> timeLimitFetch- 30
> > > >>>> numThreads - 5
> > > >>>> topN - 10000
> > > >>>> mapred.child.java.opts=-Xmx1000m
> > > >>>>
> > > >>>>
> > > >>>> I have noticed today that the crawl has stopped due to an error and
> > i
> > > >> have
> > > >>>> found the below error in logs.
> > > >>>>
> > > >>>> 2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
> > > >>>>> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
> > > >>>>> 2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner -
> > job_local_0001
> > > >>>>> java.lang.OutOfMemoryError: unable to create new native thread
> > > >>>>>         at java.lang.Thread.start0(Native Method)
> > > >>>>>         at java.lang.Thread.start(Thread.java:658)
> > > >>>>>         at
> > > >>>>>
> > > >>
> > >
> > java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> > > >>>>>         at
> > > >>>>>
> > > >>
> > >
> > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> > > >>>>>         at
> > > >>>>>
> > > >>
> > >
> > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> > > >>>>>         at
> > > >>>>>
> > > >>
> > >
> > java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> > > >>>>>         at
> > > >> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> > > >>>>>         at
> > > org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
> > > >>>>>         at
> > > >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> > > >>>>>         at
> > > >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> > > >>>>>         at
> > org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > > >>>>>         at
> > > >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> > > >>>>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> > > >>>>>         at
> > > >>>>>
> > > >>
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> > > >>>>> (END)
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> Did anyone run in to the same issue ? I am not sure why the new
> > native
> > > >>>> thread is not being created. The link here says [0] that it might
> > due
> > > to
> > > >>>> the limitation of number of processes in my OS. Will increase them
> > > solve
> > > >>>> the issue ?
> > > >>>>
> > > >>>>
> > > >>>> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html
> > > >>>>
> > > >>>> Thanks!
> > > >>>>
> > > >>>> --
> > > >>>> Kiran Chitturi
> > > >>>>
> > > >>>
> > > >>>
> > > >>>
> > > >>
> > > >>
> > > >
> > > >
> > >
> > >
> >
> >
> > --
> > Kiran Chitturi
> >
>

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Posted by Tejas Patil <te...@gmail.com>.

I agree with Sebastian. It was a crawl in local mode and not over a
cluster. The intended crawl volume is huge and if we dont override the
default heap size to some decent value, there is high possibility of facing
an OOM.


On Sun, Mar 3, 2013 at 1:04 PM, kiran chitturi <ch...@gmail.com>wrote:

> > If you find the time you should trace the process.
> > Seems to be either a misconfiguration or even a bug.
> >
> > I will try to track this down soon with the previous configuration. Right
> now, i am just trying to get data crawled by Monday.
>
> Kiran.
>
>
> > >> Luckily, you should be able to retry via "bin/nutch parse ..."
> > >> Then trace the system and the Java process to catch the reason.
> > >>
> > >> Sebastian
> > >>
> > >> On 03/02/2013 08:13 PM, kiran chitturi wrote:
> > >>> Sorry, i am looking to crawl 400k documents with the crawl. I said
> 400
> > in
> > >>> my last message.
> > >>>
> > >>>
> > >>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi <
> > >> chitturikiran15@gmail.com>wrote:
> > >>>
> > >>>> Hi!
> > >>>>
> > >>>> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz.
> > >>>>
> > >>>> Last night i started a crawl on local mode for 5 seeds with the
> config
> > >>>> given below. If the crawl goes well, it should fetch a total of 400
> > >>>> documents. The crawling is done on a single host that we own.
> > >>>>
> > >>>> Config
> > >>>> ---------------------
> > >>>>
> > >>>> fetcher.threads.per.queue - 2
> > >>>> fetcher.server.delay - 1
> > >>>> fetcher.throughput.threshold.pages - -1
> > >>>>
> > >>>> crawl script settings
> > >>>> ----------------------------
> > >>>> timeLimitFetch- 30
> > >>>> numThreads - 5
> > >>>> topN - 10000
> > >>>> mapred.child.java.opts=-Xmx1000m
> > >>>>
> > >>>>
> > >>>> I have noticed today that the crawl has stopped due to an error and
> i
> > >> have
> > >>>> found the below error in logs.
> > >>>>
> > >>>> 2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
> > >>>>> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
> > >>>>> 2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner -
> job_local_0001
> > >>>>> java.lang.OutOfMemoryError: unable to create new native thread
> > >>>>>         at java.lang.Thread.start0(Native Method)
> > >>>>>         at java.lang.Thread.start(Thread.java:658)
> > >>>>>         at
> > >>>>>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> > >>>>>         at
> > >>>>>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> > >>>>>         at
> > >>>>>
> > >>
> >
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> > >>>>>         at
> > >>>>>
> > >>
> >
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> > >>>>>         at
> > >> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> > >>>>>         at
> > org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
> > >>>>>         at
> > >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> > >>>>>         at
> > >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> > >>>>>         at
> org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > >>>>>         at
> > >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> > >>>>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> > >>>>>         at
> > >>>>>
> > >>
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> > >>>>> (END)
> > >>>>
> > >>>>
> > >>>>
> > >>>> Did anyone run in to the same issue ? I am not sure why the new
> native
> > >>>> thread is not being created. The link here says [0] that it might
> due
> > to
> > >>>> the limitation of number of processes in my OS. Will increase them
> > solve
> > >>>> the issue ?
> > >>>>
> > >>>>
> > >>>> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html
> > >>>>
> > >>>> Thanks!
> > >>>>
> > >>>> --
> > >>>> Kiran Chitturi
> > >>>>
> > >>>
> > >>>
> > >>>
> > >>
> > >>
> > >
> > >
> >
> >
>
>
> --
> Kiran Chitturi
>

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Posted by kiran chitturi <ch...@gmail.com>.

> If you find the time you should trace the process.
> Seems to be either a misconfiguration or even a bug.
>
> I will try to track this down soon with the previous configuration. Right
now, i am just trying to get data crawled by Monday.

Kiran.


> >> Luckily, you should be able to retry via "bin/nutch parse ..."
> >> Then trace the system and the Java process to catch the reason.
> >>
> >> Sebastian
> >>
> >> On 03/02/2013 08:13 PM, kiran chitturi wrote:
> >>> Sorry, i am looking to crawl 400k documents with the crawl. I said 400
> in
> >>> my last message.
> >>>
> >>>
> >>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi <
> >> chitturikiran15@gmail.com>wrote:
> >>>
> >>>> Hi!
> >>>>
> >>>> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz.
> >>>>
> >>>> Last night i started a crawl on local mode for 5 seeds with the config
> >>>> given below. If the crawl goes well, it should fetch a total of 400
> >>>> documents. The crawling is done on a single host that we own.
> >>>>
> >>>> Config
> >>>> ---------------------
> >>>>
> >>>> fetcher.threads.per.queue - 2
> >>>> fetcher.server.delay - 1
> >>>> fetcher.throughput.threshold.pages - -1
> >>>>
> >>>> crawl script settings
> >>>> ----------------------------
> >>>> timeLimitFetch- 30
> >>>> numThreads - 5
> >>>> topN - 10000
> >>>> mapred.child.java.opts=-Xmx1000m
> >>>>
> >>>>
> >>>> I have noticed today that the crawl has stopped due to an error and i
> >> have
> >>>> found the below error in logs.
> >>>>
> >>>> 2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
> >>>>> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
> >>>>> 2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner - job_local_0001
> >>>>> java.lang.OutOfMemoryError: unable to create new native thread
> >>>>>         at java.lang.Thread.start0(Native Method)
> >>>>>         at java.lang.Thread.start(Thread.java:658)
> >>>>>         at
> >>>>>
> >>
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> >>>>>         at
> >>>>>
> >>
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> >>>>>         at
> >>>>>
> >>
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> >>>>>         at
> >>>>>
> >>
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> >>>>>         at
> >> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> >>>>>         at
> org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
> >>>>>         at
> >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> >>>>>         at
> >> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> >>>>>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >>>>>         at
> >> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> >>>>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> >>>>>         at
> >>>>>
> >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> >>>>> (END)
> >>>>
> >>>>
> >>>>
> >>>> Did anyone run in to the same issue ? I am not sure why the new native
> >>>> thread is not being created. The link here says [0] that it might due
> to
> >>>> the limitation of number of processes in my OS. Will increase them
> solve
> >>>> the issue ?
> >>>>
> >>>>
> >>>> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html
> >>>>
> >>>> Thanks!
> >>>>
> >>>> --
> >>>> Kiran Chitturi
> >>>>
> >>>
> >>>
> >>>
> >>
> >>
> >
> >
>
>


-- 
Kiran Chitturi

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Posted by Sebastian Nagel <wa...@googlemail.com>.

> using low value for topN(2000) than 10000
That would mean: you need 200 rounds and also 200 segments for 400k documents.
That's a work-around no solution!

If you find the time you should trace the process.
Seems to be either a misconfiguration or even a bug.

Sebastian

On 03/03/2013 09:45 PM, kiran chitturi wrote:
> Thanks Sebastian for the suggestions. I came over this by using low value
> for topN(2000) than 10000. I decided to use lower value for topN with more
> rounds.
> 
> 
> On Sun, Mar 3, 2013 at 3:41 PM, Sebastian Nagel
> <wa...@googlemail.com>wrote:
> 
>> Hi Kiran,
>>
>> there are many possible reasons for the problem. Beside the limits on the
>> number of processes
>> the stack size in the Java VM and the system (see java -Xss and ulimit -s).
>>
>> I think in local mode there should be only one mapper and consequently only
>> one thread spent for parsing. So the number of processes/threads is hardly
>> the
>> problem suggested that you don't run any other number crunching tasks in
>> parallel
>> on your desktop.
>>
>> Luckily, you should be able to retry via "bin/nutch parse ..."
>> Then trace the system and the Java process to catch the reason.
>>
>> Sebastian
>>
>> On 03/02/2013 08:13 PM, kiran chitturi wrote:
>>> Sorry, i am looking to crawl 400k documents with the crawl. I said 400 in
>>> my last message.
>>>
>>>
>>> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi <
>> chitturikiran15@gmail.com>wrote:
>>>
>>>> Hi!
>>>>
>>>> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz.
>>>>
>>>> Last night i started a crawl on local mode for 5 seeds with the config
>>>> given below. If the crawl goes well, it should fetch a total of 400
>>>> documents. The crawling is done on a single host that we own.
>>>>
>>>> Config
>>>> ---------------------
>>>>
>>>> fetcher.threads.per.queue - 2
>>>> fetcher.server.delay - 1
>>>> fetcher.throughput.threshold.pages - -1
>>>>
>>>> crawl script settings
>>>> ----------------------------
>>>> timeLimitFetch- 30
>>>> numThreads - 5
>>>> topN - 10000
>>>> mapred.child.java.opts=-Xmx1000m
>>>>
>>>>
>>>> I have noticed today that the crawl has stopped due to an error and i
>> have
>>>> found the below error in logs.
>>>>
>>>> 2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
>>>>> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
>>>>> 2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner - job_local_0001
>>>>> java.lang.OutOfMemoryError: unable to create new native thread
>>>>>         at java.lang.Thread.start0(Native Method)
>>>>>         at java.lang.Thread.start(Thread.java:658)
>>>>>         at
>>>>>
>> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
>>>>>         at
>>>>>
>> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
>>>>>         at
>>>>>
>> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
>>>>>         at
>>>>>
>> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
>>>>>         at
>> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
>>>>>         at org.apache.nutch.parse.ParseUtil.parse(PaifrseUtil.java:93)
>>>>>         at
>> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
>>>>>         at
>> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
>>>>>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>>>         at
>> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>>>>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>>>>>         at
>>>>>
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>>>>> (END)
>>>>
>>>>
>>>>
>>>> Did anyone run in to the same issue ? I am not sure why the new native
>>>> thread is not being created. The link here says [0] that it might due to
>>>> the limitation of number of processes in my OS. Will increase them solve
>>>> the issue ?
>>>>
>>>>
>>>> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html
>>>>
>>>> Thanks!
>>>>
>>>> --
>>>> Kiran Chitturi
>>>>
>>>
>>>
>>>
>>
>>
> 
>

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Posted by kiran chitturi <ch...@gmail.com>.

Thanks Sebastian for the suggestions. I came over this by using low value
for topN(2000) than 10000. I decided to use lower value for topN with more
rounds.


On Sun, Mar 3, 2013 at 3:41 PM, Sebastian Nagel
<wa...@googlemail.com>wrote:

> Hi Kiran,
>
> there are many possible reasons for the problem. Beside the limits on the
> number of processes
> the stack size in the Java VM and the system (see java -Xss and ulimit -s).
>
> I think in local mode there should be only one mapper and consequently only
> one thread spent for parsing. So the number of processes/threads is hardly
> the
> problem suggested that you don't run any other number crunching tasks in
> parallel
> on your desktop.
>
> Luckily, you should be able to retry via "bin/nutch parse ..."
> Then trace the system and the Java process to catch the reason.
>
> Sebastian
>
> On 03/02/2013 08:13 PM, kiran chitturi wrote:
> > Sorry, i am looking to crawl 400k documents with the crawl. I said 400 in
> > my last message.
> >
> >
> > On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi <
> chitturikiran15@gmail.com>wrote:
> >
> >> Hi!
> >>
> >> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz.
> >>
> >> Last night i started a crawl on local mode for 5 seeds with the config
> >> given below. If the crawl goes well, it should fetch a total of 400
> >> documents. The crawling is done on a single host that we own.
> >>
> >> Config
> >> ---------------------
> >>
> >> fetcher.threads.per.queue - 2
> >> fetcher.server.delay - 1
> >> fetcher.throughput.threshold.pages - -1
> >>
> >> crawl script settings
> >> ----------------------------
> >> timeLimitFetch- 30
> >> numThreads - 5
> >> topN - 10000
> >> mapred.child.java.opts=-Xmx1000m
> >>
> >>
> >> I have noticed today that the crawl has stopped due to an error and i
> have
> >> found the below error in logs.
> >>
> >> 2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
> >>> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
> >>> 2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner - job_local_0001
> >>> java.lang.OutOfMemoryError: unable to create new native thread
> >>>         at java.lang.Thread.start0(Native Method)
> >>>         at java.lang.Thread.start(Thread.java:658)
> >>>         at
> >>>
> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
> >>>         at
> >>>
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
> >>>         at
> >>>
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
> >>>         at
> >>>
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
> >>>         at
> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> >>>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
> >>>         at
> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> >>>         at
> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> >>>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >>>         at
> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> >>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> >>>         at
> >>>
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> >>> (END)
> >>
> >>
> >>
> >> Did anyone run in to the same issue ? I am not sure why the new native
> >> thread is not being created. The link here says [0] that it might due to
> >> the limitation of number of processes in my OS. Will increase them solve
> >> the issue ?
> >>
> >>
> >> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html
> >>
> >> Thanks!
> >>
> >> --
> >> Kiran Chitturi
> >>
> >
> >
> >
>
>


-- 
Kiran Chitturi

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Kiran,

there are many possible reasons for the problem. Beside the limits on the number of processes
the stack size in the Java VM and the system (see java -Xss and ulimit -s).

I think in local mode there should be only one mapper and consequently only
one thread spent for parsing. So the number of processes/threads is hardly the
problem suggested that you don't run any other number crunching tasks in parallel
on your desktop.

Luckily, you should be able to retry via "bin/nutch parse ..."
Then trace the system and the Java process to catch the reason.

Sebastian

On 03/02/2013 08:13 PM, kiran chitturi wrote:
> Sorry, i am looking to crawl 400k documents with the crawl. I said 400 in
> my last message.
> 
> 
> On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi <ch...@gmail.com>wrote:
> 
>> Hi!
>>
>> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz.
>>
>> Last night i started a crawl on local mode for 5 seeds with the config
>> given below. If the crawl goes well, it should fetch a total of 400
>> documents. The crawling is done on a single host that we own.
>>
>> Config
>> ---------------------
>>
>> fetcher.threads.per.queue - 2
>> fetcher.server.delay - 1
>> fetcher.throughput.threshold.pages - -1
>>
>> crawl script settings
>> ----------------------------
>> timeLimitFetch- 30
>> numThreads - 5
>> topN - 10000
>> mapred.child.java.opts=-Xmx1000m
>>
>>
>> I have noticed today that the crawl has stopped due to an error and i have
>> found the below error in logs.
>>
>> 2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
>>> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
>>> 2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner - job_local_0001
>>> java.lang.OutOfMemoryError: unable to create new native thread
>>>         at java.lang.Thread.start0(Native Method)
>>>         at java.lang.Thread.start(Thread.java:658)
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
>>>         at
>>> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
>>>         at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
>>>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
>>>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
>>>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
>>>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>>>         at
>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>>> (END)
>>
>>
>>
>> Did anyone run in to the same issue ? I am not sure why the new native
>> thread is not being created. The link here says [0] that it might due to
>> the limitation of number of processes in my OS. Will increase them solve
>> the issue ?
>>
>>
>> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html
>>
>> Thanks!
>>
>> --
>> Kiran Chitturi
>>
> 
> 
>

Re: Nutch 1.6 : java.lang.OutOfMemoryError: unable to create new native thread

Posted by kiran chitturi <ch...@gmail.com>.

Sorry, i am looking to crawl 400k documents with the crawl. I said 400 in
my last message.


On Sat, Mar 2, 2013 at 2:12 PM, kiran chitturi <ch...@gmail.com>wrote:

> Hi!
>
> I am running Nutch 1.6 on a 4 GB Mac OS desktop with Core i5 2.8GHz.
>
> Last night i started a crawl on local mode for 5 seeds with the config
> given below. If the crawl goes well, it should fetch a total of 400
> documents. The crawling is done on a single host that we own.
>
> Config
> ---------------------
>
> fetcher.threads.per.queue - 2
> fetcher.server.delay - 1
> fetcher.throughput.threshold.pages - -1
>
> crawl script settings
> ----------------------------
> timeLimitFetch- 30
> numThreads - 5
> topN - 10000
> mapred.child.java.opts=-Xmx1000m
>
>
> I have noticed today that the crawl has stopped due to an error and i have
> found the below error in logs.
>
> 2013-03-01 21:45:03,767 INFO  parse.ParseSegment - Parsed (0ms):
>> http://scholar.lib.vt.edu/ejournals/JARS/v33n3/v33n3-letcher.htm
>> 2013-03-01 21:45:03,790 WARN  mapred.LocalJobRunner - job_local_0001
>> java.lang.OutOfMemoryError: unable to create new native thread
>>         at java.lang.Thread.start0(Native Method)
>>         at java.lang.Thread.start(Thread.java:658)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.addThread(ThreadPoolExecutor.java:681)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(ThreadPoolExecutor.java:727)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:655)
>>         at
>> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:92)
>>         at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
>>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
>>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
>>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
>>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>>         at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>> (END)
>
>
>
> Did anyone run in to the same issue ? I am not sure why the new native
> thread is not being created. The link here says [0] that it might due to
> the limitation of number of processes in my OS. Will increase them solve
> the issue ?
>
>
> [0] - http://ww2.cs.fsu.edu/~czhang/errors.html
>
> Thanks!
>
> --
> Kiran Chitturi
>



-- 
Kiran Chitturi