You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Kristopher Kane <kk...@gmail.com> on 2013/03/10 05:22:22 UTC

Session failed during parsing: IOException because of OOM

I had a long running session going and would like to try and pick up where
it left off if possible.  In the terminal, Nutch was at a parsing stage
then hit OOM.  Is there anyway to start that near where it left off?

-Kris

Re: Session failed during parsing: IOException because of OOM

Posted by kiran chitturi <ch...@gmail.com>.

Great! Just thought i would point out in case you missed :)


On Sun, Mar 10, 2013 at 11:30 PM, Kristopher Kane <kk...@gmail.com>wrote:

> Right, I'm running the script this time around based on your first reply.
>
> -Kris
>
>
> On Sun, Mar 10, 2013 at 11:13 PM, kiran chitturi
> <ch...@gmail.com>wrote:
>
> > Hi Kris,
> >
> > It was discussed several times in this thread that crawl command should
> be
> > deprecated and instead, the crawl script present in bin directory
> > (./bin/crawl) should be used. [0]
> >
> > The crawl script does a step by step procedure unlike crawl command. It
> is
> > recommended to use crawl script.
> >
> > [0] - https://issues.apache.org/jira/browse/NUTCH-1087
> >
> >
> > On Sun, Mar 10, 2013 at 10:24 PM, Kristopher Kane <kkane.list@gmail.com
> > >wrote:
> >
> > > Thanks for the reply.  I'm using 1.6 on Centos 6.3 with Oracle Java 6
> and
> > > using all of the built-in Hadoop capability.  Haven't learned how to
> run
> > it
> > > on my 'real' hadoop cluster yet...
> > >
> > > Invocation:  bin/nutch crawl urls -solr
> http://localhost:8983/solr/-depth
> > > 5 -topN 5000
> > >
> > > Hadoop trace:
> > >
> > > 2013-03-09 23:07:07,662 WARN  mapred.LocalJobRunner - job_local_0016
> > > java.lang.OutOfMemoryError: unable to create new native thread
> > >         at java.lang.Thread.start0(Native Method)
> > >         at java.lang.Thread.start(Unknown Source)
> > >         at java.util.concurrent.ThreadPoolExecutor.addThread(Unknown
> > > Source)
> > >         at
> > >
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(Unknown
> > > Source)
> > >         at java.util.concurrent.ThreadPoolExecutor.execute(Unknown
> > Source)
> > >         at java.util.concurrent.AbstractExecutorService.submit(Unknown
> > > Source)
> > >         at
> org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> > >         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
> > >         at
> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> > >         at
> org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> > >         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > >         at
> > org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> > >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> > >         at
> > >
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> > >
> > > I was running it in a small vm with 2 GB or memory. After I posted, I
> ran
> > > the crawler again with 6 GB of memory.
> > >
> > > I'll try what you suggested and bypass the inject.
> > >
> > > Thanks,
> > >
> > > -Kris
> > >
> > >
> > >
> > > On Sat, Mar 9, 2013 at 11:36 PM, kiran chitturi
> > > <ch...@gmail.com>wrote:
> > >
> > > > Hi Kris,
> > > >
> > > > Which version are you using ?
> > > >
> > > > At which step did the exception happen ? Is it after fetch stage or
> > parse
> > > > stage?
> > > >
> > > > Are you using the crawl script(./bin/crawl) or crawl command
> > (./bin/nutch
> > > > crawl) to do the crawl ?
> > > >
> > > > You can use the crawl script located at (./bin/crawl) by removing the
> > > > inject step since you would not need injecting the seeds again.
> > > >
> > > > Please let us know if you have any more questions.
> > > >
> > > >
> > > >
> > > >
> > > > On Sat, Mar 9, 2013 at 11:22 PM, Kristopher Kane <
> kkane.list@gmail.com
> > > > >wrote:
> > > >
> > > > > I had a long running session going and would like to try and pick
> up
> > > > where
> > > > > it left off if possible.  In the terminal, Nutch was at a parsing
> > stage
> > > > > then hit OOM.  Is there anyway to start that near where it left
> off?
> > > > >
> > > > > -Kris
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Kiran Chitturi
> > > >
> > >
> >
> >
> >
> > --
> > Kiran Chitturi
> >
>



-- 
Kiran Chitturi

Re: Session failed during parsing: IOException because of OOM

Posted by Kristopher Kane <kk...@gmail.com>.

Right, I'm running the script this time around based on your first reply.

-Kris


On Sun, Mar 10, 2013 at 11:13 PM, kiran chitturi
<ch...@gmail.com>wrote:

> Hi Kris,
>
> It was discussed several times in this thread that crawl command should be
> deprecated and instead, the crawl script present in bin directory
> (./bin/crawl) should be used. [0]
>
> The crawl script does a step by step procedure unlike crawl command. It is
> recommended to use crawl script.
>
> [0] - https://issues.apache.org/jira/browse/NUTCH-1087
>
>
> On Sun, Mar 10, 2013 at 10:24 PM, Kristopher Kane <kkane.list@gmail.com
> >wrote:
>
> > Thanks for the reply.  I'm using 1.6 on Centos 6.3 with Oracle Java 6 and
> > using all of the built-in Hadoop capability.  Haven't learned how to run
> it
> > on my 'real' hadoop cluster yet...
> >
> > Invocation:  bin/nutch crawl urls -solr http://localhost:8983/solr/-depth
> > 5 -topN 5000
> >
> > Hadoop trace:
> >
> > 2013-03-09 23:07:07,662 WARN  mapred.LocalJobRunner - job_local_0016
> > java.lang.OutOfMemoryError: unable to create new native thread
> >         at java.lang.Thread.start0(Native Method)
> >         at java.lang.Thread.start(Unknown Source)
> >         at java.util.concurrent.ThreadPoolExecutor.addThread(Unknown
> > Source)
> >         at
> > java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(Unknown
> > Source)
> >         at java.util.concurrent.ThreadPoolExecutor.execute(Unknown
> Source)
> >         at java.util.concurrent.AbstractExecutorService.submit(Unknown
> > Source)
> >         at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
> >         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
> >         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
> >         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
> >         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >         at
> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
> >         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
> >         at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
> >
> > I was running it in a small vm with 2 GB or memory. After I posted, I ran
> > the crawler again with 6 GB of memory.
> >
> > I'll try what you suggested and bypass the inject.
> >
> > Thanks,
> >
> > -Kris
> >
> >
> >
> > On Sat, Mar 9, 2013 at 11:36 PM, kiran chitturi
> > <ch...@gmail.com>wrote:
> >
> > > Hi Kris,
> > >
> > > Which version are you using ?
> > >
> > > At which step did the exception happen ? Is it after fetch stage or
> parse
> > > stage?
> > >
> > > Are you using the crawl script(./bin/crawl) or crawl command
> (./bin/nutch
> > > crawl) to do the crawl ?
> > >
> > > You can use the crawl script located at (./bin/crawl) by removing the
> > > inject step since you would not need injecting the seeds again.
> > >
> > > Please let us know if you have any more questions.
> > >
> > >
> > >
> > >
> > > On Sat, Mar 9, 2013 at 11:22 PM, Kristopher Kane <kkane.list@gmail.com
> > > >wrote:
> > >
> > > > I had a long running session going and would like to try and pick up
> > > where
> > > > it left off if possible.  In the terminal, Nutch was at a parsing
> stage
> > > > then hit OOM.  Is there anyway to start that near where it left off?
> > > >
> > > > -Kris
> > > >
> > >
> > >
> > >
> > > --
> > > Kiran Chitturi
> > >
> >
>
>
>
> --
> Kiran Chitturi
>

Re: Session failed during parsing: IOException because of OOM

Posted by kiran chitturi <ch...@gmail.com>.

Hi Kris,

It was discussed several times in this thread that crawl command should be
deprecated and instead, the crawl script present in bin directory
(./bin/crawl) should be used. [0]

The crawl script does a step by step procedure unlike crawl command. It is
recommended to use crawl script.

[0] - https://issues.apache.org/jira/browse/NUTCH-1087


On Sun, Mar 10, 2013 at 10:24 PM, Kristopher Kane <kk...@gmail.com>wrote:

> Thanks for the reply.  I'm using 1.6 on Centos 6.3 with Oracle Java 6 and
> using all of the built-in Hadoop capability.  Haven't learned how to run it
> on my 'real' hadoop cluster yet...
>
> Invocation:  bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth
> 5 -topN 5000
>
> Hadoop trace:
>
> 2013-03-09 23:07:07,662 WARN  mapred.LocalJobRunner - job_local_0016
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Unknown Source)
>         at java.util.concurrent.ThreadPoolExecutor.addThread(Unknown
> Source)
>         at
> java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(Unknown
> Source)
>         at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source)
>         at java.util.concurrent.AbstractExecutorService.submit(Unknown
> Source)
>         at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
>         at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
>         at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
>         at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>         at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
>
> I was running it in a small vm with 2 GB or memory. After I posted, I ran
> the crawler again with 6 GB of memory.
>
> I'll try what you suggested and bypass the inject.
>
> Thanks,
>
> -Kris
>
>
>
> On Sat, Mar 9, 2013 at 11:36 PM, kiran chitturi
> <ch...@gmail.com>wrote:
>
> > Hi Kris,
> >
> > Which version are you using ?
> >
> > At which step did the exception happen ? Is it after fetch stage or parse
> > stage?
> >
> > Are you using the crawl script(./bin/crawl) or crawl command (./bin/nutch
> > crawl) to do the crawl ?
> >
> > You can use the crawl script located at (./bin/crawl) by removing the
> > inject step since you would not need injecting the seeds again.
> >
> > Please let us know if you have any more questions.
> >
> >
> >
> >
> > On Sat, Mar 9, 2013 at 11:22 PM, Kristopher Kane <kkane.list@gmail.com
> > >wrote:
> >
> > > I had a long running session going and would like to try and pick up
> > where
> > > it left off if possible.  In the terminal, Nutch was at a parsing stage
> > > then hit OOM.  Is there anyway to start that near where it left off?
> > >
> > > -Kris
> > >
> >
> >
> >
> > --
> > Kiran Chitturi
> >
>



-- 
Kiran Chitturi

Re: Session failed during parsing: IOException because of OOM

Posted by Kristopher Kane <kk...@gmail.com>.

Thanks for the reply.  I'm using 1.6 on Centos 6.3 with Oracle Java 6 and
using all of the built-in Hadoop capability.  Haven't learned how to run it
on my 'real' hadoop cluster yet...

Invocation:  bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth
5 -topN 5000

Hadoop trace:

2013-03-09 23:07:07,662 WARN  mapred.LocalJobRunner - job_local_0016
java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.addThread(Unknown Source)
        at
java.util.concurrent.ThreadPoolExecutor.addIfUnderMaximumPoolSize(Unknown
Source)
        at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source)
        at java.util.concurrent.AbstractExecutorService.submit(Unknown
Source)
        at org.apache.nutch.parse.ParseUtil.runParser(ParseUtil.java:159)
        at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:93)
        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:97)
        at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:44)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:436)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:372)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

I was running it in a small vm with 2 GB or memory. After I posted, I ran
the crawler again with 6 GB of memory.

I'll try what you suggested and bypass the inject.

Thanks,

-Kris



On Sat, Mar 9, 2013 at 11:36 PM, kiran chitturi
<ch...@gmail.com>wrote:

> Hi Kris,
>
> Which version are you using ?
>
> At which step did the exception happen ? Is it after fetch stage or parse
> stage?
>
> Are you using the crawl script(./bin/crawl) or crawl command (./bin/nutch
> crawl) to do the crawl ?
>
> You can use the crawl script located at (./bin/crawl) by removing the
> inject step since you would not need injecting the seeds again.
>
> Please let us know if you have any more questions.
>
>
>
>
> On Sat, Mar 9, 2013 at 11:22 PM, Kristopher Kane <kkane.list@gmail.com
> >wrote:
>
> > I had a long running session going and would like to try and pick up
> where
> > it left off if possible.  In the terminal, Nutch was at a parsing stage
> > then hit OOM.  Is there anyway to start that near where it left off?
> >
> > -Kris
> >
>
>
>
> --
> Kiran Chitturi
>

Re: Session failed during parsing: IOException because of OOM

Posted by kiran chitturi <ch...@gmail.com>.

Hi Kris,

Which version are you using ?

At which step did the exception happen ? Is it after fetch stage or parse
stage?

Are you using the crawl script(./bin/crawl) or crawl command (./bin/nutch
crawl) to do the crawl ?

You can use the crawl script located at (./bin/crawl) by removing the
inject step since you would not need injecting the seeds again.

Please let us know if you have any more questions.

On Sat, Mar 9, 2013 at 11:22 PM, Kristopher Kane <kk...@gmail.com>wrote:

> I had a long running session going and would like to try and pick up where
> it left off if possible.  In the terminal, Nutch was at a parsing stage
> then hit OOM.  Is there anyway to start that near where it left off?
>
> -Kris
>

-- 
Kiran Chitturi