You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sherjeel Niazi <sh...@softmatics.com> on 2009/04/23 17:02:42 UTC
How to resume crawler after crash
Hi,
I am using Nutch 0.9
I am crawling a series of URL's of a website but after some time the crawler
crash with the following error:
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:97)
at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:62)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:128)
How can I resume the crawler where it ends?
Sherjeel
Re: How to resume crawler after crash
Posted by Dennis Kubes <ku...@apache.org>.
You can't. Crawls are self contained. You can restart them by removing
all folders under the segments/xxxx/* directories except the
crawl_generate and then reexecuting the fetch job. But there isn't a
way to restart a crawl job from a mid checkpoint.
Dennis
Sherjeel Niazi wrote:
> Hi,
>
> I am using Nutch 0.9
> I am crawling a series of URL's of a website but after some time the crawler
> crash with the following error:
>
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
> at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:97)
> at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:62)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:128)
>
> How can I resume the crawler where it ends?
>
>
> Sherjeel
>
RE: Using nutchBean
Posted by "Lukas, Ray" <Ra...@idearc.com>.
Oh works great now.. Hey thanks guys and Andrzej Bialecki.. I will look
into how this can be submitted for everyone to have..
-----Original Message-----
From: Lukas, Ray [mailto:Ray.Lukas@idearc.com]
Sent: Thursday, April 23, 2009 5:45 PM
To: nutch-user@lucene.apache.org
Subject: RE: Using nutchBean
I am looking into that now.. But.. This thread lives past the running of
my code.. If it was a normal thread, ie not a daemon then.. It should
die when the parent, (me) dies.. .. Now I don't use daemon threads much
but.. My understanding is that they are configured this way by a setting
a flag.. They have to have a flag set to true for them to become a
daemon. I have to look this up but.. That is what I understood.. So..
Wouldn't there be a flag maybe somewhere in the configs that we could
set instead of me bashing around in the Nutch code.. I am just asking..
I am thinking that I should hunt around for that , or.. Maybe someone
already knows where that lives.. Maybe??
-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org]
Sent: Thursday, April 23, 2009 5:32 PM
To: nutch-user@lucene.apache.org
Subject: Re: Using nutchBean
Lukas, Ray wrote:
> I started going through the source code debugging this problem.. The
> extra thread comes from FetchedSegments(Configuration conf, Path
> segmentsDir) constructor. This constructor creates a segmentUpdater
and
> then starts it.. This is the thread that I am talking about..
>
> Does anyone know how to cleanly shut down this segment updater?
Looking at the code it's not possible right now. You need to modify the
FetchedSegment.close() to include also the shutdown of this thread.
Since the thread already loops in a while (true) loop, it's enough to
change this to a boolean flag and set this flag to false in the close()
method.
If you come up with a more or less clean patch please submit this to
JIRA as an improvement.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
RE: Using nutchBean
Posted by "Lukas, Ray" <Ra...@idearc.com>.
I am looking into that now.. But.. This thread lives past the running of
my code.. If it was a normal thread, ie not a daemon then.. It should
die when the parent, (me) dies.. .. Now I don't use daemon threads much
but.. My understanding is that they are configured this way by a setting
a flag.. They have to have a flag set to true for them to become a
daemon. I have to look this up but.. That is what I understood.. So..
Wouldn't there be a flag maybe somewhere in the configs that we could
set instead of me bashing around in the Nutch code.. I am just asking..
I am thinking that I should hunt around for that , or.. Maybe someone
already knows where that lives.. Maybe??
-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org]
Sent: Thursday, April 23, 2009 5:32 PM
To: nutch-user@lucene.apache.org
Subject: Re: Using nutchBean
Lukas, Ray wrote:
> I started going through the source code debugging this problem.. The
> extra thread comes from FetchedSegments(Configuration conf, Path
> segmentsDir) constructor. This constructor creates a segmentUpdater
and
> then starts it.. This is the thread that I am talking about..
>
> Does anyone know how to cleanly shut down this segment updater?
Looking at the code it's not possible right now. You need to modify the
FetchedSegment.close() to include also the shutdown of this thread.
Since the thread already loops in a while (true) loop, it's enough to
change this to a boolean flag and set this flag to false in the close()
method.
If you come up with a more or less clean patch please submit this to
JIRA as an improvement.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: Using nutchBean
Posted by Andrzej Bialecki <ab...@getopt.org>.
Lukas, Ray wrote:
> I started going through the source code debugging this problem.. The
> extra thread comes from FetchedSegments(Configuration conf, Path
> segmentsDir) constructor. This constructor creates a segmentUpdater and
> then starts it.. This is the thread that I am talking about..
>
> Does anyone know how to cleanly shut down this segment updater?
Looking at the code it's not possible right now. You need to modify the
FetchedSegment.close() to include also the shutdown of this thread.
Since the thread already loops in a while (true) loop, it's enough to
change this to a boolean flag and set this flag to false in the close()
method.
If you come up with a more or less clean patch please submit this to
JIRA as an improvement.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
RE: Using nutchBean
Posted by "Lukas, Ray" <Ra...@idearc.com>.
I started going through the source code debugging this problem.. The
extra thread comes from FetchedSegments(Configuration conf, Path
segmentsDir) constructor. This constructor creates a segmentUpdater and
then starts it.. This is the thread that I am talking about..
Does anyone know how to cleanly shut down this segment updater?
>From FetchedSegments constructor....
this.segUpdater = new SegmentUpdater();
if (segmentDirs != null) {
for (final Path segmentDir : segmentDirs) {
segments.put(segmentDir.getName(),
new Segment(this.fs, segmentDir, this.conf));
}
}
this.segUpdater.start(); <-- this is the line I am talking about..
Any ideas, has anyone run into this ?
-----Original Message-----
From: Lukas, Ray [mailto:Ray.Lukas@idearc.com]
Sent: Thursday, April 23, 2009 4:36 PM
To: nutch-user@lucene.apache.org
Subject: Using nutchBean
Is this correct..
NativeCrawler nativeCrawler = null;
NutchBean nutchBean = null;
Query nutchQuery = null;
Hits nutchHits = null;
for (int index=0; index<10; index++) {
nativeCrawler = new
NativeCrawler("www.ajpm.com", "ajpm-index", 2, 5);
int maxHits = 1000;
(*) nutchBean = new NutchBean(
nativeCrawler.getConfig(),
nativeCrawler.getIndexPath());
nutchQuery = Query.parse("gold",
nativeCrawler.getConfig());
nutchHits = nutchBean.search(nutchQuery,
maxHits);
nutchBean.close();
System.out.println("gold nutchHits: " +
nutchHits.getLength());
}
nutchQuery = null;
nutchBean = null;
System.out.println("credit nutchHits: " +
nutchHits.getLength());
Everytime I execute this (*) line a new thread is started and never
ends. At the end of this I have ten threads. This loop in real life
might execute 5000 times..
How does a person close off, shutdown a nutchBean object. I call close
when, and as soon as, I am done with it.
NutchCrawler is my code that basically points me to the nutch index
directory.
Thank you for the help
ray
Using nutchBean
Posted by "Lukas, Ray" <Ra...@idearc.com>.
Is this correct..
NativeCrawler nativeCrawler = null;
NutchBean nutchBean = null;
Query nutchQuery = null;
Hits nutchHits = null;
for (int index=0; index<10; index++) {
nativeCrawler = new
NativeCrawler("www.ajpm.com", "ajpm-index", 2, 5);
int maxHits = 1000;
(*) nutchBean = new NutchBean(
nativeCrawler.getConfig(),
nativeCrawler.getIndexPath());
nutchQuery = Query.parse("gold",
nativeCrawler.getConfig());
nutchHits = nutchBean.search(nutchQuery,
maxHits);
nutchBean.close();
System.out.println("gold nutchHits: " +
nutchHits.getLength());
}
nutchQuery = null;
nutchBean = null;
System.out.println("credit nutchHits: " +
nutchHits.getLength());
Everytime I execute this (*) line a new thread is started and never
ends. At the end of this I have ten threads. This loop in real life
might execute 5000 times..
How does a person close off, shutdown a nutchBean object. I call close
when, and as soon as, I am done with it.
NutchCrawler is my code that basically points me to the nutch index
directory.
Thank you for the help
ray