You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Sherjeel Niazi <sh...@softmatics.com> on 2009/04/23 17:02:42 UTC

How to resume crawler after crash

Hi,

I am using Nutch 0.9
I am crawling a series of URL's of a website but after some time the crawler
crash with the following error:

Exception in thread "main" java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
    at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:97)
    at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:62)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:128)

How can I resume the crawler where it ends?


Sherjeel

Re: How to resume crawler after crash

Posted by Dennis Kubes <ku...@apache.org>.

You can't.  Crawls are self contained.  You can restart them by removing 
  all folders under the segments/xxxx/* directories except the 
crawl_generate and then reexecuting the fetch job.  But there isn't a 
way to restart a crawl job from a mid checkpoint.

Dennis

Sherjeel Niazi wrote:
> Hi,
> 
> I am using Nutch 0.9
> I am crawling a series of URL's of a website but after some time the crawler
> crash with the following error:
> 
> Exception in thread "main" java.io.IOException: Job failed!
>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>     at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:97)
>     at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:62)
>     at org.apache.nutch.crawl.Crawl.main(Crawl.java:128)
> 
> How can I resume the crawler where it ends?
> 
> 
> Sherjeel
>

RE: Using nutchBean

Posted by "Lukas, Ray" <Ra...@idearc.com>.

Oh works great now.. Hey thanks guys and Andrzej Bialecki.. I will look
into how this can be submitted for everyone to have.. 

-----Original Message-----
From: Lukas, Ray [mailto:Ray.Lukas@idearc.com] 
Sent: Thursday, April 23, 2009 5:45 PM
To: nutch-user@lucene.apache.org
Subject: RE: Using nutchBean

I am looking into that now.. But.. This thread lives past the running of
my code.. If it was a normal thread, ie not a daemon then.. It should
die when the parent, (me) dies.. .. Now I don't use daemon threads much
but.. My understanding is that they are configured this way by a setting
a flag.. They have to have a flag set to true for them to become a
daemon. I have to look this up but.. That is what I understood.. So..
Wouldn't there be a flag maybe somewhere in the configs that we could
set instead of me bashing around in the Nutch code.. I am just asking..
I am thinking that I should hunt around for that , or.. Maybe someone
already knows where that lives.. Maybe??

-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: Thursday, April 23, 2009 5:32 PM
To: nutch-user@lucene.apache.org
Subject: Re: Using nutchBean

Lukas, Ray wrote:
> I started going through the source code debugging this problem.. The
> extra thread comes from FetchedSegments(Configuration conf, Path
> segmentsDir) constructor. This constructor creates a segmentUpdater
and
> then starts it.. This is the thread that I am talking about.. 
> 
> Does anyone know how to cleanly shut down this segment updater?

Looking at the code it's not possible right now. You need to modify the 
FetchedSegment.close() to include also the shutdown of this thread. 
Since the thread already loops in a while (true) loop, it's enough to 
change this to a boolean flag and set this flag to false in the close() 
method.

If you come up with a more or less clean patch please submit this to 
JIRA as an improvement.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: Using nutchBean

Posted by "Lukas, Ray" <Ra...@idearc.com>.

I am looking into that now.. But.. This thread lives past the running of
my code.. If it was a normal thread, ie not a daemon then.. It should
die when the parent, (me) dies.. .. Now I don't use daemon threads much
but.. My understanding is that they are configured this way by a setting
a flag.. They have to have a flag set to true for them to become a
daemon. I have to look this up but.. That is what I understood.. So..
Wouldn't there be a flag maybe somewhere in the configs that we could
set instead of me bashing around in the Nutch code.. I am just asking..
I am thinking that I should hunt around for that , or.. Maybe someone
already knows where that lives.. Maybe??

-----Original Message-----
From: Andrzej Bialecki [mailto:ab@getopt.org] 
Sent: Thursday, April 23, 2009 5:32 PM
To: nutch-user@lucene.apache.org
Subject: Re: Using nutchBean

Lukas, Ray wrote:
> I started going through the source code debugging this problem.. The
> extra thread comes from FetchedSegments(Configuration conf, Path
> segmentsDir) constructor. This constructor creates a segmentUpdater
and
> then starts it.. This is the thread that I am talking about.. 
> 
> Does anyone know how to cleanly shut down this segment updater?

Looking at the code it's not possible right now. You need to modify the 
FetchedSegment.close() to include also the shutdown of this thread. 
Since the thread already loops in a while (true) loop, it's enough to 
change this to a boolean flag and set this flag to false in the close() 
method.

If you come up with a more or less clean patch please submit this to 
JIRA as an improvement.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Using nutchBean

Posted by Andrzej Bialecki <ab...@getopt.org>.

Lukas, Ray wrote:
> I started going through the source code debugging this problem.. The
> extra thread comes from FetchedSegments(Configuration conf, Path
> segmentsDir) constructor. This constructor creates a segmentUpdater and
> then starts it.. This is the thread that I am talking about.. 
> 
> Does anyone know how to cleanly shut down this segment updater?

Looking at the code it's not possible right now. You need to modify the 
FetchedSegment.close() to include also the shutdown of this thread. 
Since the thread already loops in a while (true) loop, it's enough to 
change this to a boolean flag and set this flag to false in the close() 
method.

If you come up with a more or less clean patch please submit this to 
JIRA as an improvement.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: Using nutchBean

Posted by "Lukas, Ray" <Ra...@idearc.com>.

I started going through the source code debugging this problem.. The
extra thread comes from FetchedSegments(Configuration conf, Path
segmentsDir) constructor. This constructor creates a segmentUpdater and
then starts it.. This is the thread that I am talking about.. 

Does anyone know how to cleanly shut down this segment updater?
>From FetchedSegments constructor....
    this.segUpdater = new SegmentUpdater();

    if (segmentDirs != null) {
      for (final Path segmentDir : segmentDirs) {
        segments.put(segmentDir.getName(),
          new Segment(this.fs, segmentDir, this.conf));
      }
    }
    this.segUpdater.start();  <-- this is the line I am talking about.. 

Any ideas, has anyone run into this ?

-----Original Message-----
From: Lukas, Ray [mailto:Ray.Lukas@idearc.com] 
Sent: Thursday, April 23, 2009 4:36 PM
To: nutch-user@lucene.apache.org
Subject: Using nutchBean

Is this correct..

		NativeCrawler nativeCrawler = null;
		NutchBean nutchBean = null;
		Query nutchQuery = null;
		Hits nutchHits = null;
		for (int index=0; index<10; index++) {
			nativeCrawler = new
NativeCrawler("www.ajpm.com", "ajpm-index", 2, 5);
			int maxHits = 1000;
		(*)	nutchBean = new NutchBean(
				nativeCrawler.getConfig(),
nativeCrawler.getIndexPath());
			nutchQuery = Query.parse("gold",
nativeCrawler.getConfig());
			nutchHits =  nutchBean.search(nutchQuery,
maxHits);
			nutchBean.close();
			System.out.println("gold nutchHits: " +
nutchHits.getLength()); 
		}
		nutchQuery = null;
		nutchBean = null;
		System.out.println("credit nutchHits: " +
nutchHits.getLength()); 

Everytime I execute this (*) line a new thread is started and never
ends. At the end of this I have ten threads. This loop in real life
might execute 5000 times.. 
How does a person close off, shutdown a nutchBean object. I call close
when, and as soon as, I am done with it.

NutchCrawler is my code that basically points me to the nutch index
directory. 

Thank you for the help
ray

Using nutchBean

Posted by "Lukas, Ray" <Ra...@idearc.com>.

Is this correct..

		NativeCrawler nativeCrawler = null;
		NutchBean nutchBean = null;
		Query nutchQuery = null;
		Hits nutchHits = null;
		for (int index=0; index<10; index++) {
			nativeCrawler = new
NativeCrawler("www.ajpm.com", "ajpm-index", 2, 5);
			int maxHits = 1000;
		(*)	nutchBean = new NutchBean(
				nativeCrawler.getConfig(),
nativeCrawler.getIndexPath());
			nutchQuery = Query.parse("gold",
nativeCrawler.getConfig());
			nutchHits =  nutchBean.search(nutchQuery,
maxHits);
			nutchBean.close();
			System.out.println("gold nutchHits: " +
nutchHits.getLength()); 
		}
		nutchQuery = null;
		nutchBean = null;
		System.out.println("credit nutchHits: " +
nutchHits.getLength()); 

Everytime I execute this (*) line a new thread is started and never
ends. At the end of this I have ten threads. This loop in real life
might execute 5000 times.. 
How does a person close off, shutdown a nutchBean object. I call close
when, and as soon as, I am done with it.

NutchCrawler is my code that basically points me to the nutch index
directory. 

Thank you for the help
ray