You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Richard Braman <rb...@bramantax.com> on 2006/03/01 00:13:58 UTC

RE: Index aborted crawl.

Jerome and Jeff
Thanks for the help:)


I found the answers in the wiki faq, to recover an aborted fetch, which
has insightful
It also mentions you can "indexed what was already crawled" 

"You should be able to index the part of the segment for crawling which
is allready fetched. "
I tried the commands, i put in my least email

bin/nutch index indexes crawled/linkdb crawled/segments/*
But it failed.

How can I recover an aborted fetch process?
Well, you can not. However, you have two choices to proceed: 

1) Recover the pages already fetched and than restart the fetcher. 

You'll need to create a file fetcher.done in the segment directory an
than:  updatedb,  generate and  fetch . Assuming your index is at /index


% touch /index/segments/2005somesegment/fetcher.done 

% bin/nutch updatedb /index/db/ /index/segments/2005somesegment/

% bin/nutch generate /index/db/ /index/segments/2005somesegment/

% bin/nutch fetch /index/segments/2005somesegment
All the pages that were not crawled will be re-generated for fetch. If
you fetched lots of pages, and don't want to have to re-fetch them
again, this is the best way. 

2) Discard the aborted output. 

Delete all folders from the segment folder except the fetchlist folder
and restart the fetcher. 


Richard Braman
mailto:rbraman@taxcodesoftware.org
561.748.4002 (voice) 

http://www.taxcodesoftware.org
Free Open Source Tax Software


-----Original Message-----
From: Richard Braman [mailto:rbraman@bramantax.com] 
Sent: Tuesday, February 28, 2006 5:02 PM
To: nutch-dev@lucene.apache.org
Subject: FW: Index aborted crawl.


I had to abort a crawl mid-crawl (after 2 days of crawling becuse I
realized I had an error in my filter). I know at least 6 segments were
fetched, 

I tried the command
bin/nutch index indexes crawled/linkdb crawled/segments/*
but it failed.  I would like to review the results of the crawl, but if
its impossible its impossible.