You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Stephen Ensor <st...@gmail.com> on 2006/03/16 10:47:19 UTC
hanging crawler/fetcher fix

I have been trying to get this crawler working for some time now it but it
always hangs which makes the whole of nutch pretty unusable right now,
unless I'm missing something.

This is the problem:

Assume the following scenario: a user runs the CrawlTool to crawl a single
site. Fetchlists generated by the CrawlTool will contain only URLs from that
site (which is to say, from the same IP). But now the logic in
Http.blockAddr() badly interacts with Fetcher, and deadlocks FetcherThreads
- by default the Fetcher starts 10 threads. Each of these threads tries to
access the same IP address, but the default value of
fetcher.threads.per.host is just 1. This means that only the first thread
will be allowed to run, other 9 threads will be spinning, waiting for the
first thread to finish. Eventually, some of these waiting threads will
exceed the maximum wait time, thus throwing the above exception.

Now a simple timeout would solve this problem but somehow this is being
overlooked right now.

Another fix would be a simple bash script that would check the log every 20
seconds ie $tail crawl.log to see if it has changed since the last check and
if it has not changed we know it has hung.  This is all pretty simple the
part I am having difficulty with is restarting the crawl so that it
continues where it left off I tried this to no avail:

How can I recover an aborted fetch process?
Well, you can not. However, you have two choices to proceed:

1) Recover the pages already fetched and than restart the fetcher.

You'll need to create a file fetcher.done in the segment directory an
than:  updatedb,  generate and  fetch . Assuming your index is at /index


% touch /index/segments/2005somesegment/fetcher.done

% bin/nutch updatedb /index/db/ /index/segments/2005somesegment/

% bin/nutch generate /index/db/ /index/segments/2005somesegment/

% bin/nutch fetch /index/segments/2005somesegment
All the pages that were not crawled will be re-generated for fetch. If
you fetched lots of pages, and don't want to have to re-fetch them
again, this is the best way.

This does not work if anyone has any ideas how to continue a terminated
crawl prey do tell!!