You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Lukas Vlcek <lu...@gmail.com> on 2006/01/24 10:11:07 UTC

Nutch merge problem after fetch is aborted with hung threads.

Re-posting to dev list after no response in user list.
Lukas
---------- Forwarded message ----------
From: Lukas Vlcek <lu...@gmail.com>
Date: Jan 19, 2006 8:42 AM
Subject: Nutch merge problem after fetch is aborted with hung threads.
To: nutch-user@lucene.apache.org

Hi,

I am facing an interesting problem. I am crawling in iterative cycles
and it works fine until one of fetch cycles is prematurely terminated
due to timeout - which result in this message to be written into log
file [Aborting with 3 hung threads.]; (I am using 3 threads).
And lets say that this fetch fetched only 101 pages (out of 500)
before it was terminated.

Then the problem is that I can see only 101 pages in merged index no
matter how many pages were fetched in previous cycles. Is seems to me
that it is not possible to build healthy merged index if one of
fetches time-outed.

Then if I open index with Luke it shows that the total number of
documents is only 101.

Here are details:

My script looks like the following example:
--- start ---
#!/bin/bash

d=crawl.test
bin/nutch generate $d/crawldb $d/segments -topN 500
s=`ls -d $d/segments/2* | tail -1`
bin/nutch fetch $s
bin/nutch updatedb $d/crawldb $s

bin/nutch invertlinks $d/linkdb $d/segments
bin/nutch index $d/indexes $d/crawldb $d/linkdb $s
bin/nutch dedup $d/indexes

bin/nutch merge $d/index $d/indexes
--- end ---

So once fetch operation is terminated then the rest of the tasks is
executed anyway (updatedb, indexing ...). Also is seems to me that in
this case it doesn't matter if I execute merge at the end of every
cycle or just once after desired crawl depth is reached.

Can anybody explain me what I am doing wrong?

Thanks,
Lukas