You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Gabriele Kahlout <ga...@mysimpatico.com> on 2011/06/04 11:43:12 UTC

Re: How to get the crawl database free of links to recrawl only from seed URL?

Ismail, I've been having the ~ issue, i.e. I want to fetch all seeds first,
and then maybe other urls. In my script I therefore fetch with depth 1, and
set the property.
Setting the property helps if you plan on restarting the script, since when
it restarts and calls generate from crawldb it will ! find urls injected
from depth 1 during the previous crawl. However, you will ! have collected
any urls but from the seeds.

Solutions to the issue:


1. Fetch with depth 1 your seeds all in one go (vising bin/nutch generate
once). Once done with that you can crawl with greater depth.

If you cannot tolerate having to fetch all seeds at once:
2. Partition seeds into chunks you can tolerate to fetch at once. Then apply
1. to each chunck in a dedicated chunck crawldb (so that the chunk crawldb
has injected only chunck urls). At the end merge the results of the chunck
crawldb with a global  crawldb.

3. Use db.update.additions.allowed in your seed crawl. Then before you
discard an indexed segment you read the discovered urls at that depth into a
discovered-urls-list. In your 2nd crawl you inject the discovered-urls-list
to crawl urls discovered in the previously crawl (and possibly new ones if
you don't set db.update.additions.allowed). Your script code would be
something like: 
 
        $nutch readseg -dump $it_segs crawl_parse_dump -nocontent -nofetch
-nogenerate -noparsedata -noparsetext
        tstamp=`date -u +%s`
        $hdfs -cat crawl_parse_dump/dump | grep 'http://' | sed 's/URL:: //'
> $tstamp
        $hdfs -put tstamp $crawl/discovered-urls/

The simplest, but expensive:
4. crawl setting db.update.additions.allowed set to false and then crawl
again with it set to true, and db.fetch.interval.default set to 0 (or just
wipe out crawldb, I'm not sure what information remains useful in there).

I prefer 3 and 2, but I couldn't test 3. since readseg -dump expects a
single segment path, and I instead have an issue (
https://issues.apache.org/jira/browse/NUTCH-1001 NUTCH-1001 ) with that. I
intended to patch SegmentReader to receive a dir of segments and output a
single dump file, but since hadoop doesn't support append I'm ! sure how to
do it w/o much complication (PATCH WELCOME). Simpler could be outputting
multiple dumps, but then one must process each dump in turn too, and having
files on the hdfs makes manipulating files more complicated.


--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-the-crawl-database-free-of-links-to-recrawl-only-from-seed-URL-tp613255p3022742.html
Sent from the Nutch - User mailing list archive at Nabble.com.