You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Coffey <mc...@yahoo.com.INVALID> on 2017/05/23 00:03:49 UTC

generating and updating segments

In search of more effective parallelism, I have been experimenting with different schemes for organizing the nutch jobs. I would like to know if the Generator can work in a way that supports what I'm trying to do.
Here is a pseudocode description of one approach. I use variables named curSegs and prevSegs to refer to lists of segments. SegsPerWave is typically 4 or more.

prevSegs = generate( segsPerWave ) 
in a "background" process (on other machines):    fetch and parse prevSegs
repeat indefinitely    curSegs = generate( segsPerWave ) 
    in a "background" process (on other machines):        fetch and parse curSegs    wait for prevSegs to be fetched and parsed
    update, linkdb, and merge prevSegs    prevSegs = curSegs
As I understand it, this will not work right if I do not set generate.update.crawldb = true. In my subsequent calls to generate, it would generate duplicated (or partially duplicated) segments.
If I do set generate.update.crawldb = true, should it work right?  What, exactly, does generate.update.crawldb = true do? I assume it changes something in the crawldb, but I don't know what.