You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2006/01/13 14:31:16 UTC

Generating multiple fetchlists between updates

Hi,

In the 0.7 branch, whenever a segment was generated the WebDB was 
modified, so that the entries that ended up in the fetchlist wouldn't be 
immediately available to the next segment generation, if that happened 
before the WebDB was updated with the data from that first segment. This 
was achieved by adding 1 week to the next fetchTime on a Page.

I can't see that we do it in the trunk. This means that we cannot 
generate more than one fetchlist between the CrawlDB updates, because 
each fetchlist would be identical to the previous one... Should we worry 
about this? There is a cost to modify the CrawlDB, but there is also a 
cost to not be able to generate multiple different fetchlists and fetch 
them in parallel...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Generating multiple fetchlists between updates

Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> In the 0.7 branch, whenever a segment was generated the WebDB was 
> modified, so that the entries that ended up in the fetchlist wouldn't be 
> immediately available to the next segment generation, if that happened 
> before the WebDB was updated with the data from that first segment. This 
> was achieved by adding 1 week to the next fetchTime on a Page.
> 
> I can't see that we do it in the trunk. This means that we cannot 
> generate more than one fetchlist between the CrawlDB updates, because 
> each fetchlist would be identical to the previous one... Should we worry 
> about this? There is a cost to modify the CrawlDB, but there is also a 
> cost to not be able to generate multiple different fetchlists and fetch 
> them in parallel...

I think this would be a useful feature to resurrect.  I'd vote for 
making it optional, at least at first.

Ideally one could run crawldb update and generate jobs in parallel with 
the fetch job, so that, as soon as a fetch completes the next can start.

Doug