You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2006/01/09 13:59:08 UTC

why index not in segment anymore

Hi Doug,
in nutch 0.8 the index is not in the segment folder any more.
What was the reason for that? in the context of a web gui it would be  
may be better to have the index also in the segment folder, since the  
segment folder would be the single item to manage a life-cycle,
Thanks for a explanation.

Stefan 

Re: why index not in segment anymore

Posted by Doug Cutting <cu...@nutch.org>.
Stefan Groschupf wrote:
> in nutch 0.8 the index is not in the segment folder any more.
> What was the reason for that? in the context of a web gui it would be  
> may be better to have the index also in the segment folder, since the  
> segment folder would be the single item to manage a life-cycle,

The current indexer command line is optimized for one-shot, batch, 
crawling.  In this case it is best to index everything at the end, in 
order to have the most up-to-date page scores from the crawl db.  So it 
indexes everything in a single MapReduce pass, which produces a set of 
indexes that are not aligned with segments.

It would be easy to modify Indexer.index() to index just a segment at a 
time, but each would need to process the entire crawl and link dbs as 
inputs, and would thus be less efficient than indexing all segments at once.

So both modes may be useful.  We could add an Indexer.index() method 
that takes just a single segment name and indexes it, storing the index 
in the segment, and modify Indexer.main() to be able to invoke it.  Then 
we'd also need to modify NutchBean to find these indexes, and 
IndexMerger, etc.

Doug