You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Brian Whitman <br...@variogr.am> on 2006/12/07 22:32:32 UTC

locks on merging indexes?

Using nutch nighly/0.9, on 'whole-internet' crawl type applications--  
we've got a process running that does a long (12hr or so) generate,  
fetch, update, mergesegs, invert, index, merge loop. This is all  
working fine.

I want to add another nutch crawl on the same machine from a small  
set of high update-rate pages. Some of these pages may be in the  
larger crawl as well. The smaller crawl will happen a few times a day  
and will only take a few minutes to finish fetching.

I would like to merge the index from the smaller crawl with the main  
larger index every time after the smaller crawl completes -- so that  
results from the small crawls are pushed out to the searcher  
immediately. However, I'm concerned that doing so might corrupt the  
larger index if by chance the larger crawl was in an indexing or  
merging state at the same time as the smaller crawl.

Are there protections against this? If it's not advised, is there a  
better way to have two separate crawls happening at once to the same  
index?

-Brian