You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Brian Whitman <br...@variogr.am> on 2006/12/07 22:32:32 UTC
locks on merging indexes?
Using nutch nighly/0.9, on 'whole-internet' crawl type applications--
we've got a process running that does a long (12hr or so) generate,
fetch, update, mergesegs, invert, index, merge loop. This is all
working fine.
I want to add another nutch crawl on the same machine from a small
set of high update-rate pages. Some of these pages may be in the
larger crawl as well. The smaller crawl will happen a few times a day
and will only take a few minutes to finish fetching.
I would like to merge the index from the smaller crawl with the main
larger index every time after the smaller crawl completes -- so that
results from the small crawls are pushed out to the searcher
immediately. However, I'm concerned that doing so might corrupt the
larger index if by chance the larger crawl was in an indexing or
merging state at the same time as the smaller crawl.
Are there protections against this? If it's not advised, is there a
better way to have two separate crawls happening at once to the same
index?
-Brian