You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Elmer van Chastelet <ev...@gmail.com> on 2012/12/23 17:55:50 UTC
Looking for efficient way to check if suggestion index is up to date
Hi all,
We're currently using Lucene 3.5.0 with the Spellchecker from the
contrib module.
Our search engine uses a single index for multiple 'namespaces' (for
example there is a namespace for each project).
Each document has the 'namespace'-field, enabling searching within a
specific namespace.
We also do this for our suggestion indexes: for each namespace we create
a spell check index using the Dictionary and Spellchecker classes.
At runtime we periodically check, say once an hour, if these spell check
indexes are up to date. And here is where I would like to safe resources.
We currently use a simple heuristic for this. For each namespace NS:
if ( completeSearchIndexReader.lastModified >
NS_SpellIndexReader.lastModified) then
recreate spell index for that NS
end
This is a very simple, but not really effective approach to safe
resources used for recreation of suggestion indexes.
It does safe resources if nothing has changed in the complete search
index. But when a document changes for namespace X, it will trigger
recreation of suggestion indexes for /all/ namespaces.
What I've already tried:
- Using
org.apache.lucene.index.PKIndexSplitter.DocumentFilteredIndexReader to
construct a 'namespace reader', from which I wanted to use the term
enumerator for comparison. Unfortunately, this term enum does not
respect the provided filter, and returns all terms from the complete index.
Another idea I thought about:
- At document addition/deletion for namespace N: flag N to be a dirty
namespace. Then at each periodic check, only renew spell check indexes
for namespaces flagged dirty.
Unfortunately, this won't survive a redeploy/restart/kill of the web
server since we will keep this information in memory (we don't want huge
overhead like writing to disk each time a document is added/removed).
However, we can combine this with our current implementation (comparing
modified timestamps) in case this information is not present (e.g. when
the server was killed)
Anyone who knows a better solution to '/check if a document has changed*
in a subset of all documents of an index, where the subset can be
expressed by a query namespaceField:namespaceValue/' ?
*changed means added/removed after some timestamp (=the moment of
recreation of a spell check index), or changed by comparing the terms in
a subset vs terms in the spellcheck index.
Upgrading to Lucene 4.0 with DirectSpellChecker is currently out of our
scope due to required refactorings for upgrading other dependencies that
depend on Lucene.
Thanks in advance!
--Elmer van Chastelet