You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Elmer van Chastelet <ev...@gmail.com> on 2012/12/23 17:55:50 UTC

Looking for efficient way to check if suggestion index is up to date

Hi all,

We're currently using Lucene 3.5.0 with the Spellchecker from the 
contrib module.

Our search engine uses a single index for multiple 'namespaces' (for 
example there is a namespace for each project).
Each document has the 'namespace'-field, enabling searching within a 
specific namespace.
We also do this for our suggestion indexes: for each namespace we create 
a spell check index using the Dictionary and Spellchecker classes.

At runtime we periodically check, say once an hour, if these spell check 
indexes are up to date. And here is where I would like to safe resources.

We currently use a simple heuristic for this. For each namespace NS:

    if ( completeSearchIndexReader.lastModified > 
NS_SpellIndexReader.lastModified) then
        recreate spell index for that NS
    end

This is a very simple, but not really effective approach to safe 
resources used for recreation of suggestion indexes.
It does safe resources if nothing has changed in the complete search 
index. But when a document changes for namespace X, it will trigger 
recreation of suggestion indexes for /all/ namespaces.

What I've already tried:

  - Using 
org.apache.lucene.index.PKIndexSplitter.DocumentFilteredIndexReader to 
construct a 'namespace reader', from which I wanted to use the term 
enumerator for comparison. Unfortunately, this term enum does not 
respect the provided filter, and returns all terms from the complete index.

Another idea I thought about:
  - At document addition/deletion for namespace N:  flag N to be a dirty 
namespace. Then at each periodic check, only renew spell check indexes 
for namespaces flagged dirty.
Unfortunately, this won't survive a redeploy/restart/kill of the web 
server since we will keep this information in memory (we don't want huge 
overhead like writing to disk each time a document is added/removed).  
However, we can combine this with our current implementation (comparing 
modified timestamps) in case this information is not present (e.g. when 
the server was killed)

Anyone who knows a better solution to '/check if a document has changed* 
in a subset of all documents of an index, where the subset can be 
expressed by a query namespaceField:namespaceValue/' ?
*changed means added/removed after some timestamp (=the moment of 
recreation of a spell check index), or changed by comparing the terms in 
a subset vs terms in the spellcheck index.

Upgrading to Lucene 4.0 with DirectSpellChecker is currently out of our 
scope due to required refactorings for upgrading other dependencies that 
depend on Lucene.

Thanks in advance!

--Elmer van Chastelet