You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by Nalini Kartha <na...@gmail.com> on 2011/09/07 20:49:17 UTC

Lucene Spellchecker versions

Hi,

We want to implement some sort of spell correction for search and we're
looking at the Lucene Spellchecker for this.

We're still stuck on Lucene 2.2 though so it looks like the old version that
requires a separate dictionary index to be built is the only option - is
that correct?

For the k-gram based spell checker that requires the separate dictionary
index, is there any supported method for keeping the dictionary index in
sync with the original index, i.e. what's the best way to propagate
adds/deletes on the original index to the dictionary index? Do you recommend
just rebuilding the dictionary index at some regular interval?

I'm also trying to understand how the new spellchecker (FST based) works. My
understanding so far is that we build an FST from the term we're trying to
find corrections for and then I'm sort of fuzzy on how the FST is
intersected with the term dictionary.  Is there any detailed documentation
that explains this?

Also, are there changes to the term dictionary structure (post 2.2) that are
required to support FST based spell correction? If so, what exactly are
those changes? A presentation I saw alluded to Fast Numeric Range queries
introduced in Lucene 2.9 but I didn't quite understand how the two are
related.

Thanks in advance,
Nalini

Re: Lucene Spellchecker versions

Posted by Robert Muir <rc...@gmail.com>.

On Wed, Sep 7, 2011 at 2:49 PM, Nalini Kartha <na...@gmail.com> wrote:
>
> We're still stuck on Lucene 2.2 though so it looks like the old version that
> requires a separate dictionary index to be built is the only option - is
> that correct?

yes

>
> For the k-gram based spell checker that requires the separate dictionary
> index, is there any supported method for keeping the dictionary index in
> sync with the original index, i.e. what's the best way to propagate
> adds/deletes on the original index to the dictionary index? Do you recommend
> just rebuilding the dictionary index at some regular interval?

yes, though I think it depends upon your content, how rapidly your
terms change, etc. how often you want to rebuild it.

note that in more recent lucene 3.x releases, the k-gram spellchecker
becomes more efficient to rebuild
and uses less memory, e.g. you can specify not to optimize(), and it
now omits norms, positions, and frequencies
for its internal fields that don't require it.

>
> I'm also trying to understand how the new spellchecker (FST based) works. My
> understanding so far is that we build an FST from the term we're trying to
> find corrections for and then I'm sort of fuzzy on how the FST is
> intersected with the term dictionary.  Is there any detailed documentation
> that explains this?

it builds an FSA actually, see
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652 for the
exact algorithm and ugly details:

for the lucene implementation of how an FSA is intersected with the
term dictionary, you can see the code here:
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/AutomatonTermsEnum.java

Note, its a little convoluted because it can also efficiently
intersect FSAs that accept an infinite language (e.g. arbitrary
regular expressions and wildcards), not just levenshtein automata.
This is just the default 'intersection' implementation for any term
dictionary, but term dictionary implementations in lucene 4 can
override the intersection by implementing Terms.intersect(), one that
does this to improve perfornance is the BlockTree implementation:
https://issues.apache.org/jira/browse/LUCENE-3030

>
> Also, are there changes to the term dictionary structure (post 2.2) that are
> required to support FST based spell correction? If so, what exactly are
> those changes? A presentation I saw alluded to Fast Numeric Range queries
> introduced in Lucene 2.9 but I didn't quite understand how the two are
> related.

theoretically, no, however its implemented much easier and more
efficient given API improvements to MultiTermQuery (Uwe's work with
NumericRangeQuery opened this up)
additionally: another reason why this stuff is a lot easier in 4.0 is
because terms in 4.0 are indexed (or at least "appear to be") in
binary order, but previous lucene versions indexed and presented them
in character (UTF-16) order.

a lot of the API changes that were necessary to get this stuff
performing well were fairly big changes, making it difficult/dangerous
to backport to older versions of lucene such as 3.x

-- 
lucidimagination.com