You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2010/09/28 19:45:35 UTC
[jira] Updated: (LUCENE-2507) automaton spellchecker
[ https://issues.apache.org/jira/browse/LUCENE-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated LUCENE-2507:
--------------------------------
Attachment: LUCENE-2507.patch
we have sped up this seeking a lot recently, and i improved this patch some:
* avoid calling docfreq on the suggestions, by using the TermsEnum's docfreq
* Mike had the idea that we should actually try lower edit distances first. The
general use case here is a small number of suggestions (e.g. 1), so
we actually try edit distance=1 first... only if this doesn't give enough suggestions
do we then try higher distances.
I think this is a good approach here, because we are getting levenshtein directly,
so we don't have the problem the n-gram based spellchecker has... (for reference below)
{noformat}
* <p>As the Lucene similarity that is used to fetch the most relevant n-grammed terms
* is not the same as the edit distance strategy used to calculate the best
* matching spell-checked word from the hits that Lucene found, one usually has
* to retrieve a couple of numSug's in order to get the true best match.
*
* <p>I.e. if numSug == 1, don't count on that suggestion being the best one.
* Thus, you should set this value to <b>at least</b> 5 for a good suggestion.
{noformat}
Since we are actually doing levenshtein, you can safely use lower values for numSug,
such as numSug=1
> automaton spellchecker
> ----------------------
>
> Key: LUCENE-2507
> URL: https://issues.apache.org/jira/browse/LUCENE-2507
> Project: Lucene - Java
> Issue Type: New Feature
> Components: contrib/spellchecker
> Reporter: Robert Muir
> Fix For: 4.0
>
> Attachments: LUCENE-2507.patch, LUCENE-2507.patch, LUCENE-2507.patch
>
>
> The current spellchecker makes an n-gram index of your terms, and queries this for spellchecking.
> The terms that come back from the n-gram query are then re-ranked by an algorithm such as Levenshtein.
> Alternatively, we could just do a levenshtein query directly against the index, then we wouldn't need
> a separate index to rebuild.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org