You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2010/09/28 19:45:35 UTC

[jira] Updated: (LUCENE-2507) automaton spellchecker

     [ https://issues.apache.org/jira/browse/LUCENE-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2507:
--------------------------------

    Attachment: LUCENE-2507.patch

we have sped up this seeking a lot recently, and i improved this patch some:
* avoid calling docfreq on the suggestions, by using the TermsEnum's docfreq
* Mike had the idea that we should actually try lower edit distances first. The 
  general use case here is a small number of suggestions (e.g. 1), so 
  we actually try edit distance=1 first... only if this doesn't give enough suggestions 
  do we then try higher distances. 

I think this is a good approach here, because we are getting levenshtein directly, 
so we don't have the problem the n-gram based spellchecker has... (for reference below)

{noformat}
   * <p>As the Lucene similarity that is used to fetch the most relevant n-grammed terms
   * is not the same as the edit distance strategy used to calculate the best
   * matching spell-checked word from the hits that Lucene found, one usually has
   * to retrieve a couple of numSug's in order to get the true best match.
   *
   * <p>I.e. if numSug == 1, don't count on that suggestion being the best one.
   * Thus, you should set this value to <b>at least</b> 5 for a good suggestion.
{noformat}

Since we are actually doing levenshtein, you can safely use lower values for numSug,
such as numSug=1


> automaton spellchecker
> ----------------------
>
>                 Key: LUCENE-2507
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2507
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/spellchecker
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-2507.patch, LUCENE-2507.patch, LUCENE-2507.patch
>
>
> The current spellchecker makes an n-gram index of your terms, and queries this for spellchecking.
> The terms that come back from the n-gram query are then re-ranked by an algorithm such as Levenshtein.
> Alternatively, we could just do a levenshtein query directly against the index, then we wouldn't need
> a separate index to rebuild.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org