You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (Commented) (JIRA)" <ji...@apache.org> on 2012/03/20 09:07:52 UTC

[jira] [Commented] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

    [ https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233291#comment-13233291 ] 

Robert Muir commented on LUCENE-3888:
-------------------------------------

Koji: hmm I think the problem is not in the Dictionary interface (which is actually ok),
but instead in the spellcheckers and suggesters themselves?

For spellchecking, I think we need to expose more Analysis options in Spellchecker:
currently this is actually hardcoded at KeywordAnalyzer (it uses NOT_ANALYZED). 
Instead I think you should be able to pass Analyzer: we would also
have a TokenFilter for Japanese that replaces term text with Reading from ReadingAttribute.

In the same way, suggest can analyze too. (LUCENE-3842 is already some work for that, especially
with the idea to support Japanese this exact same way).

So in short I think we should:
# create a TokenFilter (similar to BaseFormFilter) which copies ReadingAttribute into termAtt.
# refactor the 'n-gram analysis' in spellchecker to work on actual tokenstreams (this can
  also likely be implemented as tokenstreams), allowing user to set an Analyzer on Spellchecker
  to control how it analyzes text.
# continue to work on 'analysis for suggest' like LUCENE-3842.

Note this use of analyzers in spellcheck/suggest is unrelated to Solr's current use of 'analyzers' 
which is only for some query manipulation and not very useful.

                
> split off the spell check word and surface form in spell check dictionary
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-3888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/spellchecker
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3888.patch
>
>
> The "did you mean?" feature by using Lucene's spell checker cannot work well for Japanese environment unfortunately and is the longstanding problem, because the logic needs comparatively long text to check spells, but for some languages (e.g. Japanese), most words are too short to use the spell checker.
> I think, for at least Japanese, the things can be improved if we split off the spell check word and surface form in the spell check dictionary. Then we can use ReadingAttribute for spell checking but CharTermAttribute for suggesting, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org