You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Koji Sekiguchi (Created) (JIRA)" <ji...@apache.org> on 2012/03/20 08:09:44 UTC

[jira] [Created] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

split off the spell check word and surface form in spell check dictionary
-------------------------------------------------------------------------

                 Key: LUCENE-3888
                 URL: https://issues.apache.org/jira/browse/LUCENE-3888
             Project: Lucene - Java
          Issue Type: Improvement
          Components: modules/spellchecker
            Reporter: Koji Sekiguchi
            Priority: Minor
             Fix For: 3.6, 4.0


The "did you mean?" feature by using Lucene's spell checker cannot work well for Japanese environment unfortunately and is the longstanding problem, because the logic needs comparatively long text to check spells, but for some languages (e.g. Japanese), most words are too short to use the spell checker.

I think, for at least Japanese, the things can be improved if we split off the spell check word and surface form in the spell check dictionary. Then we can use ReadingAttribute for spell checking but CharTermAttribute for suggesting, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233291#comment-13233291 ] 

Robert Muir commented on LUCENE-3888:
-------------------------------------

Koji: hmm I think the problem is not in the Dictionary interface (which is actually ok),
but instead in the spellcheckers and suggesters themselves?

For spellchecking, I think we need to expose more Analysis options in Spellchecker:
currently this is actually hardcoded at KeywordAnalyzer (it uses NOT_ANALYZED). 
Instead I think you should be able to pass Analyzer: we would also
have a TokenFilter for Japanese that replaces term text with Reading from ReadingAttribute.

In the same way, suggest can analyze too. (LUCENE-3842 is already some work for that, especially
with the idea to support Japanese this exact same way).

So in short I think we should:
# create a TokenFilter (similar to BaseFormFilter) which copies ReadingAttribute into termAtt.
# refactor the 'n-gram analysis' in spellchecker to work on actual tokenstreams (this can
  also likely be implemented as tokenstreams), allowing user to set an Analyzer on Spellchecker
  to control how it analyzes text.
# continue to work on 'analysis for suggest' like LUCENE-3842.

Note this use of analyzers in spellcheck/suggest is unrelated to Solr's current use of 'analyzers' 
which is only for some query manipulation and not very useful.

                
> split off the spell check word and surface form in spell check dictionary
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-3888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/spellchecker
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3888.patch
>
>
> The "did you mean?" feature by using Lucene's spell checker cannot work well for Japanese environment unfortunately and is the longstanding problem, because the logic needs comparatively long text to check spells, but for some languages (e.g. Japanese), most words are too short to use the spell checker.
> I think, for at least Japanese, the things can be improved if we split off the spell check word and surface form in the spell check dictionary. Then we can use ReadingAttribute for spell checking but CharTermAttribute for suggesting, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3888:
--------------------------------

    Attachment: LUCENE-3888.patch

fix the obvious reset() problem... the real problem is I need to reset() my coffee mug.
                
> split off the spell check word and surface form in spell check dictionary
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-3888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/spellchecker
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch
>
>
> The "did you mean?" feature by using Lucene's spell checker cannot work well for Japanese environment unfortunately and is the longstanding problem, because the logic needs comparatively long text to check spells, but for some languages (e.g. Japanese), most words are too short to use the spell checker.
> I think, for at least Japanese, the things can be improved if we split off the spell check word and surface form in the spell check dictionary. Then we can use ReadingAttribute for spell checking but CharTermAttribute for suggesting, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Assigned] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

Posted by "Koji Sekiguchi (Assigned) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi reassigned LUCENE-3888:
--------------------------------------

    Assignee: Koji Sekiguchi
    
> split off the spell check word and surface form in spell check dictionary
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-3888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/spellchecker
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch
>
>
> The "did you mean?" feature by using Lucene's spell checker cannot work well for Japanese environment unfortunately and is the longstanding problem, because the logic needs comparatively long text to check spells, but for some languages (e.g. Japanese), most words are too short to use the spell checker.
> I think, for at least Japanese, the things can be improved if we split off the spell check word and surface form in the spell check dictionary. Then we can use ReadingAttribute for spell checking but CharTermAttribute for suggesting, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

Posted by "Koji Sekiguchi (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi updated LUCENE-3888:
-----------------------------------

    Fix Version/s:     (was: 3.6)

Thanks Robert for giving some patches and comment.

{quote}
The only option for 3.6 would be something like my previous patch
(https://issues.apache.org/jira/secure/attachment/12519860/LUCENE-3888.patch) which
has the disadvantages of doing the second-phase re-ranking on surface forms.
{quote}

With the disadvantages, the spell checker won't work well for Japanese anyway. I give up this for 3.6.
                
> split off the spell check word and surface form in spell check dictionary
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-3888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/spellchecker
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch
>
>
> The "did you mean?" feature by using Lucene's spell checker cannot work well for Japanese environment unfortunately and is the longstanding problem, because the logic needs comparatively long text to check spells, but for some languages (e.g. Japanese), most words are too short to use the spell checker.
> I think, for at least Japanese, the things can be improved if we split off the spell check word and surface form in the spell check dictionary. Then we can use ReadingAttribute for spell checking but CharTermAttribute for suggesting, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238397#comment-13238397 ] 

Robert Muir commented on LUCENE-3888:
-------------------------------------

Thanks for the feedback Koji.

I'm not happy with the situation: I thought it would be easy to support
some rough Japanese spellcheck in 3.6 

But it just seems like we need to do a lot of cleanup to make it work,
I would rather fix all of these APIs and do it right the first time so
that things like distributed support work too.

                
> split off the spell check word and surface form in spell check dictionary
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-3888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/spellchecker
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch
>
>
> The "did you mean?" feature by using Lucene's spell checker cannot work well for Japanese environment unfortunately and is the longstanding problem, because the logic needs comparatively long text to check spells, but for some languages (e.g. Japanese), most words are too short to use the spell checker.
> I think, for at least Japanese, the things can be improved if we split off the spell check word and surface form in the spell check dictionary. Then we can use ReadingAttribute for spell checking but CharTermAttribute for suggesting, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237934#comment-13237934 ] 

Robert Muir commented on LUCENE-3888:
-------------------------------------

In my opinion we should set this as fix for 4.0

The only option for 3.6 would be something like my previous patch 
(https://issues.apache.org/jira/secure/attachment/12519860/LUCENE-3888.patch) which 
has the disadvantages of doing the second-phase re-ranking on surface forms.

Any other opinions?
                
> split off the spell check word and surface form in spell check dictionary
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-3888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/spellchecker
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch
>
>
> The "did you mean?" feature by using Lucene's spell checker cannot work well for Japanese environment unfortunately and is the longstanding problem, because the logic needs comparatively long text to check spells, but for some languages (e.g. Japanese), most words are too short to use the spell checker.
> I think, for at least Japanese, the things can be improved if we split off the spell check word and surface form in the spell check dictionary. Then we can use ReadingAttribute for spell checking but CharTermAttribute for suggesting, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

Posted by "Koji Sekiguchi (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi updated LUCENE-3888:
-----------------------------------

    Attachment: LUCENE-3888.patch

I added a test for the surface analyzer. I also added code for the analyzer in Solr.

Currently, due to classpath problem, the test cannot be compiled. I should dig in, but if someone could, it would be appreciated.

                
> split off the spell check word and surface form in spell check dictionary
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-3888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/spellchecker
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch
>
>
> The "did you mean?" feature by using Lucene's spell checker cannot work well for Japanese environment unfortunately and is the longstanding problem, because the logic needs comparatively long text to check spells, but for some languages (e.g. Japanese), most words are too short to use the spell checker.
> I think, for at least Japanese, the things can be improved if we split off the spell check word and surface form in the spell check dictionary. Then we can use ReadingAttribute for spell checking but CharTermAttribute for suggesting, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

Posted by "Christian Moen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237457#comment-13237457 ] 

Christian Moen commented on LUCENE-3888:
----------------------------------------

This is excellent, Koji and Robert.  We should be able to do basic spellchecking for Japanese with this.
                
> split off the spell check word and surface form in spell check dictionary
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-3888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/spellchecker
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch
>
>
> The "did you mean?" feature by using Lucene's spell checker cannot work well for Japanese environment unfortunately and is the longstanding problem, because the logic needs comparatively long text to check spells, but for some languages (e.g. Japanese), most words are too short to use the spell checker.
> I think, for at least Japanese, the things can be improved if we split off the spell check word and surface form in the spell check dictionary. Then we can use ReadingAttribute for spell checking but CharTermAttribute for suggesting, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

Posted by "Koji Sekiguchi (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Koji Sekiguchi updated LUCENE-3888:
-----------------------------------

    Attachment: LUCENE-3888.patch

The patch cannot be compiled now because I changed the return type of the method in Dictionary interface but all implemented classes have not been changed.

Please give some comment because I'm new to spell checker. If no problem to go, I'll continue to work.
                
> split off the spell check word and surface form in spell check dictionary
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-3888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/spellchecker
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3888.patch
>
>
> The "did you mean?" feature by using Lucene's spell checker cannot work well for Japanese environment unfortunately and is the longstanding problem, because the logic needs comparatively long text to check spells, but for some languages (e.g. Japanese), most words are too short to use the spell checker.
> I think, for at least Japanese, the things can be improved if we split off the spell check word and surface form in the spell check dictionary. Then we can use ReadingAttribute for spell checking but CharTermAttribute for suggesting, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

Posted by "Koji Sekiguchi (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237458#comment-13237458 ] 

Koji Sekiguchi commented on LUCENE-3888:
----------------------------------------

The test itself is not good.
                
> split off the spell check word and surface form in spell check dictionary
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-3888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/spellchecker
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch
>
>
> The "did you mean?" feature by using Lucene's spell checker cannot work well for Japanese environment unfortunately and is the longstanding problem, because the logic needs comparatively long text to check spells, but for some languages (e.g. Japanese), most words are too short to use the spell checker.
> I think, for at least Japanese, the things can be improved if we split off the spell check word and surface form in the spell check dictionary. Then we can use ReadingAttribute for spell checking but CharTermAttribute for suggesting, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13237614#comment-13237614 ] 

Robert Muir commented on LUCENE-3888:
-------------------------------------

lemme see if I can help with the test. I feel bad I didn't supply one with the prototype patch.

About the Solr integration: this looks good! We can use a similar approach for autosuggest, too,
so this could configure the analyzer for LUCENE-3842.

I wonder if we should allow separate configuration of "index" and "query" analyzers? I know
I came up with some use-cases for that for autosuggest, but I'm not sure about spellchecking.
I guess it wouldn't be overkill to allow it though.
                
> split off the spell check word and surface form in spell check dictionary
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-3888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/spellchecker
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch
>
>
> The "did you mean?" feature by using Lucene's spell checker cannot work well for Japanese environment unfortunately and is the longstanding problem, because the logic needs comparatively long text to check spells, but for some languages (e.g. Japanese), most words are too short to use the spell checker.
> I think, for at least Japanese, the things can be improved if we split off the spell check word and surface form in the spell check dictionary. Then we can use ReadingAttribute for spell checking but CharTermAttribute for suggesting, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3888:
--------------------------------

    Attachment: LUCENE-3888.patch

Here is a simple prototype of what I was suggesting, allows you to specify Analyzer to SpellChecker.

This Analyzer converts the 'surface form' into 'analyzed form' at index and query time: at index-time it forms n-grams based on the analyzed form, but stores the surface form for retrieval.

At query-time we have a similar process: the docFreq() etc checks are done on the surface form, but the actual spellchecking on the analyzed form.

The default Analyzer is null which means do nothing, and the patch has no tests, refactoring, or any of that.

                
> split off the spell check word and surface form in spell check dictionary
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-3888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/spellchecker
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3888.patch, LUCENE-3888.patch
>
>
> The "did you mean?" feature by using Lucene's spell checker cannot work well for Japanese environment unfortunately and is the longstanding problem, because the logic needs comparatively long text to check spells, but for some languages (e.g. Japanese), most words are too short to use the spell checker.
> I think, for at least Japanese, the things can be improved if we split off the spell check word and surface form in the spell check dictionary. Then we can use ReadingAttribute for spell checking but CharTermAttribute for suggesting, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3888:
--------------------------------

    Attachment: LUCENE-3888.patch

updated patch (note with this one: Solr does not yet compile).

I went the route of trying to clean up these apis correctly: I think there are serious problems here.

The biggest violation is stuff like:
{code}
// convert to array string: 
// nocommit: why don't we just return SuggestWord[] with all the information?
// consumers such as Solr must be recomputing this stuff again?!
String[] list = new String[sugQueue.size()];
for (int i = sugQueue.size() - 1; i >= 0; i--) {
 list[i] = sugQueue.pop().getSurface();
}

return list;
{code}

DirectSpellChecker already returns all this data, I think its doing the right thing, but I think SpellChecker should be fixed. Even for the normal case surely we are recomputing docFreq etc on all the candidates which is wasteful.

I'll keep plugging away but it seems like this will be a pretty serious refactoring (including e.g. distributed spellcheck refactoring) and difficult for 3.6.

                
> split off the spell check word and surface form in spell check dictionary
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-3888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/spellchecker
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch
>
>
> The "did you mean?" feature by using Lucene's spell checker cannot work well for Japanese environment unfortunately and is the longstanding problem, because the logic needs comparatively long text to check spells, but for some languages (e.g. Japanese), most words are too short to use the spell checker.
> I think, for at least Japanese, the things can be improved if we split off the spell check word and surface form in the spell check dictionary. Then we can use ReadingAttribute for spell checking but CharTermAttribute for suggesting, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3888) split off the spell check word and surface form in spell check dictionary

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3888:
--------------------------------

    Attachment: LUCENE-3888.patch

I updated the patch and fixed Koji's test, its passing BUT there is a nocommit:
{code}
// nocommit: we need to fix SuggestWord to separate surface and analyzed forms.
// currently the 're-rank' is based on the surface forms!
spellChecker.setAccuracy(0F);
{code}

To explain with the Japanese case how the patch currently works, the spellchecker has two phases:
* Phase 1: n-gram approximation phase. Here we generate a n-gram boolean query on the Readings. This is working fine.
* Phase 2: re-rank phase. Here we take the candidates from Phase 1 and do a real comparison (e.g. Levenshtein) to give them the final score. The problem is this currently uses surface form!

I think phase 2 should re-rank based on the 'analyzed form' too? Inside spellchecker itself, I don't think this is very difficult, when analyzed != surface, we just store it for later retrieval.

The problem is the spellcheck comparison APIs such as SuggestWord don't even have any getters or setters and present no way for me to migrate to surface+analyzed in any backwards compatible way...

I'll think about this in the meantime. Maybe we should just break and cleanup these APIs since its a contrib module and they are funky? 

                
> split off the spell check word and surface form in spell check dictionary
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-3888
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3888
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/spellchecker
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch, LUCENE-3888.patch
>
>
> The "did you mean?" feature by using Lucene's spell checker cannot work well for Japanese environment unfortunately and is the longstanding problem, because the logic needs comparatively long text to check spells, but for some languages (e.g. Japanese), most words are too short to use the spell checker.
> I think, for at least Japanese, the things can be improved if we split off the spell check word and surface form in the spell check dictionary. Then we can use ReadingAttribute for spell checking but CharTermAttribute for suggesting, for example.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org