You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (Created) (JIRA)" <ji...@apache.org> on 2011/12/12 13:26:30 UTC

[jira] [Created] (LUCENE-3642) EdgeNgrams creates invalid offsets

EdgeNgrams creates invalid offsets
----------------------------------

                 Key: LUCENE-3642
                 URL: https://issues.apache.org/jira/browse/LUCENE-3642
             Project: Lucene - Java
          Issue Type: Bug
    Affects Versions: 3.5
            Reporter: Robert Muir
         Attachments: 6B2Uh.png

A user reported this because it was causing his highlighting to throw an error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3642) EdgeNgrams creates invalid offsets

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167544#comment-13167544 ] 

Robert Muir commented on LUCENE-3642:
-------------------------------------

Thanks Max, I am currently adding more tests/fixes for other broken tokenizers/filters with offset bugs.

I'll update the patch when these are passing, but i think the ngrams stuff is ok.
                
> EdgeNgrams creates invalid offsets
> ----------------------------------
>
>                 Key: LUCENE-3642
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3642
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Robert Muir
>         Attachments: 6B2Uh.png, LUCENE-3642_ngrams.patch, LUCENE-3642_test.patch
>
>
> A user reported this because it was causing his highlighting to throw an error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3642) EdgeNgrams creates invalid offsets

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3642:
--------------------------------

    Attachment: LUCENE-3642_test.patch

here's a test.

the problem is a previous filter 'lengthens' this term by folding æ -> ae, but EdgeNGramFilter computes the offsets "additively": offsetAtt.setOffset(tokStart + start, tokStart + end);

Because of this if a word has been 'lengthened' by a previous filter, edgengram will produce offsets that are longer than the original text. (and probably bogus ones if its been shortened).

I think we should what WDF does here, if the original offsets have already been changed (startOffset + termLength != endOffset), then we should simply preserve them for the new subwords.

I added a check for this to basetokenstreamtestcase... now to see if anything else fails... 
                
> EdgeNgrams creates invalid offsets
> ----------------------------------
>
>                 Key: LUCENE-3642
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3642
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Robert Muir
>         Attachments: 6B2Uh.png, LUCENE-3642_test.patch
>
>
> A user reported this because it was causing his highlighting to throw an error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3642) EdgeNgrams creates invalid offsets

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167585#comment-13167585 ] 

Robert Muir commented on LUCENE-3642:
-------------------------------------

Just looking i see another bug in CharTOkenizer... i'll add another test.
                
> EdgeNgrams creates invalid offsets
> ----------------------------------
>
>                 Key: LUCENE-3642
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3642
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 3.6, 4.0
>
>         Attachments: 6B2Uh.png, LUCENE-3642.patch, LUCENE-3642.patch, LUCENE-3642_ngrams.patch, LUCENE-3642_test.patch
>
>
> A user reported this because it was causing his highlighting to throw an error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3642) EdgeNgrams creates invalid offsets

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167491#comment-13167491 ] 

Robert Muir commented on LUCENE-3642:
-------------------------------------

so my assert trips for shit like whitespacespacetokenizer + lowercase... how horrible is that?

There must be offset bugs in CharTokenizer... i'll dig into it.
                
> EdgeNgrams creates invalid offsets
> ----------------------------------
>
>                 Key: LUCENE-3642
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3642
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Robert Muir
>         Attachments: 6B2Uh.png, LUCENE-3642_test.patch
>
>
> A user reported this because it was causing his highlighting to throw an error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3642) EdgeNgrams creates invalid offsets

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3642:
--------------------------------

    Attachment: LUCENE-3642.patch

patch with tests and fix for the additional bug in CharTokenizer.
                
> EdgeNgrams creates invalid offsets
> ----------------------------------
>
>                 Key: LUCENE-3642
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3642
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 3.6, 4.0
>
>         Attachments: 6B2Uh.png, LUCENE-3642.patch, LUCENE-3642.patch, LUCENE-3642.patch, LUCENE-3642_ngrams.patch, LUCENE-3642_test.patch
>
>
> A user reported this because it was causing his highlighting to throw an error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3642) EdgeNgrams creates invalid offsets

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3642:
--------------------------------

    Attachment: LUCENE-3642.patch

here's the fix for CharTokenizer.

Tests are passing, I will commit soon.
                
> EdgeNgrams creates invalid offsets
> ----------------------------------
>
>                 Key: LUCENE-3642
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3642
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Robert Muir
>             Fix For: 3.6, 4.0
>
>         Attachments: 6B2Uh.png, LUCENE-3642.patch, LUCENE-3642.patch, LUCENE-3642_ngrams.patch, LUCENE-3642_test.patch
>
>
> A user reported this because it was causing his highlighting to throw an error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3642) EdgeNgrams creates invalid offsets

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3642:
--------------------------------

    Attachment: LUCENE-3642.patch

updated patch with a test+fix for smartchinese, and with a test for CharTokenizer... it currently fails with an off by one (incorrect startOffset) which is in turn jacking up the endOffsets too. 
                
> EdgeNgrams creates invalid offsets
> ----------------------------------
>
>                 Key: LUCENE-3642
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3642
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Robert Muir
>         Attachments: 6B2Uh.png, LUCENE-3642.patch, LUCENE-3642_ngrams.patch, LUCENE-3642_test.patch
>
>
> A user reported this because it was causing his highlighting to throw an error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Resolved] (LUCENE-3642) EdgeNgrams creates invalid offsets

Posted by "Robert Muir (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-3642.
---------------------------------

    Resolution: Fixed
    
> EdgeNgrams creates invalid offsets
> ----------------------------------
>
>                 Key: LUCENE-3642
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3642
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 3.6, 4.0
>
>         Attachments: 6B2Uh.png, LUCENE-3642.patch, LUCENE-3642.patch, LUCENE-3642.patch, LUCENE-3642_ngrams.patch, LUCENE-3642_test.patch
>
>
> A user reported this because it was causing his highlighting to throw an error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3642) EdgeNgrams creates invalid offsets

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3642:
--------------------------------

    Attachment: 6B2Uh.png

screenshot from the user
                
> EdgeNgrams creates invalid offsets
> ----------------------------------
>
>                 Key: LUCENE-3642
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3642
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Robert Muir
>         Attachments: 6B2Uh.png
>
>
> A user reported this because it was causing his highlighting to throw an error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3642) EdgeNgrams creates invalid offsets

Posted by "Max Beutel (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167516#comment-13167516 ] 

Max Beutel commented on LUCENE-3642:
------------------------------------

Robert, that patch for the EdgeNGramTokenFilter worked. If there occur any problems I let you know. Thanks!
                
> EdgeNgrams creates invalid offsets
> ----------------------------------
>
>                 Key: LUCENE-3642
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3642
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Robert Muir
>         Attachments: 6B2Uh.png, LUCENE-3642_ngrams.patch, LUCENE-3642_test.patch
>
>
> A user reported this because it was causing his highlighting to throw an error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (LUCENE-3642) EdgeNgrams creates invalid offsets

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13167476#comment-13167476 ] 

Robert Muir commented on LUCENE-3642:
-------------------------------------

I thought up a hackish way we can test for these invalid offsets for all filters... I'll see if it works.
                
> EdgeNgrams creates invalid offsets
> ----------------------------------
>
>                 Key: LUCENE-3642
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3642
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Robert Muir
>         Attachments: 6B2Uh.png
>
>
> A user reported this because it was causing his highlighting to throw an error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3642) EdgeNgrams creates invalid offsets

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3642:
--------------------------------

    Attachment: LUCENE-3642_ngrams.patch

Here's a patch fixing the (edge)ngrams filters, using the same logic as wdf (its well-defined, i think its the only thing we can do here).

Still need to fix the chartokenizer bug, and also add some tests for any other "filters that are actually tokenizers" we might have.
                
> EdgeNgrams creates invalid offsets
> ----------------------------------
>
>                 Key: LUCENE-3642
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3642
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Robert Muir
>         Attachments: 6B2Uh.png, LUCENE-3642_ngrams.patch, LUCENE-3642_test.patch
>
>
> A user reported this because it was causing his highlighting to throw an error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (LUCENE-3642) EdgeNgrams creates invalid offsets

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3642:
--------------------------------

    Fix Version/s: 4.0
                   3.6
         Assignee: Robert Muir
    
> EdgeNgrams creates invalid offsets
> ----------------------------------
>
>                 Key: LUCENE-3642
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3642
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 3.5
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 3.6, 4.0
>
>         Attachments: 6B2Uh.png, LUCENE-3642.patch, LUCENE-3642.patch, LUCENE-3642_ngrams.patch, LUCENE-3642_test.patch
>
>
> A user reported this because it was causing his highlighting to throw an error.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org