You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Robert Muir (Created) (JIRA)" <ji...@apache.org> on 2012/03/04 20:40:59 UTC

[jira] [Created] (LUCENE-3848) basetokenstreamtestcase should fail if tokenstream starts with posinc=0

basetokenstreamtestcase should fail if tokenstream starts with posinc=0
-----------------------------------------------------------------------

Key: LUCENE-3848
URL: https://issues.apache.org/jira/browse/LUCENE-3848
Project: Lucene - Java
Issue Type: Bug
Reporter: Robert Muir
Fix For: 3.6, 4.0

This is meaningless for a tokenstream to start with posinc=0,

Its also caused problems and hairiness in the indexer (LUCENE-1255, LUCENE-1542),
and it makes senseless tokenstreams. We should add a check and fix any that do this.

Furthermore the same bug can exist in removing-filters if they have enablePositionIncrements=false.
I think this option is useful: but it shouldnt mean 'allow broken tokenstream', it just means we
don't add gaps.

If you remove tokens with enablePositionIncrements=false it should not cause the TS to start with
positionincrement=0, and it shouldnt 'restructure' the tokenstream (e.g. moving synonyms on top of a different word).
It should just not add any 'holes'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3848) basetokenstreamtestcase should fail if tokenstream starts with posinc=0

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13230510#comment-13230510 ] 

Robert Muir commented on LUCENE-3848:
-------------------------------------

I think this is ready to go in, ill wait a bit.

I didn't make any changes re: "graph-restructuring", though I still feel we should fix this, but it means
dealing with backwards compatibility, etc.

The changes in this patch are backwards compatible, in the sense that consumers are already correcting 
'initial posInc=0' to posinc=1 anyway.

                
> basetokenstreamtestcase should fail if tokenstream starts with posinc=0
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-3848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3848
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3848-MockGraphTokenFilter.patch, LUCENE-3848.patch, LUCENE-3848.patch
>
>
> This is meaningless for a tokenstream to start with posinc=0,
> Its also caused problems and hairiness in the indexer (LUCENE-1255, LUCENE-1542),
> and it makes senseless tokenstreams. We should add a check and fix any that do this.
> Furthermore the same bug can exist in removing-filters if they have enablePositionIncrements=false.
> I think this option is useful: but it shouldnt mean 'allow broken tokenstream', it just means we
> don't add gaps. 
> If you remove tokens with enablePositionIncrements=false it should not cause the TS to start with
> positionincrement=0, and it shouldnt 'restructure' the tokenstream (e.g. moving synonyms on top of a different word).
> It should just not add any 'holes'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Resolved] (LUCENE-3848) basetokenstreamtestcase should fail if tokenstream starts with posinc=0

Posted by "Robert Muir (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-3848.
---------------------------------

       Resolution: Fixed
    Fix Version/s: 3.6

I opened LUCENE-3873 to integrate MockGraphTokenFilter into tests.
                
> basetokenstreamtestcase should fail if tokenstream starts with posinc=0
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-3848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3848
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3848-MockGraphTokenFilter.patch, LUCENE-3848.patch, LUCENE-3848.patch
>
>
> This is meaningless for a tokenstream to start with posinc=0,
> Its also caused problems and hairiness in the indexer (LUCENE-1255, LUCENE-1542),
> and it makes senseless tokenstreams. We should add a check and fix any that do this.
> Furthermore the same bug can exist in removing-filters if they have enablePositionIncrements=false.
> I think this option is useful: but it shouldnt mean 'allow broken tokenstream', it just means we
> don't add gaps. 
> If you remove tokens with enablePositionIncrements=false it should not cause the TS to start with
> positionincrement=0, and it shouldnt 'restructure' the tokenstream (e.g. moving synonyms on top of a different word).
> It should just not add any 'holes'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3848) basetokenstreamtestcase should fail if tokenstream starts with posinc=0

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13230341#comment-13230341 ] 

Michael McCandless commented on LUCENE-3848:
--------------------------------------------

+1

                
> basetokenstreamtestcase should fail if tokenstream starts with posinc=0
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-3848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3848
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3848-MockGraphTokenFilter.patch, LUCENE-3848.patch, LUCENE-3848.patch
>
>
> This is meaningless for a tokenstream to start with posinc=0,
> Its also caused problems and hairiness in the indexer (LUCENE-1255, LUCENE-1542),
> and it makes senseless tokenstreams. We should add a check and fix any that do this.
> Furthermore the same bug can exist in removing-filters if they have enablePositionIncrements=false.
> I think this option is useful: but it shouldnt mean 'allow broken tokenstream', it just means we
> don't add gaps. 
> If you remove tokens with enablePositionIncrements=false it should not cause the TS to start with
> positionincrement=0, and it shouldnt 'restructure' the tokenstream (e.g. moving synonyms on top of a different word).
> It should just not add any 'holes'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3848) basetokenstreamtestcase should fail if tokenstream starts with posinc=0

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3848:
--------------------------------

    Attachment: LUCENE-3848.patch

updated patch: I think its ready to commit.

I didn't integrate Mike's nice MockGraphTokenFilter *yet* but will do this under a separate issue: its likely to expose a few bugs :)
                
> basetokenstreamtestcase should fail if tokenstream starts with posinc=0
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-3848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3848
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3848-MockGraphTokenFilter.patch, LUCENE-3848.patch, LUCENE-3848.patch
>
>
> This is meaningless for a tokenstream to start with posinc=0,
> Its also caused problems and hairiness in the indexer (LUCENE-1255, LUCENE-1542),
> and it makes senseless tokenstreams. We should add a check and fix any that do this.
> Furthermore the same bug can exist in removing-filters if they have enablePositionIncrements=false.
> I think this option is useful: but it shouldnt mean 'allow broken tokenstream', it just means we
> don't add gaps. 
> If you remove tokens with enablePositionIncrements=false it should not cause the TS to start with
> positionincrement=0, and it shouldnt 'restructure' the tokenstream (e.g. moving synonyms on top of a different word).
> It should just not add any 'holes'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3848) basetokenstreamtestcase should fail if tokenstream starts with posinc=0

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3848:
--------------------------------

    Fix Version/s:     (was: 3.6)
    
> basetokenstreamtestcase should fail if tokenstream starts with posinc=0
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-3848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3848
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: 4.0
>
>         Attachments: LUCENE-3848-MockGraphTokenFilter.patch, LUCENE-3848.patch
>
>
> This is meaningless for a tokenstream to start with posinc=0,
> Its also caused problems and hairiness in the indexer (LUCENE-1255, LUCENE-1542),
> and it makes senseless tokenstreams. We should add a check and fix any that do this.
> Furthermore the same bug can exist in removing-filters if they have enablePositionIncrements=false.
> I think this option is useful: but it shouldnt mean 'allow broken tokenstream', it just means we
> don't add gaps. 
> If you remove tokens with enablePositionIncrements=false it should not cause the TS to start with
> positionincrement=0, and it shouldnt 'restructure' the tokenstream (e.g. moving synonyms on top of a different word).
> It should just not add any 'holes'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3848) basetokenstreamtestcase should fail if tokenstream starts with posinc=0

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3848:
--------------------------------

    Attachment: LUCENE-3848.patch

patch fixing the bug in WikipediaTokenizer.

But i think we just dont have good tests for the removers.

Ideally for tests i think we should have a simple 'MockSynonymsFilter' that is juts stupid and slow and makes certain synonyms (maybe some multi-word) to use in testing.

Then we can write tests to find and fix the bugs in the removingfilter.
                
> basetokenstreamtestcase should fail if tokenstream starts with posinc=0
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-3848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3848
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3848.patch
>
>
> This is meaningless for a tokenstream to start with posinc=0,
> Its also caused problems and hairiness in the indexer (LUCENE-1255, LUCENE-1542),
> and it makes senseless tokenstreams. We should add a check and fix any that do this.
> Furthermore the same bug can exist in removing-filters if they have enablePositionIncrements=false.
> I think this option is useful: but it shouldnt mean 'allow broken tokenstream', it just means we
> don't add gaps. 
> If you remove tokens with enablePositionIncrements=false it should not cause the TS to start with
> positionincrement=0, and it shouldnt 'restructure' the tokenstream (e.g. moving synonyms on top of a different word).
> It should just not add any 'holes'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3848) basetokenstreamtestcase should fail if tokenstream starts with posinc=0

Posted by "Michael McCandless (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3848:
---------------------------------------

    Attachment: LUCENE-3848-MockGraphTokenFilter.patch

Patch, adding a MockGraphTokenFilter we can use to randomly insert fake graph arcs...
                
> basetokenstreamtestcase should fail if tokenstream starts with posinc=0
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-3848
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3848
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Robert Muir
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3848-MockGraphTokenFilter.patch, LUCENE-3848.patch
>
>
> This is meaningless for a tokenstream to start with posinc=0,
> Its also caused problems and hairiness in the indexer (LUCENE-1255, LUCENE-1542),
> and it makes senseless tokenstreams. We should add a check and fix any that do this.
> Furthermore the same bug can exist in removing-filters if they have enablePositionIncrements=false.
> I think this option is useful: but it shouldnt mean 'allow broken tokenstream', it just means we
> don't add gaps. 
> If you remove tokens with enablePositionIncrements=false it should not cause the TS to start with
> positionincrement=0, and it shouldnt 'restructure' the tokenstream (e.g. moving synonyms on top of a different word).
> It should just not add any 'holes'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org