You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2012/05/17 15:41:07 UTC

[jira] [Created] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

Robert Muir created LUCENE-4065:
-----------------------------------

             Summary: FilteringTokenFilter should never corrupt the tokenstream graph
                 Key: LUCENE-4065
                 URL: https://issues.apache.org/jira/browse/LUCENE-4065
             Project: Lucene - Java
          Issue Type: Bug
          Components: modules/analysis
            Reporter: Robert Muir
         Attachments: LUCENE-4065_test.patch

Currently removers like stopfilter have an option (true/false) to enable position increments.

If its true: it both inserts gaps where necessary AND propagates gaps down the stream.
If its false: it does neither, which can totally mess up the tokenstream graph (e.g. move synonyms to another word).

There are totally valid natural usecases for false, where you don't want gaps because you want phrasequeries to act as if the word was never actually there.

But 'not inserting gaps' is separate from proper propagation of existing gaps.

So I think we should provide an option (either fix 'false' or make it an enum), where you still get a legit tokenstream and dont totally screw it up, but you simply omit gaps.

See LUCENE-3848 for more information (Where we at least fixed this case to not begin the tokenstream with posinc=0)


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-4065:
--------------------------------

    Attachment: LUCENE-4065_test.patch

test case (boiled down from testrandomchains)

A much simpler one could be made.
                
> FilteringTokenFilter should never corrupt the tokenstream graph
> ---------------------------------------------------------------
>
>                 Key: LUCENE-4065
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4065
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>         Attachments: LUCENE-4065_test.patch
>
>
> Currently removers like stopfilter have an option (true/false) to enable position increments.
> If its true: it both inserts gaps where necessary AND propagates gaps down the stream.
> If its false: it does neither, which can totally mess up the tokenstream graph (e.g. move synonyms to another word).
> There are totally valid natural usecases for false, where you don't want gaps because you want phrasequeries to act as if the word was never actually there.
> But 'not inserting gaps' is separate from proper propagation of existing gaps.
> So I think we should provide an option (either fix 'false' or make it an enum), where you still get a legit tokenstream and dont totally screw it up, but you simply omit gaps.
> See LUCENE-3848 for more information (Where we at least fixed this case to not begin the tokenstream with posinc=0)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-4065) FilteringTokenFilter should never corrupt the tokenstream graph

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-4065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13277805#comment-13277805 ] 

Robert Muir commented on LUCENE-4065:
-------------------------------------

Another way to see it:
imagine i have 'my test case'
and i have a synonyms set with a single mapping: test=example

So synonymfilter makes: 'my test/example case'. Example has posinc=0

if we have a stopfilter with posinc=false that has a single stopword (test),
then we end out with 'my/example case'.

But in my opinion this should be 'my example case': e.g. we should propagate
the posinc=1 of 'test' to example. We arent introducing a gap though, just preventing
insane graph corruption and restacking of synonyms.

                
> FilteringTokenFilter should never corrupt the tokenstream graph
> ---------------------------------------------------------------
>
>                 Key: LUCENE-4065
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4065
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: modules/analysis
>            Reporter: Robert Muir
>         Attachments: LUCENE-4065_test.patch
>
>
> Currently removers like stopfilter have an option (true/false) to enable position increments.
> If its true: it both inserts gaps where necessary AND propagates gaps down the stream.
> If its false: it does neither, which can totally mess up the tokenstream graph (e.g. move synonyms to another word).
> There are totally valid natural usecases for false, where you don't want gaps because you want phrasequeries to act as if the word was never actually there.
> But 'not inserting gaps' is separate from proper propagation of existing gaps.
> So I think we should provide an option (either fix 'false' or make it an enum), where you still get a legit tokenstream and dont totally screw it up, but you simply omit gaps.
> See LUCENE-3848 for more information (Where we at least fixed this case to not begin the tokenstream with posinc=0)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org