You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Doron Cohen (JIRA)" <ji...@apache.org> on 2007/12/26 22:56:43 UTC

[jira] Created: (LUCENE-1101) Tokenizers should reset positionIncrement to 1 in their next(Token result)

Tokenizers should reset positionIncrement to 1 in their next(Token result) 
---------------------------------------------------------------------------

                 Key: LUCENE-1101
                 URL: https://issues.apache.org/jira/browse/LUCENE-1101
             Project: Lucene - Java
          Issue Type: Bug
    Affects Versions: 2.3
            Reporter: Doron Cohen
            Assignee: Doron Cohen
             Fix For: 2.3


Tokenizers which implement the reuse form of the next method:
    next(Token result) 
should reset the postionIncrement of the returned token to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1101) TokenStream.next(Token) reuse 'policy': calling Token.clear() should be responsibility of producer.

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-1101:
--------------------------------

    Summary: TokenStream.next(Token) reuse 'policy': calling Token.clear() should be responsibility of producer.  (was: Tokenizers should reset positionIncrement to 1 in their next(Token result) )

> TokenStream.next(Token) reuse 'policy': calling Token.clear() should be responsibility of producer.
> ---------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1101
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1101
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.3
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>             Fix For: 2.3
>
>         Attachments: lucene-1101.patch, lucene-1101.patch
>
>
> Tokenizers which implement the reuse form of the next method:
>     next(Token result) 
> should reset the postionIncrement of the returned token to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1101) Tokenizers should reset positionIncrement to 1 in their next(Token result)

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554533 ] 

Doron Cohen commented on LUCENE-1101:
-------------------------------------

{quote}
I think it's used for both tokenized and un-tokenized.... see line1319.
It seems redundant to call clear() in both the consumer (DocumentsWriter) and producer (Tokenizer).
{quote}

You're right again Yonik, I missed line 1319.

But I think it would be cleaner/safer to move the responsibility to clear() from consumers to producers.
(Producer being the deepest tokenstream in the call sequence, the one that would instantiate a new Token if it implemented next()).

Otherwise you get bugs like the one I had in testStopPositons() in the patch for LUCENE-1095: 
The test chains two stop filters:
* a = WhiltSpaceAnalyzer().
* f1 = StopFilter(a)
* f2 = StopFilter(f1)

Now the test itself calls next().
StopFilter implements only next(Token).
So this is the sequence we get:
* test call f2.next()
* TokenSteam next() calls t2.next(new Token())
* t2.next(r) calls t1.next(r) repeatedly (until r not stopped).
* t1.next(r) calls a.ts.next(r) repeatedly (until r not stopped).

The cause for the bug was that when t1 returns a token r, it may have set r's pos_incr to something other than1. But when t2 calls t1 again (because t2 stopped r), that pos_incr should have bean cleared to 1. Now this can also be fixed by changing StopFilter to clear() before every call to t1.next(r), except for the first time it calls ts.next(), because for the first call it can assume that its consumer already cleared r. Since most words are not stopped, this distinction between first call to successive calls is important, performance wise.

Now, this is a little complicated (and not only because of my writing style : - ) ), 
and so I think moving the clear() responsibility to the (deep most) producer would make things more simple and safe. (?)


> Tokenizers should reset positionIncrement to 1 in their next(Token result) 
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-1101
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1101
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.3
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>             Fix For: 2.3
>
>         Attachments: lucene-1101.patch, lucene-1101.patch
>
>
> Tokenizers which implement the reuse form of the next method:
>     next(Token result) 
> should reset the postionIncrement of the returned token to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-1101) TokenStream.next(Token) reuse 'policy': calling Token.clear() should be responsibility of producer.

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen resolved LUCENE-1101.
---------------------------------

    Resolution: Fixed

Committed, thanks for the reviews Yonik & Mike!

> TokenStream.next(Token) reuse 'policy': calling Token.clear() should be responsibility of producer.
> ---------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1101
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1101
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.3
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>             Fix For: 2.3
>
>         Attachments: lucene-1101.patch, lucene-1101.patch, lucene-1101.patch
>
>
> Tokenizers which implement the reuse form of the next method:
>     next(Token result) 
> should reset the postionIncrement of the returned token to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1101) Tokenizers should reset positionIncrement to 1 in their next(Token result)

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554908 ] 

Michael McCandless commented on LUCENE-1101:
--------------------------------------------

{quote}
But I think it would be cleaner/safer to move the responsibility to clear() from consumers to producers.
{quote}

+1

So, the contract of next(Token) is: if you are the source of Tokens, you must call Token.clear() before setting the fields in it & returning it.

> Tokenizers should reset positionIncrement to 1 in their next(Token result) 
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-1101
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1101
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.3
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>             Fix For: 2.3
>
>         Attachments: lucene-1101.patch, lucene-1101.patch
>
>
> Tokenizers which implement the reuse form of the next method:
>     next(Token result) 
> should reset the postionIncrement of the returned token to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1101) Tokenizers should reset positionIncrement to 1 in their next(Token result)

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554474 ] 

Doron Cohen commented on LUCENE-1101:
-------------------------------------

I think I checked that and found that Token.clear() is too strong to be invoked in those cases, but let me check again...

> Tokenizers should reset positionIncrement to 1 in their next(Token result) 
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-1101
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1101
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.3
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>             Fix For: 2.3
>
>         Attachments: lucene-1101.patch
>
>
> Tokenizers which implement the reuse form of the next method:
>     next(Token result) 
> should reset the postionIncrement of the returned token to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1101) Tokenizers should reset positionIncrement to 1 in their next(Token result)

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554473 ] 

Yonik Seeley commented on LUCENE-1101:
--------------------------------------

Resetting tokens (including the position) is currently handled via Token.clear() before the consumer reuses a token.


> Tokenizers should reset positionIncrement to 1 in their next(Token result) 
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-1101
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1101
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.3
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>             Fix For: 2.3
>
>         Attachments: lucene-1101.patch
>
>
> Tokenizers which implement the reuse form of the next method:
>     next(Token result) 
> should reset the postionIncrement of the returned token to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1101) Tokenizers should reset positionIncrement to 1 in their next(Token result)

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554520 ] 

Yonik Seeley commented on LUCENE-1101:
--------------------------------------

{quote}Currently Token.clear() is used only for un-tokenized fields in DocmentsWriter{quote}

I think it's used for both tokenized and un-tokenized.... see line1319.
It seems redundant to call clear() in both the consumer (DocumentsWriter) and producer (Tokenizer).



> Tokenizers should reset positionIncrement to 1 in their next(Token result) 
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-1101
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1101
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.3
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>             Fix For: 2.3
>
>         Attachments: lucene-1101.patch, lucene-1101.patch
>
>
> Tokenizers which implement the reuse form of the next method:
>     next(Token result) 
> should reset the postionIncrement of the returned token to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1101) Tokenizers should reset positionIncrement to 1 in their next(Token result)

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554481 ] 

Yonik Seeley commented on LUCENE-1101:
--------------------------------------

In what cases?  The protocol we are currently using requires that the caller of next(token) provide a suitably initialized token (positionIncrement defaults to 1, payload removed, etc).

> Tokenizers should reset positionIncrement to 1 in their next(Token result) 
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-1101
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1101
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.3
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>             Fix For: 2.3
>
>         Attachments: lucene-1101.patch
>
>
> Tokenizers which implement the reuse form of the next method:
>     next(Token result) 
> should reset the postionIncrement of the returned token to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1101) Tokenizers should reset positionIncrement to 1 in their next(Token result)

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-1101:
--------------------------------

    Attachment: lucene-1101.patch

Simple patch fixing this.
Planing to commit this shortly if no objections.

> Tokenizers should reset positionIncrement to 1 in their next(Token result) 
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-1101
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1101
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.3
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>             Fix For: 2.3
>
>         Attachments: lucene-1101.patch
>
>
> Tokenizers which implement the reuse form of the next method:
>     next(Token result) 
> should reset the postionIncrement of the returned token to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1101) TokenStream.next(Token) reuse 'policy': calling Token.clear() should be responsibility of producer.

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-1101:
--------------------------------

    Attachment: lucene-1101.patch

Updated patch calling Token.clear() only by producers and javadocs updated as discussed.

> TokenStream.next(Token) reuse 'policy': calling Token.clear() should be responsibility of producer.
> ---------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1101
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1101
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.3
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>             Fix For: 2.3
>
>         Attachments: lucene-1101.patch, lucene-1101.patch, lucene-1101.patch
>
>
> Tokenizers which implement the reuse form of the next method:
>     next(Token result) 
> should reset the postionIncrement of the returned token to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1101) Tokenizers should reset positionIncrement to 1 in their next(Token result)

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doron Cohen updated LUCENE-1101:
--------------------------------

    Attachment: lucene-1101.patch

Updated patch - Tokenizers now calling clear() as suggested.
This seems cleaner - Token manages its data members that should get cleared, so if more data members are added to Token in the future only Token.clear() needs to be updated - thanks Yonik!
All core tests pass.

> Tokenizers should reset positionIncrement to 1 in their next(Token result) 
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-1101
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1101
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.3
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>             Fix For: 2.3
>
>         Attachments: lucene-1101.patch, lucene-1101.patch
>
>
> Tokenizers which implement the reuse form of the next method:
>     next(Token result) 
> should reset the postionIncrement of the returned token to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1101) TokenStream.next(Token) reuse 'policy': calling Token.clear() should be responsibility of producer.

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554980 ] 

Doron Cohen commented on LUCENE-1101:
-------------------------------------

Thanks Mike, I'll remove the call from DocumentsWriter and updated the javadocs accordingly.

> TokenStream.next(Token) reuse 'policy': calling Token.clear() should be responsibility of producer.
> ---------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1101
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1101
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.3
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>             Fix For: 2.3
>
>         Attachments: lucene-1101.patch, lucene-1101.patch, lucene-1101.patch
>
>
> Tokenizers which implement the reuse form of the next method:
>     next(Token result) 
> should reset the postionIncrement of the returned token to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Commented: (LUCENE-1101) Tokenizers should reset positionIncrement to 1 in their next(Token result)

Posted by "Doron Cohen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12554482 ] 

Doron Cohen commented on LUCENE-1101:
-------------------------------------

Currently Token.clear() is used only for un-tokenized fields in DocmentsWriter - Tokenizer implementations of next(Token) do not call it.
I think they can be modified to call it (instead of explicitly reseting just the pos-incr).
But since these methods already set the value for start-offset, calling these method might eat the speed-up gained by reusing tokens.
But then again, shouldn't tokenizers also reset the payload info? (seems wrong to assume there there's no payload in the input reusable token.)
So I guess the right thing to do is to call clear() in all toknizers (3 actually) - will work that path.



> Tokenizers should reset positionIncrement to 1 in their next(Token result) 
> ---------------------------------------------------------------------------
>
>                 Key: LUCENE-1101
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1101
>             Project: Lucene - Java
>          Issue Type: Bug
>    Affects Versions: 2.3
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>             Fix For: 2.3
>
>         Attachments: lucene-1101.patch
>
>
> Tokenizers which implement the reuse form of the next method:
>     next(Token result) 
> should reset the postionIncrement of the returned token to 1.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org