You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Christopher Morris (JIRA)" <ji...@apache.org> on 2009/11/05 14:41:35 UTC

[jira] Created: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

TokenSources.getTokenStream() does not assign positionIncrement
---------------------------------------------------------------

Key: LUCENE-2035
URL: https://issues.apache.org/jira/browse/LUCENE-2035
Project: Lucene - Java
Issue Type: Bug
Components: contrib/highlighter
Affects Versions: 2.9, 2.4.1, 2.4
Reporter: Christopher Morris

TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens.

For example:
Consider a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped)
When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped

Now try a search and highlight for the phrase query "fox jumped". The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between "fox" and "jumped". If we use the original (from the analyzer) token stream then the highlighter works.

Also, consider the converse - the fox did not jump
"not" is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4)
When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3).

So the phrase query "did jump" will cause the "did" and "jump" terms in the text "did not jump" to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Assigned: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller reassigned LUCENE-2035:
-----------------------------------

    Assignee: Mark Miller

> TokenSources.getTokenStream() does not assign positionIncrement
> ---------------------------------------------------------------
>
>                 Key: LUCENE-2035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2035
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Christopher Morris
>            Assignee: Mark Miller
>             Fix For: 3.1
>
>         Attachments: LUCENE-2305.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens.
> For example:
> Consider  a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped)
> When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped
> Now try a search and highlight for the phrase query "fox jumped". The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between "fox" and "jumped". If we use the original (from the analyzer) token stream then the highlighter works.
> Also, consider the converse - the fox did not jump
> "not" is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4)
> When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
> So the phrase query "did jump" will cause the "did" and "jump" terms in the text "did not jump" to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791939#action_12791939 ] 

Mark Miller commented on LUCENE-2035:
-------------------------------------

I'll commit this soon.

> TokenSources.getTokenStream() does not assign positionIncrement
> ---------------------------------------------------------------
>
>                 Key: LUCENE-2035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2035
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Christopher Morris
>            Assignee: Mark Miller
>             Fix For: 3.1
>
>         Attachments: LUCENE-2035.patch, LUCENE-2035.patch, LUCENE-2305.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens.
> For example:
> Consider  a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped)
> When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped
> Now try a search and highlight for the phrase query "fox jumped". The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between "fox" and "jumped". If we use the original (from the analyzer) token stream then the highlighter works.
> Also, consider the converse - the fox did not jump
> "not" is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4)
> When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
> So the phrase query "did jump" will cause the "did" and "jump" terms in the text "did not jump" to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller resolved LUCENE-2035.
---------------------------------

    Resolution: Fixed

Thanks Christopher!

> TokenSources.getTokenStream() does not assign positionIncrement
> ---------------------------------------------------------------
>
>                 Key: LUCENE-2035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2035
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Christopher Morris
>            Assignee: Mark Miller
>             Fix For: 3.1
>
>         Attachments: LUCENE-2035.patch, LUCENE-2035.patch, LUCENE-2305.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens.
> For example:
> Consider  a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped)
> When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped
> Now try a search and highlight for the phrase query "fox jumped". The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between "fox" and "jumped". If we use the original (from the analyzer) token stream then the highlighter works.
> Also, consider the converse - the fox did not jump
> "not" is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4)
> When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
> So the phrase query "did jump" will cause the "did" and "jump" terms in the text "did not jump" to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791152#action_12791152 ] 

Mark Miller commented on LUCENE-2035:
-------------------------------------

Thanks for the tests and fix Christopher!

I've got one more patch coming and ill commit in a few days.

I'm going to break the tests back out in a separate file again (on second thought I think how you had is a good idea) and remove an author tag. Then after one more review I think this good to go in.

> TokenSources.getTokenStream() does not assign positionIncrement
> ---------------------------------------------------------------
>
>                 Key: LUCENE-2035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2035
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Christopher Morris
>            Assignee: Mark Miller
>             Fix For: 3.1
>
>         Attachments: LUCENE-2035.patch, LUCENE-2305.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens.
> For example:
> Consider  a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped)
> When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped
> Now try a search and highlight for the phrase query "fox jumped". The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between "fox" and "jumped". If we use the original (from the analyzer) token stream then the highlighter works.
> Also, consider the converse - the fox did not jump
> "not" is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4)
> When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
> So the phrase query "did jump" will cause the "did" and "jump" terms in the text "did not jump" to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

Posted by "Christopher Morris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792410#action_12792410 ] 

Christopher Morris commented on LUCENE-2035:
--------------------------------------------

Cheers Mark,

The custom collector was probably because I was learning the new API at the time.

The only changes I've made since the patch I submitted were to initialise the ArrayList with tpv.getTerms().length because that represents the minimum size that the list will grow to, and to replace the List and Iterator fields with an array (derived from the list) and an integer pointer. Both of which are probably unnecessary.

The tests could be improved - the first case could be fixed in it's present form by using the Analyzer to generate the phrase query. If the stemmed word was the middle word of the phrase then that fix wouldn't work.

> TokenSources.getTokenStream() does not assign positionIncrement
> ---------------------------------------------------------------
>
>                 Key: LUCENE-2035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2035
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Christopher Morris
>            Assignee: Mark Miller
>             Fix For: 3.1
>
>         Attachments: LUCENE-2035.patch, LUCENE-2035.patch, LUCENE-2305.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens.
> For example:
> Consider  a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped)
> When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped
> Now try a search and highlight for the phrase query "fox jumped". The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between "fox" and "jumped". If we use the original (from the analyzer) token stream then the highlighter works.
> Also, consider the converse - the fox did not jump
> "not" is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4)
> When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
> So the phrase query "did jump" will cause the "did" and "jump" terms in the text "did not jump" to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller updated LUCENE-2035:
--------------------------------

    Attachment: LUCENE-2035.patch

> TokenSources.getTokenStream() does not assign positionIncrement
> ---------------------------------------------------------------
>
>                 Key: LUCENE-2035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2035
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Christopher Morris
>            Assignee: Mark Miller
>             Fix For: 3.1
>
>         Attachments: LUCENE-2035.patch, LUCENE-2305.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens.
> For example:
> Consider  a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped)
> When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped
> Now try a search and highlight for the phrase query "fox jumped". The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between "fox" and "jumped". If we use the original (from the analyzer) token stream then the highlighter works.
> Also, consider the converse - the fox did not jump
> "not" is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4)
> When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
> So the phrase query "did jump" will cause the "did" and "jump" terms in the text "did not jump" to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

Posted by "Christopher Morris (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Christopher Morris updated LUCENE-2035:
---------------------------------------

    Attachment: LUCENE-2305.patch

For the highlighter trunk

> TokenSources.getTokenStream() does not assign positionIncrement
> ---------------------------------------------------------------
>
>                 Key: LUCENE-2035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2035
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Christopher Morris
>         Attachments: LUCENE-2305.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens.
> For example:
> Consider  a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped)
> When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped
> Now try a search and highlight for the phrase query "fox jumped". The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between "fox" and "jumped". If we use the original (from the analyzer) token stream then the highlighter works.
> Also, consider the converse - the fox did not jump
> "not" is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4)
> When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
> So the phrase query "did jump" will cause the "did" and "jump" terms in the text "did not jump" to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller updated LUCENE-2035:
--------------------------------

    Attachment: LUCENE-2035.patch

I've broken the new tests back out into there own file, change the hit collector code to just search basically, and improved the test coverage of TokenSources a bit.

> TokenSources.getTokenStream() does not assign positionIncrement
> ---------------------------------------------------------------
>
>                 Key: LUCENE-2035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2035
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Christopher Morris
>            Assignee: Mark Miller
>             Fix For: 3.1
>
>         Attachments: LUCENE-2035.patch, LUCENE-2035.patch, LUCENE-2305.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens.
> For example:
> Consider  a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped)
> When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped
> Now try a search and highlight for the phrase query "fox jumped". The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between "fox" and "jumped". If we use the original (from the analyzer) token stream then the highlighter works.
> Also, consider the converse - the fox did not jump
> "not" is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4)
> When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
> So the phrase query "did jump" will cause the "did" and "jump" terms in the text "did not jump" to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791680#action_12791680 ] 

Mark Miller commented on LUCENE-2035:
-------------------------------------

Hey Christopher, why are you going through the trouble of the custom collector to check that there are no hits? Why not just do a standard search?

> TokenSources.getTokenStream() does not assign positionIncrement
> ---------------------------------------------------------------
>
>                 Key: LUCENE-2035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2035
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Christopher Morris
>            Assignee: Mark Miller
>             Fix For: 3.1
>
>         Attachments: LUCENE-2035.patch, LUCENE-2305.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens.
> For example:
> Consider  a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped)
> When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped
> Now try a search and highlight for the phrase query "fox jumped". The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between "fox" and "jumped". If we use the original (from the analyzer) token stream then the highlighter works.
> Also, consider the converse - the fox did not jump
> "not" is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4)
> When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
> So the phrase query "did jump" will cause the "did" and "jump" terms in the text "did not jump" to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-2035) TokenSources.getTokenStream() does not assign positionIncrement

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller updated LUCENE-2035:
--------------------------------

    Fix Version/s: 3.1

> TokenSources.getTokenStream() does not assign positionIncrement
> ---------------------------------------------------------------
>
>                 Key: LUCENE-2035
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2035
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.4, 2.4.1, 2.9
>            Reporter: Christopher Morris
>            Assignee: Mark Miller
>             Fix For: 3.1
>
>         Attachments: LUCENE-2305.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> TokenSources.StoredTokenStream does not assign positionIncrement information. This means that all tokens in the stream are considered adjacent. This has implications for the phrase highlighting in QueryScorer when using non-contiguous tokens.
> For example:
> Consider  a token stream that creates tokens for both the stemmed and unstemmed version of each word - the fox (jump|jumped)
> When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - the fox jump jumped
> Now try a search and highlight for the phrase query "fox jumped". The search will correctly find the document; the highlighter will fail to highlight the phrase because it thinks that there is an additional word between "fox" and "jumped". If we use the original (from the analyzer) token stream then the highlighter works.
> Also, consider the converse - the fox did not jump
> "not" is a stop word and there is an option to increment the position to account for stop words - (the,0) (fox,1) (did,2) (jump,4)
> When retrieved from the index using TokenSources.getTokenStream(tpv,false), the token stream will be - (the,0) (fox,1) (did,2) (jump,3).
> So the phrase query "did jump" will cause the "did" and "jump" terms in the text "did not jump" to be highlighted. If we use the original (from the analyzer) token stream then the highlighter works correctly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org