You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Pierre Gossé (JIRA)" <ji...@apache.org> on 2011/05/11 15:47:47 UTC

[jira] [Created] (LUCENE-3087) highlighting exact phrase with overlapping tokens fails.

highlighting exact phrase with overlapping tokens fails.
--------------------------------------------------------

Key: LUCENE-3087
URL: https://issues.apache.org/jira/browse/LUCENE-3087
Project: Lucene - Java
Issue Type: Bug
Components: contrib/highlighter
Affects Versions: 3.1, 2.9.4
Reporter: Pierre Gossé
Priority: Minor

Fields with overlapping token are not highlighted in search results when searching exact phrases, when using TermVector.WITH_OFFSET.

The document builded in MemoryIndex for highlight does not preserve positions of tokens in this case. Overlapping tokens get "flattened" (position increment always set to 1), the spanquery used for searching relevant fragment will fail to identify the correct token sequence because the position shift.

I corrected this by adding a position increment calculation in sub class StoredTokenStream. I added junit test covering this case.

I used the eclipse codestyle from trunk, but style add quite a few format differences between repository and working copy files. I tried to reduce them, but some linewrapping rules still doesn't match.

Correction patch joined

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3087) highlighting exact phrase with overlapping tokens fails.

Posted by "Pierre Gossé (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032421#comment-13032421 ] 

Pierre Gossé edited comment on LUCENE-3087 at 5/12/11 2:17 PM:
---------------------------------------------------------------

Thanks for taking a look at this Michael.

In fact, I should be in the case of TermVector.WITH_POSITIONS_OFFSETS, using this parameters in my solr Shema.xml
<field name="..." type="..." indexed="true" stored="true" compressed="true" omitNorms="true" termVectors="true" termPositions="true" termOffsets="true"/>

Somehow, I end up in TokenSources with argument tokenPositionsGuaranteedContiguous to false, which falls back to using offsets instead of positions.

Maybe this is because of my overlapping tokens, maybe not, I'll have to take a couple of hours sometime to figure this out. At first sight, however it seams this parameter is always set to false when calling TokenSource.getTokenStream with an IndexReader because some code to use field infos is missing.

Some work to do here, maybe, sometime. :)

      was (Author: pigo):
    Thanks for taking a look at this Michael.

In fact, I should be in the case of TermVector.WITH_POSITIONS_OFFSETS, using this parameters in my solr Shema.xml
<field name="highlight_en" type="hst2-en" indexed="true" stored="true" compressed="true" omitNorms="true" termVectors="true" termPositions="true" termOffsets="true"/>

Somehow, I end up in TokenSources with argument tokenPositionsGuaranteedContiguous to false, which falls back to using offsets instead of positions.

Maybe this is because of my overlapping tokens, maybe not, I'll have to take a couple of hours sometime to figure this out. At first sight, however it seams this parameter is always set to false when calling TokenSource.getTokenStream with an IndexReader because some code to use field infos is missing.

Some work to do here, maybe, sometime. :)
  
> highlighting exact phrase with overlapping tokens fails.
> --------------------------------------------------------
>
>                 Key: LUCENE-3087
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3087
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.9.4, 3.1
>            Reporter: Pierre Gossé
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3087.patch
>
>
> Fields with overlapping token are not highlighted in search results when searching exact phrases, when using TermVector.WITH_OFFSET.
> The document builded in MemoryIndex for highlight does not preserve positions of tokens in this case. Overlapping tokens get "flattened" (position increment always set to 1), the spanquery used for searching relevant fragment will fail to identify the correct token sequence because the position shift.
> I corrected this by adding a position increment calculation in sub class StoredTokenStream. I added junit test covering this case.
> I used the eclipse codestyle from trunk, but style add quite a few format differences between repository and working copy files. I tried to reduce them, but some linewrapping rules still doesn't match.
> Correction patch joined

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Issue Comment Edited] (LUCENE-3087) highlighting exact phrase with overlapping tokens fails.

Posted by "Pierre Gossé (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032472#comment-13032472 ] 

Pierre Gossé edited comment on LUCENE-3087 at 5/12/11 3:59 PM:
---------------------------------------------------------------

Yes, that would be the best.

But I'm not sure how to do that :
- Check for positions in token stream ? Not sure it "guaranties" anything. :)
- add some kind of additionnal properties to the TermFreqVector returned by the IndexReader.getTermFreqVector() since it already access fields info ? Not sure it has'nt too much impact.
- Ask the index for field infos from TokenSources.getTokenStream ? Not sure it is the place but looks like the less dangerous option.

I haven't much time 'till the end of month to take a serious look at this, but I'll try to take some time next month.

      was (Author: pigo):
    Yes, that would be the best.

But I'm not sure how to do that :
- Check for positions in token stream ? Not sure it "guaranties" anything. :)
- add some kind of additionnal properties to the tokenstream returned by the IndexReader.getTermFreqVector since it access fields info ? Not sure it has'nt too much impact.
- Ask the index for field infos from TokenSources.getTokenStream ? Not sure it is the place but look like the less dangerous option.

I haven't much time 'till the end of month to take a serious look at this, but I'll try to take some time next month.
  
> highlighting exact phrase with overlapping tokens fails.
> --------------------------------------------------------
>
>                 Key: LUCENE-3087
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3087
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.9.4, 3.1
>            Reporter: Pierre Gossé
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3087.patch
>
>
> Fields with overlapping token are not highlighted in search results when searching exact phrases, when using TermVector.WITH_OFFSET.
> The document builded in MemoryIndex for highlight does not preserve positions of tokens in this case. Overlapping tokens get "flattened" (position increment always set to 1), the spanquery used for searching relevant fragment will fail to identify the correct token sequence because the position shift.
> I corrected this by adding a position increment calculation in sub class StoredTokenStream. I added junit test covering this case.
> I used the eclipse codestyle from trunk, but style add quite a few format differences between repository and working copy files. I tried to reduce them, but some linewrapping rules still doesn't match.
> Correction patch joined

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3087) highlighting exact phrase with overlapping tokens fails.

Posted by "Pierre Gossé (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032421#comment-13032421 ] 

Pierre Gossé commented on LUCENE-3087:
--------------------------------------

Thanks for taking a look at this Michael.

In fact, I should be in the case of TermVector.WITH_POSITIONS_OFFSETS, using this parameters in my solr Shema.xml
<field name="highlight_en" type="hst2-en" indexed="true" stored="true" compressed="true" omitNorms="true" termVectors="true" termPositions="true" termOffsets="true"/>

Somehow, I end up in TokenSources with argument tokenPositionsGuaranteedContiguous to false, which falls back to using offsets instead of positions.

Maybe this is because of my overlapping tokens, maybe not, I'll have to take a couple of hours sometime to figure this out. At first sight, however it seams this parameter is always set to false when calling TokenSource.getTokenStream with an IndexReader because some code to use field infos is missing.

Some work to do here, maybe, sometime. :)

> highlighting exact phrase with overlapping tokens fails.
> --------------------------------------------------------
>
>                 Key: LUCENE-3087
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3087
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.9.4, 3.1
>            Reporter: Pierre Gossé
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3087.patch
>
>
> Fields with overlapping token are not highlighted in search results when searching exact phrases, when using TermVector.WITH_OFFSET.
> The document builded in MemoryIndex for highlight does not preserve positions of tokens in this case. Overlapping tokens get "flattened" (position increment always set to 1), the spanquery used for searching relevant fragment will fail to identify the correct token sequence because the position shift.
> I corrected this by adding a position increment calculation in sub class StoredTokenStream. I added junit test covering this case.
> I used the eclipse codestyle from trunk, but style add quite a few format differences between repository and working copy files. I tried to reduce them, but some linewrapping rules still doesn't match.
> Correction patch joined

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3087) highlighting exact phrase with overlapping tokens fails.

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032349#comment-13032349 ] 

Michael McCandless commented on LUCENE-3087:
--------------------------------------------

Thank you for the patch with test case Pierre!

So this bug only applies if you store offset but not positions in your term vectors (TermVector.WITH_OFFSET).  Previously we would always set posIncr=1, now (with your patch) we set it to 0 if the offset didn't change vs the prior token.  I think this seems reasonable -- we have to fallback on heuristics since positions were not stored.

> highlighting exact phrase with overlapping tokens fails.
> --------------------------------------------------------
>
>                 Key: LUCENE-3087
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3087
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.9.4, 3.1
>            Reporter: Pierre Gossé
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3087.patch
>
>
> Fields with overlapping token are not highlighted in search results when searching exact phrases, when using TermVector.WITH_OFFSET.
> The document builded in MemoryIndex for highlight does not preserve positions of tokens in this case. Overlapping tokens get "flattened" (position increment always set to 1), the spanquery used for searching relevant fragment will fail to identify the correct token sequence because the position shift.
> I corrected this by adding a position increment calculation in sub class StoredTokenStream. I added junit test covering this case.
> I used the eclipse codestyle from trunk, but style add quite a few format differences between repository and working copy files. I tried to reduce them, but some linewrapping rules still doesn't match.
> Correction patch joined

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3087) highlighting exact phrase with overlapping tokens fails.

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032488#comment-13032488 ] 

Michael McCandless commented on LUCENE-3087:
--------------------------------------------

bq. We could try to improve that on a followup issue?

+1, I agree: progress not perfection!

So I'll commit this patch and then open a follow on issue...

Thanks Pierre!

> highlighting exact phrase with overlapping tokens fails.
> --------------------------------------------------------
>
>                 Key: LUCENE-3087
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3087
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.9.4, 3.1
>            Reporter: Pierre Gossé
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3087.patch
>
>
> Fields with overlapping token are not highlighted in search results when searching exact phrases, when using TermVector.WITH_OFFSET.
> The document builded in MemoryIndex for highlight does not preserve positions of tokens in this case. Overlapping tokens get "flattened" (position increment always set to 1), the spanquery used for searching relevant fragment will fail to identify the correct token sequence because the position shift.
> I corrected this by adding a position increment calculation in sub class StoredTokenStream. I added junit test covering this case.
> I used the eclipse codestyle from trunk, but style add quite a few format differences between repository and working copy files. I tried to reduce them, but some linewrapping rules still doesn't match.
> Correction patch joined

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3087) highlighting exact phrase with overlapping tokens fails.

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032468#comment-13032468 ] 

Michael McCandless commented on LUCENE-3087:
--------------------------------------------

Ahh, interesting.

So... maybe we should fix the code so that if in fact positions were included in the TVs, we use them?  Else, we fallback to the offset check to guess at the posIncr?  Could that work?

> highlighting exact phrase with overlapping tokens fails.
> --------------------------------------------------------
>
>                 Key: LUCENE-3087
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3087
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.9.4, 3.1
>            Reporter: Pierre Gossé
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3087.patch
>
>
> Fields with overlapping token are not highlighted in search results when searching exact phrases, when using TermVector.WITH_OFFSET.
> The document builded in MemoryIndex for highlight does not preserve positions of tokens in this case. Overlapping tokens get "flattened" (position increment always set to 1), the spanquery used for searching relevant fragment will fail to identify the correct token sequence because the position shift.
> I corrected this by adding a position increment calculation in sub class StoredTokenStream. I added junit test covering this case.
> I used the eclipse codestyle from trunk, but style add quite a few format differences between repository and working copy files. I tried to reduce them, but some linewrapping rules still doesn't match.
> Correction patch joined

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3087) highlighting exact phrase with overlapping tokens fails.

Posted by "Pierre Gossé (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pierre Gossé updated LUCENE-3087:
---------------------------------

    Attachment: LUCENE-3087.patch

correction patch with junit tests

> highlighting exact phrase with overlapping tokens fails.
> --------------------------------------------------------
>
>                 Key: LUCENE-3087
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3087
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.9.4, 3.1
>            Reporter: Pierre Gossé
>            Priority: Minor
>         Attachments: LUCENE-3087.patch
>
>
> Fields with overlapping token are not highlighted in search results when searching exact phrases, when using TermVector.WITH_OFFSET.
> The document builded in MemoryIndex for highlight does not preserve positions of tokens in this case. Overlapping tokens get "flattened" (position increment always set to 1), the spanquery used for searching relevant fragment will fail to identify the correct token sequence because the position shift.
> I corrected this by adding a position increment calculation in sub class StoredTokenStream. I added junit test covering this case.
> I used the eclipse codestyle from trunk, but style add quite a few format differences between repository and working copy files. I tried to reduce them, but some linewrapping rules still doesn't match.
> Correction patch joined

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3087) highlighting exact phrase with overlapping tokens fails.

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032476#comment-13032476 ] 

Robert Muir commented on LUCENE-3087:
-------------------------------------

bq. So... maybe we should fix the code so that if in fact positions were included in the TVs, we use them? Else, we fallback to the offset check to guess at the posIncr? Could that work?

But this patch is still good right? We introduce this heuristic when positions are not available (or when highlighter pretends they are not).

>From my very vague understanding of highlighting, when overlapping positions or gaps in the position increment exist (tokenPositionsGuaranteedContiguous=false), the highlighter uses this algorithm intentionally (there are comments in the code indicating this position-based algorithm would fail otherwise).

We could try to improve that on a followup issue?

> highlighting exact phrase with overlapping tokens fails.
> --------------------------------------------------------
>
>                 Key: LUCENE-3087
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3087
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.9.4, 3.1
>            Reporter: Pierre Gossé
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3087.patch
>
>
> Fields with overlapping token are not highlighted in search results when searching exact phrases, when using TermVector.WITH_OFFSET.
> The document builded in MemoryIndex for highlight does not preserve positions of tokens in this case. Overlapping tokens get "flattened" (position increment always set to 1), the spanquery used for searching relevant fragment will fail to identify the correct token sequence because the position shift.
> I corrected this by adding a position increment calculation in sub class StoredTokenStream. I added junit test covering this case.
> I used the eclipse codestyle from trunk, but style add quite a few format differences between repository and working copy files. I tried to reduce them, but some linewrapping rules still doesn't match.
> Correction patch joined

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Assigned] (LUCENE-3087) highlighting exact phrase with overlapping tokens fails.

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned LUCENE-3087:
------------------------------------------

    Assignee: Michael McCandless

> highlighting exact phrase with overlapping tokens fails.
> --------------------------------------------------------
>
>                 Key: LUCENE-3087
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3087
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.9.4, 3.1
>            Reporter: Pierre Gossé
>            Assignee: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-3087.patch
>
>
> Fields with overlapping token are not highlighted in search results when searching exact phrases, when using TermVector.WITH_OFFSET.
> The document builded in MemoryIndex for highlight does not preserve positions of tokens in this case. Overlapping tokens get "flattened" (position increment always set to 1), the spanquery used for searching relevant fragment will fail to identify the correct token sequence because the position shift.
> I corrected this by adding a position increment calculation in sub class StoredTokenStream. I added junit test covering this case.
> I used the eclipse codestyle from trunk, but style add quite a few format differences between repository and working copy files. I tried to reduce them, but some linewrapping rules still doesn't match.
> Correction patch joined

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3087) highlighting exact phrase with overlapping tokens fails.

Posted by "Pierre Gossé (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032472#comment-13032472 ] 

Pierre Gossé commented on LUCENE-3087:
--------------------------------------

Yes, that would be the best.

But I'm not sure how to do that :
- Check for positions in token stream ? Not sure it "guaranties" anything. :)
- add some kind of additionnal properties to the tokenstream returned by the IndexReader.getTermFreqVector since it access fields info ? Not sure it has'nt too much impact.
- Ask the index for field infos from TokenSources.getTokenStream ? Not sure it is the place but look like the less dangerous option.

I haven't much time 'till the end of month to take a serious look at this, but I'll try to take some time next month.

> highlighting exact phrase with overlapping tokens fails.
> --------------------------------------------------------
>
>                 Key: LUCENE-3087
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3087
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.9.4, 3.1
>            Reporter: Pierre Gossé
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3087.patch
>
>
> Fields with overlapping token are not highlighted in search results when searching exact phrases, when using TermVector.WITH_OFFSET.
> The document builded in MemoryIndex for highlight does not preserve positions of tokens in this case. Overlapping tokens get "flattened" (position increment always set to 1), the spanquery used for searching relevant fragment will fail to identify the correct token sequence because the position shift.
> I corrected this by adding a position increment calculation in sub class StoredTokenStream. I added junit test covering this case.
> I used the eclipse codestyle from trunk, but style add quite a few format differences between repository and working copy files. I tried to reduce them, but some linewrapping rules still doesn't match.
> Correction patch joined

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3087) highlighting exact phrase with overlapping tokens fails.

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3087:
---------------------------------------

    Fix Version/s: 4.0
                   3.2

> highlighting exact phrase with overlapping tokens fails.
> --------------------------------------------------------
>
>                 Key: LUCENE-3087
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3087
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.9.4, 3.1
>            Reporter: Pierre Gossé
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3087.patch
>
>
> Fields with overlapping token are not highlighted in search results when searching exact phrases, when using TermVector.WITH_OFFSET.
> The document builded in MemoryIndex for highlight does not preserve positions of tokens in this case. Overlapping tokens get "flattened" (position increment always set to 1), the spanquery used for searching relevant fragment will fail to identify the correct token sequence because the position shift.
> I corrected this by adding a position increment calculation in sub class StoredTokenStream. I added junit test covering this case.
> I used the eclipse codestyle from trunk, but style add quite a few format differences between repository and working copy files. I tried to reduce them, but some linewrapping rules still doesn't match.
> Correction patch joined

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Resolved] (LUCENE-3087) highlighting exact phrase with overlapping tokens fails.

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-3087.
----------------------------------------

    Resolution: Fixed

> highlighting exact phrase with overlapping tokens fails.
> --------------------------------------------------------
>
>                 Key: LUCENE-3087
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3087
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.9.4, 3.1
>            Reporter: Pierre Gossé
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3087.patch
>
>
> Fields with overlapping token are not highlighted in search results when searching exact phrases, when using TermVector.WITH_OFFSET.
> The document builded in MemoryIndex for highlight does not preserve positions of tokens in this case. Overlapping tokens get "flattened" (position increment always set to 1), the spanquery used for searching relevant fragment will fail to identify the correct token sequence because the position shift.
> I corrected this by adding a position increment calculation in sub class StoredTokenStream. I added junit test covering this case.
> I used the eclipse codestyle from trunk, but style add quite a few format differences between repository and working copy files. I tried to reduce them, but some linewrapping rules still doesn't match.
> Correction patch joined

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3087) highlighting exact phrase with overlapping tokens fails.

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13032514#comment-13032514 ] 

Michael McCandless commented on LUCENE-3087:
--------------------------------------------

OK I opened LUCENE-3091.

> highlighting exact phrase with overlapping tokens fails.
> --------------------------------------------------------
>
>                 Key: LUCENE-3087
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3087
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.9.4, 3.1
>            Reporter: Pierre Gossé
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-3087.patch
>
>
> Fields with overlapping token are not highlighted in search results when searching exact phrases, when using TermVector.WITH_OFFSET.
> The document builded in MemoryIndex for highlight does not preserve positions of tokens in this case. Overlapping tokens get "flattened" (position increment always set to 1), the spanquery used for searching relevant fragment will fail to identify the correct token sequence because the position shift.
> I corrected this by adding a position increment calculation in sub class StoredTokenStream. I added junit test covering this case.
> I used the eclipse codestyle from trunk, but style add quite a few format differences between repository and working copy files. I tried to reduce them, but some linewrapping rules still doesn't match.
> Correction patch joined

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org