You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2008/11/11 10:28:44 UTC

[jira] Created: (LUCENE-1448) add getFinalOffset() to TokenStream

add getFinalOffset() to TokenStream
-----------------------------------

Key: LUCENE-1448
URL: https://issues.apache.org/jira/browse/LUCENE-1448
Project: Lucene - Java
Issue Type: Bug
Components: Analysis
Reporter: Michael McCandless
Assignee: Michael McCandless
Fix For: 2.9

If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.

This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.

But this logic is overly simplistic. For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small. Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.

To fix this, I'd like to add a new getFinalOffset() to TokenStream. I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.

This has come up several times on the user's list.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734063#action_12734063 ] 

Uwe Schindler edited comment on LUCENE-1448 at 7/22/09 3:25 AM:
----------------------------------------------------------------

This is not the only problem with multiple Tees: The offsets are also completely mixed together, especially if the two tees feed into the sink at the same time (not after each other). In my opinion, the last call to end should be cached by the sink as end state (so if two tees add a end state to the sink, the second one overwrites the first one).

      was (Author: thetaphi):
    This is not the only problem with multiple Tees: The offsets are also completely mixed together, especially if the two tees feed into the sink at the same time (not after each other). In my opinion, the last call to end should be cached by the sink as end state (so if two tees add a end state to the tee, the second one overwrites the first one).
  
> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651999#action_12651999 ] 

Michael McCandless commented on LUCENE-1448:
--------------------------------------------

I'm torn on how to add getFinalOffset()/getFinalPositionIncrement().

One option is to add a set/getFinalOffset to OffsetAttribute.  The
downside here is it's another int added to that class, that gets
copied for caching streams yet is only used at the very end.

Another option is to "define" the API such that when incrementToken()
returns false, then it has actually advanced to an "end-of-stream
token".  OffsetAttribute.getEndOffset() should return the final
offset.  Since we have not released the new API, we could simply make
this change (and fix all instances in the core/contrib that use the
new API accordingly).  I think I like this option best.

Yet another option is to open up "per stream" attrs rather than "per
token attrs".  This seems like alot of added complexity.  Are there
other things, besides these two, that would be an example of a "per
stream" attr?

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12735213#action_12735213 ] 

Michael Busch commented on LUCENE-1448:
---------------------------------------

Note that my latest patch only contains fixes for the core TokenStreams.

I'll open a separate issue to implement end() for the contrib TokenStreams, which we can commit after LUCENE-1460 is resolved.

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: lucene-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660074#action_12660074 ] 

Michael McCandless commented on LUCENE-1448:
--------------------------------------------

See also LUCENE-579 which looks like a dup of this one.

Michael what's the game plan on this issue?  I think your EOSA approach makes sense...

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1448:
---------------------------------------

    Attachment: LUCENE-1448.patch

First cut patch.  It's not ready, though.

First: this patch only addresses the final offset, but shouldn't we also address the final position?  Eg if StopFilter removes the last few tokens, shouldn't we make it possible to report those skipped positonIncrements?  Note that Analyzer does provide a getPositionIncrementGap(String fieldName), which could be used as a workaround if we don't do this.  This does seem less important so I'd be OK with doing nothing about positions for now, but I'd like to hear other's takes.

Second: I can't figure out how to ask StandardTokenizerImpl for the number of chars it pulled from the Reader.  Can someone (who understands JFlex well) help out here?

Third: I haven't done all Tokenizers; eg, SinkTokenizer will need a new ctor so that it's told the final offset.  And I'd like to fix all contrib tokenizers too.

Fourth: I'm nervous about an off-by-one error here.  Here's the diff for DocInverterPerField:

{code}
@@ -161,7 +161,13 @@
                 break;
               }
             }
-            fieldState.offset = offsetEnd+1;
+
+            final int finalOffset = stream.getFinalOffset();
+            if (finalOffset == -1)
+              fieldState.offset = offsetEnd+1;
+            else
+              fieldState.offset += finalOffset;
+
           } finally {
             stream.close();
           }
{code}

What makes me nervous is: our current logic is to set the base for future instances of the same field to the last endOffset, plus 1.  Why do we have that plus 1?  Logically, I think it should be as if you had simply concatenated the strings together, which would not add an extra +1?  So, I'm not adding +1 in the getFinalOffset()'s that I've added, but I'm hoping someone can explain why we are adding +1 currently.

Note that endOffset is normally "exclusive" while startOffset is "inclusive".  Ie if a String starts with "abcd " then WhitespaceTokenizer would report a startOffset of 0 and endOffset of 4 for that first token.

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646574#action_12646574 ] 

markrmiller@gmail.com edited comment on LUCENE-1448 at 11/11/08 7:55 AM:
---------------------------------------------------------------

Your tests had an off by error.

I corrected them and added the standardanalyzer piece so that both tests pass. (i didnt correctly put the SA piece in the jflex file)

      was (Author: markrmiller@gmail.com):
    Your tests had an off by error.

I corrected them and added the standardanalyzer piece so that both tests pass.
  
> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723591#action_12723591 ] 

Mark Miller commented on LUCENE-1448:
-------------------------------------

Will you have time for this Michael?

It would be great to have this bug fixed for 2.9, but if we have to push to 3.0 its not the end of the word. Wasnt it done till that darn new Token API came along? ;)

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648122#action_12648122 ] 

Michael McCandless commented on LUCENE-1448:
--------------------------------------------

bq. Hmm now that we have getPositionIncrementGap() and getOffsetGap(), I think it would make sense to also add getFinalPositionIncrement()?

We could do that.  But how would you implement it?  EG StopFilter skips tokens, and (if enabled) already tracks the skippedPositions, so it could return that PLUS whatever its input reports as its getFinalPositionIncrement, I guess?

bq. Could we add this as Attributes using the new API? FinalOffsetAttribute and FinalPositionIncrementAttribute?

Hmm we could do that... but it seems awkward to add new attributes that apply only to ending state of the tokenizer.

I wonder if instead, w/ the new API, we could simply allow querying of certain attributes (offset, posincr) after incrementToken returns "false"?

Why don't you commit the new TokenStream API first, and we can iterate on this issue & commit 2nd?

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch updated LUCENE-1448:
----------------------------------

    Attachment: lucene-1448.patch

- updated to trunk
- made end() final in all implementing TokenStreams

I'm planning to commit this soon.

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: lucene-1448.patch, lucene-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653079#action_12653079 ] 

Michael McCandless commented on LUCENE-1448:
--------------------------------------------

OK, me too.  I'll move forward with that approach.

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646563#action_12646563 ] 

Mark Miller commented on LUCENE-1448:
-------------------------------------

bq. Second: I can't figure out how to ask StandardTokenizerImpl for the number of chars it pulled from the Reader. Can someone (who understands JFlex well) help out here?

Whats wrong with?

<code>

  public int getFinalOffset() {
    return scanner.yychar() + scanner.yylength();
  }

</code>

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1448:
---------------------------------------

    Priority: Minor  (was: Major)

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734022#action_12734022 ] 

Michael Busch commented on LUCENE-1448:
---------------------------------------

OK I think I have this basically working with old and new API (including 1693 changes).

The approach I took is fairly simple, it doesn't require adding a new Attribute. I added the following method to TokenSteam:

{code:java}
  /**
   * This method is called by the consumer after the last token has been consumed, 
   * i.e. after {@link #incrementToken()} returned <code>false</code> (using the new TokenStream API)
   * or after {@link #next(Token)} or {@link #next()} returned <code>null</code> (old TokenStream API).
   * <p/>
   * This method can be used to perform any end-of-stream operations, such as setting the final
   * offset of a stream. The final offset of a stream might differ from the offset of the last token
   * e.g. in case one or more whitespaces followed after the last token, but a {@link WhitespaceTokenizer}
   * was used.
   * <p/>
   * 
   * @throws IOException
   */
  public void end() throws IOException {
    // do nothing by default
  }
{code}

Then I took Mike's patch and implemented end() in all classes where his patch added getFinalOffset(). 
E.g. in CharTokenizer the implementations looks like this:

{code:java}
  public void end() {
    // set final offset
    int finalOffset = input.correctOffset(offset);
    offsetAtt.setOffset(finalOffset, finalOffset);
  }
{code}

I changed DocInverterPerField to call end() after the stream is fully consumed and use what 
offsetAttribute.endOffset() returns as final offset.

I also added all new tests from Mike's latest patch. 
All unit tests, including the new ones, pass. Also test-tag.

I'm not posting a patch yet, because this depends on 1693.

Mike, Uwe, others: could you please review if this approach makes sense?

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch updated LUCENE-1448:
----------------------------------

    Attachment: lucene-1448.patch

I changed Mike's last patch so that it uses the new end() approach that I explained above.

It also applies cleanly on current trunk (now that LUCENE-1693 is committed).

I also made the changes to CachingTokenFilter and TeeSinkTokenFilter that I mentioned above and added two more testcases to TestIndexWriter.

All tests pass.

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: lucene-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12660128#action_12660128 ] 

Michael Busch commented on LUCENE-1448:
---------------------------------------

I'm currently on vacation visiting my family in Germany till the 11th. I'm planning to work on this as soon as I'm back to get all the TokenStream changes (also LUCENE-1460) ready before 2.9.

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646549#action_12646549 ] 

markrmiller@gmail.com edited comment on LUCENE-1448 at 11/11/08 6:44 AM:
---------------------------------------------------------------

You need that +1 or you will have the subsequent token starting on the tail of the 'stopword'.

What I can't figure it out is how exactly these offsets are supposed to match up...abcd has offsets of s:0 e:4, which seems to imply it thinks abcd is 5 chars or the end is one greater than the end index (like with spans). In either case, it seems even if you put back the +1, the endoffsets are off somehow, because some will have an end of +1 the end index, while secondary multi-fields will have an end equal to the end index.

Would be cool to have fixed as this also stymies highlighting with multi-fields.

*edit*

I see. You need that +1 you took out and you need fields after the first to have +1 more for an end offset. Looks they are supposed to be end index +1.

*edit2*

Nm :) I am a bad counter. I think you only need the +1 back.

      was (Author: markrmiller@gmail.com):
    You need that +1 or you will have the subsequent token starting on the tail of the 'stopword'.

What I can't figure it out is how exactly these offsets are supposed to match up...abcd has offsets of s:0 e:4, which seems to imply it thinks abcd is 5 chars or the end is one greater than the end index (like with spans). In either case, it seems even if you put back the +1, the endoffsets are off somehow, because some will have an end of +1 the end index, while secondary multi-fields will have an end equal to the end index.

Would be cool to have fixed as this also stymies highlighting with multi-fields.

*edit*

I see. You need that +1 you took out and you need fields after the first to have +1 more for an end offset. Looks they are supposed to be end index +1.
  
> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Uwe Schindler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734063#action_12734063 ] 

Uwe Schindler commented on LUCENE-1448:
---------------------------------------

This is not the only problem with multiple Tees: The offsets are also completely mixed together, especially if the two tees feed into the sink at the same time (not after each other). In my opinion, the last call to end should be cached by the sink as end state (so if two tees add a end state to the tee, the second one overwrites the first one).

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734280#action_12734280 ] 

Michael McCandless commented on LUCENE-1448:
--------------------------------------------

This approach (adding end()) sounds good!

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654044#action_12654044 ] 

Michael Busch commented on LUCENE-1448:
---------------------------------------

{quote}
But: you'd still need a way in this model to serialize finalOffset, once,
at the end?
{quote}

Maybe we can introduce an abstract EndOfStreamAttribute and 
FinalOffsetAttribute and FinalPosIncrAttribute that extend EOSA.

Then in a stream like CachingTokenFilter a AttributeAcceptor can
be used that doesn't accept attributes of type EOSA in increment().

In end() it would use an AttributeAcceptor that accepts EOSA atts
and cache those.

{quote}
OK I'm gonna assign this one to you, Michael ;)
{quote}

Bummer! Why did I say anything? ;) j/k

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648080#action_12648080 ] 

Michael Busch commented on LUCENE-1448:
---------------------------------------

{quote}
First: this patch only addresses the final offset, but shouldn't we also address the final position? Eg if StopFilter removes the last few tokens, shouldn't we make it possible to report those skipped positonIncrements?
{quote}

Hmm now that we have getPositionIncrementGap() and getOffsetGap(), I think it would make sense to also add getFinalPositionIncrement()?

{quote}
To fix this, I'd like to add a new getFinalOffset() to TokenStream.
{quote}

Could we add this as Attributes using the new API? FinalOffsetAttribute and FinalPositionIncrementAttribute?

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Resolved: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael Busch resolved LUCENE-1448.
-----------------------------------

    Resolution: Fixed

Committed revision 797715.

Thanks Mike & Mark for the original patch!

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: lucene-1448.patch, lucene-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1448:
---------------------------------------

    Attachment: LUCENE-1448.patch

OK I added test coverage for NOT_ANAlYZED fields, and fixed contrib TokenStreams to implement getFinalOffset.

All tests, and back-compat tests, pass.

I think it's ready to commit... I'll wait a day or two.

(Michael: sorry, I think this will cause some conflicts with the new TokenStream API changes...).

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653513#action_12653513 ] 

Michael Busch commented on LUCENE-1448:
---------------------------------------

{quote}
Another option is to "define" the API such that when incrementToken()
returns false, then it has actually advanced to an "end-of-stream
token". OffsetAttribute.getEndOffset() should return the final
offset. Since we have not released the new API, we could simply make
this change (and fix all instances in the core/contrib that use the
new API accordingly). I think I like this option best.
{quote}

This adds some "cleaning up" responsibilities to all existing
TokenFilters out there. So far it is very straightforward to change an
existing TokenFilter to use the new API. You simply have to:
- add  attributes the filter needs in its constructor 
- change next() to incrementToken() and change return calls that
return null to false, others to true (or what input returns)
- don't access a token but the appropriate attributes to set the data

But maybe there's a custom filter in the end of the chain that returns
more tokens even after its input returned the last one. For example a
SynonymExpansionFilter might return a synonym for the last word it
received from its input before it returns false. In this case it might
overwrite endOffset that another filter/stream already set to the
final endOffset. It needs to cache that value and set it when it
returns false.

ALso all filters that currently use an offset need to know now to
clean up before returning false.

I'm not saying this is necessarily bad. I also find this approach
tempting, because it's simple. But it might be a common pitfall for
bugs?

What I'd like to work on soon is an efficient way to buffer attributes
(maybe add methods to attribute that write into a bytebuffer). Then
attributes can implement what variables need to be serialized and
which ones don't. In that case we could add a finalOffset to
OffsetAttribute that does not get serialiezd/deserialized.

And possibly it might be worthwhile to have explicit states defined in
a TokenStream that we can enforce with three methods: start(),
increment(), end(). Then people would now if they have to do something
at the end of a stream they have to do it in end().

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734025#action_12734025 ] 

Michael Busch commented on LUCENE-1448:
---------------------------------------

Hmm another reason why I don't like two Tees feeding one Sink:

What is the finalOffset and finalState then?

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718176#action_12718176 ] 

Michael McCandless commented on LUCENE-1448:
--------------------------------------------

Michael are you going to get to this soonish?  Else let's push until after 3.0?

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653065#action_12653065 ] 

Mark Miller commented on LUCENE-1448:
-------------------------------------

bq. Another option is to "define" the API such that when incrementToken()
returns false, then it has actually advanced to an "end-of-stream
token". OffsetAttribute.getEndOffset() should return the final
offset. Since we have not released the new API, we could simply make
this change (and fix all instances in the core/contrib that use the
new API accordingly). I think I like this option best.

+1. I like this.

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller updated LUCENE-1448:
--------------------------------

    Attachment: LUCENE-1448.patch

Your tests had an off by error.

I corrected them and added the standardanalyzer piece so that both tests pass.

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12653893#action_12653893 ] 

Michael McCandless commented on LUCENE-1448:
--------------------------------------------

{quote}
What I'd like to work on soon is an efficient way to buffer attributes
(maybe add methods to attribute that write into a bytebuffer). Then
attributes can implement what variables need to be serialized and
which ones don't. In that case we could add a finalOffset to
OffsetAttribute that does not get serialiezd/deserialized.
{quote}

I like that (it'd make streams like CachingTokenFilter much more
efficient).  It'd also presumably lead to more efficiently serialized
token streams.

But: you'd still need a way in this model to serialize finalOffset, once,
at the end?

{quote}
And possibly it might be worthwhile to have explicit states defined in
a TokenStream that we can enforce with three methods: start(),
increment(), end(). Then people would now if they have to do something
at the end of a stream they have to do it in end().
{quote}

This also seems good.  So end() would be the obvious place to set
the OffsetAttribute.finalOffset, PositionIncrementAttribute.positionIncrementGap, etc.

OK I'm gonna assign this one to you, Michael ;)


> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Assigned: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless reassigned LUCENE-1448:
------------------------------------------

    Assignee: Michael Busch  (was: Michael McCandless)

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12648290#action_12648290 ] 

Michael Busch commented on LUCENE-1448:
---------------------------------------

{quote}
Hmm we could do that... but it seems awkward to add new attributes that apply only to ending state of the tokenizer.
{quote}

Yeah. Also you wouldn't want to pay overhead in TokenFilters that can buffer tokens to serialize or clone those attributes for every token.

{quote}
I wonder if instead, w/ the new API, we could simply allow querying of certain attributes (offset, posincr) after incrementToken returns "false"?
{quote}

Yeah, maybe we can make the AttributeSource more sophisticated, so that it can distinguish between per-field (instance) and per-token attributes. But as a separate patch, not as part of LUCENE-1422.

{quote}
Why don't you commit the new TokenStream API first, and we can iterate on this issue & commit 2nd?
{quote}

OK, will do. I think 1422 is ready now.

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646549#action_12646549 ] 

Mark Miller commented on LUCENE-1448:
-------------------------------------

You need that +1 or you will have the subsequent token starting on the tail of the 'stopword'.

What I can't figure it out is how exactly these offsets are supposed to match up...abcd has offsets of s:0 e:4, which seems to imply it thinks abcd is 5 chars or the end is one greater than the end index (like with spans). In either case, it seems even if you put back the +1, the endoffsets are off somehow, because some will have an end of +1 the end index, while secondary multi-fields will have an end equal to the end index.

Would be cool to have fixed as this also stymies highlighting with multi-fields.

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Updated: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by Mark Miller <ma...@gmail.com>.

Michael McCandless wrote:
> This stuff is confusing!  I think your numbers are not right.  Let's 
> try reformatting with CHAR=POS.
>
> Here's your example without the +1:
>
> 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
> a=0 b=1 c=2 d=3  =4 t=5 h=6 e=7 c=8 r=9 u=10 n=11 c=12 h=13  =14 m=15 
> a=16 n=17
>
>   abcd 0-4
> crunch 8-14
>    man 15-18
>
> This is not how Lucene works today.  Lucene adds the +1 ("virtual
> space character"):
>
> 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
> a=0 b=1 c=2 d=3  =4 t=5 h=6 e=7  =8 c=9 r=10 u=11 n=12 c=13 h=14  =15 
> m=16 a=17 n=18
>
>   abcd 0-4
> crunch 9-15
>    man 16-19
>
> I think?
>
Man, I'm sorry. Just reran my stuff and it didn't jive with my earlier 
results. Keep the +1 off it looks. Don't know what happened...I have 
java code and lucene calculating for me :)

At least that jives with earlier reports of people saying that have to 
insert that space to get things highlighted. Here are the results i get now:

Old:
a=0 b=1 c=2 d=3  =4 t=5 h=6 e=7 c=8 r=9 u=10 n=11 c=12 h=13  =14 m=15 
a=16 n=17
term:abcd s:0 e:4
term:crunch s:5 e:11
term:man s:12 e:15

New Without +1:

a=0 b=1 c=2 d=3  =4 t=5 h=6 e=7 c=8 r=9 u=10 n=11 c=12 h=13  =14 m=15 
a=16 n=17
term:abcd s:0 e:4
term:crunch s:8 e:14
term:man s:15 e:18

New With +1:

a=0 b=1 c=2 d=3  =4 t=5 h=6 e=7 c=8 r=9 u=10 n=11 c=12 h=13  =14 m=15 
a=16 n=17
term:abcd s:0 e:4
term:crunch s:9 e:15
term:man s:16 e:19


We are on the same page and I'm sorry for taking you down that path - 
except now you might be more sure it doesn't belong ;)

I see some of my initial and continued confusion was caused by that char 
tokenizer bug...your original tests now look right (second abcd starting 
at 8 rather than 7).

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Updated: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by Michael McCandless <lu...@mikemccandless.com>.

This stuff is confusing!  I think your numbers are not right.  Let's  
try reformatting with CHAR=POS.

Here's your example without the +1:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
a=0 b=1 c=2 d=3  =4 t=5 h=6 e=7 c=8 r=9 u=10 n=11 c=12 h=13  =14 m=15  
a=16 n=17

   abcd 0-4
crunch 8-14
    man 15-18

This is not how Lucene works today.  Lucene adds the +1 ("virtual
space character"):

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
a=0 b=1 c=2 d=3  =4 t=5 h=6 e=7  =8 c=9 r=10 u=11 n=12 c=13 h=14  =15  
m=16 a=17 n=18

   abcd 0-4
crunch 9-15
    man 16-19

I think?

Mark Miller wrote:

>
>> Is it? Lets straighten this out. Here is what I see from my test for:
>>
>> field = "abcd the"
>> field = "crunch man"
>>
>> abcd thecrunch man
>> a0b1c2d3 4t5h6e7c8r9u10n11c12h13  14m15a16n17
>>
>> Without the +1 I got:
>> abcd: 0-4
>> crunch: 7-13
>> man: 14-17
>>
>> Something like that anyway.
>>
>> With +1 I got:
>> abcd:0-4
>> crunch:8-14
>> man:14-18
> *correction*
> should have been - man: 15-18
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Updated: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by Mark Miller <ma...@gmail.com>.

> Is it? Lets straighten this out. Here is what I see from my test for:
>
> field = "abcd the"
> field = "crunch man"
>
> abcd thecrunch man
> a0b1c2d3 4t5h6e7c8r9u10n11c12h13  14m15a16n17
>
> Without the +1 I got:
> abcd: 0-4
> crunch: 7-13
> man: 14-17
>
> Something like that anyway.
>
> With +1 I got:
> abcd:0-4
> crunch:8-14
> man:14-18
*correction*
should have been - man: 15-18

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: [jira] Updated: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by Mark Miller <ma...@gmail.com>.

Michael McCandless (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Michael McCandless updated LUCENE-1448:
> ---------------------------------------
>
>     Attachment: LUCENE-1448.patch
>
> Attached new patch (changes described below):
>
> bq. You need that +1 or you will have the subsequent token starting on the tail of the 'stopword'. 
>
> So logically it's like we silently & forcefully insert a space between
> the Fieldable instances?
>   
Is it? Lets straighten this out. Here is what I see from my test for:

field = "abcd the"
field = "crunch man"

abcd thecrunch man
a0b1c2d3 4t5h6e7c8r9u10n11c12h13  14m15a16n17

Without the +1 I got:
abcd: 0-4
crunch: 7-13
man: 14-17

Something like that anyway.

With +1 I got:
abcd:0-4
crunch:8-14
man:14-18

which seems correct right? With no space injected? Your test seemed to 
ensure the equivalent of crunch starting at 7...which is the 'e' 
position and not correct right?
>
> OK I added Analyzer.getOffsetGap(Fieldable), and defaulted it to
> return 1 for analyzed fields and 0 for unanalyzed fields.
>
> bq. What's wrong with public int getFinalOffset() { return scanner.yychar() + scanner.yylength(); }
>
> Does that handle spaces at the end of the text?  (Oh it seems like it
> does...I added a test case...hmm).
>   
> bq. i didnt correctly put the SA piece in the jflex file
>
> I think this change (adding getFinalOffset to StandardTokenizer)
> doesn't need a change to jflex?  (It's only if you edit
> StandardTokenizerImpl.java).
>   
My fault. I was assuming I missed it without really looking close.
> Hmm another complexity is handling a field instance that produced no
> tokens.  Currently, we do not increment the cumulative offset by +1 in
> such cases.  But, for position increment gap we always add this gap in
> between fields if any field from the past have produced a token.  I
> added a couple test cases for this.
>
> Also, I fixed a bug in how CharTokenizer was computing its final
> offset.
>
> Still todo:
>   - add test cases to cover NOT_ANALYZED fields
>   - fix contrib tokenizers to implement getFinalOffset
>
>
>   
>> add getFinalOffset() to TokenStream
>> -----------------------------------
>>
>>                 Key: LUCENE-1448
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>>             Project: Lucene - Java
>>          Issue Type: Bug
>>          Components: Analysis
>>            Reporter: Michael McCandless
>>            Assignee: Michael McCandless
>>            Priority: Minor
>>             Fix For: 2.9
>>
>>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>>
>>
>> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
>> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
>> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
>> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
>> This has come up several times on the user's list.
>>     
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-1448:
---------------------------------------

    Attachment: LUCENE-1448.patch

Attached new patch (changes described below):

bq. You need that +1 or you will have the subsequent token starting on the tail of the 'stopword'. 

So logically it's like we silently & forcefully insert a space between
the Fieldable instances?

Maybe we should add Analyzer.getOffsetGap(String fieldName), which by
default would return 1, and we then add that into the offset for
subsequent field instances?

But then here's another challenge: for NOT_ANALYZED fields we don't
add this extra +1.  We just add the string length.  Hmm.

OK I added Analyzer.getOffsetGap(Fieldable), and defaulted it to
return 1 for analyzed fields and 0 for unanalyzed fields.

bq. What's wrong with public int getFinalOffset() { return scanner.yychar() + scanner.yylength(); }

Does that handle spaces at the end of the text?  (Oh it seems like it
does...I added a test case...hmm).

bq. i didnt correctly put the SA piece in the jflex file

I think this change (adding getFinalOffset to StandardTokenizer)
doesn't need a change to jflex?  (It's only if you edit
StandardTokenizerImpl.java).

Hmm another complexity is handling a field instance that produced no
tokens.  Currently, we do not increment the cumulative offset by +1 in
such cases.  But, for position increment gap we always add this gap in
between fields if any field from the past have produced a token.  I
added a couple test cases for this.

Also, I fixed a bug in how CharTokenizer was computing its final
offset.

Still todo:
  - add test cases to cover NOT_ANALYZED fields
  - fix contrib tokenizers to implement getFinalOffset


> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734023#action_12734023 ] 

Michael Busch commented on LUCENE-1448:
---------------------------------------

Hmm one thing I haven't done yet is changing Tee/Sink and CachingTokenFilter.

But it should be simple: CachingTokenFilter.end() should call input.end() when 
it is called for the first time and store the captured state locally as finalState. 
Then whenever CachingTokenFilter.end() is called again, it just restores the
finalState.

For Tee/Sink it should work similarly: The tee just puts a finalState into the
sink(s) the first time end() is called. And when end() of a sink is called it 
restores the finalState.

This should work?

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646549#action_12646549 ] 

markrmiller@gmail.com edited comment on LUCENE-1448 at 11/11/08 6:29 AM:
---------------------------------------------------------------

You need that +1 or you will have the subsequent token starting on the tail of the 'stopword'.

What I can't figure it out is how exactly these offsets are supposed to match up...abcd has offsets of s:0 e:4, which seems to imply it thinks abcd is 5 chars or the end is one greater than the end index (like with spans). In either case, it seems even if you put back the +1, the endoffsets are off somehow, because some will have an end of +1 the end index, while secondary multi-fields will have an end equal to the end index.

Would be cool to have fixed as this also stymies highlighting with multi-fields.

*edit*

I see. You need that +1 you took out and you need fields after the first to have +1 more for an end offset. Looks they are supposed to be end index +1.

      was (Author: markrmiller@gmail.com):
    You need that +1 or you will have the subsequent token starting on the tail of the 'stopword'.

What I can't figure it out is how exactly these offsets are supposed to match up...abcd has offsets of s:0 e:4, which seems to imply it thinks abcd is 5 chars or the end is one greater than the end index (like with spans). In either case, it seems even if you put back the +1, the endoffsets are off somehow, because some will have an end of +1 the end index, while secondary multi-fields will have an end equal to the end index.

Would be cool to have fixed as this also stymies highlighting with multi-fields.
  
> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12734292#action_12734292 ] 

Michael Busch commented on LUCENE-1448:
---------------------------------------

Cool, I will take this approach and submit a patch as soon as LUCENE-1693 is committed.

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1448) add getFinalOffset() to TokenStream

Posted by "Michael Busch (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12723764#action_12723764 ] 

Michael Busch commented on LUCENE-1448:
---------------------------------------

Oh man, what did I do suggesting you as the RM?!? Now there's another guy chasing me! ;)

Currently I have to sacrifice some of my already very limited sleep for everything I do on Lucene. After next week I'll have more time. When everything else for 2.9 is done, then I don't think this should block it. Otherwise, I'll try to do it for 2.9.

> add getFinalOffset() to TokenStream
> -----------------------------------
>
>                 Key: LUCENE-1448
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1448
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>            Reporter: Michael McCandless
>            Assignee: Michael Busch
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch, LUCENE-1448.patch
>
>
> If you add multiple Fieldable instances for the same field name to a document, and you then index those fields with TermVectors storing offsets, it's very likely the offsets for all but the first field instance will be wrong.
> This is because IndexWriter under the hood adds a cumulative base to the offsets of each field instance, where that base is 1 + the endOffset of the last token it saw when analyzing that field.
> But this logic is overly simplistic.  For example, if the WhitespaceAnalyzer is being used, and the text being analyzed ended in 3 whitespace characters, then that information is lost and then next field's offsets are then all 3 too small.  Similarly, if a StopFilter appears in the chain, and the last N tokens were stop words, then the base will be 1 + the endOffset of the last non-stopword token.
> To fix this, I'd like to add a new getFinalOffset() to TokenStream.  I'm thinking by default it returns -1, which means "I don't know so you figure it out", meaning we fallback to the faulty logic we have today.
> This has come up several times on the user's list.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org