You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Uwe Schindler (JIRA)" <ji...@apache.org> on 2009/09/10 19:15:57 UTC

[jira] Issue Comment Edited: (LUCENE-1906) Problem with CharStream and Tokenizers with custom reset(Reader) method

    [ https://issues.apache.org/jira/browse/LUCENE-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12753709#action_12753709 ] 

Uwe Schindler edited comment on LUCENE-1906 at 9/10/09 10:15 AM:
-----------------------------------------------------------------

One possibility to prevent this break would be the following:
- revert the CharStream to a Reader in Tokenizer (restore the old class).
- add a method correctOffset(int) to Tokenizer that does:
{code}
protected final int correctOffset(int offset) {
  if (input instanceof CharStream) return ((CharStream) input).correctOffset(offset);
  return offset;
}
{code}
- Change the usage of correctOffset in all Tokenizers in core/contrib (only a few are affected)
- revert the backwards-branch changes

In this case, the input of a Tokenizer stays always a j.io.Reader and the offset correction defaults to just nothing. If a custom Reader extends CharStream, the offset correction is automatically used. This is 100% backwards compatible (only some old Tokenizers not calling correctOffset() before passing to Token/OffsetAttribute would produce invalid offsets if used with CharStreams instead of Readers)

      was (Author: thetaphi):
    One possibility to prevent this break would be the following:
- revert the CharStream to a Reader in Tokenizer (restore the old class).
- add a method correctOffset(int) to Tokenizer that does:
{code}
protected final int correctOffset(int offset) {
  if (input instanceof CharStream) return ((CharStream) input).correctOffset(offset);
  return offset;
}
{code}
- Change the usage of correctOffset in all Tokenizers in core/contrib (only a few are affected)
- revert the backwards-branch changes
  
> Problem with CharStream and Tokenizers with custom reset(Reader) method
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-1906
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1906
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Analysis
>    Affects Versions: 2.9
>            Reporter: Uwe Schindler
>            Assignee: Uwe Schindler
>            Priority: Blocker
>             Fix For: 2.9
>
>         Attachments: backwards-break.patch, LUCENE-1906.patch
>
>
> When reviewing the new CharStream code added to Tokenizers, I found a
> serious problem with backwards compatibility and other Tokenizers, that do
> not override reset(CharStream).
> The problem is, that e.g. CharTokenizer only overrides reset(Reader):
> {code}
>   public void reset(Reader input) throws IOException {
>     super.reset(input);
>     bufferIndex = 0;
>     offset = 0;
>     dataLen = 0;
>   }
> {code}
> If you reset such a Tokenizer with another CharStream (not a Reader), this
> method will never be called and breaking the whole Tokenizer.
> As CharStream extends Reader, I propose to remove this reset(CharStream
> method) and simply do an instanceof check to detect if the supplied Reader
> is no CharStream and wrap it. We could also remove the extra ctor (because
> most Tokenizers have no support for passing CharStreams). If the ctor also
> checks with instanceof and warps as needed the code is backwards compatible
> and we do not need to add additional ctors in subclasses.
> As this instanceof check is always done in CharReader.get() why not remove
> ctor(CharStream) and reset(CharStream) completely?
> Any thoughts?
> I would like to fix this somehow before RC4, I'm, sorry :(

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org