You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Koji Sekiguchi <ko...@r.email.ne.jp> on 2008/11/04 18:21:53 UTC

Proposal for introducing CharFilter

I'm working on SOLR-822 and trying to introduce new classes CharStream,
CharReader and CharFilter into Solr:

CharFilter - normalize characters before tokenizer
https://issues.apache.org/jira/browse/SOLR-822

CharFilter(s) will be placed between Reader and Tokenizer:

// CharReader is needed to convert Reader to CharStream
TokenStream stream = new MyTokenFilter( new MyTokenizer(
new MyCharFilter( new CharReader( reader ) ) ) );

and it does character-level filtering like as TokenFilter does
Token-level filtering.

I attached a nice JPEG sample for "character normalization" in SOLR-822.
Please see:

https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG

As you can see, if you use CharFilter, Token offsets could be incorrect
because CharFilters may convert 1 char to 2 chars or the other way
around. So, CharFilter has a method "correctOffset()" (CharStream
defines the method as abstract and CharFilter extends CharStream.
See SOLR-822 for the detail) so that Tokenizer can correct token
offsets. But Tokenizer should be "CharStream aware" to call the
method. What do folks feel about introducing CharFilter into Lucene
and changing *all* Tokenizers to "CharStream aware" Tokenizers in
Lucene 2.9/3.0?

Thank you,

Koji


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Proposal for introducing CharFilter

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
Chris Hostetter wrote:
 > : > If a given Tokenizer does not need to do any character 
normalization (I
 > : would think most wouldn't) is there any added cost during 
tokenization with
 > : this change?
 > :
 > : Thank you for your reply, Mike!
 > : There is no added cost if Tokenizer doesn't need to call 
correctOffset().
 >
 > But every tokenizer *should* call correctOffset on the start/end 
offset of
 > every token it produces correct?

Yes.

 > My understanding is that we would imake a change like this is...
 >
 > 1) change the Tokenizer class to look something like this...

(snip)

 > 2) change all of the Tokenizers shipped with Lucene to use correctOffset
 > when setting all start/end offsets on any Tokens.
 >
 > ...once those two things are done, anyone using out-of-the-box 
tokenizers
 > can use a CharStream and get correct offsets -- anyone with an existing
 > custom Tokenizer should continue to get the same behavior as before, but
 > if they wnat to start using a CharStream they need to tweak there code.

Looks great!

 > The only potential downside i can think of is the performance cost of 
the
 > added method calls -- but if we make NoOpCharStream.correctOffset final
 > the JVM should be able to able to optimize away the "identity" function
 > correct?

I didn't take care of JVM optimization, however, we have already have
the final class "CharReader" in Solr 1.4:

public final class CharReader extends CharStream {
  protected Reader input;
  public CharReader( Reader in ){
    input = in;
  }
  @Override
  public int correctOffset(int currentOff) {
    return currentOff;
  }
  :
}

and CharReader is instantiated in TokenizerChain.

Koji


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Proposal for introducing CharFilter

Posted by Chris Hostetter <ho...@fucit.org>.
: > If a given Tokenizer does not need to do any character normalization (I
: would think most wouldn't) is there any added cost during tokenization with
: this change?
: 
: Thank you for your reply, Mike!
: There is no added cost if Tokenizer doesn't need to call correctOffset().

But every tokenizer *should* call correctOffset on the start/end offset of 
every token it produces correct?

My understanding is that we would imake a change like this is...

1) change the Tokenizer class to look something like this...

public abstract class Tokenizer extends TokenStream {
  protected CharStream input;
  protected Tokenizer() {}
  protected Tokenizer(Reader input) {
    this(new NoOpCharStream(input));
  }
  protected Tokenizer(CharStream input) {
    this.input = input;
  }
  public void close() throws IOException {
    input.close();
  }
  public void reset(Reader input) throws IOException {
    if (input instanceof CharStream) {
       this.input = (CharStream)input;
    } else {
       this.input = new NoOpCharStream(input);
    }
  }
}

2) change all of the Tokenizers shipped with Lucene to use correctOffset 
when setting all start/end offsets on any Tokens.


...once those two things are done, anyone using out-of-the-box tokenizers 
can use a CharStream and get correct offsets -- anyone with an existing 
custom Tokenizer should continue to get the same behavior as before, but 
if they wnat to start using a CharStream they need to tweak there code.

The only potential downside i can think of is the performance cost of the 
added method calls -- but if we make NoOpCharStream.correctOffset final 
the JVM should be able to able to optimize away the "identity" function 
correct?



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Proposal for introducing CharFilter

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
 > This looks like a good idea, thanks!
 >
 > If a given Tokenizer does not need to do any character normalization 
(I would think most wouldn't) is there any added cost during 
tokenization with this change?

Thank you for your reply, Mike!
There is no added cost if Tokenizer doesn't need to call correctOffset().

Koji


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Re: Proposal for introducing CharFilter

Posted by Michael McCandless <lu...@mikemccandless.com>.
This looks like a good idea, thanks!

If a given Tokenizer does not need to do any character normalization  
(I would think most wouldn't) is there any added cost during  
tokenization with this change?

Mike

Koji Sekiguchi wrote:

> I'm working on SOLR-822 and trying to introduce new classes  
> CharStream,
> CharReader and CharFilter into Solr:
>
> CharFilter - normalize characters before tokenizer
> https://issues.apache.org/jira/browse/SOLR-822
>
> CharFilter(s) will be placed between Reader and Tokenizer:
>
> // CharReader is needed to convert Reader to CharStream
> TokenStream stream = new MyTokenFilter( new MyTokenizer(
> new MyCharFilter( new CharReader( reader ) ) ) );
>
> and it does character-level filtering like as TokenFilter does
> Token-level filtering.
>
> I attached a nice JPEG sample for "character normalization" in  
> SOLR-822.
> Please see:
>
> https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG
>
> As you can see, if you use CharFilter, Token offsets could be  
> incorrect
> because CharFilters may convert 1 char to 2 chars or the other way
> around. So, CharFilter has a method "correctOffset()" (CharStream
> defines the method as abstract and CharFilter extends CharStream.
> See SOLR-822 for the detail) so that Tokenizer can correct token
> offsets. But Tokenizer should be "CharStream aware" to call the
> method. What do folks feel about introducing CharFilter into Lucene
> and changing *all* Tokenizers to "CharStream aware" Tokenizers in
> Lucene 2.9/3.0?
>
> Thank you,
>
> Koji
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org