You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Koji Sekiguchi <ko...@r.email.ne.jp> on 2008/11/04 18:21:53 UTC
Proposal for introducing CharFilter
I'm working on SOLR-822 and trying to introduce new classes CharStream,
CharReader and CharFilter into Solr:
CharFilter - normalize characters before tokenizer
https://issues.apache.org/jira/browse/SOLR-822
CharFilter(s) will be placed between Reader and Tokenizer:
// CharReader is needed to convert Reader to CharStream
TokenStream stream = new MyTokenFilter( new MyTokenizer(
new MyCharFilter( new CharReader( reader ) ) ) );
and it does character-level filtering like as TokenFilter does
Token-level filtering.
I attached a nice JPEG sample for "character normalization" in SOLR-822.
Please see:
https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG
As you can see, if you use CharFilter, Token offsets could be incorrect
because CharFilters may convert 1 char to 2 chars or the other way
around. So, CharFilter has a method "correctOffset()" (CharStream
defines the method as abstract and CharFilter extends CharStream.
See SOLR-822 for the detail) so that Tokenizer can correct token
offsets. But Tokenizer should be "CharStream aware" to call the
method. What do folks feel about introducing CharFilter into Lucene
and changing *all* Tokenizers to "CharStream aware" Tokenizers in
Lucene 2.9/3.0?
Thank you,
Koji
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Proposal for introducing CharFilter
Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
Chris Hostetter wrote:
> : > If a given Tokenizer does not need to do any character
normalization (I
> : would think most wouldn't) is there any added cost during
tokenization with
> : this change?
> :
> : Thank you for your reply, Mike!
> : There is no added cost if Tokenizer doesn't need to call
correctOffset().
>
> But every tokenizer *should* call correctOffset on the start/end
offset of
> every token it produces correct?
Yes.
> My understanding is that we would imake a change like this is...
>
> 1) change the Tokenizer class to look something like this...
(snip)
> 2) change all of the Tokenizers shipped with Lucene to use correctOffset
> when setting all start/end offsets on any Tokens.
>
> ...once those two things are done, anyone using out-of-the-box
tokenizers
> can use a CharStream and get correct offsets -- anyone with an existing
> custom Tokenizer should continue to get the same behavior as before, but
> if they wnat to start using a CharStream they need to tweak there code.
Looks great!
> The only potential downside i can think of is the performance cost of
the
> added method calls -- but if we make NoOpCharStream.correctOffset final
> the JVM should be able to able to optimize away the "identity" function
> correct?
I didn't take care of JVM optimization, however, we have already have
the final class "CharReader" in Solr 1.4:
public final class CharReader extends CharStream {
protected Reader input;
public CharReader( Reader in ){
input = in;
}
@Override
public int correctOffset(int currentOff) {
return currentOff;
}
:
}
and CharReader is instantiated in TokenizerChain.
Koji
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Proposal for introducing CharFilter
Posted by Chris Hostetter <ho...@fucit.org>.
: > If a given Tokenizer does not need to do any character normalization (I
: would think most wouldn't) is there any added cost during tokenization with
: this change?
:
: Thank you for your reply, Mike!
: There is no added cost if Tokenizer doesn't need to call correctOffset().
But every tokenizer *should* call correctOffset on the start/end offset of
every token it produces correct?
My understanding is that we would imake a change like this is...
1) change the Tokenizer class to look something like this...
public abstract class Tokenizer extends TokenStream {
protected CharStream input;
protected Tokenizer() {}
protected Tokenizer(Reader input) {
this(new NoOpCharStream(input));
}
protected Tokenizer(CharStream input) {
this.input = input;
}
public void close() throws IOException {
input.close();
}
public void reset(Reader input) throws IOException {
if (input instanceof CharStream) {
this.input = (CharStream)input;
} else {
this.input = new NoOpCharStream(input);
}
}
}
2) change all of the Tokenizers shipped with Lucene to use correctOffset
when setting all start/end offsets on any Tokens.
...once those two things are done, anyone using out-of-the-box tokenizers
can use a CharStream and get correct offsets -- anyone with an existing
custom Tokenizer should continue to get the same behavior as before, but
if they wnat to start using a CharStream they need to tweak there code.
The only potential downside i can think of is the performance cost of the
added method calls -- but if we make NoOpCharStream.correctOffset final
the JVM should be able to able to optimize away the "identity" function
correct?
-Hoss
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Proposal for introducing CharFilter
Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
> This looks like a good idea, thanks!
>
> If a given Tokenizer does not need to do any character normalization
(I would think most wouldn't) is there any added cost during
tokenization with this change?
Thank you for your reply, Mike!
There is no added cost if Tokenizer doesn't need to call correctOffset().
Koji
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Proposal for introducing CharFilter
Posted by Michael McCandless <lu...@mikemccandless.com>.
This looks like a good idea, thanks!
If a given Tokenizer does not need to do any character normalization
(I would think most wouldn't) is there any added cost during
tokenization with this change?
Mike
Koji Sekiguchi wrote:
> I'm working on SOLR-822 and trying to introduce new classes
> CharStream,
> CharReader and CharFilter into Solr:
>
> CharFilter - normalize characters before tokenizer
> https://issues.apache.org/jira/browse/SOLR-822
>
> CharFilter(s) will be placed between Reader and Tokenizer:
>
> // CharReader is needed to convert Reader to CharStream
> TokenStream stream = new MyTokenFilter( new MyTokenizer(
> new MyCharFilter( new CharReader( reader ) ) ) );
>
> and it does character-level filtering like as TokenFilter does
> Token-level filtering.
>
> I attached a nice JPEG sample for "character normalization" in
> SOLR-822.
> Please see:
>
> https://issues.apache.org/jira/secure/attachment/12392639/character-normalization.JPG
>
> As you can see, if you use CharFilter, Token offsets could be
> incorrect
> because CharFilters may convert 1 char to 2 chars or the other way
> around. So, CharFilter has a method "correctOffset()" (CharStream
> defines the method as abstract and CharFilter extends CharStream.
> See SOLR-822 for the detail) so that Tokenizer can correct token
> offsets. But Tokenizer should be "CharStream aware" to call the
> method. What do folks feel about introducing CharFilter into Lucene
> and changing *all* Tokenizers to "CharStream aware" Tokenizers in
> Lucene 2.9/3.0?
>
> Thank you,
>
> Koji
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org