You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Benson Margulies <bi...@gmail.com> on 2013/09/16 01:05:16 UTC

org.apache.lucene.analysis.icu.ICUNormalizer2Filter -- why Token?

Can anyone shed light as to why this is a token filter and not a char
filter? I'm wishing for one of these _upstream_ of a tokenizer, so that the
tokenizer's lookups in its dictionaries are seeing normalized contents.

Re: org.apache.lucene.analysis.icu.ICUNormalizer2Filter -- why Token?

Posted by Robert Muir <rc...@gmail.com>.

That would be great!

On Mon, Sep 16, 2013 at 1:41 PM, Benson Margulies <be...@basistech.com> wrote:
> Thanks, I might pitch in.
>
>
> On Mon, Sep 16, 2013 at 12:58 PM, Robert Muir <rc...@gmail.com> wrote:
>
>> Mostly because our tokenizers like StandardTokenizer will tokenize the
>> same way regardless of normalization form or whether its normalized at
>> all?
>>
>> But for other tokenizers, such a charfilter should be useful: there is
>> a JIRA for it, but it has some unresolved issues
>>
>> https://issues.apache.org/jira/browse/LUCENE-4072
>>
>> On Sun, Sep 15, 2013 at 7:05 PM, Benson Margulies <bi...@gmail.com>
>> wrote:
>> > Can anyone shed light as to why this is a token filter and not a char
>> > filter? I'm wishing for one of these _upstream_ of a tokenizer, so that
>> the
>> > tokenizer's lookups in its dictionaries are seeing normalized contents.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: org.apache.lucene.analysis.icu.ICUNormalizer2Filter -- why Token?

Posted by Benson Margulies <be...@basistech.com>.

Thanks, I might pitch in.


On Mon, Sep 16, 2013 at 12:58 PM, Robert Muir <rc...@gmail.com> wrote:

> Mostly because our tokenizers like StandardTokenizer will tokenize the
> same way regardless of normalization form or whether its normalized at
> all?
>
> But for other tokenizers, such a charfilter should be useful: there is
> a JIRA for it, but it has some unresolved issues
>
> https://issues.apache.org/jira/browse/LUCENE-4072
>
> On Sun, Sep 15, 2013 at 7:05 PM, Benson Margulies <bi...@gmail.com>
> wrote:
> > Can anyone shed light as to why this is a token filter and not a char
> > filter? I'm wishing for one of these _upstream_ of a tokenizer, so that
> the
> > tokenizer's lookups in its dictionaries are seeing normalized contents.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: org.apache.lucene.analysis.icu.ICUNormalizer2Filter -- why Token?

Posted by Robert Muir <rc...@gmail.com>.

Mostly because our tokenizers like StandardTokenizer will tokenize the
same way regardless of normalization form or whether its normalized at
all?

But for other tokenizers, such a charfilter should be useful: there is
a JIRA for it, but it has some unresolved issues

https://issues.apache.org/jira/browse/LUCENE-4072

On Sun, Sep 15, 2013 at 7:05 PM, Benson Margulies <bi...@gmail.com> wrote:
> Can anyone shed light as to why this is a token filter and not a char
> filter? I'm wishing for one of these _upstream_ of a tokenizer, so that the
> tokenizer's lookups in its dictionaries are seeing normalized contents.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org