You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by tsuraan <ts...@gmail.com> on 2010/05/27 00:29:12 UTC

Customer TokenFilter

I'd like to have all my queries and terms run through Unicode
Normalization prior to being executed/indexed.  I've been using the
StandardAnalyzer with pretty good luck for the past few years, so I
think I'd like to write an analyzer that wraps that, and tacks a
custom TokenFilter onto the chain provided by the StandardAnalyzer.
I'm really not clear, though, on how to write a TokenFilter.  My best
guess is that I want to write a class that overrides getAttribute, and
uses java.text.Normalizer to normalize any TermAttribute that is
returned from the upstream filter.  Is that correct, or should I put
my normalization somewhere else?  Are there any docs on making custom
filters/analyzers?  I didn't have much luck finding any.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Customer TokenFilter

Posted by tsuraan <ts...@gmail.com>.

> Looks correct! Wrapping by CharBuffer is very intelligent! In Lucene 3.1 the
> new Term Attribute will implement CharSequence, then its even simplier. You
> may also look at 3.1's ICU contrib that has support even for Normalizer2.

Ok, I've only been looking at 3.0.1 so far; I'll check out the 3.1
work and see what's there.

> Overriding StandardAnalyzer is the wrong way, as in 3.1 its final (its only
> accidentially non-final)! You should always extend Analyzer and build the
> chain yourself. Or wrap another analyzer like in PerFieldAnalyzerWrapper.
> The whole Tokenization API in Lucene is based on the delegator pattern, so
> never extend any class not made for it!

Ok, I've changed it to just borrowing the guts of the StandardAnalyzer
rather than inheriting.  I also think I got the reuasableTokenStream
method correct this way, so that's a bonus.  Care to have another look
at it?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Customer TokenFilter

Posted by Uwe Schindler <uw...@thetaphi.de>.

Looks correct! Wrapping by CharBuffer is very intelligent! In Lucene 3.1 the
new Term Attribute will implement CharSequence, then its even simplier. You
may also look at 3.1's ICU contrib that has support even for Normalizer2.

Overriding StandardAnalyzer is the wrong way, as in 3.1 its final (its only
accidentially non-final)! You should always extend Analyzer and build the
chain yourself. Or wrap another analyzer like in PerFieldAnalyzerWrapper.
The whole Tokenization API in Lucene is based on the delegator pattern, so
never extend any class not made for it!

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: tsuraan [mailto:tsuraan@gmail.com]
> Sent: Thursday, May 27, 2010 10:38 PM
> To: java-user
> Subject: Re: Customer TokenFilter
> 
> > I'd like to have all my queries and terms run through Unicode
> > Normalization prior to being executed/indexed.  I've been using the
> > StandardAnalyzer with pretty good luck for the past few years, so I
> > think I'd like to write an analyzer that wraps that, and tacks a
> > custom TokenFilter onto the chain provided by the StandardAnalyzer.
> > I'm really not clear, though, on how to write a TokenFilter.  My best
> > guess is that I want to write a class that overrides getAttribute, and
> > uses java.text.Normalizer to normalize any TermAttribute that is
> > returned from the upstream filter.  Is that correct, or should I put
> > my normalization somewhere else?  Are there any docs on making custom
> > filters/analyzers?  I didn't have much luck finding any.
> 
> Ok, I think that's probably the wrong approach, and I did something
different
> by imitating the LowerCaseFilter.  If somebody could take a look at what
I've
> put up at http://github.com/tsuraan/StandardNormalizingAnalyzer, and tell
> me if there's something horrible about what I've done, I'd really
appreciate it.
> It passes the small unit tests I've made, but I'd really like to know if
there's
> something glaringly wrong about my approach.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Customer TokenFilter

Posted by tsuraan <ts...@gmail.com>.

> I'd like to have all my queries and terms run through Unicode
> Normalization prior to being executed/indexed.  I've been using the
> StandardAnalyzer with pretty good luck for the past few years, so I
> think I'd like to write an analyzer that wraps that, and tacks a
> custom TokenFilter onto the chain provided by the StandardAnalyzer.
> I'm really not clear, though, on how to write a TokenFilter.  My best
> guess is that I want to write a class that overrides getAttribute, and
> uses java.text.Normalizer to normalize any TermAttribute that is
> returned from the upstream filter.  Is that correct, or should I put
> my normalization somewhere else?  Are there any docs on making custom
> filters/analyzers?  I didn't have much luck finding any.

Ok, I think that's probably the wrong approach, and I did something
different by imitating the LowerCaseFilter.  If somebody could take a
look at what I've put up at
http://github.com/tsuraan/StandardNormalizingAnalyzer, and tell me if
there's something horrible about what I've done, I'd really appreciate
it.  It passes the small unit tests I've made, but I'd really like to
know if there's something glaringly wrong about my approach.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org