You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Georger Araujo <ge...@gmail.com> on 2011/02/06 21:28:42 UTC

Extending org.apache.lucene.analysis.br.BrazilianAnalyzer to discard numeric tokens

Hi,
I started using Lucene a few weeks ago, and I must say I'm amazed. Hats off
to the developers and the community!
I'd like to write a custom analyzer whose only difference to
org.apache.lucene.analysis.br.BrazilianAnalyzer is that I want it to discard
numeric tokens from the input. I've looked at the code and also at the
discussion in [1], but I'm lost about what is the simplest/cleanest way to
go.
What do you think?

[1]
http://mail-archives.apache.org/mod_mbox/lucene-java-user/200809.mbox/%3C48D7948B.4030804@gmail.com%3E

Best regards,

Georger

Re: Extending org.apache.lucene.analysis.br.BrazilianAnalyzer to discard numeric tokens

Posted by Georger Araujo <ge...@gmail.com>.
2011/2/7 Robert Muir <rc...@gmail.com>

> On Sun, Feb 6, 2011 at 3:28 PM, Georger Araujo <ge...@gmail.com>
> wrote:
> > Hi,
> > I started using Lucene a few weeks ago, and I must say I'm amazed. Hats
> off
> > to the developers and the community!
> > I'd like to write a custom analyzer whose only difference to
> > org.apache.lucene.analysis.br.BrazilianAnalyzer is that I want it to
> discard
> > numeric tokens from the input. I've looked at the code and also at the
> > discussion in [1], but I'm lost about what is the simplest/cleanest way
> to
> > go.
> > What do you think?
>
> Hi, in general the supplied analyzers are basically very general
> purpose examples.
>
> So i would make your own analyzer: except using a tokenizer that
> discards numbers (like lowercasetokenizer) instead of
> standardtokenizer: something like LowerCaseTokenizer +
> BrazilianStemFilter + Brazilian stopwords in a stopfilter.
>
>
>
Hi,
I investigated this issue further and found out that StandardTokenizer is
actually desirable for my needs - I need to index emails, acronyms, etc. So
I'll use package org.apache.lucene.analysis.StopFilter as a starting point
to try and write a custom TokenFilter to discard numbers, then just extend
BrazilianAnalyzer and use this custom TokenFilter as the final filter in the
chain. I believe the end result will be simpler and cleaner this way.
Best regards,

Georger

Re: Extending org.apache.lucene.analysis.br.BrazilianAnalyzer to discard numeric tokens

Posted by Robert Muir <rc...@gmail.com>.
On Sun, Feb 6, 2011 at 3:28 PM, Georger Araujo <ge...@gmail.com> wrote:
> Hi,
> I started using Lucene a few weeks ago, and I must say I'm amazed. Hats off
> to the developers and the community!
> I'd like to write a custom analyzer whose only difference to
> org.apache.lucene.analysis.br.BrazilianAnalyzer is that I want it to discard
> numeric tokens from the input. I've looked at the code and also at the
> discussion in [1], but I'm lost about what is the simplest/cleanest way to
> go.
> What do you think?

Hi, in general the supplied analyzers are basically very general
purpose examples.

So i would make your own analyzer: except using a tokenizer that
discards numbers (like lowercasetokenizer) instead of
standardtokenizer: something like LowerCaseTokenizer +
BrazilianStemFilter + Brazilian stopwords in a stopfilter.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org