You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Max Lynch <ih...@gmail.com> on 2009/12/30 18:55:01 UTC

Re: Different Analyzers

> I just want to see if it's safe to use two different analyzers for the
> following situation:
>
> I have an index that I want to preserve case with so I can do
> case-sensitive
> searches with my WhitespaceAnalyzer.  However, I also want to do case
> insensitive searches.

you should also make sure the data is indexed twice, once w/ the original
> case and once w/o. It's like putting a TokenFilter after
> WhitespaceTokenizer
> which returns two tokens - lowercased and the original, both in the same
> position (set posIncr to 0).
>

I finally got around to really needing this, and I'm just a little confused
by the implementation.  Should I physically use two different indexes (one
with StandardAnalyzer, one with WhitespaceAnalyzer?), two separate fields (I
don't think that's possible?), or could you explain your idea a little
more?  Should I implement my own WhitespaceTokenizer with the TokenFilter?

Thanks.

Re: Different Analyzers

Posted by Max Lynch <ih...@gmail.com>.
> Alternatively, if one of the "regular" analyzers works for you *except*
> for lower-casing, just use that one for your mixed-case field and
> lower-case your input and send it to your lower-case field.
>
> Be careful to do the same steps when querying <G>.
>

Thanks Erick, I didn't think about this.  It seems the most simple solution
for now.

-max

Re: Different Analyzers

Posted by Erick Erickson <er...@gmail.com>.
See PerFieldAnalyzerWrapper for an easy way to implement two fields
in the same document processed with different analyzers. So basically
you're copying the input to two fields that handle things slightly
differently.

As far as re-implementing stuff, no real re-implementing is necessary,
just create your Analyzers from pre-existing parts, it's much simpler
than it sounds. Just derive a class from Analyzer and override
tokenStream (or, possibly, reusableTokenStream). Then you have to
send you input to both fields (see above).

SynonymAnalyzer in Lucene In Action has an example, and I'm sure
if you look in the mail archives you'll find other examples.....

Alternatively, if one of the "regular" analyzers works for you *except*
for lower-casing, just use that one for your mixed-case field and
lower-case your input and send it to your lower-case field.

Be careful to do the same steps when querying <G>.

Also, TeeSinkTokenFilter might give you some joy, but I confess I haven't
looked at it very thoroughly.

HTH
Erick

On Wed, Dec 30, 2009 at 12:55 PM, Max Lynch <ih...@gmail.com> wrote:

> > I just want to see if it's safe to use two different analyzers for the
> > following situation:
> >
> > I have an index that I want to preserve case with so I can do
> > case-sensitive
> > searches with my WhitespaceAnalyzer.  However, I also want to do case
> > insensitive searches.
>
> you should also make sure the data is indexed twice, once w/ the original
> > case and once w/o. It's like putting a TokenFilter after
> > WhitespaceTokenizer
> > which returns two tokens - lowercased and the original, both in the same
> > position (set posIncr to 0).
> >
>
> I finally got around to really needing this, and I'm just a little confused
> by the implementation.  Should I physically use two different indexes (one
> with StandardAnalyzer, one with WhitespaceAnalyzer?), two separate fields
> (I
> don't think that's possible?), or could you explain your idea a little
> more?  Should I implement my own WhitespaceTokenizer with the TokenFilter?
>
> Thanks.
>