You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Da...@sybase.com on 2005/10/30 17:03:30 UTC

Multiple Analyzers

I am implementing a search that allows the user to toggle the Soundex and
Stem functionality on and off.  When the Soundex or Sten functions are not
on, the search will use the StandardAnalyzer as default.  I implemented the
Soundex (SoundexAnalyzer from book Lucene In Action) and Stem
(SnowballAnalyzer) analyzer successfully individually.  They only work if I
use the analyzers during both indexing and searching.  The problem I have
is that I seem to have to use one of the three analyzers or nothing.  I
heven't been able to combine them.  The following may help better to
understand what I am trying to achieve.  Any help would be greatly
appreciated.

Indexing
========
Index using recursive TokenStream object.

public TokenStream tokenStream(String fieldName, Reader reader) {
      TokenStream result = new SoundexFilter (
                        new SnowballFilter (
                           new StopFilter (
                                                                       new
LowerCaseFilter (

new StandardFilter (
new StandardTokenizer(reader))),

StandardAnalyzer.STOP_WORDS)));
      return result;
}

Searching
==========
If user selects Soundex function, use SoundexAnalyzer in query.
Else if user selects Stem option, use SnowballAnalyzer in query.
Else use StandardAnalyzer.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Daniel Clark, Senior Consultant
Sybase Federal Professional Services
6550 Rock Spring Drive, Suite 800
Bethesda, MD  20817
Office - (301) 896-1103
Office Fax - (301) 896-1604
Mobile - (703) 403-0340
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Multiple Analyzers

Posted by Chris Hostetter <ho...@fucit.org>.

: understand your recommendations.  Is there a way that I can just
: incorporate Stem, Soundex, and Standard into one search.  In other words,
: don't toggle anything.  Just index using custom analyzer that contains
: Stem, Soundex, and Standard analyzers at once.  And search using the custom
: analyzer that has Stem, Soundex, and Standard.  I can live with just having
: all on compared to only having one of the three.  Please, advise.  Thanks
: again for your swift responses.

even if you don't need a toggle to decide which approach to take, you're
still probably going to be better off if you index the fields seperately.
assuming you want the conceptional fields "title", "description", and
"body" to all be searchable, use real fields like...

        Stem          Soundex       Standard
        title_stem    title_sound   title
        desc_stem     desc_sound    desc
        body_stem     body_sound    body

...and at query time, build up a single BooleanQuery (or perhaps
MaxDisjunct query) that contains the users search words tested against
each of the various fields.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Multiple Analyzers

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

Yes, you can certainly do all three in one shot.   Look at the source  
code to StandardAnalyzer, build a custom analyzer that used it's core  
tokenization, then filtered through the SnowballFilter, and then a  
custom soundex filter (or metaphone or similar).  That would make for  
some mighty fuzzy searches, but I guess that is your goal.

A custom analyzer needs to be written, along with the custom soundex  
filter, but should only be a handful of lines of code.  Use the  
source code of Lucene in Action as a basis if you'd like.

     Erik




On 31 Oct 2005, at 09:22, Daniel.Clark@sybase.com wrote:

> Thanks Erik.  So you're saying that my approach won't work, right?  I
> understand your recommendations.  Is there a way that I can just
> incorporate Stem, Soundex, and Standard into one search.  In other  
> words,
> don't toggle anything.  Just index using custom analyzer that contains
> Stem, Soundex, and Standard analyzers at once.  And search using  
> the custom
> analyzer that has Stem, Soundex, and Standard.  I can live with  
> just having
> all on compared to only having one of the three.  Please, advise.   
> Thanks
> again for your swift responses.
>
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Daniel Clark, Senior Consultant
> Sybase Federal Professional Services
> 6550 Rock Spring Drive, Suite 800
> Bethesda, MD  20817
> Office - (301) 896-1103
> Office Fax - (301) 896-1604
> Mobile - (703) 403-0340
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
>
>
>              Erik Hatcher
>              <erik@ehatchersol
>               
> utions.com>                                                To
>                                        java-user@lucene.apache.org
>              10/30/2005  
> 07:50                                           cc
>              PM
>                                                                     
> Subject
>                                        Re: Multiple Analyzers
>              Please respond to
>              java-user@lucene.
>                 apache.org
>
>
>
>
>
>
>
>
> On 30 Oct 2005, at 11:03, Daniel.Clark@sybase.com wrote:
>
>> I am implementing a search that allows the user to toggle the
>> Soundex and
>> Stem functionality on and off.  When the Soundex or Sten functions
>> are not
>> on, the search will use the StandardAnalyzer as default.  I
>> implemented the
>> Soundex (SoundexAnalyzer from book Lucene In Action) and Stem
>> (SnowballAnalyzer) analyzer successfully individually.  They only
>> work if I
>> use the analyzers during both indexing and searching.  The problem
>> I have
>> is that I seem to have to use one of the three analyzers or
>> nothing.  I
>> heven't been able to combine them.  The following may help better to
>> understand what I am trying to achieve.  Any help would be greatly
>> appreciated.
>>
>> Indexing
>> ========
>> Index using recursive TokenStream object.
>>
>> public TokenStream tokenStream(String fieldName, Reader reader) {
>>       TokenStream result = new SoundexFilter (
>>                         new SnowballFilter (
>>                            new StopFilter (
>>
>>  new
>> LowerCaseFilter (
>>
>> new StandardFilter (
>> new StandardTokenizer(reader))),
>>
>> StandardAnalyzer.STOP_WORDS)));
>>       return result;
>> }
>>
>
> So you're using a single analyzer for all fields during indexing?
>
>
>> Searching
>> ==========
>> If user selects Soundex function, use SoundexAnalyzer in query.
>> Else if user selects Stem option, use SnowballAnalyzer in query.
>> Else use StandardAnalyzer.
>>
>
> But different analyzers for searching.... you have to be quite
> careful about this sort of thing.  And in this particular case its
> tricky.  You can't stem and soundex during indexing and then
> selectively turn those features off without having indexed the
> various combinations you want available during search.
>
> One way to address this is to index the same text into multiple
> fields with each using a different analysis process.  Then for
> search, line up the query type (stemming, soundex, standard) with the
> proper field.  (side note, this gets even more fun if you're using
> QueryParser and the user enters in "field:term"!).
>
> Another option is to index into multiple indexes - one for each type
> of analysis.  I use this myself for case sensitive and insensitive
> searches.
>
>      Erik
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Multiple Analyzers

Posted by Da...@sybase.com.

Thanks Erik.  So you're saying that my approach won't work, right?  I
understand your recommendations.  Is there a way that I can just
incorporate Stem, Soundex, and Standard into one search.  In other words,
don't toggle anything.  Just index using custom analyzer that contains
Stem, Soundex, and Standard analyzers at once.  And search using the custom
analyzer that has Stem, Soundex, and Standard.  I can live with just having
all on compared to only having one of the three.  Please, advise.  Thanks
again for your swift responses.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Daniel Clark, Senior Consultant
Sybase Federal Professional Services
6550 Rock Spring Drive, Suite 800
Bethesda, MD  20817
Office - (301) 896-1103
Office Fax - (301) 896-1604
Mobile - (703) 403-0340
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


                                                                           
             Erik Hatcher                                                  
             <erik@ehatchersol                                             
             utions.com>                                                To 
                                       java-user@lucene.apache.org         
             10/30/2005 07:50                                           cc 
             PM                                                            
                                                                   Subject 
                                       Re: Multiple Analyzers              
             Please respond to                                             
             java-user@lucene.                                             
                apache.org                                                 
                                                                           
                                                                           
                                                                           





On 30 Oct 2005, at 11:03, Daniel.Clark@sybase.com wrote:
> I am implementing a search that allows the user to toggle the
> Soundex and
> Stem functionality on and off.  When the Soundex or Sten functions
> are not
> on, the search will use the StandardAnalyzer as default.  I
> implemented the
> Soundex (SoundexAnalyzer from book Lucene In Action) and Stem
> (SnowballAnalyzer) analyzer successfully individually.  They only
> work if I
> use the analyzers during both indexing and searching.  The problem
> I have
> is that I seem to have to use one of the three analyzers or
> nothing.  I
> heven't been able to combine them.  The following may help better to
> understand what I am trying to achieve.  Any help would be greatly
> appreciated.
>
> Indexing
> ========
> Index using recursive TokenStream object.
>
> public TokenStream tokenStream(String fieldName, Reader reader) {
>       TokenStream result = new SoundexFilter (
>                         new SnowballFilter (
>                            new StopFilter (
>
>  new
> LowerCaseFilter (
>
> new StandardFilter (
> new StandardTokenizer(reader))),
>
> StandardAnalyzer.STOP_WORDS)));
>       return result;
> }

So you're using a single analyzer for all fields during indexing?

> Searching
> ==========
> If user selects Soundex function, use SoundexAnalyzer in query.
> Else if user selects Stem option, use SnowballAnalyzer in query.
> Else use StandardAnalyzer.

But different analyzers for searching.... you have to be quite
careful about this sort of thing.  And in this particular case its
tricky.  You can't stem and soundex during indexing and then
selectively turn those features off without having indexed the
various combinations you want available during search.

One way to address this is to index the same text into multiple
fields with each using a different analysis process.  Then for
search, line up the query type (stemming, soundex, standard) with the
proper field.  (side note, this gets even more fun if you're using
QueryParser and the user enters in "field:term"!).

Another option is to index into multiple indexes - one for each type
of analysis.  I use this myself for case sensitive and insensitive
searches.

     Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Multiple Analyzers

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On 30 Oct 2005, at 11:03, Daniel.Clark@sybase.com wrote:
> I am implementing a search that allows the user to toggle the  
> Soundex and
> Stem functionality on and off.  When the Soundex or Sten functions  
> are not
> on, the search will use the StandardAnalyzer as default.  I  
> implemented the
> Soundex (SoundexAnalyzer from book Lucene In Action) and Stem
> (SnowballAnalyzer) analyzer successfully individually.  They only  
> work if I
> use the analyzers during both indexing and searching.  The problem  
> I have
> is that I seem to have to use one of the three analyzers or  
> nothing.  I
> heven't been able to combine them.  The following may help better to
> understand what I am trying to achieve.  Any help would be greatly
> appreciated.
>
> Indexing
> ========
> Index using recursive TokenStream object.
>
> public TokenStream tokenStream(String fieldName, Reader reader) {
>       TokenStream result = new SoundexFilter (
>                         new SnowballFilter (
>                            new StopFilter (
>                                                                        
>  new
> LowerCaseFilter (
>
> new StandardFilter (
> new StandardTokenizer(reader))),
>
> StandardAnalyzer.STOP_WORDS)));
>       return result;
> }

So you're using a single analyzer for all fields during indexing?

> Searching
> ==========
> If user selects Soundex function, use SoundexAnalyzer in query.
> Else if user selects Stem option, use SnowballAnalyzer in query.
> Else use StandardAnalyzer.

But different analyzers for searching.... you have to be quite  
careful about this sort of thing.  And in this particular case its  
tricky.  You can't stem and soundex during indexing and then  
selectively turn those features off without having indexed the  
various combinations you want available during search.

One way to address this is to index the same text into multiple  
fields with each using a different analysis process.  Then for  
search, line up the query type (stemming, soundex, standard) with the  
proper field.  (side note, this gets even more fun if you're using  
QueryParser and the user enters in "field:term"!).

Another option is to index into multiple indexes - one for each type  
of analysis.  I use this myself for case sensitive and insensitive  
searches.

     Erik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org