You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Croci Francesco Luigi (ID SWS)" <fc...@id.ethz.ch> on 2014/03/14 13:17:33 UTC

analyzer with multiple stem-filters for more languages

It is possible to define an analyzer with more than one Stem-filter for more languages?

Something like this:

<analyzer type="index">
                ...
<filter class="solr.PorterStemFilterFactory"/>  (default for english)
<filter class="solr.SnowballPorterFilterFactory" language="German2" />
</analyzer>

Greetings
Francesco

Re: analyzer with multiple stem-filters for more languages

Posted by Trey Grainger <so...@gmail.com>.

I wouldn't recommend putting multiple stemmers in the same Analyzer. Like
Jack said, the second stemmer could take the results of the first stemmer
and "stem the stem", wreaking all kinds of havoc on the resulting terms.

Since the Stemmers replace the original word, running two of them in
sequence will mean the second stemmer never sees the original input in
cases where the first stemmer modified it. Also, many languages require
multiple different CharFilters and TokenFilters (some for accent
normalization, some for stopwords and/or synonyms, some for stemming,
etc.), so it will get VERY complicated trying to safely coordinate when
each token filter runs... probably impossible for many language
combinations.

What you CAN do, however, is define multiple language-specific Analyzers
and then invoke both Analyzers separately within your field, stacking the
resulting tokens from each Analyzer's outputted token stream according to
their position increments. Think of it as having sub-fields within a field,
where each sub-field has it's own dedicated Analyzer.

Shameless plug: We cover how to do this (and provide the sample code) in
the "Multilingual Search" chapter of *Solr in Action
<http://solrinaction.com>*, the new book from Manning Publications that is
being to be released within the next few days. The source code is all
publicly available, though, if want to get an idea of how this works:
https://github.com/treygrainger/solr-in-action/tree/master/src/main/java/sia/ch14

Of course, if you want to take a simpler route, you can always just copy
your text to two separate fields (one per language) and then search across
them at query time using the eDisMax query parser. There are pros and cons
to both approaches.

All the best,

-Trey Grainger

On Fri, Mar 14, 2014 at 8:00 PM, Jack Krupansky <ja...@basetechnology.com>wrote:

> You would have to carefully analyze the source code and tables of these
> two stemmers to determine if one might incorrectly stem words in the other
> language. Technically, that could be fine for indexing, but it might give
> users some unexpected results for queries. There might also be cases where
> the second stemmer would stem a term that was already stemmed by the first
> stemmer.
>
> You could avoid the latter issue by using the duplicate token technique.
> For a single stemmer this is generally:
>
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.KeywordRepeatFilterFactory"/>
> <filter class="solr.PorterStemFilterFactory"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>
> For two (or more) languages:
>
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.KeywordRepeatFilterFactory"/>
> <filter class="solr.PorterStemFilterFactory"/>
>
> <filter class="solr.SnowballPorterFilterFactory" language="German2" />
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>
> This would produce the stemmed term for both languages, or either
> language, or neither, as the case may be.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Croci Francesco Luigi (ID SWS)
> Sent: Friday, March 14, 2014 8:17 AM
> To: solr-user@lucene.apache.org
> Subject: analyzer with multiple stem-filters for more languages
>
>
> It is possible to define an analyzer with more than one Stem-filter for
> more languages?
>
> Something like this:
>
> <analyzer type="index">
>                ...
> <filter class="solr.PorterStemFilterFactory"/>  (default for english)
> <filter class="solr.SnowballPorterFilterFactory" language="German2" />
> </analyzer>
>
> Greetings
> Francesco
>

Re: analyzer with multiple stem-filters for more languages

Posted by Jack Krupansky <ja...@basetechnology.com>.

You would have to carefully analyze the source code and tables of these two 
stemmers to determine if one might incorrectly stem words in the other 
language. Technically, that could be fine for indexing, but it might give 
users some unexpected results for queries. There might also be cases where 
the second stemmer would stem a term that was already stemmed by the first 
stemmer.

You could avoid the latter issue by using the duplicate token technique. For 
a single stemmer this is generally:

<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

For two (or more) languages:

<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="German2" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

This would produce the stemmed term for both languages, or either language, 
or neither, as the case may be.

-- Jack Krupansky

-----Original Message----- 
From: Croci Francesco Luigi (ID SWS)
Sent: Friday, March 14, 2014 8:17 AM
To: solr-user@lucene.apache.org
Subject: analyzer with multiple stem-filters for more languages

It is possible to define an analyzer with more than one Stem-filter for more 
languages?

Something like this:

<analyzer type="index">
                ...
<filter class="solr.PorterStemFilterFactory"/>  (default for english)
<filter class="solr.SnowballPorterFilterFactory" language="German2" />
</analyzer>

Greetings
Francesco