You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Christopher Bottaro <cj...@onespot.com> on 2011/03/24 00:14:07 UTC

which German stemmer to use?

The wiki lists 5 available, but doesn't do a good job at explaining or
recommending one:

GermanStemFilterFactory
SnowballPorterFilterFactory (German)
SnowballPorterFilterFactory (German2)
GermanLightStemFilterFactory
GermanMinimalStemFilterFactory

Which is the best one to use in general?  Which is the best to use when the
content being indexed is German technology articles?

Thanks for the help.

Re: which German stemmer to use?

Posted by Paul Libbrecht <pa...@hoplahup.net>.

In our ActiveMath project, we have had positive feedback in Lucene with the 
 SnowBallAnalyzer(Version.LUCENE_29,"German") 
which is probably one of the two below.

I note that you may want to be careful to use one field with exact matching (e.g. whitespace analyzer and lowercase filter) an done field with stemmed matches. That's two fields in the index and a query-expansion mechanism such as dismax to

  text-de^2.0 text-de.stemmed^1.2
(add the phonetic...)

One of the biggest issues that our testers formulated is that compound words should be split. I believe this issue is also very present in technology texts. Thus far only the compound-words analyzer can do such a split and you need the compounds to be manually input. Maybe that's doable?

paul


Le 24 mars 2011 à 00:14, Christopher Bottaro a écrit :

> The wiki lists 5 available, but doesn't do a good job at explaining or
> recommending one:
> 
> GermanStemFilterFactory
> SnowballPorterFilterFactory (German)
> SnowballPorterFilterFactory (German2)
> GermanLightStemFilterFactory
> GermanMinimalStemFilterFactory
> 
> Which is the best one to use in general?  Which is the best to use when the
> content being indexed is German technology articles?
> 
> Thanks for the help.