You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Alexander Rosemann <al...@gmail.com> on 2014/03/04 21:48:37 UTC

Stemming Croatian, Macedonian, Serbian and Slovenian content

Hi,

I have the requirement to index and stem Croatian, Macedonian, Serbian 
and Slovenian content. I started by creating a collection _hr_ for the 
Croatian content and configured the HunSpellStemFilterFactory using the 
.dic and .aff files provided by OpenOffice. While testing my 
configuration I noticed that only very simple forms such as

hrvatski -> hrvatska,
algoritamskom -> algoritamska

get "stemmed". I was wondering whether there are better approaches for 
Croatian content. I haven't tested the dict and aff files for the other 
languages yet but I would expect similar results.

I am using Solr 4.1.

Any pointers to better stemmers, open source or commercial, are much 
appreciated.

Many thanks,
Alex