You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jan Murre <ja...@pareto.nl> on 2009/07/12 17:11:41 UTC

DutchStemFilterFactory reducing double vowels bug ?

Hi,

Some time ago I configured my Solr instance to use the
DutchStemFilterFactory.

When used during indexing or query-ing, the filter reduces double vowels
to single vowels, which is not always what we want.

Words like 'baas', 'paas', 'maan', 'boom' etc. are indexed as 'bas',
'pas', 'man' and 'bom'. Those wordt have a meaning of their own. Am I
missing something, or has this to be considered as a bug?

Regards, Jan

Re: DutchStemFilterFactory reducing double vowels bug ?

Posted by Chris Hostetter <ho...@fucit.org>.

: Some time ago I configured my Solr instance to use the
: DutchStemFilterFactory.
	...
: Words like 'baas', 'paas', 'maan', 'boom' etc. are indexed as 'bas',
: 'pas', 'man' and 'bom'. Those wordt have a meaning of their own. Am I
: missing something, or has this to be considered as a bug?

I know nothing about Dutch, but the DutchStemFilterFactory is just a 
factory for the DutchStemFilter, which is just a Lucene TOkenFilter 
arround the DutchStemmer which is a java impl of this algorithm...

http://snowball.tartarus.org/algorithms/dutch/stemmer.html

...according to that page, Step#4 explicilty includes a 
reduction of doubled vowels (maan->man is an explicit example)

so the code seems to be working as specified .. wether it's what you 
*want* is a different question.


-Hoss