You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Bernhard Schulz <of...@schubec.com> on 2011/06/10 18:22:57 UTC

Problem with DictionaryCompoundWordTokenFilterFactory in Solr 3.2

Hello everybody!


I am facing a problem with Solr's DictionaryCompoundWordTokenFilterFactory and hope you have some advice for me.
I am using the latest version Solr 3.2. (Had the same problem with Solr 3.1)

In the schema, I am using the settings like

<filter
class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="words.german.txt"
minWordSize="5"
minSubwordSize="3"
maxSubwordSize="15"
onlyLongestMatch="true"
/>

Now, when I am analyzing the word "lederschuh" (means "leather shoe" in German) I am getting the following sub-words using the analyzer interface:
1.) lederschuh
2.) lederschuh
3.) der
4.) er
5.) schuh

Problem 1: I configured "minSubwordSize" to 3. Why does entry 4 ("er") appear which is shorter than 3 chars?
Problem 2: I configured "onlyLongestMatch" to true. There is a "lederschuh" entry in my dictionary. So the longestmatch would be "lederschuh" by itself and I do not expect to have that split up any further. Why is Solr still splitting that up? Is this a bug or did I misconfigure something?

Any advise is very welcome!

Thank you,
Bernhard