You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Kraus, Ralf | pixelhouse GmbH" <rk...@pixelhouse.de> on 2009/02/06 11:23:51 UTC
Need help with DictionaryCompoundWordTokenFilterFactory
Hi,
Now I ran into another problem by using the
solr.DictionaryCompoundWordTokenFilterFactory :-(
If I search for the german word "Spargelcremesuppe" which contains
"Spargel", "Creme" and "Suppe" SOLR will find way to many result.
Its because SOLR finds EVERY entry with either one of the three words in
it :-(
Here is my schema.xml
<fieldType name="text_text" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter
class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="dictionary.txt"
minWordSize="5"
minSubwordSize="2"
maxSubwordSize="15"
onlyLongestMatch="true" />
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"
language="German" />
</analyzer>
</fieldType>
Any help ?
Greets,
Ralf Kraus
Re: Need help with DictionaryCompoundWordTokenFilterFactory
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Ralf,
Not sure if you got this working or not, but perhaps a simple solution is changing the default boolean operator from OR to AND.
Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
________________________________
From: "Kraus, Ralf | pixelhouse GmbH" <rk...@pixelhouse.de>
To: solr-user@lucene.apache.org
Sent: Friday, February 6, 2009 6:23:51 PM
Subject: Need help with DictionaryCompoundWordTokenFilterFactory
Hi,
Now I ran into another problem by using the solr.DictionaryCompoundWordTokenFilterFactory :-(
If I search for the german word "Spargelcremesuppe" which contains "Spargel", "Creme" and "Suppe" SOLR will find way to many result.
Its because SOLR finds EVERY entry with either one of the three words in it :-(
Here is my schema.xml
<fieldType name="text_text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="dictionary.txt"
minWordSize="5"
minSubwordSize="2"
maxSubwordSize="15"
onlyLongestMatch="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="German" />
</analyzer>
</fieldType>
Any help ?
Greets,
Ralf Kraus
Re: Need help with DictionaryCompoundWordTokenFilterFactory
Posted by Grant Ingersoll <gs...@apache.org>.
Sounds like you need some work on the analysis part. I would start by
using the Solr Admin Analysis tool and play around with your settings
for that TokenFilter. Sounds too me like you might want a different
approach to compound words. I'm not a German expert, so can't offer
too much there, but one thought that comes to mind is using phrases or
ngrams or if it is just that word, then put it in a protected words
list.
-Grant
On Feb 6, 2009, at 5:23 AM, Kraus, Ralf | pixelhouse GmbH wrote:
> Hi,
>
> Now I ran into another problem by using the
> solr.DictionaryCompoundWordTokenFilterFactory :-(
> If I search for the german word "Spargelcremesuppe" which contains
> "Spargel", "Creme" and "Suppe" SOLR will find way to many result.
> Its because SOLR finds EVERY entry with either one of the three
> words in it :-(
>
> Here is my schema.xml
>
> <fieldType name="text_text" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter
> class="solr.DictionaryCompoundWordTokenFilterFactory"
> dictionary="dictionary.txt"
> minWordSize="5"
> minSubwordSize="2"
> maxSubwordSize="15"
> onlyLongestMatch="true" />
> <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> <filter class="solr.StopFilterFactory"
> ignoreCase="true" words="stopwords.txt"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter
> class="solr.RemoveDuplicatesTokenFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory"
> language="German" />
> </analyzer>
> </fieldType>
>
> Any help ?
>
> Greets,
>
> Ralf Kraus
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika) using Solr/
Lucene:
http://www.lucidimagination.com/search