You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by "Kraus, Ralf | pixelhouse GmbH" <rk...@pixelhouse.de> on 2009/02/06 11:23:51 UTC

Need help with DictionaryCompoundWordTokenFilterFactory

Hi,

Now I ran into another problem by using the 
solr.DictionaryCompoundWordTokenFilterFactory :-(
If I search for the german word "Spargelcremesuppe" which contains 
"Spargel", "Creme" and "Suppe" SOLR will find way to many result.
Its because SOLR finds EVERY entry with either one of the three words in 
it :-(

Here is my schema.xml

        <fieldType name="text_text" class="solr.TextField" 
positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter 
class="solr.DictionaryCompoundWordTokenFilterFactory"
                                dictionary="dictionary.txt"
                                minWordSize="5"
                                minSubwordSize="2"
                                maxSubwordSize="15"
                                onlyLongestMatch="true" />
                <filter class="solr.SynonymFilterFactory" 
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
                <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
                <filter class="solr.SnowballPorterFilterFactory" 
language="German" />
            </analyzer>
        </fieldType>

Any help ?

Greets,

Ralf Kraus

Re: Need help with DictionaryCompoundWordTokenFilterFactory

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Ralf,

Not sure if you got this working or not, but perhaps a simple solution is changing the default boolean operator from OR to AND.

Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch 




________________________________
From: "Kraus, Ralf | pixelhouse GmbH" <rk...@pixelhouse.de>
To: solr-user@lucene.apache.org
Sent: Friday, February 6, 2009 6:23:51 PM
Subject: Need help with DictionaryCompoundWordTokenFilterFactory

Hi,

Now I ran into another problem by using the solr.DictionaryCompoundWordTokenFilterFactory :-(
If I search for the german word "Spargelcremesuppe" which contains "Spargel", "Creme" and "Suppe" SOLR will find way to many result.
Its because SOLR finds EVERY entry with either one of the three words in it :-(

Here is my schema.xml

      <fieldType name="text_text" class="solr.TextField" positionIncrementGap="100">
          <analyzer>
              <tokenizer class="solr.WhitespaceTokenizerFactory"/>
              <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
                              dictionary="dictionary.txt"
                              minWordSize="5"
                              minSubwordSize="2"
                              maxSubwordSize="15"
                              onlyLongestMatch="true" />
              <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
              <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
              <filter class="solr.LowerCaseFilterFactory"/>
              <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
              <filter class="solr.SnowballPorterFilterFactory" language="German" />
          </analyzer>
      </fieldType>

Any help ?

Greets,

Ralf Kraus

Re: Need help with DictionaryCompoundWordTokenFilterFactory

Posted by Grant Ingersoll <gs...@apache.org>.

Sounds like you need some work on the analysis part.  I would start by  
using the Solr Admin Analysis tool and play around with your settings  
for that TokenFilter.  Sounds too me like you might want a different  
approach to compound words.  I'm not a German expert, so can't offer  
too much there, but one thought that comes to mind is using phrases or  
ngrams or if it is just that word, then put it in a protected words  
list.

-Grant

On Feb 6, 2009, at 5:23 AM, Kraus, Ralf | pixelhouse GmbH wrote:

> Hi,
>
> Now I ran into another problem by using the  
> solr.DictionaryCompoundWordTokenFilterFactory :-(
> If I search for the german word "Spargelcremesuppe" which contains  
> "Spargel", "Creme" and "Suppe" SOLR will find way to many result.
> Its because SOLR finds EVERY entry with either one of the three  
> words in it :-(
>
> Here is my schema.xml
>
>       <fieldType name="text_text" class="solr.TextField"  
> positionIncrementGap="100">
>           <analyzer>
>               <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>               <filter  
> class="solr.DictionaryCompoundWordTokenFilterFactory"
>                               dictionary="dictionary.txt"
>                               minWordSize="5"
>                               minSubwordSize="2"
>                               maxSubwordSize="15"
>                               onlyLongestMatch="true" />
>               <filter class="solr.SynonymFilterFactory"  
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>               <filter class="solr.StopFilterFactory"  
> ignoreCase="true" words="stopwords.txt"/>
>               <filter class="solr.LowerCaseFilterFactory"/>
>               <filter  
> class="solr.RemoveDuplicatesTokenFilterFactory"/>
>               <filter class="solr.SnowballPorterFilterFactory"  
> language="German" />
>           </analyzer>
>       </fieldType>
>
> Any help ?
>
> Greets,
>
> Ralf Kraus

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika) using Solr/ 
Lucene:
http://www.lucidimagination.com/search