You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Thomas Michael Engelke <th...@posteo.de> on 2014/11/11 08:52:51 UTC

Suggester not suggesting anything using DictionaryCompoundWordTokenFilterFactory

I'm toying around with the suggester component, like described here: 
http://www.andornot.com/blog/post/Advanced-autocomplete-with-Solr-Ngrams-and-Twitters-typeaheadjs.aspx

So I made 4 fields:

  <field name="text_suggest" type="text_suggest" indexed="true" 
stored="true" multiValued="true" />
  <copyField source="name" dest="text_suggest" />
  <field name="text_suggest_edge" type="text_suggest_edge" indexed="true" 
stored="true" multiValued="true" />
  <copyField source="name" dest="text_suggest_edge" />
  <field name="text_suggest_ngram" type="text_suggest_ngram" 
indexed="true" stored="true" multiValued="true" />
  <copyField source="name" dest="text_suggest_ngram" />
  <field name="text_suggest_dictionary_ngram" 
type="text_suggest_dictionary_ngram" indexed="true" stored="true" 
multiValued="true" />
  <copyField source="name" dest="text_suggest_dictionary_ngram" />

with the corresponding definitions:

  <fieldType name="text_suggest" class="solr.TextField">
  <analyzer>
  <tokenizer class="solr.KeywordTokenizerFactory" />
  <filter class="solr.LowerCaseFilterFactory" />
  </analyzer>
  </fieldType>
  <fieldType name="text_suggest_edge" class="solr.TextField">
  <analyzer>
  <tokenizer class="solr.KeywordTokenizerFactory" />
  <filter class="solr.LowerCaseFilterFactory" />
  <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" 
maxGramSize="50" side="front" />
  </analyzer>
  </fieldType>
  <fieldType name="text_suggest_ngram" class="solr.TextField">
  <analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory" />
  <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" 
maxGramSize="50" side="front" />
  <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
  </fieldType>
  <fieldType name="text_suggest_dictionary_ngram" class="solr.TextField">
  <analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory" />
  <filter class="solr.DictionaryCompoundWordTokenFilterFactory" 
dictionary="dictionary.txt" minWordSize="5" minSubwordSize="3" 
maxSubwordSize="30" onlyLongestMatch="false"/>
  <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" 
maxGramSize="50" side="front" />
  <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
  <analyzer type="query">
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory" />
  </analyzer>
  </fieldType>

I'm calling the suggester component this way:

http://<address>:8983/solr/<core>/suggest?qf="text_suggest^6.0%20test_suggest_edge^3.0%20text_suggest_ngram^1.0%20text_suggest_dictionary_ngram^0.2"&q=wa

This seems to work fine:

<response>
   <lst name="responseHeader">
     <int name="status">0</int>
     <int name="QTime">0</int>
   </lst>
   <lst name="spellcheck">
     <lst name="suggestions">
       <lst name="wa">
         <int name="numFound">5</int>
         <int name="startOffset">0</int>
         <int name="endOffset">2</int>
         <arr name="suggestion">
           <str>wandelement aus gitter</str>
           <str>wandelement aus stahlblech</str>
           <str>wandelement</str>
           <str>wandhalter für prospekte</str>
           <str>wandascher, h 300 × b 230 × t 60 mm</str>
         </arr>
       </lst>
       <str name="collation">(wandelement aus gitter)</str>
     </lst>
   </lst>
</response>

However, I added the fourth field so I could get low-boosted suggestions 
using the afformentioned DictionaryCompoundWordTokenFilterFactory. A 
sample analysis for the field(type) text_suggest_dictionary_ngram for 
the word "Geländewagen":

g
ge
gel
gelä
gelän
geländ
gelände
geländew
geländewa
geländewag
geländewage
geländewagen
g
ge
gel
gelä
gelän
geländ
gelände
w
wa
wag
wage
wagen

As we can see, the DictionaryCompoundWordTokenFilterFactory extracts the 
word "wagen" and EdgeNGrams it. However, I cannot get results from these 
NGrams. Trying "wag" as the search term for the suggester, there are no 
results.

However, doing an analysis of "Geländewagen" (as field value index) and 
"wag" (as field value query), analysis shows a match.

I had the thought that it might be because the underlying component of 
the suggester is a spellchecker, and a spellchecker wouldn't "correct" 
"wag" to "wagen" because there was an NGram that spelled "wag", and so 
the word was spelled correctly already. So I tried without the 
EdgeNGrams, but the result stays the same.

Re: Suggester not suggesting anything using DictionaryCompoundWordTokenFilterFactory

Posted by Thomas Michael Engelke <th...@posteo.de>.
 I think I found the problem. The definition of the suggester component
has a "field" option which references the field that the suggester uses
to generate suggestions. Changing this to the field using the
DictionaryCompundWordTokenFilterFactory also suggests word parts.

Am 11.11.2014 08:52 schrieb Thomas Michael Engelke: 

> I'm toying around with the suggester component, like described here: http://www.andornot.com/blog/post/Advanced-autocomplete-with-Solr-Ngrams-and-Twitters-typeaheadjs.aspx [1]
> 
> So I made 4 fields:
> 
> <field name="text_suggest" type="text_suggest" indexed="true" stored="true" multiValued="true" />
> <copyField source="name" dest="text_suggest" />
> <field name="text_suggest_edge" type="text_suggest_edge" indexed="true" stored="true" multiValued="true" />
> <copyField source="name" dest="text_suggest_edge" />
> <field name="text_suggest_ngram" type="text_suggest_ngram" indexed="true" stored="true" multiValued="true" />
> <copyField source="name" dest="text_suggest_ngram" />
> <field name="text_suggest_dictionary_ngram" type="text_suggest_dictionary_ngram" indexed="true" stored="true" multiValued="true" />
> <copyField source="name" dest="text_suggest_dictionary_ngram" />
> 
> with the corresponding definitions:
> 
> <fieldType name="text_suggest" class="solr.TextField">
> <analyzer>
> <tokenizer class="solr.KeywordTokenizerFactory" />
> <filter class="solr.LowerCaseFilterFactory" />
> </analyzer>
> </fieldType>
> <fieldType name="text_suggest_edge" class="solr.TextField">
> <analyzer>
> <tokenizer class="solr.KeywordTokenizerFactory" />
> <filter class="solr.LowerCaseFilterFactory" />
> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="front" />
> </analyzer>
> </fieldType>
> <fieldType name="text_suggest_ngram" class="solr.TextField">
> <analyzer>
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory" />
> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="front" />
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> </fieldType>
> <fieldType name="text_suggest_dictionary_ngram" class="solr.TextField">
> <analyzer type="index">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory" />
> <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="dictionary.txt" minWordSize="5" minSubwordSize="3" maxSubwordSize="30" onlyLongestMatch="false"/>
> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" side="front" />
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.StandardTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory" />
> </analyzer>
> </fieldType>
> 
> I'm calling the suggester component this way:
> 
> http://<address>:8983/solr/<core>/suggest?qf="text_suggest^6.0%20test_suggest_edge^3.0%20text_suggest_ngram^1.0%20text_suggest_dictionary_ngram^0.2"&q=wa
> 
> This seems to work fine:
> 
> <response>
> <lst name="responseHeader">
> <int name="status">0</int>
> <int name="QTime">0</int>
> </lst>
> <lst name="spellcheck">
> <lst name="suggestions">
> <lst name="wa">
> <int name="numFound">5</int>
> <int name="startOffset">0</int>
> <int name="endOffset">2</int>
> <arr name="suggestion">
> <str>wandelement aus gitter</str>
> <str>wandelement aus stahlblech</str>
> <str>wandelement</str>
> <str>wandhalter für prospekte</str>
> <str>wandascher, h 300 × b 230 × t 60 mm</str>
> </arr>
> </lst>
> <str name="collation">(wandelement aus gitter)</str>
> </lst>
> </lst>
> </response>
> 
> However, I added the fourth field so I could get low-boosted suggestions using the afformentioned DictionaryCompoundWordTokenFilterFactory. A sample analysis for the field(type) text_suggest_dictionary_ngram for the word "Geländewagen":
> 
> g
> ge
> gel
> gelä
> gelän
> geländ
> gelände
> geländew
> geländewa
> geländewag
> geländewage
> geländewagen
> g
> ge
> gel
> gelä
> gelän
> geländ
> gelände
> w
> wa
> wag
> wage
> wagen
> 
> As we can see, the DictionaryCompoundWordTokenFilterFactory extracts the word "wagen" and EdgeNGrams it. However, I cannot get results from these NGrams. Trying "wag" as the search term for the suggester, there are no results.
> 
> However, doing an analysis of "Geländewagen" (as field value index) and "wag" (as field value query), analysis shows a match.
> 
> I had the thought that it might be because the underlying component of the suggester is a spellchecker, and a spellchecker wouldn't "correct" "wag" to "wagen" because there was an NGram that spelled "wag", and so the word was spelled correctly already. So I tried without the EdgeNGrams, but the result stays the same.
 

Links:
------
[1]
http://www.andornot.com/blog/post/Advanced-autocomplete-with-Solr-Ngrams-and-Twitters-typeaheadjs.aspx