You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Thomas Michael Engelke <th...@posteo.de> on 2014/11/07 16:23:12 UTC
Autosuggest using EdgeNGrams with strange highlighting

We've moved from an asterisk based autosuggest functionality 
("searchterm*") to a version using a special field called autosuggest, 
filled via copyField directives. The field definition:

                 <fieldType name="autosuggest" class="solr.TextField" 
positionIncrementGap="100">
                         <analyzer type="index">
                                 <tokenizer 
class="solr.StandardTokenizerFactory"/>
                                 <filter 
class="solr.LowerCaseFilterFactory"/>
                                 <filter class="solr.StopFilterFactory" 
words="stopwords.txt" ignoreCase="true" enablePositionIncrements="true" 
format="snowball"/>
                                 <filter 
class="solr.DictionaryCompoundWordTokenFilterFactory" 
dictionary="dictionary.txt" minWordSize="5" minSubwordSize="3" 
maxSubwordSize="30" onlyLongestMatch="false"/>
                                 <filter 
class="solr.GermanNormalizationFilterFactory"/>
                                 <filter 
class="solr.SnowballPorterFilterFactory" language="German2" 
protected="protwords.txt"/>
                                 <filter 
class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" 
side="front"/>
                                 <filter 
class="solr.RemoveDuplicatesTokenFilterFactory"/>
                         </analyzer>
                         <analyzer type="query">
                                 <tokenizer 
class="solr.StandardTokenizerFactory"/>
                                 <filter 
class="solr.LowerCaseFilterFactory"/>
                                 <filter class="solr.StopFilterFactory" 
words="stopwords.txt" ignoreCase="true" enablePositionIncrements="true" 
format="snowball"/>
                                 <filter 
class="solr.DictionaryCompoundWordTokenFilterFactory" 
dictionary="dictionary.txt" minWordSize="5" minSubwordSize="3" 
maxSubwordSize="30" onlyLongestMatch="false"/>
                                 <filter 
class="solr.GermanNormalizationFilterFactory"/>
                                 <filter 
class="solr.SnowballPorterFilterFactory" language="German2" 
protected="protwords.txt"/>
                                 <filter 
class="solr.RemoveDuplicatesTokenFilterFactory"/>
                         </analyzer>
                 </fieldType>

It works like a charm. Now, we've had highlighting from Solr before, 
using these parameters:

hl=true&hl.simple.pre=<span+class%3D"highlight">&hl.snippets=1&hl.simple.post=</span>&spellcheck=true&hl.fl=description

Now, we've seen something strange. This is just an example, the problem 
is with more than this record. In this example, the autosuggest field 
contains:

2CV4 Spot, Dekorsatz, für 2CV.

However, the highlighting branch for this autosuggest field in the 
record looks like this:

<lst name="highlighting">
   <lst name="34725">
     <arr name="short_description">
       <str>2CV4 Spot, Dekorsatz, für <em>2CV</em>.</str>
     </arr>
   </lst>
   ...

Although the EdgeNGramFilterFactory generated the NGrams so that "2CV4" 
-> "2", "2C", "2CV", "2CV4", the term is not highlighted. Shouldn't it? 
It's not a question of the number of highlights, records containing 
multiple occurances of "2CV" get highlighted multiple times with no 
problems.

It seems that words only containing parts of the search term which match 
the EdgeNGrams are not highlighted. As we're using highlighting from 
Solr exclusively, this leads to records being found, but having no 
highlight at all.