You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Thomas Michael Engelke <th...@posteo.de> on 2014/11/07 16:23:12 UTC
Autosuggest using EdgeNGrams with strange highlighting
We've moved from an asterisk based autosuggest functionality
("searchterm*") to a version using a special field called autosuggest,
filled via copyField directives. The field definition:
<fieldType name="autosuggest" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<tokenizer
class="solr.StandardTokenizerFactory"/>
<filter
class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
words="stopwords.txt" ignoreCase="true" enablePositionIncrements="true"
format="snowball"/>
<filter
class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="dictionary.txt" minWordSize="5" minSubwordSize="3"
maxSubwordSize="30" onlyLongestMatch="false"/>
<filter
class="solr.GermanNormalizationFilterFactory"/>
<filter
class="solr.SnowballPorterFilterFactory" language="German2"
protected="protwords.txt"/>
<filter
class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15"
side="front"/>
<filter
class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer
class="solr.StandardTokenizerFactory"/>
<filter
class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
words="stopwords.txt" ignoreCase="true" enablePositionIncrements="true"
format="snowball"/>
<filter
class="solr.DictionaryCompoundWordTokenFilterFactory"
dictionary="dictionary.txt" minWordSize="5" minSubwordSize="3"
maxSubwordSize="30" onlyLongestMatch="false"/>
<filter
class="solr.GermanNormalizationFilterFactory"/>
<filter
class="solr.SnowballPorterFilterFactory" language="German2"
protected="protwords.txt"/>
<filter
class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
It works like a charm. Now, we've had highlighting from Solr before,
using these parameters:
hl=true&hl.simple.pre=<span+class%3D"highlight">&hl.snippets=1&hl.simple.post=</span>&spellcheck=true&hl.fl=description
Now, we've seen something strange. This is just an example, the problem
is with more than this record. In this example, the autosuggest field
contains:
2CV4 Spot, Dekorsatz, für 2CV.
However, the highlighting branch for this autosuggest field in the
record looks like this:
<lst name="highlighting">
<lst name="34725">
<arr name="short_description">
<str>2CV4 Spot, Dekorsatz, für <em>2CV</em>.</str>
</arr>
</lst>
...
Although the EdgeNGramFilterFactory generated the NGrams so that "2CV4"
-> "2", "2C", "2CV", "2CV4", the term is not highlighted. Shouldn't it?
It's not a question of the number of highlights, records containing
multiple occurances of "2CV" get highlighted multiple times with no
problems.
It seems that words only containing parts of the search term which match
the EdgeNGrams are not highlighted. As we're using highlighting from
Solr exclusively, this leads to records being found, but having no
highlight at all.