You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Becker Moritz <m....@curecomp.com> on 2015/06/15 20:47:23 UTC
Inconsistent Solr highlighting
Hi,
I have the requirement to index internationalized fields ('name') with Solr.
For this purpose, I want to use dynamic fields and have e.g. 'name_en', 'name_de', 'name_fr' in my Solr documents.
When querying the index, I need to know which language a match was found in. For this, I want to use Solr highlighting.
My problem is now, that the highlighting seems to work inconsistently which is a problem in my use case.
The field configuration for e.g. my dynamic field '*_en' field is as follows:
<dynamicField name="*_en" type="text_en" indexed="true" stored="true" multiValued="false"/>
The field type 'text_en' is configured as follows:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
<filter class="solr.EnglishMinimalStemFilterFactory"/>
-->
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
<filter class="solr.EnglishMinimalStemFilterFactory"/>
-->
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
My index contains the following document:
<doc>
<int name="id">25</int>
<str name="name_it">Note Test</str>
<str name="description_it"/>
<str name="name_en">Note Test Translation</str>
<str name="description_en"/>
<long name="_version_">1504065955969368064</long>
</doc>
The query defType=edismax&q=Translation&hl=on&hl.fl=name_* returns the above document but does not highlight anything.
The query defType=edismax&q=name_en:Translation&hl=on&hl.fl=name_* returns the above document AND highlights 'Translation' as expected.
Since translation does occur in any other field, I do not understand how the match could have occurred on a different than 'name_en' (which would explain why 'name_en' is not highlighted).
I already tried:
http://stackoverflow.com/questions/23755097/solr-highlighting-hl-simple-pre-post-doesnt-appear-sometime
http://lucene.472066.n3.nabble.com/Urgent-Highlighting-not-working-as-expected-td3983755.html
http://stackoverflow.com/questions/9842886/why-is-this-simple-solr-highlighting-attempt-failing
Neither worked.
Moreover, when I run defType=edismax&q=Note&hl=on&hl.fl=name_* the result is
<doc>
<int name="id">25</int>
<str name="name_it">Note Test</str>
<str name="description_it"/>
<str name="name_en">Note Test Translation</str>
<str name="description_en"/>
<long name="_version_">1504067222466723840</long>
</doc>
<doc>
<int name="id">27</int>
<str name="name_de">Note Test child</str>
<str name="description_de"/>
<long name="_version_">1504067222528589824</long>
</doc>
However, the highlighting only contains fields of document 25 but not 27:
<lst name="highlighting">
<lst name="25">
<arr name="name_it">
<str><em>Note</em> Test</str>
</arr>
<arr name="name_en">
<str><em>Note</em> Test Translation</str>
</arr>
</lst>
<lstname="27"/>
</lst>
I really do not understand what is happening here and what I can do to make the highlighting consistent.
Also, is my approach with the 'name_en', 'name_de', ... for localized field indexing reasonable or is there a much more preferable way?
Thank you for your help and best regards
Moritz Becker
Softwareentwicklung
curecomp Software Services GmbH
Hafenstrasse 47-51
4020 Linz
web: www.curecomp.com<http://www.curecomp.com/>
e-Mail: m.becker@curecomp.com<ma...@curecomp.com>
[Beschreibung: Beschreibung: premium SRM for premium customers]