You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Becker Moritz <m....@curecomp.com> on 2015/06/15 20:47:23 UTC
Inconsistent Solr highlighting

Hi,

I have the requirement to index internationalized fields ('name') with Solr.
For this purpose, I want to use dynamic fields and have e.g. 'name_en', 'name_de', 'name_fr' in my Solr documents.

When querying the index, I need to know which language a match was found in. For this, I want to use Solr highlighting.

My problem is now, that the highlighting seems to work inconsistently which is a problem in my use case.
The field configuration for e.g. my dynamic field '*_en' field is as follows:

<dynamicField name="*_en"  type="text_en"    indexed="true"  stored="true" multiValued="false"/>

The field type 'text_en' is configured as follows:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
        -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
                <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
                -->
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="lang/stopwords_en.txt"
                />
        <filter class="solr.LowerCaseFilterFactory"/>
                <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
                <!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
        <filter class="solr.EnglishMinimalStemFilterFactory"/>
                -->
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>

My index contains the following document:

<doc>
<int name="id">25</int>
<str name="name_it">Note Test</str>
<str name="description_it"/>
<str name="name_en">Note Test Translation</str>
<str name="description_en"/>
<long name="_version_">1504065955969368064</long>
</doc>

The query defType=edismax&q=Translation&hl=on&hl.fl=name_* returns the above document but does not highlight anything.
The query defType=edismax&q=name_en:Translation&hl=on&hl.fl=name_* returns the above document AND highlights 'Translation' as expected.
Since translation does occur in any other field, I do not understand how the match could have occurred on a different than 'name_en' (which would explain why 'name_en' is not highlighted).
I already tried:
http://stackoverflow.com/questions/23755097/solr-highlighting-hl-simple-pre-post-doesnt-appear-sometime
http://lucene.472066.n3.nabble.com/Urgent-Highlighting-not-working-as-expected-td3983755.html
http://stackoverflow.com/questions/9842886/why-is-this-simple-solr-highlighting-attempt-failing

Neither worked.

Moreover, when I run defType=edismax&q=Note&hl=on&hl.fl=name_* the result is
<doc>
<int name="id">25</int>
<str name="name_it">Note Test</str>
<str name="description_it"/>
<str name="name_en">Note Test Translation</str>
<str name="description_en"/>
<long name="_version_">1504067222466723840</long>
</doc>
<doc>
<int name="id">27</int>
<str name="name_de">Note Test child</str>
<str name="description_de"/>
<long name="_version_">1504067222528589824</long>
</doc>

However, the highlighting only contains fields of document 25 but not 27:

<lst name="highlighting">
<lst name="25">
<arr name="name_it">
<str>&lt;em&gt;Note&lt;/em&gt; Test</str>
</arr>
<arr name="name_en">
<str>&lt;em&gt;Note&lt;/em&gt; Test Translation</str>
</arr>
</lst>
<lstname="27"/>
</lst>

I really do not understand what is happening here and what I can do to make the highlighting consistent.
Also, is my approach with the 'name_en', 'name_de', ... for localized field indexing reasonable or is there a much more preferable way?

Thank you for your help and best regards

Moritz Becker
Softwareentwicklung

curecomp Software Services GmbH
Hafenstrasse 47-51
4020 Linz

web: www.curecomp.com<http://www.curecomp.com/>
e-Mail: m.becker@curecomp.com<ma...@curecomp.com>

[Beschreibung: Beschreibung: premium SRM for premium customers]