You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Tanguy Moal <ta...@gmail.com> on 2011/12/21 14:28:46 UTC

Solr 3.5 | Highlighting

Dear all,

I'm try to get highlighting working, and I'm almost done, but that's not 
perfect yet...

Basically my documents have a title and a description.

I have two kind of text fields :
text :
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
> generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
> catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" enablePositionIncrements="true" />
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
> ignoreCase="true" expand="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
> generateNumberParts="1" catenateWords="0" catenateNumbers="0" 
> catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" enablePositionIncrements="true" />
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
> ignoreCase="true" expand="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> </fieldType>
and text_french_light :
> <fieldType name="text_french_light" class="solr.TextField" 
> positionIncrementGap="100">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
> generateNumberParts="1" catenateWords="1" catenateNumbers="1" 
> catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" enablePositionIncrements="true" />
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
> ignoreCase="true" expand="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.FrenchLightStemFilterFactory" />
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
> generateNumberParts="1" catenateWords="0" catenateNumbers="0" 
> catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" enablePositionIncrements="true" />
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
> ignoreCase="true" expand="true"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.FrenchLightStemFilterFactory" />
> <filter class="solr.ASCIIFoldingFilterFactory"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> </fieldType>
I then define my fields the following way :
> <field name="title" type="text" indexed="true" stored="true" 
> termVectors="true" termPositions="true" termOffsets="true"/>
> <field name="title_stemmed" type="text_french_light" indexed="true" 
> stored="true" termVectors="true" termPositions="true" termOffsets="true"/>
> <field name="title_stemmed_nonorms" type="text_french_light" 
> indexed="true" stored="false" omitNorms="true" 
> omitTermFreqAndPositions="true"/>
> <field name="description" type="text" indexed="true" stored="true" 
> termVectors="true" termPositions="true" termOffsets="true"/>
> <field name="description_stemmed" type="text_french_light" 
> indexed="true" stored="true" termVectors="true" termPositions="true" 
> termOffsets="true"/>
> <field name="description_stemmed_nonorms" type="text_french_light" 
> indexed="true" stored="false" omitNorms="true" 
> omitTermFreqAndPositions="true"/>
I have the following copyField directives :
> <copyField source="title" dest="title_stemmed" />
> <copyField source="title" dest="title_stemmed_nonorms" />
> <copyField source="description" dest="description_stemmed" />
> <copyField source="description" dest="description_stemmed_nonorms" />
I rely on dismax query handler to achieve relevancy.

I have two different search use cases :
- a "structured search" mode where my query looks like q="Term1 
term2"&qf=my_category_field^1.0&hl.q=Word1 word2&mm=100%
- a "free-text search" mode where my query looks like q=Term1 
term2&qf=title_stemmed_nonorms^1.0 description_stemmed_nonorms^0.5&mm=-40%

Shared query parameters are as follow : 
defType=dismax&hl=on&hl.fl=title_stemmed 
description_stemmed&hl.useFastVectorHighlighter=true&hl.fragListBuilder=single

For all use cases, I have the good relevancy parameters, my results are 
satisfying.

Troubles concern highlighting :
- in the "free-text search" mode, everything is fine : the query is not 
a phrase query, and highlighted terms may vary from the query terms (if 
stemming came into play)
- in the "structured search" mode, I've got less luck : the query is a 
phrase query. Therefor, I rely on the hl.q parameter to achieve my 
needs. However, when specified in the hl.q parameter the query isn't 
processed the same way that it should when trying to highlight from the 
fields : query analysis seems not to be applied.
I can prove it easily by entering my query term that isn't highlighted 
in the analysis.jsp page, obtain it's stemmed version, use that in the 
hl.q parameter, and then I can see my highlighted terms as expected.

I suspect a bug arround the handling of the default query (hl.q) when 
fields to highlight have a custom analysis (especially when stemming, 
word delimiters, and so on are involved).
I tried playing with hl.usePhraseHighlighter=true and 
hl.highlightMultiTerm=true but that didn't help at all =D

I tried using both legacy highlighter and FVH but the same issue occurs.
The issue only triggers when relying on hl.q.

Thank you very much for any help,

--
Tanguy

Re: Solr 3.5 | Highlighting

Posted by Tanguy Moal <ta...@gmail.com>.

Le 21/12/2011 23:49, Koji Sekiguchi a écrit :
> (11/12/21 22:28), Tanguy Moal wrote:
>> Dear all,
>>
[...]
>>
>> I tried using both legacy highlighter and FVH but the same issue occurs.
>> The issue only triggers when relying on hl.q.
>>
>> Thank you very much for any help,
>>
>> -- 
>> Tanguy
>>
>
> Tanguy,
>
> Thank you for reporting this!
>
> > The issue only triggers when relying on hl.q.
>
> That is not good. Can you reproduce the problem on Solr example 
> environment?
> If we can share same environment (solrconfig.xml and schema.xml), 
> request params
> to reproduce and data, I'd like to look into it.
>
> koji
Koji,
First, thank you for your quick reply.

Indeed isolating the issue was the key to resolve it!

Once isolated in the distribution's example directory I couldn't 
reproduce the issue (and achieved the expected behaviour).

I then started to look at my setup a little closer and realized that I 
wasn't using the same solr distribution on my master server (solr 3.4) 
and on my slave server (solr 3.5 with brand new hl.q parameter).

Since it isn't a recommended setup, I'll simply assume that the error 
was on my side. Sorry for false alerting :-D
New highlighter is great!

--
Tanguy

Re: Solr 3.5 | Highlighting

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

(11/12/21 22:28), Tanguy Moal wrote:
> Dear all,
>
> I'm try to get highlighting working, and I'm almost done, but that's not perfect yet...
>
> Basically my documents have a title and a description.
>
> I have two kind of text fields :
> text :
>> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>> <analyzer type="index">
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
>> catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
>> enablePositionIncrements="true" />
>> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.ASCIIFoldingFilterFactory"/>
>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> </analyzer>
>> <analyzer type="query">
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
>> catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
>> enablePositionIncrements="true" />
>> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.ASCIIFoldingFilterFactory"/>
>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> </analyzer>
>> </fieldType>
> and text_french_light :
>> <fieldType name="text_french_light" class="solr.TextField" positionIncrementGap="100">
>> <analyzer type="index">
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
>> catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
>> enablePositionIncrements="true" />
>> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.FrenchLightStemFilterFactory" />
>> <filter class="solr.ASCIIFoldingFilterFactory"/>
>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> </analyzer>
>> <analyzer type="query">
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
>> catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"
>> enablePositionIncrements="true" />
>> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.FrenchLightStemFilterFactory" />
>> <filter class="solr.ASCIIFoldingFilterFactory"/>
>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> </analyzer>
>> </fieldType>
> I then define my fields the following way :
>> <field name="title" type="text" indexed="true" stored="true" termVectors="true"
>> termPositions="true" termOffsets="true"/>
>> <field name="title_stemmed" type="text_french_light" indexed="true" stored="true"
>> termVectors="true" termPositions="true" termOffsets="true"/>
>> <field name="title_stemmed_nonorms" type="text_french_light" indexed="true" stored="false"
>> omitNorms="true" omitTermFreqAndPositions="true"/>
>> <field name="description" type="text" indexed="true" stored="true" termVectors="true"
>> termPositions="true" termOffsets="true"/>
>> <field name="description_stemmed" type="text_french_light" indexed="true" stored="true"
>> termVectors="true" termPositions="true" termOffsets="true"/>
>> <field name="description_stemmed_nonorms" type="text_french_light" indexed="true" stored="false"
>> omitNorms="true" omitTermFreqAndPositions="true"/>
> I have the following copyField directives :
>> <copyField source="title" dest="title_stemmed" />
>> <copyField source="title" dest="title_stemmed_nonorms" />
>> <copyField source="description" dest="description_stemmed" />
>> <copyField source="description" dest="description_stemmed_nonorms" />
> I rely on dismax query handler to achieve relevancy.
>
> I have two different search use cases :
> - a "structured search" mode where my query looks like q="Term1
> term2"&qf=my_category_field^1.0&hl.q=Word1 word2&mm=100%
> - a "free-text search" mode where my query looks like q=Term1 term2&qf=title_stemmed_nonorms^1.0
> description_stemmed_nonorms^0.5&mm=-40%
>
> Shared query parameters are as follow : defType=dismax&hl=on&hl.fl=title_stemmed
> description_stemmed&hl.useFastVectorHighlighter=true&hl.fragListBuilder=single
>
> For all use cases, I have the good relevancy parameters, my results are satisfying.
>
> Troubles concern highlighting :
> - in the "free-text search" mode, everything is fine : the query is not a phrase query, and
> highlighted terms may vary from the query terms (if stemming came into play)
> - in the "structured search" mode, I've got less luck : the query is a phrase query. Therefor, I
> rely on the hl.q parameter to achieve my needs. However, when specified in the hl.q parameter the
> query isn't processed the same way that it should when trying to highlight from the fields : query
> analysis seems not to be applied.
> I can prove it easily by entering my query term that isn't highlighted in the analysis.jsp page,
> obtain it's stemmed version, use that in the hl.q parameter, and then I can see my highlighted terms
> as expected.
>
> I suspect a bug arround the handling of the default query (hl.q) when fields to highlight have a
> custom analysis (especially when stemming, word delimiters, and so on are involved).
> I tried playing with hl.usePhraseHighlighter=true and hl.highlightMultiTerm=true but that didn't
> help at all =D
>
> I tried using both legacy highlighter and FVH but the same issue occurs.
> The issue only triggers when relying on hl.q.
>
> Thank you very much for any help,
>
> --
> Tanguy
>

Tanguy,

Thank you for reporting this!

 > The issue only triggers when relying on hl.q.

That is not good. Can you reproduce the problem on Solr example environment?
If we can share same environment (solrconfig.xml and schema.xml), request params
to reproduce and data, I'd like to look into it.

koji
-- 
http://www.rondhuit.com/en/