You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Orosz György <or...@itk.ppke.hu> on 2011/07/29 10:55:33 UTC

slow highlighting because of stemming

Dear all,

I am quite new about using Solr, but would like to ask your help.
I am developing an application which should be able to highlight the results
of a query. For this I am using regex fragmenter:
<highlighting>
   <fragmenter name="regex"
class="org.apache.solr.highlight.RegexFragmenter">
    <lst name="defaults">
      <int name="hl.fragsize">500</int>
      <float name="hl.regex.slop">0.5</float>
      <str name="hl.pre"><![CDATA[<b>]]></str>
     <str name="hl.post"><![CDATA[</b>]]></str>
     <str name="hl.useFastVectorHighlighter">true</str>
      <str name="hl.regex.pattern">[-\w ,/\n\"']{20,300}[.?!]</str>
      <str name="hl.fl">dokumentum_syn_query</str>
    </lst>
   </fragmenter>
  </highlighting>
The field is indexed with term vectors and offsets:
<field name="dokumentum_syn_query" type="huntext_syn" indexed="true"
stored="true" multiValued="true" termVectors="on" termPositions="on"
 termOffsets="on"/>
    <fieldType name="huntext_syn" class="solr.TextField" stored="true"
indexed="true" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer
class="com.morphologic.solr.huntoken.HunTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_query.txt" enablePositionIncrements="true" />
 <filter class="com.morphologic.solr.hunstem.HumorStemFilterFactory"
 lex="/home/oroszgy/workspace/morpho/solrplugins/data/lex"
 cache="alma"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
 <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords_query.txt" enablePositionIncrements="true" />
 <filter class="com.morphologic.solr.hunstem.HumorStemFilterFactory"
 lex="/home/oroszgy/workspace/morpho/solrplugins/data/lex"
 cache="alma"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms_query.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

The highlighting works well, excepts that its really slow. I realized that
this is because the highlighter/fragmenter does stemming for all the results
documents again.

Could you please help me why does it happen an how should I avoid this? (I
thought that using fastvectorhighlighter will solve my problem, but it
didn't)

Thanks in advance!
Gyuri Orosz

Re: slow highlighting because of stemming

Posted by Orosz György <or...@itk.ppke.hu>.

Thanks for the answers!
This was the solution! :) (my fault was that I tried to use the "on" value
instead of true - don't know why..)
Gyuri

2011/7/30 Michael Sokolov <so...@ifactory.com>

> On 7/30/2011 3:46 AM, Orosz György wrote:
>
>> Hi,
>>
>> Thanks for the answer!
>> I am doing some logging about stemming, and what I can see is that a lot
>> of
>> tokens are stemmed for the highlighting. It is the strange part, since I
>> don't understand why does any highlighter need stemming again.
>>
> Consider that the highlighter needs to match terms from the query with
> terms from the document, just like search. If the indexed document has been
> stemmed, then the query also needs to be stemmed, or you won't see matches.
>
> -Mike
>

Re: slow highlighting because of stemming

Posted by Michael Sokolov <so...@ifactory.com>.

On 7/30/2011 3:46 AM, Orosz György wrote:
> Hi,
>
> Thanks for the answer!
> I am doing some logging about stemming, and what I can see is that a lot of
> tokens are stemmed for the highlighting. It is the strange part, since I
> don't understand why does any highlighter need stemming again.
Consider that the highlighter needs to match terms from the query with 
terms from the document, just like search. If the indexed document has 
been stemmed, then the query also needs to be stemmed, or you won't see 
matches.

-Mike

Re: slow highlighting because of stemming

Posted by Ahmet Arslan <io...@yahoo.com>.

> I am doing some logging about stemming, and what I can see
> is that a lot of
> tokens are stemmed for the highlighting. It is the strange
> part, since I
> don't understand why does any highlighter need stemming
> again.

Highlighting do re-analyze the text being highlighted.

> Anyway my docments are not really large, just a few
> kilobytes, but thanks
> for this suggestion.
> 
> If you could help me in "how could I just ignore the
> stemming for
> highlighting" thing it would be very great!

If you store term vectors, the this re-analyze is skipped.
http://wiki.apache.org/solr/FieldOptionsByUseCase

Re: slow highlighting because of stemming

Posted by Orosz György <or...@itk.ppke.hu>.

Hi,

Thanks for the answer!
I am doing some logging about stemming, and what I can see is that a lot of
tokens are stemmed for the highlighting. It is the strange part, since I
don't understand why does any highlighter need stemming again.
Anyway my docments are not really large, just a few kilobytes, but thanks
for this suggestion.

If you could help me in "how could I just ignore the stemming for
highlighting" thing it would be very great!

Thanks,
Gyuri

2011/7/29 Mike Sokolov <so...@ifactory.com>

> I'm not sure I would identify stemming as the culprit here.
>
> Do you have very large documents?  If so, there is a patch for FVH
> committed to limit the number of phrases it looks at; see hl.phraseLimit,
> but this won't be available until 3.4 is released.


> You can also limit the amount of each document that is analyzed by the
> regular Highlighter using maxDocCharsToAnalyze (and maybe this applies to
> FVH? not sure)
>
> Using RegexFragmenter is also probably slower than something like
> SimpleFragmenter.
>
> There is work to implement faster highlighting for Solr/Lucene, but it
> depends on some basic changes to the search architecture so it might be a
> while before that becomes available.  See https://issues.apache.org/**
> jira/browse/LUCENE-3318<https://issues.apache.org/jira/browse/LUCENE-3318>if you're interested in following that development.
>
> -Mike
>
>
> On 07/29/2011 04:55 AM, Orosz György wrote:
>
>> Dear all,
>>
>> I am quite new about using Solr, but would like to ask your help.
>> I am developing an application which should be able to highlight the
>> results
>> of a query. For this I am using regex fragmenter:
>> <highlighting>
>>    <fragmenter name="regex"
>> class="org.apache.solr.**highlight.RegexFragmenter">
>>     <lst name="defaults">
>>       <int name="hl.fragsize">500</int>
>>       <float name="hl.regex.slop">0.5</**float>
>>       <str name="hl.pre"><![CDATA[<b>]]><**/str>
>>      <str name="hl.post"><![CDATA[</b>]]**></str>
>>      <str name="hl.**useFastVectorHighlighter">**true</str>
>>       <str name="hl.regex.pattern">[-\w ,/\n\"']{20,300}[.?!]</str>
>>       <str name="hl.fl">dokumentum_syn_**query</str>
>>     </lst>
>>    </fragmenter>
>>   </highlighting>
>> The field is indexed with term vectors and offsets:
>> <field name="dokumentum_syn_query" type="huntext_syn" indexed="true"
>> stored="true" multiValued="true" termVectors="on" termPositions="on"
>>  termOffsets="on"/>
>>     <fieldType name="huntext_syn" class="solr.TextField" stored="true"
>> indexed="true" positionIncrementGap="100">
>>       <analyzer type="index">
>>         <tokenizer
>> class="com.morphologic.solr.**huntoken.HunTokenizerFactory"/**>
>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_query.txt" enablePositionIncrements="**true" />
>>  <filter class="com.morphologic.solr.**hunstem.**HumorStemFilterFactory"
>>  lex="/home/oroszgy/workspace/**morpho/solrplugins/data/lex"
>>  cache="alma"/>
>>         <filter class="solr.**LowerCaseFilterFactory"/>
>>       </analyzer>
>>       <analyzer type="query">
>>         <tokenizer class="solr.**StandardTokenizerFactory"/>
>>  <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords_query.txt" enablePositionIncrements="**true" />
>>  <filter class="com.morphologic.solr.**hunstem.**HumorStemFilterFactory"
>>  lex="/home/oroszgy/workspace/**morpho/solrplugins/data/lex"
>>  cache="alma"/>
>>         <filter class="solr.**SynonymFilterFactory"
>> synonyms="synonyms_query.txt" ignoreCase="true" expand="true"/>
>>         <filter class="solr.**LowerCaseFilterFactory"/>
>>       </analyzer>
>>     </fieldType>
>>
>> The highlighting works well, excepts that its really slow. I realized that
>> this is because the highlighter/fragmenter does stemming for all the
>> results
>> documents again.
>>
>> Could you please help me why does it happen an how should I avoid this? (I
>> thought that using fastvectorhighlighter will solve my problem, but it
>> didn't)
>>
>> Thanks in advance!
>> Gyuri Orosz
>>
>>
>>
>

Re: slow highlighting because of stemming

Posted by Mike Sokolov <so...@ifactory.com>.

I'm not sure I would identify stemming as the culprit here.

Do you have very large documents?  If so, there is a patch for FVH 
committed to limit the number of phrases it looks at; see 
hl.phraseLimit, but this won't be available until 3.4 is released.

You can also limit the amount of each document that is analyzed by the 
regular Highlighter using maxDocCharsToAnalyze (and maybe this applies 
to FVH? not sure)

Using RegexFragmenter is also probably slower than something like 
SimpleFragmenter.

There is work to implement faster highlighting for Solr/Lucene, but it 
depends on some basic changes to the search architecture so it might be 
a while before that becomes available.  See 
https://issues.apache.org/jira/browse/LUCENE-3318 if you're interested 
in following that development.

-Mike

On 07/29/2011 04:55 AM, Orosz György wrote:
> Dear all,
>
> I am quite new about using Solr, but would like to ask your help.
> I am developing an application which should be able to highlight the results
> of a query. For this I am using regex fragmenter:
> <highlighting>
>     <fragmenter name="regex"
> class="org.apache.solr.highlight.RegexFragmenter">
>      <lst name="defaults">
>        <int name="hl.fragsize">500</int>
>        <float name="hl.regex.slop">0.5</float>
>        <str name="hl.pre"><![CDATA[<b>]]></str>
>       <str name="hl.post"><![CDATA[</b>]]></str>
>       <str name="hl.useFastVectorHighlighter">true</str>
>        <str name="hl.regex.pattern">[-\w ,/\n\"']{20,300}[.?!]</str>
>        <str name="hl.fl">dokumentum_syn_query</str>
>      </lst>
>     </fragmenter>
>    </highlighting>
> The field is indexed with term vectors and offsets:
> <field name="dokumentum_syn_query" type="huntext_syn" indexed="true"
> stored="true" multiValued="true" termVectors="on" termPositions="on"
>   termOffsets="on"/>
>      <fieldType name="huntext_syn" class="solr.TextField" stored="true"
> indexed="true" positionIncrementGap="100">
>        <analyzer type="index">
>          <tokenizer
> class="com.morphologic.solr.huntoken.HunTokenizerFactory"/>
>          <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_query.txt" enablePositionIncrements="true" />
>   <filter class="com.morphologic.solr.hunstem.HumorStemFilterFactory"
>   lex="/home/oroszgy/workspace/morpho/solrplugins/data/lex"
>   cache="alma"/>
>          <filter class="solr.LowerCaseFilterFactory"/>
>        </analyzer>
>        <analyzer type="query">
>          <tokenizer class="solr.StandardTokenizerFactory"/>
>   <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords_query.txt" enablePositionIncrements="true" />
>   <filter class="com.morphologic.solr.hunstem.HumorStemFilterFactory"
>   lex="/home/oroszgy/workspace/morpho/solrplugins/data/lex"
>   cache="alma"/>
>          <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms_query.txt" ignoreCase="true" expand="true"/>
>          <filter class="solr.LowerCaseFilterFactory"/>
>        </analyzer>
>      </fieldType>
>
> The highlighting works well, excepts that its really slow. I realized that
> this is because the highlighter/fragmenter does stemming for all the results
> documents again.
>
> Could you please help me why does it happen an how should I avoid this? (I
> thought that using fastvectorhighlighter will solve my problem, but it
> didn't)
>
> Thanks in advance!
> Gyuri Orosz
>
>