You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Jeffrey Schmidt <je...@mac.com> on 2012/04/23 21:26:54 UTC

Re: FastVectorHighlighter -> no highlights

This does not appear to be shingle specific.  A non-shingled field is also NOT highlighted in the same manner with FVH.  I can see in the timing information that it takes much longer to run FVH than no highlighting at all, so Solr must be doing something.  But why it just lists the document IDs and little or no field highlights is still a mystery.

Any ideas on where I should look in the configuration, parameters to try etc.?

Cheers,

Jeff

On Apr 19, 2012, at 7:51 AM, Jeff Schmidt wrote:

> I am using Solr 4.0, and debug=timing shows Solr spending the great majority of its time in the HighlightComponent. It seemed logical to look into the FastVectorHighlighter.  I does seem much faster, but on the other hand, I'm not getting the highlights I need. :)
> 
> I've seen references to FVH not supporting MultiTerm and (non-fixed sized) ngrams.  I'm using edismax, and I don't know if a certain configuration of that becomes multi term and that's my problem, or if the is something completely different. I don't have ngrams, but I do shingle.  For the examples below, I have these fields defined:
> 
>       <field name="n_macromolecule_name" type="text_lc_np_shingle" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true" />
>       <field name="n_protein_family" type="text_lc_np_shingle" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true" />
>       <field name="n_pathway_name" type="text_lc_np_shingle" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true" />
>       <field name="n_cellreg_regulated_by" type="text_lc_np_shingle" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true" />
>       <field name="n_cellreg_disease" type="text_lc_np_shingle" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true" />
>       <field name="n_macromolecule_summary" type="text_lc_np_shingle" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
> 
> 
> Note that all are both indexed and stored, multi-valued, and I have  termVectors="true" termPositions="true" termOffsets="true" to enable FVH. When I had missed that in a field, I could see the log indicating such and reverting to the regular highlighter. I no longer see those messages.  All of the above fields are of this type:
> 
>         <!-- A text field that forces lowercase, removes punctuation and generates shingles for phrase matching -->
>        <fieldType name="text_lc_np_shingle" class="solr.TextField" positionIncrementGap="100">
>          <analyzer type="index">
>            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>            <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>            <!-- strip punctuation -->
>            <filter class="solr.PatternReplaceFilterFactory"
>                pattern="([\p{Punct}])" replacement="" replace="all"/>
>            <!-- Remove any 0-length tokens. -->
>            <filter class="solr.LengthFilterFactory" min="1" max="100"/>
>            <filter class="solr.LowerCaseFilterFactory"/>
>            <filter class="solr.ShingleFilterFactory" maxShingleSize="4" outputUnigrams="true" />         
>          </analyzer>
>          <analyzer type="query">
>            <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>            <!-- strip punctuation -->
>            <filter class="solr.PatternReplaceFilterFactory"
>                pattern="([\p{Punct}])" replacement="" replace="all"/>
>            <!-- Remove any 0-length tokens. -->
>            <filter class="solr.LengthFilterFactory" min="1" max="100"/>
>            <filter class="solr.LowerCaseFilterFactory"/>
>            <filter class="solr.ShingleFilterFactory" maxShingleSize="4" outputUnigrams="false" outputUnigramsIfNoShingles="true"/>
>          </analyzer>
>        </fieldType>
> 
> 
> Using the standard highlight component, for the search term cancer (rows=2), I get the highlights I've come to appreciate:
> 
>     <lst name="highlighting">
>         <lst name="ING:3lzx">
>             <arr name="n_macromolecule_name">
>                 <str>&lt;span class="ingReasonText"&gt;cancer&lt;/span&gt; susceptibility candidate 1</str>
>             </arr>
>             <arr name="n_protein_family">
>                 <str>&lt;span class="ingReasonText"&gt;Cancer&lt;/span&gt; susceptibility candidate 1</str>
>             </arr>
>         </lst>
>         <lst name="ING:8lj">
>             <arr name="n_macromolecule_name">
>                 <str>breast &lt;span class="ingReasonText"&gt;cancer&lt;/span&gt; 2, early onset</str>
>             </arr>
>             <arr name="n_pathway_name">
>                 <str>Hereditary Breast &lt;span class="ingReasonText"&gt;Cancer&lt;/span&gt; Signaling</str>
>             </arr>
>             <arr name="n_cellreg_regulated_by">
>                 <str>prostate &lt;span class="ingReasonText"&gt;cancer&lt;/span&gt; cells</str>
>             </arr>
>             <arr name="n_cellreg_disease">
>                 <str>breast &lt;span class="ingReasonText"&gt;cancer&lt;/span&gt;</str>
>             </arr>
>             <arr name="n_macromolecule_summary">
>                 <str> mutations in BRCA1 and this gene, BRCA2, confer increased lifetime risk of developing breast or ovarian &lt;span class="ingReasonText"&gt;cancer.&lt;/span&gt;</str>
>             </arr>
>         </lst>
>     </lst>
> 
> With everything else being the same, when I set hl.useFastVectorHighlighter=true I get:
> 
>     <lst name="highlighting">
>         <lst name="ING:3lzx"/>
>         <lst name="ING:8lj">
>             <arr name="n_macromolecule_summary">
>                 <str>breast or &lt;span class="ingReasonText"&gt;ovarian&lt;/span&gt; cancer. Both BRCA1 and BRCA2 are involved in maintenance of genome stability, specifically</str>
>             </arr>
>         </lst>
>     </lst>
> 
> Note that the same fields simply do not appear, except for n_macromolecule_summary, in which case it's for some reason highlighting "ovarian" instead of "cancer".
> 
> Highlight related configuration is in the edismax request handler:
> 
>      <str name="hl.requireFieldMatch">true</str>
>      <str name="hl.usePhraseHighlighter">true</str>
>      <str name="hl.phraseLimit">5000</str>
>      <str name="hl.fragListBuilder">simple</str>
>      <str name="hl.fragmentsBuilder">colored</str>
>      <str name="hl.simple.pre"><![CDATA[<span class="ingReasonText">]]></str>
>      <str name="hl.simple.post"><![CDATA[</span>]]></str>
>      <str name="hl.tag.pre"><![CDATA[<span class="ingReasonText">]]></str>
>      <str name="hl.tag.post"><![CDATA[</span>]]></str>
>      
>      <!-- for this field, we want no fragmenting, just highlighting -->
>      <str name="f.name.hl.fragsize">0</str>
>      <!-- instructs Solr to return the field itself if no query terms are
>           found
>      <str name="f.name.hl.alternateField">name</str> -->
>      <str name="f.text.hl.fragmenter">regex</str> <!-- defined below -->
> 
> Any ideas on what I'm doing wrong?  Sorry for the long email, but I"m trying to answer as many anticipated configuration questions as I can. Is there a problem with FVH and shingling?  Hopefully it's something else?
> 
> Thanks,
> 
> Jeff
> --
> Jeff Schmidt
> 535 Consulting
> jas@535consulting.com
> http://www.535consulting.com
> (650) 423-1068
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 

--
Jeff Schmidt
jeff_schmidt@mac.com

Re: FastVectorHighlighter -> no highlights

Posted by Schmidt Jeff <ja...@gmail.com>.

Okay, my fault. I had a misunderstanding as to under what conditions DataStax Enterprise 2 re-indexes the content, and thus while I had the field definitions set properly to support FVH, I believe no actual position and offset data was generated and that might indicate why I had empty highlights.

After re-indexing my content, I am now getting highlights from FVH, and I'm getting them noticeably faster.  But, for some reason, the document ID is being highlighted.  For example, for a given document using the old highlighter:

<lst name="ING:6xwoe">
    <arr name="n_name">
        <str><span class="ingReasonText">egfr</span></str>
    </arr>
    <arr name="n_synonym">
        <str><span class="ingReasonText">egfr</span></str>
    </arr>
</lst>

With FVH, I get:

<lst name="ING:6xwoe">
    <arr name="n_name">
        <str><span class="ingReasonText">ING:</span>6xwoe egfr </str>
    </arr>
    <arr name="n_synonym">
        <str><span class="ingReasonText">ING:</span>6xwoe egfr </str>
    </arr>
</lst>

Anybody ever seen that before?

Thanks,

Jeff

On Apr 23, 2012, at 1:26 PM, Jeffrey Schmidt wrote:

> This does not appear to be shingle specific.  A non-shingled field is also NOT highlighted in the same manner with FVH.  I can see in the timing information that it takes much longer to run FVH than no highlighting at all, so Solr must be doing something.  But why it just lists the document IDs and little or no field highlights is still a mystery.
> 
> Any ideas on where I should look in the configuration, parameters to try etc.?
> 
> Cheers,
> 
> Jeff
> 
> On Apr 19, 2012, at 7:51 AM, Jeff Schmidt wrote:
> 
>> I am using Solr 4.0, and debug=timing shows Solr spending the great majority of its time in the HighlightComponent. It seemed logical to look into the FastVectorHighlighter.  I does seem much faster, but on the other hand, I'm not getting the highlights I need. :)
>> 
>> I've seen references to FVH not supporting MultiTerm and (non-fixed sized) ngrams.  I'm using edismax, and I don't know if a certain configuration of that becomes multi term and that's my problem, or if the is something completely different. I don't have ngrams, but I do shingle.  For the examples below, I have these fields defined:
>> 
>>      <field name="n_macromolecule_name" type="text_lc_np_shingle" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true" />
>>      <field name="n_protein_family" type="text_lc_np_shingle" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true" />
>>      <field name="n_pathway_name" type="text_lc_np_shingle" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true" />
>>      <field name="n_cellreg_regulated_by" type="text_lc_np_shingle" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true" />
>>      <field name="n_cellreg_disease" type="text_lc_np_shingle" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true" />
>>      <field name="n_macromolecule_summary" type="text_lc_np_shingle" indexed="true" stored="true" multiValued="true" termVectors="true" termPositions="true" termOffsets="true"/>
>> 
>> 
>> Note that all are both indexed and stored, multi-valued, and I have  termVectors="true" termPositions="true" termOffsets="true" to enable FVH. When I had missed that in a field, I could see the log indicating such and reverting to the regular highlighter. I no longer see those messages.  All of the above fields are of this type:
>> 
>>        <!-- A text field that forces lowercase, removes punctuation and generates shingles for phrase matching -->
>>       <fieldType name="text_lc_np_shingle" class="solr.TextField" positionIncrementGap="100">
>>         <analyzer type="index">
>>           <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>           <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>>           <!-- strip punctuation -->
>>           <filter class="solr.PatternReplaceFilterFactory"
>>               pattern="([\p{Punct}])" replacement="" replace="all"/>
>>           <!-- Remove any 0-length tokens. -->
>>           <filter class="solr.LengthFilterFactory" min="1" max="100"/>
>>           <filter class="solr.LowerCaseFilterFactory"/>
>>           <filter class="solr.ShingleFilterFactory" maxShingleSize="4" outputUnigrams="true" />         
>>         </analyzer>
>>         <analyzer type="query">
>>           <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>           <!-- strip punctuation -->
>>           <filter class="solr.PatternReplaceFilterFactory"
>>               pattern="([\p{Punct}])" replacement="" replace="all"/>
>>           <!-- Remove any 0-length tokens. -->
>>           <filter class="solr.LengthFilterFactory" min="1" max="100"/>
>>           <filter class="solr.LowerCaseFilterFactory"/>
>>           <filter class="solr.ShingleFilterFactory" maxShingleSize="4" outputUnigrams="false" outputUnigramsIfNoShingles="true"/>
>>         </analyzer>
>>       </fieldType>
>> 
>> 
>> Using the standard highlight component, for the search term cancer (rows=2), I get the highlights I've come to appreciate:
>> 
>>    <lst name="highlighting">
>>        <lst name="ING:3lzx">
>>            <arr name="n_macromolecule_name">
>>                <str>&lt;span class="ingReasonText"&gt;cancer&lt;/span&gt; susceptibility candidate 1</str>
>>            </arr>
>>            <arr name="n_protein_family">
>>                <str>&lt;span class="ingReasonText"&gt;Cancer&lt;/span&gt; susceptibility candidate 1</str>
>>            </arr>
>>        </lst>
>>        <lst name="ING:8lj">
>>            <arr name="n_macromolecule_name">
>>                <str>breast &lt;span class="ingReasonText"&gt;cancer&lt;/span&gt; 2, early onset</str>
>>            </arr>
>>            <arr name="n_pathway_name">
>>                <str>Hereditary Breast &lt;span class="ingReasonText"&gt;Cancer&lt;/span&gt; Signaling</str>
>>            </arr>
>>            <arr name="n_cellreg_regulated_by">
>>                <str>prostate &lt;span class="ingReasonText"&gt;cancer&lt;/span&gt; cells</str>
>>            </arr>
>>            <arr name="n_cellreg_disease">
>>                <str>breast &lt;span class="ingReasonText"&gt;cancer&lt;/span&gt;</str>
>>            </arr>
>>            <arr name="n_macromolecule_summary">
>>                <str> mutations in BRCA1 and this gene, BRCA2, confer increased lifetime risk of developing breast or ovarian &lt;span class="ingReasonText"&gt;cancer.&lt;/span&gt;</str>
>>            </arr>
>>        </lst>
>>    </lst>
>> 
>> With everything else being the same, when I set hl.useFastVectorHighlighter=true I get:
>> 
>>    <lst name="highlighting">
>>        <lst name="ING:3lzx"/>
>>        <lst name="ING:8lj">
>>            <arr name="n_macromolecule_summary">
>>                <str>breast or &lt;span class="ingReasonText"&gt;ovarian&lt;/span&gt; cancer. Both BRCA1 and BRCA2 are involved in maintenance of genome stability, specifically</str>
>>            </arr>
>>        </lst>
>>    </lst>
>> 
>> Note that the same fields simply do not appear, except for n_macromolecule_summary, in which case it's for some reason highlighting "ovarian" instead of "cancer".
>> 
>> Highlight related configuration is in the edismax request handler:
>> 
>>     <str name="hl.requireFieldMatch">true</str>
>>     <str name="hl.usePhraseHighlighter">true</str>
>>     <str name="hl.phraseLimit">5000</str>
>>     <str name="hl.fragListBuilder">simple</str>
>>     <str name="hl.fragmentsBuilder">colored</str>
>>     <str name="hl.simple.pre"><![CDATA[<span class="ingReasonText">]]></str>
>>     <str name="hl.simple.post"><![CDATA[</span>]]></str>
>>     <str name="hl.tag.pre"><![CDATA[<span class="ingReasonText">]]></str>
>>     <str name="hl.tag.post"><![CDATA[</span>]]></str>
>> 
>>     <!-- for this field, we want no fragmenting, just highlighting -->
>>     <str name="f.name.hl.fragsize">0</str>
>>     <!-- instructs Solr to return the field itself if no query terms are
>>          found
>>     <str name="f.name.hl.alternateField">name</str> -->
>>     <str name="f.text.hl.fragmenter">regex</str> <!-- defined below -->
>> 
>> Any ideas on what I'm doing wrong?  Sorry for the long email, but I"m trying to answer as many anticipated configuration questions as I can. Is there a problem with FVH and shingling?  Hopefully it's something else?
>> 
>> Thanks,
>> 
>> Jeff
>> --
>> Jeff Schmidt
>> 535 Consulting
>> jas@535consulting.com
>> http://www.535consulting.com
>> (650) 423-1068
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> --
> Jeff Schmidt
> jeff_schmidt@mac.com
>