You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Teague James <te...@insystechinc.com> on 2017/02/01 16:24:59 UTC

Solr 6.0.0 Returns Blank Highlights for alpha-numeric combos

Hello everyone! I'm still stuck on this issue and could really use some
help. I have a Solr 6.0.0 instance that is storing documents peppered with
text like "1a", "2e", "4c", etc. If I search the documents for a word, "ms",
"in", "the", etc., I get the correct number of hits and the results are
highlighted correctly in the highlighting section. But when I search for
"1a" or "2e" I get hits, but the highlights are blank. Further testing
revealed that the highlighter fails to highlight any combination of
alpha-numeric two character value, such a n0, b1, 1z, etc.:
<result name="response" numFound="1" start="0">
...
<lst name="highlighting">
<lst name="8667"/>

Where "8667" is the document ID of the record that had the hit, but no
highlight. Other searches, "ms" for example, return:
<result name="response" numFound="1" start="0">
...
<lst name="highlighting">
 <lst name="8667"/>
  <arr name="text">
   <str>
    <em>MS</em>
   </str>
  </arr>
 </lst>
</lst>

Why does highlighting fail for "1a" type searches? Any help is appreciated!
Thanks!

-Teague James


Re: Solr 6.0.0 Returns Blank Highlights for alpha-numeric combos

Posted by Erick Erickson <er...@gmail.com>.
The termvectors and offsets aren't necessary, they can be beneficial
for speed reasons so I'd defer them.

I ran a quick test on 6.0 with your definitions and it works just
fine. I did have to comment out your custom stopwords filter on the
indexing but unless you're substituting for pairs like you indicate
(and I don't see how you could be) that shouldn't matter. I also am
using the default stopwords.txt file and assuming you don't have
anything unusual there.

I specified hl.fl=field and a q=field:1a which worked, as did
hl.fl=field and h.q=field:1a

What that means is probably that you've changed "something" in your
setup that's causing this. I've often found it's easiest to just start
over and add one thing at a time in a test bed (my laptop if you must
know) until I had my "aha" moment.

Do note that unless your hl parameters default to including the "text"
field (which they do from your example) you wouldn't get anything.

Plus if you include "&debug=query" on the URL, the results can
sometimes shed light on what's actually happening as opposed to what
you expect ;)

Best,
Erick


On Wed, Feb 1, 2017 at 12:23 PM, Teague James <te...@insystechinc.com> wrote:
> Hi Erick! Thanks for the reply. The goal is to get two character terms like 1a, 1b, 2a, 2b, 3a, etc. to get highlighted in the documents. Additional testing shows that any alpha-numeric combo returns a blank highlight, regardless of length. Thus, "pr0blem" will not highlight because of the zero in the middle of the term.
>
> I came across a ServerFault article where it was suggested that the fieldType must be tokenized in order for highlighting to work correctly. Setting the field type to text_general was suggested as a solution. In my case the data is stored as a string fieldType, which is then copied using copyField to a field that has a fieldType of text_general, but I'm still not getting a good highlight on terms like "1a". Highlighting works for any other non-alpha-numeric term though.
>
> Other articles pointed to termVectors and termOffsets, but none of these seemed to help. Here's  my config:
>
> <field name="contents" type="string" indexed="true" stored="true" termPositions="true" termVectors="true" termOffsets="true" />
> <field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/>
> <copyField source="contents" dest="text"/>
>
> <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
>         <analyzer type="index">
>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                 <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
>                 <filter class="solr.WordDelimiterFilterFactory" catenateAll="1" preserveOriginal="1" generateNumberParts="0" generateWordParts="0" />
>                 <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="true"/>
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.PorterStemFilterFactory"/>
>                 <filter class="solr.ApostropheFilterFactory"/>
>         </analyzer>
>         <analyzer type="query">
>                 <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                 <filter class="solr.WordDelimiterFilterFactory" catenateAll="1" preserveOriginal="1" generateNumberParts="0" generateWordParts="0" />
>                 <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
>                 <filter class="solr.LowerCaseFilterFactory"/>
>                 <filter class="solr.PorterStemFilterFactory"/>
>                 <filter class="solr.ApostropheFilterFactory"/>
>         </analyzer>
> </fieldType>
>
> In the solrconfig file highlighting is set to use the text field: <str name="hl.fl">text</str>
>
> Thoughts?
>
> Appreciate the help! Thanks!
>
> -Teague
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: Wednesday, February 1, 2017 2:49 PM
> To: solr-user <so...@lucene.apache.org>
> Subject: Re: Solr 6.0.0 Returns Blank Highlights for alpha-numeric combos
>
> How far into the text field are these tokens? The highlighter defaults to the first 10K characters under control of hl.maxAnalyzedChars. It's vaguely possible that the values happen to be farther along in the text than that. Not likely, mind you but possible.
>
> Best,
> Erick
>
> On Wed, Feb 1, 2017 at 8:24 AM, Teague James <te...@insystechinc.com> wrote:
>> Hello everyone! I'm still stuck on this issue and could really use
>> some help. I have a Solr 6.0.0 instance that is storing documents
>> peppered with text like "1a", "2e", "4c", etc. If I search the
>> documents for a word, "ms", "in", "the", etc., I get the correct
>> number of hits and the results are highlighted correctly in the
>> highlighting section. But when I search for "1a" or "2e" I get hits,
>> but the highlights are blank. Further testing revealed that the
>> highlighter fails to highlight any combination of alpha-numeric two character value, such a n0, b1, 1z, etc.:
>> <result name="response" numFound="1" start="0"> ...
>> <lst name="highlighting">
>> <lst name="8667"/>
>>
>> Where "8667" is the document ID of the record that had the hit, but no
>> highlight. Other searches, "ms" for example, return:
>> <result name="response" numFound="1" start="0"> ...
>> <lst name="highlighting">
>>  <lst name="8667"/>
>>   <arr name="text">
>>    <str>
>>     <em>MS</em>
>>    </str>
>>   </arr>
>>  </lst>
>> </lst>
>>
>> Why does highlighting fail for "1a" type searches? Any help is appreciated!
>> Thanks!
>>
>> -Teague James
>>
>

RE: Solr 6.0.0 Returns Blank Highlights for alpha-numeric combos

Posted by Teague James <te...@insystechinc.com>.
Hi Erick! Thanks for the reply. The goal is to get two character terms like 1a, 1b, 2a, 2b, 3a, etc. to get highlighted in the documents. Additional testing shows that any alpha-numeric combo returns a blank highlight, regardless of length. Thus, "pr0blem" will not highlight because of the zero in the middle of the term.

I came across a ServerFault article where it was suggested that the fieldType must be tokenized in order for highlighting to work correctly. Setting the field type to text_general was suggested as a solution. In my case the data is stored as a string fieldType, which is then copied using copyField to a field that has a fieldType of text_general, but I'm still not getting a good highlight on terms like "1a". Highlighting works for any other non-alpha-numeric term though.

Other articles pointed to termVectors and termOffsets, but none of these seemed to help. Here's  my config:

<field name="contents" type="string" indexed="true" stored="true" termPositions="true" termVectors="true" termOffsets="true" />
<field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/>
<copyField source="contents" dest="text"/>

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
	<analyzer type="index">
		<tokenizer class="solr.WhitespaceTokenizerFactory"/>
		<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
		<filter class="solr.WordDelimiterFilterFactory" catenateAll="1" preserveOriginal="1" generateNumberParts="0" generateWordParts="0" />
		<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="true"/>
		<filter class="solr.LowerCaseFilterFactory"/>
		<filter class="solr.PorterStemFilterFactory"/>
		<filter class="solr.ApostropheFilterFactory"/>
	</analyzer>
	<analyzer type="query">
 		<tokenizer class="solr.WhitespaceTokenizerFactory"/>
		<filter class="solr.WordDelimiterFilterFactory" catenateAll="1" preserveOriginal="1" generateNumberParts="0" generateWordParts="0" />
		<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
		<filter class="solr.LowerCaseFilterFactory"/>
		<filter class="solr.PorterStemFilterFactory"/>
		<filter class="solr.ApostropheFilterFactory"/>
	</analyzer>
</fieldType>

In the solrconfig file highlighting is set to use the text field: <str name="hl.fl">text</str> 

Thoughts?

Appreciate the help! Thanks!

-Teague

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Wednesday, February 1, 2017 2:49 PM
To: solr-user <so...@lucene.apache.org>
Subject: Re: Solr 6.0.0 Returns Blank Highlights for alpha-numeric combos

How far into the text field are these tokens? The highlighter defaults to the first 10K characters under control of hl.maxAnalyzedChars. It's vaguely possible that the values happen to be farther along in the text than that. Not likely, mind you but possible.

Best,
Erick

On Wed, Feb 1, 2017 at 8:24 AM, Teague James <te...@insystechinc.com> wrote:
> Hello everyone! I'm still stuck on this issue and could really use 
> some help. I have a Solr 6.0.0 instance that is storing documents 
> peppered with text like "1a", "2e", "4c", etc. If I search the 
> documents for a word, "ms", "in", "the", etc., I get the correct 
> number of hits and the results are highlighted correctly in the 
> highlighting section. But when I search for "1a" or "2e" I get hits, 
> but the highlights are blank. Further testing revealed that the 
> highlighter fails to highlight any combination of alpha-numeric two character value, such a n0, b1, 1z, etc.:
> <result name="response" numFound="1" start="0"> ...
> <lst name="highlighting">
> <lst name="8667"/>
>
> Where "8667" is the document ID of the record that had the hit, but no 
> highlight. Other searches, "ms" for example, return:
> <result name="response" numFound="1" start="0"> ...
> <lst name="highlighting">
>  <lst name="8667"/>
>   <arr name="text">
>    <str>
>     <em>MS</em>
>    </str>
>   </arr>
>  </lst>
> </lst>
>
> Why does highlighting fail for "1a" type searches? Any help is appreciated!
> Thanks!
>
> -Teague James
>


Re: Solr 6.0.0 Returns Blank Highlights for alpha-numeric combos

Posted by Erick Erickson <er...@gmail.com>.
How far into the text field are these tokens? The highlighter defaults
to the first 10K characters under control of hl.maxAnalyzedChars. It's
vaguely possible that the values happen to be farther along in the
text than that. Not likely, mind you but possible.

Best,
Erick

On Wed, Feb 1, 2017 at 8:24 AM, Teague James <te...@insystechinc.com> wrote:
> Hello everyone! I'm still stuck on this issue and could really use some
> help. I have a Solr 6.0.0 instance that is storing documents peppered with
> text like "1a", "2e", "4c", etc. If I search the documents for a word, "ms",
> "in", "the", etc., I get the correct number of hits and the results are
> highlighted correctly in the highlighting section. But when I search for
> "1a" or "2e" I get hits, but the highlights are blank. Further testing
> revealed that the highlighter fails to highlight any combination of
> alpha-numeric two character value, such a n0, b1, 1z, etc.:
> <result name="response" numFound="1" start="0">
> ...
> <lst name="highlighting">
> <lst name="8667"/>
>
> Where "8667" is the document ID of the record that had the hit, but no
> highlight. Other searches, "ms" for example, return:
> <result name="response" numFound="1" start="0">
> ...
> <lst name="highlighting">
>  <lst name="8667"/>
>   <arr name="text">
>    <str>
>     <em>MS</em>
>    </str>
>   </arr>
>  </lst>
> </lst>
>
> Why does highlighting fail for "1a" type searches? Any help is appreciated!
> Thanks!
>
> -Teague James
>