You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Michael Sokolov <ms...@safaribooksonline.com> on 2014/05/02 15:34:58 UTC

PostingHighlighter complains about no offsets

I've been wanting to try out the PostingsHighlighter, so I added 
storeOffsetsWithPositions to my field definition, enabled the 
highlighter in solrconfig.xml,  reindexed and tried it out.  When I 
issue a query I'm getting this error:

|field 'text' was indexed without offsets, cannot highlight


java.lang.IllegalArgumentException: field 'text' was indexed without offsets, cannot highlight
	at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightDoc(PostingsHighlighter.java:545)
	at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightField(PostingsHighlighter.java:467)
	at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFieldsAsObjects(PostingsHighlighter.java:392)
	at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFields(PostingsHighlighter.java:293)|

I've been trying to figure out why the field wouldn't have offsets 
indexed, but I just can't see it.  Is there something in the analysis 
chain that could stripping out offsets?


This is the field definition:

     <field name="text" type="text_en" indexed="true" stored="true" 
multiValued="false" termVectors="true" termPositions="true" 
termOffsets="true" storeOffsetsWithPositions="true" />

(Yes I know PH doesn't require term vectors; I'm keeping them around for 
now while I experiment)

     <fieldType name="text_en" class="solr.TextField" 
positionIncrementGap="100">
       <analyzer type="index">
         <!-- We are indexing mostly HTML so we need to ignore the tags -->
         <charFilter class="solr.HTMLStripCharFilterFactory"/>
         <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <!-- lower casing must happen before WordDelimiterFilter or 
protwords.txt will not work -->
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.WordDelimiterFilterFactory" 
stemEnglishPossessive="1" protected="protwords.txt"/>
         <!-- This deals with contractions -->
         <filter class="solr.SynonymFilterFactory" 
synonyms="synonyms.txt" expand="true" ignoreCase="true"/>
         <filter class="solr.HunspellStemFilterFactory" 
dictionary="en_US.dic" affix="en_US.aff" ignoreCase="true"/>
         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
       </analyzer>
       <analyzer type="query">
         <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <!-- lower casing must happen before WordDelimiterFilter or 
protwords.txt will not work -->
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.WordDelimiterFilterFactory" 
protected="protwords.txt"/>
         <!-- setting tokenSeparator="" solves issues with compound 
words and improves phrase search -->
         <filter class="solr.HunspellStemFilterFactory" 
dictionary="en_US.dic" affix="en_US.aff" ignoreCase="true"/>
         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
       </analyzer>
     </fieldType>

Re: PostingHighlighter complains about no offsets

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

For posterity, in case anybody follows this thread, I tracked the 
problem down to WordDelimiterFilter; apparently it creates an offset of 
-1 in some case, which PostingsHighlighter rejects.

-Mike


On 5/2/2014 10:20 AM, Michael Sokolov wrote:
> I checked using the analysis admin page, and I believe there are 
> offsets being generated (I assume start/end=offsets).  So IDK I am 
> going to try reindexing again.  Maybe I neglected to reload the config 
> before I indexed last time.
>
> -Mike
>
> On 05/02/2014 09:34 AM, Michael Sokolov wrote:
>> I've been wanting to try out the PostingsHighlighter, so I added 
>> storeOffsetsWithPositions to my field definition, enabled the 
>> highlighter in solrconfig.xml,  reindexed and tried it out. When I 
>> issue a query I'm getting this error:
>>
>> |field 'text' was indexed without offsets, cannot highlight
>>
>>
>> java.lang.IllegalArgumentException: field 'text' was indexed without offsets, cannot highlight
>> 	at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightDoc(PostingsHighlighter.java:545)
>> 	at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightField(PostingsHighlighter.java:467)
>> 	at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFieldsAsObjects(PostingsHighlighter.java:392)
>> 	at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFields(PostingsHighlighter.java:293)|
>> I've been trying to figure out why the field wouldn't have offsets 
>> indexed, but I just can't see it.  Is there something in the analysis 
>> chain that could stripping out offsets?
>>
>>
>> This is the field definition:
>>
>>     <field name="text" type="text_en" indexed="true" stored="true" 
>> multiValued="false" termVectors="true" termPositions="true" 
>> termOffsets="true" storeOffsetsWithPositions="true" />
>>
>> (Yes I know PH doesn't require term vectors; I'm keeping them around 
>> for now while I experiment)
>>
>>     <fieldType name="text_en" class="solr.TextField" 
>> positionIncrementGap="100">
>>       <analyzer type="index">
>>         <!-- We are indexing mostly HTML so we need to ignore the 
>> tags -->
>>         <charFilter class="solr.HTMLStripCharFilterFactory"/>
>>         <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
>>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>         <!-- lower casing must happen before WordDelimiterFilter or 
>> protwords.txt will not work -->
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.WordDelimiterFilterFactory" 
>> stemEnglishPossessive="1" protected="protwords.txt"/>
>>         <!-- This deals with contractions -->
>>         <filter class="solr.SynonymFilterFactory" 
>> synonyms="synonyms.txt" expand="true" ignoreCase="true"/>
>>         <filter class="solr.HunspellStemFilterFactory" 
>> dictionary="en_US.dic" affix="en_US.aff" ignoreCase="true"/>
>>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>       </analyzer>
>>       <analyzer type="query">
>>         <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
>>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>         <!-- lower casing must happen before WordDelimiterFilter or 
>> protwords.txt will not work -->
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <filter class="solr.WordDelimiterFilterFactory" 
>> protected="protwords.txt"/>
>>         <!-- setting tokenSeparator="" solves issues with compound 
>> words and improves phrase search -->
>>         <filter class="solr.HunspellStemFilterFactory" 
>> dictionary="en_US.dic" affix="en_US.aff" ignoreCase="true"/>
>>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>       </analyzer>
>>     </fieldType>
>

Re: PostingHighlighter complains about no offsets

Posted by Michael Sokolov <ms...@safaribooksonline.com>.

I checked using the analysis admin page, and I believe there are offsets 
being generated (I assume start/end=offsets).  So IDK I am going to try 
reindexing again.  Maybe I neglected to reload the config before I 
indexed last time.

-Mike

On 05/02/2014 09:34 AM, Michael Sokolov wrote:
> I've been wanting to try out the PostingsHighlighter, so I added 
> storeOffsetsWithPositions to my field definition, enabled the 
> highlighter in solrconfig.xml,  reindexed and tried it out.  When I 
> issue a query I'm getting this error:
>
> |field 'text' was indexed without offsets, cannot highlight
>
>
> java.lang.IllegalArgumentException: field 'text' was indexed without offsets, cannot highlight
> 	at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightDoc(PostingsHighlighter.java:545)
> 	at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightField(PostingsHighlighter.java:467)
> 	at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFieldsAsObjects(PostingsHighlighter.java:392)
> 	at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFields(PostingsHighlighter.java:293)|
> I've been trying to figure out why the field wouldn't have offsets 
> indexed, but I just can't see it.  Is there something in the analysis 
> chain that could stripping out offsets?
>
>
> This is the field definition:
>
>     <field name="text" type="text_en" indexed="true" stored="true" 
> multiValued="false" termVectors="true" termPositions="true" 
> termOffsets="true" storeOffsetsWithPositions="true" />
>
> (Yes I know PH doesn't require term vectors; I'm keeping them around 
> for now while I experiment)
>
>     <fieldType name="text_en" class="solr.TextField" 
> positionIncrementGap="100">
>       <analyzer type="index">
>         <!-- We are indexing mostly HTML so we need to ignore the tags -->
>         <charFilter class="solr.HTMLStripCharFilterFactory"/>
>         <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <!-- lower casing must happen before WordDelimiterFilter or 
> protwords.txt will not work -->
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.WordDelimiterFilterFactory" 
> stemEnglishPossessive="1" protected="protwords.txt"/>
>         <!-- This deals with contractions -->
>         <filter class="solr.SynonymFilterFactory" 
> synonyms="synonyms.txt" expand="true" ignoreCase="true"/>
>         <filter class="solr.HunspellStemFilterFactory" 
> dictionary="en_US.dic" affix="en_US.aff" ignoreCase="true"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>       <analyzer type="query">
>         <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <!-- lower casing must happen before WordDelimiterFilter or 
> protwords.txt will not work -->
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.WordDelimiterFilterFactory" 
> protected="protwords.txt"/>
>         <!-- setting tokenSeparator="" solves issues with compound 
> words and improves phrase search -->
>         <filter class="solr.HunspellStemFilterFactory" 
> dictionary="en_US.dic" affix="en_US.aff" ignoreCase="true"/>
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>     </fieldType>