You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Michael Sokolov <ms...@safaribooksonline.com> on 2014/05/02 15:34:58 UTC
PostingHighlighter complains about no offsets
I've been wanting to try out the PostingsHighlighter, so I added
storeOffsetsWithPositions to my field definition, enabled the
highlighter in solrconfig.xml, reindexed and tried it out. When I
issue a query I'm getting this error:
|field 'text' was indexed without offsets, cannot highlight
java.lang.IllegalArgumentException: field 'text' was indexed without offsets, cannot highlight
at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightDoc(PostingsHighlighter.java:545)
at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightField(PostingsHighlighter.java:467)
at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFieldsAsObjects(PostingsHighlighter.java:392)
at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFields(PostingsHighlighter.java:293)|
I've been trying to figure out why the field wouldn't have offsets
indexed, but I just can't see it. Is there something in the analysis
chain that could stripping out offsets?
This is the field definition:
<field name="text" type="text_en" indexed="true" stored="true"
multiValued="false" termVectors="true" termPositions="true"
termOffsets="true" storeOffsetsWithPositions="true" />
(Yes I know PH doesn't require term vectors; I'm keeping them around for
now while I experiment)
<fieldType name="text_en" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<!-- We are indexing mostly HTML so we need to ignore the tags -->
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- lower casing must happen before WordDelimiterFilter or
protwords.txt will not work -->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
stemEnglishPossessive="1" protected="protwords.txt"/>
<!-- This deals with contractions -->
<filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" expand="true" ignoreCase="true"/>
<filter class="solr.HunspellStemFilterFactory"
dictionary="en_US.dic" affix="en_US.aff" ignoreCase="true"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- lower casing must happen before WordDelimiterFilter or
protwords.txt will not work -->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt"/>
<!-- setting tokenSeparator="" solves issues with compound
words and improves phrase search -->
<filter class="solr.HunspellStemFilterFactory"
dictionary="en_US.dic" affix="en_US.aff" ignoreCase="true"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Re: PostingHighlighter complains about no offsets
Posted by Michael Sokolov <ms...@safaribooksonline.com>.
For posterity, in case anybody follows this thread, I tracked the
problem down to WordDelimiterFilter; apparently it creates an offset of
-1 in some case, which PostingsHighlighter rejects.
-Mike
On 5/2/2014 10:20 AM, Michael Sokolov wrote:
> I checked using the analysis admin page, and I believe there are
> offsets being generated (I assume start/end=offsets). So IDK I am
> going to try reindexing again. Maybe I neglected to reload the config
> before I indexed last time.
>
> -Mike
>
> On 05/02/2014 09:34 AM, Michael Sokolov wrote:
>> I've been wanting to try out the PostingsHighlighter, so I added
>> storeOffsetsWithPositions to my field definition, enabled the
>> highlighter in solrconfig.xml, reindexed and tried it out. When I
>> issue a query I'm getting this error:
>>
>> |field 'text' was indexed without offsets, cannot highlight
>>
>>
>> java.lang.IllegalArgumentException: field 'text' was indexed without offsets, cannot highlight
>> at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightDoc(PostingsHighlighter.java:545)
>> at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightField(PostingsHighlighter.java:467)
>> at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFieldsAsObjects(PostingsHighlighter.java:392)
>> at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFields(PostingsHighlighter.java:293)|
>> I've been trying to figure out why the field wouldn't have offsets
>> indexed, but I just can't see it. Is there something in the analysis
>> chain that could stripping out offsets?
>>
>>
>> This is the field definition:
>>
>> <field name="text" type="text_en" indexed="true" stored="true"
>> multiValued="false" termVectors="true" termPositions="true"
>> termOffsets="true" storeOffsetsWithPositions="true" />
>>
>> (Yes I know PH doesn't require term vectors; I'm keeping them around
>> for now while I experiment)
>>
>> <fieldType name="text_en" class="solr.TextField"
>> positionIncrementGap="100">
>> <analyzer type="index">
>> <!-- We are indexing mostly HTML so we need to ignore the
>> tags -->
>> <charFilter class="solr.HTMLStripCharFilterFactory"/>
>> <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <!-- lower casing must happen before WordDelimiterFilter or
>> protwords.txt will not work -->
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.WordDelimiterFilterFactory"
>> stemEnglishPossessive="1" protected="protwords.txt"/>
>> <!-- This deals with contractions -->
>> <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms.txt" expand="true" ignoreCase="true"/>
>> <filter class="solr.HunspellStemFilterFactory"
>> dictionary="en_US.dic" affix="en_US.aff" ignoreCase="true"/>
>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> </analyzer>
>> <analyzer type="query">
>> <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <!-- lower casing must happen before WordDelimiterFilter or
>> protwords.txt will not work -->
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.WordDelimiterFilterFactory"
>> protected="protwords.txt"/>
>> <!-- setting tokenSeparator="" solves issues with compound
>> words and improves phrase search -->
>> <filter class="solr.HunspellStemFilterFactory"
>> dictionary="en_US.dic" affix="en_US.aff" ignoreCase="true"/>
>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> </analyzer>
>> </fieldType>
>
Re: PostingHighlighter complains about no offsets
Posted by Michael Sokolov <ms...@safaribooksonline.com>.
I checked using the analysis admin page, and I believe there are offsets
being generated (I assume start/end=offsets). So IDK I am going to try
reindexing again. Maybe I neglected to reload the config before I
indexed last time.
-Mike
On 05/02/2014 09:34 AM, Michael Sokolov wrote:
> I've been wanting to try out the PostingsHighlighter, so I added
> storeOffsetsWithPositions to my field definition, enabled the
> highlighter in solrconfig.xml, reindexed and tried it out. When I
> issue a query I'm getting this error:
>
> |field 'text' was indexed without offsets, cannot highlight
>
>
> java.lang.IllegalArgumentException: field 'text' was indexed without offsets, cannot highlight
> at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightDoc(PostingsHighlighter.java:545)
> at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightField(PostingsHighlighter.java:467)
> at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFieldsAsObjects(PostingsHighlighter.java:392)
> at org.apache.lucene.search.postingshighlight.PostingsHighlighter.highlightFields(PostingsHighlighter.java:293)|
> I've been trying to figure out why the field wouldn't have offsets
> indexed, but I just can't see it. Is there something in the analysis
> chain that could stripping out offsets?
>
>
> This is the field definition:
>
> <field name="text" type="text_en" indexed="true" stored="true"
> multiValued="false" termVectors="true" termPositions="true"
> termOffsets="true" storeOffsetsWithPositions="true" />
>
> (Yes I know PH doesn't require term vectors; I'm keeping them around
> for now while I experiment)
>
> <fieldType name="text_en" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <!-- We are indexing mostly HTML so we need to ignore the tags -->
> <charFilter class="solr.HTMLStripCharFilterFactory"/>
> <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <!-- lower casing must happen before WordDelimiterFilter or
> protwords.txt will not work -->
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.WordDelimiterFilterFactory"
> stemEnglishPossessive="1" protected="protwords.txt"/>
> <!-- This deals with contractions -->
> <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" expand="true" ignoreCase="true"/>
> <filter class="solr.HunspellStemFilterFactory"
> dictionary="en_US.dic" affix="en_US.aff" ignoreCase="true"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> <analyzer type="query">
> <!--<tokenizer class="solr.StandardTokenizerFactory"/>-->
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <!-- lower casing must happen before WordDelimiterFilter or
> protwords.txt will not work -->
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.WordDelimiterFilterFactory"
> protected="protwords.txt"/>
> <!-- setting tokenSeparator="" solves issues with compound
> words and improves phrase search -->
> <filter class="solr.HunspellStemFilterFactory"
> dictionary="en_US.dic" affix="en_US.aff" ignoreCase="true"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> </analyzer>
> </fieldType>