You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Justin Engelman <ju...@smalldemons.com> on 2012/08/03 06:38:39 UTC

Highlighting error InvalidTokenOffsetsException: Token oedipus exceeds length of provided text sized 11

I have an autocomplete index that I return highlighting information for but
am getting an error with certain search strings and fields on Solr 3.5.
I’ve narrowed it down to a specific field matching with a specific search
string.  And I’ve tried making a few different changes to the schema and
rebuilding but so far I cannot get the error to go away.  The field that is
failing is an ngram indexed field for matching on the start of any word.
Any help would be appreciated.



The text being searched for is “ant” (without quotes).

The field value that is matching and causing the error is “Anti-Å’dipus”
(again without quotes).



The field schema is (additional fields and field types removed):

<types>

<fieldType name="autocomplete_ngram" class="solr.TextField">

 <analyzer type="index">

  <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>

  <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(\|)" replaceWith="or" replace="all"/>

  <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="([&])" replaceWith="and" replace="all"/>

  <tokenizer class="solr.StandardTokenizerFactory"/>

  <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>

  <filter class="solr.LowerCaseFilterFactory"/>

  <filter class="solr.EdgeNGramFilterFactory" maxGramSize="20" minGramSize="2"/>

  <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æøåÆØÅ
])" replacement=" " replace="all"/>

 </analyzer>

 <analyzer type="query">

  <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>

  <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="(\|)" replaceWith="or" replace="all"/>

  <charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="([&])" replaceWith="and" replace="all"/>

  <tokenizer class="solr.StandardTokenizerFactory"/>

  <filter class="solr.PatternReplaceFilterFactory" pattern="([^\w\d\*æøåÆØÅ
])" replacement=" " replace="all"/>

  <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>

  <filter class="solr.LowerCaseFilterFactory"/>

  <filter class="solr.PatternReplaceFilterFactory"
pattern="^(.{20})(.*)?" replacement="$1" replace="all"/>

 </analyzer>

</fieldType>

</types>

<fields>

<field name="ng" type="autocomplete_ngram" indexed="true"
stored="true" omitNorms="true" omitTermFreqAndPositions="true"/>

</fields>



Things I’ve tried changing in the above are having the
PatternReplaceCharFilterFactory charFilters be PatternReplaceFilterFactory
filters instead, and moving around the order of the the filters
(particularly moving the PatternReplaceFilterFactory filters to the top of
bottom of the filters), and completely removing the
WordDelimiterFilterFactory and the PatternReplaceFilterFactory that
has the pattern="([^\w\d\*æøåÆØÅ
])".  No matter what I do though I still get errors (sometimes it seems to
change matched values that it gets the error on though, but the one
included here seems to be the most consistent).



Highlighting is configured as:

<requestHandler name="ac" class="solr.SearchHandler" default="true">

<lst name="defaults">

  <str name="defType">edismax</str>

  <str name="wt">json</str>

  <int name="rows">10</int>

  <bool name="hl">true</bool>

  <str name="hl.fl">ng</str>

  <int name="hl.snippets">4</int>

  <bool name="hl.requireFieldMatch">true</bool>

  <int name="hl.fragsize">2</int>

  <str name="fl">ng score</str>

</lst>

</requestHandler>



When I do a field analysis using that search term and field value I get:

*Index Analyzer*

*org.apache.solr.analysis.MappingCharFilterFactory
{mapping=mapping-ISOLatin1Accent.txt, luceneMatchVersion=LUCENE_35}*

*text*

Anti-A’dipus

*org.apache.solr.analysis.PatternReplaceCharFilterFactory {replace=all,
pattern=(\|), replaceWith=or, luceneMatchVersion=LUCENE_35}*

*text*

Anti-A’dipus

*org.apache.solr.analysis.PatternReplaceCharFilterFactory {replace=all,
pattern=([&]), replaceWith=and, luceneMatchVersion=LUCENE_35}*

*text*

Anti-A’dipus

*org.apache.solr.analysis.StandardTokenizerFactory
{luceneMatchVersion=LUCENE_35}*

*position*

1

2

*term text*

Anti

A’dipus

*startOffset*

0

5

*endOffset*

4

12

*type*

<ALPHANUM>

<ALPHANUM>

*org.apache.solr.analysis.WordDelimiterFilterFactory {splitOnCaseChange=1,
generateNumberParts=1, catenateWords=0, luceneMatchVersion=LUCENE_35,
generateWordParts=1, catenateAll=0, catenateNumbers=0}*

*position*

1

2

3

*term text*

Anti

A

dipus

*startOffset*

0

5

7

*endOffset*

4

6

12

*type*

<ALPHANUM>

<ALPHANUM>

<ALPHANUM>

*org.apache.solr.analysis.LowerCaseFilterFactory
{luceneMatchVersion=LUCENE_35}*

*position*

1

2

3

*term text*

anti

a

dipus

*startOffset*

0

5

7

*endOffset*

4

6

12

*type*

<ALPHANUM>

<ALPHANUM>

<ALPHANUM>

*org.apache.solr.analysis.EdgeNGramFilterFactory {maxGramSize=20,
minGramSize=2, luceneMatchVersion=LUCENE_35}*

*position*

1

2

3

4

5

6

7

*term text*

an

ant

anti

di

dip

dipu

dipus

*startOffset*

0

0

0

7

7

7

7

*endOffset*

2

3

4

9

10

11

12

*type*

word

word

word

word

word

word

word

*org.apache.solr.analysis.PatternReplaceFilterFactory {replace=all,
replacement= , pattern=([^\w\d\*æøåÆØÅ ]), luceneMatchVersion=LUCENE_35}*

*position*

1

2

3

4

5

6

7

*term text*

an

ant

anti

di

dip

dipu

dipus

*startOffset*

0

0

0

7

7

7

7

*endOffset*

2

3

4

9

10

11

12

*type*

word

word

word

word

word

word

word

*Query Analyzer*

ant



ant



ant



ant



ant



And when I call the search URL:
http://localhost:8983/solr/autocomplete/select/?q=ng%3A%28ant%29

I get the following error stack:
HTTP ERROR 500

Problem accessing /solr/autocomplete/select/. Reason:

    org.apache.lucene.search.highlight.InvalidTokenOffsetsException:
Token oedipus exceeds length of provided text sized 11



org.apache.solr.common.SolrException:
org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token
oedipus exceeds length of provided text sized 11

  at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:497)

  at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:401)

  at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:131)

  at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)

  at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)

  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)

  at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)

  at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)

  at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)

  at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)

  at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)

  at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)

  at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)

  at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)

  at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)

  at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)

  at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)

  at org.mortbay.jetty.Server.handle(Server.java:326)

  at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)

  at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)

  at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)

  at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)

  at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)

  at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)

  at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)

Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException:
Token oedipus exceeds length of provided text sized 11

  at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:233)

  at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:490)

  ... 24 more



I am not even sure where the “oedipus” token  is coming from.  It doesn’t
show up in the analysis.  Help please?

Thank you,

Justin

Re: Highlighting error InvalidTokenOffsetsException: Token oedipus exceeds length of provided text sized 11

Posted by Robert Muir <rc...@gmail.com>.
On Fri, Aug 3, 2012 at 12:38 AM, Justin Engelman <ju...@smalldemons.com> wrote:
> I have an autocomplete index that I return highlighting information for but
> am getting an error with certain search strings and fields on Solr 3.5.

try the 3.6 release:

* LUCENE-3642, SOLR-2891, LUCENE-3717: Fixed bugs in CharTokenizer,
n-gram tokenizers/filters,
  compound token filters, thai word filter, icutokenizer, pattern analyzer,
  wikipediatokenizer, and smart chinese where they would create
invalid offsets in
  some situations, leading to problems in highlighting.

-- 
lucidimagination.com