You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Edwin Steiner (Created) (JIRA)" <ji...@apache.org> on 2011/11/11 08:46:51 UTC

[jira] [Created] (SOLR-2891) InvalidTokenOffsetsException when using MappingCharFilterFactory, DictionaryCompoundWordTokenFilterFactory and Highlighting

InvalidTokenOffsetsException when using MappingCharFilterFactory, DictionaryCompoundWordTokenFilterFactory and Highlighting
---------------------------------------------------------------------------------------------------------------------------

                 Key: SOLR-2891
                 URL: https://issues.apache.org/jira/browse/SOLR-2891
             Project: Solr
          Issue Type: Bug
          Components: highlighter, Schema and Analysis, search
    Affects Versions: 3.4, 3.1
         Environment: MacOS X, Java 1.6, Tomcat 7
            Reporter: Edwin Steiner
            Priority: Critical


I would like to handle german accents (Umlaute) by replacing the accented char with its two-letter substitute (e.g ä => ae). For this reason I use the char-filter solr.MappingCharFilterFactory configured with a mapping file containing entries like "ä" => "ae". I also want to use the solr.DictionaryCompoundWordTokenFilterFactory to find words which are part of compound words (e.g. revision in totalrevision). And finally I want to use Solr highlighting. But there seems to be a problem if I combine the char filter and the compound word filter in combination with highlighting (an org.apache.lucene.search.highlight.InvalidTokenOffsetsException is raised).

Here are the details:

types:
--------
    <fieldType name="textAnalyzedFailed" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="words.txt"/>
      </analyzer>
    </fieldType>

schema:
-----------
  <fields>
     <field name="id"         type="string"               indexed="true" stored="true" required="true" /> 
     <field name="title"      type="textAnalyzedFailed"   indexed="true" stored="true"/>
  </fields>

document:
--------------
  <doc>
     <field name="id">1</field> 
     <field name="title">banküberfall</field> 
  </doc>

mapping.txt:
-----------------
"ü" => "ue"

words.txt:
--------------
fall

The resulting error when search with:

http://localhost:8080/solr/select/?q=banküberfall&hl=true&hl.fl=title

Nov 4, 2011 4:29:12 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select/ params={q=bank?berfall&hl.fl=title_hl&hl=true} hits=1 status=0 QTime=13 
Nov 4, 2011 4:29:16 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall exceeds length of provided text sized 12
	at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:469)
	at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:378)
	at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:116)
	at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:462)
	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164)
	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
	at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:851)
	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:405)
	at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:278)
	at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:515)
	at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:302)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
	at java.lang.Thread.run(Thread.java:680)
Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall exceeds length of provided text sized 12
	at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:228)
	at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:462)
	... 23 more



The analysis tool says the following for field name=title, field value=banküberfall:
------------------------------------------------------------------------------------
Index Analyzer
org.apache.solr.analysis.MappingCharFilterFactory {mapping=mapping.txt, luceneMatchVersion=LUCENE_31}
text	bankueberfall
org.apache.solr.analysis.WhitespaceTokenizerFactory {luceneMatchVersion=LUCENE_31}
position	1
term text	bankueberfall
startOffset	0
endOffset	12
org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory {dictionary=words.txt, luceneMatchVersion=LUCENE_31}
position	1
term text	bankueberfall
                fall
startOffset	0
                9
endOffset	12
                13
flags	        0
                0
type	        word
                word
payload	


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Assigned] (SOLR-2891) InvalidTokenOffsetsException when using MappingCharFilterFactory, DictionaryCompoundWordTokenFilterFactory and Highlighting

Posted by "Robert Muir (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir reassigned SOLR-2891:
---------------------------------

    Assignee: Robert Muir
    
> InvalidTokenOffsetsException when using MappingCharFilterFactory, DictionaryCompoundWordTokenFilterFactory and Highlighting
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2891
>                 URL: https://issues.apache.org/jira/browse/SOLR-2891
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter, Schema and Analysis, search
>    Affects Versions: 3.1, 3.4
>         Environment: MacOS X, Java 1.6, Tomcat 7
>            Reporter: Edwin Steiner
>            Assignee: Robert Muir
>            Priority: Critical
>
> I would like to handle german accents (Umlaute) by replacing the accented char with its two-letter substitute (e.g ä => ae). For this reason I use the char-filter solr.MappingCharFilterFactory configured with a mapping file containing entries like "ä" => "ae". I also want to use the solr.DictionaryCompoundWordTokenFilterFactory to find words which are part of compound words (e.g. revision in totalrevision). And finally I want to use Solr highlighting. But there seems to be a problem if I combine the char filter and the compound word filter in combination with highlighting (an org.apache.lucene.search.highlight.InvalidTokenOffsetsException is raised).
> Here are the details:
> types:
> --------
>     <fieldType name="textAnalyzedFailed" class="solr.TextField" positionIncrementGap="100">
>       <analyzer>
>         <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="words.txt"/>
>       </analyzer>
>     </fieldType>
> schema:
> -----------
>   <fields>
>      <field name="id"         type="string"               indexed="true" stored="true" required="true" /> 
>      <field name="title"      type="textAnalyzedFailed"   indexed="true" stored="true"/>
>   </fields>
> document:
> --------------
>   <doc>
>      <field name="id">1</field> 
>      <field name="title">banküberfall</field> 
>   </doc>
> mapping.txt:
> -----------------
> "ü" => "ue"
> words.txt:
> --------------
> fall
> The resulting error when search with:
> http://localhost:8080/solr/select/?q=banküberfall&hl=true&hl.fl=title
> Nov 4, 2011 4:29:12 PM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/select/ params={q=bank?berfall&hl.fl=title_hl&hl=true} hits=1 status=0 QTime=13 
> Nov 4, 2011 4:29:16 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall exceeds length of provided text sized 12
> 	at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:469)
> 	at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:378)
> 	at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:116)
> 	at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
> 	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> 	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
> 	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> 	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
> 	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
> 	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
> 	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
> 	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
> 	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:462)
> 	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164)
> 	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
> 	at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:851)
> 	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
> 	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:405)
> 	at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:278)
> 	at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:515)
> 	at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:302)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:680)
> Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall exceeds length of provided text sized 12
> 	at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:228)
> 	at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:462)
> 	... 23 more
> The analysis tool says the following for field name=title, field value=banküberfall:
> ------------------------------------------------------------------------------------
> Index Analyzer
> org.apache.solr.analysis.MappingCharFilterFactory {mapping=mapping.txt, luceneMatchVersion=LUCENE_31}
> text	bankueberfall
> org.apache.solr.analysis.WhitespaceTokenizerFactory {luceneMatchVersion=LUCENE_31}
> position	1
> term text	bankueberfall
> startOffset	0
> endOffset	12
> org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory {dictionary=words.txt, luceneMatchVersion=LUCENE_31}
> position	1
> term text	bankueberfall
>                 fall
> startOffset	0
>                 9
> endOffset	12
>                 13
> flags	        0
>                 0
> type	        word
>                 word
> payload	

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Commented] (SOLR-2891) InvalidTokenOffsetsException when using MappingCharFilterFactory, DictionaryCompoundWordTokenFilterFactory and Highlighting

Posted by "Vadim Kisselmann (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/SOLR-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13148433#comment-13148433 ] 

Vadim Kisselmann commented on SOLR-2891:
----------------------------------------

it´s an old bug. I have big problems too with OffsetExceptions when i use
Highlighting, or Carrot.
It looks like a problem with HTMLStripCharFilter.
Patch doesn´t work.

https://issues.apache.org/jira/browse/LUCENE-2208
                
> InvalidTokenOffsetsException when using MappingCharFilterFactory, DictionaryCompoundWordTokenFilterFactory and Highlighting
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2891
>                 URL: https://issues.apache.org/jira/browse/SOLR-2891
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter, Schema and Analysis, search
>    Affects Versions: 3.1, 3.4
>         Environment: MacOS X, Java 1.6, Tomcat 7
>            Reporter: Edwin Steiner
>            Priority: Critical
>
> I would like to handle german accents (Umlaute) by replacing the accented char with its two-letter substitute (e.g ä => ae). For this reason I use the char-filter solr.MappingCharFilterFactory configured with a mapping file containing entries like "ä" => "ae". I also want to use the solr.DictionaryCompoundWordTokenFilterFactory to find words which are part of compound words (e.g. revision in totalrevision). And finally I want to use Solr highlighting. But there seems to be a problem if I combine the char filter and the compound word filter in combination with highlighting (an org.apache.lucene.search.highlight.InvalidTokenOffsetsException is raised).
> Here are the details:
> types:
> --------
>     <fieldType name="textAnalyzedFailed" class="solr.TextField" positionIncrementGap="100">
>       <analyzer>
>         <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="words.txt"/>
>       </analyzer>
>     </fieldType>
> schema:
> -----------
>   <fields>
>      <field name="id"         type="string"               indexed="true" stored="true" required="true" /> 
>      <field name="title"      type="textAnalyzedFailed"   indexed="true" stored="true"/>
>   </fields>
> document:
> --------------
>   <doc>
>      <field name="id">1</field> 
>      <field name="title">banküberfall</field> 
>   </doc>
> mapping.txt:
> -----------------
> "ü" => "ue"
> words.txt:
> --------------
> fall
> The resulting error when search with:
> http://localhost:8080/solr/select/?q=banküberfall&hl=true&hl.fl=title
> Nov 4, 2011 4:29:12 PM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/select/ params={q=bank?berfall&hl.fl=title_hl&hl=true} hits=1 status=0 QTime=13 
> Nov 4, 2011 4:29:16 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall exceeds length of provided text sized 12
> 	at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:469)
> 	at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:378)
> 	at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:116)
> 	at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
> 	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> 	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
> 	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> 	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
> 	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
> 	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
> 	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
> 	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
> 	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:462)
> 	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164)
> 	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
> 	at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:851)
> 	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
> 	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:405)
> 	at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:278)
> 	at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:515)
> 	at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:302)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:680)
> Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall exceeds length of provided text sized 12
> 	at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:228)
> 	at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:462)
> 	... 23 more
> The analysis tool says the following for field name=title, field value=banküberfall:
> ------------------------------------------------------------------------------------
> Index Analyzer
> org.apache.solr.analysis.MappingCharFilterFactory {mapping=mapping.txt, luceneMatchVersion=LUCENE_31}
> text	bankueberfall
> org.apache.solr.analysis.WhitespaceTokenizerFactory {luceneMatchVersion=LUCENE_31}
> position	1
> term text	bankueberfall
> startOffset	0
> endOffset	12
> org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory {dictionary=words.txt, luceneMatchVersion=LUCENE_31}
> position	1
> term text	bankueberfall
>                 fall
> startOffset	0
>                 9
> endOffset	12
>                 13
> flags	        0
>                 0
> type	        word
>                 word
> payload	

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Updated] (SOLR-2891) InvalidTokenOffsetsException when using MappingCharFilterFactory, DictionaryCompoundWordTokenFilterFactory and Highlighting

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated SOLR-2891:
------------------------------

    Attachment: SOLR-2891.patch

The problem is CompoundWordTokenFilter has the same bugs as LUCENE-3642. There was a note in the source code (I think noted by Uwe):
{code}
// TODO: This ignores the original endOffset, if a CharFilter/Tokenizer/Filter removed
// chars from the term, offsets may not match correctly (other filters producing tokens
// may also have this problem):
{code}

Edwin: thanks for providing good information, i turned this into a test and fixed it the same way as LUCENE-3642.

                
> InvalidTokenOffsetsException when using MappingCharFilterFactory, DictionaryCompoundWordTokenFilterFactory and Highlighting
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2891
>                 URL: https://issues.apache.org/jira/browse/SOLR-2891
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter, Schema and Analysis, search
>    Affects Versions: 3.1, 3.4
>         Environment: MacOS X, Java 1.6, Tomcat 7
>            Reporter: Edwin Steiner
>            Assignee: Robert Muir
>            Priority: Critical
>         Attachments: SOLR-2891.patch
>
>
> I would like to handle german accents (Umlaute) by replacing the accented char with its two-letter substitute (e.g ä => ae). For this reason I use the char-filter solr.MappingCharFilterFactory configured with a mapping file containing entries like "ä" => "ae". I also want to use the solr.DictionaryCompoundWordTokenFilterFactory to find words which are part of compound words (e.g. revision in totalrevision). And finally I want to use Solr highlighting. But there seems to be a problem if I combine the char filter and the compound word filter in combination with highlighting (an org.apache.lucene.search.highlight.InvalidTokenOffsetsException is raised).
> Here are the details:
> types:
> --------
>     <fieldType name="textAnalyzedFailed" class="solr.TextField" positionIncrementGap="100">
>       <analyzer>
>         <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="words.txt"/>
>       </analyzer>
>     </fieldType>
> schema:
> -----------
>   <fields>
>      <field name="id"         type="string"               indexed="true" stored="true" required="true" /> 
>      <field name="title"      type="textAnalyzedFailed"   indexed="true" stored="true"/>
>   </fields>
> document:
> --------------
>   <doc>
>      <field name="id">1</field> 
>      <field name="title">banküberfall</field> 
>   </doc>
> mapping.txt:
> -----------------
> "ü" => "ue"
> words.txt:
> --------------
> fall
> The resulting error when search with:
> http://localhost:8080/solr/select/?q=banküberfall&hl=true&hl.fl=title
> Nov 4, 2011 4:29:12 PM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/select/ params={q=bank?berfall&hl.fl=title_hl&hl=true} hits=1 status=0 QTime=13 
> Nov 4, 2011 4:29:16 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall exceeds length of provided text sized 12
> 	at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:469)
> 	at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:378)
> 	at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:116)
> 	at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
> 	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> 	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
> 	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> 	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
> 	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
> 	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
> 	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
> 	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
> 	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:462)
> 	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164)
> 	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
> 	at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:851)
> 	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
> 	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:405)
> 	at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:278)
> 	at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:515)
> 	at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:302)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:680)
> Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall exceeds length of provided text sized 12
> 	at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:228)
> 	at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:462)
> 	... 23 more
> The analysis tool says the following for field name=title, field value=banküberfall:
> ------------------------------------------------------------------------------------
> Index Analyzer
> org.apache.solr.analysis.MappingCharFilterFactory {mapping=mapping.txt, luceneMatchVersion=LUCENE_31}
> text	bankueberfall
> org.apache.solr.analysis.WhitespaceTokenizerFactory {luceneMatchVersion=LUCENE_31}
> position	1
> term text	bankueberfall
> startOffset	0
> endOffset	12
> org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory {dictionary=words.txt, luceneMatchVersion=LUCENE_31}
> position	1
> term text	bankueberfall
>                 fall
> startOffset	0
>                 9
> endOffset	12
>                 13
> flags	        0
>                 0
> type	        word
>                 word
> payload	

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] [Resolved] (SOLR-2891) InvalidTokenOffsetsException when using MappingCharFilterFactory, DictionaryCompoundWordTokenFilterFactory and Highlighting

Posted by "Robert Muir (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/SOLR-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved SOLR-2891.
-------------------------------

       Resolution: Fixed
    Fix Version/s: 4.0
                   3.6

Thanks Edwin!
                
> InvalidTokenOffsetsException when using MappingCharFilterFactory, DictionaryCompoundWordTokenFilterFactory and Highlighting
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-2891
>                 URL: https://issues.apache.org/jira/browse/SOLR-2891
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter, Schema and Analysis, search
>    Affects Versions: 3.1, 3.4
>         Environment: MacOS X, Java 1.6, Tomcat 7
>            Reporter: Edwin Steiner
>            Assignee: Robert Muir
>            Priority: Critical
>             Fix For: 3.6, 4.0
>
>         Attachments: SOLR-2891.patch
>
>
> I would like to handle german accents (Umlaute) by replacing the accented char with its two-letter substitute (e.g ä => ae). For this reason I use the char-filter solr.MappingCharFilterFactory configured with a mapping file containing entries like "ä" => "ae". I also want to use the solr.DictionaryCompoundWordTokenFilterFactory to find words which are part of compound words (e.g. revision in totalrevision). And finally I want to use Solr highlighting. But there seems to be a problem if I combine the char filter and the compound word filter in combination with highlighting (an org.apache.lucene.search.highlight.InvalidTokenOffsetsException is raised).
> Here are the details:
> types:
> --------
>     <fieldType name="textAnalyzedFailed" class="solr.TextField" positionIncrementGap="100">
>       <analyzer>
>         <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
>         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="words.txt"/>
>       </analyzer>
>     </fieldType>
> schema:
> -----------
>   <fields>
>      <field name="id"         type="string"               indexed="true" stored="true" required="true" /> 
>      <field name="title"      type="textAnalyzedFailed"   indexed="true" stored="true"/>
>   </fields>
> document:
> --------------
>   <doc>
>      <field name="id">1</field> 
>      <field name="title">banküberfall</field> 
>   </doc>
> mapping.txt:
> -----------------
> "ü" => "ue"
> words.txt:
> --------------
> fall
> The resulting error when search with:
> http://localhost:8080/solr/select/?q=banküberfall&hl=true&hl.fl=title
> Nov 4, 2011 4:29:12 PM org.apache.solr.core.SolrCore execute
> INFO: [] webapp=/solr path=/select/ params={q=bank?berfall&hl.fl=title_hl&hl=true} hits=1 status=0 QTime=13 
> Nov 4, 2011 4:29:16 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall exceeds length of provided text sized 12
> 	at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:469)
> 	at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:378)
> 	at org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:116)
> 	at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
> 	at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> 	at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
> 	at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> 	at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
> 	at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
> 	at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
> 	at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
> 	at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
> 	at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:462)
> 	at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164)
> 	at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
> 	at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:851)
> 	at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
> 	at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:405)
> 	at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:278)
> 	at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:515)
> 	at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:302)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> 	at java.lang.Thread.run(Thread.java:680)
> Caused by: org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall exceeds length of provided text sized 12
> 	at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:228)
> 	at org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:462)
> 	... 23 more
> The analysis tool says the following for field name=title, field value=banküberfall:
> ------------------------------------------------------------------------------------
> Index Analyzer
> org.apache.solr.analysis.MappingCharFilterFactory {mapping=mapping.txt, luceneMatchVersion=LUCENE_31}
> text	bankueberfall
> org.apache.solr.analysis.WhitespaceTokenizerFactory {luceneMatchVersion=LUCENE_31}
> position	1
> term text	bankueberfall
> startOffset	0
> endOffset	12
> org.apache.solr.analysis.DictionaryCompoundWordTokenFilterFactory {dictionary=words.txt, luceneMatchVersion=LUCENE_31}
> position	1
> term text	bankueberfall
>                 fall
> startOffset	0
>                 9
> endOffset	12
>                 13
> flags	        0
>                 0
> type	        word
>                 word
> payload	

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org