You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Tomoko Uchida (JIRA)" <ji...@apache.org> on 2019/07/24 15:16:00 UTC
[jira] [Comment Edited] (LUCENE-8933) JapaneseTokenizer creates
Token objects with corrupt offsets
[ https://issues.apache.org/jira/browse/LUCENE-8933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16891940#comment-16891940 ]
Tomoko Uchida edited comment on LUCENE-8933 at 7/24/19 3:15 PM:
----------------------------------------------------------------
I once encountered similar errors but it was not related to synonyms. I guess that there might be surrogate pair (perhaps Emoji) characters?
If whole analysis chain and ideally the text string which cause the error are provided, I will look into it. (Otherwise, it seems to be difficult to reproduce the error...)
was (Author: tomoko uchida):
I once encountered similar errors but it's not related to synonyms. I guess that there might be surrogate pair (perhaps Emoji) characters?
If whole analysis chain and ideally the text string which cause the error are provided, I will look into it. (Otherwise, it seems to be difficult to reproduce the error...)
> JapaneseTokenizer creates Token objects with corrupt offsets
> ------------------------------------------------------------
>
> Key: LUCENE-8933
> URL: https://issues.apache.org/jira/browse/LUCENE-8933
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Adrien Grand
> Priority: Minor
>
> An Elasticsearch user reported the following stack trace when parsing synonyms. It looks like the only reason why this might occur is if the offset of a {{org.apache.lucene.analysis.ja.Token}} is not within the expected range.
>
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
> at org.apache.lucene.analysis.tokenattributes.CharTermAttributeImpl.copyBuffer(CharTermAttributeImpl.java:44) ~[lucene-core-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:20]
> at org.apache.lucene.analysis.ja.JapaneseTokenizer.incrementToken(JapaneseTokenizer.java:486) ~[?:?]
> at org.apache.lucene.analysis.synonym.SynonymMap$Parser.analyze(SynonymMap.java:318) ~[lucene-analyzers-common-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at org.elasticsearch.index.analysis.ESSolrSynonymParser.analyze(ESSolrSynonymParser.java:57) ~[elasticsearch-6.6.1.jar:6.6.1]
> at org.apache.lucene.analysis.synonym.SolrSynonymParser.addInternal(SolrSynonymParser.java:114) ~[lucene-analyzers-common-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at org.apache.lucene.analysis.synonym.SolrSynonymParser.parse(SolrSynonymParser.java:70) ~[lucene-analyzers-common-7.6.0.jar:7.6.0 719cde97f84640faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:48]
> at org.elasticsearch.index.analysis.SynonymTokenFilterFactory.buildSynonyms(SynonymTokenFilterFactory.java:154) ~[elasticsearch-6.6.1.jar:6.6.1]
> ... 24 more
> {noformat}
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org