You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Jim Ferenczi (JIRA)" <ji...@apache.org> on 2018/05/22 07:52:00 UTC
[jira] [Commented] (LUCENE-8325) smartcn analyzer can't handle
SURROGATE char
[ https://issues.apache.org/jira/browse/LUCENE-8325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16483610#comment-16483610 ]
Jim Ferenczi commented on LUCENE-8325:
--------------------------------------
Thanks [~chengpohi], the patch looks good to me, the dictionary is for simplified chinese characters only but I agree that the tokenizer should not split surrogate pairs into multiple tokens. I don't know the smartcn analyzer well enough though so I'd like someone else to double check the patch, maybe [~rcmuir] ?
> smartcn analyzer can't handle SURROGATE char
> --------------------------------------------
>
> Key: LUCENE-8325
> URL: https://issues.apache.org/jira/browse/LUCENE-8325
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: chengpohi
> Priority: Minor
> Labels: newbie, patch
> Attachments: handle-surrogate-char-for-smartcn.patch
>
>
> This issue is from [https://github.com/elastic/elasticsearch/issues/30739]
> smartcn analyzer can't handle SURROGATE char, Example:
>
>
> {code:java}
> Analyzer ca = new SmartChineseAnalyzer();
> String sentence = "\uD862\uDE0F"; // 𨨏 a surrogate char
> TokenStream tokenStream = ca.tokenStream("", sentence);
> CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
> tokenStream.reset();
> while (tokenStream.incrementToken()) {
> String term = charTermAttribute.toString();
> System.out.println(term);
> }
> {code}
>
> In the above code snippet will output:
>
> {code:java}
> ?
> ?
> {code}
>
> and I have created a *PATCH* to try to fix this, please help review(since *smartcn* only support *GBK* char, so it's only just handle it as a *single char*).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org