You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steve Rowe (JIRA)" <ji...@apache.org> on 2018/10/05 19:54:00 UTC

[jira] [Comment Edited] (LUCENE-8526) StandardTokenizer doesn't separate hangul characters from other non-CJK chars

    [ https://issues.apache.org/jira/browse/LUCENE-8526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16640279#comment-16640279 ] 

Steve Rowe edited comment on LUCENE-8526 at 10/5/18 7:53 PM:
-------------------------------------------------------------

bq. We can maybe add a note in the CJKBigram filter regarding this behavior when the StandardTokenizer is used ?

+1

How's this, to be added to the CJKBigramFilter class javadoc:

{noformat}
 * <p>
 * Unlike ICUTokenizer, StandardTokenizer does not split at script boundaries.
 * Korean Hangul characters are treated the same as many other scripts'
 * letters, and as a result, StandardTokenizer can produce tokens that mix
 * Hangul and non-Hangul characters, e.g. "한국abc".  Such mixed-script tokens
 * are typed as <code>&lt;ALPHANUM&gt;</code> rather than
 * <code>&lt;HANGUL&gt;</code>, and as a result, will not be converted to 
 * bigrams by CJKBigramFilter. 
{noformat}



was (Author: steve_rowe):
bq. We can maybe add a note in the CJKBigram filter regarding this behavior when the StandardTokenizer is used ?

+1

How's this, to be added to the CJKBigramFilter class javadoc:

{noformat}
Unlike ICUTokenizer, StandardTokenizer does not split at script boundaries.  Korean Hangul characters are treated the same as many other scripts' letters, and as a result, StandardTokenizer can produce tokens that mix Hangul and non-Hangul characters, e.g. "한국abc".  Such mixed-script tokens are typed as <code>&lt;ALPHANUM&gt;</code> rather than <code>&lt;HANGUL&gt;</code>, and as a result, will not be converted to bigrams by CJKBigramFilter. 
{noformat}


> StandardTokenizer doesn't separate hangul characters from other non-CJK chars
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-8526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8526
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>
> It was first reported here https://github.com/elastic/elasticsearch/issues/34285.
> I don't know if it's the expected behavior but the StandardTokenizer does not split words
> which are composed of a mixed of non-CJK characters and hangul syllabs. For instance "한국2018" or "한국abc" is kept as is by this tokenizer and mark as an alpha-numeric group. This breaks the CJKBigram token filter which will not build bigrams on such groups. The other CJK characters are correctly splitted when they are mixed with other alphabet so I'd expect the same for hangul.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org