You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Kiju Kim (JIRA)" <ji...@apache.org> on 2018/10/16 02:01:00 UTC
[jira] [Updated] (LUCENE-8532) nori analyzer issue with trailing
space
[ https://issues.apache.org/jira/browse/LUCENE-8532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kiju Kim updated LUCENE-8532:
-----------------------------
Description:
We can reproduce it from Elasticsearch.
When we run the following command:
{{GET _analyze}}
{\{{ }}
{\{ "analyzer": "nori", }}
{\{ "text": "공단시" }}
{{}}}
It returns the following as expected:
{
"tokens": [
{
"token": "공단",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "시",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 1
}
]
}
But if we run with "공단시 " (with a trailing space)
GET _analyze
{
"analyzer": "nori",
"text": "공단시 "
}
It returns
{
"tokens": [
{
"token": "공단",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
*"token": "씨",*
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 1
}
]
}
The second token should be "시" instead of "씨".
was:
We can reproduce it from Elasticsearch.
When we run the following command:
GET _analyze
{
"analyzer": "nori",
"text": "공단시"
}
It returns the following as expected:
{
"tokens": [
{
"token": "공단",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "시",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 1
}
]
}
But if we run with "공단시 " (with a trailing space)
GET _analyze
{
"analyzer": "nori",
"text": "공단시 "
}
It returns
{
"tokens": [
{
"token": "공단",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
*"token": "씨",*
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 1
}
]
}
The second token should be " 시" instead of "씨".
> nori analyzer issue with trailing space
> ---------------------------------------
>
> Key: LUCENE-8532
> URL: https://issues.apache.org/jira/browse/LUCENE-8532
> Project: Lucene - Core
> Issue Type: Bug
> Components: modules/analysis
> Affects Versions: 7.4
> Environment: Elasticsearch version: Version: Version: 6.4.2, Build: default/tar/04711c2/2018-09-26T13:34:09.098244Z, JVM: 1.8.0_131
> Plugins installed: [analysis-nori]
> JVM version:
> java version "1.8.0_131"
> Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
> Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
> OS version: Darwin Kijuui-MacBook-Pro.local 17.7.0 Darwin Kernel Version 17.7.0: Thu Jun 21 22:53:14 PDT 2018; root:xnu-4570.71.2~1/RELEASE_X86_64 x86_64
> Reporter: Kiju Kim
> Priority: Major
>
> We can reproduce it from Elasticsearch.
> When we run the following command:
> {{GET _analyze}}
> {\{{ }}
> {\{ "analyzer": "nori", }}
> {\{ "text": "공단시" }}
> {{}}}
> It returns the following as expected:
> {
> "tokens": [
> {
> "token": "공단",
> "start_offset": 0,
> "end_offset": 2,
> "type": "word",
> "position": 0
> },
> {
> "token": "시",
> "start_offset": 2,
> "end_offset": 3,
> "type": "word",
> "position": 1
> }
> ]
> }
> But if we run with "공단시 " (with a trailing space)
> GET _analyze
> {
> "analyzer": "nori",
> "text": "공단시 "
> }
> It returns
> {
> "tokens": [
> {
> "token": "공단",
> "start_offset": 0,
> "end_offset": 2,
> "type": "word",
> "position": 0
> },
> {
> *"token": "씨",*
> "start_offset": 2,
> "end_offset": 3,
> "type": "word",
> "position": 1
> }
> ]
> }
> The second token should be "시" instead of "씨".
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org