You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Kiju Kim (JIRA)" <ji...@apache.org> on 2018/10/16 01:56:00 UTC

[jira] [Created] (LUCENE-8532) nori analyzer issue with trailing space

Kiju Kim created LUCENE-8532:
--------------------------------

             Summary: nori analyzer issue with trailing space
                 Key: LUCENE-8532
                 URL: https://issues.apache.org/jira/browse/LUCENE-8532
             Project: Lucene - Core
          Issue Type: Bug
          Components: modules/analysis
    Affects Versions: 7.4
         Environment: Elasticsearch version: Version: Version: 6.4.2, Build: default/tar/04711c2/2018-09-26T13:34:09.098244Z, JVM: 1.8.0_131

Plugins installed: [analysis-nori]

JVM version:
java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)


OS version: Darwin Kijuui-MacBook-Pro.local 17.7.0 Darwin Kernel Version 17.7.0: Thu Jun 21 22:53:14 PDT 2018; root:xnu-4570.71.2~1/RELEASE_X86_64 x86_64
            Reporter: Kiju Kim


We can reproduce it from Elasticsearch.

When we run the following command:

GET _analyze
{
  "analyzer": "nori",
  "text": "공단시"
}

It returns the following as expected:

{
  "tokens": [
    {
      "token": "공단",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "시",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 1
    }
  ]
}

But if we run with "공단시 " (with a trailing space)

GET _analyze
{
  "analyzer": "nori",
  "text": "공단시 "
}

It returns

{
  "tokens": [
    {
      "token": "공단",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      *"token": "씨",*
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 1
    }
  ]
}

The second token should be " 시" instead of  "씨".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org