You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Christoph Büscher (Jira)" <ji...@apache.org> on 2019/12/10 12:45:00 UTC

[jira] [Created] (LUCENE-9088) JapaneseNumberFilter uses inaccurate PartOfSpeechAttribute

Christoph Büscher created LUCENE-9088:
-----------------------------------------

             Summary: JapaneseNumberFilter uses inaccurate PartOfSpeechAttribute
                 Key: LUCENE-9088
                 URL: https://issues.apache.org/jira/browse/LUCENE-9088
             Project: Lucene - Core
          Issue Type: Bug
          Components: modules/analysis
            Reporter: Christoph Büscher


According to the JapaneseNumberFilter javadocs, it uses the attribute values of the last token used to compose the normalized number, which can be wrong. While this is documented it leads to a number of incompatibilities with other japanese token filters.

For example, the PartOfSpeechAttribute of the last token used for an input text of "2008 2009" will lead to an the following output (some attributes left out...):

```

{
 "token" : "2008",
 "start_offset" : 0,
 "end_offset" : 4,
 "type" : "word",
[...]

"partOfSpeech" : "記号-空白",
 "partOfSpeech (en)" : "symbol-space"

[...]
 },
 {
 "token" : " ",
 "start_offset" : 4,
 "end_offset" : 5,
 "type" : "word",

[...]
"partOfSpeech" : "記号-空白",
 "partOfSpeech (en)" : "symbol-space",
[...]
 },
 {
 "token" : "2009",
 "start_offset" : 5,
 "end_offset" : 9,
 "type" : "word",
...
 "partOfSpeech" : "名詞-数",
 "partOfSpeech (en)" : "noun-numeric",
 }

```

so that e.g. a following `{color:#1d1c1d}kuromoji_part_of_speech{color}` filter will eliminate the "2008" token erroneously tagged as "symbol-space".

Even without fixing the other token attrobutes, the POS attributes should IMHO be set to "noun-numeric", since that's what the filter is supposed to detect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org