You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Christian Moen (Jira)" <ji...@apache.org> on 2019/08/29 10:04:00 UTC

[jira] [Commented] (LUCENE-8959) JapaneseNumberFilter does not take whitespaces into account when concatenating numbers

    [ https://issues.apache.org/jira/browse/LUCENE-8959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16918464#comment-16918464 ] 

Christian Moen commented on LUCENE-8959:
----------------------------------------

Sounds like a good idea.  This is also rather big rabbit hole... 

Would it be useful to consider making the digit grouping separators configurable as part of a bigger scheme here?

In Japanese, if you're processing text with SI numbers, I believe space is a valid digit grouping.

> JapaneseNumberFilter does not take whitespaces into account when concatenating numbers
> --------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8959
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8959
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>
> Today the JapaneseNumberFilter tries to concatenate numbers even if they are separated by whitespaces. So for instance "10 100" is rewritten into "10100" even if the tokenizer doesn't discard punctuations. In practice this is not an issue but this can lead to giant number of tokens if there are a lot of numbers separated by spaces. The number of concatenation should be configurable with a sane default limit in order to avoid creating big tokens that slows down the analysis.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org