You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by "Jim Ferenczi (Jira)" <ji...@apache.org> on 2021/08/31 21:37:00 UTC

[jira] [Created] (LUCENE-10081) KoreanTokenizer should check the max backtrace gap on whitespaces

Jim Ferenczi created LUCENE-10081:
-------------------------------------

             Summary: KoreanTokenizer should check the max backtrace gap on whitespaces
                 Key: LUCENE-10081
                 URL: https://issues.apache.org/jira/browse/LUCENE-10081
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: Jim Ferenczi


Today the KoreanTokenizer keeps track of the whitespaces that appear before a known term in order to apply a space penalty factor. These whitespaces are considered part of the next term so the backtrace gap limit is not applied. 
As a result, the position buffer can grow up to the maximum number of consecutive whitespaces in the input. This is problematic since the buffer is reused on reset() so we should ensure that the max backtrace gap limit is applied on consecutive whitespaces consistently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org