You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Trey Jones (Jira)" <ji...@apache.org> on 2021/02/12 20:45:00 UTC
[jira] [Comment Edited] (LUCENE-9754) ICU Tokenizer: letter-space-number-letter tokenized inconsistently

    [ https://issues.apache.org/jira/browse/LUCENE-9754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17283971#comment-17283971 ] 

Trey Jones edited comment on LUCENE-9754 at 2/12/21, 8:44 PM:
--------------------------------------------------------------

The inconsistency caused by chunking is a very confusing, albeit rare, problem—but I don't think it is what needs to be fixed here. The chunking algorithm assumes that whitespace is a reasonable place to split tokens, and that should be a valid assumption.

Right now the ICU Tokenizer tokenizes _cat 14th γάτα 1ος cat 1ος γάτα 14th_ as _cat | 14th | γάτα | 1οσ | cat | 1 | οσ | γάτα | 14 | th._ Does anyone expect the tokenization of _14th_ or _1ος_ (Greek "1st") to depend on the word before it? It happens across punctuation too, so a word in a different _sentence_ can trigger different tokenization; in this example, "The top results are: 1st is the Greek word for cat, γάτα. 2nd is the French word for cat, chat. 3rd is ..." No one would reasonably expect that you would get the tokens _1st, 2, nd,_ and _3rd_ out of this, but that's what happens. (Splitting on sentences wouldn't solve this one either—just replace periods with semicolons and it's one long sentence.)

The Word Boundary Rules that Robert linked to explicitly say _Do not break within sequences of digits, or digits adjacent to letters (“3a”, or “A3”)._ The [Unicode Segmentation Utility|https://util.unicode.org/UnicodeJsps/breaks.jsp?a=The%20top%20results%20are:%201st%20is%20the%20Greek%20word%20for%20cat,%20%CE%B3%CE%AC%CF%84%CE%B1.%202nd%20is%20the%20French%20word%20for%20cat,%20chat.%203rd%20is%20...] also doesn't split the tokens this way.

Like I said above, my guess is that there is a flag of some sort for "most recent character set" that should be reset to null or "none" or something at whitespace, line breaks, etc.

Other examples taken from English Wikipedia (it does not use the ICU Tokenizer, but it's a good place to find natural examples): resistor 1.5kΩ 12W (12|w); πρώτη 5G πόλη (5|G); the σ 2p has (2|p); Суворове в 3D (3|D); ФИБА 3x3 (3|x3); интерконективен 400kV (400|kv); collection crosses रु 18cr mark (18|cr); 2019 వేడుక 17th Santosham Awards (17|th); หลวงพี่แจ๊ส 4G (4|g); factor of 2π (2|π); 50m-bazen.pdf 50м базен (50|м); hydroxyprednisolone 16α,17α-acetonide (16|α|17α); 

That last one is particularly egregious, since 16α is separated, but 17α is not.



was (Author: trey jones):
The inconsistency caused by chunking is a very confusing, albeit rare, problem—but I don't think it is what needs to be fixed here. The chunking algorithm assumes that whitespace is a reasonable place to split tokens, and that should be a valid assumption.

Right now the ICU Tokenizer tokenizes _cat 14th γάτα 1ος cat 1ος γάτα 14th_ as _cat | 14th | γάτα | 1οσ | cat | 1 | οσ | γάτα | 14 | th._ Does anyone expect the tokenization of _14th_ or _1ος_ (Greek "1st") to depend on the word before it? It happens across punctuation too, so a word in a different _sentence_ can trigger different tokenization; in this example, "The top results are: 1st is the Greek word for cat, γάτα. 2nd is the French word for cat, chat. 3rd is ..." No one would reasonably expect that you would get the tokens _1st, 2, nd,_ and _3rd_ out of this, but that's what happens. (Splitting on sentences wouldn't solve this one either—just replace periods with semi-colons and it's one long sentence.)

The Word Boundary Rules that Robert linked to explicitly say _Do not break within sequences of digits, or digits adjacent to letters (“3a”, or “A3”)._ The [Unicode Segmentation Utility|https://util.unicode.org/UnicodeJsps/breaks.jsp?a=The%20top%20results%20are:%201st%20is%20the%20Greek%20word%20for%20cat,%20%CE%B3%CE%AC%CF%84%CE%B1.%202nd%20is%20the%20French%20word%20for%20cat,%20chat.%203rd%20is%20...] also doesn't split the tokens this way.

Like I said above, my guess is that there is a flag of some sort for "most recent character set" that should be reset to null or "none" or something at whitespace, line breaks, etc.

Other examples taken from English Wikipedia (it does not use the ICU Tokenizer, but it's a good place to find natural examples): resistor 1.5kΩ 12W (12|w); πρώτη 5G πόλη (5|G); the σ 2p has (2|p); Суворове в 3D (3|D); ФИБА 3x3 (3|x3); интерконективен 400kV (400|kv); collection crosses रु 18cr mark (18|cr); 2019 వేడుక 17th Santosham Awards (17|th); หลวงพี่แจ๊ส 4G (4|g); factor of 2π (2|π); 50m-bazen.pdf 50м базен (50|м); hydroxyprednisolone 16α,17α-acetonide (16|α|17α); 

That last one is particularly egregious, since 16α is separated, but 17α is not.


> ICU Tokenizer: letter-space-number-letter tokenized inconsistently
> ------------------------------------------------------------------
>
>                 Key: LUCENE-9754
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9754
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: core/search
>    Affects Versions: 7.5
>         Environment: Tested most recently on Elasticsearch 6.5.4.
>            Reporter: Trey Jones
>            Priority: Major
>         Attachments: LUCENE-9754_prototype.patch
>
>
> The tokenization of strings like _14th_ with the ICU tokenizer is affected by the character that comes before preceeding whitespace.
> For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ | 14 | th.
> In general, in a letter-space-number-letter sequence, if the writing system before the space is the same as the writing system after the number, then you get two tokens. If the writing systems differ, you get three tokens.
> If the conditions are just right, the chunking that the ICU tokenizer does (trying to split on spaces to create <4k chunks) can create an artificial boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the unexpected split of the second token (_14th_). Because chunking changes can ripple through a long document, editing text or the effects of a character filter can cause changes in tokenization thousands of lines later in a document.
> My guess is that some "previous character set" flag is not reset at the space, and numbers are not in a character set, so _t_ is compared to _ァ_ and they are not the same—causing a token split at the character set change—but I'm not sure.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org