You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Michael Gibney (Jira)" <ji...@apache.org> on 2021/06/28 20:30:00 UTC

[jira] [Commented] (LUCENE-9177) ICUNormalizer2CharFilter worst case is very slow

    [ https://issues.apache.org/jira/browse/LUCENE-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17370836#comment-17370836 ] 

Michael Gibney commented on LUCENE-9177:
----------------------------------------

[~jim.ferenczi] I adapted your informal "performance benchmark" test ([^LUCENE-9177_LUCENE-8972.patch]) to also evaluate a third approach: instead of using {{ICUNormalizer2CharFilter}}, I tried using the new {{ICUTransform2CharFilter}}, under consideration at LUCENE-8972 (and [PR #15|https://github.com/apache/lucene/pull/15]).

In place of {{nfkc_cf}} normalizer, I used a compound transliterator, configured as {{Any-NFKD; Any-CaseFold; Any-NFC}}, which should be (I think?) equivalent.

The performance improvement was quite dramatic:
{quote}  1> normalized 400000 chars in 2364ms
  1> transformed 400000 chars in 181ms
  1> normalized 400000 chars in 49ms
{quote}

The first of these uses the ICUNormalizer2CharFilter (the performance issue we're discussing); the second is the approach I introduced (ICUTransform2CharFilter); the third is your point of comparison using ICUNormalizerFilter (TokenFilter).

Note that the first two should be functionally identical. In contrast, the last ends up normalizing the whole input string at once (whitespace tokenizer over a string with no whitespace) and doesn't track any offsets.

My initial (admittedly fuzzy) hypothesis is that the performance difference is accounted for by the fact that the new ICUTransform2CharFilter uses the {{Replaceable}} ICU abstraction to do offset correction "from within", while ICUNormalizer2CharFilter requires firmer boundaries up front in order to do "external" offset correction?

In any case it's worth noting that for Japanese text, there may be other Transliteration ids that could be especially useful as CharFilters (i.e., pre-Tokenizer) -- this was indeed the main motivation for LUCENE-8972.

> ICUNormalizer2CharFilter worst case is very slow
> ------------------------------------------------
>
>                 Key: LUCENE-9177
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9177
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>         Attachments: LUCENE-9177_LUCENE-8972.patch, lucene.patch
>
>
> ICUNormalizer2CharFilter is fast most of the times but we've had some report in Elasticsearch that some unrealistic data can slow down the process very significantly. For instance an input that consists of characters to normalize with no normalization-inert character in between can take up to several seconds to process few hundreds of kilo-bytes on my machine. While the input is not realistic, this worst case can slow down indexing considerably when dealing with uncleaned data.
> I attached a small test that reproduces the slow processing using a stream that contains a lot of repetition of the character `℃` and no normalization-inert character. I am not surprised that the processing is slower than usual but several seconds to process seems a lot. Adding normalization-inert character makes the processing a lot more faster so I wonder if we can improve the process to split the input more eagerly ?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org