You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Michael Gibney (Jira)" <ji...@apache.org> on 2021/07/01 19:56:00 UTC

[jira] [Commented] (LUCENE-9177) ICUNormalizer2CharFilter worst case is very slow

    [ https://issues.apache.org/jira/browse/LUCENE-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17373020#comment-17373020 ] 

Michael Gibney commented on LUCENE-9177:
----------------------------------------

Upon further investigation I actually don't think the "concatenation" question is a concern. I just opened a [PR #199|https://github.com/apache/lucene/pull/199], which altogether avoids the "normalization-inert character" requirement in a way that I think should be safe/robust. I restored the "performance minitest patch" to something very like ( [^LUCENE-9177-benchmark-test.patch] ) the original posted by Jim, with encouraging results:
{quote}1> normalized 400000 chars in 71ms
 1> normalized 400000 chars in 64ms
{quote}
I don't think this is at all detrimental to performance for the common case. If anything I think it might help somewhat by reducing required over-buffering even for the common case (because the extant buffering actually requires normalization-inert characters to be placed arbitrarily at the "128th char" – the last element in the internal buffer).

> ICUNormalizer2CharFilter worst case is very slow
> ------------------------------------------------
>
>                 Key: LUCENE-9177
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9177
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Jim Ferenczi
>            Priority: Minor
>         Attachments: LUCENE-9177-benchmark-test.patch, LUCENE-9177_LUCENE-8972.patch, lucene.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> ICUNormalizer2CharFilter is fast most of the times but we've had some report in Elasticsearch that some unrealistic data can slow down the process very significantly. For instance an input that consists of characters to normalize with no normalization-inert character in between can take up to several seconds to process few hundreds of kilo-bytes on my machine. While the input is not realistic, this worst case can slow down indexing considerably when dealing with uncleaned data.
> I attached a small test that reproduces the slow processing using a stream that contains a lot of repetition of the character `℃` and no normalization-inert character. I am not surprised that the processing is slower than usual but several seconds to process seems a lot. Adding normalization-inert character makes the processing a lot more faster so I wonder if we can improve the process to split the input more eagerly ?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org