You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/03/15 06:22:21 UTC

[GitHub] [lucene] rmuir commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

rmuir commented on pull request #15:
URL: https://github.com/apache/lucene/pull/15#issuecomment-799147946

thanks for updating the PR! Its late tonight, but I will look at this this week. I think I haven't looked since the patch so I am sure there are some "interesting" things that had to be done for any kind of good performance... Sorry if I ask dumb questions you have already answered...!

I think its good to have the charfilter especially for the more efficient transforms like `XYZ>Latin`... I think we should mainly optimize the transliterator code for those type of transforms.

The JIRA Issue discusses stuff like Traditional->Simplified as the use-case, but I am not sure I would make major sacrifices to try to speedup Traditional->Simplified. At least in the past this one was slow as a transliterator, but looking at the rules, maybe it really shouldnt be a transliterator at all: https://github.com/unicode-org/cldr/blob/master/common/transforms/Simplified-Traditional.xml. This file is like 4k lines of simple mappings (longest match) and out of thousands, only 6 context-based rules are used (ScDigit: 16 chars in the set). So as an alternative, we could process that file, expand those 6 rules (resulting in 66 additional "lines"), and produce a text file suitable for MappingCharFilter. It could be part of the ICU upgrade automation in the gradle build so that it gets regenerated with new ICU versions.

I can see why disabling normalization in the output might improve performance here, why normalize twice? But its a bit scary to provide such an option that makes things "faster" but might result in crazy output and confuse users or even make tokenizers behave worse, if they don't understand what it means. Even then its still problematic, as the ICUNormalizer2CharFilter has buggy offset handling, that the tests will find if you re-enable testRandom and run it enough times: https://issues.apache.org/jira/browse/LUCENE-5595 . So I'm not a fan of e.g. returning unexpected decomposed output from this thing right now. For which transforms does this optimization impact the performance and how much?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org