You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2019/10/02 18:36:56 UTC

[GitHub] [lucene-solr] magibney edited a comment on issue #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

magibney edited a comment on issue #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#issuecomment-537538822
 
 
   Here's a thought: what if we provided a boolean configuration option like `assumeExternalUnicodeNormalization`. Many of these transforms work on NFD input, and produce NFC output, but they generally are configured defensively (not assuming input to be NFD, and not assuming that output will be externally converted to NFC).
   
   This is understandable, but results in the odd situation (for example) that an analysis component like "ICUTransformFilter(Cyrillic-Latin)" would have NFC output, but _only_ for characters whose input representation matched the top-level Cyrillic-Latin filter (which is pretty restrictive). Input characters that didn't match the top-level filter would be untouched by any component of the underlying CompoundTransliterator. So if you want fully unicode-normalized output (and in the context of an analysis chain, most do), you have to separately apply post-transform NFC normalization anyway.
   
   At best, for this ends up doing some redundant work; but for the performance case we're considering here, there are particular implications. NFC, as a trailing transformation step, is both _very_ common and _very_ active -- active in the sense that it will in many common contexts block output waiting for combining diacritics for literally almost every character. If we know we're externally applying unicode normalization over the entire output, skipping baked-in post-NFC for every transform component avoids redundant work, but more importantly avoids a common case that's virtually guaranteed to result in a substantial amount of partial transliteration, rollback, etc. I think this can be done relatively cleanly using Transliterator getElements(), toRules(false), and createFromRules(...).
   
   I'd be curious to know what you think, @msokolov, and perhaps @rmuir?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org