You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/03/15 16:56:53 UTC

[GitHub] [lucene] magibney commented on pull request #15: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

magibney commented on pull request #15:
URL: https://github.com/apache/lucene/pull/15#issuecomment-799579642


   Thanks for looking at this, @rmuir! And no worries wrt "questions already answered" -- it's been long enough that this all feels fresh to me again, for better or worse :-)
   
   The Traditional->Simplified use case is a good example of where this type of functionality (however accomplished) is really necessary, because of the significance of dictionary-based tokenization for these scripts. It sounds promising to map the transliteration file onto a file for MappingCharFilter. Based on what you say, it looks like that would be viable for Traditional->Simplified. I think that approach would also address some of the weird issues this PR had to work around wrt offset resolution in composite transliterators. I wonder whether the same approach could apply to other scripts that commonly use dictionary-based tokenization?
   
   Perhaps this is what you were suggesting, but if the "MappingCharFilter" approach were used as an optimization, while still preserving generic ICUTransformCharFilter directly backed by ICU Transliterator, it'd be possible to test for consistency between the two approaches, while preserving "vanilla" ICUTransformCharFilter for unanticipated use cases, etc. ...
   
   True, `assumeExternalUnicodeNormalization` gets you into "shoot yourself in the foot" territory for sure, but results in roughly 4x performance improvement (based on some naive but I think representative benchmarks). [This comment](https://github.com/apache/lucene-solr/pull/892#issuecomment-537538822) describes the approach, and the [comment immediately following](https://github.com/apache/lucene-solr/pull/892#issuecomment-537601926) confirms results. Regarding which transformations this applies to (from the first linked comment):
   >NFC, as a trailing transformation step, is both very common and very active -- active in the sense that it will in many common contexts block output waiting for combining diacritics for literally almost every character
   I'm inclined to think the significant performance gain for such a common case is worth it -- as a user I'd certainly not want that type of functionality hidden from me. I wonder if there's a way to "have cake and eat it too" ...
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org