You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2019/09/30 16:27:15 UTC

[GitHub] [lucene-solr] magibney commented on issue #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation

magibney commented on issue #892: LUCENE-8972: Add ICUTransformCharFilter, to support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#issuecomment-536641016
 
 
   Thank you for the review/feedback @msokolov! In addition to (as you mention) offset accuracy, some form of incremental transliteration with rollback is necessary for baseline support of consistent transliteration in a streaming fashion. The only alternatives I can think of are:
   1. to load the entire input in memory and transliterate the entire block at once (which would lose you _all_ offset accuracy and could potentially be problematic memory-wise for large inputs), or
   2. transliterate input in arbitrary-sized chunks (which would would get you offset accuracy at the level of granularity of the individual chunks).
   
   The second of these options sounds ok for cases where you don't care about offsets, but a bigger problem with this option is that regardless of whether nominally processed "incrementally", the tail end of each chunk would in some cases undergo partial transformation yielding inconsistent results.
   
   I was curious as well about the performance implications of rollback (as compared to the `ICUTransformTokenFilter`, which has the input pre-split and can thus afford single-pass transliteration of each chunk). I put together a _really_ quick and dirty comparison that just runs inputs with configurably-sized, whitespace-delimited tokens that consist entirely of one character. The results (at least initially, superficially) seem to suggest that the latency scales ~linearly with respect to the average length of character runs that can be completely transliterated (advancing the rollback window) -- or, put another way, the average size to which the rollback buffer content grows before being "committed". See initial performance evaluation code at:
   
   [charFilterPerformanceTest.txt](https://github.com/apache/lucene-solr/files/3672215/charFilterPerformanceTest.txt)
   
   It's a tricky thing to test, and I'd like to see it evaluated against real-world input, but the performance impact of incremental transliteration would also be highly dependent on the particular transliteration type (e.g., top-level `RuleBasedTransliterator` instances should never require any rollback, and should be plenty fast; some `CompoundTransliterator` instances should require little or no rollback, others (particularly ones that bundle trailing NFC transformation, like "Cyrillic-Latin") might require rollback for most non-delimiter character runs.
   
   If we somehow know that offsets are not necessary for a given instance, there would be some optimizations possible (like initially buffering some large-but-not-too-large number of characters for a given input, with the hope that we'll reach the end of the input and can just do block transliteration).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org