You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "gianm (via GitHub)" <gi...@apache.org> on 2024/03/27 18:08:52 UTC

[PR] JSONFlattenerMaker: Speed up charsetFix. (druid)

gianm opened a new pull request, #16212:
URL: https://github.com/apache/druid/pull/16212

   JSON parsing has this function "charsetFix" that fixes up strings so they can round-trip through UTF-8 encoding without loss of fidelity. It was originally introduced to fix a bug where strings could be sorted, encoded, then decoded, and the resulting decoded strings could end up no longer in sorted order (due to character swaps during the encode operation).
   
   The code has been in place for some time, and only applies to JSON. I am not sure if it needs to apply to other formats; it's certainly more difficult to get broken strings from other formats. It's easy in JSON because you can write a JSON string like "foo\uD900".
   
   At any rate, this patch does not revisit whether charsetFix should be applied to all formats. It merely optimizes it for the JSON case. The function works by using CharsetEncoder.canEncode, which is a relatively slow method (just as expensive as actually encoding). This patch adds a short-circuit to skip canEncode if all chars in a string are in the basic multilingual plane (i.e. if no chars are surrogates).
   
   Benchmarks:
   
   ```
   master
   
   Benchmark                              (discovery)  (readerTypeString)  Mode  Cnt     Score    Error  Units
   JsonInputFormatBenchmark.parseAndRead        false              reader  avgt   10  2645.716 ± 24.261  ns/op
   
   patch
   
   Benchmark                              (discovery)  (readerTypeString)  Mode  Cnt     Score    Error  Units
   JsonInputFormatBenchmark.parseAndRead        false              reader  avgt   10  2307.164 ± 36.656  ns/op
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


Re: [PR] JSONFlattenerMaker: Speed up charsetFix. (druid)

Posted by "cryptoe (via GitHub)" <gi...@apache.org>.
cryptoe merged PR #16212:
URL: https://github.com/apache/druid/pull/16212


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org