You are viewing a plain text version of this content. The canonical link for it is here.
Posted to pr@cassandra.apache.org by GitBox <gi...@apache.org> on 2019/11/11 23:36:24 UTC

[GitHub] [cassandra] yifan-c opened a new pull request #382: Estimate UTF-8 string size based on encodeSize and add benchmarks

yifan-c opened a new pull request #382: Estimate UTF-8 string size based on encodeSize and add benchmarks
URL: https://github.com/apache/cassandra/pull/382
 
 
   Given the fact that the `encodeSize` was calculated already when encoding, we can leverage the size and safely reserve the remaining capacity for writing to avoid resizing. 
   
   A set of benchmarks were taken to show the difference. For the long text, the change halves the string encoding time from 571.9 ns to 216.1 ns. The time is almost halves for the short text as well. 
   
   The improvement is because of avoiding the unnecessary resizing and data copy. 
   
   ```
   [java] Benchmark                                                  Mode  Cnt    Score    Error  Units
   [java] Utf8StringEncodeBench.writeLongText                        avgt    6  571.949 ± 19.791  ns/op
   [java] Utf8StringEncodeBench.writeLongTextWithExactSize           avgt    6  459.932 ± 27.790  ns/op
   [java] Utf8StringEncodeBench.writeLongTextWithExactSizeSkipCalc   avgt    6  216.085 ±  3.480  ns/op
   [java] Utf8StringEncodeBench.writeShortText                       avgt    6   62.775 ±  6.159  ns/op
   [java] Utf8StringEncodeBench.writeShortTextWithExactSize          avgt    6   44.071 ±  5.645  ns/op
   [java] Utf8StringEncodeBench.writeShortTextWithExactSizeSkipCalc  avgt    6   36.358 ±  5.135  ns/op
   ````
   
   - writeLongText: the original implementation that calls `ByteBufUtils.writeUtf8`. It over-estimates the size of string that causes resizing the buffer.
   - writeLongTextWithExactSize: calls `TypeSizes.encodeUTF8Length` to reserve the exact size of bytes to write.
   - writeLongTextWithExactSizeSkipCalc: optimize by removing calculating the UTF8 length. Because we calculated the encodeSize before encode for messages. Therefore, the size of the final bytes is known, we can leverage this information to just reserve using the remaining capacity.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: pr-unsubscribe@cassandra.apache.org
For additional commands, e-mail: pr-help@cassandra.apache.org