You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Yifan Cai (Jira)" <ji...@apache.org> on 2019/11/11 23:42:00 UTC

[jira] [Commented] (CASSANDRA-15410) Avoid over-allocation of bytes for UTF8 string serialization

    [ https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971902#comment-16971902 ] 

Yifan Cai commented on CASSANDRA-15410:
---------------------------------------

||Branch||PR||Test||
|[CASSANDRA-15410|https://github.com/yifan-c/cassandra/tree/CASSANDRA-15410]|[PR|https://github.com/apache/cassandra/pull/382]|[Test|https://app.circleci.com/jobs/github/yifan-c/cassandra/56]|

Given the fact that the encodeSize was calculated already when encoding, we can leverage the size and safely reserve the remaining capacity for writing to avoid resizing.

A set of benchmarks were taken to show the difference. For the long text, the change halves the string encoding time from 571.9 ns to 216.1 ns. The time is almost halves for the short text as well.

The improvement is because of avoiding the unnecessary resizing and data copy.

{code:java}
[java] Benchmark                                                  Mode  Cnt    Score    Error  Units
[java] Utf8StringEncodeBench.writeLongText                        avgt    6  571.949 ± 19.791  ns/op
[java] Utf8StringEncodeBench.writeLongTextWithExactSize           avgt    6  459.932 ± 27.790  ns/op
[java] Utf8StringEncodeBench.writeLongTextWithExactSizeSkipCalc   avgt    6  216.085 ±  3.480  ns/op
[java] Utf8StringEncodeBench.writeShortText                       avgt    6   62.775 ±  6.159  ns/op
[java] Utf8StringEncodeBench.writeShortTextWithExactSize          avgt    6   44.071 ±  5.645  ns/op
[java] Utf8StringEncodeBench.writeShortTextWithExactSizeSkipCalc  avgt    6   36.358 ±  5.135  ns/op
{code}

* writeLongText: the original implementation that calls ByteBufUtils.writeUtf8. It over-estimates the size of string that causes resizing the buffer.
* writeLongTextWithExactSize: calls TypeSizes.encodeUTF8Length to reserve the exact size of bytes to write.
* writeLongTextWithExactSizeSkipCalc: optimize by removing calculating the UTF8 length. Because we calculated the encodeSize before encode for messages. Therefore, the size of the final bytes is known, we can leverage this information to just reserve using the remaining capacity.


> Avoid over-allocation of bytes for UTF8 string serialization 
> -------------------------------------------------------------
>
>                 Key: CASSANDRA-15410
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15410
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Messaging/Client
>            Reporter: Yifan Cai
>            Priority: Normal
>
> In the current message encoding implementation, it first calculates the `encodeSize` and allocates the bytebuffer with that size. 
> However, during encoding, it assumes the worst case of writing UTF8 string to allocate bytes, i.e. assuming each letter takes 3 bytes. 
> The over-estimation further leads to resizing the underlying array and data copy. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org