You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Yifan Cai (Jira)" <ji...@apache.org> on 2019/11/11 23:42:00 UTC
[jira] [Commented] (CASSANDRA-15410) Avoid over-allocation of bytes
for UTF8 string serialization
[ https://issues.apache.org/jira/browse/CASSANDRA-15410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971902#comment-16971902 ]
Yifan Cai commented on CASSANDRA-15410:
---------------------------------------
||Branch||PR||Test||
|[CASSANDRA-15410|https://github.com/yifan-c/cassandra/tree/CASSANDRA-15410]|[PR|https://github.com/apache/cassandra/pull/382]|[Test|https://app.circleci.com/jobs/github/yifan-c/cassandra/56]|
Given the fact that the encodeSize was calculated already when encoding, we can leverage the size and safely reserve the remaining capacity for writing to avoid resizing.
A set of benchmarks were taken to show the difference. For the long text, the change halves the string encoding time from 571.9 ns to 216.1 ns. The time is almost halves for the short text as well.
The improvement is because of avoiding the unnecessary resizing and data copy.
{code:java}
[java] Benchmark Mode Cnt Score Error Units
[java] Utf8StringEncodeBench.writeLongText avgt 6 571.949 ± 19.791 ns/op
[java] Utf8StringEncodeBench.writeLongTextWithExactSize avgt 6 459.932 ± 27.790 ns/op
[java] Utf8StringEncodeBench.writeLongTextWithExactSizeSkipCalc avgt 6 216.085 ± 3.480 ns/op
[java] Utf8StringEncodeBench.writeShortText avgt 6 62.775 ± 6.159 ns/op
[java] Utf8StringEncodeBench.writeShortTextWithExactSize avgt 6 44.071 ± 5.645 ns/op
[java] Utf8StringEncodeBench.writeShortTextWithExactSizeSkipCalc avgt 6 36.358 ± 5.135 ns/op
{code}
* writeLongText: the original implementation that calls ByteBufUtils.writeUtf8. It over-estimates the size of string that causes resizing the buffer.
* writeLongTextWithExactSize: calls TypeSizes.encodeUTF8Length to reserve the exact size of bytes to write.
* writeLongTextWithExactSizeSkipCalc: optimize by removing calculating the UTF8 length. Because we calculated the encodeSize before encode for messages. Therefore, the size of the final bytes is known, we can leverage this information to just reserve using the remaining capacity.
> Avoid over-allocation of bytes for UTF8 string serialization
> -------------------------------------------------------------
>
> Key: CASSANDRA-15410
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15410
> Project: Cassandra
> Issue Type: Improvement
> Components: Messaging/Client
> Reporter: Yifan Cai
> Priority: Normal
>
> In the current message encoding implementation, it first calculates the `encodeSize` and allocates the bytebuffer with that size.
> However, during encoding, it assumes the worst case of writing UTF8 string to allocate bytes, i.e. assuming each letter takes 3 bytes.
> The over-estimation further leads to resizing the underlying array and data copy.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org