You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Alexey Zotov (Jira)" <ji...@apache.org> on 2021/04/04 21:01:00 UTC
[jira] [Comment Edited] (CASSANDRA-16360) CRC32 is inefficient on
x86
[ https://issues.apache.org/jira/browse/CASSANDRA-16360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314622#comment-17314622 ]
Alexey Zotov edited comment on CASSANDRA-16360 at 4/4/21, 9:00 PM:
-------------------------------------------------------------------
I've made an initial research on this task. Looks like the _CRC32C_ hardware support is only available when intrinsic code is used. Here are a few reference tickets mentioning that:
* [https://bugs.openjdk.java.net/browse/JDK-8189745]
* [https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8191328]
* [https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8073583]
And it looks like intrinsic code is only available for the [native _CRC32C_ class|[https://docs.oracle.com/javase/9/docs/api/java/util/zip/CRC32C.html]] which comes with Java 9.
Here are the results of a microbenchmark test for native _CRC32_ and _CRC32C_ classes (run was made using Java 11.0.9):
{code:java}
[java] Benchmark (bufferSize) Mode Cnt Score Error Units
[java] ChecksumBench.benchCrc32 31 avgt 5 98.823 ± 2.667 ns/op
[java] ChecksumBench.benchCrc32 131 avgt 5 133.014 ± 5.831 ns/op
[java] ChecksumBench.benchCrc32 517 avgt 5 173.939 ± 14.456 ns/op
[java] ChecksumBench.benchCrc32 2041 avgt 5 270.847 ± 7.071 ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic 31 avgt 5 118.761 ± 77.032 ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic 131 avgt 5 157.799 ± 377.481 ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic 517 avgt 5 238.125 ± 900.150 ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic 2041 avgt 5 276.828 ± 7.814 ns/op
[java] ChecksumBench.benchCrc32c 31 avgt 5 50.619 ± 2.178 ns/op
[java] ChecksumBench.benchCrc32c 131 avgt 5 69.229 ± 2.186 ns/op
[java] ChecksumBench.benchCrc32c 517 avgt 5 190.943 ± 3.741 ns/op
[java] ChecksumBench.benchCrc32c 2041 avgt 5 276.401 ± 4.161 ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic 31 avgt 5 56.111 ± 1.834 ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic 131 avgt 5 75.475 ± 1.912 ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic 517 avgt 5 196.557 ± 5.209 ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic 2041 avgt 5 281.095 ± 5.765 ns/op
{code}
Two obvious facts were reaffirmed by this run:
# intrinsic code works faster than non-intrinsic
# _CRC32C_ works faster than _CRC32_
----
As I mentioned before, _CRC32C_ class is unavailable in Java 8. That means we cannot directly use that in the code base, otherwise it won't compile with Java 8. So there are two ways to proceed: 1) we wait until Java 8 support is abandoned and start using _CRC32C_ 2) if Java 9+ is used then we use _CRC32C_ , otherwise we fallback to some Java-based implementation. The second approach sounds reasonable to me since we need to target for newer versions and faster solutions.
So I've given a try to two available Java-based implementations (Guava - _hasherCrc32c_ and Snappy - _pureJavaCrc32c_) in another microbenchmark test (run was made using Java 11.0.9):
{code:java}
[java] Benchmark (bufferSize) Mode Cnt Score Error Units
[java] ChecksumBench.benchCrc32 31 avgt 5 113.076 ± 33.568 ns/op
[java] ChecksumBench.benchCrc32 131 avgt 5 88.348 ± 11.433 ns/op
[java] ChecksumBench.benchCrc32 517 avgt 5 175.464 ± 56.718 ns/op
[java] ChecksumBench.benchCrc32 2041 avgt 5 273.945 ± 16.331 ns/op
[java] ChecksumBench.benchHasherCrc32c 31 avgt 25 150.860 ± 2.883 ns/op
[java] ChecksumBench.benchHasherCrc32c 131 avgt 25 559.423 ± 141.932 ns/op
[java] ChecksumBench.benchHasherCrc32c 517 avgt 25 1599.504 ± 82.253 ns/op
[java] ChecksumBench.benchHasherCrc32c 2041 avgt 25 5988.707 ± 103.558 ns/op
[java] ChecksumBench.benchPureJavaCrc32c 31 avgt 25 99.486 ± 1.464 ns/op
[java] ChecksumBench.benchPureJavaCrc32c 131 avgt 25 252.278 ± 20.278 ns/op
[java] ChecksumBench.benchPureJavaCrc32c 517 avgt 25 822.142 ± 38.860 ns/op
[java] ChecksumBench.benchPureJavaCrc32c 2041 avgt 25 3083.762 ± 200.493 ns/op
{code}
This test shows that:
# Snappy's implementation outperforms Guava's - which basically is not really important, we can use either of them
# They both work slower than native _CRC32C_ implementation with intrinsic - which is actually expected
# They both work slower than native _CRC32C_ implementation without intrinsic - which is totally fine
# They both work slower than native _CRC32_ implementation - *which appears to be a problem*
Basically that means that the users who use Java 9+ would benefit, whereas the users who use Java 8 would suffer from this change.
Here I need some input/feedback from the maintainers/community whether we need to * *proceed with this change or not (I have some draft changes though). Maybe [~tjake] and [~benedict] can help with the decision since they worked on the microbenchmarks and performance-related functionality in C*.
----
On a separate note, Hadoop used Snappy's Java-based implementation. Then they migrated to native CRC32C implementation and started using Snappy's Java-based implementation as a fallback option if Java 8 is used. Here are the details: [https://github.com/apache/hadoop/pull/291|https://github.com/apache/hadoop/pull/291.]. We can do something similar, if we're fine that for Java 8 users performance will slightly degrade.
----
The benchmark I used is available here (see ChecksumBench class): [https://github.com/apache/cassandra/pull/951]. I feel that PR can be merged because it contains only benchmark changes which seem to be useful for further developments.
was (Author: azotcsit):
I've made an initial research on this task. Looks like the _CRC32C_ hardware support is only available when intrinsic code is used. Here are a few reference tickets mentioning that:
* [https://bugs.openjdk.java.net/browse/JDK-8189745]
* [https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8191328]
* [https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8073583]
And it looks like intrinsic code is only available for the [native _CRC32C_ class|[https://docs.oracle.com/javase/9/docs/api/java/util/zip/CRC32C.html]] which comes with Java 9.
Here are the results of a microbenchmark test for native _CRC32_ and _CRC32C_ classes (run was made using Java 11.0.9):
{code:java}
[java] Benchmark (bufferSize) Mode Cnt Score Error Units
[java] ChecksumBench.benchCrc32 31 avgt 5 98.823 ± 2.667 ns/op
[java] ChecksumBench.benchCrc32 131 avgt 5 133.014 ± 5.831 ns/op
[java] ChecksumBench.benchCrc32 517 avgt 5 173.939 ± 14.456 ns/op
[java] ChecksumBench.benchCrc32 2041 avgt 5 270.847 ± 7.071 ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic 31 avgt 5 118.761 ± 77.032 ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic 131 avgt 5 157.799 ± 377.481 ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic 517 avgt 5 238.125 ± 900.150 ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic 2041 avgt 5 276.828 ± 7.814 ns/op
[java] ChecksumBench.benchCrc32c 31 avgt 5 50.619 ± 2.178 ns/op
[java] ChecksumBench.benchCrc32c 131 avgt 5 69.229 ± 2.186 ns/op
[java] ChecksumBench.benchCrc32c 517 avgt 5 190.943 ± 3.741 ns/op
[java] ChecksumBench.benchCrc32c 2041 avgt 5 276.401 ± 4.161 ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic 31 avgt 5 56.111 ± 1.834 ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic 131 avgt 5 75.475 ± 1.912 ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic 517 avgt 5 196.557 ± 5.209 ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic 2041 avgt 5 281.095 ± 5.765 ns/op
{code}
Two obvious facts were reaffirmed by this run:
# intrinsic code works faster than non-intrinsic
# _CRC32C_ works faster than _CRC32_
----
As I mentioned before, _CRC32C_ class is unavailable in Java 8. That means we cannot directly use that in the code base, otherwise it won't compile with Java 8. So there are two ways to proceed: 1) we wait until Java 8 support is abandoned and start using _CRC32C_ 2) if Java 9+ is used then we use _CRC32C_ , otherwise we fallback to some Java-based implementation. The second approach sounds reasonable to me since we need to target for newer versions and faster solutions.
So I've given a try to two available Java-based implementations (Guava - _hasherCrc32c_ and Snappy - _pureJavaCrc32c_) in another microbenchmark test (run was made using Java 11.0.9):
{code:java}
[java] Benchmark (bufferSize) Mode Cnt Score Error Units
[java] ChecksumBench.benchCrc32 31 avgt 5 113.076 ± 33.568 ns/op
[java] ChecksumBench.benchCrc32 131 avgt 5 88.348 ± 11.433 ns/op
[java] ChecksumBench.benchCrc32 517 avgt 5 175.464 ± 56.718 ns/op
[java] ChecksumBench.benchCrc32 2041 avgt 5 273.945 ± 16.331 ns/op
[java] ChecksumBench.benchHasherCrc32c 31 avgt 25 150.860 ± 2.883 ns/op
[java] ChecksumBench.benchHasherCrc32c 131 avgt 25 559.423 ± 141.932 ns/op
[java] ChecksumBench.benchHasherCrc32c 517 avgt 25 1599.504 ± 82.253 ns/op
[java] ChecksumBench.benchHasherCrc32c 2041 avgt 25 5988.707 ± 103.558 ns/op
[java] ChecksumBench.benchPureJavaCrc32c 31 avgt 25 99.486 ± 1.464 ns/op
[java] ChecksumBench.benchPureJavaCrc32c 131 avgt 25 252.278 ± 20.278 ns/op
[java] ChecksumBench.benchPureJavaCrc32c 517 avgt 25 822.142 ± 38.860 ns/op
[java] ChecksumBench.benchPureJavaCrc32c 2041 avgt 25 3083.762 ± 200.493 ns/op
{code}
This test shows that:
# Snappy's implementation outperforms Guava's - which basically is not really important, we can use either of them
# They both work slower than native _CRC32C_ implementation with intrinsic - which is actually expected
# They both work slower than native _CRC32C_ implementation without intrinsic - which is totally fine
# They both work slower than native _CRC32_ implementation - *which appears to be a problem*
Basically that means that the users who use Java 9+ would benefit, whereas the users who use Java 8 would suffer from this change.
Here I need some input/feedback from the maintainers/community whether we need to ** proceed with this change or not (I have some draft changes though). Maybe [~tjake] and [~benedict] can help with the decision since they worked on the microbenchmarks and performance-related functionality in C*.
----
On a separate note, Hadoop used Snappy's Java-based implementation. Then they migrated to native CRC32C implementation and started using Snappy's Java-based implementation as a fallback option if Java 8 is used. Here are the details: [https://github.com/apache/hadoop/pull/291|https://github.com/apache/hadoop/pull/291.]. We can do something similar, if we're fine that for Java 8 users performance will slightly degrade.
----
The benchmark I used is available here (see ChecksumBench class): [https://github.com/apache/cassandra/pull/951]. I feel that PR can be merged because it contains only benchmark changes which seem to be useful for further developments.
> CRC32 is inefficient on x86
> ---------------------------
>
> Key: CASSANDRA-16360
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16360
> Project: Cassandra
> Issue Type: Improvement
> Components: Messaging/Client
> Reporter: Avi Kivity
> Priority: Normal
> Labels: protocolv5
> Fix For: 4.0.x
>
>
> The client/server protocol specifies CRC24 and CRC32 as the checksum algorithm (cql_protocol_V5_framing.asc). Those however are expensive to compute; this affects both the client and the server.
>
> A better checksum algorithm is CRC32C, which has hardware support on x86 (as well as other modern architectures).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org