You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Alexey Zotov (Jira)" <ji...@apache.org> on 2021/04/04 21:01:00 UTC

[jira] [Comment Edited] (CASSANDRA-16360) CRC32 is inefficient on x86

    [ https://issues.apache.org/jira/browse/CASSANDRA-16360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314622#comment-17314622 ] 

Alexey Zotov edited comment on CASSANDRA-16360 at 4/4/21, 9:00 PM:
-------------------------------------------------------------------

I've made an initial research on this task. Looks like the _CRC32C_ hardware support is only available when intrinsic code is used. Here are a few reference tickets mentioning that:
 * [https://bugs.openjdk.java.net/browse/JDK-8189745]
 * [https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8191328]
 * [https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8073583]

And it looks like intrinsic code is only available for the [native _CRC32C_ class|[https://docs.oracle.com/javase/9/docs/api/java/util/zip/CRC32C.html]] which comes with Java 9.

Here are the results of a microbenchmark test for native _CRC32_ and _CRC32C_ classes (run was made using Java 11.0.9): 
{code:java}
[java] Benchmark                             (bufferSize)  Mode  Cnt     Score    Error   Units
[java] ChecksumBench.benchCrc32                        31  avgt    5    98.823 ±   2.667  ns/op
[java] ChecksumBench.benchCrc32                       131  avgt    5   133.014 ±   5.831  ns/op
[java] ChecksumBench.benchCrc32                       517  avgt    5   173.939 ±  14.456  ns/op
[java] ChecksumBench.benchCrc32                      2041  avgt    5   270.847 ±   7.071  ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic             31  avgt    5   118.761 ±  77.032  ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic            131  avgt    5   157.799 ± 377.481  ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic            517  avgt    5   238.125 ± 900.150  ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic           2041  avgt    5   276.828 ±   7.814  ns/op
[java] ChecksumBench.benchCrc32c                       31  avgt    5    50.619 ±   2.178  ns/op
[java] ChecksumBench.benchCrc32c                      131  avgt    5    69.229 ±   2.186  ns/op
[java] ChecksumBench.benchCrc32c                      517  avgt    5   190.943 ±   3.741  ns/op
[java] ChecksumBench.benchCrc32c                     2041  avgt    5   276.401 ±   4.161  ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic            31  avgt    5    56.111 ±   1.834  ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic           131  avgt    5    75.475 ±   1.912  ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic           517  avgt    5   196.557 ±   5.209  ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic          2041  avgt    5   281.095 ±   5.765  ns/op
{code}
Two obvious facts were reaffirmed by this run:
 # intrinsic code works faster than non-intrinsic
 # _CRC32C_ works faster than _CRC32_

----
As I mentioned before, _CRC32C_ class is unavailable in Java 8. That means we cannot directly use that in the code base, otherwise it won't compile with Java 8. So there are two ways to proceed: 1) we wait until Java 8 support is abandoned and start using _CRC32C_ 2) if Java 9+ is used then we use _CRC32C_ , otherwise we fallback to some Java-based implementation. The second approach sounds reasonable to me since we need to target for newer versions and faster solutions.

So I've given a try to two available Java-based implementations (Guava - _hasherCrc32c_ and Snappy - _pureJavaCrc32c_) in another microbenchmark test (run was made using Java 11.0.9):
{code:java}
[java] Benchmark                          (bufferSize)  Mode  Cnt     Score     Error  Units
[java] ChecksumBench.benchCrc32                     31  avgt    5   113.076 ±  33.568  ns/op
[java] ChecksumBench.benchCrc32                    131  avgt    5    88.348 ±  11.433  ns/op
[java] ChecksumBench.benchCrc32                    517  avgt    5   175.464 ±  56.718  ns/op
[java] ChecksumBench.benchCrc32                   2041  avgt    5   273.945 ±  16.331  ns/op
[java] ChecksumBench.benchHasherCrc32c              31  avgt   25   150.860 ±   2.883  ns/op
[java] ChecksumBench.benchHasherCrc32c             131  avgt   25   559.423 ± 141.932  ns/op
[java] ChecksumBench.benchHasherCrc32c             517  avgt   25  1599.504 ±  82.253  ns/op
[java] ChecksumBench.benchHasherCrc32c            2041  avgt   25  5988.707 ± 103.558  ns/op
[java] ChecksumBench.benchPureJavaCrc32c            31  avgt   25    99.486 ±   1.464  ns/op
[java] ChecksumBench.benchPureJavaCrc32c           131  avgt   25   252.278 ±  20.278  ns/op
[java] ChecksumBench.benchPureJavaCrc32c           517  avgt   25   822.142 ±  38.860  ns/op
[java] ChecksumBench.benchPureJavaCrc32c          2041  avgt   25  3083.762 ± 200.493  ns/op

{code}
This test shows that:
 # Snappy's implementation outperforms Guava's  - which basically is not really important, we can use either of them
 # They both work slower than native _CRC32C_ implementation with intrinsic - which is actually expected
 # They both work slower than native _CRC32C_ implementation without intrinsic - which is totally fine
 # They both work slower than native _CRC32_ implementation - *which appears to be a problem*

 Basically that means that the users who use Java 9+ would benefit, whereas the users who use Java 8 would suffer from this change.

Here I need some input/feedback from the maintainers/community whether we need to * *proceed with this change or not (I have some draft changes though). Maybe [~tjake] and [~benedict] can help with the decision since they worked on the microbenchmarks and performance-related functionality in C*.
----
On a separate note, Hadoop used Snappy's Java-based implementation. Then they migrated to native CRC32C implementation and started using Snappy's Java-based implementation as a fallback option if Java 8 is used. Here are the details:  [https://github.com/apache/hadoop/pull/291|https://github.com/apache/hadoop/pull/291.]. We can do something similar, if we're fine that for Java 8 users performance will slightly degrade.
----
The benchmark I used is available here (see ChecksumBench class):  [https://github.com/apache/cassandra/pull/951]. I feel that PR can be merged because it contains only benchmark changes which seem to be useful for further developments.


was (Author: azotcsit):
I've made an initial research on this task. Looks like the _CRC32C_ hardware support is only available when intrinsic code is used. Here are a few reference tickets mentioning that:
 * [https://bugs.openjdk.java.net/browse/JDK-8189745]
 * [https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8191328]
 * [https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8073583]

And it looks like intrinsic code is only available for the [native _CRC32C_ class|[https://docs.oracle.com/javase/9/docs/api/java/util/zip/CRC32C.html]] which comes with Java 9.

Here are the results of a microbenchmark test for native _CRC32_ and _CRC32C_ classes (run was made using Java 11.0.9):

 
{code:java}
[java] Benchmark                             (bufferSize)  Mode  Cnt     Score    Error   Units
[java] ChecksumBench.benchCrc32                        31  avgt    5    98.823 ±   2.667  ns/op
[java] ChecksumBench.benchCrc32                       131  avgt    5   133.014 ±   5.831  ns/op
[java] ChecksumBench.benchCrc32                       517  avgt    5   173.939 ±  14.456  ns/op
[java] ChecksumBench.benchCrc32                      2041  avgt    5   270.847 ±   7.071  ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic             31  avgt    5   118.761 ±  77.032  ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic            131  avgt    5   157.799 ± 377.481  ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic            517  avgt    5   238.125 ± 900.150  ns/op
[java] ChecksumBench.benchCrc32NoIntrinsic           2041  avgt    5   276.828 ±   7.814  ns/op
[java] ChecksumBench.benchCrc32c                       31  avgt    5    50.619 ±   2.178  ns/op
[java] ChecksumBench.benchCrc32c                      131  avgt    5    69.229 ±   2.186  ns/op
[java] ChecksumBench.benchCrc32c                      517  avgt    5   190.943 ±   3.741  ns/op
[java] ChecksumBench.benchCrc32c                     2041  avgt    5   276.401 ±   4.161  ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic            31  avgt    5    56.111 ±   1.834  ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic           131  avgt    5    75.475 ±   1.912  ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic           517  avgt    5   196.557 ±   5.209  ns/op
[java] ChecksumBench.benchCrc32cNoIntrinsic          2041  avgt    5   281.095 ±   5.765  ns/op
{code}
Two obvious facts were reaffirmed by this run:
 # intrinsic code works faster than non-intrinsic
 # _CRC32C_ works faster than _CRC32_

 
----
As I mentioned before, _CRC32C_ class is unavailable in Java 8. That means we cannot directly use that in the code base, otherwise it won't compile with Java 8. So there are two ways to proceed: 1) we wait until Java 8 support is abandoned and start using _CRC32C_ 2) if Java 9+ is used then we use _CRC32C_ , otherwise we fallback to some Java-based implementation. The second approach sounds reasonable to me since we need to target for newer versions and faster solutions.

So I've given a try to two available Java-based implementations (Guava - _hasherCrc32c_ and Snappy - _pureJavaCrc32c_) in another microbenchmark test (run was made using Java 11.0.9):

 
{code:java}
[java] Benchmark                          (bufferSize)  Mode  Cnt     Score     Error  Units
[java] ChecksumBench.benchCrc32                     31  avgt    5   113.076 ±  33.568  ns/op
[java] ChecksumBench.benchCrc32                    131  avgt    5    88.348 ±  11.433  ns/op
[java] ChecksumBench.benchCrc32                    517  avgt    5   175.464 ±  56.718  ns/op
[java] ChecksumBench.benchCrc32                   2041  avgt    5   273.945 ±  16.331  ns/op
[java] ChecksumBench.benchHasherCrc32c              31  avgt   25   150.860 ±   2.883  ns/op
[java] ChecksumBench.benchHasherCrc32c             131  avgt   25   559.423 ± 141.932  ns/op
[java] ChecksumBench.benchHasherCrc32c             517  avgt   25  1599.504 ±  82.253  ns/op
[java] ChecksumBench.benchHasherCrc32c            2041  avgt   25  5988.707 ± 103.558  ns/op
[java] ChecksumBench.benchPureJavaCrc32c            31  avgt   25    99.486 ±   1.464  ns/op
[java] ChecksumBench.benchPureJavaCrc32c           131  avgt   25   252.278 ±  20.278  ns/op
[java] ChecksumBench.benchPureJavaCrc32c           517  avgt   25   822.142 ±  38.860  ns/op
[java] ChecksumBench.benchPureJavaCrc32c          2041  avgt   25  3083.762 ± 200.493  ns/op

{code}
This test shows that:

 
 # Snappy's implementation outperforms Guava's  - which basically is not really important, we can use either of them
 # They both work slower than native _CRC32C_ implementation with intrinsic - which is actually expected
 # They both work slower than native _CRC32C_ implementation without intrinsic - which is totally fine
 # They both work slower than native _CRC32_ implementation - *which appears to be a problem*

 

Basically that means that the users who use Java 9+ would benefit, whereas the users who use Java 8 would suffer from this change.

Here I need some input/feedback from the maintainers/community whether we need to ** proceed with this change or not (I have some draft changes though). Maybe [~tjake] and [~benedict] can help with the decision since they worked on the microbenchmarks and performance-related functionality in C*.
----
On a separate note, Hadoop used Snappy's Java-based implementation. Then they migrated to native CRC32C implementation and started using Snappy's Java-based implementation as a fallback option if Java 8 is used. Here are the details:  [https://github.com/apache/hadoop/pull/291|https://github.com/apache/hadoop/pull/291.]. We can do something similar, if we're fine that for Java 8 users performance will slightly degrade.
----
 

 

The benchmark I used is available here (see ChecksumBench class):  [https://github.com/apache/cassandra/pull/951]. I feel that PR can be merged because it contains only benchmark changes which seem to be useful for further developments.

> CRC32 is inefficient on x86
> ---------------------------
>
>                 Key: CASSANDRA-16360
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16360
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Messaging/Client
>            Reporter: Avi Kivity
>            Priority: Normal
>              Labels: protocolv5
>             Fix For: 4.0.x
>
>
> The client/server protocol specifies CRC24 and CRC32 as the checksum algorithm (cql_protocol_V5_framing.asc). Those however are expensive to compute; this affects both the client and the server.
>  
> A better checksum algorithm is CRC32C, which has hardware support on x86 (as well as other modern architectures).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org