You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2009/03/31 14:19:51 UTC

[jira] Created: (HADOOP-5598) Implement a pure Java CRC32 calculator

Implement a pure Java CRC32 calculator
--------------------------------------

                 Key: HADOOP-5598
                 URL: https://issues.apache.org/jira/browse/HADOOP-5598
             Project: Hadoop Core
          Issue Type: Improvement
          Components: dfs
            Reporter: Owen O'Malley
            Assignee: Owen O'Malley


We've seen single replica HDFS write spending too much of their time in crc32 calculation. On a 200 MB write, the crc it is taking 5 seconds of the 6 seconds total.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5598) Implement a pure Java CRC32 calculator

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12722132#action_12722132 ] 

Todd Lipcon commented on HADOOP-5598:
-------------------------------------

Thanks, Owen. I definitely see your point about the JNI code possibly locking or copying. The API we're using says it tries to avoid doing that when possible, but I guess for large blocks, heap fragmentation is quite likely and we'd run into that issue.

I'm going on vacation this next week, so I'm unlikely to work on it, but if you have a moment to upload your implementation for comparison I'll be back on this the week after next.

> Implement a pure Java CRC32 calculator
> --------------------------------------
>
>                 Key: HADOOP-5598
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5598
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Owen O'Malley
>            Assignee: Todd Lipcon
>         Attachments: crc32-results.txt, hadoop-5598-evil.txt, hadoop-5598-hybrid.txt, hadoop-5598.txt, hadoop-5598.txt, PureJavaCrc32.java, PureJavaCrc32.java, PureJavaCrc32.java, TestCrc32Performance.java, TestCrc32Performance.java, TestCrc32Performance.java, TestPureJavaCrc32.java
>
>
> We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5598) Implement a pure Java CRC32 calculator

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719232#action_12719232 ] 

dhruba borthakur commented on HADOOP-5598:
------------------------------------------

Looking at the results, it appears that a choice of Crc algorithms, crc data sizes and JVMs can result in pretty varied performance numbers. Maybe it would be worthwhile to make the ChecksymFilesystem pick a Checksum object based on a config parameter.

Also, the hybrid model of  dynamically deciding which algo to use (based on the size of data to checksum) sounds a litle scary to me :-)

> Implement a pure Java CRC32 calculator
> --------------------------------------
>
>                 Key: HADOOP-5598
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5598
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Owen O'Malley
>            Assignee: Todd Lipcon
>         Attachments: crc32-results.txt, hadoop-5598-hybrid.txt, hadoop-5598.txt, TestCrc32Performance.java, TestCrc32Performance.java
>
>
> We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5598) Implement a pure Java CRC32 calculator

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Carey updated HADOOP-5598:
--------------------------------

    Attachment: PureJavaCrc32.java

This version of PureJavaCrc32.java significantly beats, or equals, the native implementation.  For small chunks, it is far faster.  For large chunks it is equal.

This version is simply a term expansion of the previous one, and it operates four bytes at a time.

There is a way to nearly double this one more time for large chunks, but it is proving tricky to nail down correctly.




> Implement a pure Java CRC32 calculator
> --------------------------------------
>
>                 Key: HADOOP-5598
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5598
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Owen O'Malley
>            Assignee: Todd Lipcon
>         Attachments: crc32-results.txt, hadoop-5598-evil.txt, hadoop-5598-hybrid.txt, hadoop-5598.txt, hadoop-5598.txt, PureJavaCrc32.java, TestCrc32Performance.java, TestCrc32Performance.java
>
>
> We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5598) Implement a pure Java CRC32 calculator

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719234#action_12719234 ] 

Todd Lipcon commented on HADOOP-5598:
-------------------------------------

The other solution that would solve this whole issue is to simply not checksum the data in FSOutputSummer until an entire chunk has been buffered up. This would bring the "write size" up to io.bytes.per.checksum, 512 by default, which performs better in java.util.zip.CRC32 than in the Java implementation. Doug expressed some concern over that idea in HADOOP-5318, I guess since data can sit in memory for an arbitrarily long amount of time before getting checksummed, but I would imagine most people are using ECC RAM anyway :)

> Implement a pure Java CRC32 calculator
> --------------------------------------
>
>                 Key: HADOOP-5598
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5598
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Owen O'Malley
>            Assignee: Todd Lipcon
>         Attachments: crc32-results.txt, hadoop-5598-hybrid.txt, hadoop-5598.txt, TestCrc32Performance.java, TestCrc32Performance.java
>
>
> We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5598) Implement a pure Java CRC32 calculator

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated HADOOP-5598:
--------------------------------

    Attachment: hadoop-5598-evil.txt

Here's another approach, which is exceedingly evil, but also "best of both worlds" in terms of speed. It switches between pure-Java CRC32 and the built-in CRC32, but avoids the slow crc_combine function. It does so by cheating using reflection to set the internal (private) crc member of the java.util.zip.CRC32.

This is probably not a good idea to actually use, but it does show an "upper bound" in terms of performance. In benchmarks, this "fast" version is always as fast as the faster of the pure and built-in CRC32 routines. In case the reflection fails, it falls back to always pure.

Owen and Arun: could you guys comment on the real-life workload where you saw this as an issue? I would imagine most MR workloads don't have this problem since they're spilling out of an in-memory buffer and therefore can write large chunks.

> Implement a pure Java CRC32 calculator
> --------------------------------------
>
>                 Key: HADOOP-5598
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5598
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Owen O'Malley
>            Assignee: Todd Lipcon
>         Attachments: crc32-results.txt, hadoop-5598-evil.txt, hadoop-5598-hybrid.txt, hadoop-5598.txt, TestCrc32Performance.java, TestCrc32Performance.java
>
>
> We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5598) Implement a pure Java CRC32 calculator

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721804#action_12721804 ] 

Owen O'Malley commented on HADOOP-5598:
---------------------------------------

I should have commented earlier on this. I think the right solution is to use a pure Java impl if we can get the performance comparable in the "normal" case. If use a C implementation in libhadoop, it should use DirectByteBuffers and pool those buffers. Furthermore, it should be a different jira, since there are a lot more issues there.

I'd also veto any code that dynamically switches implementations based on anything other that whether libhadoop is present. (ie. switching based on the size of the input is going to be unmaintainable)

I can upload the code that I wrote for the pure java, if you want to see a third implementation.

> Implement a pure Java CRC32 calculator
> --------------------------------------
>
>                 Key: HADOOP-5598
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5598
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Owen O'Malley
>            Assignee: Todd Lipcon
>         Attachments: crc32-results.txt, hadoop-5598-evil.txt, hadoop-5598-hybrid.txt, hadoop-5598.txt, hadoop-5598.txt, PureJavaCrc32.java, PureJavaCrc32.java, PureJavaCrc32.java, TestCrc32Performance.java, TestCrc32Performance.java, TestCrc32Performance.java, TestPureJavaCrc32.java
>
>
> We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5598) Implement a pure Java CRC32 calculator

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720996#action_12720996 ] 

Todd Lipcon commented on HADOOP-5598:
-------------------------------------

Scott: I just tried your version and was unable to get the same performance improvements. I think we've established that the pure Java definitely wins on small blocks. For large blocks, I'm seeing the following on my laptop (64-bit, with 64-bit JRE):

My most recent non-evil pure-Java: 250M/sec
Scott's patch that unrolls the loop: 260-280M/sec
Sun Java 1.6 update 14: 333M/sec
OpenJDK 1.6: 795M/sec

The OpenJDK implementation is simply wrapping zlib's crc32 routine, which must be highly optimized. Given that we already have a JNI library for native compression using zlib, I'd like to simply add a stub to libhadoop that wraps zlib's crc32. That should give us the same ~800M/sec throughput for large blocks. Since we can implement the stub ourself, we also have the ability to switch to pure Java for small sizes and get the 20x speedup with no adversarial workloads that cause bad performance. On systems where the native code isn't available, we can simply use the pure Java for all sizes, since at worst it's only slightly slower than java.util.Crc32 and at best it's 30x faster.

I imagine that most production systems are using libhadoop, or at least could easily get this deployed if it was shown to have significant performance benefits.

I'll upload a patch later this evening for this.

> Implement a pure Java CRC32 calculator
> --------------------------------------
>
>                 Key: HADOOP-5598
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5598
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Owen O'Malley
>            Assignee: Todd Lipcon
>         Attachments: crc32-results.txt, hadoop-5598-evil.txt, hadoop-5598-hybrid.txt, hadoop-5598.txt, hadoop-5598.txt, PureJavaCrc32.java, TestCrc32Performance.java, TestCrc32Performance.java
>
>
> We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5598) Implement a pure Java CRC32 calculator

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Carey updated HADOOP-5598:
--------------------------------

    Attachment: PureJavaCrc32.java

This version does some algebraic manipulation and gets to about 10% faster than the native implementation on large blocks on my machine (Java 1.6, mac osx, 64 bit, 2.5Ghz Core2 duo).

pure java	16MB block:   397.516 MB/sec
sun native 16MB block:  337.731 MB/sec

This version uses the same lookup table as the previous, occupying 1KB.

I have another pure java version that uses four lookup tables (4KB) that I will be posting shortly after I clean it up.

Its results for large blocks are:

pure java	16MB block:   624.390 MB/sec
sun native 16MB block:  342.246 MB/sec

it first breaks 600MB/sec at a block size of 128 bytes and is over 520MB/sec at a block size of 32 bytes.


A big remaining question is performance under concurrency.  The larger lookup table footprint may bring this version down a little.
Any version calling out to native code may also slow under concurrency.



> Implement a pure Java CRC32 calculator
> --------------------------------------
>
>                 Key: HADOOP-5598
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5598
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Owen O'Malley
>            Assignee: Todd Lipcon
>         Attachments: crc32-results.txt, hadoop-5598-evil.txt, hadoop-5598-hybrid.txt, hadoop-5598.txt, hadoop-5598.txt, PureJavaCrc32.java, PureJavaCrc32.java, TestCrc32Performance.java, TestCrc32Performance.java
>
>
> We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5598) Implement a pure Java CRC32 calculator

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley updated HADOOP-5598:
----------------------------------

    Description: We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief.  (was: We've seen single replica HDFS write spending too much of their time in crc32 calculation. On a 200 MB write, the crc it is taking 5 seconds of the 6 seconds total.)

> Implement a pure Java CRC32 calculator
> --------------------------------------
>
>                 Key: HADOOP-5598
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5598
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>
> We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5598) Implement a pure Java CRC32 calculator

Posted by "Owen O'Malley (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721803#action_12721803 ] 

Owen O'Malley commented on HADOOP-5598:
---------------------------------------

Our problem with JNI mostly happens when you have large byte[] that you are using for your input. However, it depends a lot on the fragmentation of the heap and thus is not easy to benchmark against. It was in the context of doing the terabyte sort. The problem with JNI is that to get access to a byte[], the runtime may need to copy the array in/out of the C code. If the array is 100 mb, that takes a lot of time.

> Implement a pure Java CRC32 calculator
> --------------------------------------
>
>                 Key: HADOOP-5598
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5598
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Owen O'Malley
>            Assignee: Todd Lipcon
>         Attachments: crc32-results.txt, hadoop-5598-evil.txt, hadoop-5598-hybrid.txt, hadoop-5598.txt, hadoop-5598.txt, PureJavaCrc32.java, PureJavaCrc32.java, PureJavaCrc32.java, TestCrc32Performance.java, TestCrc32Performance.java, TestCrc32Performance.java, TestPureJavaCrc32.java
>
>
> We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5598) Implement a pure Java CRC32 calculator

Posted by "Ben Maurer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12696165#action_12696165 ] 

Ben Maurer commented on HADOOP-5598:
------------------------------------

Is it possible that a fix for HADOOP-5318 would help this issue -- the java/jni border should only slow things down if very small amounts of data are being CRCd.

> Implement a pure Java CRC32 calculator
> --------------------------------------
>
>                 Key: HADOOP-5598
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5598
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>
> We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5598) Implement a pure Java CRC32 calculator

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Scott Carey updated HADOOP-5598:
--------------------------------

    Attachment: TestCrc32Performance.java
                TestPureJavaCrc32.java
                PureJavaCrc32.java

This version of PureJavaCrc32 is between 10x and 1.8x faster than Sun's native implementation depending on the chunk size.

Results on my latop (mac osx with Java 1.6 64 bit, -Xmx512m, 2.5Ghz processor) below.  Run to run, results vary by about 5%.

||CRC32Class||num bytes||throughput||
| PureJava	|1	|99.533 MB/sec|
| SunNative	|1	|9.772 MB/sec|
| PureJava	|2	|163.265 MB/sec|
| SunNative	|2	|18.846 MB/sec|
| PureJava	|4	|234.004 MB/sec|
| SunNative	|4	|37.307 MB/sec|
| PureJava	|8	|307.692 MB/sec|
| SunNative	|8	|66.876 MB/sec|
| PureJava	|16	|432.432 MB/sec|
| SunNative	|16	|110.919 MB/sec|
| PureJava	|32	|522.449 MB/sec|
| SunNative	|32	|161.616 MB/sec|
| PureJava	|64	|547.009 MB/sec|
| SunNative	|64	|217.687 MB/sec|
| PureJava	|128	|432.432 MB/sec|
| SunNative	|128	|270.042 MB/sec|
| PureJava	|256	|551.724 MB/sec|
| SunNative	|256	|299.065 MB/sec|
| PureJava	|512	|615.385 MB/sec|
| SunNative	|512	|321.608 MB/sec|
| PureJava	|1024	|551.724 MB/sec|
| SunNative	|1024	|212.625 MB/sec|
| PureJava	|2048	|561.404 MB/sec|
| SunNative	|2048	|309.179 MB/sec|
| PureJava	|4096	|551.724 MB/sec|
| SunNative	|4096	|307.692 MB/sec|
| PureJava	|8192	|589.862 MB/sec|
| SunNative	|8192	|316.049 MB/sec|
| PureJava	|16384	|640.000 MB/sec|
| SunNative	|16384	|343.164 MB/sec|
| PureJava	|32768	|643.216 MB/sec|
| SunNative	|32768	|343.164 MB/sec|
| PureJava	|65536	|621.359 MB/sec|
| SunNative	|65536	|345.013 MB/sec|
| PureJava	|131072	|636.816 MB/sec|
| SunNative	|131072	|345.946 MB/sec|
| PureJava	|262144	|636.816 MB/sec|
| SunNative	|262144	|343.164 MB/sec|
| PureJava	|524288	|646.465 MB/sec|
| SunNative	|524288	|345.946 MB/sec|
| PureJava	|1048576	|640.000 MB/sec|
| SunNative	|1048576	|343.164 MB/sec|
| PureJava	|2097152	|633.663 MB/sec|
| SunNative	|2097152	|347.826 MB/sec|
| PureJava	|4194304	|636.816 MB/sec|
| SunNative	|4194304	|291.572 MB/sec|
| PureJava	|8388608	|618.357 MB/sec|
| SunNative	|8388608	|342.246 MB/sec|
| PureJava	|16777216	|624.390 MB/sec|
| SunNative	|16777216	|307.692 MB/sec|



> Implement a pure Java CRC32 calculator
> --------------------------------------
>
>                 Key: HADOOP-5598
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5598
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Owen O'Malley
>            Assignee: Todd Lipcon
>         Attachments: crc32-results.txt, hadoop-5598-evil.txt, hadoop-5598-hybrid.txt, hadoop-5598.txt, hadoop-5598.txt, PureJavaCrc32.java, PureJavaCrc32.java, PureJavaCrc32.java, TestCrc32Performance.java, TestCrc32Performance.java, TestCrc32Performance.java, TestPureJavaCrc32.java
>
>
> We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5598) Implement a pure Java CRC32 calculator

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated HADOOP-5598:
--------------------------------

    Attachment: crc32-results.txt
                TestCrc32Performance.java
                hadoop-5598.txt

This is a patch to implement CRC32 in Pure Java, along with a performance test that shows its improvement. Also attaching the benchmark output from both Sun 1.6.0_12 and OpenJDK 1.6.0_0-b12, which looks pretty different.

The summary is that, on Sun's JDK (which most people use), the pure Java implementation is faster for all chunk sizes less than 32 bytes (by a high factor for the smaller end of the spectrum) and about 33% slower for chunk sizes larger than that. On OpenJDK, the CRC32 implementation is 3-4x faster than the Sun JDK.

Running the concurrency benchmark from HADOOP-5318 also shows huge improvements (the same as was seen with Ben's buffering patch) by using the pure Java CRC32. This patch contains the change to FSDataOutputStream to make use of it.

Review from someone who understands Java's bit extension semantics better than me would be appreciated - I bet more performance can be squeezed out of this by a Java bitwise op master.

> Implement a pure Java CRC32 calculator
> --------------------------------------
>
>                 Key: HADOOP-5598
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5598
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>         Attachments: crc32-results.txt, hadoop-5598.txt, TestCrc32Performance.java
>
>
> We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5598) Implement a pure Java CRC32 calculator

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated HADOOP-5598:
--------------------------------

    Attachment: hadoop-5598.txt

I managed to coerce Java into cooperating with int sized variables, and it's significantly faster now. It's now about 5-10% slower than the built-in CRC for large sizes.

> Implement a pure Java CRC32 calculator
> --------------------------------------
>
>                 Key: HADOOP-5598
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5598
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Owen O'Malley
>            Assignee: Todd Lipcon
>         Attachments: crc32-results.txt, hadoop-5598-evil.txt, hadoop-5598-hybrid.txt, hadoop-5598.txt, hadoop-5598.txt, TestCrc32Performance.java, TestCrc32Performance.java
>
>
> We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HADOOP-5598) Implement a pure Java CRC32 calculator

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon reassigned HADOOP-5598:
-----------------------------------

    Assignee: Todd Lipcon  (was: Owen O'Malley)

> Implement a pure Java CRC32 calculator
> --------------------------------------
>
>                 Key: HADOOP-5598
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5598
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Owen O'Malley
>            Assignee: Todd Lipcon
>         Attachments: crc32-results.txt, hadoop-5598.txt, TestCrc32Performance.java
>
>
> We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HADOOP-5598) Implement a pure Java CRC32 calculator

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HADOOP-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated HADOOP-5598:
--------------------------------

    Attachment: TestCrc32Performance.java
                hadoop-5598-hybrid.txt

Attached is a new version that is a hybrid implementation. For writes smaller than a threshold it calculates CRC32 natively. Above the threshold, it uses the java.util.zip implementation, which it folds back in lazily using crc32_combine ported from zlib.

On the old TestCrc32Performance benchmark, this version was always faster or as fast as "theirs". I added a new benchmark test which sizes the writes randomly, on which the hybrid version is awful in certain cases since it spends most of its time in crc32_combine. For this hybrid model to work, there will need to be some kind of hysteresis when switching between implementations, so as to avoid crc32_combine.

If someone has Java 1.6 update 14 handy, I'd be interested to see if the new array bounds checking elimination optimization makes the pure Java fast enough to completely replace java.util.zip's.

> Implement a pure Java CRC32 calculator
> --------------------------------------
>
>                 Key: HADOOP-5598
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5598
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Owen O'Malley
>            Assignee: Todd Lipcon
>         Attachments: crc32-results.txt, hadoop-5598-hybrid.txt, hadoop-5598.txt, TestCrc32Performance.java, TestCrc32Performance.java
>
>
> We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5598) Implement a pure Java CRC32 calculator

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12694192#action_12694192 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-5598:
------------------------------------------------

There is a CRC class in org.apache.hadoop.io.compress.bzip2.  Looks like that it supports computing CRC32.

> Implement a pure Java CRC32 calculator
> --------------------------------------
>
>                 Key: HADOOP-5598
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5598
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>
> We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HADOOP-5598) Implement a pure Java CRC32 calculator

Posted by "Scott Carey (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HADOOP-5598?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721433#action_12721433 ] 

Scott Carey commented on HADOOP-5598:
-------------------------------------

I additionally added my modifications to the test class and performance test to this JIRA.

> Implement a pure Java CRC32 calculator
> --------------------------------------
>
>                 Key: HADOOP-5598
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5598
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: dfs
>            Reporter: Owen O'Malley
>            Assignee: Todd Lipcon
>         Attachments: crc32-results.txt, hadoop-5598-evil.txt, hadoop-5598-hybrid.txt, hadoop-5598.txt, hadoop-5598.txt, PureJavaCrc32.java, PureJavaCrc32.java, PureJavaCrc32.java, TestCrc32Performance.java, TestCrc32Performance.java, TestCrc32Performance.java, TestPureJavaCrc32.java
>
>
> We've seen a reducer writing 200MB to HDFS with replication = 1 spending a long time in crc calculation. In particular, it was spending 5 seconds in crc calculation out of a total of 6 for the write. I suspect that it is the java-jni border that is causing us grief.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.