You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2011/08/03 21:58:26 UTC

[jira] [Updated] (HADOOP-7445) Implement bulk checksum verification using efficient native code

     [ https://issues.apache.org/jira/browse/HADOOP-7445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon updated HADOOP-7445:
--------------------------------

    Attachment: hadoop-7445.txt

Good point that we don't need the special license on the tables, since we generated them using your Table class. But, the actual "slicing-by-8" implementation is from a project with BSD license. So, I moved that special license header to bulk_crc32.c.

This new revision also rebases on the mavenized common.

As for testing performance and correctness against the existing implementation:
- Performance wise, we don't currently have a canned benchmark for testing performance of checksum _verification_. This patch doesn't currently add native checksum _computation_ anywhere, since the umbrella JIRA HDFS-2080 is focusing on the read path. I was able to run benchmarks of "hadoop fs -cat /dev/shm/128M /dev/shm/128M /dev/shm/128M [repeated 50 times]" using a ChecksumFileSystem, and saw ~60% speed improvement. This is a measurement of CPU overhead, since it's reading from a file in  a RAM disk.
- Correctness wise, the new test cases in TestDataChecksum verify both the native and non-native code, since they test with direct buffers as well as heap buffers that wrap a byte[]. If the native and non-native code disagreed, then this test would fail for one of the two cases (since the computed checksums are always computed by the java code)

> Implement bulk checksum verification using efficient native code
> ----------------------------------------------------------------
>
>                 Key: HADOOP-7445
>                 URL: https://issues.apache.org/jira/browse/HADOOP-7445
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: native, util
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hadoop-7445.txt, hadoop-7445.txt, hadoop-7445.txt, hadoop-7445.txt, hadoop-7445.txt
>
>
> Once HADOOP-7444 is implemented ("bulk" API for checksums), good performance gains can be had by implementing bulk checksum operations using JNI. This JIRA is to add checksum support to the native libraries. Of course if native libs are not available, it will still fall back to the pure-Java implementations.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira