You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org> on 2008/09/03 23:45:44 UTC

[jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628156#action_12628156 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3981:
------------------------------------------------

Currently, Datanode stores a CRC-32 for every 512-byte chunk.  Let's call these CRCs the first level CRC.  So the total size for the first level CRC is about 1/128 of the data size.

How about we compute a second level of checksum over the first level CRCs?  So, for every 512-byte first level CRC, we compute a CRC-32.  Then, the second level CRC is about 1/16384 of the data size.  We could use these second level CRCs as the checksum of the file.

For example, if a file has 100GB, the size of first level CRCs is 800MB and the size of the second level CRCs is only 6.25MB.  We use these 6.25MB second level CRCs as the checksum of the entire file.


> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.