You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org> on 2008/08/20 20:06:44 UTC

[jira] Created: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Need a distributed file checksum algorithm for HDFS
---------------------------------------------------

                 Key: HADOOP-3981
                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
             Project: Hadoop Core
          Issue Type: New Feature
          Components: dfs
            Reporter: Tsz Wo (Nicholas), SZE


Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629237#action_12629237 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3981:
------------------------------------------------

> Perhaps we can use bytes.per.checksum in the algorithm name, e.g., MD5-of-CRC32-every-512bytes

+1 We definitely need to encode these details in the algorithm name.

> Another API to consider ...

Which API are you talking about, FileSystem API or HDFS API?  If you mean HDFS API, are you saying that we should handle HDFS specially in DistCp?  Currently, DistCp only uses FileSystem API.

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629217#action_12629217 ] 

Doug Cutting commented on HADOOP-3981:
--------------------------------------

> Otherwise the file checksum computed depends on the block size.

It still depends on bytes.per.checksum, which can vary per file, just like block size.  If two files have different bytes.per.checksum then we should not compare CRC-derived checksums.  Perhaps we can use bytes.per.checksum in the algorithm name, e.g., MD5-of-CRC32-every-512bytes could be an algorithm name.  If we compute these per-block, then the algorithm name would be MD5-of-CRC32-every-512bytes-with-64Mblocks.

If we compute checksums on demand from CRCs then it will be relatively slow.  Distcp thus needs to be sure to only get checksums when lengths match and the alternative is copying the entire file.  So long as distcp is the primary client of checksums this is probably sufficient and we should not bother storing checksums.

Another API to consider might be:
  - String[] getChecksumAlgorithms(Path)
  - Checksum getChecksum(Path)

This way an HDFS filesystem might return ["MD5-of-CRC32-every-512bytes-with-64Mblocks", "MD5-of-CRC32-every-512bytes", "MD5"] the possible algorithms for a file in preferred order.  Then Distcp could call this for two files (whose lengths match) to see if they have any compatible algorithms.  If possible, CRC's would be combined on datanodes, but, if block sizes differ, the CRCs could be summed in the client.  If the CRCs are incompatible, then MD5s could be computed on datanodes.  Is this overkill?  Probably.


> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Lohit Vijayarenu (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624076#action_12624076 ] 

Lohit Vijayarenu commented on HADOOP-3981:
------------------------------------------

One simple approach could be checksum of checksums/metadata.

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628817#action_12628817 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3981:
------------------------------------------------

bq. When should we compute checksums? Are they computed on demand, when someone calls FileSystem#getFileChecksum()? Or are they pre-computed and stored? If they're not pre-computed then we certainly ought to compute them from the CRC's. Even if they are to be pre-computed, then we might still use the CRCs, to reduce FileSystem upgrade time.

It is better to compute file checksum on-demand, so that the Datanode storage layout remains unchanged and we won't have to do distributed upgrade.

bq. My hunch is that we should compute them on demand from CRC data. We extend ClientDatanodeProtocol to add a getChecksum() operation that returns the checksum for a block without transmitting the CRCs to the client, and the client combines block checksums to get a whole-file checksum. This is rather expensive, but still a lot faster than checksumming the entire file on demand.

My idea is similar to this except that we should not compute block checksum.  Otherwise the file checksum computed depends on the block size.  That is the reason that I propose to compute the second level CRCs over the first level CRCs.  This idea is borrowed from hash tree (aka Merkle Trees), which is used by ZFS.


> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631280#action_12631280 ] 

Hadoop QA commented on HADOOP-3981:
-----------------------------------

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12390040/3981_20080912.patch
  against trunk revision 695690.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 6 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3268/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3268/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3268/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3268/console

This message is automatically generated.

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3981_20080909.patch, 3981_20080910.patch, 3981_20080910b.patch, 3981_20080912.patch
>
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12633332#action_12633332 ] 

Hudson commented on HADOOP-3981:
--------------------------------

Integrated in Hadoop-trunk #611 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/611/])

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.19.0
>
>         Attachments: 3981_20080909.patch, 3981_20080910.patch, 3981_20080910b.patch, 3981_20080912.patch
>
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629960#action_12629960 ] 

Doug Cutting commented on HADOOP-3981:
--------------------------------------

+1 This looks good to me.

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: 3981_20080909.patch, 3981_20080910.patch
>
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629854#action_12629854 ] 

Doug Cutting commented on HADOOP-3981:
--------------------------------------

Some comments:
 - It looks like you forgot to include the class MD5MD5CRC32FileChecksum in the patch.
 - Why do you use the datanode's socket/opcode interface rather than adding a method to ClientDatanodeProtocol?
 - WritableUtils#toByteArray can use io.DataOutputBuffer, no?
 - DistCp#sameFile() should be changed to only get checksums when the lengths differ.



> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: 3981_20080909.patch
>
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629886#action_12629886 ] 

Raghu Angadi commented on HADOOP-3981:
--------------------------------------

> Why do you use the datanode's socket/opcode interface rather than adding a method to ClientDatanodeProtocol?

Nicholas had briefly talk to me regd this. I was ok with either way. If RPCs are used, then other RPCs on the port should be prepared to handle delays on the order of minutes, since these checksum RPCs compete with the rest of the disk accesses. And there could be quite a few these requests.

Datanode has just 3 RPC handlers.. we probably should not increase the handlers for this reason since checksum load would be very rare and DataNode is thread starved already.


> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: 3981_20080909.patch
>
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629256#action_12629256 ] 

Doug Cutting commented on HADOOP-3981:
--------------------------------------

> Which API are you talking about, FileSystem API or HDFS API?

FileSystem.  And that second method above should have been:
 - Checksum getChecksum(Path, String algorithm)

But this is probably overkill for now.  Let's just choose a single algorithm and put its parameters in the algorithm name for now.  We have to change the ClientDatanodeProtocol in any case.  We can either compute per-block checksums on the datanode, or send CRCs to the client and sum them there.  Let's just pick one or the other though for the first version.  My preference would be to compute per-block checksums on the datanode, but I don't feel that strongly and would not veto the other approach.


> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-3981:
-------------------------------------------

    Attachment: 3981_20080910b.patch

3981_20080910b.patch:

Regarding to Raghu's comments
   1.  Removed
   2 & 3. Changed the messages
   4. Moved toByteArray() to WritableUtils
   5. Changed the buffer to 4kB

Also fixed some bugs.

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3981_20080909.patch, 3981_20080910.patch, 3981_20080910b.patch
>
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628765#action_12628765 ] 

Raghu Angadi commented on HADOOP-3981:
--------------------------------------

> Why not just use the MD5 or SHA1 of the CRCs?

+1. It is more appealing to have a small, fixed size checksum. 

Issues regd differences in bytes.per.checksum, blockSize etc are essentially same either with fixed length checksum or with 2nd level checksum that Nicholas proposed.


> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630719#action_12630719 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3981:
------------------------------------------------

In case that Hudson is not working, here is the ant test-patch results:
{noformat}
     [exec] +1 overall.  

     [exec]     +1 @author.  The patch does not contain any @author tags.

     [exec]     +1 tests included.  The patch appears to include 6 new or modified tests.

     [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.

     [exec]     +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

     [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
{noformat}

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3981_20080909.patch, 3981_20080910.patch, 3981_20080910b.patch, 3981_20080912.patch
>
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628156#action_12628156 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3981:
------------------------------------------------

Currently, Datanode stores a CRC-32 for every 512-byte chunk.  Let's call these CRCs the first level CRC.  So the total size for the first level CRC is about 1/128 of the data size.

How about we compute a second level of checksum over the first level CRCs?  So, for every 512-byte first level CRC, we compute a CRC-32.  Then, the second level CRC is about 1/16384 of the data size.  We could use these second level CRCs as the checksum of the file.

For example, if a file has 100GB, the size of first level CRCs is 800MB and the size of the second level CRCs is only 6.25MB.  We use these 6.25MB second level CRCs as the checksum of the entire file.


> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-3981:
-------------------------------------------

    Attachment: 3981_20080910.patch

3981_20080910.patch: 

> It looks like you forgot to include the class MD5MD5CRC32FileChecksum in the patch.
added MD5MD5CRC32FileChecksum

> Why do you use the datanode's socket/opcode interface rather than adding a method to ClientDatanodeProtocol?
There might be a RPC timeout problem.  So I use data transfer protocol instead of ClientDatanodeProtocol.

> WritableUtils#toByteArray can use io.DataOutputBuffer, no?
Moved WritableUtils#toByteArray to io.DataOutputBuffer and implemented it with DataOutputBuffer.

> DistCp#sameFile() should be changed to only get checksums when the lengths differ.
Updated DistCp

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: 3981_20080909.patch, 3981_20080910.patch
>
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628814#action_12628814 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3981:
------------------------------------------------

bq. Why not just use the MD5 or SHA1 of the CRCs?
MD5 requires sequential access of data.  One easy implementation of MD5-over-CRCs is that client read all CRCs from Datanodes and then compute MD5 over them.  However, it requires reading all the first level CRCs, which is 800MB for a 100GB file.  Is it too much network traffic?

Raghu has a very good idea for another implementation, which computes MD5 across datanodes as follow: Client initiates the Datanode 1 (which has the first block) to compute MD5.  Datanode 1 returns the intermediate status of MD5 computation to the Client and the Client send the intermediate states to Datanode 2 (which has the second block).  Then, the Datanode 2 continues the MD5 computation and return the MD5 computation intermediate status to the Client, and so on.

Note that this is not a parallel algorithm although it is a distributed algorithm.  Another problem for MD5 in this implemenation is that there is no easy way to get the MD5 computation intermediate status in Java 1.6.

bq. It is more appealing to have a small, fixed size checksum.
This is probably good.  I will think about this.

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628743#action_12628743 ] 

Doug Cutting commented on HADOOP-3981:
--------------------------------------

> We use these 6.25MB second level CRCs as the checksum of the entire file.

Why not just use the MD5 or SHA1 of the CRCs?

When should we compute checksums? Are they computed on demand, when someone calls FileSystem#getFileChecksum()? Or are they pre-computed and stored? If they're not pre-computed then we certainly ought to compute them from the CRC's.  Even if they are to be pre-computed, then we might still use the CRCs, to reduce FileSystem upgrade time.

If checksums were pre-computed, where would they be stored?  We could store them in the NameNode, with file metadata, or we could store per-block checksums on datanodes.

My hunch is that we should compute them on demand from CRC data.  We extend ClientDatanodeProtocol to add a getChecksum() operation that returns the checksum for a block without transmitting the CRCs to the client, and the client combines block checksums to get a whole-file checksum.  This is rather expensive, but still a lot faster than checksumming the entire file on demand.  DistCp would be substantially faster if it only used checksums when file lengths match, so we should probably make that optimization.

Longer-term we could think about a checksum API that permits a sequence of checksums to be returned per file, so that, e.g., if a source file has been appended to, we could truncate the destination and append the new data, incrementally updating it.  But until HDFS supports truncation this is moot.

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629319#action_12629319 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3981:
------------------------------------------------

How about we implement MD5-of-CRC32-every-512bytes-with-64Mblocks and use it as the default file checksum algorithm for all FileSystem?  Then, we don't have to change FileSystem API at this moment.

A few issues:
- Should we return a list of MD5 (one per block, the length of the file checksum will depend on the number of blocks) or a fixed length checksum (e.g. MD5-of-MD5-of-CRC32)?
- If bytes.per.checksum is not 512 or block size is not 64MB in HDFS, how about getFileChecksum(Path f) returns null?
- For other FS, how about we return null?  We should implement a serial version of MD5-of-CRC32-every-512bytes-with-64Mblocks algorithm later.

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-3981:
-------------------------------------------

    Hadoop Flags: [Incompatible change, Reviewed]  (was: [Reviewed])

This is an incompatible change since the DATA_TRANSFER_VERSION should be changed.  See HADOOP-4197.

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.19.0
>
>         Attachments: 3981_20080909.patch, 3981_20080910.patch, 3981_20080910b.patch, 3981_20080912.patch
>
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624092#action_12624092 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3981:
------------------------------------------------

> One simple approach could be checksum of checksums/metadata.

+1  we probably should compute a hash tree or md5 over the existing checksums.

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-3981:
-------------------------------------------

    Attachment: 3981_20080912.patch

3981_20080912.patch: fixed a bug.

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3981_20080909.patch, 3981_20080910.patch, 3981_20080910b.patch, 3981_20080912.patch
>
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Raghu Angadi reassigned HADOOP-3981:
------------------------------------

    Assignee: Tsz Wo (Nicholas), SZE

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3981_20080909.patch, 3981_20080910.patch
>
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-3981:
-------------------------------------------

       Resolution: Fixed
    Fix Version/s: 0.19.0
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)

I just committed this.

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.19.0
>
>         Attachments: 3981_20080909.patch, 3981_20080910.patch, 3981_20080910b.patch, 3981_20080912.patch
>
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629322#action_12629322 ] 

Doug Cutting commented on HADOOP-3981:
--------------------------------------

> implement MD5-of-CRC32-every-512bytes-with-64Mblocks and use it as the default file checksum algorithm for all FileSystem?

We should just implement this for HDFS, where CRCs already exist.

> Should we return a list of MD5 [ ... ] ?

No, just a single checksum for the entire file.

> If bytes.per.checksum is not 512 or block size is not 64MB in HDFS, how about getFileChecksum(Path f) returns null?

No, it should return a different algorithm string, with the file's bytes.per.checksum and block size.  I now think returning null by default is probably best, rather than having a default implementation that uses file length, since we should check file lengths explicitly and only compare checksums when lengths differ.

> For other FS, how about we return null?

Yes, I agree.




> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-3981:
-------------------------------------------

    Release Note: 
Implemented MD5-of-xxxMD5-of-yyyCRC32 which is a distributed file checksum algorithm for HDFS, where xxx is the number of CRCs per block and yyy is the number of bytes per CRC.

Changed DistCp to use file checksum for comparing files if both source and destination FileSystem(s) support getFileChecksum(...).

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3981_20080909.patch, 3981_20080910.patch, 3981_20080910b.patch
>
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-3981:
-------------------------------------------

    Status: Patch Available  (was: Open)

Passed all tests locally, submitting...

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3981_20080909.patch, 3981_20080910.patch, 3981_20080910b.patch, 3981_20080912.patch
>
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630050#action_12630050 ] 

Raghu Angadi commented on HADOOP-3981:
--------------------------------------

Mostly minor  :

# DFSClient : {{sock.setSendBufferSize(DEFAULT_DATA_SOCKET_SIZE);}} is not required.
# DFSClient : {{LOG.warn("src=" + src, ie);}} is pretty vague. It could include datanode as well.
# DFSClient : {{IOException("Fail to get block MD5 for block (=" + block)}} could be " {{IOException("Failed to get block MD5 for " + block)}}
# I don't see much need for {{toByteArray()}} : -0.25.  at least it could move to some utils.
# 1MB buffer for {{MD5Hash.digest()}} seems too big. Any CPU benefits with MD5? This is another utility function written as member. 

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3981_20080909.patch, 3981_20080910.patch
>
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12631385#action_12631385 ] 

Hudson commented on HADOOP-3981:
--------------------------------

Integrated in Hadoop-trunk #605 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/605/])

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.19.0
>
>         Attachments: 3981_20080909.patch, 3981_20080910.patch, 3981_20080910b.patch, 3981_20080912.patch
>
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3981) Need a distributed file checksum algorithm for HDFS

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-3981:
-------------------------------------------

    Attachment: 3981_20080909.patch

3981_20080909.patch: implement MD5-of-xxxMD5-of-yyyCRC, where xxx is crc per block and yyy is bytes per crc.  Need to fix some bugs.

> Need a distributed file checksum algorithm for HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-3981
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3981
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: 3981_20080909.patch
>
>
> Traditional message digest algorithms, like MD5, SHA1, etc., require reading the entire input message sequentially in a central location.  HDFS supports large files with multiple tera bytes.  The overhead of reading the entire file is huge. A distributed file checksum algorithm is needed for HDFS.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.