You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org> on 2008/08/13 01:07:44 UTC

[jira] Created: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Extend FileSystem API to return file-checksums/file-digests
-----------------------------------------------------------

                 Key: HADOOP-3941
                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
             Project: Hadoop Core
          Issue Type: New Feature
          Components: fs
            Reporter: Tsz Wo (Nicholas), SZE


Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?

Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.

So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623183#action_12623183 ] 

Doug Cutting commented on HADOOP-3941:
--------------------------------------

How about making FileChecksum an abstract class, adding the method:

public abstract equals(Object other);


> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-3941:
-------------------------------------------

    Attachment: 3941_20080827.patch

3941_20080827.patch: changed DistCp to use LengthFileChecksum

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch, 3941_20080820.patch, 3941_20080826.patch, 3941_20080827.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626699#action_12626699 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3941:
------------------------------------------------

> Do we expect a FileSystem to actually checksum a file on demand? I assume not, that this feature is primarily for accessing pre-computed checksums, ...

For HDFS, I am not sure whether sending all crcs to client is good enough since the size of all crcs is 1/128 of the file size, which is big for large files.  We might want to reduce the network traffic (especially in the case of distcp) by computing a second level of checksums (e.g. compute a MD5 for all the crcs of a block).  So, I think this feature is not only for accessing pre-computed checksums, but indeed a framework for supporting checksum algorithms.

> In (2) copies should use flie lengths or perhaps fail, ...

It should not fail.  Otherwise, we cannot copy from local fs to hdfs.  We are currently using file length as checksum.  It is simply too easy to have false positive.

> In any case, hardwiring distcp to use FileLengthChecksum doesn't seem like an improvement.

It is only temporary.  Once we have a distributed checksum implementation, we could change DistCp to use it.  The distributed checksum implementation will optimize for HDFS, so that coping from HDFS to HDFS will be very efficient (which is the main purpose of distcp).  If necessary, we could provide an option in distcp for users to specify the checksum algorithm.

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch, 3941_20080820.patch, 3941_20080826.patch, 3941_20080827.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Hudson (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628866#action_12628866 ] 

Hudson commented on HADOOP-3941:
--------------------------------

Integrated in Hadoop-trunk #595 (See [http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/595/])

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.19.0
>
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch, 3941_20080820.patch, 3941_20080826.patch, 3941_20080827.patch, 3941_20080904.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628786#action_12628786 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3941:
------------------------------------------------

I forgot to mention that the test failed is nothing to do with this issue.  See HADOOP-4078.

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.19.0
>
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch, 3941_20080820.patch, 3941_20080826.patch, 3941_20080827.patch, 3941_20080904.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623799#action_12623799 ] 

Doug Cutting commented on HADOOP-3941:
--------------------------------------

Why not have the default implementation of getFileChecksum() throw the "unsupported operation" exception so that we don't have duplicated code in every subclass?  Also, should this really throw an exception or return null?  I would guess that most applications would want to handle this not as an exceptional condition somewhere higher on the stack, but rather explicitly where getFileChecksum() is called, so perhaps null would be better.

Do you intend to implement this for HDFS here, or as a separate issue?

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12625689#action_12625689 ] 

Doug Cutting commented on HADOOP-3941:
--------------------------------------

> Below is a summary of the default getFileChecksum() implementation options [ ... ]

The default should minimize code duplication, if possible.  An abstract method should only be used for mandatory methods.  Since this is an optional method, a default implementation should be provided.

The choice of an exception or null depends on the expected use.  An exception should be thrown for unusual situations that are best handled non-locally, somewhere above the call.  The absence of a checksum should probably be handled at the site of the call, so returning null seems a better choice than an exception here.  Another option might be to return a trivial checksum, e.g., the file's length.

Perhaps we should include a use of this new feature in the patch, to better guide its implementation.  Should we extend distcp to use this?  Or do you have another canonical application in mind?  If we add features without applications of them, we risk a design that does not meet any needs.

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch, 3941_20080820.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627146#action_12627146 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3941:
------------------------------------------------

> Distcp should not hardwire any algorithm

That is true.  We might need a method for getting the supported algorithms of a file system.  Algorithms will be sorted by the preference.  For example, if S3 supports {MD5, FileLength}, HDFS supports {HDFS-Checksum, FileLength} and LocalFS supports {MD5, HDFS-Checksum, FileLength}, then 
- S3 -> HDFS or HDFS -> S3 will use FileLength
- S3 -> S3 will use MD5
- S3 -> LocalFS will use MD5
- LocalFS -> HDFS will use HDFS-Checksum


> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch, 3941_20080820.patch, 3941_20080826.patch, 3941_20080827.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623837#action_12623837 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3941:
------------------------------------------------

Below is a summary of the default getFileChecksum() implementation options.  We mentioned the first three before.  I added the fourth.
# no implementation, declare it as abstract
# returning null
# throwing "Not supported" IOException
# if algorithm is MD5, return a MD5FileChecksum.  Otherwise, do #2 or #3.

However, MD5 in #4 may not be efficient for HDFS since it will read the entire file.

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-3941:
-------------------------------------------

    Attachment: 3941_20080826.patch

3941_20080826.patch:

- The default implementation is provided in the FileSystem class.
- It returns null when the algorithm is not found.
- Added FileLengthChecksum which uses file lengths as checksums

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch, 3941_20080820.patch, 3941_20080826.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624055#action_12624055 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3941:
------------------------------------------------

bq. This patch does not implement checksum on HDFS files, right?

You are correct.  The patch only throws a "not supported" exception for HDFS.

bq. Do you plan to generate MD5s for HDFS files too? For HDFS, does it make sense to create a checksum from the blk*.meta files because the size of the meta file will be much much lesser than the size of the data file?

No, the original MD5 algorithm may not be efficient for large files.  I think we need a distributed file digest algorithm for HDFS.  Yes, one way is to compute MD5 over the meta files.  This will reduce the overhead dramatically.  I probably will implement a MD5-over-CRC32 for HDFS.

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623822#action_12623822 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3941:
------------------------------------------------

bq. Why not have the default implementation of getFileChecksum() throw the "unsupported operation" exception so that we don't have duplicated code in every subclass? Also, should this really throw an exception or return null? I would guess that most applications would want to handle this not as an exceptional condition somewhere higher on the stack, but rather explicitly where getFileChecksum() is called, so perhaps null would be better. 

For other optional operaions (e.g. append), we declare an abstract method in FileSystem and let other FileSystem sub-classes throw "Not supported".  Should we do the same for getFileChecksum()?

I think throwing IOException might be better than returning null.  Otherwise, applications have to check null, or they may get NPE which is a RuntimeException.

The methods defined in java.security.MessageDigest, e.g. getInstance(String algorithm), throw NoSuchAlgorithmException.  We might want to do something similar.

bq. Do you intend to implement this for HDFS here, or as a separate issue?

Yes, it is because there are more works for implementing HDFS.

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-3941:
-------------------------------------------

    Assignee: Tsz Wo (Nicholas), SZE
      Status: Patch Available  (was: Open)

passed all tests locally.  try Hudson

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch, 3941_20080820.patch, 3941_20080826.patch, 3941_20080827.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-3941:
-------------------------------------------

    Attachment: 3941_20080819.patch

3941_20080819.patch: implemented MD5FileChecksum for RawLocalFileSystem.  Need unit tests.

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-3941:
-------------------------------------------

    Status: Open  (was: Patch Available)

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch, 3941_20080820.patch, 3941_20080826.patch, 3941_20080827.patch, 3941_20080904.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12622377#action_12622377 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3941:
------------------------------------------------

How about we add the following optional method in the FileSystem API?
{code}
//a new optional method in FileSystem.java
public abstract FileChecksum getFileChecksum(String algorithm, Path p);
{code}
where FileChecksum is a new interface in hadoop.fs package
{code}
interface FileChecksum {
  String getAlgorithm();

  int getLength();

  byte[] getBytes();
}
{code}

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628457#action_12628457 ] 

Doug Cutting commented on HADOOP-3941:
--------------------------------------

> We might need a method for getting the supported algorithms of a file system.

If we remove the "algorithm" parameter to getFileChecksum() then each FileSystem would simply return checksums using its native algorithm.  When these match, cross-filesystem copies would be checksummed.  Later, if we have filesystems that implement multiple checksum algorithms, we might consider something more elaborate, but that seems sufficient for now, no?


> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch, 3941_20080820.patch, 3941_20080826.patch, 3941_20080827.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Doug Cutting updated HADOOP-3941:
---------------------------------

    Hadoop Flags: [Reviewed]

+1 This looks good to me.

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch, 3941_20080820.patch, 3941_20080826.patch, 3941_20080827.patch, 3941_20080904.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623909#action_12623909 ] 

dhruba borthakur commented on HADOOP-3941:
------------------------------------------

This patch does not implement checksum on HDFS files, right? 

Do you plan to generate MD5s for HDFS files too? For HDFS, does it make sense to create a checksum from the blk*.meta files because the size of the meta file will be much much lesser than the size of the data file?

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-3941:
-------------------------------------------

    Attachment: 3941_20080904.patch

3941_20080904.patch: Changed FileSystem API to getFileChecksum(Path).

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch, 3941_20080820.patch, 3941_20080826.patch, 3941_20080827.patch, 3941_20080904.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-3941:
-------------------------------------------

    Status: Patch Available  (was: Open)

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch, 3941_20080820.patch, 3941_20080826.patch, 3941_20080827.patch, 3941_20080904.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12625847#action_12625847 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3941:
------------------------------------------------

> ... so returning null seems a better choice than an exception here. Another option might be to return a trivial checksum, e.g., the file's length.

I think returning null make sense.  We cannot return a trivial checksum since an algorithm is specified in the call.  We should only return the checksum generated by the specified algorithm.

> Should we extend distcp to use this? 

Yes, the canonical application is distcp.   I could also change distcp to use the new API.

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch, 3941_20080820.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-3941:
-------------------------------------------

    Attachment: 3941_20080820.patch

3941_20080820.patch: fixed a bug.

All the patches up to now implements option #1, which is our usual approach.

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch, 3941_20080820.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-3941:
-------------------------------------------

       Resolution: Fixed
    Fix Version/s: 0.19.0
     Release Note: Added new FileSystem APIs: FileChecksum and FileSystem.getFileChecksum(Path).
           Status: Resolved  (was: Patch Available)

I just committed this.

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>             Fix For: 0.19.0
>
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch, 3941_20080820.patch, 3941_20080826.patch, 3941_20080827.patch, 3941_20080904.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626853#action_12626853 ] 

Doug Cutting commented on HADOOP-3941:
--------------------------------------

Distcp should not hardwire any algorithm, but rather use the preferred algorithm of the filesystems involved.  That way checksums will not be used just for HDFS->HDFS, but also for S3->S3, etc.

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch, 3941_20080820.patch, 3941_20080826.patch, 3941_20080827.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-3941:
-------------------------------------------

    Attachment: 3941_20080818.patch

3941_20080818.patch: API change preview

- Added getFileChecksum(...) in FileSystem

- Added abstract class FileChecksum
-* implemented equals(...) and hashCode()
-* renamed the method getAlgorithm() mentioned above to getAlgorithmName()

I am going to implement MD5FileChecksum for LocalFileSysetm.

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626503#action_12626503 ] 

Hadoop QA commented on HADOOP-3941:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12389024/3941_20080827.patch
  against trunk revision 689733.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    -1 contrib tests.  The patch failed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3134/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3134/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3134/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3134/console

This message is automatically generated.

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch, 3941_20080820.patch, 3941_20080826.patch, 3941_20080827.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628464#action_12628464 ] 

Tsz Wo (Nicholas), SZE commented on HADOOP-3941:
------------------------------------------------

bq. If we remove the "algorithm" parameter to getFileChecksum() then each FileSystem would simply return checksums using its native algorithm. When these match, cross-filesystem copies would be checksummed. Later, if we have filesystems that implement multiple checksum algorithms, we might consider something more elaborate, but that seems sufficient for now, no?

+1  Then, I will only add
{code}
public FileChecksum getFileChecksum(Path f) throws IOException
{code}
in this patch.  If we need more checksum algorithms later, we should add these two methods:
{code}
public FileChecksum getFileChecksum(String algorithm, Path f) throws IOException

public String[] getSupportedChecksumAlgorithms()
{code}

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch, 3941_20080820.patch, 3941_20080826.patch, 3941_20080827.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz Wo (Nicholas), SZE updated HADOOP-3941:
-------------------------------------------

    Attachment: 3941_20080819b.patch

3941_20080819b.patch: added a unit test and fixed a Findbugs warning.

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12623184#action_12623184 ] 

Doug Cutting commented on HADOOP-3941:
--------------------------------------

Sorry, that should have been something like:

{code}
public boolean equals(Object other) {
  if (!(other instanceof FileChecksum))
    return false;
  FileChecksum that = (FileChecksum)other;
  return this.getAlgorithm().equals(that.getAlgorithm())
    && this.getBytes().equals(that.getBytes());
}
{code}

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Hadoop QA (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628766#action_12628766 ] 

Hadoop QA commented on HADOOP-3941:
-----------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12389550/3941_20080904.patch
  against trunk revision 692492.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3190/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3190/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3190/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3190/console

This message is automatically generated.

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch, 3941_20080820.patch, 3941_20080826.patch, 3941_20080827.patch, 3941_20080904.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-3941) Extend FileSystem API to return file-checksums/file-digests

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-3941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12626614#action_12626614 ] 

Doug Cutting commented on HADOOP-3941:
--------------------------------------

I don't see the point in passing the checksum algorithm name to getFileChecksum().  Do we expect a FileSystem to actually checksum a file on demand?  I assume not, that this feature is primarily for accessing pre-computed checksums, and that most filesystems will only support a single checksum algorithm.

There are two primary cases to consider:
  1. Copying files between filesystems that have pre-computed checksums using the same algorithm.
  2. Copying files between filesystems which either do not have pre-computed checksums or use different algorithms.

In (2) copies should use flie lengths or perhaps fail, and in (1) we should use checksums.  Right?

In any case, hardwiring distcp to use FileLengthChecksum doesn't seem like an improvement.

> Extend FileSystem API to return file-checksums/file-digests
> -----------------------------------------------------------
>
>                 Key: HADOOP-3941
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3941
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Tsz Wo (Nicholas), SZE
>            Assignee: Tsz Wo (Nicholas), SZE
>         Attachments: 3941_20080818.patch, 3941_20080819.patch, 3941_20080819b.patch, 3941_20080820.patch, 3941_20080826.patch, 3941_20080827.patch
>
>
> Suppose we have two files in two locations (may be two clusters) and these two files have the same size.  How could we tell whether the content of them are the same?
> Currently, the only way is to read both files and compare the content of them.  This is a very expensive operation if the files are huge.
> So, we would like to extend the FileSystem API to support returning file-checksums/file-digests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.