You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Kihwal Lee (JIRA)" <ji...@apache.org> on 2012/05/15 20:19:20 UTC

[jira] [Commented] (HADOOP-8240) Allow users to specify a checksum type on create()

    [ https://issues.apache.org/jira/browse/HADOOP-8240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13276083#comment-13276083 ] 

Kihwal Lee commented on HADOOP-8240:
------------------------------------

We need this feature to make data copying and verification work across clusters with different configurations. I would appreciate any feedback.

h4. Design Choices

# *Add a new create method to FileSystem for allowing checksum type to be specified.* FileSystem#create() already allows specifying bytesPerChecksum.   The new create method may accept a DataChecksum object.  Users can use the existing DataChecksum.newDataChecksum( int type, int bytesPerChecksum) to create one. Users who wants to specify non-default type likely want to control bytesPerChecksum as well. 
# *Add checksum types to CreateFlags.* This approach minimizes interface changes, but may not be the most intuitive/consistent way.
# *Add a method to FSDataOutputStream and DFSOutputStream to allow users to override default checksum parameters.*  This method should fail if data is already written.  This is sort of like ioctl. If there are other tunables we want to support, we could generalize the api. But changing internal parameters (not encapsulated data) of an object during run-time doesn't go well with typical java semantics and may cause confusion. So we need to be careful about this.

h4. Other previously discussed approaches

# *Setting dfs.checksum.type.*  FileSystem cache cause it to be stay the same after the creation of DFSClient.  Also, conf is shared, so it can have unforeseen side-effects.
# *Disable FileSystem cache.* Create a new Configuration and set dfs.checksum.type. Without cache, memory bloat is too much. 
# *Use conf as a part of key in FileSystem cache, in addition to UGI and scheme + authority.* Something along this line may work.  Doing shallow comparison may not be enough. Do we create a special hashCode/equals to make it safer?  There will be memory bloat, but how much?  It is still up to users to manage different configurations and may be more prone to mistakes because of that.

                
> Allow users to specify a checksum type on create()
> --------------------------------------------------
>
>                 Key: HADOOP-8240
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8240
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: fs
>    Affects Versions: 0.23.0
>            Reporter: Kihwal Lee
>            Assignee: Kihwal Lee
>             Fix For: 0.23.3, 2.0.0, 3.0.0
>
>         Attachments: hadoop-8240.patch
>
>
> Per discussion in HADOOP-8060, a way for users to specify a checksum type on create() is needed. The way FileSystem cache works makes it impossible to use dfs.checksum.type to achieve this. Also checksum-related API is at Filesystem-level, so we prefer something at that level, not hdfs-specific one.  Current proposal is to use CreatFlag.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira