You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Mahadev konar (JIRA)" <ji...@apache.org> on 2008/04/24 22:35:21 UTC
[jira] Commented: (HADOOP-3307) Archives in Hadoop.

    [ https://issues.apache.org/jira/browse/HADOOP-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12592164#action_12592164 ] 

Mahadev konar commented on HADOOP-3307:
---------------------------------------

Here is the design for the archives. 

Archiving files in HDFS

-- Motivation-- 

The Namenode is a limited resource and we usually end up with lots of small files that users do not use so often. We would like to create an archiving  utility that is able to archive these files which are semi transparent and usable by map reduce. 

-- Why not just concatenate the files?
 As we understand that concatenation of files might be useful but not a full fledged solution for archiving files. Users want to keep their files as distinct  files and would sometime like to unarchive and not lose the file layouts.

-- Requirements-- 
 -- transparent or semi transparent usage of archives. 
 -- Must be able to archive and unarchive in parallel 
 -- Changeable archives is not a requirement but the design should not prevent it to be implemented later.
 -- Compression is not a goal.

-- Archive Format --
-- Conventional archive formats like tar are not convenient for parallel archive creation 
-- Here is a proposal that will allow archive creation in parallel

The format of an archive as a filesystem path is: 

/user/mahadev/foo.har/_index*
/user/mahadev/foo.har/part-* 

The indexes store the filenames and the offset with the part files.

-- URI Syntax -- 
Har FileSystem is a client side filesystem which is semitransparent. 
-- har:<archivePath>!<fileInArchive> (similar to jar uri)
example: har:hdfs://host:port/pathinfilesystem/foo.har!path_inside_thearchive

-- How will map reduce work with this new Filesystem.
   There will not be any changes required to map reduce to get the Archives running as input to map reduce jobs.

-- How will the dfs commands work -- 

   The DFS command will have to specify the whole URI for doing dfs operations on the files. Archives are immutable, so renames, deletes, creates will throw an exception in the initial versions of archives. 

-- How will permissions work with archives 
   In the first version of HAR, all the files that are archived into HAR will lose permissions that they initially had. In later versions of HAR, permissions can be stored into the metadata making it possible to unarchive without losing permissions.

-- Future Work:

-- Transparent use of archives. 
   This will need changes on the Hadoop File System to have mounts that point to a archives and changes to DFSClient that will transparently walk this mount to the real archive and will allow transparent use of archives.
 
Comments?





> Archives in Hadoop.
> -------------------
>
>                 Key: HADOOP-3307
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3307
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: fs
>            Reporter: Mahadev konar
>            Assignee: Mahadev konar
>             Fix For: 0.18.0
>
>
> This is a new feature for archiving and unarchiving files in HDFS. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.