You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Owen O'Malley (Resolved) (JIRA)" <ji...@apache.org> on 2012/01/28 04:45:11 UTC
[jira] [Resolved] (HDFS-224) I propose a tool for creating and manipulating a new abstraction, Hadoop Archives.

     [ https://issues.apache.org/jira/browse/HDFS-224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Owen O'Malley resolved HDFS-224.
--------------------------------

    Resolution: Duplicate

We have a different version of harchives.
                
> I propose a tool for creating and manipulating a new abstraction, Hadoop Archives.
> ----------------------------------------------------------------------------------
>
>                 Key: HDFS-224
>                 URL: https://issues.apache.org/jira/browse/HDFS-224
>             Project: Hadoop HDFS
>          Issue Type: New Feature
>            Reporter: Dick King
>
> -- Introduction
> In some hadoop map/reduce and dfs use cases, including a specific case that arises in my own work, users would like to populate dfs with a family of hundreds or thousands of directory trees, each of which consists of thousands of files.  In our case, the trees each have perhaps 20 gigabytes; two or three 3-10-gigabyte files, a thousand small ones, and a large number of files of intermediate size.  I am writing this JIRA to encourage discussion of a new facility I want to create and contribute to the dfs core.
> -- The problem
> You can't store such families of trees in dfs in the obvious manner.  The problem is that the name nodes can't handle the millions or ten million files that result from such a family, especially if there are a couple of families.  I understand that dfs will not be able to accommodate tens of millions of files in one instance for quite a while.
> -- Exposed API of my proposed solution
> I would therefore like to produce, and contribute to the dfs core, a new tool that implements an abstraction called a Hadoop Archive [or harchive].  Conceptually, a harchive is a unit, but it manages a space that looks like a directory tree.  The tool exposes an interface that allows a user to do the following:
>  * directory-level operations
>    ** create a harchive [either empty, or initially populated form a locally-stored directory tree] .  The namespace for harchives is the same as the space of possible dfs directory locators, and a harchive would in fact be implemented as a dfs directory with specialized contents.
>    ** Add a directory tree to an existing harchive in a specific place within the harchive
>    ** retrieve a directory tree or subtree at or beneath the root of the harchive directory structure, into a local directory tree
>  * file-level operations
>    ** add a local file to a specific place in the harchive
>    ** modify a file image in a specific place in the harchive to match a local file
>    ** delete a file image in the harchive.
>    ** move a file image within the harchive
>    ** open a file image in the harchive for reading or writing.
>  * stream operations
>    ** open a harchive file image for reading or writing as a stream, in a manner similar to dfs files, and read or write it [ie., hdfsRead(...) ].  This would include random access operators for reading.
>  * management operations
>    ** commit a group of changes [which would be made atomically -- there would be no way half of a change could be made to a harchive if a client crashes].
>    ** clean up a harchive, if it's gotten less performant because of extensive editing
>    ** delete a harchive
> We would also implement a command line interface.
> -- Brief sketch of internals
> A harchive would be represented as a small collection of files, called segments, in a dfs directory at the harchive's location.  Each segment would contain some of the files of the harchive's file images in a format to be determined, plus a harchive index.  We may group files by size, or some other criteria.  It is likely that harchives would contain only one segment in common cases.
> Changes would be made by adding the text of the new files, either by rewriting an existing segment that contains not much more data than the size of the changes or by creating a new segment, complete with a new index.  When dfs comes to be enhanced to allow appends to dfs files, as requested by HADOOP-1700 , we would be able to take advantage of that.
> Often, when a harchive is initially populated, it could be a single segment, and a file it contains could be accessed with two random accesses into the segment.  The first access retrieves the index, and the second access retrieves the beginning of the file.  We could choose to put smaller files closer to the index to allow lower average amortized costs per byte.
> We might instead choose to represent a harchive as one file or a few files for the large represented files, and smaller files for the represented smaller files.  That lets us make modifications by copying at lower cost.
> The segment containing the index is found by a naming convention.  Atomicity is obtained by creating indices and renaming the files containing them according to the convention, when a change is committed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira