You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "dhruba borthakur (JIRA)" <ji...@apache.org> on 2008/09/03 07:53:44 UTC

[jira] Commented: (HADOOP-4058) Transparent archival and restore of files from HDFS

    [ https://issues.apache.org/jira/browse/HADOOP-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627909#action_12627909 ] 

dhruba borthakur commented on HADOOP-4058:
------------------------------------------

Hadoop user's have been using the Hadoop clusters as a queryable archive warehouse. This means that data that once gets into the warehouse is very unlikely to be deleted. This puts tremendous pressure on adding additional storage capacity to the production cluster.

There could be a set of storage-heavy nodes that cannot be added to the production cluster because do not have enough memory and CPU. One option would be to use this old-cluster to archive old files from the production cluster.

A layer of software can scan the file system in the production cluster to find files with the earliest access times (HADOOP-1869). These files can be moved to the old-cluster and the original file in the production cluster can be replaced by a symbolic link (via HADOOP-4044). An access to read the original file still works because of the symbolic link. Some other piece of software periodically scans the old-cluster, finds out files that were accessed recently, and tries to move them back to the production cluster.

The advantage of this approach is that it is "layered"... it is not built into HDFS but depends on two artifacts of HDFS: symbolic links and access-times. I hate to put more and more intelligence into core-hdfs, otherwise the code becomes very bloated and difficult to maintain.


> Transparent archival and restore of files from HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-4058
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4058
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>
> There should be a facility to migrate old files away from a production cluster. Access to those files from applications should continue to work transparently, without changing application code, but maybe with reduced performance. The policy engine  that does this could be layered on HDFS rather than being built into HDFS itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.