You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by "dhruba borthakur (JIRA)" <ji...@apache.org> on 2008/09/03 07:41:44 UTC

[jira] Created: (HADOOP-4058) Transparent archival and restore of files from HDFS

Transparent archival and restore of files from HDFS
---------------------------------------------------

                 Key: HADOOP-4058
                 URL: https://issues.apache.org/jira/browse/HADOOP-4058
             Project: Hadoop Core
          Issue Type: New Feature
          Components: dfs
            Reporter: dhruba borthakur
            Assignee: dhruba borthakur


There should be a facility to migrate old files away from a production cluster. Access to those files from applications should continue to work transparently, without changing application code, but maybe with reduced performance. The policy engine  that does this could be layered on HDFS rather than being built into HDFS itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4058) Transparent archival and restore of files from HDFS

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628665#action_12628665 ] 

Doug Cutting commented on HADOOP-4058:
--------------------------------------

> the only issue is that this does not save on inode count for primary hdfs instance
> inodes are as important (or perhaps more) a resource in hdfs as raw storage. 

The har:// format exists to conserve inodes.  One can archive trees as har:// archives, replacing the root with a symbolic link to the archive.  This would conserve inodes in the original and archive filesystems.


> Transparent archival and restore of files from HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-4058
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4058
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>
> There should be a facility to migrate old files away from a production cluster. Access to those files from applications should continue to work transparently, without changing application code, but maybe with reduced performance. The policy engine  that does this could be layered on HDFS rather than being built into HDFS itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4058) Transparent archival and restore of files from HDFS

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628079#action_12628079 ] 

dhruba borthakur commented on HADOOP-4058:
------------------------------------------

I agree with Allen to a certain extent. The  design has to be such that the "retention-policy" be part of a layer written above the HDFS file system. I would like the design to be similar to approach adopted by the block-rebalancing code. The block-rebalacer is not in the namenode. It uses a few primitives in the namenode, but mos tof the logic of what-to-move, when-to-move, etc are handled outside the namenode.

i would like the "transparent-archiving-and-restoring" code to be designed similarly. I think the implementation would be more than a "bunch of scripts"! It would involve some server(s) that continuously monitors the access pattern in a cluster and take appropriate action.

I like the idea of having a volume-abstraction. But I think this idea of auto-archiving-and-restore is orthogonal to "volumes". Can you pl explain how they are related?

> Transparent archival and restore of files from HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-4058
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4058
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>
> There should be a facility to migrate old files away from a production cluster. Access to those files from applications should continue to work transparently, without changing application code, but maybe with reduced performance. The policy engine  that does this could be layered on HDFS rather than being built into HDFS itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4058) Transparent archival and restore of files from HDFS

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628687#action_12628687 ] 

Raghu Angadi commented on HADOOP-4058:
--------------------------------------

> the only issue is that this does not save on inode count for primary hdfs instance. 
It should save inodes if the sym link is for a directory tree.

> Transparent archival and restore of files from HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-4058
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4058
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>
> There should be a facility to migrate old files away from a production cluster. Access to those files from applications should continue to work transparently, without changing application code, but maybe with reduced performance. The policy engine  that does this could be layered on HDFS rather than being built into HDFS itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4058) Transparent archival and restore of files from HDFS

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-4058:
-------------------------------------

    Attachment: migrator.txt

A very preliminary version of a proposal that supports archival and migration of files across clusters.

> Transparent archival and restore of files from HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-4058
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4058
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: migrator.txt
>
>
> There should be a facility to migrate old files away from a production cluster. Access to those files from applications should continue to work transparently, without changing application code, but maybe with reduced performance. The policy engine  that does this could be layered on HDFS rather than being built into HDFS itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4058) Transparent archival and restore of files from HDFS

Posted by "Allen Wittenauer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628070#action_12628070 ] 

Allen Wittenauer commented on HADOOP-4058:
------------------------------------------

Realistically, any data warehousing system needs to have a retention policy that includes "when does this data go away?"

To me, this proposal really sounds like a contrib item.  Some script that does auto-distcp's.

It also sounds like it might be made irrelevant with some of the ideas floating around about simulating autofs-like functionality using volumes.

> Transparent archival and restore of files from HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-4058
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4058
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>
> There should be a facility to migrate old files away from a production cluster. Access to those files from applications should continue to work transparently, without changing application code, but maybe with reduced performance. The policy engine  that does this could be layered on HDFS rather than being built into HDFS itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4058) Transparent archival and restore of files from HDFS

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628071#action_12628071 ] 

Joydeep Sen Sarma commented on HADOOP-4058:
-------------------------------------------

+1. would love to have this.

the only issue is that this does not save on inode count for primary hdfs instance. longer term i would like to see a junction/mount point construct that can save on inodes as well (the mount point could refer to a hadoop archive uri - and the subpath would be relative to that).

the reason for bringing this up is that it seems that inodes are as important (or perhaps more) a resource in hdfs as raw storage. so 'archival' needs to take care of that as well.

> Transparent archival and restore of files from HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-4058
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4058
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>
> There should be a facility to migrate old files away from a production cluster. Access to those files from applications should continue to work transparently, without changing application code, but maybe with reduced performance. The policy engine  that does this could be layered on HDFS rather than being built into HDFS itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4058) Transparent archival and restore of files from HDFS

Posted by "Rodrigo Schmidt (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rodrigo Schmidt updated HADOOP-4058:
------------------------------------

    Attachment: dfscron.txt

An alternative proposal, a bit more general, to support different types of services also triggered by an HDFS monitor.

> Transparent archival and restore of files from HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-4058
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4058
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: dfscron.txt, migrator.txt
>
>
> There should be a facility to migrate old files away from a production cluster. Access to those files from applications should continue to work transparently, without changing application code, but maybe with reduced performance. The policy engine  that does this could be layered on HDFS rather than being built into HDFS itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-4058) Transparent archival and restore of files from HDFS

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HADOOP-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12627909#action_12627909 ] 

dhruba borthakur commented on HADOOP-4058:
------------------------------------------

Hadoop user's have been using the Hadoop clusters as a queryable archive warehouse. This means that data that once gets into the warehouse is very unlikely to be deleted. This puts tremendous pressure on adding additional storage capacity to the production cluster.

There could be a set of storage-heavy nodes that cannot be added to the production cluster because do not have enough memory and CPU. One option would be to use this old-cluster to archive old files from the production cluster.

A layer of software can scan the file system in the production cluster to find files with the earliest access times (HADOOP-1869). These files can be moved to the old-cluster and the original file in the production cluster can be replaced by a symbolic link (via HADOOP-4044). An access to read the original file still works because of the symbolic link. Some other piece of software periodically scans the old-cluster, finds out files that were accessed recently, and tries to move them back to the production cluster.

The advantage of this approach is that it is "layered"... it is not built into HDFS but depends on two artifacts of HDFS: symbolic links and access-times. I hate to put more and more intelligence into core-hdfs, otherwise the code becomes very bloated and difficult to maintain.


> Transparent archival and restore of files from HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-4058
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4058
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>
> There should be a facility to migrate old files away from a production cluster. Access to those files from applications should continue to work transparently, without changing application code, but maybe with reduced performance. The policy engine  that does this could be layered on HDFS rather than being built into HDFS itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-4058) Transparent archival and restore of files from HDFS

Posted by "dhruba borthakur (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HADOOP-4058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

dhruba borthakur updated HADOOP-4058:
-------------------------------------

    Attachment: archival.pdf

A slide describing archival in the hadoop summit 2009.

> Transparent archival and restore of files from HDFS
> ---------------------------------------------------
>
>                 Key: HADOOP-4058
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4058
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: dfs
>            Reporter: dhruba borthakur
>            Assignee: dhruba borthakur
>         Attachments: archival.pdf, dfscron.txt, migrator.txt
>
>
> There should be a facility to migrate old files away from a production cluster. Access to those files from applications should continue to work transparently, without changing application code, but maybe with reduced performance. The policy engine  that does this could be layered on HDFS rather than being built into HDFS itself.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.