You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Nathan Marz <na...@rapleaf.com> on 2009/02/10 01:17:07 UTC

Backing up HDFS?

How do people back up their data that they keep on HDFS? We have many  
TB of data which we need to get backed up but are unclear on how to do  
this efficiently/reliably.

Re: Backing up HDFS?

Posted by Stefan Podkowinski <sp...@gmail.com>.
On Tue, Feb 10, 2009 at 2:22 AM, Allen Wittenauer <aw...@yahoo-inc.com> wrote:
>
> The key here is to prioritize your data.  Impossible to replicate data gets
> backed up using whatever means necessary, hard-to-regenerate data, next
> priority. Easy to regenerate and ok to nuke data, doesn't get backed up.
>

I think thats a good advise to start with when creating a backup strategy.
E.g. what we do at the moment is to analyze huge volumes of access
logs where we import those logs into hdfs, creating aggregates for
several metrics and finally storing results in sequence files using
block level compression. Its kind of an intermediate format that can
be used for further analysis. Those files end up being pretty small
and will be exported daily to storage and getting backuped. In case
hdfs goes to hell we can restore some raw log data from the servers
and only loose historical logs which should not be a big deal.

I must also add that I really enjoy the great deal of optimization
opportunities that hadoop gives you by directly implementing the
serialization strategies. You really get control over every bit and
byte that gets recorded. Same with compression. So you can make the
best trade offs possible and finally store only data you really need.

Re: Backing up HDFS?

Posted by Jeff Hammerbacher <ha...@cloudera.com>.
Hey,

There's also a ticket open to enable global snapshots for a single HDFS
instance: https://issues.apache.org/jira/browse/HADOOP-3637. While this
doesn't solve the multi-site backup issue, it does provide stronger
protection against programmatic deletion of data in a single cluster.

Regards,
Jeff

On Mon, Feb 9, 2009 at 5:22 PM, Allen Wittenauer <aw...@yahoo-inc.com> wrote:

> On 2/9/09 4:41 PM, "Amandeep Khurana" <am...@gmail.com> wrote:
> > Why would you want to have another backup beyond HDFS? HDFS itself
> > replicates your data so if the reliability of the system shouldnt be a
> > concern (if at all it is)...
>
> I'm reminded of a previous job where a site administrator refused to make
> tape backups (despite our continual harassment and pointing out that he was
> in violation of the contract) because he said RAID was "good enough".
>
> Then the RAID controller failed. When we couldn't recover data "from the
> other mirror" he was fired.  Not sure how they ever recovered, esp.
> considering what the data was they lost.  Hopefully they had a paper trail.
>
> To answer Nathan's question:
>
> > On Mon, Feb 9, 2009 at 4:17 PM, Nathan Marz <na...@rapleaf.com> wrote:
> >
> >> How do people back up their data that they keep on HDFS? We have many TB
> of
> >> data which we need to get backed up but are unclear on how to do this
> >> efficiently/reliably.
>
> The content of our HDFSes is loaded from elsewhere and is not considered
> 'the source of authority'.  It is the responsibility of the original
> sources
> to maintain backups and we then follow their policies for data retention.
> For user generated content, we provide *limited* (read: quota'ed) NFS space
> that is backed up regularly.
>
> Another strategy we take is multiple grids in multiple locations that get
> the data loaded simultaneously.
>
> The key here is to prioritize your data.  Impossible to replicate data gets
> backed up using whatever means necessary, hard-to-regenerate data, next
> priority. Easy to regenerate and ok to nuke data, doesn't get backed up.
>
>

Re: Backing up HDFS?

Posted by Nathan Marz <na...@rapleaf.com>.
Replication only protects against single node failure. If there's a  
fire and we lose the whole cluster, replication doesn't help. Or if  
there's human error and someone accidentally deletes data, then it's  
deleted from all the replicas. We want our backups to protect against  
all these scenarios.

On Feb 9, 2009, at 4:41 PM, Amandeep Khurana wrote:

> Why would you want to have another backup beyond HDFS? HDFS itself
> replicates your data so if the reliability of the system shouldnt be a
> concern (if at all it is)...
>
> Amandeep
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Mon, Feb 9, 2009 at 4:17 PM, Nathan Marz <na...@rapleaf.com>  
> wrote:
>
>> How do people back up their data that they keep on HDFS? We have  
>> many TB of
>> data which we need to get backed up but are unclear on how to do this
>> efficiently/reliably.
>>


Re: Backing up HDFS?

Posted by Brian Bockelman <bb...@cse.unl.edu>.
On Feb 9, 2009, at 6:41 PM, Amandeep Khurana wrote:

> Why would you want to have another backup beyond HDFS? HDFS itself
> replicates your data so if the reliability of the system shouldnt be a
> concern (if at all it is)...
>

It should be.  HDFS is not an archival system.   Multiple replicas  
does not equate a backup, just like having a RAID1 or RAID5 shouldn't  
make you feel safe.

HDFS is actively developed with lots of new features.  Bugs creep in.   
Things can become inconsistent and mis-replicated.  Even though loss  
due to hardware failure is small, losses due to new bugs are still  
possible!

Brian

> Amandeep
>
>
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz
>
>
> On Mon, Feb 9, 2009 at 4:17 PM, Nathan Marz <na...@rapleaf.com>  
> wrote:
>
>> How do people back up their data that they keep on HDFS? We have  
>> many TB of
>> data which we need to get backed up but are unclear on how to do this
>> efficiently/reliably.
>>


Re: Backing up HDFS?

Posted by Steve Loughran <st...@apache.org>.
Allen Wittenauer wrote:
> On 2/9/09 4:41 PM, "Amandeep Khurana" <am...@gmail.com> wrote:
>> Why would you want to have another backup beyond HDFS? HDFS itself
>> replicates your data so if the reliability of the system shouldnt be a
>> concern (if at all it is)...
> 
> I'm reminded of a previous job where a site administrator refused to make
> tape backups (despite our continual harassment and pointing out that he was
> in violation of the contract) because he said RAID was "good enough".
> 
> Then the RAID controller failed. When we couldn't recover data "from the
> other mirror" he was fired.  Not sure how they ever recovered, esp.
> considering what the data was they lost.  Hopefully they had a paper trail.

hope that wasnt at SUNW, not given they do their own controllers

1. controller failure is lethal, especially if you don't notice for a while
2. some products -say, databases- didnt like live updates, so a trick 
evolved of taking off some of the RAID array and putting that to tape. 
Of course, then there's the problem of what happens there
3. Tape is still very power efficient; good for a bulk off-site store 
(or a local fire-safe)
4. Over at last.fm, they had an accident rm / on their primary dataset. 
Fortunately they did apparently have another copy somewhere else. and 
now that hfds has user ids, you can prevent anyone but the admin team 
from accidentally deleting everyones data.

Re: Backing up HDFS?

Posted by lohit <lo...@yahoo.com>.
We copy over selected files from HDFS to KFS and use an instance of KFS as backup file system.
We use distcp to take backup.
Lohit



----- Original Message ----
From: Allen Wittenauer <aw...@yahoo-inc.com>
To: core-user@hadoop.apache.org
Sent: Monday, February 9, 2009 5:22:38 PM
Subject: Re: Backing up HDFS?

On 2/9/09 4:41 PM, "Amandeep Khurana" <am...@gmail.com> wrote:
> Why would you want to have another backup beyond HDFS? HDFS itself
> replicates your data so if the reliability of the system shouldnt be a
> concern (if at all it is)...

I'm reminded of a previous job where a site administrator refused to make
tape backups (despite our continual harassment and pointing out that he was
in violation of the contract) because he said RAID was "good enough".

Then the RAID controller failed. When we couldn't recover data "from the
other mirror" he was fired.  Not sure how they ever recovered, esp.
considering what the data was they lost.  Hopefully they had a paper trail.

To answer Nathan's question:

> On Mon, Feb 9, 2009 at 4:17 PM, Nathan Marz <na...@rapleaf.com> wrote:
> 
>> How do people back up their data that they keep on HDFS? We have many TB of
>> data which we need to get backed up but are unclear on how to do this
>> efficiently/reliably.

The content of our HDFSes is loaded from elsewhere and is not considered
'the source of authority'.  It is the responsibility of the original sources
to maintain backups and we then follow their policies for data retention.
For user generated content, we provide *limited* (read: quota'ed) NFS space
that is backed up regularly.

Another strategy we take is multiple grids in multiple locations that get
the data loaded simultaneously.

The key here is to prioritize your data.  Impossible to replicate data gets
backed up using whatever means necessary, hard-to-regenerate data, next
priority. Easy to regenerate and ok to nuke data, doesn't get backed up.

Re: Backing up HDFS?

Posted by Allen Wittenauer <aw...@yahoo-inc.com>.
On 2/9/09 4:41 PM, "Amandeep Khurana" <am...@gmail.com> wrote:
> Why would you want to have another backup beyond HDFS? HDFS itself
> replicates your data so if the reliability of the system shouldnt be a
> concern (if at all it is)...

I'm reminded of a previous job where a site administrator refused to make
tape backups (despite our continual harassment and pointing out that he was
in violation of the contract) because he said RAID was "good enough".

Then the RAID controller failed. When we couldn't recover data "from the
other mirror" he was fired.  Not sure how they ever recovered, esp.
considering what the data was they lost.  Hopefully they had a paper trail.

To answer Nathan's question:

> On Mon, Feb 9, 2009 at 4:17 PM, Nathan Marz <na...@rapleaf.com> wrote:
> 
>> How do people back up their data that they keep on HDFS? We have many TB of
>> data which we need to get backed up but are unclear on how to do this
>> efficiently/reliably.

The content of our HDFSes is loaded from elsewhere and is not considered
'the source of authority'.  It is the responsibility of the original sources
to maintain backups and we then follow their policies for data retention.
For user generated content, we provide *limited* (read: quota'ed) NFS space
that is backed up regularly.

Another strategy we take is multiple grids in multiple locations that get
the data loaded simultaneously.

The key here is to prioritize your data.  Impossible to replicate data gets
backed up using whatever means necessary, hard-to-regenerate data, next
priority. Easy to regenerate and ok to nuke data, doesn't get backed up.


Re: Backing up HDFS?

Posted by Amandeep Khurana <am...@gmail.com>.
Why would you want to have another backup beyond HDFS? HDFS itself
replicates your data so if the reliability of the system shouldnt be a
concern (if at all it is)...

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Mon, Feb 9, 2009 at 4:17 PM, Nathan Marz <na...@rapleaf.com> wrote:

> How do people back up their data that they keep on HDFS? We have many TB of
> data which we need to get backed up but are unclear on how to do this
> efficiently/reliably.
>