You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Darrell Taylor <da...@gmail.com> on 2012/05/29 18:19:20 UTC

Pragmatic cluster backup strategies?

Hi,

We are about to build a 10 machine cluster with 40Tb of storage, obviously
as this gets full actually trying to create an offsite backup becomes a
problem unless we build another 10 machine cluster (too expensive right
now).  Not sure if it will help but we have planned the cabinet into an
upper and lower half with separate redundant power, then we plan to put
half of the cluster in the top, half in the bottom, effectively 2 racks, so
in theory we could lose half the cluster and still have the copies of all
the blocks with a replication factor of 3?  Apart form the data centre
burning down or some other disaster that would render the machines totally
unrecoverable, is this approach good enough?

I realise this is a very open question and everyone's circumstances are
different, but I'm wondering what other peoples experiences/opinions are
for backing up cluster data?

Thanks
Darrell.

Re: Pragmatic cluster backup strategies?

Posted by Darrell Taylor <da...@gmail.com>.

Sounds like Trash is useful for those times when you delete a bunch of
files by mistake and can get them back quickly, as you say not a backup
strategy, but at least a first line of defence.

We had a discussion in the office and came up with the following possible
solution, this stems from the technique we currently use for fast MySQL
backups.  So, each of the nodes will have 4 x 3Tb drives on board, what we
propose is to use 2 of the drives on each node for the main data, and the
other 2 drives for backup, but using LVM we will be able to take a snapshot
of all the nodes all at the same time, what LVM snapshots effectively do is
checkpoint the main disk and then takes copies of any changed inodes,
resulting in a partition that you can mount that is a view of the node at
the snapshot time.  We would simply run this in cron on all the machines at
the same time (machines are synced with ntp) and this would give us a
snapshot of the cluster at a point in time.  The main question I have here
is if the cluster is busy doing something at the point in time we take the
snapshot, and then do a subsequent full restore (after shutting down the
cluster etc.) what potential problems might we see with the data nodes, as
I guess there will be blocks in various random states, but the cluster is
essentially restored.  Also I guess we need to apply the same technique to
the main namenode and jobtracker machines?

Anybody every tried anything like this before?  Is it even feasible?

On Wed, May 30, 2012 at 2:36 PM, Robert Evans <ev...@yahoo-inc.com> wrote:

> I am not an expert on the trash so you probably want to verify everything
> I am about to say.  I believe that trash acts oddly when you try to use it
> to delete a trash directory.  Quotas can potentially get off when doing
> this, but I think it still deletes the directory.  Trash is a nice feature,
> but I wouldn't trust it as a true backup.  I just don't think it is mature
> enough for something like that.  There are enough issues with quotas that
> sadly most of our users almost always add -skipTrash all the time.
>
> Where I work we do a combination of several different things depending on
> the project and their requirements.  In some cases where there are
> government regulations involved we do regular tape backups.  In other cases
> we keep the original data around for some time and can re-import it to HDFS
> if necessary.  In other cases we will copy the data, to multiple Hadoop
> clusters.  This is usually for the case where we want to do Hot/Warm
> failover between clusters.  Now we may be different from most other users
> because we do run lots of different projects on lots of different clusters.
>
> --Bobby Evans
>
> On 5/30/12 1:31 AM, "Darrell Taylor" <da...@gmail.com> wrote:
>
> Will "hadoop fs -rm -rf" move everything to the the /trash directory or
> will it delete that as well?
>
> I was thinking along the lines of what you suggest, keep the original
> source of the data somewhere and then reprocess it all in the event of a
> problem.
>
> What do other people do?  Do you run another cluster?  Do you backup
> specific parts of the cluster?  Some form of offsite SAN?
>
> On Tue, May 29, 2012 at 6:02 PM, Robert Evans <ev...@yahoo-inc.com> wrote:
>
> > Yes you will have redundancy, so no single point of hardware failure can
> > wipe out your data, short of a major catastrophe.  But you can still have
> > an errant or malicious "hadoop fs -rm -rf" shut you down.  If you still
> > have the original source of your data somewhere else you may be able to
> > recover, by reprocessing the data, but if this cluster is your single
> > repository for all your data you may have a problem.
> >
> > --Bobby Evans
> >
> > On 5/29/12 11:40 AM, "Michael Segel" <mi...@hotmail.com> wrote:
> >
> > Hi,
> > That's not a back up strategy.
> > You could still have joe luser take out a key file or directory. What do
> > you do then?
> >
> > On May 29, 2012, at 11:19 AM, Darrell Taylor wrote:
> >
> > > Hi,
> > >
> > > We are about to build a 10 machine cluster with 40Tb of storage,
> > obviously
> > > as this gets full actually trying to create an offsite backup becomes a
> > > problem unless we build another 10 machine cluster (too expensive right
> > > now).  Not sure if it will help but we have planned the cabinet into an
> > > upper and lower half with separate redundant power, then we plan to put
> > > half of the cluster in the top, half in the bottom, effectively 2
> racks,
> > so
> > > in theory we could lose half the cluster and still have the copies of
> all
> > > the blocks with a replication factor of 3?  Apart form the data centre
> > > burning down or some other disaster that would render the machines
> > totally
> > > unrecoverable, is this approach good enough?
> > >
> > > I realise this is a very open question and everyone's circumstances are
> > > different, but I'm wondering what other peoples experiences/opinions
> are
> > > for backing up cluster data?
> > >
> > > Thanks
> > > Darrell.
> >
> >
> >
>
>

Re: Pragmatic cluster backup strategies?

Posted by alo alt <wg...@googlemail.com>.

Hi,

you could set fs.trash.interval into the number of minutes you want to consider that the rm'd data will lost forever. The data will be moved into .Trash and deleted after the configured time.
Second way could be to use mount.fuse to mount the HDFS and backup over that mount your data into a storage tier. That is not the best solution, but a useable way. 

cheers,
 Alex 

--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF

On May 30, 2012, at 8:31 AM, Darrell Taylor wrote:

> Will "hadoop fs -rm -rf" move everything to the the /trash directory or
> will it delete that as well?
> 
> I was thinking along the lines of what you suggest, keep the original
> source of the data somewhere and then reprocess it all in the event of a
> problem.
> 
> What do other people do?  Do you run another cluster?  Do you backup
> specific parts of the cluster?  Some form of offsite SAN?
> 
> On Tue, May 29, 2012 at 6:02 PM, Robert Evans <ev...@yahoo-inc.com> wrote:
> 
>> Yes you will have redundancy, so no single point of hardware failure can
>> wipe out your data, short of a major catastrophe.  But you can still have
>> an errant or malicious "hadoop fs -rm -rf" shut you down.  If you still
>> have the original source of your data somewhere else you may be able to
>> recover, by reprocessing the data, but if this cluster is your single
>> repository for all your data you may have a problem.
>> 
>> --Bobby Evans
>> 
>> On 5/29/12 11:40 AM, "Michael Segel" <mi...@hotmail.com> wrote:
>> 
>> Hi,
>> That's not a back up strategy.
>> You could still have joe luser take out a key file or directory. What do
>> you do then?
>> 
>> On May 29, 2012, at 11:19 AM, Darrell Taylor wrote:
>> 
>>> Hi,
>>> 
>>> We are about to build a 10 machine cluster with 40Tb of storage,
>> obviously
>>> as this gets full actually trying to create an offsite backup becomes a
>>> problem unless we build another 10 machine cluster (too expensive right
>>> now).  Not sure if it will help but we have planned the cabinet into an
>>> upper and lower half with separate redundant power, then we plan to put
>>> half of the cluster in the top, half in the bottom, effectively 2 racks,
>> so
>>> in theory we could lose half the cluster and still have the copies of all
>>> the blocks with a replication factor of 3?  Apart form the data centre
>>> burning down or some other disaster that would render the machines
>> totally
>>> unrecoverable, is this approach good enough?
>>> 
>>> I realise this is a very open question and everyone's circumstances are
>>> different, but I'm wondering what other peoples experiences/opinions are
>>> for backing up cluster data?
>>> 
>>> Thanks
>>> Darrell.
>> 
>> 
>>

Re: Pragmatic cluster backup strategies?

Posted by Robert Evans <ev...@yahoo-inc.com>.

I am not an expert on the trash so you probably want to verify everything I am about to say.  I believe that trash acts oddly when you try to use it to delete a trash directory.  Quotas can potentially get off when doing this, but I think it still deletes the directory.  Trash is a nice feature, but I wouldn't trust it as a true backup.  I just don't think it is mature enough for something like that.  There are enough issues with quotas that sadly most of our users almost always add -skipTrash all the time.

Where I work we do a combination of several different things depending on the project and their requirements.  In some cases where there are government regulations involved we do regular tape backups.  In other cases we keep the original data around for some time and can re-import it to HDFS if necessary.  In other cases we will copy the data, to multiple Hadoop clusters.  This is usually for the case where we want to do Hot/Warm failover between clusters.  Now we may be different from most other users because we do run lots of different projects on lots of different clusters.

--Bobby Evans

On 5/30/12 1:31 AM, "Darrell Taylor" <da...@gmail.com> wrote:

Will "hadoop fs -rm -rf" move everything to the the /trash directory or
will it delete that as well?

I was thinking along the lines of what you suggest, keep the original
source of the data somewhere and then reprocess it all in the event of a
problem.

What do other people do?  Do you run another cluster?  Do you backup
specific parts of the cluster?  Some form of offsite SAN?

On Tue, May 29, 2012 at 6:02 PM, Robert Evans <ev...@yahoo-inc.com> wrote:

> Yes you will have redundancy, so no single point of hardware failure can
> wipe out your data, short of a major catastrophe.  But you can still have
> an errant or malicious "hadoop fs -rm -rf" shut you down.  If you still
> have the original source of your data somewhere else you may be able to
> recover, by reprocessing the data, but if this cluster is your single
> repository for all your data you may have a problem.
>
> --Bobby Evans
>
> On 5/29/12 11:40 AM, "Michael Segel" <mi...@hotmail.com> wrote:
>
> Hi,
> That's not a back up strategy.
> You could still have joe luser take out a key file or directory. What do
> you do then?
>
> On May 29, 2012, at 11:19 AM, Darrell Taylor wrote:
>
> > Hi,
> >
> > We are about to build a 10 machine cluster with 40Tb of storage,
> obviously
> > as this gets full actually trying to create an offsite backup becomes a
> > problem unless we build another 10 machine cluster (too expensive right
> > now).  Not sure if it will help but we have planned the cabinet into an
> > upper and lower half with separate redundant power, then we plan to put
> > half of the cluster in the top, half in the bottom, effectively 2 racks,
> so
> > in theory we could lose half the cluster and still have the copies of all
> > the blocks with a replication factor of 3?  Apart form the data centre
> > burning down or some other disaster that would render the machines
> totally
> > unrecoverable, is this approach good enough?
> >
> > I realise this is a very open question and everyone's circumstances are
> > different, but I'm wondering what other peoples experiences/opinions are
> > for backing up cluster data?
> >
> > Thanks
> > Darrell.
>
>
>

Re: Pragmatic cluster backup strategies?

Posted by Darrell Taylor <da...@gmail.com>.

Will "hadoop fs -rm -rf" move everything to the the /trash directory or
will it delete that as well?

I was thinking along the lines of what you suggest, keep the original
source of the data somewhere and then reprocess it all in the event of a
problem.

What do other people do?  Do you run another cluster?  Do you backup
specific parts of the cluster?  Some form of offsite SAN?

On Tue, May 29, 2012 at 6:02 PM, Robert Evans <ev...@yahoo-inc.com> wrote:

> Yes you will have redundancy, so no single point of hardware failure can
> wipe out your data, short of a major catastrophe.  But you can still have
> an errant or malicious "hadoop fs -rm -rf" shut you down.  If you still
> have the original source of your data somewhere else you may be able to
> recover, by reprocessing the data, but if this cluster is your single
> repository for all your data you may have a problem.
>
> --Bobby Evans
>
> On 5/29/12 11:40 AM, "Michael Segel" <mi...@hotmail.com> wrote:
>
> Hi,
> That's not a back up strategy.
> You could still have joe luser take out a key file or directory. What do
> you do then?
>
> On May 29, 2012, at 11:19 AM, Darrell Taylor wrote:
>
> > Hi,
> >
> > We are about to build a 10 machine cluster with 40Tb of storage,
> obviously
> > as this gets full actually trying to create an offsite backup becomes a
> > problem unless we build another 10 machine cluster (too expensive right
> > now).  Not sure if it will help but we have planned the cabinet into an
> > upper and lower half with separate redundant power, then we plan to put
> > half of the cluster in the top, half in the bottom, effectively 2 racks,
> so
> > in theory we could lose half the cluster and still have the copies of all
> > the blocks with a replication factor of 3?  Apart form the data centre
> > burning down or some other disaster that would render the machines
> totally
> > unrecoverable, is this approach good enough?
> >
> > I realise this is a very open question and everyone's circumstances are
> > different, but I'm wondering what other peoples experiences/opinions are
> > for backing up cluster data?
> >
> > Thanks
> > Darrell.
>
>
>

Re: Pragmatic cluster backup strategies?

Posted by Robert Evans <ev...@yahoo-inc.com>.

Yes you will have redundancy, so no single point of hardware failure can wipe out your data, short of a major catastrophe.  But you can still have an errant or malicious "hadoop fs -rm -rf" shut you down.  If you still have the original source of your data somewhere else you may be able to recover, by reprocessing the data, but if this cluster is your single repository for all your data you may have a problem.

--Bobby Evans

On 5/29/12 11:40 AM, "Michael Segel" <mi...@hotmail.com> wrote:

Hi,
That's not a back up strategy.
You could still have joe luser take out a key file or directory. What do you do then?

On May 29, 2012, at 11:19 AM, Darrell Taylor wrote:

> Hi,
>
> We are about to build a 10 machine cluster with 40Tb of storage, obviously
> as this gets full actually trying to create an offsite backup becomes a
> problem unless we build another 10 machine cluster (too expensive right
> now).  Not sure if it will help but we have planned the cabinet into an
> upper and lower half with separate redundant power, then we plan to put
> half of the cluster in the top, half in the bottom, effectively 2 racks, so
> in theory we could lose half the cluster and still have the copies of all
> the blocks with a replication factor of 3?  Apart form the data centre
> burning down or some other disaster that would render the machines totally
> unrecoverable, is this approach good enough?
>
> I realise this is a very open question and everyone's circumstances are
> different, but I'm wondering what other peoples experiences/opinions are
> for backing up cluster data?
>
> Thanks
> Darrell.

Re: Pragmatic cluster backup strategies?

Posted by Michael Segel <mi...@hotmail.com>.

Hi,
That's not a back up strategy. 
You could still have joe luser take out a key file or directory. What do you do then?

On May 29, 2012, at 11:19 AM, Darrell Taylor wrote:

> Hi,
> 
> We are about to build a 10 machine cluster with 40Tb of storage, obviously
> as this gets full actually trying to create an offsite backup becomes a
> problem unless we build another 10 machine cluster (too expensive right
> now).  Not sure if it will help but we have planned the cabinet into an
> upper and lower half with separate redundant power, then we plan to put
> half of the cluster in the top, half in the bottom, effectively 2 racks, so
> in theory we could lose half the cluster and still have the copies of all
> the blocks with a replication factor of 3?  Apart form the data centre
> burning down or some other disaster that would render the machines totally
> unrecoverable, is this approach good enough?
> 
> I realise this is a very open question and everyone's circumstances are
> different, but I'm wondering what other peoples experiences/opinions are
> for backing up cluster data?
> 
> Thanks
> Darrell.