You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by prasenjit mukherjee <pr...@gmail.com> on 2010/03/15 04:29:34 UTC

Re: HFile backup while cluster running

This is kind of use-case what I was looking for ( persistent HDFS
across ec2 cluster restarts )

Correct me if I am wrong,  I probably don't even need to take
snapshots if I am bringing down and restarting  the entire ec2
cluster.  I am using cloudera's hadoop-ec2's launch/terminate clusters
to start/shutdown my hadoop clusters running on ec2.

Now is there anything additional I need to do ( while bringing up the
cluster )  to use the previously stored files in  hdfs ( aka ebs
volumes ).  We probably need to have the EXACT same number of
nodes/slaves ( or not required ) ?  Anything additional to that ? Or
any slave can be attached to any ebs volume ( without maintaining any
history ) ?

Didnt see any wiki/article on this particular use-case ( except some
mail threads in hbase). I think this requirement is probably specific
to hdfs ( and hence affects hbase and hive )

-Prasen


On Thu, Mar 4, 2010 at 7:26 AM, Vaibhav Puranik <vp...@gmail.com> wrote:
> Kevin,
>
> Are you using EBS? If yes, just take a snapshot of your volumes. And create
> new volumes from the snapshot.
>
> Regards,
> Vaibhav Puranik
> GumGum
>
> On Wed, Mar 3, 2010 at 1:12 PM, Jonathan Gray <jl...@streamy.com> wrote:
>
>> Kevin,
>>
>> Taking writes during the transition time will be the issue.
>>
>> If you don't take any writes, then you can flush all your tables and do an
>> HDFS copy the same way.  HBase doesn't actually have to be shutdown, that's
>> just recommended to prevent things from changing mid-backup.  If you're
>> careful to not write data it should be ok.
>>
>> JG
>>
>> -----Original Message-----
>> From: Ted Yu [mailto:yuzhihong@gmail.com]
>> Sent: Wednesday, March 03, 2010 11:40 AM
>> To: hbase-user@hadoop.apache.org
>> Subject: Re: HFile backup while cluster running
>>
>> If you disable writing, you can use
>> org.apache.hadoop.hbase.mapreduce.Export
>> to export all your data, copy them to your new HDFS, then use
>> org.apache.hadoop.hbase.mapreduce.Import, finally switch your clients to
>> the
>> new HBase cluster.
>>
>> On Wed, Mar 3, 2010 at 11:27 AM, Kevin Peterson <kpeterson@biz360.com
>> >wrote:
>>
>> > My current setup in EC2 is a Hadoop Map Reduce cluster and HBase
>> > cluster sharing the same HDFS. That is, I have a batch of nodes that
>> > run datanode and tasktracker and a bunch of nodes that run datanode
>> > and regionserver. I'm trying to move HBase off this cluster to a new
>> > cluster with it's own HDFS.
>> >
>> > My plan is to shut down the cluster, copy the HFiles using distcp, and
>> > then start up the new cluster. My problem is that it looks like it
>> > will take several hours to transfer the > 1TB of data. I don't want to
>> > be offline that long. Is it possible to copy the HFiles while the
>> > cluster is up? Do I need to take any special precautions? I think my
>> > plan would be to turn off any jobs writing, take what tables I can
>> > offline, and leave the critical tables online but only serving reads.
>> >
>> > Jonathan Gray mentioned he has copied the files with HBase running
>> > successfully in https://issues.apache.org/jira/browse/HBASE-1684
>> >
>>
>>
>

Re: Re: HFile backup while cluster running

Posted by Vaibhav Puranik <vp...@gmail.com>.

That is correct. There is no need to reconfigure the property files if you
use an elastic ip for each node.

Regards,
Vaibhav

On Mon, Mar 15, 2010 at 2:23 PM, <ch...@gmail.com> wrote:

> I mentioned this on a previous thread, but I think it's worth restating -
> in EC2, the public DNS hostnames follow a well-known naming convention and
> the internal DNS servers automatically convert the public hostnames to the
> internal ip addresses. So I believe that if you assign elastic ip addresses
> to the machines in your cluster you can use the public DNS hostnames in your
> config files and the DNS service will use the internal ips, avoiding the
> data transfer fee. If a machine fails, you'll be able to replace it by
> re-assigning the elastic ip to that instance once it's up and hijacking the
> ailing machine's config files. Granted if you're just using EC2 as a dev
> cluster then this may not make sense since you'll be charged for the idle
> elastic ip's while your cluster is not running, but for a longer running or
> semi-permanent setup this may make sense. Of course, the EC2 network is a
> bit flaky so you may want to make sure you have something in place which
> provides a bit of dns fault tolerance (ie. dnscache).
>
> ie. in the us-east-1d region, ec2-www-xxx-yyy-zzz.compute-1.amazonaws.comis the public hostname for the ip address www.xxx.yyy.zzz.
>
>
>  Off course you will have to reconfigure your cluster with the new DNS
>> names,
>>
>
>  but besides that you don't need to do anything.
>>
>
>

Re: Re: HFile backup while cluster running

Posted by ch...@gmail.com.

I mentioned this on a previous thread, but I think it's worth restating -  
in EC2, the public DNS hostnames follow a well-known naming convention and  
the internal DNS servers automatically convert the public hostnames to the  
internal ip addresses. So I believe that if you assign elastic ip addresses  
to the machines in your cluster you can use the public DNS hostnames in  
your config files and the DNS service will use the internal ips, avoiding  
the data transfer fee. If a machine fails, you'll be able to replace it by  
re-assigning the elastic ip to that instance once it's up and hijacking the  
ailing machine's config files. Granted if you're just using EC2 as a dev  
cluster then this may not make sense since you'll be charged for the idle  
elastic ip's while your cluster is not running, but for a longer running or  
semi-permanent setup this may make sense. Of course, the EC2 network is a  
bit flaky so you may want to make sure you have something in place which  
provides a bit of dns fault tolerance (ie. dnscache).

ie. in the us-east-1d region, ec2-www-xxx-yyy-zzz.compute-1.amazonaws.com  
is the public hostname for the ip address www.xxx.yyy.zzz.

> Off course you will have to reconfigure your cluster with the new DNS  
> names,

> but besides that you don't need to do anything.

Re: HFile backup while cluster running

Posted by Vaibhav Puranik <vp...@gmail.com>.

Prasenjit,

You don't need to take a snapshot, if you plan to use the exact same
volumes.
If you want to add machines, you can simply add new machines to the cluster
as soon as it boots up.

Off course you will have to reconfigure your cluster with the new DNS names,
but besides that you don't need to do anything.

We have done that. We store all our hadoop/hbase data on ebs volumes.
We only keep the hadoop data directory on the ebs volumes. The install is on
the instance.
This helps us in doing upgrade too. We just prepare new machines with the
new version of Hadoop/HBase, shut down the cluster, mount existing drives on
the new machines, run our configuration script and bring up the cluster.

Regards,
Vaibhav Puranik

On Sun, Mar 14, 2010 at 8:29 PM, prasenjit mukherjee
<pr...@gmail.com>wrote:

> This is kind of use-case what I was looking for ( persistent HDFS
> across ec2 cluster restarts )
>
> Correct me if I am wrong,  I probably don't even need to take
> snapshots if I am bringing down and restarting  the entire ec2
> cluster.  I am using cloudera's hadoop-ec2's launch/terminate clusters
> to start/shutdown my hadoop clusters running on ec2.
>
> Now is there anything additional I need to do ( while bringing up the
> cluster )  to use the previously stored files in  hdfs ( aka ebs
> volumes ).  We probably need to have the EXACT same number of
> nodes/slaves ( or not required ) ?  Anything additional to that ? Or
> any slave can be attached to any ebs volume ( without maintaining any
> history ) ?
>
> Didnt see any wiki/article on this particular use-case ( except some
> mail threads in hbase). I think this requirement is probably specific
> to hdfs ( and hence affects hbase and hive )
>
> -Prasen
>
>
> On Thu, Mar 4, 2010 at 7:26 AM, Vaibhav Puranik <vp...@gmail.com>
> wrote:
> > Kevin,
> >
> > Are you using EBS? If yes, just take a snapshot of your volumes. And
> create
> > new volumes from the snapshot.
> >
> > Regards,
> > Vaibhav Puranik
> > GumGum
> >
> > On Wed, Mar 3, 2010 at 1:12 PM, Jonathan Gray <jl...@streamy.com> wrote:
> >
> >> Kevin,
> >>
> >> Taking writes during the transition time will be the issue.
> >>
> >> If you don't take any writes, then you can flush all your tables and do
> an
> >> HDFS copy the same way.  HBase doesn't actually have to be shutdown,
> that's
> >> just recommended to prevent things from changing mid-backup.  If you're
> >> careful to not write data it should be ok.
> >>
> >> JG
> >>
> >> -----Original Message-----
> >> From: Ted Yu [mailto:yuzhihong@gmail.com]
> >> Sent: Wednesday, March 03, 2010 11:40 AM
> >> To: hbase-user@hadoop.apache.org
> >> Subject: Re: HFile backup while cluster running
> >>
> >> If you disable writing, you can use
> >> org.apache.hadoop.hbase.mapreduce.Export
> >> to export all your data, copy them to your new HDFS, then use
> >> org.apache.hadoop.hbase.mapreduce.Import, finally switch your clients to
> >> the
> >> new HBase cluster.
> >>
> >> On Wed, Mar 3, 2010 at 11:27 AM, Kevin Peterson <kpeterson@biz360.com
> >> >wrote:
> >>
> >> > My current setup in EC2 is a Hadoop Map Reduce cluster and HBase
> >> > cluster sharing the same HDFS. That is, I have a batch of nodes that
> >> > run datanode and tasktracker and a bunch of nodes that run datanode
> >> > and regionserver. I'm trying to move HBase off this cluster to a new
> >> > cluster with it's own HDFS.
> >> >
> >> > My plan is to shut down the cluster, copy the HFiles using distcp, and
> >> > then start up the new cluster. My problem is that it looks like it
> >> > will take several hours to transfer the > 1TB of data. I don't want to
> >> > be offline that long. Is it possible to copy the HFiles while the
> >> > cluster is up? Do I need to take any special precautions? I think my
> >> > plan would be to turn off any jobs writing, take what tables I can
> >> > offline, and leave the critical tables online but only serving reads.
> >> >
> >> > Jonathan Gray mentioned he has copied the files with HBase running
> >> > successfully in https://issues.apache.org/jira/browse/HBASE-1684
> >> >
> >>
> >>
> >
>