You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by "M. C. Srivas" <mc...@gmail.com> on 2012/07/09 07:04:23 UTC

Re: When node is down

On Sun, Jun 24, 2012 at 8:14 PM, Michel Segel <mi...@hotmail.com>wrote:

> You don't notice it faster, it's the timeout.
> You can reduce the timeout, it's configurable. Default is 10 min.
>
> There shouldn't be downtime of the cluster, just the node.
>
> Note this is for Apache. MapR is different and someone from MapR should be
> able to provide details...
>

No downtime for MapR ... the failed drive is detected in 30 seconds or so
 (if the controller is jammed, Linux takes about 2 mins to "un-hang" the
entire system, so it could be as much as that).  The drive can be pulled
out and a new one inserted while the system is live.  Mapr will
automatically reformat and start using the newly added drive  in under 1
min.

While you are fetching the replacement drive,  the data that was on the bad
drive is immediately rebuilt and redistributed automatically.




>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Jun 22, 2012, at 8:41 AM, Tom Brown <to...@gmail.com> wrote:
>
> > Can it notice the node is down sooner? If that node is serving an active
> > region (or if it's a datanode for an active region), that would be a
> > potentially large amount of downtime.  With comodity hardware, and a
> large
> > enough cluster, there will always be a machine or two being rebuilt...
> >
> > Thanks!
> >
> > -Tom
> >
> > On Thursday, June 21, 2012, Michael Segel wrote:
> >
> >> Assuming that you have an Apache release (Apache, HW, Cloudera) ...
> >> (If MapR, replace the drive and you should be able to repair the cluster
> >> from the console. Node doesn't go down. )
> >> Node goes down.
> >> 10 min later, cluster sees node down. Should then be able to replicate
> the
> >> missing blocks.
> >>
> >> Replace disk w new disk and rebuild file system.
> >> Bring node up.
> >> Rebalance cluster.
> >>
> >> That should be pretty much it.
> >>
> >>
> >> On Jun 21, 2012, at 10:17 PM, David Charle wrote:
> >>
> >>> What is the best practice to remove a node and add the same node back
> for
> >>> hbase/hadoop ?
> >>>
> >>> Currently in our 10 node cluster; 2 nodes went down (bad disk, so node
> is
> >>> down as its the root volume+data); need to replace the disk and add
> them
> >>> back. Any quick suggestions or pointers to doc for the right procedure
> ?
> >>>
> >>> --
> >>> David
> >>
> >>
>

Re: When node is down

Posted by Kevin O'dell <ke...@cloudera.com>.

Depending on your setup(not MapR) you can also raise your allowed failed
volumes this will let you keep your nodes up until you are ready to replace
the single bad drive.

On Mon, Jul 9, 2012 at 1:04 AM, M. C. Srivas <mc...@gmail.com> wrote:

> On Sun, Jun 24, 2012 at 8:14 PM, Michel Segel <michael_segel@hotmail.com
> >wrote:
>
> > You don't notice it faster, it's the timeout.
> > You can reduce the timeout, it's configurable. Default is 10 min.
> >
> > There shouldn't be downtime of the cluster, just the node.
> >
> > Note this is for Apache. MapR is different and someone from MapR should
> be
> > able to provide details...
> >
>
> No downtime for MapR ... the failed drive is detected in 30 seconds or so
>  (if the controller is jammed, Linux takes about 2 mins to "un-hang" the
> entire system, so it could be as much as that).  The drive can be pulled
> out and a new one inserted while the system is live.  Mapr will
> automatically reformat and start using the newly added drive  in under 1
> min.
>
> While you are fetching the replacement drive,  the data that was on the bad
> drive is immediately rebuilt and redistributed automatically.
>
>
>
>
> >
> > Sent from a remote device. Please excuse any typos...
> >
> > Mike Segel
> >
> > On Jun 22, 2012, at 8:41 AM, Tom Brown <to...@gmail.com> wrote:
> >
> > > Can it notice the node is down sooner? If that node is serving an
> active
> > > region (or if it's a datanode for an active region), that would be a
> > > potentially large amount of downtime.  With comodity hardware, and a
> > large
> > > enough cluster, there will always be a machine or two being rebuilt...
> > >
> > > Thanks!
> > >
> > > -Tom
> > >
> > > On Thursday, June 21, 2012, Michael Segel wrote:
> > >
> > >> Assuming that you have an Apache release (Apache, HW, Cloudera) ...
> > >> (If MapR, replace the drive and you should be able to repair the
> cluster
> > >> from the console. Node doesn't go down. )
> > >> Node goes down.
> > >> 10 min later, cluster sees node down. Should then be able to replicate
> > the
> > >> missing blocks.
> > >>
> > >> Replace disk w new disk and rebuild file system.
> > >> Bring node up.
> > >> Rebalance cluster.
> > >>
> > >> That should be pretty much it.
> > >>
> > >>
> > >> On Jun 21, 2012, at 10:17 PM, David Charle wrote:
> > >>
> > >>> What is the best practice to remove a node and add the same node back
> > for
> > >>> hbase/hadoop ?
> > >>>
> > >>> Currently in our 10 node cluster; 2 nodes went down (bad disk, so
> node
> > is
> > >>> down as its the root volume+data); need to replace the disk and add
> > them
> > >>> back. Any quick suggestions or pointers to doc for the right
> procedure
> > ?
> > >>>
> > >>> --
> > >>> David
> > >>
> > >>
> >
>



-- 
Kevin O'Dell
Customer Operations Engineer, Cloudera