You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Lucas Stanley <lu...@gmail.com> on 2013/06/11 03:42:29 UTC

Failure scenarios in HBase

Hi,

I'm trying to understand how failures are handled in HBase.

One Disk Failure:
If one disk on a Region Server fails and some HFiles are lost on that
machine, how will that Region Server handle incoming reads for the missing
data? Will the HRegion read from a remote node's replicated HFile over the
network? Will this cause the reads to be slow for this particular set of
data?


Full node failure:
Also, if a Region Server complete crashes/panics, will some reads fail for
a few minutes? If that crashed Region Server was hosting 5 regions, I guess
it will take some time for other nodes to take over those regions and
replay the WAL. So, can I expect a few minutes of downtime before I can
read from the crashed regions again?

Re: Failure scenarios in HBase

Posted by Ted Yu <yu...@gmail.com>.

Lucas:
You can also find some interesting discussion in HBASE-8701 where we try to
handle the case where concurrent writes to the region server carry the same
timestamp as some of the Puts that are being replayed.

Cheers

On Mon, Jun 10, 2013 at 7:18 PM, Sergey Shelukhin <se...@hortonworks.com>wrote:

> HBase stores HFiles (and other files) in HDFS, so HDFS replication should
> take care of the lost replica. It may indeed happen that region server will
> be reading the files from remote machine; if it continues functioning
> however, eventually compaction of the files will restore locality.
>
> In case of full failure ideally there should not be any downtime; some
> requests can just take long as they retry thru the downtime of one node.
> Often, recovery can be very fast.
> Take a look at HBASE-5843, it has some summary of MTTR (mean time to
> recover) improvement work done recently.
> There's also HBASE-7006 and some related JIRAs that may allow us to serve
> the region faster after recovery.
>
> On Mon, Jun 10, 2013 at 6:42 PM, Lucas Stanley <lu...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I'm trying to understand how failures are handled in HBase.
> >
> > One Disk Failure:
> > If one disk on a Region Server fails and some HFiles are lost on that
> > machine, how will that Region Server handle incoming reads for the
> missing
> > data? Will the HRegion read from a remote node's replicated HFile over
> the
> > network? Will this cause the reads to be slow for this particular set of
> > data?
> >
> >
> > Full node failure:
> > Also, if a Region Server complete crashes/panics, will some reads fail
> for
> > a few minutes? If that crashed Region Server was hosting 5 regions, I
> guess
> > it will take some time for other nodes to take over those regions and
> > replay the WAL. So, can I expect a few minutes of downtime before I can
> > read from the crashed regions again?
> >
>

Re: Failure scenarios in HBase

Posted by Sergey Shelukhin <se...@hortonworks.com>.

HBase stores HFiles (and other files) in HDFS, so HDFS replication should
take care of the lost replica. It may indeed happen that region server will
be reading the files from remote machine; if it continues functioning
however, eventually compaction of the files will restore locality.

In case of full failure ideally there should not be any downtime; some
requests can just take long as they retry thru the downtime of one node.
Often, recovery can be very fast.
Take a look at HBASE-5843, it has some summary of MTTR (mean time to
recover) improvement work done recently.
There's also HBASE-7006 and some related JIRAs that may allow us to serve
the region faster after recovery.

On Mon, Jun 10, 2013 at 6:42 PM, Lucas Stanley <lu...@gmail.com> wrote:

> Hi,
>
> I'm trying to understand how failures are handled in HBase.
>
> One Disk Failure:
> If one disk on a Region Server fails and some HFiles are lost on that
> machine, how will that Region Server handle incoming reads for the missing
> data? Will the HRegion read from a remote node's replicated HFile over the
> network? Will this cause the reads to be slow for this particular set of
> data?
>
>
> Full node failure:
> Also, if a Region Server complete crashes/panics, will some reads fail for
> a few minutes? If that crashed Region Server was hosting 5 regions, I guess
> it will take some time for other nodes to take over those regions and
> replay the WAL. So, can I expect a few minutes of downtime before I can
> read from the crashed regions again?
>