You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Claudiu Soroiu <cs...@gmail.com> on 2014/04/14 22:47:30 UTC

HBase region server failure issues

Hi all,

My name is Claudiu Soroiu and I am new to hbase/hadoop but not new to
distributed computing in FT/HA environments and I see there are a lot of
issues reported related to the region server failure.

The main problem I see it is related to recovery time in case of a node
failure and distributed log splitting. After some tunning I managed to
reduce it to 8 seconds in total and for the moment it fits the needs.

I have one question: *Why there is only one WAL file per region server and
not one WAL per region itself? *
I haven't found the exact answer anywhere, that's why i'm asking on this
list and please point me to the right direction if i missed the list.

My point is that eliminating the need of splitting a log in case of failure
reduces the downtime for the regions and the only delay that we will see
will be related to transferring data over network to the region servers
that will take over the failed regions.
This is feasible only if having multiple WAL's per Region Server does not
affect the overall write performance.

Thanks,
Claudiu

Re: HBase region server failure issues

Posted by Kevin O'dell <ke...@cloudera.com>.

Andrew,

  I agree, there is definitely a chance HDFS doesn't have an extra 3GB of
NN heap to squeak out for HBase.  It would be interesting to check in with
the Flurry guys and see what their NN pressure looks like.  As clusters
become more multi-tenant HDFS pressure could become a real concern.  I have
not seen too many clusters that have a ton of files and are choking the NN
into large GC pauses.  Usually, the end user is doing something wrong and
we can use something similar to HAR to help clean up some of the FS.


On Tue, Apr 15, 2014 at 12:29 PM, Andrew Purtell <ap...@apache.org>wrote:

> You'd probably know better than I Kevin but I'd worry about the
> 1000*1000*32 case, where HDFS is as (over)committed as the HBase tier.
>
>
> On Tue, Apr 15, 2014 at 9:26 AM, Kevin O'dell <kevin.odell@cloudera.com
> >wrote:
>
> > In general I have never seen nor heard of Federated Namespaces in the
> wild,
> > so I would be hesitant to go down that path.  But you know for "Science"
> I
> > would be interested in seeing how that worked out.  Would we be looking
> at
> > 32 WALs per region?  At a large cluster with 1000nodes, 100 regions per
> > node, and a WAL per region(I like easy math):
> >
> > 1000*100*32= 3.2 million files for WALs  This is not ideal, but it is not
> > horrible if we are using 128MB block sizes etc.
> >
> > I feel like I am missing something above though.  Thoughts?
> >
> >
> > On Tue, Apr 15, 2014 at 12:20 PM, Andrew Purtell <apurtell@apache.org
> > >wrote:
> >
> > > # of WALs as roughly spindles / replication factor seems intuitive.
> Would
> > > be interesting to benchmark.
> > >
> > > As for one WAL per region, the BigTable paper IIRC says they didn't
> > because
> > > of concerns about the number of seeks in the filesystems underlying GFS
> > and
> > > because it would reduce the effectiveness of group commit throughput
> > > optimization. If WALs are backed by SSD certainly the first
> consideration
> > > no longer holds. We also had a global HDFS file limit to contend with.
> I
> > > know HDFS is incrementally improving the scalabilty of a namespace, but
> > > this is still an active consideration. (Or we could try partitioning a
> > > deploy over a federated namespace? Could be "interesting". Has anyone
> > tried
> > > that? I haven't heard.)
> > >
> > >
> > >
> > > On Tue, Apr 15, 2014 at 7:11 AM, Jonathan Hsieh <jo...@cloudera.com>
> > wrote:
> > >
> > > > It makes sense to have as many wals as # of spindles / replication
> > factor
> > > > per machine.  This should be decoupled from the number of regions on
> a
> > > > region server.  So for a cluster with 12 spindles we should likely
> have
> > > at
> > > > least 4 wals (12 spindles / 3 replication factor), and need to do
> > > > experiments to see if going to 8 or some higher number makes sense
> (new
> > > wal
> > > > uses a disruptor pattern which avoids much contention on individual
> > > > writes).   So with your example, your 1000 regions would get sharded
> > into
> > > > the 4 wals which would maximize io throughput, disk utilization, and
> > > reduce
> > > > time for recovery in the face of failure.
> > > >
> > > > In the case of an SSD world, it makes more sense to have one wal per
> > node
> > > > once we have decent HSM support in HDFS.  The key win here will be in
> > > > recovery time -- if any RS goes down we only have to replay a regions
> > > edits
> > > > and not have to split or demux different region's edits.
> > > >
> > > > Jon.
> > > >
> > > >
> > > > On Mon, Apr 14, 2014 at 10:37 PM, Vladimir Rodionov
> > > > <vl...@gmail.com>wrote:
> > > >
> > > > > Todd, how about 300 regions with 3x replication?  Or 1000 regions?
> > This
> > > > is
> > > > > going to be 3000 files. on HDFS. per one RS. When I said that it
> does
> > > not
> > > > > scale, I meant that exactly that.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > // Jonathan Hsieh (shay)
> > > > // HBase Tech Lead, Software Engineer, Cloudera
> > > > // jon@cloudera.com // @jmhsieh
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > >    - Andy
> > >
> > > Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> > > (via Tom White)
> > >
> >
> >
> >
> > --
> > Kevin O'Dell
> > Systems Engineer, Cloudera
> >
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>



-- 
Kevin O'Dell
Systems Engineer, Cloudera

Re: HBase region server failure issues

Posted by Andrew Purtell <ap...@apache.org>.

You'd probably know better than I Kevin but I'd worry about the
1000*1000*32 case, where HDFS is as (over)committed as the HBase tier.


On Tue, Apr 15, 2014 at 9:26 AM, Kevin O'dell <ke...@cloudera.com>wrote:

> In general I have never seen nor heard of Federated Namespaces in the wild,
> so I would be hesitant to go down that path.  But you know for "Science" I
> would be interested in seeing how that worked out.  Would we be looking at
> 32 WALs per region?  At a large cluster with 1000nodes, 100 regions per
> node, and a WAL per region(I like easy math):
>
> 1000*100*32= 3.2 million files for WALs  This is not ideal, but it is not
> horrible if we are using 128MB block sizes etc.
>
> I feel like I am missing something above though.  Thoughts?
>
>
> On Tue, Apr 15, 2014 at 12:20 PM, Andrew Purtell <apurtell@apache.org
> >wrote:
>
> > # of WALs as roughly spindles / replication factor seems intuitive. Would
> > be interesting to benchmark.
> >
> > As for one WAL per region, the BigTable paper IIRC says they didn't
> because
> > of concerns about the number of seeks in the filesystems underlying GFS
> and
> > because it would reduce the effectiveness of group commit throughput
> > optimization. If WALs are backed by SSD certainly the first consideration
> > no longer holds. We also had a global HDFS file limit to contend with. I
> > know HDFS is incrementally improving the scalabilty of a namespace, but
> > this is still an active consideration. (Or we could try partitioning a
> > deploy over a federated namespace? Could be "interesting". Has anyone
> tried
> > that? I haven't heard.)
> >
> >
> >
> > On Tue, Apr 15, 2014 at 7:11 AM, Jonathan Hsieh <jo...@cloudera.com>
> wrote:
> >
> > > It makes sense to have as many wals as # of spindles / replication
> factor
> > > per machine.  This should be decoupled from the number of regions on a
> > > region server.  So for a cluster with 12 spindles we should likely have
> > at
> > > least 4 wals (12 spindles / 3 replication factor), and need to do
> > > experiments to see if going to 8 or some higher number makes sense (new
> > wal
> > > uses a disruptor pattern which avoids much contention on individual
> > > writes).   So with your example, your 1000 regions would get sharded
> into
> > > the 4 wals which would maximize io throughput, disk utilization, and
> > reduce
> > > time for recovery in the face of failure.
> > >
> > > In the case of an SSD world, it makes more sense to have one wal per
> node
> > > once we have decent HSM support in HDFS.  The key win here will be in
> > > recovery time -- if any RS goes down we only have to replay a regions
> > edits
> > > and not have to split or demux different region's edits.
> > >
> > > Jon.
> > >
> > >
> > > On Mon, Apr 14, 2014 at 10:37 PM, Vladimir Rodionov
> > > <vl...@gmail.com>wrote:
> > >
> > > > Todd, how about 300 regions with 3x replication?  Or 1000 regions?
> This
> > > is
> > > > going to be 3000 files. on HDFS. per one RS. When I said that it does
> > not
> > > > scale, I meant that exactly that.
> > > >
> > >
> > >
> > >
> > > --
> > > // Jonathan Hsieh (shay)
> > > // HBase Tech Lead, Software Engineer, Cloudera
> > > // jon@cloudera.com // @jmhsieh
> > >
> >
> >
> >
> > --
> > Best regards,
> >
> >    - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
> >
>
>
>
> --
> Kevin O'Dell
> Systems Engineer, Cloudera
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: HBase region server failure issues

Posted by Kevin O'dell <ke...@cloudera.com>.

In general I have never seen nor heard of Federated Namespaces in the wild,
so I would be hesitant to go down that path.  But you know for "Science" I
would be interested in seeing how that worked out.  Would we be looking at
32 WALs per region?  At a large cluster with 1000nodes, 100 regions per
node, and a WAL per region(I like easy math):

1000*100*32= 3.2 million files for WALs  This is not ideal, but it is not
horrible if we are using 128MB block sizes etc.

I feel like I am missing something above though.  Thoughts?


On Tue, Apr 15, 2014 at 12:20 PM, Andrew Purtell <ap...@apache.org>wrote:

> # of WALs as roughly spindles / replication factor seems intuitive. Would
> be interesting to benchmark.
>
> As for one WAL per region, the BigTable paper IIRC says they didn't because
> of concerns about the number of seeks in the filesystems underlying GFS and
> because it would reduce the effectiveness of group commit throughput
> optimization. If WALs are backed by SSD certainly the first consideration
> no longer holds. We also had a global HDFS file limit to contend with. I
> know HDFS is incrementally improving the scalabilty of a namespace, but
> this is still an active consideration. (Or we could try partitioning a
> deploy over a federated namespace? Could be "interesting". Has anyone tried
> that? I haven't heard.)
>
>
>
> On Tue, Apr 15, 2014 at 7:11 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:
>
> > It makes sense to have as many wals as # of spindles / replication factor
> > per machine.  This should be decoupled from the number of regions on a
> > region server.  So for a cluster with 12 spindles we should likely have
> at
> > least 4 wals (12 spindles / 3 replication factor), and need to do
> > experiments to see if going to 8 or some higher number makes sense (new
> wal
> > uses a disruptor pattern which avoids much contention on individual
> > writes).   So with your example, your 1000 regions would get sharded into
> > the 4 wals which would maximize io throughput, disk utilization, and
> reduce
> > time for recovery in the face of failure.
> >
> > In the case of an SSD world, it makes more sense to have one wal per node
> > once we have decent HSM support in HDFS.  The key win here will be in
> > recovery time -- if any RS goes down we only have to replay a regions
> edits
> > and not have to split or demux different region's edits.
> >
> > Jon.
> >
> >
> > On Mon, Apr 14, 2014 at 10:37 PM, Vladimir Rodionov
> > <vl...@gmail.com>wrote:
> >
> > > Todd, how about 300 regions with 3x replication?  Or 1000 regions? This
> > is
> > > going to be 3000 files. on HDFS. per one RS. When I said that it does
> not
> > > scale, I meant that exactly that.
> > >
> >
> >
> >
> > --
> > // Jonathan Hsieh (shay)
> > // HBase Tech Lead, Software Engineer, Cloudera
> > // jon@cloudera.com // @jmhsieh
> >
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>



-- 
Kevin O'Dell
Systems Engineer, Cloudera

Re: HBase region server failure issues

Posted by Andrew Purtell <ap...@apache.org>.

# of WALs as roughly spindles / replication factor seems intuitive. Would
be interesting to benchmark.

As for one WAL per region, the BigTable paper IIRC says they didn't because
of concerns about the number of seeks in the filesystems underlying GFS and
because it would reduce the effectiveness of group commit throughput
optimization. If WALs are backed by SSD certainly the first consideration
no longer holds. We also had a global HDFS file limit to contend with. I
know HDFS is incrementally improving the scalabilty of a namespace, but
this is still an active consideration. (Or we could try partitioning a
deploy over a federated namespace? Could be "interesting". Has anyone tried
that? I haven't heard.)

On Tue, Apr 15, 2014 at 7:11 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:

> It makes sense to have as many wals as # of spindles / replication factor
> per machine.  This should be decoupled from the number of regions on a
> region server.  So for a cluster with 12 spindles we should likely have at
> least 4 wals (12 spindles / 3 replication factor), and need to do
> experiments to see if going to 8 or some higher number makes sense (new wal
> uses a disruptor pattern which avoids much contention on individual
> writes).   So with your example, your 1000 regions would get sharded into
> the 4 wals which would maximize io throughput, disk utilization, and reduce
> time for recovery in the face of failure.
>
> In the case of an SSD world, it makes more sense to have one wal per node
> once we have decent HSM support in HDFS.  The key win here will be in
> recovery time -- if any RS goes down we only have to replay a regions edits
> and not have to split or demux different region's edits.
>
> Jon.
>
>
> On Mon, Apr 14, 2014 at 10:37 PM, Vladimir Rodionov
> <vl...@gmail.com>wrote:
>
> > Todd, how about 300 regions with 3x replication?  Or 1000 regions? This
> is
> > going to be 3000 files. on HDFS. per one RS. When I said that it does not
> > scale, I meant that exactly that.
> >
>
>
>
> --
> // Jonathan Hsieh (shay)
> // HBase Tech Lead, Software Engineer, Cloudera
> // jon@cloudera.com // @jmhsieh
>

-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)

Re: HBase region server failure issues

Posted by Jonathan Hsieh <jo...@cloudera.com>.

Thanks for catching that -- it was a typo -- one wal per region.


On Tue, Apr 15, 2014 at 8:21 AM, Ted Yu <yu...@gmail.com> wrote:

> bq. In the case of an SSD world, it makes more sense to have one wal per
> node
>
> Was there a typo in the sentence above (one wal per node) ?
>
> Cheers
>
>
> On Tue, Apr 15, 2014 at 7:11 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:
>
> > It makes sense to have as many wals as # of spindles / replication factor
> > per machine.  This should be decoupled from the number of regions on a
> > region server.  So for a cluster with 12 spindles we should likely have
> at
> > least 4 wals (12 spindles / 3 replication factor), and need to do
> > experiments to see if going to 8 or some higher number makes sense (new
> wal
> > uses a disruptor pattern which avoids much contention on individual
> > writes).   So with your example, your 1000 regions would get sharded into
> > the 4 wals which would maximize io throughput, disk utilization, and
> reduce
> > time for recovery in the face of failure.
> >
> > In the case of an SSD world, it makes more sense to have one wal per node
> > once we have decent HSM support in HDFS.  The key win here will be in
> > recovery time -- if any RS goes down we only have to replay a regions
> edits
> > and not have to split or demux different region's edits.
> >
> > Jon.
> >
> >
> > On Mon, Apr 14, 2014 at 10:37 PM, Vladimir Rodionov
> > <vl...@gmail.com>wrote:
> >
> > > Todd, how about 300 regions with 3x replication?  Or 1000 regions? This
> > is
> > > going to be 3000 files. on HDFS. per one RS. When I said that it does
> not
> > > scale, I meant that exactly that.
> > >
> >
> >
> >
> > --
> > // Jonathan Hsieh (shay)
> > // HBase Tech Lead, Software Engineer, Cloudera
> > // jon@cloudera.com // @jmhsieh
> >
>



-- 
// Jonathan Hsieh (shay)
// HBase Tech Lead, Software Engineer, Cloudera
// jon@cloudera.com // @jmhsieh

Re: HBase region server failure issues

Posted by Ted Yu <yu...@gmail.com>.

bq. In the case of an SSD world, it makes more sense to have one wal per
node

Was there a typo in the sentence above (one wal per node) ?

Cheers


On Tue, Apr 15, 2014 at 7:11 AM, Jonathan Hsieh <jo...@cloudera.com> wrote:

> It makes sense to have as many wals as # of spindles / replication factor
> per machine.  This should be decoupled from the number of regions on a
> region server.  So for a cluster with 12 spindles we should likely have at
> least 4 wals (12 spindles / 3 replication factor), and need to do
> experiments to see if going to 8 or some higher number makes sense (new wal
> uses a disruptor pattern which avoids much contention on individual
> writes).   So with your example, your 1000 regions would get sharded into
> the 4 wals which would maximize io throughput, disk utilization, and reduce
> time for recovery in the face of failure.
>
> In the case of an SSD world, it makes more sense to have one wal per node
> once we have decent HSM support in HDFS.  The key win here will be in
> recovery time -- if any RS goes down we only have to replay a regions edits
> and not have to split or demux different region's edits.
>
> Jon.
>
>
> On Mon, Apr 14, 2014 at 10:37 PM, Vladimir Rodionov
> <vl...@gmail.com>wrote:
>
> > Todd, how about 300 regions with 3x replication?  Or 1000 regions? This
> is
> > going to be 3000 files. on HDFS. per one RS. When I said that it does not
> > scale, I meant that exactly that.
> >
>
>
>
> --
> // Jonathan Hsieh (shay)
> // HBase Tech Lead, Software Engineer, Cloudera
> // jon@cloudera.com // @jmhsieh
>

Re: HBase region server failure issues

Posted by Jonathan Hsieh <jo...@cloudera.com>.

It makes sense to have as many wals as # of spindles / replication factor
per machine.  This should be decoupled from the number of regions on a
region server.  So for a cluster with 12 spindles we should likely have at
least 4 wals (12 spindles / 3 replication factor), and need to do
experiments to see if going to 8 or some higher number makes sense (new wal
uses a disruptor pattern which avoids much contention on individual
writes).   So with your example, your 1000 regions would get sharded into
the 4 wals which would maximize io throughput, disk utilization, and reduce
time for recovery in the face of failure.

In the case of an SSD world, it makes more sense to have one wal per node
once we have decent HSM support in HDFS.  The key win here will be in
recovery time -- if any RS goes down we only have to replay a regions edits
and not have to split or demux different region's edits.

Jon.

On Mon, Apr 14, 2014 at 10:37 PM, Vladimir Rodionov
<vl...@gmail.com>wrote:

> Todd, how about 300 regions with 3x replication?  Or 1000 regions? This is
> going to be 3000 files. on HDFS. per one RS. When I said that it does not
> scale, I meant that exactly that.
>

-- 
// Jonathan Hsieh (shay)
// HBase Tech Lead, Software Engineer, Cloudera
// jon@cloudera.com // @jmhsieh

Re: HBase region server failure issues

Posted by Vladimir Rodionov <vl...@gmail.com>.

Todd, how about 300 regions with 3x replication?  Or 1000 regions? This is
going to be 3000 files. on HDFS. per one RS. When I said that it does not
scale, I meant that exactly that.

Re: HBase region server failure issues

Posted by Todd Lipcon <to...@cloudera.com>.

On Mon, Apr 14, 2014 at 6:32 PM, Vladimir Rodionov
<vl...@gmail.com>wrote:

> *On the other hand, 95% of HBase users don't actually configure HDFS to
> fsync() every edit. Given that, the random writes aren't actually going to
> cause one seek per write -- they'll get buffered up and written back
> periodically in a much more efficient fashion.*
>
> Todd, this is in theory. Reality is different. 1 writer is definitely more
> efficient than 100. This won't scale well.
>

I'd actually disagree. 100 is probably significantly faster than 1, given
that most machines have 12 spindles. So, yes, you'd be multiplexing 8 or so
logs per spindle, but even 100 logs only requires a few hundred MB worth of
buffer cache in order to get good coalescing of writes into large physical
IOs.

If memory is really constrained on your machine, you'll probably get some
throughput collapse as you enter some really inefficient dirty throttling,
but so long as you leave a few GB unallocated, I bet the reality is much
closer to what I said than you might think.

-Todd


>
>
> On Mon, Apr 14, 2014 at 6:20 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
> > On the other hand, 95% of HBase users don't actually configure HDFS to
> > fsync() every edit. Given that, the random writes aren't actually going
> to
> > cause one seek per write -- they'll get buffered up and written back
> > periodically in a much more efficient fashion.
> >
> > Plus, in some small number of years, I believe SSDs will be available on
> > most server machines (in a hybrid configuration) so the seeks will cost
> > less even with fsync on.
> >
> > -Todd
> >
> >
> > On Mon, Apr 14, 2014 at 4:54 PM, Vladimir Rodionov
> > <vl...@gmail.com>wrote:
> >
> > > I do not think its a good idea to have one WAL file per region. All WAL
> > > file idea is based on assumption that  writing data sequentially
> reduces
> > > average latency and increases total throughput. This is no longer a
> case
> > in
> > > a one WAL file per region approach, you may have hundreds active
> regions
> > > per RS and all sequential writes become random ones and random IO for
> > > rotational media is very bad, very bad.
> > >
> > > -Vladimir Rodionov
> > >
> > >
> > >
> > > On Mon, Apr 14, 2014 at 2:41 PM, Ted Yu <yu...@gmail.com> wrote:
> > >
> > > > There is on-going effort to address this issue.
> > > >
> > > > See the following:
> > > > HBASE-8610 Introduce interfaces to support MultiWAL
> > > > HBASE-10378 Divide HLog interface into User and Implementor specific
> > > > interfaces
> > > >
> > > > Cheers
> > > >
> > > >
> > > > On Mon, Apr 14, 2014 at 1:47 PM, Claudiu Soroiu <cs...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > My name is Claudiu Soroiu and I am new to hbase/hadoop but not new
> to
> > > > > distributed computing in FT/HA environments and I see there are a
> lot
> > > of
> > > > > issues reported related to the region server failure.
> > > > >
> > > > > The main problem I see it is related to recovery time in case of a
> > node
> > > > > failure and distributed log splitting. After some tunning I managed
> > to
> > > > > reduce it to 8 seconds in total and for the moment it fits the
> needs.
> > > > >
> > > > > I have one question: *Why there is only one WAL file per region
> > server
> > > > and
> > > > > not one WAL per region itself? *
> > > > > I haven't found the exact answer anywhere, that's why i'm asking on
> > > this
> > > > > list and please point me to the right direction if i missed the
> list.
> > > > >
> > > > > My point is that eliminating the need of splitting a log in case of
> > > > failure
> > > > > reduces the downtime for the regions and the only delay that we
> will
> > > see
> > > > > will be related to transferring data over network to the region
> > servers
> > > > > that will take over the failed regions.
> > > > > This is feasible only if having multiple WAL's per Region Server
> does
> > > not
> > > > > affect the overall write performance.
> > > > >
> > > > > Thanks,
> > > > > Claudiu
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: HBase region server failure issues

Posted by Vladimir Rodionov <vl...@gmail.com>.

*On the other hand, 95% of HBase users don't actually configure HDFS to
fsync() every edit. Given that, the random writes aren't actually going to
cause one seek per write -- they'll get buffered up and written back
periodically in a much more efficient fashion.*

Todd, this is in theory. Reality is different. 1 writer is definitely more
efficient than 100. This won't scale well.


On Mon, Apr 14, 2014 at 6:20 PM, Todd Lipcon <to...@cloudera.com> wrote:

> On the other hand, 95% of HBase users don't actually configure HDFS to
> fsync() every edit. Given that, the random writes aren't actually going to
> cause one seek per write -- they'll get buffered up and written back
> periodically in a much more efficient fashion.
>
> Plus, in some small number of years, I believe SSDs will be available on
> most server machines (in a hybrid configuration) so the seeks will cost
> less even with fsync on.
>
> -Todd
>
>
> On Mon, Apr 14, 2014 at 4:54 PM, Vladimir Rodionov
> <vl...@gmail.com>wrote:
>
> > I do not think its a good idea to have one WAL file per region. All WAL
> > file idea is based on assumption that  writing data sequentially reduces
> > average latency and increases total throughput. This is no longer a case
> in
> > a one WAL file per region approach, you may have hundreds active regions
> > per RS and all sequential writes become random ones and random IO for
> > rotational media is very bad, very bad.
> >
> > -Vladimir Rodionov
> >
> >
> >
> > On Mon, Apr 14, 2014 at 2:41 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > There is on-going effort to address this issue.
> > >
> > > See the following:
> > > HBASE-8610 Introduce interfaces to support MultiWAL
> > > HBASE-10378 Divide HLog interface into User and Implementor specific
> > > interfaces
> > >
> > > Cheers
> > >
> > >
> > > On Mon, Apr 14, 2014 at 1:47 PM, Claudiu Soroiu <cs...@gmail.com>
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > My name is Claudiu Soroiu and I am new to hbase/hadoop but not new to
> > > > distributed computing in FT/HA environments and I see there are a lot
> > of
> > > > issues reported related to the region server failure.
> > > >
> > > > The main problem I see it is related to recovery time in case of a
> node
> > > > failure and distributed log splitting. After some tunning I managed
> to
> > > > reduce it to 8 seconds in total and for the moment it fits the needs.
> > > >
> > > > I have one question: *Why there is only one WAL file per region
> server
> > > and
> > > > not one WAL per region itself? *
> > > > I haven't found the exact answer anywhere, that's why i'm asking on
> > this
> > > > list and please point me to the right direction if i missed the list.
> > > >
> > > > My point is that eliminating the need of splitting a log in case of
> > > failure
> > > > reduces the downtime for the regions and the only delay that we will
> > see
> > > > will be related to transferring data over network to the region
> servers
> > > > that will take over the failed regions.
> > > > This is feasible only if having multiple WAL's per Region Server does
> > not
> > > > affect the overall write performance.
> > > >
> > > > Thanks,
> > > > Claudiu
> > > >
> > >
> >
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

Re: HBase region server failure issues

Posted by Todd Lipcon <to...@cloudera.com>.

On the other hand, 95% of HBase users don't actually configure HDFS to
fsync() every edit. Given that, the random writes aren't actually going to
cause one seek per write -- they'll get buffered up and written back
periodically in a much more efficient fashion.

Plus, in some small number of years, I believe SSDs will be available on
most server machines (in a hybrid configuration) so the seeks will cost
less even with fsync on.

-Todd


On Mon, Apr 14, 2014 at 4:54 PM, Vladimir Rodionov
<vl...@gmail.com>wrote:

> I do not think its a good idea to have one WAL file per region. All WAL
> file idea is based on assumption that  writing data sequentially reduces
> average latency and increases total throughput. This is no longer a case in
> a one WAL file per region approach, you may have hundreds active regions
> per RS and all sequential writes become random ones and random IO for
> rotational media is very bad, very bad.
>
> -Vladimir Rodionov
>
>
>
> On Mon, Apr 14, 2014 at 2:41 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > There is on-going effort to address this issue.
> >
> > See the following:
> > HBASE-8610 Introduce interfaces to support MultiWAL
> > HBASE-10378 Divide HLog interface into User and Implementor specific
> > interfaces
> >
> > Cheers
> >
> >
> > On Mon, Apr 14, 2014 at 1:47 PM, Claudiu Soroiu <cs...@gmail.com>
> wrote:
> >
> > > Hi all,
> > >
> > > My name is Claudiu Soroiu and I am new to hbase/hadoop but not new to
> > > distributed computing in FT/HA environments and I see there are a lot
> of
> > > issues reported related to the region server failure.
> > >
> > > The main problem I see it is related to recovery time in case of a node
> > > failure and distributed log splitting. After some tunning I managed to
> > > reduce it to 8 seconds in total and for the moment it fits the needs.
> > >
> > > I have one question: *Why there is only one WAL file per region server
> > and
> > > not one WAL per region itself? *
> > > I haven't found the exact answer anywhere, that's why i'm asking on
> this
> > > list and please point me to the right direction if i missed the list.
> > >
> > > My point is that eliminating the need of splitting a log in case of
> > failure
> > > reduces the downtime for the regions and the only delay that we will
> see
> > > will be related to transferring data over network to the region servers
> > > that will take over the failed regions.
> > > This is feasible only if having multiple WAL's per Region Server does
> not
> > > affect the overall write performance.
> > >
> > > Thanks,
> > > Claudiu
> > >
> >
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: HBase region server failure issues

Posted by Vladimir Rodionov <vl...@gmail.com>.

I do not think its a good idea to have one WAL file per region. All WAL
file idea is based on assumption that  writing data sequentially reduces
average latency and increases total throughput. This is no longer a case in
a one WAL file per region approach, you may have hundreds active regions
per RS and all sequential writes become random ones and random IO for
rotational media is very bad, very bad.

-Vladimir Rodionov



On Mon, Apr 14, 2014 at 2:41 PM, Ted Yu <yu...@gmail.com> wrote:

> There is on-going effort to address this issue.
>
> See the following:
> HBASE-8610 Introduce interfaces to support MultiWAL
> HBASE-10378 Divide HLog interface into User and Implementor specific
> interfaces
>
> Cheers
>
>
> On Mon, Apr 14, 2014 at 1:47 PM, Claudiu Soroiu <cs...@gmail.com> wrote:
>
> > Hi all,
> >
> > My name is Claudiu Soroiu and I am new to hbase/hadoop but not new to
> > distributed computing in FT/HA environments and I see there are a lot of
> > issues reported related to the region server failure.
> >
> > The main problem I see it is related to recovery time in case of a node
> > failure and distributed log splitting. After some tunning I managed to
> > reduce it to 8 seconds in total and for the moment it fits the needs.
> >
> > I have one question: *Why there is only one WAL file per region server
> and
> > not one WAL per region itself? *
> > I haven't found the exact answer anywhere, that's why i'm asking on this
> > list and please point me to the right direction if i missed the list.
> >
> > My point is that eliminating the need of splitting a log in case of
> failure
> > reduces the downtime for the regions and the only delay that we will see
> > will be related to transferring data over network to the region servers
> > that will take over the failed regions.
> > This is feasible only if having multiple WAL's per Region Server does not
> > affect the overall write performance.
> >
> > Thanks,
> > Claudiu
> >
>

Re: HBase region server failure issues

Posted by Ted Yu <yu...@gmail.com>.

There is on-going effort to address this issue.

See the following:
HBASE-8610 Introduce interfaces to support MultiWAL
HBASE-10378 Divide HLog interface into User and Implementor specific
interfaces

Cheers


On Mon, Apr 14, 2014 at 1:47 PM, Claudiu Soroiu <cs...@gmail.com> wrote:

> Hi all,
>
> My name is Claudiu Soroiu and I am new to hbase/hadoop but not new to
> distributed computing in FT/HA environments and I see there are a lot of
> issues reported related to the region server failure.
>
> The main problem I see it is related to recovery time in case of a node
> failure and distributed log splitting. After some tunning I managed to
> reduce it to 8 seconds in total and for the moment it fits the needs.
>
> I have one question: *Why there is only one WAL file per region server and
> not one WAL per region itself? *
> I haven't found the exact answer anywhere, that's why i'm asking on this
> list and please point me to the right direction if i missed the list.
>
> My point is that eliminating the need of splitting a log in case of failure
> reduces the downtime for the regions and the only delay that we will see
> will be related to transferring data over network to the region servers
> that will take over the failed regions.
> This is feasible only if having multiple WAL's per Region Server does not
> affect the overall write performance.
>
> Thanks,
> Claudiu
>

Re: HBase region server failure issues

Posted by Claudiu Soroiu <cs...@gmail.com>.

Thanks for the hints.
I will take a look and explore the idea.

Claudiu
On Tue, Apr 15, 2014 at 1:43 PM, Claudiu Soroiu <cs...@gmail.com> wrote:

> First of all, thanks for the clarifications.
>
> **how about 300 regions with 3x replication?  Or 1000 regions? This
> is going to be 3000 files. on HDFS. per one RS.**
>
> Now i see that the trade-off is how to reduce the recovery time without
> affecting the overall performance of the cluster.
> Having too many WAL's affects the write performance.
> Basically multiple WAL's might improve the process but the number of WAL's
> should be relatively small.
>
> Would it be feasible to know ahead of time where a region might activate
in
> case of a failure and have for each region server a second WAL file
> containing backup edits?
> E.g. If machine B crashes then a region will be assigned to node A,  one
to
> node C, etc.
> Also another view would be: Server A will backup a region from Server B if
> crashes, a region from server C, etc. Basically this second WAL will
> contain the data that is needed to fast recover a crashed node.
> This adds additional redundancy and some degree of complexity to the
> solution but ensures data locality in case of a crash and faster recovery.
>
>
This sounds like what I called Shadow Memstores.  This depends on hdfs file
affinity groups, (favored nodes could help but isn't guaranteed), and could
be used for super fast edit recovery.  See this thread and jira.  HEre's a
link to a doc I posted on the HBASE-10070 jira.  This requires some
simplifications on the master side, and should be compatible with the
current approach in HBASE-10070.

https://docs.google.com/document/d/1q5kJTOA3sZ760sHkORGZNeWgNuMzP41PnAXtaCgPgEU/edit#heading=h.pyxl4wbui0l



> **What did you do Claudiu to get the time down?**
>
>  Decreased the hdfs block size to 64 megs for now.
>  Enabled settings to avoid hdfs stale nodes.
>  Cluster I tested this was relatively small - 10 computers.
>  I did tuning for zookeeper sessions to keep the heartbeat at 5 seconds
for
> the moment, and plan to decrease this value.
>  At this point dfs.heartbeat.interval is set at the default 3 seconds, but
> this I also plan to decrease and perform a more intensive test.
>  (Decreasing the times is based on the experience with our current system
> configured at 1.2 seconds and didn't had any issues even under heavy
loads,
> obviously stop the world GC times should be smaller that the heartbeat
> interval)
>  And I remember i did some changes for the reconnect intervals of the
> client to allow him to reconnect to the region as fast as possible.
>  I am in an early stage of experimenting with hbase but there are lot of
> things to test/check...
>
>
>
>
> On Tue, Apr 15, 2014 at 11:03 PM, Vladimir Rodionov
> <vl...@gmail.com>wrote:
>
> > *We also had a global HDFS file limit to contend with*
> >
> > Yes, we have been seeing this from time to time in our production
> clusters.
> > Periodic purging of old files helps, but the issue is obvious.
> >
> > -Vladimir Rodionov
> >
> >
> > On Tue, Apr 15, 2014 at 11:58 AM, Stack <st...@duboce.net> wrote:
> >
> > > On Mon, Apr 14, 2014 at 1:47 PM, Claudiu Soroiu <cs...@gmail.com>
> > wrote:
> > >
> > > > ....
> > >
> > > After some tunning I managed to
> > > > reduce it to 8 seconds in total and for the moment it fits the
needs.
> > > >
> > >
> > > What did you do Claudiu to get the time down?
> > > Thanks,
> > > St.Ack
> > >
> >
>



--
// Jonathan Hsieh (shay)
// HBase Tech Lead, Software Engineer, Cloudera
// jon@cloudera.com // @jmhsieh

Re: HBase region server failure issues

Posted by Jonathan Hsieh <jo...@cloudera.com>.

On Tue, Apr 15, 2014 at 1:43 PM, Claudiu Soroiu <cs...@gmail.com> wrote:

> First of all, thanks for the clarifications.
>
> **how about 300 regions with 3x replication?  Or 1000 regions? This
> is going to be 3000 files. on HDFS. per one RS.**
>
> Now i see that the trade-off is how to reduce the recovery time without
> affecting the overall performance of the cluster.
> Having too many WAL's affects the write performance.
> Basically multiple WAL's might improve the process but the number of WAL's
> should be relatively small.
>
> Would it be feasible to know ahead of time where a region might activate in
> case of a failure and have for each region server a second WAL file
> containing backup edits?
> E.g. If machine B crashes then a region will be assigned to node A,  one to
> node C, etc.
> Also another view would be: Server A will backup a region from Server B if
> crashes, a region from server C, etc. Basically this second WAL will
> contain the data that is needed to fast recover a crashed node.
> This adds additional redundancy and some degree of complexity to the
> solution but ensures data locality in case of a crash and faster recovery.
>
>
This sounds like what I called Shadow Memstores.  This depends on hdfs file
affinity groups, (favored nodes could help but isn't guaranteed), and could
be used for super fast edit recovery.  See this thread and jira.  HEre's a
link to a doc I posted on the HBASE-10070 jira.  This requires some
simplifications on the master side, and should be compatible with the
current approach in HBASE-10070.

https://docs.google.com/document/d/1q5kJTOA3sZ760sHkORGZNeWgNuMzP41PnAXtaCgPgEU/edit#heading=h.pyxl4wbui0l



> **What did you do Claudiu to get the time down?**
>
>  Decreased the hdfs block size to 64 megs for now.
>  Enabled settings to avoid hdfs stale nodes.
>  Cluster I tested this was relatively small - 10 computers.
>  I did tuning for zookeeper sessions to keep the heartbeat at 5 seconds for
> the moment, and plan to decrease this value.
>  At this point dfs.heartbeat.interval is set at the default 3 seconds, but
> this I also plan to decrease and perform a more intensive test.
>  (Decreasing the times is based on the experience with our current system
> configured at 1.2 seconds and didn't had any issues even under heavy loads,
> obviously stop the world GC times should be smaller that the heartbeat
> interval)
>  And I remember i did some changes for the reconnect intervals of the
> client to allow him to reconnect to the region as fast as possible.
>  I am in an early stage of experimenting with hbase but there are lot of
> things to test/check...
>
>
>
>
> On Tue, Apr 15, 2014 at 11:03 PM, Vladimir Rodionov
> <vl...@gmail.com>wrote:
>
> > *We also had a global HDFS file limit to contend with*
> >
> > Yes, we have been seeing this from time to time in our production
> clusters.
> > Periodic purging of old files helps, but the issue is obvious.
> >
> > -Vladimir Rodionov
> >
> >
> > On Tue, Apr 15, 2014 at 11:58 AM, Stack <st...@duboce.net> wrote:
> >
> > > On Mon, Apr 14, 2014 at 1:47 PM, Claudiu Soroiu <cs...@gmail.com>
> > wrote:
> > >
> > > > ....
> > >
> > > After some tunning I managed to
> > > > reduce it to 8 seconds in total and for the moment it fits the needs.
> > > >
> > >
> > > What did you do Claudiu to get the time down?
> > > Thanks,
> > > St.Ack
> > >
> >
>



-- 
// Jonathan Hsieh (shay)
// HBase Tech Lead, Software Engineer, Cloudera
// jon@cloudera.com // @jmhsieh

Re: HBase region server failure issues

Posted by Nicolas Liochon <nk...@gmail.com>.

What you described seems to be the favored nodes feature, but there are
still some open (and stale...) jiras there: HBASE-9116 and cie.
You may also want to look at the hbase.master.distributed.log.replay
option, as is allows writes during recovery.
And for the client there is hbase.status.published that can help...





On Wed, Apr 16, 2014 at 6:35 AM, Claudiu Soroiu <cs...@gmail.com> wrote:

> Yes, overall the second WAL would contain the same data but differently
> distributed.
> A server will have in the second WAL data from the regions that it will
> take over if they fail.
> It is just an idea as it might not be good to duplicate the data across the
> cluster.
>
>
>
>
> On Wed, Apr 16, 2014 at 12:36 AM, Ted Yu <yu...@gmail.com> wrote:
>
> > Would the second WAL contain the same contents as the first ?
> >
> > We already have the code that adds interceptor on the calls to the
> > namenode#getBlockLocations so that blocks on the same DN as the dead RS
> are
> > placed at the end of the priority queue..
> > See addLocationsOrderInterceptor()
> > in hbase-server/src/main/java/org/apache/hadoop/hbase/fs/HFileSystem.java
> >
> > This is for faster recovery in case regionserver is deployed on the same
> > box as the datanode.
> >
> >
> > On Tue, Apr 15, 2014 at 1:43 PM, Claudiu Soroiu <cs...@gmail.com>
> wrote:
> >
> > > First of all, thanks for the clarifications.
> > >
> > > **how about 300 regions with 3x replication?  Or 1000 regions? This
> > > is going to be 3000 files. on HDFS. per one RS.**
> > >
> > > Now i see that the trade-off is how to reduce the recovery time without
> > > affecting the overall performance of the cluster.
> > > Having too many WAL's affects the write performance.
> > > Basically multiple WAL's might improve the process but the number of
> > WAL's
> > > should be relatively small.
> > >
> > > Would it be feasible to know ahead of time where a region might
> activate
> > in
> > > case of a failure and have for each region server a second WAL file
> > > containing backup edits?
> > > E.g. If machine B crashes then a region will be assigned to node A,
>  one
> > to
> > > node C, etc.
> > > Also another view would be: Server A will backup a region from Server B
> > if
> > > crashes, a region from server C, etc. Basically this second WAL will
> > > contain the data that is needed to fast recover a crashed node.
> > > This adds additional redundancy and some degree of complexity to the
> > > solution but ensures data locality in case of a crash and faster
> > recovery.
> > >
> > > **What did you do Claudiu to get the time down?**
> > >
> > >  Decreased the hdfs block size to 64 megs for now.
> > >  Enabled settings to avoid hdfs stale nodes.
> > >  Cluster I tested this was relatively small - 10 computers.
> > >  I did tuning for zookeeper sessions to keep the heartbeat at 5 seconds
> > for
> > > the moment, and plan to decrease this value.
> > >  At this point dfs.heartbeat.interval is set at the default 3 seconds,
> > but
> > > this I also plan to decrease and perform a more intensive test.
> > >  (Decreasing the times is based on the experience with our current
> system
> > > configured at 1.2 seconds and didn't had any issues even under heavy
> > loads,
> > > obviously stop the world GC times should be smaller that the heartbeat
> > > interval)
> > >  And I remember i did some changes for the reconnect intervals of the
> > > client to allow him to reconnect to the region as fast as possible.
> > >  I am in an early stage of experimenting with hbase but there are lot
> of
> > > things to test/check...
> > >
> > >
> > >
> > >
> > > On Tue, Apr 15, 2014 at 11:03 PM, Vladimir Rodionov
> > > <vl...@gmail.com>wrote:
> > >
> > > > *We also had a global HDFS file limit to contend with*
> > > >
> > > > Yes, we have been seeing this from time to time in our production
> > > clusters.
> > > > Periodic purging of old files helps, but the issue is obvious.
> > > >
> > > > -Vladimir Rodionov
> > > >
> > > >
> > > > On Tue, Apr 15, 2014 at 11:58 AM, Stack <st...@duboce.net> wrote:
> > > >
> > > > > On Mon, Apr 14, 2014 at 1:47 PM, Claudiu Soroiu <csoroiu@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > ....
> > > > >
> > > > > After some tunning I managed to
> > > > > > reduce it to 8 seconds in total and for the moment it fits the
> > needs.
> > > > > >
> > > > >
> > > > > What did you do Claudiu to get the time down?
> > > > > Thanks,
> > > > > St.Ack
> > > > >
> > > >
> > >
> >
>

Re: HBase region server failure issues

Posted by Claudiu Soroiu <cs...@gmail.com>.

Yes, overall the second WAL would contain the same data but differently
distributed.
A server will have in the second WAL data from the regions that it will
take over if they fail.
It is just an idea as it might not be good to duplicate the data across the
cluster.




On Wed, Apr 16, 2014 at 12:36 AM, Ted Yu <yu...@gmail.com> wrote:

> Would the second WAL contain the same contents as the first ?
>
> We already have the code that adds interceptor on the calls to the
> namenode#getBlockLocations so that blocks on the same DN as the dead RS are
> placed at the end of the priority queue..
> See addLocationsOrderInterceptor()
> in hbase-server/src/main/java/org/apache/hadoop/hbase/fs/HFileSystem.java
>
> This is for faster recovery in case regionserver is deployed on the same
> box as the datanode.
>
>
> On Tue, Apr 15, 2014 at 1:43 PM, Claudiu Soroiu <cs...@gmail.com> wrote:
>
> > First of all, thanks for the clarifications.
> >
> > **how about 300 regions with 3x replication?  Or 1000 regions? This
> > is going to be 3000 files. on HDFS. per one RS.**
> >
> > Now i see that the trade-off is how to reduce the recovery time without
> > affecting the overall performance of the cluster.
> > Having too many WAL's affects the write performance.
> > Basically multiple WAL's might improve the process but the number of
> WAL's
> > should be relatively small.
> >
> > Would it be feasible to know ahead of time where a region might activate
> in
> > case of a failure and have for each region server a second WAL file
> > containing backup edits?
> > E.g. If machine B crashes then a region will be assigned to node A,  one
> to
> > node C, etc.
> > Also another view would be: Server A will backup a region from Server B
> if
> > crashes, a region from server C, etc. Basically this second WAL will
> > contain the data that is needed to fast recover a crashed node.
> > This adds additional redundancy and some degree of complexity to the
> > solution but ensures data locality in case of a crash and faster
> recovery.
> >
> > **What did you do Claudiu to get the time down?**
> >
> >  Decreased the hdfs block size to 64 megs for now.
> >  Enabled settings to avoid hdfs stale nodes.
> >  Cluster I tested this was relatively small - 10 computers.
> >  I did tuning for zookeeper sessions to keep the heartbeat at 5 seconds
> for
> > the moment, and plan to decrease this value.
> >  At this point dfs.heartbeat.interval is set at the default 3 seconds,
> but
> > this I also plan to decrease and perform a more intensive test.
> >  (Decreasing the times is based on the experience with our current system
> > configured at 1.2 seconds and didn't had any issues even under heavy
> loads,
> > obviously stop the world GC times should be smaller that the heartbeat
> > interval)
> >  And I remember i did some changes for the reconnect intervals of the
> > client to allow him to reconnect to the region as fast as possible.
> >  I am in an early stage of experimenting with hbase but there are lot of
> > things to test/check...
> >
> >
> >
> >
> > On Tue, Apr 15, 2014 at 11:03 PM, Vladimir Rodionov
> > <vl...@gmail.com>wrote:
> >
> > > *We also had a global HDFS file limit to contend with*
> > >
> > > Yes, we have been seeing this from time to time in our production
> > clusters.
> > > Periodic purging of old files helps, but the issue is obvious.
> > >
> > > -Vladimir Rodionov
> > >
> > >
> > > On Tue, Apr 15, 2014 at 11:58 AM, Stack <st...@duboce.net> wrote:
> > >
> > > > On Mon, Apr 14, 2014 at 1:47 PM, Claudiu Soroiu <cs...@gmail.com>
> > > wrote:
> > > >
> > > > > ....
> > > >
> > > > After some tunning I managed to
> > > > > reduce it to 8 seconds in total and for the moment it fits the
> needs.
> > > > >
> > > >
> > > > What did you do Claudiu to get the time down?
> > > > Thanks,
> > > > St.Ack
> > > >
> > >
> >
>

Re: HBase region server failure issues

Posted by Ted Yu <yu...@gmail.com>.

Would the second WAL contain the same contents as the first ?

We already have the code that adds interceptor on the calls to the
namenode#getBlockLocations so that blocks on the same DN as the dead RS are
placed at the end of the priority queue..
See addLocationsOrderInterceptor()
in hbase-server/src/main/java/org/apache/hadoop/hbase/fs/HFileSystem.java

This is for faster recovery in case regionserver is deployed on the same
box as the datanode.


On Tue, Apr 15, 2014 at 1:43 PM, Claudiu Soroiu <cs...@gmail.com> wrote:

> First of all, thanks for the clarifications.
>
> **how about 300 regions with 3x replication?  Or 1000 regions? This
> is going to be 3000 files. on HDFS. per one RS.**
>
> Now i see that the trade-off is how to reduce the recovery time without
> affecting the overall performance of the cluster.
> Having too many WAL's affects the write performance.
> Basically multiple WAL's might improve the process but the number of WAL's
> should be relatively small.
>
> Would it be feasible to know ahead of time where a region might activate in
> case of a failure and have for each region server a second WAL file
> containing backup edits?
> E.g. If machine B crashes then a region will be assigned to node A,  one to
> node C, etc.
> Also another view would be: Server A will backup a region from Server B if
> crashes, a region from server C, etc. Basically this second WAL will
> contain the data that is needed to fast recover a crashed node.
> This adds additional redundancy and some degree of complexity to the
> solution but ensures data locality in case of a crash and faster recovery.
>
> **What did you do Claudiu to get the time down?**
>
>  Decreased the hdfs block size to 64 megs for now.
>  Enabled settings to avoid hdfs stale nodes.
>  Cluster I tested this was relatively small - 10 computers.
>  I did tuning for zookeeper sessions to keep the heartbeat at 5 seconds for
> the moment, and plan to decrease this value.
>  At this point dfs.heartbeat.interval is set at the default 3 seconds, but
> this I also plan to decrease and perform a more intensive test.
>  (Decreasing the times is based on the experience with our current system
> configured at 1.2 seconds and didn't had any issues even under heavy loads,
> obviously stop the world GC times should be smaller that the heartbeat
> interval)
>  And I remember i did some changes for the reconnect intervals of the
> client to allow him to reconnect to the region as fast as possible.
>  I am in an early stage of experimenting with hbase but there are lot of
> things to test/check...
>
>
>
>
> On Tue, Apr 15, 2014 at 11:03 PM, Vladimir Rodionov
> <vl...@gmail.com>wrote:
>
> > *We also had a global HDFS file limit to contend with*
> >
> > Yes, we have been seeing this from time to time in our production
> clusters.
> > Periodic purging of old files helps, but the issue is obvious.
> >
> > -Vladimir Rodionov
> >
> >
> > On Tue, Apr 15, 2014 at 11:58 AM, Stack <st...@duboce.net> wrote:
> >
> > > On Mon, Apr 14, 2014 at 1:47 PM, Claudiu Soroiu <cs...@gmail.com>
> > wrote:
> > >
> > > > ....
> > >
> > > After some tunning I managed to
> > > > reduce it to 8 seconds in total and for the moment it fits the needs.
> > > >
> > >
> > > What did you do Claudiu to get the time down?
> > > Thanks,
> > > St.Ack
> > >
> >
>

Re: HBase region server failure issues

Posted by Claudiu Soroiu <cs...@gmail.com>.

First of all, thanks for the clarifications.

**how about 300 regions with 3x replication?  Or 1000 regions? This
is going to be 3000 files. on HDFS. per one RS.**

Now i see that the trade-off is how to reduce the recovery time without
affecting the overall performance of the cluster.
Having too many WAL's affects the write performance.
Basically multiple WAL's might improve the process but the number of WAL's
should be relatively small.

Would it be feasible to know ahead of time where a region might activate in
case of a failure and have for each region server a second WAL file
containing backup edits?
E.g. If machine B crashes then a region will be assigned to node A,  one to
node C, etc.
Also another view would be: Server A will backup a region from Server B if
crashes, a region from server C, etc. Basically this second WAL will
contain the data that is needed to fast recover a crashed node.
This adds additional redundancy and some degree of complexity to the
solution but ensures data locality in case of a crash and faster recovery.

**What did you do Claudiu to get the time down?**

 Decreased the hdfs block size to 64 megs for now.
 Enabled settings to avoid hdfs stale nodes.
 Cluster I tested this was relatively small - 10 computers.
 I did tuning for zookeeper sessions to keep the heartbeat at 5 seconds for
the moment, and plan to decrease this value.
 At this point dfs.heartbeat.interval is set at the default 3 seconds, but
this I also plan to decrease and perform a more intensive test.
 (Decreasing the times is based on the experience with our current system
configured at 1.2 seconds and didn't had any issues even under heavy loads,
obviously stop the world GC times should be smaller that the heartbeat
interval)
 And I remember i did some changes for the reconnect intervals of the
client to allow him to reconnect to the region as fast as possible.
 I am in an early stage of experimenting with hbase but there are lot of
things to test/check...

On Tue, Apr 15, 2014 at 11:03 PM, Vladimir Rodionov
<vl...@gmail.com>wrote:

> *We also had a global HDFS file limit to contend with*
>
> Yes, we have been seeing this from time to time in our production clusters.
> Periodic purging of old files helps, but the issue is obvious.
>
> -Vladimir Rodionov
>
>
> On Tue, Apr 15, 2014 at 11:58 AM, Stack <st...@duboce.net> wrote:
>
> > On Mon, Apr 14, 2014 at 1:47 PM, Claudiu Soroiu <cs...@gmail.com>
> wrote:
> >
> > > ....
> >
> > After some tunning I managed to
> > > reduce it to 8 seconds in total and for the moment it fits the needs.
> > >
> >
> > What did you do Claudiu to get the time down?
> > Thanks,
> > St.Ack
> >
>

Re: HBase region server failure issues

Posted by Vladimir Rodionov <vl...@gmail.com>.

*We also had a global HDFS file limit to contend with*

Yes, we have been seeing this from time to time in our production clusters.
Periodic purging of old files helps, but the issue is obvious.

-Vladimir Rodionov

On Tue, Apr 15, 2014 at 11:58 AM, Stack <st...@duboce.net> wrote:

> On Mon, Apr 14, 2014 at 1:47 PM, Claudiu Soroiu <cs...@gmail.com> wrote:
>
> > ....
>
> After some tunning I managed to
> > reduce it to 8 seconds in total and for the moment it fits the needs.
> >
>
> What did you do Claudiu to get the time down?
> Thanks,
> St.Ack
>

Re: HBase region server failure issues

Posted by Stack <st...@duboce.net>.

On Mon, Apr 14, 2014 at 1:47 PM, Claudiu Soroiu <cs...@gmail.com> wrote:

> ....

After some tunning I managed to
> reduce it to 8 seconds in total and for the moment it fits the needs.
>

What did you do Claudiu to get the time down?
Thanks,
St.Ack