You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Heng Chen <he...@gmail.com> on 2016/02/25 00:31:04 UTC

Some problems in one accident on my production cluster

The story is I run one MR job on my production cluster (0.98.6),   it needs
to scan one table during map procedure.

Because of the heavy load from the job,  all my RS crashed due to OOM.

After i restart all RS,  i found one problem.

All regions were reopened on one RS,  and balancer could not run because of
two regions were in transition.   The cluster got in stuck a long time
until i restarted master.

1.  why this happened?

2.  If cluster has a lots of regions, after all RS crash,  how to restart
the cluster.  If restart RS one by one, it means OOM may happen because one
RS has to hold all regions and it will cost a long time.

3.  Is it possible to make each table with some requests quotas,  it means
when one table is requested heavily, it has no impact to other tables on
cluster.


Thanks

Re: Some problems in one accident on my production cluster

Posted by Ted Yu <yu...@gmail.com>.

bq. RegionStates: THIS SHOULD NOT HAPPEN: unexpected {
ad283942aff2bba6c0b94ff98a904d1a state=SPLITTING_NEW

Looks like the above wouldn't have happened if you are using 0.98.11+

See HBASE-12958

On Wed, Feb 24, 2016 at 6:39 PM, Heng Chen <he...@gmail.com> wrote:

> I pick up some logs in master.log about one region
> "ad283942aff2bba6c0b94ff98a904d1a"
>
>
> 2016-02-24 16:24:35,610 INFO  [AM.ZK.Worker-pool2-t3491]
> master.RegionStates: Transition null to {ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068}
> 2016-02-24 16:25:40,472 WARN
>  [MASTER_SERVER_OPERATIONS-dx-common-hmaster1-online:60000-0]
> master.RegionStates: THIS SHOULD NOT HAPPEN: unexpected
> {ad283942aff2bba6c0b94ff98a904d1a state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068}
> 2016-02-24 16:34:24,769 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 16:39:24,768 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 16:44:24,768 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 16:45:37,749 DEBUG [FifoRpcScheduler.handler1-thread-10]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 16:49:24,769 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 16:54:24,768 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 16:59:24,768 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 17:04:24,769 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 17:09:24,768 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
>
>
>
>
>
> 2016-02-25 10:05 GMT+08:00 Ted Yu <yu...@gmail.com>:
>
> > bq. two regions were in transition
> >
> > Can you pastebin related server logs w.r.t. these two regions so that we
> > can have more clue ?
> >
> > For #2, please see http://hbase.apache.org/book.html#big.cluster.config
> >
> > For #3, please see
> >
> >
> http://hbase.apache.org/book.html#_running_multiple_workloads_on_a_single_cluster
> >
> > On Wed, Feb 24, 2016 at 3:31 PM, Heng Chen <he...@gmail.com>
> > wrote:
> >
> > > The story is I run one MR job on my production cluster (0.98.6),   it
> > needs
> > > to scan one table during map procedure.
> > >
> > > Because of the heavy load from the job,  all my RS crashed due to OOM.
> > >
> > > After i restart all RS,  i found one problem.
> > >
> > > All regions were reopened on one RS,  and balancer could not run
> because
> > of
> > > two regions were in transition.   The cluster got in stuck a long time
> > > until i restarted master.
> > >
> > > 1.  why this happened?
> > >
> > > 2.  If cluster has a lots of regions, after all RS crash,  how to
> restart
> > > the cluster.  If restart RS one by one, it means OOM may happen because
> > one
> > > RS has to hold all regions and it will cost a long time.
> > >
> > > 3.  Is it possible to make each table with some requests quotas,  it
> > means
> > > when one table is requested heavily, it has no impact to other tables
> on
> > > cluster.
> > >
> > >
> > > Thanks
> > >
> >
>

Re: Some problems in one accident on my production cluster

Posted by Heng Chen <he...@gmail.com>.

Thanks @ted,   your suggestions about 2 and 3  are what i need !

2016-02-25 10:39 GMT+08:00 Heng Chen <he...@gmail.com>:

> I pick up some logs in master.log about one region
> "ad283942aff2bba6c0b94ff98a904d1a"
>
>
> 2016-02-24 16:24:35,610 INFO  [AM.ZK.Worker-pool2-t3491]
> master.RegionStates: Transition null to {ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068}
> 2016-02-24 16:25:40,472 WARN
>  [MASTER_SERVER_OPERATIONS-dx-common-hmaster1-online:60000-0]
> master.RegionStates: THIS SHOULD NOT HAPPEN: unexpected
> {ad283942aff2bba6c0b94ff98a904d1a state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068}
> 2016-02-24 16:34:24,769 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 16:39:24,768 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 16:44:24,768 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 16:45:37,749 DEBUG [FifoRpcScheduler.handler1-thread-10]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 16:49:24,769 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 16:54:24,768 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 16:59:24,768 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 17:04:24,769 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
> 2016-02-24 17:09:24,768 DEBUG
> [dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
> master.HMaster: Not running balancer because 2 region(s) in transition:
> {ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
> state=SPLITTING_NEW, ts=1456302275610,
> server=dx-common-regionserver1-online,60020,1456302268068},
> ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
> state=SPLITTING_NEW...
>
>
>
>
>
> 2016-02-25 10:05 GMT+08:00 Ted Yu <yu...@gmail.com>:
>
>> bq. two regions were in transition
>>
>> Can you pastebin related server logs w.r.t. these two regions so that we
>> can have more clue ?
>>
>> For #2, please see http://hbase.apache.org/book.html#big.cluster.config
>>
>> For #3, please see
>>
>> http://hbase.apache.org/book.html#_running_multiple_workloads_on_a_single_cluster
>>
>> On Wed, Feb 24, 2016 at 3:31 PM, Heng Chen <he...@gmail.com>
>> wrote:
>>
>> > The story is I run one MR job on my production cluster (0.98.6),   it
>> needs
>> > to scan one table during map procedure.
>> >
>> > Because of the heavy load from the job,  all my RS crashed due to OOM.
>> >
>> > After i restart all RS,  i found one problem.
>> >
>> > All regions were reopened on one RS,  and balancer could not run
>> because of
>> > two regions were in transition.   The cluster got in stuck a long time
>> > until i restarted master.
>> >
>> > 1.  why this happened?
>> >
>> > 2.  If cluster has a lots of regions, after all RS crash,  how to
>> restart
>> > the cluster.  If restart RS one by one, it means OOM may happen because
>> one
>> > RS has to hold all regions and it will cost a long time.
>> >
>> > 3.  Is it possible to make each table with some requests quotas,  it
>> means
>> > when one table is requested heavily, it has no impact to other tables on
>> > cluster.
>> >
>> >
>> > Thanks
>> >
>>
>
>

Re: Some problems in one accident on my production cluster

Posted by Heng Chen <he...@gmail.com>.

I pick up some logs in master.log about one region
"ad283942aff2bba6c0b94ff98a904d1a"

2016-02-24 16:24:35,610 INFO  [AM.ZK.Worker-pool2-t3491]
master.RegionStates: Transition null to {ad283942aff2bba6c0b94ff98a904d1a
state=SPLITTING_NEW, ts=1456302275610,
server=dx-common-regionserver1-online,60020,1456302268068}
2016-02-24 16:25:40,472 WARN
 [MASTER_SERVER_OPERATIONS-dx-common-hmaster1-online:60000-0]
master.RegionStates: THIS SHOULD NOT HAPPEN: unexpected
{ad283942aff2bba6c0b94ff98a904d1a state=SPLITTING_NEW, ts=1456302275610,
server=dx-common-regionserver1-online,60020,1456302268068}
2016-02-24 16:34:24,769 DEBUG
[dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
master.HMaster: Not running balancer because 2 region(s) in transition:
{ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
state=SPLITTING_NEW, ts=1456302275610,
server=dx-common-regionserver1-online,60020,1456302268068},
ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
state=SPLITTING_NEW...
2016-02-24 16:39:24,768 DEBUG
[dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
master.HMaster: Not running balancer because 2 region(s) in transition:
{ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
state=SPLITTING_NEW, ts=1456302275610,
server=dx-common-regionserver1-online,60020,1456302268068},
ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
state=SPLITTING_NEW...
2016-02-24 16:44:24,768 DEBUG
[dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
master.HMaster: Not running balancer because 2 region(s) in transition:
{ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
state=SPLITTING_NEW, ts=1456302275610,
server=dx-common-regionserver1-online,60020,1456302268068},
ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
state=SPLITTING_NEW...
2016-02-24 16:45:37,749 DEBUG [FifoRpcScheduler.handler1-thread-10]
master.HMaster: Not running balancer because 2 region(s) in transition:
{ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
state=SPLITTING_NEW, ts=1456302275610,
server=dx-common-regionserver1-online,60020,1456302268068},
ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
state=SPLITTING_NEW...
2016-02-24 16:49:24,769 DEBUG
[dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
master.HMaster: Not running balancer because 2 region(s) in transition:
{ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
state=SPLITTING_NEW, ts=1456302275610,
server=dx-common-regionserver1-online,60020,1456302268068},
ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
state=SPLITTING_NEW...
2016-02-24 16:54:24,768 DEBUG
[dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
master.HMaster: Not running balancer because 2 region(s) in transition:
{ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
state=SPLITTING_NEW, ts=1456302275610,
server=dx-common-regionserver1-online,60020,1456302268068},
ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
state=SPLITTING_NEW...
2016-02-24 16:59:24,768 DEBUG
[dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
master.HMaster: Not running balancer because 2 region(s) in transition:
{ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
state=SPLITTING_NEW, ts=1456302275610,
server=dx-common-regionserver1-online,60020,1456302268068},
ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
state=SPLITTING_NEW...
2016-02-24 17:04:24,769 DEBUG
[dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
master.HMaster: Not running balancer because 2 region(s) in transition:
{ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
state=SPLITTING_NEW, ts=1456302275610,
server=dx-common-regionserver1-online,60020,1456302268068},
ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
state=SPLITTING_NEW...
2016-02-24 17:09:24,768 DEBUG
[dx-common-hmaster1-online,60000,1433937470611-BalancerChore]
master.HMaster: Not running balancer because 2 region(s) in transition:
{ad283942aff2bba6c0b94ff98a904d1a={ad283942aff2bba6c0b94ff98a904d1a
state=SPLITTING_NEW, ts=1456302275610,
server=dx-common-regionserver1-online,60020,1456302268068},
ab07d6fbcef39be032ba11ca6ba252ef={ab07d6fbcef39be032ba11ca6ba252ef
state=SPLITTING_NEW...

2016-02-25 10:05 GMT+08:00 Ted Yu <yu...@gmail.com>:

> bq. two regions were in transition
>
> Can you pastebin related server logs w.r.t. these two regions so that we
> can have more clue ?
>
> For #2, please see http://hbase.apache.org/book.html#big.cluster.config
>
> For #3, please see
>
> http://hbase.apache.org/book.html#_running_multiple_workloads_on_a_single_cluster
>
> On Wed, Feb 24, 2016 at 3:31 PM, Heng Chen <he...@gmail.com>
> wrote:
>
> > The story is I run one MR job on my production cluster (0.98.6),   it
> needs
> > to scan one table during map procedure.
> >
> > Because of the heavy load from the job,  all my RS crashed due to OOM.
> >
> > After i restart all RS,  i found one problem.
> >
> > All regions were reopened on one RS,  and balancer could not run because
> of
> > two regions were in transition.   The cluster got in stuck a long time
> > until i restarted master.
> >
> > 1.  why this happened?
> >
> > 2.  If cluster has a lots of regions, after all RS crash,  how to restart
> > the cluster.  If restart RS one by one, it means OOM may happen because
> one
> > RS has to hold all regions and it will cost a long time.
> >
> > 3.  Is it possible to make each table with some requests quotas,  it
> means
> > when one table is requested heavily, it has no impact to other tables on
> > cluster.
> >
> >
> > Thanks
> >
>

Re: Some problems in one accident on my production cluster

Posted by Ted Yu <yu...@gmail.com>.

bq. two regions were in transition

Can you pastebin related server logs w.r.t. these two regions so that we
can have more clue ?

For #2, please see http://hbase.apache.org/book.html#big.cluster.config

For #3, please see
http://hbase.apache.org/book.html#_running_multiple_workloads_on_a_single_cluster

On Wed, Feb 24, 2016 at 3:31 PM, Heng Chen <he...@gmail.com> wrote:

> The story is I run one MR job on my production cluster (0.98.6),   it needs
> to scan one table during map procedure.
>
> Because of the heavy load from the job,  all my RS crashed due to OOM.
>
> After i restart all RS,  i found one problem.
>
> All regions were reopened on one RS,  and balancer could not run because of
> two regions were in transition.   The cluster got in stuck a long time
> until i restarted master.
>
> 1.  why this happened?
>
> 2.  If cluster has a lots of regions, after all RS crash,  how to restart
> the cluster.  If restart RS one by one, it means OOM may happen because one
> RS has to hold all regions and it will cost a long time.
>
> 3.  Is it possible to make each table with some requests quotas,  it means
> when one table is requested heavily, it has no impact to other tables on
> cluster.
>
>
> Thanks
>

Re: Some problems in one accident on my production cluster

Posted by Heng Chen <he...@gmail.com>.

Thanks stack and ted for your help.

After check the code, i think the reason is RS send split request with
parent region, two daughter regions,  then RS crash.

Master update two daughter regions to be SPLIT_NEW state and put them
in regionsInTransition
which is stored in memory of master.

And in 0.98.11-,  serverOffline not handle this situation when region is in
SPLIT_NEW state. So we have to restart master.

As ted said, HBASE-12958 has fixed it.

As for "set_quota" command, it was introduced after 1.1,  i will upgrade my
cluster.

Thanks guys for your help.



2016-02-25 11:41 GMT+08:00 Stack <st...@duboce.net>:

> On Wed, Feb 24, 2016 at 3:31 PM, Heng Chen <he...@gmail.com>
> wrote:
>
> > The story is I run one MR job on my production cluster (0.98.6),   it
> needs
> > to scan one table during map procedure.
> >
> > Because of the heavy load from the job,  all my RS crashed due to OOM.
> >
> >
> Really big rows? If so, can you narrow your scan or ask for partial rows
> (IIRC, you can do this in 0.98.x) or move up on to hbase 1.1+ where
> scanning does 'chunking'?
>
>
> > After i restart all RS,  i found one problem.
> >
> > All regions were reopened on one RS,
>
>
>
> ... the others took a while to check in? Thats usual reason one RS gets a
> bunch of regions.
>
>
>
> > and balancer could not run because of
> > two regions were in transition.   The cluster got in stuck a long time
> > until i restarted master.
> >
> > 1.  why this happened?
> >
> > Would need logs. I see you posted some later. Good to go to the server
> that was doing the split and look in log around the time of split fail.
>
>
> > 2.  If cluster has a lots of regions, after all RS crash,  how to restart
> > the cluster.  If restart RS one by one, it means OOM may happen because
> one
> > RS has to hold all regions and it will cost a long time.
> >
> >
> Best to restart cluster in this case (after figuring why others took a
> while to check in... look at their logs around startup time to see why they
> dally)
>
>
> > 3.  Is it possible to make each table with some requests quotas,  it
> means
> > when one table is requested heavily, it has no impact to other tables on
> > cluster.
> >
> >
> Not sure what the state of this is in 0.98. Maybe someone closer to 0.98
> knows.
>
> St.Ack
>
>
>
> >
> > Thanks
> >
>

Re: Some problems in one accident on my production cluster

Posted by Stack <st...@duboce.net>.

On Wed, Feb 24, 2016 at 3:31 PM, Heng Chen <he...@gmail.com> wrote:

> The story is I run one MR job on my production cluster (0.98.6),   it needs
> to scan one table during map procedure.
>
> Because of the heavy load from the job,  all my RS crashed due to OOM.
>
>
Really big rows? If so, can you narrow your scan or ask for partial rows
(IIRC, you can do this in 0.98.x) or move up on to hbase 1.1+ where
scanning does 'chunking'?


> After i restart all RS,  i found one problem.
>
> All regions were reopened on one RS,



... the others took a while to check in? Thats usual reason one RS gets a
bunch of regions.



> and balancer could not run because of
> two regions were in transition.   The cluster got in stuck a long time
> until i restarted master.
>
> 1.  why this happened?
>
> Would need logs. I see you posted some later. Good to go to the server
that was doing the split and look in log around the time of split fail.


> 2.  If cluster has a lots of regions, after all RS crash,  how to restart
> the cluster.  If restart RS one by one, it means OOM may happen because one
> RS has to hold all regions and it will cost a long time.
>
>
Best to restart cluster in this case (after figuring why others took a
while to check in... look at their logs around startup time to see why they
dally)


> 3.  Is it possible to make each table with some requests quotas,  it means
> when one table is requested heavily, it has no impact to other tables on
> cluster.
>
>
Not sure what the state of this is in 0.98. Maybe someone closer to 0.98
knows.

St.Ack



>
> Thanks
>