You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by 张磊 <zh...@youku.com> on 2012/10/18 13:30:48 UTC

one RegionServer crashed and the whole cluster was blocked

Hi, All

  One of the RegionServer of our company’s cluster was crashed. At this
time, I found:

1.       All the RegionServer stopped handling the requests from the client
side( requestsPerSecond=0 at the master-status UI page).

2.       It takes about 12-15 minutes to recovery.

3.       I have set hbase.regionserver.restart.on.zk.expire to true, but it
does not work.

  For 1, I knew the cluster began to split log and recover the data on the
crashed RegionServer, will the recovery operation block all the requests
from the client side?

  For 2, Is there any solution to reduce the recovery time?

  For 3, I checked the log, found “session is timeout” exception, maybe
for full gc and the session was timeout. But why the
hbase.regionserver.restart.on.zk.expire does not work? My HBase version is
0.94.0.

 

  Thanks for any suggestions and feedback!

 

Fowler Zhang

Re: one RegionServer crashed and the whole cluster was blocked

Posted by Nicolas Liochon <nk...@gmail.com>.

Hi,

Some stuff below:

On Thu, Oct 18, 2012 at 1:30 PM, 张磊 <zh...@youku.com> wrote:

> Hi, All
>
>   One of the RegionServer of our company’s cluster was crashed. At this
> time, I found:
>
> 1.       All the RegionServer stopped handling the requests from the client
> side( requestsPerSecond=0 at the master-status UI page).
>
> 2.       It takes about 12-15 minutes to recovery.
>
> 3.       I have set hbase.regionserver.restart.on.zk.expire to true, but it
> does not work.
>
>   For 1, I knew the cluster began to split log and recover the data on the
> crashed RegionServer, will the recovery operation block all the requests
> from the client side?
>

No. But it's worth checking that the region server who died was not the one
handling the .meta. region. If it's the case, it's could be an explanation
(clients do have a cache, but for first time access to a region they go to
the .meta. region first.)

>   For 2, Is there any solution to reduce the recovery time?
>

12 minutes for a single region server crash (i.e. the datanode it still
there, the cluster is ok) seems huge.
You need to look at:
- a possible root cause: if the region server got disconnected, it may be
because the network or ZooKeeper was in the bad shape anyway. So the
recovery is slow because the cause of the crash is still there.
- how is your cluster? Do you have a a lot of regions to recover? Did you
have a lot of writes on this region server?

>   For 3, I checked the log, found “session is timeout” exception, maybe
> for full gc and the session was timeout. But why the
> hbase.regionserver.restart.on.zk.expire does not work? My HBase version is
> 0.94.0.
>

I'm not sure it's still in the code base. To be checked. As well, you can
have a root cause that makes the server stops.
But there are two sides of a ZK disconnect anyway:
1) the region server: if it's disconnected but actually still there so it
may decide to kill itself, or not.
2) the cluster: after the timeout, the timeouted regionserver is considered
as dead and the recovery starts. This whatever what happens in 1). So
whatever happens in 1) does not change much from a mttr point of view,
except if your cluster is small, or if your loosing multiple nodes.

There is an autorestart option in the 0.96 scripts. It changes nothing to
the mttr itself, but cover more cases of regionserver crashes. See releases
notes in HBASE-5939.

Good luck,

Nicolas

RE: one RegionServer crashed and the whole cluster was blocked

Posted by "Ramkrishna.S.Vasudevan" <ra...@huawei.com>.

>   For 1, I knew the cluster began to split log and recover the data on
> the
> crashed RegionServer, will the recovery operation block all the
> requests
> from the client side?


Ideally should not.  But if your client was generating data for the regions
that were dead at that time then client requests willnot be served till the
regions are online after
Log splitting on some other region server.
Any client requests going to other region servers should ideally be working.
Did you see the threaddumps at that time on the other RS? That should give
some clue.

>   For 2, Is there any solution to reduce the recovery time?
The recovery time depends on the amount of data and particularly on the size
of the HLog file.  By default every HLog file is of size 256MB.
In 0.94.0 some good no of changes have gone in to make the recovery faster
in terms of HLog Splitting.


> 3.       I have set hbase.regionserver.restart.on.zk.expire to true,
> but it
> does not work.
I am not very sure how the code works with this property.  Will check this
part.

Regards
Ram



> -----Original Message-----
> From: 张磊 [mailto:zhanglei@youku.com]
> Sent: Thursday, October 18, 2012 5:01 PM
> To: user@hbase.apache.org
> Subject: one RegionServer crashed and the whole cluster was blocked
> 
> Hi, All
> 
>   One of the RegionServer of our company’s cluster was crashed. At this
> time, I found:
> 
> 1.       All the RegionServer stopped handling the requests from the
> client
> side( requestsPerSecond=0 at the master-status UI page).
> 
> 2.       It takes about 12-15 minutes to recovery.
> 
> 3.       I have set hbase.regionserver.restart.on.zk.expire to true,
> but it
> does not work.
> 
>   For 1, I knew the cluster began to split log and recover the data on
> the
> crashed RegionServer, will the recovery operation block all the
> requests
> from the client side?
> 
>   For 2, Is there any solution to reduce the recovery time?
> 
>   For 3, I checked the log, found “session is timeout” exception, maybe
> for full gc and the session was timeout. But why the
> hbase.regionserver.restart.on.zk.expire does not work? My HBase version
> is
> 0.94.0.
> 
> 
> 
>   Thanks for any suggestions and feedback!
> 
> 
> 
> Fowler Zhang
> 
>