You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Mohit <mo...@huawei.com> on 2011/03/25 09:56:48 UTC

Query Regarding Design Strategy behind Abortable.

Hello Users/Authors

 

Well we've observed in our cluster , that HMaster went down due to watched
event triggered from zookeeper, of type session expired.

 

Why not reconnect back to the zookeeper(at least try once and then abort, if
unsuccessful) and resetting trackers/watchers instead of aborting/killing
HMaster/HRegionServers just like it is done in one of the implementation of
abort able named HConnectionImplementation present in HConnectionManager?

 

Kindly brief me upon this design strategy.

 

Thanks

-Mohit

****************************************************************************
***********
This e-mail and attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed
above. Any use of the information contained herein in any way (including,
but not limited to, total or partial disclosure, reproduction, or
dissemination) by persons other than the intended recipient's) is
prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!

Re: Query Regarding Design Strategy behind Abortable.

Posted by Stack <st...@duboce.net>.

On Fri, Mar 25, 2011 at 1:56 AM, Mohit <mo...@huawei.com> wrote:
> Why not reconnect back to the zookeeper(at least try once and then abort, if
> unsuccessful) and resetting trackers/watchers instead of aborting/killing
> HMaster/HRegionServers just like it is done in one of the implementation of
> abort able named HConnectionImplementation present in HConnectionManager?
>

Hello Mohit:

The ZooKeeper client is doing what you describes, sort of.  On session
timeout, it does a reconnect to the ensemble to ask if its session has
indeed expired.  If it has, then it'll log session expired.

The regionserver will kill itself on loss of session because its
likely that the data it was hosting has been assumed by another.

The retry you refer to, IIRC, is something different -- its before
session setup?  Please cite it if you'd like me to explain.

Do you think the session timed out because of a long GC session?  If
0.90.1, there may be some things you can do.  See
http://hbase.apache.org/book/performance.html#jvm

Yours,
St.Ack