You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Du, Jingcheng" <ji...@intel.com> on 2014/03/20 08:43:04 UTC

Backup HMasters will go down if the zk connection expires without recovery

Dear Devs,

  Now I encounter a problem in the HMaster.
  Currently I run multiple HMasters in a cluster. If the ZK connection of one of the backup HMasters expires, this backup HMaster will go down directly without recovering the ZK connection.
I saw there were such code in the HMaster.abortNow() listed below, the fail.fast only works for active HMaster. Do the backup ones need to be recovered if the zk connection expires? Please advise. Thanks.

if (!this.isActiveMaster || this.stopped) {
      return true;
    }
boolean failFast = conf.getBoolean("fail.fast.expired.active.master", false);


Regards,
Jingcheng

Re: Backup HMasters will go down if the zk connection expires without recovery

Posted by ramkrishna vasudevan <ra...@gmail.com>.
I think if we need the recovering behaviour for the back up master also, we
could do that by introducing a similar config and retrying with the zk.
 Any specific reason why it was not retried? May be some one aware of this
part of the design can suggest on that.

Regards
Ram


On Thu, Mar 20, 2014 at 1:13 PM, Du, Jingcheng <ji...@intel.com>wrote:

> Dear Devs,
>
>   Now I encounter a problem in the HMaster.
>   Currently I run multiple HMasters in a cluster. If the ZK connection of
> one of the backup HMasters expires, this backup HMaster will go down
> directly without recovering the ZK connection.
> I saw there were such code in the HMaster.abortNow() listed below, the
> fail.fast only works for active HMaster. Do the backup ones need to be
> recovered if the zk connection expires? Please advise. Thanks.
>
> if (!this.isActiveMaster || this.stopped) {
>       return true;
>     }
> boolean failFast = conf.getBoolean("fail.fast.expired.active.master",
> false);
>
>
> Regards,
> Jingcheng
>

Re: Backup HMasters will go down if the zk connection expires without recovery

Posted by Nick Dimiduk <nd...@gmail.com>.
I agree that resuming the process is best handled by site-local tooling.
Could be we do a better job of informing that tooling regarding the nature
of the failure. Well defined exit codes, for instance, may be useful.

On Thursday, March 20, 2014, Du, Jingcheng <ji...@intel.com> wrote:

> Thanks a lot for the comments.
>
> I think we could have another service or supervisor to bring the backup
> masters back when they go down.
>
> Regards,
> Jingcheng
>
> -----Original Message-----
> From: ramkrishna vasudevan [mailto:ramkrishna.s.vasudevan@gmail.com<javascript:;>
> ]
> Sent: Friday, March 21, 2014 12:02 PM
> To: dev@hbase.apache.org <javascript:;>
> Subject: Re: Backup HMasters will go down if the zk connection expires
> without recovery
>
> We discussed this internally too.  May be the intention was to see if
> through code it can be handled.  Generally the management of these back up
> master can be done outside of HBase through monitoring services.
> @Jingcheng
> What do you think?
>
> Regards
> Ram
>
>
> On Fri, Mar 21, 2014 at 3:25 AM, Enis Söztutar <enis.soz@gmail.com<javascript:;>>
> wrote:
>
> > Zk session recovery in the active master was added some time ago, but
> > it requires a complex state management in regards to what services
> > inside master to reinitialize or keep. We discussed that we should
> > remove it altogether since this increases the code complexity by a
> > lot, and makes the recovery from zk session lost very error prone (a
> > remember 1-2 issues fixing this area).
> >
> > I think architecturally, we remove zk session recovery from active
> > master, and not add this to backup masters at all. Another service,
> > like Ambari, or a supervisor should be responsible to bring the master
> > / backup master nodes back.
> >
> > Enis
> >
> >
> > On Thu, Mar 20, 2014 at 11:35 AM, Andrew Purtell <apurtell@apache.org<javascript:;>
> > >wrote:
> >
> > > Why did the backup master's zookeeper session expire? That indicates
> > > a problem somewhere on the network or with zookeeper.
> > >
> > > The active master and regionservers also shut down when their
> > > sessions expire. If our zookeeper session expires we have been
> > > partitioned and
> > have
> > > a high degree of uncertainty from our vantage point on the state of
> > > the world. We shut down to avoid accidentally taking incorrect
> > > actions with
> > bad
> > > or out of date state. This simplifies design and removes corner cases.
> >  In
> > > a production environment I would expect a site local strategy (could
> > > be daemontools etc.) for automatic service recovery, if that is
> desired.
> > >
> > >
> > >
> > > On Thu, Mar 20, 2014 at 12:43 AM, Du, Jingcheng
> > > <jingcheng.du@intel.com <javascript:;>
> > > >wrote:
> > >
> > > > Dear Devs,
> > > >
> > > >   Now I encounter a problem in the HMaster.
> > > >   Currently I run multiple HMasters in a cluster. If the ZK
> > > > connection
> > of
> > > > one of the backup HMasters expires, this backup HMaster will go
> > > > down directly without recovering the ZK connection.
> > > > I saw there were such code in the HMaster.abortNow() listed below,
> > > > the fail.fast only works for active HMaster. Do the backup ones
> > > > need to be recovered if the zk connection expires? Please advise.
> Thanks.
> > > >
> > > > if (!this.isActiveMaster || this.stopped) {
> > > >       return true;
> > > >     }
> > > > boolean failFast =
> > > > conf.getBoolean("fail.fast.expired.active.master",
> > > > false);
> > > >
> > > >
> > > > Regards,
> > > > Jingcheng
> > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > >
> > >    - Andy
> > >
> > > Problems worthy of attack prove their worth by hitting back. - Piet
> > > Hein (via Tom White)
> > >
> >
>

RE: Backup HMasters will go down if the zk connection expires without recovery

Posted by "Du, Jingcheng" <ji...@intel.com>.
Thanks a lot for the comments.

I think we could have another service or supervisor to bring the backup masters back when they go down.

Regards,
Jingcheng

-----Original Message-----
From: ramkrishna vasudevan [mailto:ramkrishna.s.vasudevan@gmail.com] 
Sent: Friday, March 21, 2014 12:02 PM
To: dev@hbase.apache.org
Subject: Re: Backup HMasters will go down if the zk connection expires without recovery

We discussed this internally too.  May be the intention was to see if through code it can be handled.  Generally the management of these back up master can be done outside of HBase through monitoring services.
@Jingcheng
What do you think?

Regards
Ram


On Fri, Mar 21, 2014 at 3:25 AM, Enis Söztutar <en...@gmail.com> wrote:

> Zk session recovery in the active master was added some time ago, but 
> it requires a complex state management in regards to what services 
> inside master to reinitialize or keep. We discussed that we should 
> remove it altogether since this increases the code complexity by a 
> lot, and makes the recovery from zk session lost very error prone (a 
> remember 1-2 issues fixing this area).
>
> I think architecturally, we remove zk session recovery from active 
> master, and not add this to backup masters at all. Another service, 
> like Ambari, or a supervisor should be responsible to bring the master 
> / backup master nodes back.
>
> Enis
>
>
> On Thu, Mar 20, 2014 at 11:35 AM, Andrew Purtell <apurtell@apache.org
> >wrote:
>
> > Why did the backup master's zookeeper session expire? That indicates 
> > a problem somewhere on the network or with zookeeper.
> >
> > The active master and regionservers also shut down when their 
> > sessions expire. If our zookeeper session expires we have been 
> > partitioned and
> have
> > a high degree of uncertainty from our vantage point on the state of 
> > the world. We shut down to avoid accidentally taking incorrect 
> > actions with
> bad
> > or out of date state. This simplifies design and removes corner cases.
>  In
> > a production environment I would expect a site local strategy (could 
> > be daemontools etc.) for automatic service recovery, if that is desired.
> >
> >
> >
> > On Thu, Mar 20, 2014 at 12:43 AM, Du, Jingcheng 
> > <jingcheng.du@intel.com
> > >wrote:
> >
> > > Dear Devs,
> > >
> > >   Now I encounter a problem in the HMaster.
> > >   Currently I run multiple HMasters in a cluster. If the ZK 
> > > connection
> of
> > > one of the backup HMasters expires, this backup HMaster will go 
> > > down directly without recovering the ZK connection.
> > > I saw there were such code in the HMaster.abortNow() listed below, 
> > > the fail.fast only works for active HMaster. Do the backup ones 
> > > need to be recovered if the zk connection expires? Please advise. Thanks.
> > >
> > > if (!this.isActiveMaster || this.stopped) {
> > >       return true;
> > >     }
> > > boolean failFast = 
> > > conf.getBoolean("fail.fast.expired.active.master",
> > > false);
> > >
> > >
> > > Regards,
> > > Jingcheng
> > >
> >
> >
> >
> > --
> > Best regards,
> >
> >    - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet 
> > Hein (via Tom White)
> >
>

Re: Backup HMasters will go down if the zk connection expires without recovery

Posted by ramkrishna vasudevan <ra...@gmail.com>.
We discussed this internally too.  May be the intention was to see if
through code it can be handled.  Generally the management of these back up
master can be done outside of HBase through monitoring services.
@Jingcheng
What do you think?

Regards
Ram


On Fri, Mar 21, 2014 at 3:25 AM, Enis Söztutar <en...@gmail.com> wrote:

> Zk session recovery in the active master was added some time ago, but it
> requires a complex state management in regards to what services inside
> master to reinitialize or keep. We discussed that we should remove it
> altogether since this increases the code complexity by a lot, and makes the
> recovery from zk session lost very error prone (a remember 1-2 issues
> fixing this area).
>
> I think architecturally, we remove zk session recovery from active master,
> and not add this to backup masters at all. Another service, like Ambari, or
> a supervisor should be responsible to bring the master / backup master
> nodes back.
>
> Enis
>
>
> On Thu, Mar 20, 2014 at 11:35 AM, Andrew Purtell <apurtell@apache.org
> >wrote:
>
> > Why did the backup master's zookeeper session expire? That indicates a
> > problem somewhere on the network or with zookeeper.
> >
> > The active master and regionservers also shut down when their sessions
> > expire. If our zookeeper session expires we have been partitioned and
> have
> > a high degree of uncertainty from our vantage point on the state of the
> > world. We shut down to avoid accidentally taking incorrect actions with
> bad
> > or out of date state. This simplifies design and removes corner cases.
>  In
> > a production environment I would expect a site local strategy (could be
> > daemontools etc.) for automatic service recovery, if that is desired.
> >
> >
> >
> > On Thu, Mar 20, 2014 at 12:43 AM, Du, Jingcheng <jingcheng.du@intel.com
> > >wrote:
> >
> > > Dear Devs,
> > >
> > >   Now I encounter a problem in the HMaster.
> > >   Currently I run multiple HMasters in a cluster. If the ZK connection
> of
> > > one of the backup HMasters expires, this backup HMaster will go down
> > > directly without recovering the ZK connection.
> > > I saw there were such code in the HMaster.abortNow() listed below, the
> > > fail.fast only works for active HMaster. Do the backup ones need to be
> > > recovered if the zk connection expires? Please advise. Thanks.
> > >
> > > if (!this.isActiveMaster || this.stopped) {
> > >       return true;
> > >     }
> > > boolean failFast = conf.getBoolean("fail.fast.expired.active.master",
> > > false);
> > >
> > >
> > > Regards,
> > > Jingcheng
> > >
> >
> >
> >
> > --
> > Best regards,
> >
> >    - Andy
> >
> > Problems worthy of attack prove their worth by hitting back. - Piet Hein
> > (via Tom White)
> >
>

Re: Backup HMasters will go down if the zk connection expires without recovery

Posted by Enis Söztutar <en...@gmail.com>.
Zk session recovery in the active master was added some time ago, but it
requires a complex state management in regards to what services inside
master to reinitialize or keep. We discussed that we should remove it
altogether since this increases the code complexity by a lot, and makes the
recovery from zk session lost very error prone (a remember 1-2 issues
fixing this area).

I think architecturally, we remove zk session recovery from active master,
and not add this to backup masters at all. Another service, like Ambari, or
a supervisor should be responsible to bring the master / backup master
nodes back.

Enis


On Thu, Mar 20, 2014 at 11:35 AM, Andrew Purtell <ap...@apache.org>wrote:

> Why did the backup master's zookeeper session expire? That indicates a
> problem somewhere on the network or with zookeeper.
>
> The active master and regionservers also shut down when their sessions
> expire. If our zookeeper session expires we have been partitioned and have
> a high degree of uncertainty from our vantage point on the state of the
> world. We shut down to avoid accidentally taking incorrect actions with bad
> or out of date state. This simplifies design and removes corner cases.  In
> a production environment I would expect a site local strategy (could be
> daemontools etc.) for automatic service recovery, if that is desired.
>
>
>
> On Thu, Mar 20, 2014 at 12:43 AM, Du, Jingcheng <jingcheng.du@intel.com
> >wrote:
>
> > Dear Devs,
> >
> >   Now I encounter a problem in the HMaster.
> >   Currently I run multiple HMasters in a cluster. If the ZK connection of
> > one of the backup HMasters expires, this backup HMaster will go down
> > directly without recovering the ZK connection.
> > I saw there were such code in the HMaster.abortNow() listed below, the
> > fail.fast only works for active HMaster. Do the backup ones need to be
> > recovered if the zk connection expires? Please advise. Thanks.
> >
> > if (!this.isActiveMaster || this.stopped) {
> >       return true;
> >     }
> > boolean failFast = conf.getBoolean("fail.fast.expired.active.master",
> > false);
> >
> >
> > Regards,
> > Jingcheng
> >
>
>
>
> --
> Best regards,
>
>    - Andy
>
> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> (via Tom White)
>

Re: Backup HMasters will go down if the zk connection expires without recovery

Posted by Andrew Purtell <ap...@apache.org>.
Why did the backup master's zookeeper session expire? That indicates a
problem somewhere on the network or with zookeeper.

The active master and regionservers also shut down when their sessions
expire. If our zookeeper session expires we have been partitioned and have
a high degree of uncertainty from our vantage point on the state of the
world. We shut down to avoid accidentally taking incorrect actions with bad
or out of date state. This simplifies design and removes corner cases.  In
a production environment I would expect a site local strategy (could be
daemontools etc.) for automatic service recovery, if that is desired.



On Thu, Mar 20, 2014 at 12:43 AM, Du, Jingcheng <ji...@intel.com>wrote:

> Dear Devs,
>
>   Now I encounter a problem in the HMaster.
>   Currently I run multiple HMasters in a cluster. If the ZK connection of
> one of the backup HMasters expires, this backup HMaster will go down
> directly without recovering the ZK connection.
> I saw there were such code in the HMaster.abortNow() listed below, the
> fail.fast only works for active HMaster. Do the backup ones need to be
> recovered if the zk connection expires? Please advise. Thanks.
>
> if (!this.isActiveMaster || this.stopped) {
>       return true;
>     }
> boolean failFast = conf.getBoolean("fail.fast.expired.active.master",
> false);
>
>
> Regards,
> Jingcheng
>



-- 
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)