You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "Andrew Purtell (JIRA)" <ji...@apache.org> on 2009/04/05 23:22:12 UTC

[jira] Created: (HBASE-1312) ZooKeeper: Master's ephemeral node went away while it was still up and functioning normally

ZooKeeper: Master's ephemeral node went away while it was still up and functioning normally
-------------------------------------------------------------------------------------------

                 Key: HBASE-1312
                 URL: https://issues.apache.org/jira/browse/HBASE-1312
             Project: Hadoop HBase
          Issue Type: Bug
            Reporter: Andrew Purtell


Does the master watch its own znode? Right around the time of regionserver problems described in HBASE-1311, clients could no longer find the master, but according to its log it was up and functionling normally. I think the master and regionserver sessions expired at the same time, as they were started within seconds of each other.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Created: (HBASE-1312) ZooKeeper: Master's ephemeral node went away while it was still up and functioning normally

Posted by Nitay <ni...@gmail.com>.
Internal reset. To the outside world it will look like node dying and coming
back up fairly quickly.

On Sun, Apr 5, 2009 at 3:17 PM, Ryan Rawson <ry...@gmail.com> wrote:

> When you say 'node restart' - do you mean a JVM reboot, or will we be able
> to have an internal reset?
>
> Most people dont run hbase under some job control, so when hbase jvms die,
> they stay dead...
>
> -ryan
>
> On Sun, Apr 5, 2009 at 3:14 PM, Nitay <ni...@gmail.com> wrote:
>
> > The master did not respond correctly to a SessionExpired event. I don't
> > think there's a ZK bug. This is like HBASE-1232. Both the master and
> > regionserver got a SessionExpired event. The bug I fixed for Ryan was
> just
> > with the client getting a SessionExpired. Andrew's cluster shows us that
> > it's just as likely for the master/RS to get this event.
> >
> > The only thing you can do on a SessionExpired event is to completely
> > restart
> > the node. SessionExpired means your ZooKeeper handle is dead, and your
> > ephemeral nodes will go away. Since every server in HBase has some
> > ephemeral
> > node that indicates it liveness (e.g. /hbase/master, /hbase/rs/...), the
> > node has to completely restart.
> >
> > HBASE-1232, HBASE-1311, and HBASE-1312 are all the same problem, just
> with
> > three different points of view (client, RS, master).
> >
> > On Sun, Apr 5, 2009 at 2:32 PM, Ryan Rawson <ry...@gmail.com> wrote:
> >
> > > ZK keeps the note up as long as the session is still valid.
> > >
> > > So the question is:
> > > - did the master not respond correctly to an expired session?
> > > - is there a ZK bug (HOPE NOT!)
> > >
> > > -ryan
> > >
> > > On Sun, Apr 5, 2009 at 2:22 PM, Andrew Purtell (JIRA) <jira@apache.org
> > > >wrote:
> > >
> > > > ZooKeeper: Master's ephemeral node went away while it was still up
> and
> > > > functioning normally
> > > >
> > > >
> > >
> >
> -------------------------------------------------------------------------------------------
> > > >
> > > >                 Key: HBASE-1312
> > > >                 URL:
> https://issues.apache.org/jira/browse/HBASE-1312
> > > >             Project: Hadoop HBase
> > > >          Issue Type: Bug
> > > >            Reporter: Andrew Purtell
> > > >
> > > >
> > > > Does the master watch its own znode? Right around the time of
> > > regionserver
> > > > problems described in HBASE-1311, clients could no longer find the
> > > master,
> > > > but according to its log it was up and functionling normally. I think
> > the
> > > > master and regionserver sessions expired at the same time, as they
> were
> > > > started within seconds of each other.
> > > >
> > > > --
> > > > This message is automatically generated by JIRA.
> > > > -
> > > > You can reply to this email to add a comment to the issue online.
> > > >
> > > >
> > >
> >
>

Re: [jira] Created: (HBASE-1312) ZooKeeper: Master's ephemeral node went away while it was still up and functioning normally

Posted by Andrew Purtell <ap...@apache.org>.
> From: Ryan Rawson
> Most people dont run hbase under some job control, so when
> hbase jvms die, they stay dead...

Well... We can leave it up to the user to do process recovery
should e.g. a HRS abort, or we can consider providing some
basic automatic recovery. For example, in my old deployment we
launched HBase (and Hadoop) daemons as children of monitoring
processes. Our monitoring and recovery framework was pretty
elaborate and I'm not suggesting to roll something like that.
However, to get people over the hump at this point in the
project's maturity where HRS may go down for "avoidable"
problems like OOME or filesystem glitches, running HBase
processes as children of simple monitors that can
automatically launch new children under some specific failure
scenarios is not necessarily a bad idea. For one thing it
specifically counteracts the "spiral of death" scenario where
OOME of one HRS takes it down, distributing increasing load
to others, which go down in turn, in an accelerating chain
reaction. 

   - Andy



      

Re: [jira] Created: (HBASE-1312) ZooKeeper: Master's ephemeral node went away while it was still up and functioning normally

Posted by Ryan Rawson <ry...@gmail.com>.
When you say 'node restart' - do you mean a JVM reboot, or will we be able
to have an internal reset?

Most people dont run hbase under some job control, so when hbase jvms die,
they stay dead...

-ryan

On Sun, Apr 5, 2009 at 3:14 PM, Nitay <ni...@gmail.com> wrote:

> The master did not respond correctly to a SessionExpired event. I don't
> think there's a ZK bug. This is like HBASE-1232. Both the master and
> regionserver got a SessionExpired event. The bug I fixed for Ryan was just
> with the client getting a SessionExpired. Andrew's cluster shows us that
> it's just as likely for the master/RS to get this event.
>
> The only thing you can do on a SessionExpired event is to completely
> restart
> the node. SessionExpired means your ZooKeeper handle is dead, and your
> ephemeral nodes will go away. Since every server in HBase has some
> ephemeral
> node that indicates it liveness (e.g. /hbase/master, /hbase/rs/...), the
> node has to completely restart.
>
> HBASE-1232, HBASE-1311, and HBASE-1312 are all the same problem, just with
> three different points of view (client, RS, master).
>
> On Sun, Apr 5, 2009 at 2:32 PM, Ryan Rawson <ry...@gmail.com> wrote:
>
> > ZK keeps the note up as long as the session is still valid.
> >
> > So the question is:
> > - did the master not respond correctly to an expired session?
> > - is there a ZK bug (HOPE NOT!)
> >
> > -ryan
> >
> > On Sun, Apr 5, 2009 at 2:22 PM, Andrew Purtell (JIRA) <jira@apache.org
> > >wrote:
> >
> > > ZooKeeper: Master's ephemeral node went away while it was still up and
> > > functioning normally
> > >
> > >
> >
> -------------------------------------------------------------------------------------------
> > >
> > >                 Key: HBASE-1312
> > >                 URL: https://issues.apache.org/jira/browse/HBASE-1312
> > >             Project: Hadoop HBase
> > >          Issue Type: Bug
> > >            Reporter: Andrew Purtell
> > >
> > >
> > > Does the master watch its own znode? Right around the time of
> > regionserver
> > > problems described in HBASE-1311, clients could no longer find the
> > master,
> > > but according to its log it was up and functionling normally. I think
> the
> > > master and regionserver sessions expired at the same time, as they were
> > > started within seconds of each other.
> > >
> > > --
> > > This message is automatically generated by JIRA.
> > > -
> > > You can reply to this email to add a comment to the issue online.
> > >
> > >
> >
>

Re: [jira] Created: (HBASE-1312) ZooKeeper: Master's ephemeral node went away while it was still up and functioning normally

Posted by Andrew Purtell <ap...@apache.org>.
That's an unfortunate side effect of some aspect of the ZK
implementation, I suppose. 

HBase clients, regionservers, and masters with watches on
ephemeral nodes will have to treat their disappearance as
advisory only and check back once or twice before taking
any recovery actions. It lengthens the time for recovery
beyond what would be necessary without this wrinkle, which
is unfortunate. 

Just to be clear by restart you are talking about re-
initializing the ZK wrapper only, correct? It should not be
necessary to restart everything on a node to deal with an
expired ZK session, right? 

> From: Nitay
>
> The master did not respond correctly to a SessionExpired
> event. I don't think there's a ZK bug. This is like
> HBASE-1232. Both the master and regionserver got a
> SessionExpired event. The bug I fixed for Ryan was just
> with the client getting a SessionExpired. Andrew's
> cluster shows us that it's just as likely for the master/
> RS to get this event.
> 
> The only thing you can do on a SessionExpired event is to
> completely restart the node. SessionExpired means your
> ZooKeeper handle is dead, and your ephemeral nodes will go
> away. Since every server in HBase has some ephemeral
> node that indicates it liveness (e.g. /hbase/master,
> /hbase/rs/...), the node has to completely restart.
> 
> HBASE-1232, HBASE-1311, and HBASE-1312 are all the same
> problem, just with three different points of view (client,
> RS, master).
> 
> On Sun, Apr 5, 2009 at 2:32 PM, Ryan Rawson wrote:
> 
> > ZK keeps the note up as long as the session is still
> > valid.
> > So the question is:
> > - did the master not respond correctly to an expired
> > session?
> > - is there a ZK bug (HOPE NOT!)
> >
> > -ryan
> >
> > On Sun, Apr 5, 2009 at 2:22 PM, Andrew Purtell wrote:
> > > ZooKeeper: Master's ephemeral node went away
> > > while it was still up and functioning normally



      

Re: [jira] Created: (HBASE-1312) ZooKeeper: Master's ephemeral node went away while it was still up and functioning normally

Posted by Nitay <ni...@gmail.com>.
The master did not respond correctly to a SessionExpired event. I don't
think there's a ZK bug. This is like HBASE-1232. Both the master and
regionserver got a SessionExpired event. The bug I fixed for Ryan was just
with the client getting a SessionExpired. Andrew's cluster shows us that
it's just as likely for the master/RS to get this event.

The only thing you can do on a SessionExpired event is to completely restart
the node. SessionExpired means your ZooKeeper handle is dead, and your
ephemeral nodes will go away. Since every server in HBase has some ephemeral
node that indicates it liveness (e.g. /hbase/master, /hbase/rs/...), the
node has to completely restart.

HBASE-1232, HBASE-1311, and HBASE-1312 are all the same problem, just with
three different points of view (client, RS, master).

On Sun, Apr 5, 2009 at 2:32 PM, Ryan Rawson <ry...@gmail.com> wrote:

> ZK keeps the note up as long as the session is still valid.
>
> So the question is:
> - did the master not respond correctly to an expired session?
> - is there a ZK bug (HOPE NOT!)
>
> -ryan
>
> On Sun, Apr 5, 2009 at 2:22 PM, Andrew Purtell (JIRA) <jira@apache.org
> >wrote:
>
> > ZooKeeper: Master's ephemeral node went away while it was still up and
> > functioning normally
> >
> >
> -------------------------------------------------------------------------------------------
> >
> >                 Key: HBASE-1312
> >                 URL: https://issues.apache.org/jira/browse/HBASE-1312
> >             Project: Hadoop HBase
> >          Issue Type: Bug
> >            Reporter: Andrew Purtell
> >
> >
> > Does the master watch its own znode? Right around the time of
> regionserver
> > problems described in HBASE-1311, clients could no longer find the
> master,
> > but according to its log it was up and functionling normally. I think the
> > master and regionserver sessions expired at the same time, as they were
> > started within seconds of each other.
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>

Re: [jira] Created: (HBASE-1312) ZooKeeper: Master's ephemeral node went away while it was still up and functioning normally

Posted by Ryan Rawson <ry...@gmail.com>.
ZK keeps the note up as long as the session is still valid.

So the question is:
- did the master not respond correctly to an expired session?
- is there a ZK bug (HOPE NOT!)

-ryan

On Sun, Apr 5, 2009 at 2:22 PM, Andrew Purtell (JIRA) <ji...@apache.org>wrote:

> ZooKeeper: Master's ephemeral node went away while it was still up and
> functioning normally
>
> -------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1312
>                 URL: https://issues.apache.org/jira/browse/HBASE-1312
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: Andrew Purtell
>
>
> Does the master watch its own znode? Right around the time of regionserver
> problems described in HBASE-1311, clients could no longer find the master,
> but according to its log it was up and functionling normally. I think the
> master and regionserver sessions expired at the same time, as they were
> started within seconds of each other.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

[jira] Resolved: (HBASE-1312) ZooKeeper: Master's ephemeral node went away while it was still up and functioning normally

Posted by "Andrew Purtell (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Purtell resolved HBASE-1312.
-----------------------------------

       Resolution: Invalid
    Fix Version/s:     (was: 0.20.0)

It's fine to close this as invalid. I agree there is no contribution here beyond 1311.

> ZooKeeper: Master's ephemeral node went away while it was still up and functioning normally
> -------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1312
>                 URL: https://issues.apache.org/jira/browse/HBASE-1312
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: Andrew Purtell
>            Assignee: Nitay Joffe
>
> Does the master watch its own znode? Right around the time of regionserver problems described in HBASE-1311, clients could no longer find the master, but according to its log it was up and functionling normally. I think the master and regionserver sessions expired at the same time, as they were started within seconds of each other.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HBASE-1312) ZooKeeper: Master's ephemeral node went away while it was still up and functioning normally

Posted by "Nitay Joffe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nitay Joffe reassigned HBASE-1312:
----------------------------------

    Assignee: Nitay Joffe

> ZooKeeper: Master's ephemeral node went away while it was still up and functioning normally
> -------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1312
>                 URL: https://issues.apache.org/jira/browse/HBASE-1312
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: Andrew Purtell
>            Assignee: Nitay Joffe
>
> Does the master watch its own znode? Right around the time of regionserver problems described in HBASE-1311, clients could no longer find the master, but according to its log it was up and functionling normally. I think the master and regionserver sessions expired at the same time, as they were started within seconds of each other.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HBASE-1312) ZooKeeper: Master's ephemeral node went away while it was still up and functioning normally

Posted by "Nitay Joffe (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HBASE-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nitay Joffe updated HBASE-1312:
-------------------------------

    Fix Version/s: 0.20.0

Master does not watch its own ZNode. Andrew, what is this JIRA describing beyond HBASE-1311?

> ZooKeeper: Master's ephemeral node went away while it was still up and functioning normally
> -------------------------------------------------------------------------------------------
>
>                 Key: HBASE-1312
>                 URL: https://issues.apache.org/jira/browse/HBASE-1312
>             Project: Hadoop HBase
>          Issue Type: Bug
>            Reporter: Andrew Purtell
>            Assignee: Nitay Joffe
>             Fix For: 0.20.0
>
>
> Does the master watch its own znode? Right around the time of regionserver problems described in HBASE-1311, clients could no longer find the master, but according to its log it was up and functionling normally. I think the master and regionserver sessions expired at the same time, as they were started within seconds of each other.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.