You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Jeff Whiting <je...@qualtrics.com> on 2010/10/28 18:56:41 UTC

Sanity date time check when a region server joins the cluster

We recently had a problem where one of our machines in the cluster had a time that was 6 hours 
behind the other ones (ntp was supposed to be setup on that machine but wasn't).  We subsequently 
restarted our cluster and the '-ROOT-' table was assigned to that machine.  The problem was that 
when it tried to update the value (info:server) for who was holding the '.META.' table the value 
wasn't updating and stayed set as the previous machine. I'm pretty sure the problem was the 
timestamp for the new server was older than the timestamp for the previous server preventing the 
value from updating correctly.  Having the incorrect info:server in the ROOT table basically made 
the cluster unusable.

So my question is, would it make sense to have a sanity time check when a region server joins the 
cluster?  Basically when the region server joins it would sent its current time and the master would 
check that time against its current time and if difference is too large then it would prevent the 
region server from joining.  I know this is basic server configuration stuff but because of human 
error these things happen and seem like they can cause major problems for the cluster if the servers 
times aren't synchronized.

~Jeff

-- 

Jeff Whiting
Qualtrics Senior Software Engineer
jeffw@qualtrics.com

RE: Sanity date time check when a region server joins the cluster

Posted by Jonathan Gray <jg...@facebook.com>.

I continued this discussion on the JIRA:

https://issues.apache.org/jira/browse/HBASE-3168

Please comment over there where we're working on implementing this.

(And I'm planning to run at about 5 seconds)

> -----Original Message-----
> From: M. C. Srivas [mailto:mcsrivas@gmail.com]
> Sent: Sunday, October 31, 2010 11:35 AM
> To: user@hbase.apache.org
> Subject: Re: Sanity date time check when a region server joins the
> cluster
> 
> How about 5 mins? 1 min is too narrow -- if the clocks are off, NTP
> will not
> slam clocks around that fast to get into within-one-minute resolution
> quickly.
> 
> 
> On Fri, Oct 29, 2010 at 11:03 AM, Jean-Daniel Cryans
> <jd...@apache.org>wrote:
> 
> > A minute? Although it could be configurable.
> >
> > J-D
> >
> > On Fri, Oct 29, 2010 at 10:58 AM, Jeff Whiting <je...@qualtrics.com>
> > wrote:
> > > Created HBASE-3168 for this issue.  It seems pretty straight
> forward and
> > I
> > > wouldn't mind tackling this problem.  How much of a skew do we want
> to
> > allow
> > > between the RS and the rest of the cluster?
> > >
> > > ~Jeff
> > >
> > > On 10/28/2010 12:08 PM, Jonathan Gray wrote:
> > >>
> > >> I was discussing this exact issue this morning.  Ran into a
> problem
> > where
> > >> master was timing out a region in transition because the RS was 5
> > minutes
> > >> behind the master.
> > >>
> > >> I like the idea of the RS sending it's timestamp on startup and if
> it is
> > >> outside a certain threshold, the master throws it a
> ClockOutOfSync-like
> > >> exception and the RS goes down.
> > >>
> > >> Please do file a jira, Jeff.  Or let me know and I can do it.
> > >>
> > >> JG
> > >>
> > >>> -----Original Message-----
> > >>> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of
> > Jean-
> > >>> Daniel Cryans
> > >>> Sent: Thursday, October 28, 2010 10:00 AM
> > >>> To: user@hbase.apache.org
> > >>> Subject: Re: Sanity date time check when a region server joins
> the
> > >>> cluster
> > >>>
> > >>> That could be done easily when the server checks in by looking at
> the
> > >>> given start code. In ServerManager we already do:
> > >>>
> > >>>     HServerInfo info = new HServerInfo(serverInfo);
> > >>>     checkIsDead(info.getServerName(), "STARTUP");
> > >>>     checkAlreadySameHostPort(info);
> > >>>     recordNewServer(info, false, null);
> > >>>
> > >>> A new check in there would fit nicely. Can you open a jira Jeff?
> > >>>
> > >>> Thx!
> > >>>
> > >>> J-D
> > >>>
> > >>> On Thu, Oct 28, 2010 at 9:56 AM, Jeff
> Whiting<je...@qualtrics.com>
> > >>> wrote:
> > >>>>
> > >>>> We recently had a problem where one of our machines in the
> cluster
> > >>>
> > >>> had a
> > >>>>
> > >>>> time that was 6 hours behind the other ones (ntp was supposed to
> be
> > >>>
> > >>> setup on
> > >>>>
> > >>>> that machine but wasn't).  We subsequently restarted our cluster
> and
> > >>>
> > >>> the
> > >>>>
> > >>>> '-ROOT-' table was assigned to that machine.  The problem was
> that
> > >>>
> > >>> when it
> > >>>>
> > >>>> tried to update the value (info:server) for who was holding the
> > >>>
> > >>> '.META.'
> > >>>>
> > >>>> table the value wasn't updating and stayed set as the previous
> > >>>
> > >>> machine. I'm
> > >>>>
> > >>>> pretty sure the problem was the timestamp for the new server was
> > >>>
> > >>> older than
> > >>>>
> > >>>> the timestamp for the previous server preventing the value from
> > >>>
> > >>> updating
> > >>>>
> > >>>> correctly.  Having the incorrect info:server in the ROOT table
> > >>>
> > >>> basically
> > >>>>
> > >>>> made the cluster unusable.
> > >>>>
> > >>>> So my question is, would it make sense to have a sanity time
> check
> > >>>
> > >>> when a
> > >>>>
> > >>>> region server joins the cluster?  Basically when the region
> server
> > >>>
> > >>> joins it
> > >>>>
> > >>>> would sent its current time and the master would check that time
> > >>>
> > >>> against its
> > >>>>
> > >>>> current time and if difference is too large then it would
> prevent the
> > >>>
> > >>> region
> > >>>>
> > >>>> server from joining.  I know this is basic server configuration
> stuff
> > >>>
> > >>> but
> > >>>>
> > >>>> because of human error these things happen and seem like they
> can
> > >>>
> > >>> cause
> > >>>>
> > >>>> major problems for the cluster if the servers times aren't
> > >>>
> > >>> synchronized.
> > >>>>
> > >>>> ~Jeff
> > >>>>
> > >>>> --
> > >>>>
> > >>>> Jeff Whiting
> > >>>> Qualtrics Senior Software Engineer
> > >>>> jeffw@qualtrics.com
> > >>>>
> > >>>>
> > >
> > > --
> > > Jeff Whiting
> > > Qualtrics Senior Software Engineer
> > > jeffw@qualtrics.com
> > >
> > >
> >

Re: Sanity date time check when a region server joins the cluster

Posted by "M. C. Srivas" <mc...@gmail.com>.

How about 5 mins? 1 min is too narrow -- if the clocks are off, NTP will not
slam clocks around that fast to get into within-one-minute resolution
quickly.


On Fri, Oct 29, 2010 at 11:03 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> A minute? Although it could be configurable.
>
> J-D
>
> On Fri, Oct 29, 2010 at 10:58 AM, Jeff Whiting <je...@qualtrics.com>
> wrote:
> > Created HBASE-3168 for this issue.  It seems pretty straight forward and
> I
> > wouldn't mind tackling this problem.  How much of a skew do we want to
> allow
> > between the RS and the rest of the cluster?
> >
> > ~Jeff
> >
> > On 10/28/2010 12:08 PM, Jonathan Gray wrote:
> >>
> >> I was discussing this exact issue this morning.  Ran into a problem
> where
> >> master was timing out a region in transition because the RS was 5
> minutes
> >> behind the master.
> >>
> >> I like the idea of the RS sending it's timestamp on startup and if it is
> >> outside a certain threshold, the master throws it a ClockOutOfSync-like
> >> exception and the RS goes down.
> >>
> >> Please do file a jira, Jeff.  Or let me know and I can do it.
> >>
> >> JG
> >>
> >>> -----Original Message-----
> >>> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of
> Jean-
> >>> Daniel Cryans
> >>> Sent: Thursday, October 28, 2010 10:00 AM
> >>> To: user@hbase.apache.org
> >>> Subject: Re: Sanity date time check when a region server joins the
> >>> cluster
> >>>
> >>> That could be done easily when the server checks in by looking at the
> >>> given start code. In ServerManager we already do:
> >>>
> >>>     HServerInfo info = new HServerInfo(serverInfo);
> >>>     checkIsDead(info.getServerName(), "STARTUP");
> >>>     checkAlreadySameHostPort(info);
> >>>     recordNewServer(info, false, null);
> >>>
> >>> A new check in there would fit nicely. Can you open a jira Jeff?
> >>>
> >>> Thx!
> >>>
> >>> J-D
> >>>
> >>> On Thu, Oct 28, 2010 at 9:56 AM, Jeff Whiting<je...@qualtrics.com>
> >>> wrote:
> >>>>
> >>>> We recently had a problem where one of our machines in the cluster
> >>>
> >>> had a
> >>>>
> >>>> time that was 6 hours behind the other ones (ntp was supposed to be
> >>>
> >>> setup on
> >>>>
> >>>> that machine but wasn't).  We subsequently restarted our cluster and
> >>>
> >>> the
> >>>>
> >>>> '-ROOT-' table was assigned to that machine.  The problem was that
> >>>
> >>> when it
> >>>>
> >>>> tried to update the value (info:server) for who was holding the
> >>>
> >>> '.META.'
> >>>>
> >>>> table the value wasn't updating and stayed set as the previous
> >>>
> >>> machine. I'm
> >>>>
> >>>> pretty sure the problem was the timestamp for the new server was
> >>>
> >>> older than
> >>>>
> >>>> the timestamp for the previous server preventing the value from
> >>>
> >>> updating
> >>>>
> >>>> correctly.  Having the incorrect info:server in the ROOT table
> >>>
> >>> basically
> >>>>
> >>>> made the cluster unusable.
> >>>>
> >>>> So my question is, would it make sense to have a sanity time check
> >>>
> >>> when a
> >>>>
> >>>> region server joins the cluster?  Basically when the region server
> >>>
> >>> joins it
> >>>>
> >>>> would sent its current time and the master would check that time
> >>>
> >>> against its
> >>>>
> >>>> current time and if difference is too large then it would prevent the
> >>>
> >>> region
> >>>>
> >>>> server from joining.  I know this is basic server configuration stuff
> >>>
> >>> but
> >>>>
> >>>> because of human error these things happen and seem like they can
> >>>
> >>> cause
> >>>>
> >>>> major problems for the cluster if the servers times aren't
> >>>
> >>> synchronized.
> >>>>
> >>>> ~Jeff
> >>>>
> >>>> --
> >>>>
> >>>> Jeff Whiting
> >>>> Qualtrics Senior Software Engineer
> >>>> jeffw@qualtrics.com
> >>>>
> >>>>
> >
> > --
> > Jeff Whiting
> > Qualtrics Senior Software Engineer
> > jeffw@qualtrics.com
> >
> >
>

Re: Sanity date time check when a region server joins the cluster

Posted by Jean-Daniel Cryans <jd...@apache.org>.

A minute? Although it could be configurable.

J-D

On Fri, Oct 29, 2010 at 10:58 AM, Jeff Whiting <je...@qualtrics.com> wrote:
> Created HBASE-3168 for this issue.  It seems pretty straight forward and I
> wouldn't mind tackling this problem.  How much of a skew do we want to allow
> between the RS and the rest of the cluster?
>
> ~Jeff
>
> On 10/28/2010 12:08 PM, Jonathan Gray wrote:
>>
>> I was discussing this exact issue this morning.  Ran into a problem where
>> master was timing out a region in transition because the RS was 5 minutes
>> behind the master.
>>
>> I like the idea of the RS sending it's timestamp on startup and if it is
>> outside a certain threshold, the master throws it a ClockOutOfSync-like
>> exception and the RS goes down.
>>
>> Please do file a jira, Jeff.  Or let me know and I can do it.
>>
>> JG
>>
>>> -----Original Message-----
>>> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-
>>> Daniel Cryans
>>> Sent: Thursday, October 28, 2010 10:00 AM
>>> To: user@hbase.apache.org
>>> Subject: Re: Sanity date time check when a region server joins the
>>> cluster
>>>
>>> That could be done easily when the server checks in by looking at the
>>> given start code. In ServerManager we already do:
>>>
>>>     HServerInfo info = new HServerInfo(serverInfo);
>>>     checkIsDead(info.getServerName(), "STARTUP");
>>>     checkAlreadySameHostPort(info);
>>>     recordNewServer(info, false, null);
>>>
>>> A new check in there would fit nicely. Can you open a jira Jeff?
>>>
>>> Thx!
>>>
>>> J-D
>>>
>>> On Thu, Oct 28, 2010 at 9:56 AM, Jeff Whiting<je...@qualtrics.com>
>>> wrote:
>>>>
>>>> We recently had a problem where one of our machines in the cluster
>>>
>>> had a
>>>>
>>>> time that was 6 hours behind the other ones (ntp was supposed to be
>>>
>>> setup on
>>>>
>>>> that machine but wasn't).  We subsequently restarted our cluster and
>>>
>>> the
>>>>
>>>> '-ROOT-' table was assigned to that machine.  The problem was that
>>>
>>> when it
>>>>
>>>> tried to update the value (info:server) for who was holding the
>>>
>>> '.META.'
>>>>
>>>> table the value wasn't updating and stayed set as the previous
>>>
>>> machine. I'm
>>>>
>>>> pretty sure the problem was the timestamp for the new server was
>>>
>>> older than
>>>>
>>>> the timestamp for the previous server preventing the value from
>>>
>>> updating
>>>>
>>>> correctly.  Having the incorrect info:server in the ROOT table
>>>
>>> basically
>>>>
>>>> made the cluster unusable.
>>>>
>>>> So my question is, would it make sense to have a sanity time check
>>>
>>> when a
>>>>
>>>> region server joins the cluster?  Basically when the region server
>>>
>>> joins it
>>>>
>>>> would sent its current time and the master would check that time
>>>
>>> against its
>>>>
>>>> current time and if difference is too large then it would prevent the
>>>
>>> region
>>>>
>>>> server from joining.  I know this is basic server configuration stuff
>>>
>>> but
>>>>
>>>> because of human error these things happen and seem like they can
>>>
>>> cause
>>>>
>>>> major problems for the cluster if the servers times aren't
>>>
>>> synchronized.
>>>>
>>>> ~Jeff
>>>>
>>>> --
>>>>
>>>> Jeff Whiting
>>>> Qualtrics Senior Software Engineer
>>>> jeffw@qualtrics.com
>>>>
>>>>
>
> --
> Jeff Whiting
> Qualtrics Senior Software Engineer
> jeffw@qualtrics.com
>
>

Re: Sanity date time check when a region server joins the cluster

Posted by Jeff Whiting <je...@qualtrics.com>.

Created HBASE-3168 for this issue.  It seems pretty straight forward and I wouldn't mind tackling 
this problem.  How much of a skew do we want to allow between the RS and the rest of the cluster?

~Jeff

On 10/28/2010 12:08 PM, Jonathan Gray wrote:
> I was discussing this exact issue this morning.  Ran into a problem where master was timing out a region in transition because the RS was 5 minutes behind the master.
>
> I like the idea of the RS sending it's timestamp on startup and if it is outside a certain threshold, the master throws it a ClockOutOfSync-like exception and the RS goes down.
>
> Please do file a jira, Jeff.  Or let me know and I can do it.
>
> JG
>
>> -----Original Message-----
>> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-
>> Daniel Cryans
>> Sent: Thursday, October 28, 2010 10:00 AM
>> To: user@hbase.apache.org
>> Subject: Re: Sanity date time check when a region server joins the
>> cluster
>>
>> That could be done easily when the server checks in by looking at the
>> given start code. In ServerManager we already do:
>>
>>      HServerInfo info = new HServerInfo(serverInfo);
>>      checkIsDead(info.getServerName(), "STARTUP");
>>      checkAlreadySameHostPort(info);
>>      recordNewServer(info, false, null);
>>
>> A new check in there would fit nicely. Can you open a jira Jeff?
>>
>> Thx!
>>
>> J-D
>>
>> On Thu, Oct 28, 2010 at 9:56 AM, Jeff Whiting<je...@qualtrics.com>
>> wrote:
>>> We recently had a problem where one of our machines in the cluster
>> had a
>>> time that was 6 hours behind the other ones (ntp was supposed to be
>> setup on
>>> that machine but wasn't).  We subsequently restarted our cluster and
>> the
>>> '-ROOT-' table was assigned to that machine.  The problem was that
>> when it
>>> tried to update the value (info:server) for who was holding the
>> '.META.'
>>> table the value wasn't updating and stayed set as the previous
>> machine. I'm
>>> pretty sure the problem was the timestamp for the new server was
>> older than
>>> the timestamp for the previous server preventing the value from
>> updating
>>> correctly.  Having the incorrect info:server in the ROOT table
>> basically
>>> made the cluster unusable.
>>>
>>> So my question is, would it make sense to have a sanity time check
>> when a
>>> region server joins the cluster?  Basically when the region server
>> joins it
>>> would sent its current time and the master would check that time
>> against its
>>> current time and if difference is too large then it would prevent the
>> region
>>> server from joining.  I know this is basic server configuration stuff
>> but
>>> because of human error these things happen and seem like they can
>> cause
>>> major problems for the cluster if the servers times aren't
>> synchronized.
>>> ~Jeff
>>>
>>> --
>>>
>>> Jeff Whiting
>>> Qualtrics Senior Software Engineer
>>> jeffw@qualtrics.com
>>>
>>>

-- 
Jeff Whiting
Qualtrics Senior Software Engineer
jeffw@qualtrics.com

RE: Sanity date time check when a region server joins the cluster

Posted by Jonathan Gray <jg...@facebook.com>.

I was discussing this exact issue this morning.  Ran into a problem where master was timing out a region in transition because the RS was 5 minutes behind the master.

I like the idea of the RS sending it's timestamp on startup and if it is outside a certain threshold, the master throws it a ClockOutOfSync-like exception and the RS goes down.

Please do file a jira, Jeff.  Or let me know and I can do it.

JG

> -----Original Message-----
> From: jdcryans@gmail.com [mailto:jdcryans@gmail.com] On Behalf Of Jean-
> Daniel Cryans
> Sent: Thursday, October 28, 2010 10:00 AM
> To: user@hbase.apache.org
> Subject: Re: Sanity date time check when a region server joins the
> cluster
> 
> That could be done easily when the server checks in by looking at the
> given start code. In ServerManager we already do:
> 
>     HServerInfo info = new HServerInfo(serverInfo);
>     checkIsDead(info.getServerName(), "STARTUP");
>     checkAlreadySameHostPort(info);
>     recordNewServer(info, false, null);
> 
> A new check in there would fit nicely. Can you open a jira Jeff?
> 
> Thx!
> 
> J-D
> 
> On Thu, Oct 28, 2010 at 9:56 AM, Jeff Whiting <je...@qualtrics.com>
> wrote:
> > We recently had a problem where one of our machines in the cluster
> had a
> > time that was 6 hours behind the other ones (ntp was supposed to be
> setup on
> > that machine but wasn't).  We subsequently restarted our cluster and
> the
> > '-ROOT-' table was assigned to that machine.  The problem was that
> when it
> > tried to update the value (info:server) for who was holding the
> '.META.'
> > table the value wasn't updating and stayed set as the previous
> machine. I'm
> > pretty sure the problem was the timestamp for the new server was
> older than
> > the timestamp for the previous server preventing the value from
> updating
> > correctly.  Having the incorrect info:server in the ROOT table
> basically
> > made the cluster unusable.
> >
> > So my question is, would it make sense to have a sanity time check
> when a
> > region server joins the cluster?  Basically when the region server
> joins it
> > would sent its current time and the master would check that time
> against its
> > current time and if difference is too large then it would prevent the
> region
> > server from joining.  I know this is basic server configuration stuff
> but
> > because of human error these things happen and seem like they can
> cause
> > major problems for the cluster if the servers times aren't
> synchronized.
> >
> > ~Jeff
> >
> > --
> >
> > Jeff Whiting
> > Qualtrics Senior Software Engineer
> > jeffw@qualtrics.com
> >
> >

Re: Sanity date time check when a region server joins the cluster

Posted by Jean-Daniel Cryans <jd...@apache.org>.

That could be done easily when the server checks in by looking at the
given start code. In ServerManager we already do:

    HServerInfo info = new HServerInfo(serverInfo);
    checkIsDead(info.getServerName(), "STARTUP");
    checkAlreadySameHostPort(info);
    recordNewServer(info, false, null);

A new check in there would fit nicely. Can you open a jira Jeff?

Thx!

J-D

On Thu, Oct 28, 2010 at 9:56 AM, Jeff Whiting <je...@qualtrics.com> wrote:
> We recently had a problem where one of our machines in the cluster had a
> time that was 6 hours behind the other ones (ntp was supposed to be setup on
> that machine but wasn't).  We subsequently restarted our cluster and the
> '-ROOT-' table was assigned to that machine.  The problem was that when it
> tried to update the value (info:server) for who was holding the '.META.'
> table the value wasn't updating and stayed set as the previous machine. I'm
> pretty sure the problem was the timestamp for the new server was older than
> the timestamp for the previous server preventing the value from updating
> correctly.  Having the incorrect info:server in the ROOT table basically
> made the cluster unusable.
>
> So my question is, would it make sense to have a sanity time check when a
> region server joins the cluster?  Basically when the region server joins it
> would sent its current time and the master would check that time against its
> current time and if difference is too large then it would prevent the region
> server from joining.  I know this is basic server configuration stuff but
> because of human error these things happen and seem like they can cause
> major problems for the cluster if the servers times aren't synchronized.
>
> ~Jeff
>
> --
>
> Jeff Whiting
> Qualtrics Senior Software Engineer
> jeffw@qualtrics.com
>
>