You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@kudu.apache.org by Franco Venturi <fv...@comcast.net> on 2017/11/01 02:12:55 UTC

Error message: 'Tried to update clock beyond the max. error.'

A few days ago at work our Kudu servers started having fatal errors and shutting down with the following error message:

Couldn't get the current time: Clock unsynchronized. Status: Service unavailable: Error: Clock synchronized but error wastoo high (10000016 us).

After some research in the community forums, I found a post by Todd that pointed to this JIRA issue: https://issues.apache.org/jira/browse/KUDU-2079

I then checked our ntpd configuration and sure enough we had the '-x' option in the daemon command, so I went ahead, removed that option, restarted ntpd, and a few minutes later I restarted all the Kudu processes (one master and three tablet servers).
A few minutes later a couple of those Kudu processes were down again, this time with this new time sync related error message:

Tried to update clock beyond the max. error.

To try to address this new error, I brought down all the Kudu processes, stopped ntpd, resync'd the time on all the servers with ntpdate, brought ntpd back up, waited a bit, and restarted Kudu (master and tablet servers). A few minutes or less later a couple of them were down again with the same 'Tried to update clock beyond the max. error.'

I eventually ended up doubling the parameter 'max_clock_sync_error_usec' to 20,000,000 (20 seconds) and everything stayed up (and is still up).

Looking at the source code in git, I found the relevant section here (source file https://github.com/apache/kudu/blob/master/src/kudu/clock/hybrid_clock.cc):

// we won't update our clock if to_update is more than 'max_clock_sync_error_usec'
// into the future as it might have been corrupted or originated from an out-of-sync
// server.
if ((to_update_physical - now_physical) > FLAGS_max_clock_sync_error_usec) {
return Status::InvalidArgument("Tried to update clock beyond the max. error.");
}

If I understand this code correctly, it is complaining because for some reason Kudu is trying to update its clock by more than 10 seconds - however I ran ntptime and several ntpq queries, and I don't see the time between the servers being off by that much (or even by say half a second, since they are all synchronized with a stratum 3 NTP server).

Has anyone in this group seen anything similar or does anyone have a better understanding of what this message means and what could be causing it?

Thanks,
Franco

Re: Error message: 'Tried to update clock beyond the max. error.'

Posted by Franco Venturi <fv...@comcast.net>.

Thanks Todd. 
Much appreciated. 

Franco 

----- Original Message -----

From: "Todd Lipcon" <to...@cloudera.com> 
To: user@kudu.apache.org 
Sent: Wednesday, November 1, 2017 9:46:15 PM 
Subject: Re: Error message: 'Tried to update clock beyond the max. error.' 

Thanks Franco. I filed https://issues.apache.org/jira/browse/KUDU-2209 and put up a patch. 

I'm also going to work on a change to try to allow Kudu to ride over brief interruptions in ntp synchronization status. Hopefully this will help folks who have some issues with occasional ntp instability. 

-Todd 

On Wed, Nov 1, 2017 at 6:31 PM, Franco Venturi < fventuri@comcast.net > wrote: 

From 'tablet_bootsratp.cc': 

1030 14:29:37.324306 60682 tablet_bootstrap.cc:884] Check failed: _s.ok() Bad status: Invalid argument: Tried to update clock beyond the max. error. 

Franco 

From: "Todd Lipcon" < todd@cloudera.com > 
To: user@kudu.apache.org 
Sent: Wednesday, November 1, 2017 8:00:09 PM 
Subject: Re: Error message: 'Tried to update clock beyond the max. error.' 

What's the full log line where you're seeing this crash? Is it coming from tablet_bootstrap.cc, raft_consensus.cc, or elsewhere? 

-Todd 

2017-11-01 15:45 GMT-07:00 Franco Venturi < fventuri@comcast.net > : 

<blockquote>

Our version is kudu 1.5.0-cdh5.13.0. 

Franco 

<blockquote>

</blockquote>

-- 
Todd Lipcon 
Software Engineer, Cloudera 

</blockquote>

-- 
Todd Lipcon 
Software Engineer, Cloudera

Re: Error message: 'Tried to update clock beyond the max. error.'

Posted by Todd Lipcon <to...@cloudera.com>.

Thanks Franco. I filed https://issues.apache.org/jira/browse/KUDU-2209 and
put up a patch.

I'm also going to work on a change to try to allow Kudu to ride over brief
interruptions in ntp synchronization status. Hopefully this will help folks
who have some issues with occasional ntp instability.

-Todd

On Wed, Nov 1, 2017 at 6:31 PM, Franco Venturi <fv...@comcast.net> wrote:

>
> From 'tablet_bootsratp.cc':
>
>         1030 14:29:37.324306 60682 tablet_bootstrap.cc:884] Check failed:
> _s.ok() Bad status: Invalid argument: Tried to update clock beyond the max.
> error.
>
> Franco
>
>
> ------------------------------
> *From: *"Todd Lipcon" <to...@cloudera.com>
> *To: *user@kudu.apache.org
> *Sent: *Wednesday, November 1, 2017 8:00:09 PM
> *Subject: *Re: Error message: 'Tried to update clock beyond the max.
> error.'
>
>
> What's the full log line where you're seeing this crash? Is it coming from
> tablet_bootstrap.cc, raft_consensus.cc, or elsewhere?
>
> -Todd
>
> 2017-11-01 15:45 GMT-07:00 Franco Venturi <fv...@comcast.net>:
>
>> Our version is kudu 1.5.0-cdh5.13.0.
>>
>> Franco
>>
>>
>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Error message: 'Tried to update clock beyond the max. error.'

Posted by Franco Venturi <fv...@comcast.net>.

From 'tablet_bootsratp.cc': 



1030 14:29:37.324306 60682 tablet_bootstrap.cc:884] Check failed: _s.ok() Bad status: Invalid argument: Tried to update clock beyond the max. error. 

Franco 

----- Original Message -----

From: "Todd Lipcon" <to...@cloudera.com> 
To: user@kudu.apache.org 
Sent: Wednesday, November 1, 2017 8:00:09 PM 
Subject: Re: Error message: 'Tried to update clock beyond the max. error.' 

What's the full log line where you're seeing this crash? Is it coming from tablet_bootstrap.cc, raft_consensus.cc, or elsewhere? 

-Todd 

2017-11-01 15:45 GMT-07:00 Franco Venturi < fventuri@comcast.net > : 



Our version is kudu 1.5.0-cdh5.13.0. 

Franco 



<blockquote>







</blockquote>




-- 
Todd Lipcon 
Software Engineer, Cloudera

Re: Error message: 'Tried to update clock beyond the max. error.'

Posted by Todd Lipcon <to...@cloudera.com>.

Actually I think I understand the root cause of this. I think at some point
NTP can switch the clock from a microseconds-based mode to a
nanoseconds-based mode, at which point Kudu starts interpreting the results
of the ntp_gettime system call incorrectly, resulting in incorrect error
estimates and even time values up to 1000 seconds in the future (we read 1
billion nanoseconds as 1 billion microseconds (=1000 seconds)). I'll work
on reproducing this and a patch, to backport to previous versions.

-Todd

On Wed, Nov 1, 2017 at 5:00 PM, Todd Lipcon <to...@cloudera.com> wrote:

> What's the full log line where you're seeing this crash? Is it coming from
> tablet_bootstrap.cc, raft_consensus.cc, or elsewhere?
>
> -Todd
>
> 2017-11-01 15:45 GMT-07:00 Franco Venturi <fv...@comcast.net>:
>
>> Our version is kudu 1.5.0-cdh5.13.0.
>>
>> Franco
>>
>>
>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Error message: 'Tried to update clock beyond the max. error.'

Posted by Todd Lipcon <to...@cloudera.com>.

What's the full log line where you're seeing this crash? Is it coming from
tablet_bootstrap.cc, raft_consensus.cc, or elsewhere?

-Todd

2017-11-01 15:45 GMT-07:00 Franco Venturi <fv...@comcast.net>:

> Our version is kudu 1.5.0-cdh5.13.0.
>
> Franco
>
>
>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Error message: 'Tried to update clock beyond the max. error.'

Posted by Franco Venturi <fv...@comcast.net>.

Our version is kudu 1.5.0-cdh5.13.0. 

Franco

Re: Error message: 'Tried to update clock beyond the max. error.'

Posted by Zhen Zhang <zh...@gmail.com>.

Which version of Kudu do you use? I also encountered this error a few days
ago using Kudu1.3.0: "Tried to update clock beyond the max. error.", after
restarting the cluster, everything goes normal. I checked the dmesg and
asked SRE to check the ntp service, everything looks normal. Still have no
idea what cause that error.

2017-11-01 10:12 GMT+08:00 Franco Venturi <fv...@comcast.net>:

> A few days ago at work our Kudu servers started having fatal errors and
> shutting down with the following error message:
>
>
>      Couldn't get the current time: Clock unsynchronized. Status: Service
> unavailable: Error: Clock synchronized but error wastoo high (10000016 us).
>
>
> After some research in the community forums, I found a post by Todd that
> pointed to this JIRA issue: https://issues.apache.org/
> jira/browse/KUDU-2079
>
> I then checked our ntpd configuration and sure enough we had the '-x'
> option in the daemon command, so I went ahead, removed that option,
> restarted ntpd, and a few minutes later I restarted all the Kudu processes
> (one master and three tablet servers).
> A few minutes later a couple of those Kudu processes were down again, this
> time with this new time sync related error message:
>
>
>      Tried to update clock beyond the max. error.
>
>
> To try to address this new error, I brought down all the Kudu processes,
> stopped ntpd, resync'd the time on all the servers with ntpdate, brought
> ntpd back up, waited a bit, and restarted Kudu (master and tablet servers).
> A few minutes or less later a couple of them were down again with the same
> 'Tried to update clock beyond the max. error.'
>
>
> I eventually ended up doubling the parameter 'max_clock_sync_error_usec'
> to 20,000,000 (20 seconds) and everything stayed up (and is still up).
>
>
> Looking at the source code in git, I found the relevant section here
> (source file https://github.com/apache/kudu/blob/master/src/kudu/
> clock/hybrid_clock.cc):
>
>
>      // we won't update our clock if to_update is more than
> 'max_clock_sync_error_usec'
>      // into the future as it might have been corrupted or originated from
> an out-of-sync
>      // server.
>      if ((to_update_physical - now_physical) > FLAGS_max_clock_sync_error_usec)
> {
>        return Status::InvalidArgument("Tried to update clock beyond the
> max. error.");
>      }
>
>
> If I understand this code correctly, it is complaining because for some
> reason Kudu is trying to update its clock by more than 10 seconds - however
> I ran ntptime and several ntpq queries, and I don't see the time between
> the servers being off by that much (or even by say half a second, since
> they are all synchronized with a stratum 3 NTP server).
>
>
> Has anyone in this group seen anything similar or does anyone have a
> better understanding of what this message means and what could be causing
> it?
>
>
> Thanks,
> Franco
>