You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@kudu.apache.org by Franco Venturi <fv...@comcast.net> on 2017/11/01 02:12:55 UTC
Error message: 'Tried to update clock beyond the max. error.'
A few days ago at work our Kudu servers started having fatal errors and shutting down with the following error message:
Couldn't get the current time: Clock unsynchronized. Status: Service unavailable: Error: Clock synchronized but error wastoo high (10000016 us).
After some research in the community forums, I found a post by Todd that pointed to this JIRA issue: https://issues.apache.org/jira/browse/KUDU-2079
I then checked our ntpd configuration and sure enough we had the '-x' option in the daemon command, so I went ahead, removed that option, restarted ntpd, and a few minutes later I restarted all the Kudu processes (one master and three tablet servers).
A few minutes later a couple of those Kudu processes were down again, this time with this new time sync related error message:
Tried to update clock beyond the max. error.
To try to address this new error, I brought down all the Kudu processes, stopped ntpd, resync'd the time on all the servers with ntpdate, brought ntpd back up, waited a bit, and restarted Kudu (master and tablet servers). A few minutes or less later a couple of them were down again with the same 'Tried to update clock beyond the max. error.'
I eventually ended up doubling the parameter 'max_clock_sync_error_usec' to 20,000,000 (20 seconds) and everything stayed up (and is still up).
Looking at the source code in git, I found the relevant section here (source file https://github.com/apache/kudu/blob/master/src/kudu/clock/hybrid_clock.cc):
// we won't update our clock if to_update is more than 'max_clock_sync_error_usec'
// into the future as it might have been corrupted or originated from an out-of-sync
// server.
if ((to_update_physical - now_physical) > FLAGS_max_clock_sync_error_usec) {
return Status::InvalidArgument("Tried to update clock beyond the max. error.");
}
If I understand this code correctly, it is complaining because for some reason Kudu is trying to update its clock by more than 10 seconds - however I ran ntptime and several ntpq queries, and I don't see the time between the servers being off by that much (or even by say half a second, since they are all synchronized with a stratum 3 NTP server).
Has anyone in this group seen anything similar or does anyone have a better understanding of what this message means and what could be causing it?
Thanks,
Franco
Re: Error message: 'Tried to update clock beyond the max. error.'
Posted by Franco Venturi <fv...@comcast.net>.
Thanks Todd.
Much appreciated.
Franco
----- Original Message -----
From: "Todd Lipcon" <to...@cloudera.com>
To: user@kudu.apache.org
Sent: Wednesday, November 1, 2017 9:46:15 PM
Subject: Re: Error message: 'Tried to update clock beyond the max. error.'
Thanks Franco. I filed https://issues.apache.org/jira/browse/KUDU-2209 and put up a patch.
I'm also going to work on a change to try to allow Kudu to ride over brief interruptions in ntp synchronization status. Hopefully this will help folks who have some issues with occasional ntp instability.
-Todd
On Wed, Nov 1, 2017 at 6:31 PM, Franco Venturi < fventuri@comcast.net > wrote:
From 'tablet_bootsratp.cc':
1030 14:29:37.324306 60682 tablet_bootstrap.cc:884] Check failed: _s.ok() Bad status: Invalid argument: Tried to update clock beyond the max. error.
Franco
From: "Todd Lipcon" < todd@cloudera.com >
To: user@kudu.apache.org
Sent: Wednesday, November 1, 2017 8:00:09 PM
Subject: Re: Error message: 'Tried to update clock beyond the max. error.'
What's the full log line where you're seeing this crash? Is it coming from tablet_bootstrap.cc, raft_consensus.cc, or elsewhere?
-Todd
2017-11-01 15:45 GMT-07:00 Franco Venturi < fventuri@comcast.net > :
<blockquote>
Our version is kudu 1.5.0-cdh5.13.0.
Franco
<blockquote>
</blockquote>
--
Todd Lipcon
Software Engineer, Cloudera
</blockquote>
--
Todd Lipcon
Software Engineer, Cloudera
Re: Error message: 'Tried to update clock beyond the max. error.'
Posted by Todd Lipcon <to...@cloudera.com>.
Thanks Franco. I filed https://issues.apache.org/jira/browse/KUDU-2209 and
put up a patch.
I'm also going to work on a change to try to allow Kudu to ride over brief
interruptions in ntp synchronization status. Hopefully this will help folks
who have some issues with occasional ntp instability.
-Todd
On Wed, Nov 1, 2017 at 6:31 PM, Franco Venturi <fv...@comcast.net> wrote:
>
> From 'tablet_bootsratp.cc':
>
> 1030 14:29:37.324306 60682 tablet_bootstrap.cc:884] Check failed:
> _s.ok() Bad status: Invalid argument: Tried to update clock beyond the max.
> error.
>
> Franco
>
>
> ------------------------------
> *From: *"Todd Lipcon" <to...@cloudera.com>
> *To: *user@kudu.apache.org
> *Sent: *Wednesday, November 1, 2017 8:00:09 PM
> *Subject: *Re: Error message: 'Tried to update clock beyond the max.
> error.'
>
>
> What's the full log line where you're seeing this crash? Is it coming from
> tablet_bootstrap.cc, raft_consensus.cc, or elsewhere?
>
> -Todd
>
> 2017-11-01 15:45 GMT-07:00 Franco Venturi <fv...@comcast.net>:
>
>> Our version is kudu 1.5.0-cdh5.13.0.
>>
>> Franco
>>
>>
>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
>
--
Todd Lipcon
Software Engineer, Cloudera
Re: Error message: 'Tried to update clock beyond the max. error.'
Posted by Franco Venturi <fv...@comcast.net>.
From 'tablet_bootsratp.cc':
1030 14:29:37.324306 60682 tablet_bootstrap.cc:884] Check failed: _s.ok() Bad status: Invalid argument: Tried to update clock beyond the max. error.
Franco
----- Original Message -----
From: "Todd Lipcon" <to...@cloudera.com>
To: user@kudu.apache.org
Sent: Wednesday, November 1, 2017 8:00:09 PM
Subject: Re: Error message: 'Tried to update clock beyond the max. error.'
What's the full log line where you're seeing this crash? Is it coming from tablet_bootstrap.cc, raft_consensus.cc, or elsewhere?
-Todd
2017-11-01 15:45 GMT-07:00 Franco Venturi < fventuri@comcast.net > :
Our version is kudu 1.5.0-cdh5.13.0.
Franco
<blockquote>
</blockquote>
--
Todd Lipcon
Software Engineer, Cloudera
Re: Error message: 'Tried to update clock beyond the max. error.'
Posted by Todd Lipcon <to...@cloudera.com>.
Actually I think I understand the root cause of this. I think at some point
NTP can switch the clock from a microseconds-based mode to a
nanoseconds-based mode, at which point Kudu starts interpreting the results
of the ntp_gettime system call incorrectly, resulting in incorrect error
estimates and even time values up to 1000 seconds in the future (we read 1
billion nanoseconds as 1 billion microseconds (=1000 seconds)). I'll work
on reproducing this and a patch, to backport to previous versions.
-Todd
On Wed, Nov 1, 2017 at 5:00 PM, Todd Lipcon <to...@cloudera.com> wrote:
> What's the full log line where you're seeing this crash? Is it coming from
> tablet_bootstrap.cc, raft_consensus.cc, or elsewhere?
>
> -Todd
>
> 2017-11-01 15:45 GMT-07:00 Franco Venturi <fv...@comcast.net>:
>
>> Our version is kudu 1.5.0-cdh5.13.0.
>>
>> Franco
>>
>>
>>
>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>
--
Todd Lipcon
Software Engineer, Cloudera
Re: Error message: 'Tried to update clock beyond the max. error.'
Posted by Todd Lipcon <to...@cloudera.com>.
What's the full log line where you're seeing this crash? Is it coming from
tablet_bootstrap.cc, raft_consensus.cc, or elsewhere?
-Todd
2017-11-01 15:45 GMT-07:00 Franco Venturi <fv...@comcast.net>:
> Our version is kudu 1.5.0-cdh5.13.0.
>
> Franco
>
>
>
>
>
--
Todd Lipcon
Software Engineer, Cloudera
Re: Error message: 'Tried to update clock beyond the max. error.'
Posted by Franco Venturi <fv...@comcast.net>.
Our version is kudu 1.5.0-cdh5.13.0.
Franco
Re: Error message: 'Tried to update clock beyond the max. error.'
Posted by Zhen Zhang <zh...@gmail.com>.
Which version of Kudu do you use? I also encountered this error a few days
ago using Kudu1.3.0: "Tried to update clock beyond the max. error.", after
restarting the cluster, everything goes normal. I checked the dmesg and
asked SRE to check the ntp service, everything looks normal. Still have no
idea what cause that error.
2017-11-01 10:12 GMT+08:00 Franco Venturi <fv...@comcast.net>:
> A few days ago at work our Kudu servers started having fatal errors and
> shutting down with the following error message:
>
>
> Couldn't get the current time: Clock unsynchronized. Status: Service
> unavailable: Error: Clock synchronized but error wastoo high (10000016 us).
>
>
> After some research in the community forums, I found a post by Todd that
> pointed to this JIRA issue: https://issues.apache.org/
> jira/browse/KUDU-2079
>
> I then checked our ntpd configuration and sure enough we had the '-x'
> option in the daemon command, so I went ahead, removed that option,
> restarted ntpd, and a few minutes later I restarted all the Kudu processes
> (one master and three tablet servers).
> A few minutes later a couple of those Kudu processes were down again, this
> time with this new time sync related error message:
>
>
> Tried to update clock beyond the max. error.
>
>
> To try to address this new error, I brought down all the Kudu processes,
> stopped ntpd, resync'd the time on all the servers with ntpdate, brought
> ntpd back up, waited a bit, and restarted Kudu (master and tablet servers).
> A few minutes or less later a couple of them were down again with the same
> 'Tried to update clock beyond the max. error.'
>
>
> I eventually ended up doubling the parameter 'max_clock_sync_error_usec'
> to 20,000,000 (20 seconds) and everything stayed up (and is still up).
>
>
> Looking at the source code in git, I found the relevant section here
> (source file https://github.com/apache/kudu/blob/master/src/kudu/
> clock/hybrid_clock.cc):
>
>
> // we won't update our clock if to_update is more than
> 'max_clock_sync_error_usec'
> // into the future as it might have been corrupted or originated from
> an out-of-sync
> // server.
> if ((to_update_physical - now_physical) > FLAGS_max_clock_sync_error_usec)
> {
> return Status::InvalidArgument("Tried to update clock beyond the
> max. error.");
> }
>
>
> If I understand this code correctly, it is complaining because for some
> reason Kudu is trying to update its clock by more than 10 seconds - however
> I ran ntptime and several ntpq queries, and I don't see the time between
> the servers being off by that much (or even by say half a second, since
> they are all synchronized with a stratum 3 NTP server).
>
>
> Has anyone in this group seen anything similar or does anyone have a
> better understanding of what this message means and what could be causing
> it?
>
>
> Thanks,
> Franco
>