You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Patrick Freeman (Code Review)" <ge...@cloudera.org> on 2019/01/03 04:06:49 UTC

[kudu-CR] Take absolute value of ntp maxerror to avoid undefined behavior from negative values

Patrick Freeman has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12149


Change subject: Take absolute value of ntp maxerror to avoid undefined behavior from negative values
......................................................................

Take absolute value of ntp maxerror to avoid undefined behavior from negative values

The adjtimex system call returns type signed long for maxerror. If
negative, Kudu will crash due to wrapping to a large uint64 value.

Change-Id: I3fdc5807d227bb3f6f7f039239d1a9755b9a5725
---
M src/kudu/clock/system_ntp.cc
1 file changed, 1 insertion(+), 1 deletion(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/49/12149/1
-- 
To view, visit http://gerrit.cloudera.org:8080/12149
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I3fdc5807d227bb3f6f7f039239d1a9755b9a5725
Gerrit-Change-Number: 12149
Gerrit-PatchSet: 1
Gerrit-Owner: Patrick Freeman <wi...@gmail.com>

[kudu-CR] Take absolute value of ntp maxerror to avoid undefined behavior from negative values

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has abandoned this change. ( http://gerrit.cloudera.org:8080/12149 )

Change subject: Take absolute value of ntp maxerror to avoid undefined behavior from negative values
......................................................................


Abandoned

The issue has been addressed with http://gerrit.cloudera.org:8080/20626 (see https://issues.apache.org/jira/browse/KUDU-3521 for details), so closing/abandoning this request.

Thank you for putting together this patch!  It was useful on the route of eventually addressing the issue.
-- 
To view, visit http://gerrit.cloudera.org:8080/12149
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I3fdc5807d227bb3f6f7f039239d1a9755b9a5725
Gerrit-Change-Number: 12149
Gerrit-PatchSet: 1
Gerrit-Owner: Patrick Freeman <wi...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Patrick Freeman <wi...@gmail.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Take absolute value of ntp maxerror to avoid undefined behavior from negative values

Posted by "Patrick Freeman (Code Review)" <ge...@cloudera.org>.
Patrick Freeman has posted comments on this change. ( http://gerrit.cloudera.org:8080/12149 )

Change subject: Take absolute value of ntp maxerror to avoid undefined behavior from negative values
......................................................................


Patch Set 1:

> > Patch Set 1:
 > >
 > > > (1 comment)
 > >
 > > We have seen this in practice when using PTP instead of NTP.
 > >
 > > We first noticed this behavior in the binary ntptime (/usr/sbin),
 > seemingly when the adjtimex system call returns any negative value
 > for maxerror.
 > > The output from ntptime results in very large maxerror values due
 > to ntptime casting long maxerror to an unsigned long.
 > >
 > > Here is the initial log entry from Kudu crashing:
 > >
 > > F0102 21:07:23.614316 18235 hybrid_clock.cc:340] Check failed:
 > _s.ok() unable to get current time with error bound:
 > > Service unavailable: clock error estimate (18446744073709551608us)
 > too high (clock considered synchronized by the kernel)
 > 
 > Interesting. So, the maxerror here is -8. That's a really really
 > low maxerror. I guess PTP can give you tight synchronization but
 > I'm surprised it's that tighta as a strict max bound.
 > 
 > Are you sure that taking the absolute value is the correct thing to
 > do instead of clamping to some arbitrary minimum like 50us? The
 > fact that maxerror can go negative implies that it might also end
 > up being 0, which is obviously not correct, and I imagine some of
 > our consistency algorithms will produce incorrect results in that
 > case.
 I have not put much thought into a proper fix, taking into account upstream effects.
 > Which PTP implementation are you using? Might be good to look at
 > how it's calculating maxerror and see if this indeed is a bug in
 > its calculations.
We are using PTPd. 

FWIW, I have also created a PR for NTP here https://github.com/ntp-project/ntp/pull/29.
It appears there may be something going on closer to the system level, which would make the assumption
(for lack of a better word) that maxerror will always be greater than 0 sensible.


-- 
To view, visit http://gerrit.cloudera.org:8080/12149
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3fdc5807d227bb3f6f7f039239d1a9755b9a5725
Gerrit-Change-Number: 12149
Gerrit-PatchSet: 1
Gerrit-Owner: Patrick Freeman <wi...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Patrick Freeman <wi...@gmail.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 04 Jan 2019 01:51:33 +0000
Gerrit-HasComments: No

[kudu-CR] Take absolute value of ntp maxerror to avoid undefined behavior from negative values

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12149 )

Change subject: Take absolute value of ntp maxerror to avoid undefined behavior from negative values
......................................................................


Patch Set 1:

> Patch Set 1:
> 
> > (1 comment)
> 
> We have seen this in practice when using PTP instead of NTP.
> 
> We first noticed this behavior in the binary ntptime (/usr/sbin), seemingly when the adjtimex system call returns any negative value for maxerror.
> The output from ntptime results in very large maxerror values due to ntptime casting long maxerror to an unsigned long.
> 
> Here is the initial log entry from Kudu crashing:
> 
> F0102 21:07:23.614316 18235 hybrid_clock.cc:340] Check failed: _s.ok() unable to get current time with error bound:
> Service unavailable: clock error estimate (18446744073709551608us) too high (clock considered synchronized by the kernel)

Interesting. So, the maxerror here is -8. That's a really really low maxerror. I guess PTP can give you tight synchronization but I'm surprised it's that tighta as a strict max bound.

Are you sure that taking the absolute value is the correct thing to do instead of clamping to some arbitrary minimum like 50us? The fact that maxerror can go negative implies that it might also end up being 0, which is obviously not correct, and I imagine some of our consistency algorithms will produce incorrect results in that case.

Which PTP implementation are you using? Might be good to look at how it's calculating maxerror and see if this indeed is a bug in its calculations.


-- 
To view, visit http://gerrit.cloudera.org:8080/12149
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3fdc5807d227bb3f6f7f039239d1a9755b9a5725
Gerrit-Change-Number: 12149
Gerrit-PatchSet: 1
Gerrit-Owner: Patrick Freeman <wi...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Patrick Freeman <wi...@gmail.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 03 Jan 2019 20:42:20 +0000
Gerrit-HasComments: No

[kudu-CR] Take absolute value of ntp maxerror to avoid undefined behavior from negative values

Posted by "Patrick Freeman (Code Review)" <ge...@cloudera.org>.
Patrick Freeman has posted comments on this change. ( http://gerrit.cloudera.org:8080/12149 )

Change subject: Take absolute value of ntp maxerror to avoid undefined behavior from negative values
......................................................................


Patch Set 1:

> (1 comment)

We have seen this in practice when using PTP instead of NTP.

We first noticed this behavior in the binary ntptime (/usr/sbin), seemingly when the adjtimex system call returns any negative value for maxerror.
The output from ntptime results in very large maxerror values due to ntptime casting long maxerror to an unsigned long.

Here is the initial log entry from Kudu crashing:

F0102 21:07:23.614316 18235 hybrid_clock.cc:340] Check failed: _s.ok() unable to get current time with error bound:
Service unavailable: clock error estimate (18446744073709551608us) too high (clock considered synchronized by the kernel)


Some ref links for convenience
https://github.com/ntp-project/ntp/blob/1a399a03e674da08cfce2cdb847bfb65d65df237/util/ntptime.c#L347
https://manpages.ubuntu.com/manpages/xenial/man2/adjtimex.2.html

 > (1 comment)


-- 
To view, visit http://gerrit.cloudera.org:8080/12149
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3fdc5807d227bb3f6f7f039239d1a9755b9a5725
Gerrit-Change-Number: 12149
Gerrit-PatchSet: 1
Gerrit-Owner: Patrick Freeman <wi...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Patrick Freeman <wi...@gmail.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 03 Jan 2019 05:53:35 +0000
Gerrit-HasComments: No

[kudu-CR] Take absolute value of ntp maxerror to avoid undefined behavior from negative values

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/12149 )

Change subject: Take absolute value of ntp maxerror to avoid undefined behavior from negative values
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/12149/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12149/1//COMMIT_MSG@9
PS1, Line 9: The adjtimex system call returns type signed long for maxerror. If
did you see a case in practice where maxerror returns a negative value? We've had this code around for 4 years or so and running on a bunch of different kernels and NTP implementations and haven't seen this happen yet. Looking at the kernel source, I agree there is nothing that prevents it from getting set to a negative value by another adjtimex call, but I don't see how NTP could calculate a negative value there since it casts to unsigned before setting it in the loopfilter implementation:

  ntv.maxerror = (u_int32)((sys_rootdelay / 2 +
			    sys_rootdisp) * 1e6);



-- 
To view, visit http://gerrit.cloudera.org:8080/12149
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I3fdc5807d227bb3f6f7f039239d1a9755b9a5725
Gerrit-Change-Number: 12149
Gerrit-PatchSet: 1
Gerrit-Owner: Patrick Freeman <wi...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 03 Jan 2019 04:26:59 +0000
Gerrit-HasComments: Yes