You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Adar Dembo (Code Review)" <ge...@cloudera.org> on 2019/05/13 21:56:10 UTC

[kudu-CR] system ntp: wait for sync via kernel and shorten wait time

Hello Will Berkeley,

I'd like you to do a code review. Please visit

    http://gerrit.cloudera.org:8080/13323

to review the following change.


Change subject: system_ntp: wait for sync via kernel and shorten wait time
......................................................................

system_ntp: wait for sync via kernel and shorten wait time

We've been having more and more trouble with dist-test slave clocks becoming
desynchronized. Because slaves run inside docker containers, we can't use
ntpd/chrony-based tools such as ntp-wait to wait for synchronization.
Moreover, this highlights a general issue with the existing wait-for-sync
approach: we can't differentiate between environments where ntpd isn't
running but should be, and environments where it's not but synchronization
is still expected.

This patch does two things:
1. Switches from ntp-wait and friends to ntp_adjtime-based waiting.
2. Cuts the wait time from 60s to 30s. This is an overture for users running
   in the first environment, who would prefer not to wait a full minute to
   know that their ntpd isn't running.

I also looked at the source code for ntp-wait and chronyc, but didn't see
them doing anything fundamentally more interesting than what we're now
doing with ntp_adjtime.

Change-Id: Ic133c4f9b5fd933216fb27f9f01396a5fee8276b
---
M src/kudu/clock/system_ntp.cc
1 file changed, 19 insertions(+), 28 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/13323/1
-- 
To view, visit http://gerrit.cloudera.org:8080/13323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ic133c4f9b5fd933216fb27f9f01396a5fee8276b
Gerrit-Change-Number: 13323
Gerrit-PatchSet: 1
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] system ntp: wait for sync via kernel

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/13323 )

Change subject: system_ntp: wait for sync via kernel
......................................................................


Patch Set 2: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/13323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic133c4f9b5fd933216fb27f9f01396a5fee8276b
Gerrit-Change-Number: 13323
Gerrit-PatchSet: 2
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Tue, 14 May 2019 20:58:39 +0000
Gerrit-HasComments: No

[kudu-CR] system ntp: wait for sync via kernel and shorten wait time

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/13323 )

Change subject: system_ntp: wait for sync via kernel and shorten wait time
......................................................................


Patch Set 1:

One of the Java tests failed with:

  22:12:12.416 [INFO - cluster stderr printer] (MiniKuduCluster.java:543) I0513 
  22:12:12.415963 12376 system_ntp.cc:126] Waiting up to --ntp_initial_sync_wait_secs=30 seconds for the clock to synchronize
  22:12:42.370 [INFO - cluster stderr printer] (MiniKuduCluster.java:543) W0513 
  22:12:42.370260 12369 subprocess.cc:282] Child process 12376 (kudu-master --fs_wal_dir=/tmp/dist-test-taskQk3Nj3/test-tmp/mini-kudu-cluster1920508555886468812/master-0/wal --fs_data_dirs=/tmp/dist-test-taskQk3Nj3/test-tmp/mini-kudu-cluster1920508555886468812/master-0/data --block_manager=log --webserver_interface=localhost --ipki_ca_key_size=1024 --tsk_num_rsa_bits=512 --rpc_bind_addresses=127.12.20.126:43729 --webserver_interface=127.12.20.126 --webserver_port=0 --never_fsync --ipki_server_key_size=1024 --enable_minidumps=false --redact=none --metrics_log_interval_ms=1000 --logtostderr --logbuflevel=-1 --log_dir=/tmp/dist-test-taskQk3Nj3/test-tmp/mini-kudu-cluster1920508555886468812/master-0/logs --server_dump_info_path=/tmp/dist-test-taskQk3Nj3/test-tmp/mini-kudu-cluster1920508555886468812/master-0/data/info.pb --server_dump_info_format=pb --rpc_server_allow_ephemeral_ports --unlock_experimental_flags --unlock_unsafe_flags --rpc_reuseport=true --master_addresses=127.12.20.126:43729,127.12.20.125:38023,127.12.20.124:41667) was orphaned. Sending signal 9...
  22:12:42.374 [DEBUG - main] (MiniKuduCluster.java:165) Response: error {
  code: TIMED_OUT
  message: "failed to start masters: Unable to start Master at index 0: Timed out after 30.000ss waiting for process (/tmp/dist-test taskQk3Nj3/build/debug/bin/kudu-master) to write info file (/tmp/dist-test-taskQk3Nj3/test-tmp/mini-kudu-cluster1920508555886468812/master-0/data/info.pb)"
}

So 30s wasn't enough to ride over the desynchronization. I'll switch back to 60s.


-- 
To view, visit http://gerrit.cloudera.org:8080/13323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic133c4f9b5fd933216fb27f9f01396a5fee8276b
Gerrit-Change-Number: 13323
Gerrit-PatchSet: 1
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Mon, 13 May 2019 22:34:52 +0000
Gerrit-HasComments: No

[kudu-CR] system ntp: wait for sync via kernel

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/13323 )

Change subject: system_ntp: wait for sync via kernel
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13323/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13323/1//COMMIT_MSG@21
PS1, Line 21: insufficient to ride out dist-test clock desynchronization.
> That seems like a reasonable approach to me. I have seen cases where 30s is
As another data point, since this conversation I saw (in precommit) a failure even after waiting for 30s, so I had to bump it back up to 60s anyway. We'll see whether that's truly enough once this gets merged.



-- 
To view, visit http://gerrit.cloudera.org:8080/13323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic133c4f9b5fd933216fb27f9f01396a5fee8276b
Gerrit-Change-Number: 13323
Gerrit-PatchSet: 2
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Tue, 14 May 2019 20:42:24 +0000
Gerrit-HasComments: Yes

[kudu-CR] system ntp: wait for sync via kernel

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Hello Will Berkeley, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13323

to look at the new patch set (#2).

Change subject: system_ntp: wait for sync via kernel
......................................................................

system_ntp: wait for sync via kernel

We've been having more and more trouble with dist-test slave clocks becoming
desynchronized. Because slaves run inside docker containers, we can't use
ntpd/chrony-based tools such as ntp-wait to wait for synchronization.
Moreover, this highlights a general issue with the existing wait-for-sync
approach: we can't differentiate between environments where ntpd isn't
running but should be, and environments where it's not but synchronization
is still expected.

This patch switches from ntp-wait and friends to ntp_adjtime-based waiting.
Originally it also cut the wait time from 60s to 30s (as an overture for
users running in the first environment who would prefer not to wait a full
minute to know that their ntpd isn't running), but this proved to be
insufficient to ride out dist-test clock desynchronization.

I also looked at the source code for ntp-wait and chronyc, but didn't see
them doing anything fundamentally more interesting than what we're now
doing with ntp_adjtime.

Change-Id: Ic133c4f9b5fd933216fb27f9f01396a5fee8276b
---
M src/kudu/clock/system_ntp.cc
1 file changed, 18 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/23/13323/2
-- 
To view, visit http://gerrit.cloudera.org:8080/13323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic133c4f9b5fd933216fb27f9f01396a5fee8276b
Gerrit-Change-Number: 13323
Gerrit-PatchSet: 2
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] system ntp: wait for sync via kernel

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/13323 )

Change subject: system_ntp: wait for sync via kernel
......................................................................

system_ntp: wait for sync via kernel

We've been having more and more trouble with dist-test slave clocks becoming
desynchronized. Because slaves run inside docker containers, we can't use
ntpd/chrony-based tools such as ntp-wait to wait for synchronization.
Moreover, this highlights a general issue with the existing wait-for-sync
approach: we can't differentiate between environments where ntpd isn't
running but should be, and environments where it's not but synchronization
is still expected.

This patch switches from ntp-wait and friends to ntp_adjtime-based waiting.
Originally it also cut the wait time from 60s to 30s (as an overture for
users running in the first environment who would prefer not to wait a full
minute to know that their ntpd isn't running), but this proved to be
insufficient to ride out dist-test clock desynchronization.

I also looked at the source code for ntp-wait and chronyc, but didn't see
them doing anything fundamentally more interesting than what we're now
doing with ntp_adjtime.

Change-Id: Ic133c4f9b5fd933216fb27f9f01396a5fee8276b
Reviewed-on: http://gerrit.cloudera.org:8080/13323
Tested-by: Kudu Jenkins
Reviewed-by: Will Berkeley <wd...@gmail.com>
---
M src/kudu/clock/system_ntp.cc
1 file changed, 18 insertions(+), 27 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Will Berkeley: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/13323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ic133c4f9b5fd933216fb27f9f01396a5fee8276b
Gerrit-Change-Number: 13323
Gerrit-PatchSet: 3
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] system ntp: wait for sync via kernel and shorten wait time

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/13323 )

Change subject: system_ntp: wait for sync via kernel and shorten wait time
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/13323/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13323/1//COMMIT_MSG@15
PS1, Line 15: is still expected.
this can't be determined from ntp_adjtime output in any way?


http://gerrit.cloudera.org:8080/#/c/13323/1//COMMIT_MSG@21
PS1, Line 21:    know that their ntpd isn't running.
Could we do something like: _if_ we see NTP is running, then wait 60s (or longer). If we don't see that NTP is running, wait 30 seconds (hoping that maybe we're inside a container, and NTP might be running even though we can't see it)?



-- 
To view, visit http://gerrit.cloudera.org:8080/13323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic133c4f9b5fd933216fb27f9f01396a5fee8276b
Gerrit-Change-Number: 13323
Gerrit-PatchSet: 1
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Mon, 13 May 2019 21:59:53 +0000
Gerrit-HasComments: Yes

[kudu-CR] system ntp: wait for sync via kernel and shorten wait time

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/13323 )

Change subject: system_ntp: wait for sync via kernel and shorten wait time
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/13323/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13323/1//COMMIT_MSG@15
PS1, Line 15: is still expected.
> this can't be determined from ntp_adjtime output in any way?
Here's the output from ntpdate run by a dist-test slave:

ntp_gettime() returns code 5 (ERROR)
  time e06f6f8e.2852d000  Sun, Apr 28 2019  0:17:18.157, (.157514),
  maximum error 16000000 us, estimated error 16000000 us, TAI offset 0
ntp_adjtime() returns code 5 (ERROR)
  modes 0x0 (),
  offset 0.000 us, frequency 500.000 ppm, interval 1 s,
  maximum error 16000000 us, estimated error 16000000 us,
  status 0x41 (PLL,UNSYNC),
  time constant 7, precision 1.000 us, tolerance 500 ppm,

And here's the output from my laptop which is using systemd-timesyncd for synchronization:

ntp_gettime() returns code 0 (OK)
  time e0846bcd.73d86a90  Mon, May 13 2019 15:18:53.452, (.452521241),
  maximum error 929000 us, estimated error 0 us, TAI offset 0
ntp_adjtime() returns code 0 (OK)
  modes 0x0 (),
  offset -571.086 us, frequency -8.080 ppm, interval 1 s,
  maximum error 929000 us, estimated error 0 us,
  status 0x2001 (PLL,NANO),
  time constant 7, precision 0.001 us, tolerance 500 ppm,

And the output from an el6.6 machine running ntpd:

ntp_gettime() returns code 0 (OK)
  time e0846bd1.70e47538  Mon, May 13 2019 15:18:57.440, (.440986597),
  maximum error 1175929 us, estimated error 681 us, TAI offset 0
ntp_adjtime() returns code 0 (OK)
  modes 0x0 (),
  offset 1489.979 us, frequency 6.703 ppm, interval 1 s,
  maximum error 1175929 us, estimated error 681 us,
  status 0x6001 (PLL,NANO,MODE),
  time constant 10, precision 0.001 us, tolerance 500 ppm,

The only tangible difference I see is the status field, which is FLL mode for the first two cases and PLL mode for ntpd. But that's not particularly helpful given that we do want this to work across a variety of time sync daemons.


http://gerrit.cloudera.org:8080/#/c/13323/1//COMMIT_MSG@21
PS1, Line 21:    know that their ntpd isn't running.
> Could we do something like: _if_ we see NTP is running, then wait 60s (or l
I had considered something like that: start with ntp-wait/chronyc with the usual 60s, and if that fails but there's still time left, wait out the rest manually via looping on ntp_adjtime. Still, the only gain in doing that is the separation of these two waiting budgets, and there's not a huge difference between 30s and 60s. Do you think it's worth doing?



-- 
To view, visit http://gerrit.cloudera.org:8080/13323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic133c4f9b5fd933216fb27f9f01396a5fee8276b
Gerrit-Change-Number: 13323
Gerrit-PatchSet: 1
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Mon, 13 May 2019 22:31:27 +0000
Gerrit-HasComments: Yes

[kudu-CR] system ntp: wait for sync via kernel

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/13323 )

Change subject: system_ntp: wait for sync via kernel
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13323/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13323/1//COMMIT_MSG@21
PS1, Line 21: insufficient to ride out dist-test clock desynchronization.
> I had considered something like that: start with ntp-wait/chronyc with the 
That seems like a reasonable approach to me. I have seen cases where 30s is not enough to come to sync at startup, and maybe one minute would be enough. Not a huge deal, though, so if it's a pain, don't bother.



-- 
To view, visit http://gerrit.cloudera.org:8080/13323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic133c4f9b5fd933216fb27f9f01396a5fee8276b
Gerrit-Change-Number: 13323
Gerrit-PatchSet: 2
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Tue, 14 May 2019 20:39:59 +0000
Gerrit-HasComments: Yes

[kudu-CR] system ntp: wait for sync via kernel

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/13323 )

Change subject: system_ntp: wait for sync via kernel
......................................................................


Patch Set 2: Code-Review+1


-- 
To view, visit http://gerrit.cloudera.org:8080/13323
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic133c4f9b5fd933216fb27f9f01396a5fee8276b
Gerrit-Change-Number: 13323
Gerrit-PatchSet: 2
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Tue, 14 May 2019 20:44:15 +0000
Gerrit-HasComments: No