You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Grant Henke (Jira)" <ji...@apache.org> on 2020/06/03 15:55:00 UTC
[jira] [Updated] (KUDU-2988) built-in NTP client: sometimes minichronyd fails to start with address already in use

     [ https://issues.apache.org/jira/browse/KUDU-2988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Henke updated KUDU-2988:
------------------------------
    Component/s: test

> built-in NTP client: sometimes minichronyd fails to start with address already in use
> -------------------------------------------------------------------------------------
>
>                 Key: KUDU-2988
>                 URL: https://issues.apache.org/jira/browse/KUDU-2988
>             Project: Kudu
>          Issue Type: Sub-task
>          Components: clock, ntp-client, test
>    Affects Versions: 1.11.0
>            Reporter: Adar Dembo
>            Priority: Major
>              Labels: clock
>
> From time to time some tests that use the built-in NTP client and MiniChronyd fail. The failure usually looks like this:
> {noformat}
> I1013 22:02:28.429188 23364 hybrid_clock.cc:162] waiting up to --ntp_initial_sync_wait_secs=10 seconds for the clock to synchronize
> I1013 22:02:38.430480 23364 builtin_ntp.cc:552] server 127.16.18.212:42817: addresses=127.16.18.212:42817 current_address=127.16.18.212:42817 i_pkt_total_num=0 i_pkt_valid_num=0 o_pkt_total_num=20 o_pkt_timedout_num=14
> is_synchronized=false
> last_mono=0
> last_wall=0
> last_error=0
> now_mono=264459136130
> F1013 22:02:38.430524 23364 master_main.cc:105] Check failed: _s.ok() Bad status: Service unavailable: Cannot initialize clock: timed out waiting for clock synchronisation: wallclock is not synchronized: no valid NTP responses yet
> *** Check failure stack trace: ***
> {noformat}
> I've also seen one failure like this:
> {noformat}
> [ RUN      ] BlockManagerType/TsRecoveryITest.TestCrashDuringLogReplay/0
> 2019-11-01T00:37:31Z chronyd version 3.5 starting (+CMDMON +NTP +REFCLOCK +RTC -PRIVDROP -SCFILTER -SIGND +ASYNCDNS -SECHASH -IPV6 +DEBUG)
> 2019-11-01T00:37:32Z Disabled control of system clock
> Could not open connection to daemon
> W1101 00:37:32.553658   235 mini_chronyd.cc:189] Time spent starting chronyd: real 1.110s	user 0.000s	sys 0.011s
> 2019-11-01T00:37:32Z chronyd exiting
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/integration-tests/external_mini_cluster-itest-base.cc:66: Failure
> Failed
> Bad status: Timed out: failed to start NTP server 0: failed to contact chronyd in 1.000s
> /home/jenkins-slave/workspace/kudu-master/2/src/kudu/integration-tests/ts_recovery-itest.cc:428: Failure
> Expected: StartClusterOneTs({ "--fault_crash_during_log_replay=0.05" }) doesn't generate new fatal failures in the current thread.
>   Actual: it does.
> I1101 00:37:32.564697   235 external_mini_cluster-itest-base.cc:80] Found fatal failure
> I1101 00:37:32.566167   235 test_util.cc:136] -----------------------------------------------
> I1101 00:37:32.566222   235 test_util.cc:137] Had fatal failures, leaving test files at /tmp/dist-test-taskBEeRDj/test-tmp/ts_recovery-itest.0.BlockManagerType_TsRecoveryITest.TestCrashDuringLogReplay_0.1572568627695135-235
> [  FAILED  ] BlockManagerType/TsRecoveryITest.TestCrashDuringLogReplay/0, where GetParam() = "file" (1125 ms)
> {noformat}
> In the first case it's pretty odd that despite the fact that MiniChronyd failed to bind to a socket, we managed to convince ourselves that it was up and running. My running theory is that chronyd doesn't exit when this happens and still listens on the UNIX domain socket for our "server stats" query, which then convinces us that chronyd is alive and well.
> Anyway, this is the cause of some test flakiness, so we should look into it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)