You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Alexey Serbin (Code Review)" <ge...@cloudera.org> on 2023/04/03 16:41:05 UTC

[kudu-CR](branch-1.17.x) [clock] add sanity check to detect wall clock jumps

Alexey Serbin has uploaded this change for review. ( http://gerrit.cloudera.org:8080/19681


Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................

[clock] add sanity check to detect wall clock jumps

There was a case when a timestamp read from system/local clock using
the ntp_adjtime() call jumped 40+ years ahead when running kudu-tserver
on an Azure VM, while ntp_adjtime() didn't report an error on clock
being unsynchronized or any other error at all. That came along with
a huge number of kernel messages, and other software (such as the SASL
library used by SSSD) detected the strange jump in the local clock
as well.  My multiple attempts to reproduce the issue on a real
hardware node, Dockerized environment run at a real hardware server
in a datacenter, and GCE & EC2 VMs were not successful.

This patch adds a sanity check to detect such strange jumps in wall
clock readings.  The idea is to rely on the readings from the
CLOCK_MONOTONIC_RAW clock captured along with the wall clock readings.
A jump should manifest itself in a big difference between the wall clock
delta and the corresponding CLOCK_MONOTONIC_RAW delta.  If such a
condition is detected, then HybridClock::NowWithErrorUnlocked() dumps
diagnostic information about clock NTP synchronisation status and
returns Status::ServiceUnavailable() with appropriate error message.

This patch also adds a unit test for the newly added functionality.

As a part of this changelist, the following new flags are introduced:
  * --wall_clock_jump_detection
      This is to control the newly introduced sanity check for readings
      of the wall clock.  Acceptable values are 'auto', 'enabled,
      and 'disabled'.  It is set to 'auto' by default, meaning the
      sanity check for timestamps is enabled if the process detects
      that it's running on a VM in Azure cloud.
  * --wall_clock_jump_threshold_sec
      This is to control the threshold (in seconds) for the difference
      in deltas of the wall clock's and CLOCK_MONOTONIC_RAW clock's
      readings.  It is set to 900 (15 minutes) by default.

The reasoning behind having --wall_clock_jump_detection=auto by
default is to skip an extra check at the majority of nodes out there
since NTP-synchronized system clock isn't supposed to jump that much
at all.  However, since the issue has been observed only at VMs in Azure
cloud, it's enabled automatically at Azure nodes to detect the issue.
If the clock jump went unnoticed, the timestamp would be persisted with
an operation in the WAL and propagated to other replicas.  That could
lead to crashes during tablet bootstrapping and would require manual
intervention to remove the orphaned operations with out-of-whack
timestamps from the WALs of affected tablet replicas.

To assess the performance hit induced by the jump clock detection,
I also added a parameterized HybridClockJumpProtectionTest.BasicPerf
scenario to compare how fast HybridClock::NowWithError() runs
with and without the newly introduced sanity check.

I found that with sanity check enabled the method runs 1.12 times slower
than without the check.  The details are below.

To collect stats on the performance of the HybridClock::NowWithError()
method, I used the command below, running it 10 times for RELEASE and
DEBUG builds to run the new test scenario.  I recorded the timings
output by the test scenarios and computed the average time taken to
invoke HybridClock::NowWithError() 1000000 times.

  KUDU_ALLOW_SLOW_TESTS=1 ./bin/hybrid_clock-test --gtest_filter='*Perf*'

  -------------------------------------------------------------------

  RELEASE build:

  without clock jump detection:
    1000000 iterations in 158.553514 ms
    1000000 iterations in 159.200062 ms
    1000000 iterations in 157.840572 ms
    1000000 iterations in 159.079716 ms
    1000000 iterations in 161.235647 ms

    average: 159.181902 ms

  with clock jump detection:
    1000000 iterations in 178.415508 ms
    1000000 iterations in 178.102677 ms
    1000000 iterations in 177.465617 ms
    1000000 iterations in 180.552167 ms
    1000000 iterations in 179.644688 ms

    average: 178.836131 ms

  -------------------------------------------------------------------

  DEBUG build:

  without clock jump detection:
    1000000 iterations in 592.051281 ms
    1000000 iterations in 591.087500 ms
    1000000 iterations in 597.142767 ms
    1000000 iterations in 591.066335 ms
    1000000 iterations in 590.717687 ms

    average: 592.413114 ms

  with clock jump detection:
    1000000 iterations in 682.691815 ms
    1000000 iterations in 682.087873 ms
    1000000 iterations in 685.602277 ms
    1000000 iterations in 683.057099 ms
    1000000 iterations in 682.095996 ms

    average: 683.107012 ms

  -------------------------------------------------------------------

RELEASE, running under perf stat without clock jump detection:

 Performance counter stats for './bin/hybrid_clock-test --gtest_filter=Perf/HybridClockJumpProtectionTest.BasicPerf/0' (10 runs):

        171.499485 task-clock                #    0.999 CPUs utilized            ( +-  0.18% )
                 3 context-switches          #    0.016 K/sec                    ( +-  4.76% )
                 0 cpu-migrations            #    0.000 K/sec
             2,829 page-faults               #    0.016 M/sec                    ( +-  1.92% )
       562,222,108 cycles                    #    3.278 GHz                      ( +-  0.18% )
       698,278,773 instructions              #    1.24  insns per cycle          ( +-  0.09% )
       122,793,006 branches                  #  715.996 M/sec                    ( +-  0.11% )
           319,525 branch-misses             #    0.26% of all branches          ( +-  0.16% )

       0.171751822 seconds time elapsed                                          ( +-  0.18% )

  -------------------------------------------------------------------

RELEASE, running under perf stat with clock jump detection:

 Performance counter stats for './bin/hybrid_clock-test --gtest_filter=Perf/HybridClockJumpProtectionTest.BasicPerf/1' (10 runs):

        193.189743 task-clock                #    0.999 CPUs utilized            ( +-  0.84% )
                 3 context-switches          #    0.015 K/sec                    ( +-  8.05% )
                 0 cpu-migrations            #    0.000 K/sec
             2,741 page-faults               #    0.014 M/sec                    ( +-  2.35% )
       627,522,941 cycles                    #    3.248 GHz                      ( +-  0.22% )
       797,617,361 instructions              #    1.27  insns per cycle          ( +-  0.13% )
       138,636,629 branches                  #  717.619 M/sec                    ( +-  0.13% )
           339,705 branch-misses             #    0.25% of all branches          ( +-  0.31% )

       0.193459251 seconds time elapsed                                          ( +-  0.84% )

  -------------------------------------------------------------------

Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Reviewed-on: http://gerrit.cloudera.org:8080/19473
Tested-by: Alexey Serbin <al...@apache.org>
Reviewed-by: Yingchun Lai <la...@apache.org>
(cherry picked from commit 555854178b9b498701619f4bb0dbbbbeab8e69e7)
---
M src/kudu/clock/hybrid_clock-test.cc
M src/kudu/clock/hybrid_clock.cc
M src/kudu/clock/hybrid_clock.h
M src/kudu/clock/system_ntp.h
M src/kudu/server/server_base.cc
M src/kudu/server/server_base.h
6 files changed, 377 insertions(+), 40 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/81/19681/1
-- 
To view, visit http://gerrit.cloudera.org:8080/19681
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.17.x
Gerrit-MessageType: newchange
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19681
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <al...@apache.org>

[kudu-CR](branch-1.17.x) [clock] add sanity check to detect wall clock jumps

Posted by "Yingchun Lai (Code Review)" <ge...@cloudera.org>.
Yingchun Lai has posted comments on this change. ( http://gerrit.cloudera.org:8080/19681 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Patch Set 1: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/19681
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.17.x
Gerrit-MessageType: comment
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19681
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Comment-Date: Tue, 04 Apr 2023 03:23:53 +0000
Gerrit-HasComments: No

[kudu-CR](branch-1.17.x) [clock] add sanity check to detect wall clock jumps

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/19681 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................

[clock] add sanity check to detect wall clock jumps

There was a case when a timestamp read from system/local clock using
the ntp_adjtime() call jumped 40+ years ahead when running kudu-tserver
on an Azure VM, while ntp_adjtime() didn't report an error on clock
being unsynchronized or any other error at all. That came along with
a huge number of kernel messages, and other software (such as the SASL
library used by SSSD) detected the strange jump in the local clock
as well.  My multiple attempts to reproduce the issue on a real
hardware node, Dockerized environment run at a real hardware server
in a datacenter, and GCE & EC2 VMs were not successful.

This patch adds a sanity check to detect such strange jumps in wall
clock readings.  The idea is to rely on the readings from the
CLOCK_MONOTONIC_RAW clock captured along with the wall clock readings.
A jump should manifest itself in a big difference between the wall clock
delta and the corresponding CLOCK_MONOTONIC_RAW delta.  If such a
condition is detected, then HybridClock::NowWithErrorUnlocked() dumps
diagnostic information about clock NTP synchronisation status and
returns Status::ServiceUnavailable() with appropriate error message.

This patch also adds a unit test for the newly added functionality.

As a part of this changelist, the following new flags are introduced:
  * --wall_clock_jump_detection
      This is to control the newly introduced sanity check for readings
      of the wall clock.  Acceptable values are 'auto', 'enabled,
      and 'disabled'.  It is set to 'auto' by default, meaning the
      sanity check for timestamps is enabled if the process detects
      that it's running on a VM in Azure cloud.
  * --wall_clock_jump_threshold_sec
      This is to control the threshold (in seconds) for the difference
      in deltas of the wall clock's and CLOCK_MONOTONIC_RAW clock's
      readings.  It is set to 900 (15 minutes) by default.

The reasoning behind having --wall_clock_jump_detection=auto by
default is to skip an extra check at the majority of nodes out there
since NTP-synchronized system clock isn't supposed to jump that much
at all.  However, since the issue has been observed only at VMs in Azure
cloud, it's enabled automatically at Azure nodes to detect the issue.
If the clock jump went unnoticed, the timestamp would be persisted with
an operation in the WAL and propagated to other replicas.  That could
lead to crashes during tablet bootstrapping and would require manual
intervention to remove the orphaned operations with out-of-whack
timestamps from the WALs of affected tablet replicas.

To assess the performance hit induced by the jump clock detection,
I also added a parameterized HybridClockJumpProtectionTest.BasicPerf
scenario to compare how fast HybridClock::NowWithError() runs
with and without the newly introduced sanity check.

I found that with sanity check enabled the method runs 1.12 times slower
than without the check.  The details are below.

To collect stats on the performance of the HybridClock::NowWithError()
method, I used the command below, running it 10 times for RELEASE and
DEBUG builds to run the new test scenario.  I recorded the timings
output by the test scenarios and computed the average time taken to
invoke HybridClock::NowWithError() 1000000 times.

  KUDU_ALLOW_SLOW_TESTS=1 ./bin/hybrid_clock-test --gtest_filter='*Perf*'

  -------------------------------------------------------------------

  RELEASE build:

  without clock jump detection:
    1000000 iterations in 158.553514 ms
    1000000 iterations in 159.200062 ms
    1000000 iterations in 157.840572 ms
    1000000 iterations in 159.079716 ms
    1000000 iterations in 161.235647 ms

    average: 159.181902 ms

  with clock jump detection:
    1000000 iterations in 178.415508 ms
    1000000 iterations in 178.102677 ms
    1000000 iterations in 177.465617 ms
    1000000 iterations in 180.552167 ms
    1000000 iterations in 179.644688 ms

    average: 178.836131 ms

  -------------------------------------------------------------------

  DEBUG build:

  without clock jump detection:
    1000000 iterations in 592.051281 ms
    1000000 iterations in 591.087500 ms
    1000000 iterations in 597.142767 ms
    1000000 iterations in 591.066335 ms
    1000000 iterations in 590.717687 ms

    average: 592.413114 ms

  with clock jump detection:
    1000000 iterations in 682.691815 ms
    1000000 iterations in 682.087873 ms
    1000000 iterations in 685.602277 ms
    1000000 iterations in 683.057099 ms
    1000000 iterations in 682.095996 ms

    average: 683.107012 ms

  -------------------------------------------------------------------

RELEASE, running under perf stat without clock jump detection:

 Performance counter stats for './bin/hybrid_clock-test --gtest_filter=Perf/HybridClockJumpProtectionTest.BasicPerf/0' (10 runs):

        171.499485 task-clock                #    0.999 CPUs utilized            ( +-  0.18% )
                 3 context-switches          #    0.016 K/sec                    ( +-  4.76% )
                 0 cpu-migrations            #    0.000 K/sec
             2,829 page-faults               #    0.016 M/sec                    ( +-  1.92% )
       562,222,108 cycles                    #    3.278 GHz                      ( +-  0.18% )
       698,278,773 instructions              #    1.24  insns per cycle          ( +-  0.09% )
       122,793,006 branches                  #  715.996 M/sec                    ( +-  0.11% )
           319,525 branch-misses             #    0.26% of all branches          ( +-  0.16% )

       0.171751822 seconds time elapsed                                          ( +-  0.18% )

  -------------------------------------------------------------------

RELEASE, running under perf stat with clock jump detection:

 Performance counter stats for './bin/hybrid_clock-test --gtest_filter=Perf/HybridClockJumpProtectionTest.BasicPerf/1' (10 runs):

        193.189743 task-clock                #    0.999 CPUs utilized            ( +-  0.84% )
                 3 context-switches          #    0.015 K/sec                    ( +-  8.05% )
                 0 cpu-migrations            #    0.000 K/sec
             2,741 page-faults               #    0.014 M/sec                    ( +-  2.35% )
       627,522,941 cycles                    #    3.248 GHz                      ( +-  0.22% )
       797,617,361 instructions              #    1.27  insns per cycle          ( +-  0.13% )
       138,636,629 branches                  #  717.619 M/sec                    ( +-  0.13% )
           339,705 branch-misses             #    0.25% of all branches          ( +-  0.31% )

       0.193459251 seconds time elapsed                                          ( +-  0.84% )

  -------------------------------------------------------------------

Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Reviewed-on: http://gerrit.cloudera.org:8080/19473
Tested-by: Alexey Serbin <al...@apache.org>
Reviewed-by: Yingchun Lai <la...@apache.org>
(cherry picked from commit 555854178b9b498701619f4bb0dbbbbeab8e69e7)
Reviewed-on: http://gerrit.cloudera.org:8080/19681
Tested-by: Kudu Jenkins
---
M src/kudu/clock/hybrid_clock-test.cc
M src/kudu/clock/hybrid_clock.cc
M src/kudu/clock/hybrid_clock.h
M src/kudu/clock/system_ntp.h
M src/kudu/server/server_base.cc
M src/kudu/server/server_base.h
6 files changed, 377 insertions(+), 40 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Yingchun Lai: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/19681
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.17.x
Gerrit-MessageType: merged
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19681
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>

[kudu-CR](branch-1.17.x) [clock] add sanity check to detect wall clock jumps

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/19681 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Patch Set 1:

> Patch Set 1: Code-Review+2

Thank you for fast review, Yingchun!


-- 
To view, visit http://gerrit.cloudera.org:8080/19681
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.17.x
Gerrit-MessageType: comment
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19681
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Comment-Date: Tue, 04 Apr 2023 06:00:15 +0000
Gerrit-HasComments: No