You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Alexey Serbin (Code Review)" <ge...@cloudera.org> on 2023/02/04 06:06:14 UTC

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Alexey Serbin has uploaded this change for review. ( http://gerrit.cloudera.org:8080/19473


Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................

[clock] add sanity check to detect wall clock jumps

There was a case when a timestamp read from system/local clock using
the ntp_adjtime() call jumped 40+ years ahead when running kudu-tserver
on an Azure VM. That came along with a huge number of kernel messages,
and other software (such as the SASL library used by SSSD) detected
the strange jump in the local clock as well.  My multiple attempts to
reproduce the issue on a real hardware node, Dockerized environment
run at a real hardware server in a datacenter, and GCE & EC2 VMs were
not successful.

This patch adds a sanity check to detect such strange jumps in wall
clock readings.  The idea is to rely on the readings from the
CLOCK_MONOTONIC clock captured along with the wall clock readings.
A jump should manifest itself in big difference between the wall clock
delta and the corresponding CLOCK_MONOTONIC delta.  If such a
condition detected, then HybridClock::NowWithErrorUnlocked() dumps
diagnostic information about clock synchronisation status and aborts.

This patch also adds a unit test for the newly added functionality.

As a part of this changelist, the following new flags are introduced:
  * --enable_wall_clock_jump_check
      Whether to enable the newly introduced sanity check for readings
      of the wall clock.  Set to 'false' by default.
  * --wall_clock_jump_threshold_s
      The threshold (in seconds) for the difference in corresponding
      deltas of the wall clock's and CLOCK_MONOTONIC_RAW clock's
      readings.  Set to 900 (15 minutes) by default.

The reasoning behind having --enable_wall_clock_jump_check=false by
default is to skip an extra check at the majority of nodes out there
since NTP-synchronized system clock isn't supposed to jump that much
at all.  However, if a problem is observed at some inadequate VMs such
as ones in Azure Cloud, it's now possible to enable the guardrail
to detect such an issue.  If this goes unnoticed, the timestamp might
be persisted with an operation in the WAL and propagated to other
replicas as an orphaned REPLICATE operation.  That leads to crashes
during tablet bootstrapping, and requires manual intervention to
remove the orphaned opeations with out-of-wack timestamps from the WAL.

Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
---
M src/kudu/clock/hybrid_clock-test.cc
M src/kudu/clock/hybrid_clock.cc
M src/kudu/clock/hybrid_clock.h
M src/kudu/clock/system_ntp.h
4 files changed, 132 insertions(+), 11 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/73/19473/1
-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <al...@apache.org>

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has removed a vote on this change.

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Removed Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 8
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/19473 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/19473/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19473/2//COMMIT_MSG@10
PS2, Line 10: the ntp_adjtime() call jumped 40+ years ahead when running kudu-tserver
> Sorry, reply the message too late. 
It's not a problem.  Feel free to post an update whenever you have a chance to capture more details on the case observed in your cluster.

Thanks!


http://gerrit.cloudera.org:8080/#/c/19473/3/src/kudu/clock/hybrid_clock.cc
File src/kudu/clock/hybrid_clock.cc:

http://gerrit.cloudera.org:8080/#/c/19473/3/src/kudu/clock/hybrid_clock.cc@293
PS3, Line 293:   uint64_t now_latest = GetPhysicalValueMicros(now) + error;
> warning: The right operand of '+' is a garbage value [clang-analyzer-core.U
Done


http://gerrit.cloudera.org:8080/#/c/19473/3/src/kudu/clock/hybrid_clock.cc@780
PS3, Line 780: }
> warning: Undefined or garbage value returned to caller [clang-analyzer-core
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Mar 2023 18:51:19 +0000
Gerrit-HasComments: Yes

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Yuqi Du (Code Review)" <ge...@cloudera.org>.
Yuqi Du has posted comments on this change. ( http://gerrit.cloudera.org:8080/19473 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Patch Set 4: Code-Review+1

(1 comment)

LGTM

http://gerrit.cloudera.org:8080/#/c/19473/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19473/2//COMMIT_MSG@10
PS2, Line 10: the ntp_adjtime() call jumped 40+ years ahead when running kudu-tserver
> It's not a problem.  Feel free to post an update whenever you have a chance
OK?I get it.



-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 4
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Wed, 08 Mar 2023 12:51:16 +0000
Gerrit-HasComments: Yes

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Yuqi Du (Code Review)" <ge...@cloudera.org>.
Yuqi Du has posted comments on this change. ( http://gerrit.cloudera.org:8080/19473 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Patch Set 5: Code-Review+1


-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 5
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Fri, 24 Mar 2023 08:55:53 +0000
Gerrit-HasComments: No

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Yuqi Du (Code Review)" <ge...@cloudera.org>.
Yuqi Du has posted comments on this change. ( http://gerrit.cloudera.org:8080/19473 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/19473/2/src/kudu/clock/hybrid_clock.cc
File src/kudu/clock/hybrid_clock.cc:

http://gerrit.cloudera.org:8080/#/c/19473/2/src/kudu/clock/hybrid_clock.cc@439
PS2, Line 439:     if (is_wall_clock_jump_check_enabled_) {
> The idea in using constant member fields of the HybridClock class is to all
I see. It's reasonable for your explains.



-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 4
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Fri, 10 Feb 2023 04:07:56 +0000
Gerrit-HasComments: Yes

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/19473 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/19473/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19473/2//COMMIT_MSG@10
PS2, Line 10: the ntp_adjtime() call jumped 40+ years ahead when running kudu-tserver
> That is a valuable information, Yuqi.
+1

Yuqi, it would be great if could you share more information on what was the environment when this issue happened in your cluster.  Was there anything specific that you think might be the culprit?

As for Azure VMs, I'm suspecting (but it's not yet confirmed) that it somehow related to the activity of so-called VMICTimeSync service which updates VM's clock after no-downtime maintenance (a.k.a. memory preserving maintenance) [1] https://learn.microsoft.com/en-us/azure/virtual-machines/linux/time-sync

--- snip ---
Virtual machine interactions with the host can also affect the clock. During memory preserving maintenance, VMs are paused for up to 30 seconds. For example, before maintenance begins the VM clock shows 10:00:00 AM and lasts 28 seconds. After the VM resumes, the clock on the VM would still show 10:00:00 AM, which would be 28 seconds off. To correct for this, the VMICTimeSync service monitors what is happening on the host and updates the time-of-day clock in Linux VMs to compensate.
--- snip ---

[1] https://learn.microsoft.com/en-us/azure/virtual-machines/linux/time-sync


http://gerrit.cloudera.org:8080/#/c/19473/2/src/kudu/clock/hybrid_clock.cc
File src/kudu/clock/hybrid_clock.cc:

http://gerrit.cloudera.org:8080/#/c/19473/2/src/kudu/clock/hybrid_clock.cc@439
PS2, Line 439:     if (is_wall_clock_jump_check_enabled_) {
> I am not sure If it is better that use 'FLAGS_enable_wall_clock_jump_check'
The idea in using constant member fields of the HybridClock class is to allow the compiler to optimize this code as much as possible since it is in the hot path of processing WriteRequestPB, etc.

Supporting toggling this flag in run time doesn't bring a lot of benefits IMO.  If the system clock jumped ahead when this flag was set 'true', the kudu-master/kudu-tserver would crash anyways.  Vice versa, if this flag wasn't set 'true' when the clock jumped, the aftermath would be having far-in-the-future timestamps in the tablets' WALs of this server and other tablet servers where the write operation was replicated, so they would crash while replaying WAL during tablet bootstrap.  So, there is 'forced' restart in both cases :)


http://gerrit.cloudera.org:8080/#/c/19473/2/src/kudu/clock/hybrid_clock.cc@448
PS2, Line 448: if (abs(wall_delta_usec - mono_delta_usec) > wall_clock_jump_threshold_usec_)
> what about PREDICT_FALSE
I think is a good idea!  Done.



-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Wed, 08 Feb 2023 19:27:41 +0000
Gerrit-HasComments: Yes

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Zoltan Chovan, Yuqi Du, Ashwani Raina, Yingchun Lai, Attila Bukor, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19473

to look at the new patch set (#8).

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................

[clock] add sanity check to detect wall clock jumps

There was a case when a timestamp read from system/local clock using
the ntp_adjtime() call jumped 40+ years ahead when running kudu-tserver
on an Azure VM, while ntp_adjtime() didn't report an error on clock
being unsynchronized or any other error at all. That came along with
a huge number of kernel messages, and other software (such as the SASL
library used by SSSD) detected the strange jump in the local clock
as well.  My multiple attempts to reproduce the issue on a real
hardware node, Dockerized environment run at a real hardware server
in a datacenter, and GCE & EC2 VMs were not successful.

This patch adds a sanity check to detect such strange jumps in wall
clock readings.  The idea is to rely on the readings from the
CLOCK_MONOTONIC_RAW clock captured along with the wall clock readings.
A jump should manifest itself in a big difference between the wall clock
delta and the corresponding CLOCK_MONOTONIC_RAW delta.  If such a
condition is detected, then HybridClock::NowWithErrorUnlocked() dumps
diagnostic information about clock NTP synchronisation status and
returns Status::ServiceUnavailable() with appropriate error message.

This patch also adds a unit test for the newly added functionality.

As a part of this changelist, the following new flags are introduced:
  * --wall_clock_jump_detection
      This is to control the newly introduced sanity check for readings
      of the wall clock.  Acceptable values are 'auto', 'enabled,
      and 'disabled'.  It is set to 'auto' by default, meaning the
      sanity check for timestamps is enabled if the process detects
      that it's running on a VM in Azure cloud.
  * --wall_clock_jump_threshold_s
      This is to control the threshold (in seconds) for the difference
      in deltas of the wall clock's and CLOCK_MONOTONIC_RAW clock's
      readings.  It is set to 900 (15 minutes) by default.

The reasoning behind having --wall_clock_jump_detection=auto by
default is to skip an extra check at the majority of nodes out there
since NTP-synchronized system clock isn't supposed to jump that much
at all.  However, since the issue has been observed only at VMs in Azure
cloud, it's enabled automatically at Azure nodes to detect the issue.
If the clock jump went unnoticed, the timestamp would be persisted with
an operation in the WAL and propagated to other replicas.  That could
lead to crashes during tablet bootstrapping and would require manual
intervention to remove the orphaned operations with out-of-whack
timestamps from the WALs of affected tablet replicas.

To assess the performance hit induced by the jump clock detection,
I also added a parameterized HybridClockJumpProtectionTest.BasicPerf
scenario to compare how fast HybridClock::NowWithError() runs
with and without the newly introduced sanity check.

I found that with sanity check enabled the method runs 1.12 times slower
than without the check.  The details are below.

To collect stats on the performance of the HybridClock::NowWithError()
method, I used the command below, running it 10 times for RELEASE and
DEBUG builds to run the new test scenario.  I recorded the timings
output by the test scenarios and computed the average time taken to
invoke HybridClock::NowWithError() 1000000 times.

  KUDU_ALLOW_SLOW_TESTS=1 ./bin/hybrid_clock-test --gtest_filter='*Perf*'

  -------------------------------------------------------------------

  RELEASE build:

  without clock jump detection:
    1000000 iterations in 158.553514 ms
    1000000 iterations in 159.200062 ms
    1000000 iterations in 157.840572 ms
    1000000 iterations in 159.079716 ms
    1000000 iterations in 161.235647 ms

    average: 159.181902 ms

  with clock jump detection:
    1000000 iterations in 178.415508 ms
    1000000 iterations in 178.102677 ms
    1000000 iterations in 177.465617 ms
    1000000 iterations in 180.552167 ms
    1000000 iterations in 179.644688 ms

    average: 178.836131 ms

  -------------------------------------------------------------------

  DEBUG build:

  without clock jump detection:
    1000000 iterations in 592.051281 ms
    1000000 iterations in 591.087500 ms
    1000000 iterations in 597.142767 ms
    1000000 iterations in 591.066335 ms
    1000000 iterations in 590.717687 ms

    average: 592.413114 ms

  with clock jump detection:
    1000000 iterations in 682.691815 ms
    1000000 iterations in 682.087873 ms
    1000000 iterations in 685.602277 ms
    1000000 iterations in 683.057099 ms
    1000000 iterations in 682.095996 ms

    average: 683.107012 ms

  -------------------------------------------------------------------

RELEASE, running under perf stat without clock jump detection:

 Performance counter stats for './bin/hybrid_clock-test --gtest_filter=Perf/HybridClockJumpProtectionTest.BasicPerf/0' (10 runs):

        171.499485 task-clock                #    0.999 CPUs utilized            ( +-  0.18% )
                 3 context-switches          #    0.016 K/sec                    ( +-  4.76% )
                 0 cpu-migrations            #    0.000 K/sec
             2,829 page-faults               #    0.016 M/sec                    ( +-  1.92% )
       562,222,108 cycles                    #    3.278 GHz                      ( +-  0.18% )
       698,278,773 instructions              #    1.24  insns per cycle          ( +-  0.09% )
       122,793,006 branches                  #  715.996 M/sec                    ( +-  0.11% )
           319,525 branch-misses             #    0.26% of all branches          ( +-  0.16% )

       0.171751822 seconds time elapsed                                          ( +-  0.18% )

  -------------------------------------------------------------------

RELEASE, running under perf stat with clock jump detection:

 Performance counter stats for './bin/hybrid_clock-test --gtest_filter=Perf/HybridClockJumpProtectionTest.BasicPerf/1' (10 runs):

        193.189743 task-clock                #    0.999 CPUs utilized            ( +-  0.84% )
                 3 context-switches          #    0.015 K/sec                    ( +-  8.05% )
                 0 cpu-migrations            #    0.000 K/sec
             2,741 page-faults               #    0.014 M/sec                    ( +-  2.35% )
       627,522,941 cycles                    #    3.248 GHz                      ( +-  0.22% )
       797,617,361 instructions              #    1.27  insns per cycle          ( +-  0.13% )
       138,636,629 branches                  #  717.619 M/sec                    ( +-  0.13% )
           339,705 branch-misses             #    0.25% of all branches          ( +-  0.31% )

       0.193459251 seconds time elapsed                                          ( +-  0.84% )

  -------------------------------------------------------------------

Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
---
M src/kudu/clock/hybrid_clock-test.cc
M src/kudu/clock/hybrid_clock.cc
M src/kudu/clock/hybrid_clock.h
M src/kudu/clock/system_ntp.h
M src/kudu/server/server_base.cc
M src/kudu/server/server_base.h
6 files changed, 377 insertions(+), 40 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/73/19473/8
-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 8
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Zoltan Chovan, Yuqi Du, Ashwani Raina, Yingchun Lai, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19473

to look at the new patch set (#2).

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................

[clock] add sanity check to detect wall clock jumps

There was a case when a timestamp read from system/local clock using
the ntp_adjtime() call jumped 40+ years ahead when running kudu-tserver
on an Azure VM. That came along with a huge number of kernel messages,
and other software (such as the SASL library used by SSSD) detected
the strange jump in the local clock as well.  My multiple attempts to
reproduce the issue on a real hardware node, Dockerized environment
run at a real hardware server in a datacenter, and GCE & EC2 VMs were
not successful.

This patch adds a sanity check to detect such strange jumps in wall
clock readings.  The idea is to rely on the readings from the
CLOCK_MONOTONIC clock captured along with the wall clock readings.
A jump should manifest itself in big difference between the wall clock
delta and the corresponding CLOCK_MONOTONIC delta.  If such a condition
is detected, then HybridClock::NowWithErrorUnlocked() dumps diagnostic
information about clock synchronisation status and aborts.

This patch also adds a unit test for the newly added functionality.

As a part of this changelist, the following new flags are introduced:
  * --enable_wall_clock_jump_check
      Whether to enable the newly introduced sanity check for readings
      of the wall clock.  Set to 'false' by default.
  * --wall_clock_jump_threshold_s
      The threshold (in seconds) for the difference in corresponding
      deltas of the wall clock's and CLOCK_MONOTONIC_RAW clock's
      readings.  Set to 900 (15 minutes) by default.

The reasoning behind having --enable_wall_clock_jump_check=false by
default is to skip an extra check at the majority of nodes out there
since NTP-synchronized system clock isn't supposed to jump that much
at all.  However, if a problem is observed at some inadequate VMs such
as ones in Azure Cloud, it's now possible to enable the guardrail
to detect such an issue.  If this goes unnoticed, the timestamp might
be persisted with an operation in the WAL and propagated to other
replicas as an orphaned REPLICATE operation.  That leads to crashes
during tablet bootstrapping, and requires manual intervention to
remove the orphaned operations with out-of-wack timestamps from the WAL.

Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
---
M src/kudu/clock/hybrid_clock-test.cc
M src/kudu/clock/hybrid_clock.cc
M src/kudu/clock/hybrid_clock.h
M src/kudu/clock/system_ntp.h
4 files changed, 138 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/73/19473/2
-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/19473 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Patch Set 8: Verified+1

unrelated test failures (RELEASE):
* LogRollingITest.TestLogCleanupOnStartup
* MasterReplicationAndRpcSizeLimitTest.TabletReports


-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 8
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Wed, 29 Mar 2023 05:23:12 +0000
Gerrit-HasComments: No

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has removed a vote on this change.

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Removed Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 9
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Zoltan Chovan (Code Review)" <ge...@cloudera.org>.
Zoltan Chovan has posted comments on this change. ( http://gerrit.cloudera.org:8080/19473 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/19473/1/src/kudu/clock/hybrid_clock.cc
File src/kudu/clock/hybrid_clock.cc:

http://gerrit.cloudera.org:8080/#/c/19473/1/src/kudu/clock/hybrid_clock.cc@444
PS1, Line 444: monotomic
nit: monotonic



-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Mon, 06 Feb 2023 09:21:22 +0000
Gerrit-HasComments: Yes

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Ashwani Raina (Code Review)" <ge...@cloudera.org>.
Ashwani Raina has posted comments on this change. ( http://gerrit.cloudera.org:8080/19473 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/19473/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19473/1//COMMIT_MSG@40
PS1, Line 40: However, if a problem is observed at some inadequate VMs such
            : as ones in Azure Cloud, it's now possible to enable the guardrail
            : to detect such an issue.
> The idea is to turn on this flag if it's found that the cluster is prone cl
Thank you for the response.
Just summarising yesterday's discussion here:
- It would make sense to enable this by default since it might be too late when we find out the issue and then enable this check since the wrong timestamp may have already been persisted on disk. As you mentioned, you have a WIP patch where you are enabling this check by default so that should take care of it.
- We are awaiting more data on this, if and when received might helps us augment this patch further. That may also make it more robust and target oriented in terms of fixing only part/case that is broken.


http://gerrit.cloudera.org:8080/#/c/19473/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19473/2//COMMIT_MSG@10
PS2, Line 10: the ntp_adjtime() call jumped 40+ years ahead when running kudu-tserver
> Good work., it's would be very useful. we have encountered several time thi
That is a valuable information, Yuqi.
Do you think this issue is reproducible? If yes, it would be great if you could capture the steps here.
Also, did you hit this issue on an Azure VM?



-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Feb 2023 10:01:21 +0000
Gerrit-HasComments: Yes

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Yuqi Du (Code Review)" <ge...@cloudera.org>.
Yuqi Du has posted comments on this change. ( http://gerrit.cloudera.org:8080/19473 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Patch Set 2:

(3 comments)

A nice work.

http://gerrit.cloudera.org:8080/#/c/19473/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19473/2//COMMIT_MSG@10
PS2, Line 10: the ntp_adjtime() call jumped 40+ years ahead when running kudu-tserver
Good work., it's would be very useful. we have encountered several time this same problem.
It's pending at my todo list more than 1 year, but I haven't study it now, of course I also don't know the reason.

In a word, good work.


http://gerrit.cloudera.org:8080/#/c/19473/2/src/kudu/clock/hybrid_clock.cc
File src/kudu/clock/hybrid_clock.cc:

http://gerrit.cloudera.org:8080/#/c/19473/2/src/kudu/clock/hybrid_clock.cc@439
PS2, Line 439:     if (is_wall_clock_jump_check_enabled_) {
I am not sure If it is better that use 'FLAGS_enable_wall_clock_jump_check' and 'wall_clock_jump_threshold_s'  / 'FLAGS_wall_clock_jump_threshold_us'  directly and remove the two member variables.

Using them directly, we can support turn on/off the gflags at runtime and avoid restarting kudu modules.


http://gerrit.cloudera.org:8080/#/c/19473/2/src/kudu/clock/hybrid_clock.cc@448
PS2, Line 448: if (abs(wall_delta_usec - mono_delta_usec) > wall_clock_jump_threshold_usec_)
what about PREDICT_FALSE



-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Feb 2023 09:42:15 +0000
Gerrit-HasComments: Yes

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Zoltan Chovan, Yuqi Du, Ashwani Raina, Yingchun Lai, Attila Bukor, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19473

to look at the new patch set (#7).

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................

[clock] add sanity check to detect wall clock jumps

There was a case when a timestamp read from system/local clock using
the ntp_adjtime() call jumped 40+ years ahead when running kudu-tserver
on an Azure VM, while ntp_adjtime() didn't report an error on clock
being unsynchronized or any other error at all. That came along with
a huge number of kernel messages, and other software (such as the SASL
library used by SSSD) detected the strange jump in the local clock
as well.  My multiple attempts to reproduce the issue on a real
hardware node, Dockerized environment run at a real hardware server
in a datacenter, and GCE & EC2 VMs were not successful.

This patch adds a sanity check to detect such strange jumps in wall
clock readings.  The idea is to rely on the readings from the
CLOCK_MONOTONIC_RAW clock captured along with the wall clock readings.
A jump should manifest itself in a big difference between the wall clock
delta and the corresponding CLOCK_MONOTONIC_RAW delta.  If such a
condition is detected, then HybridClock::NowWithErrorUnlocked() dumps
diagnostic information about clock NTP synchronisation status and
returns Status::ServiceUnavailable() with appropriate error message.

This patch also adds a unit test for the newly added functionality.

As a part of this changelist, the following new flags are introduced:
  * --wall_clock_jump_detection
      This is to control the newly introduced sanity check for readings
      of the wall clock.  Acceptable values are 'auto', 'enabled,
      and 'disabled'.  It is set to 'auto' by default, meaning the
      sanity check for timestamps is enabled if the process detects
      that it's running on a VM in Azure cloud.
  * --wall_clock_jump_threshold_s
      This is to control the threshold (in seconds) for the difference
      in deltas of the wall clock's and CLOCK_MONOTONIC_RAW clock's
      readings.  It is set to 900 (15 minutes) by default.

The reasoning behind having --wall_clock_jump_detection=auto by
default is to skip an extra check at the majority of nodes out there
since NTP-synchronized system clock isn't supposed to jump that much
at all.  However, since the issue has been observed only at VMs in Azure
cloud, it's enabled automatically at Azure nodes to detect the issue.
If the clock jump went unnoticed, the timestamp would be persisted with
an operation in the WAL and propagated to other replicas.  That could
lead to crashes during tablet bootstrapping and would require manual
intervention to remove the orphaned operations with out-of-whack
timestamps from the WALs of affected tablet replicas.

To assess the performance hit induced by the jump clock detection,
I also added a parameterized HybridClockJumpProtectionTest.BasicPerf
scenario to compare how fast HybridClock::NowWithError() runs
with and without the newly introduced sanity check.

I found that with sanity check enabled the method runs 1.12 times slower
than without the check.  The details are below.

To collect stats on the performance of the HybridClock::NowWithError()
method, I used the command below, running it 10 times for RELEASE and
DEBUG builds to run the new test scenario.  I recorded the timings
output by the test scenarios and computed the average time taken to
invoke HybridClock::NowWithError() 2000000 times.

  KUDU_ALLOW_SLOW_TESTS=1 ./bin/hybrid_clock-test --gtest_filter='*Perf*'

  -------------------------------------------------------------------

  RELEASE build:

  without clock jump detection:
    1000000 iterations in 158.553514 ms
    1000000 iterations in 159.200062 ms
    1000000 iterations in 157.840572 ms
    1000000 iterations in 159.079716 ms
    1000000 iterations in 161.235647 ms

    average: 159.181902 ms

  with clock jump detection:
    1000000 iterations in 178.415508 ms
    1000000 iterations in 178.102677 ms
    1000000 iterations in 177.465617 ms
    1000000 iterations in 180.552167 ms
    1000000 iterations in 179.644688 ms

    average: 178.836131 ms

  -------------------------------------------------------------------

  DEBUG build:

  without clock jump detection:
    1000000 iterations in 592.051281 ms
    1000000 iterations in 591.087500 ms
    1000000 iterations in 597.142767 ms
    1000000 iterations in 591.066335 ms
    1000000 iterations in 590.717687 ms

    average: 592.413114 ms

  with clock jump detection:
    1000000 iterations in 682.691815 ms
    1000000 iterations in 682.087873 ms
    1000000 iterations in 685.602277 ms
    1000000 iterations in 683.057099 ms
    1000000 iterations in 682.095996 ms

    average: 683.107012 ms

  -------------------------------------------------------------------

RELEASE, running under perf stat without clock jump detection:

 Performance counter stats for './bin/hybrid_clock-test --gtest_filter=Perf/HybridClockJumpProtectionTest.BasicPerf/0' (10 runs):

        171.499485 task-clock                #    0.999 CPUs utilized            ( +-  0.18% )
                 3 context-switches          #    0.016 K/sec                    ( +-  4.76% )
                 0 cpu-migrations            #    0.000 K/sec
             2,829 page-faults               #    0.016 M/sec                    ( +-  1.92% )
       562,222,108 cycles                    #    3.278 GHz                      ( +-  0.18% )
       698,278,773 instructions              #    1.24  insns per cycle          ( +-  0.09% )
       122,793,006 branches                  #  715.996 M/sec                    ( +-  0.11% )
           319,525 branch-misses             #    0.26% of all branches          ( +-  0.16% )

       0.171751822 seconds time elapsed                                          ( +-  0.18% )

  -------------------------------------------------------------------

RELEASE, running under perf stat with clock jump detection:

 Performance counter stats for './bin/hybrid_clock-test --gtest_filter=Perf/HybridClockJumpProtectionTest.BasicPerf/1' (10 runs):

        193.189743 task-clock                #    0.999 CPUs utilized            ( +-  0.84% )
                 3 context-switches          #    0.015 K/sec                    ( +-  8.05% )
                 0 cpu-migrations            #    0.000 K/sec
             2,741 page-faults               #    0.014 M/sec                    ( +-  2.35% )
       627,522,941 cycles                    #    3.248 GHz                      ( +-  0.22% )
       797,617,361 instructions              #    1.27  insns per cycle          ( +-  0.13% )
       138,636,629 branches                  #  717.619 M/sec                    ( +-  0.13% )
           339,705 branch-misses             #    0.25% of all branches          ( +-  0.31% )

       0.193459251 seconds time elapsed                                          ( +-  0.84% )

  -------------------------------------------------------------------

Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
---
M src/kudu/clock/hybrid_clock-test.cc
M src/kudu/clock/hybrid_clock.cc
M src/kudu/clock/hybrid_clock.h
M src/kudu/clock/system_ntp.h
M src/kudu/server/server_base.cc
M src/kudu/server/server_base.h
6 files changed, 377 insertions(+), 40 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/73/19473/7
-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 7
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Yingchun Lai (Code Review)" <ge...@cloudera.org>.
Yingchun Lai has posted comments on this change. ( http://gerrit.cloudera.org:8080/19473 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Patch Set 8: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 8
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Fri, 31 Mar 2023 07:22:36 +0000
Gerrit-HasComments: No

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Yingchun Lai (Code Review)" <ge...@cloudera.org>.
Yingchun Lai has posted comments on this change. ( http://gerrit.cloudera.org:8080/19473 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Patch Set 9: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 9
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Sat, 01 Apr 2023 07:12:47 +0000
Gerrit-HasComments: No

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/19473 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Patch Set 5: Verified+1

unrelated test failure in ToolTest.TestNonRandomWorkloadLoadgen (DEBUG)


-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 5
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Fri, 24 Mar 2023 21:27:06 +0000
Gerrit-HasComments: No

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Zoltan Chovan, Yuqi Du, Ashwani Raina, Yingchun Lai, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19473

to look at the new patch set (#3).

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................

[clock] add sanity check to detect wall clock jumps

There was a case when a timestamp read from system/local clock using
the ntp_adjtime() call jumped 40+ years ahead when running kudu-tserver
on an Azure VM, while ntp_adjtime() didn't report an error on clock
being unsynchronized. That came along with a huge number of kernel
messages, and other software (such as the SASL library used by SSSD)
detected the strange jump in the local clock as well.  My multiple
attempts to reproduce the issue on a real hardware node, Dockerized
environment run at a real hardware server in a datacenter, and GCE & EC2
VMs were not successful.

This patch adds a sanity check to detect such strange jumps in wall
clock readings.  The idea is to rely on the readings from the
CLOCK_MONOTONIC clock captured along with the wall clock readings.
A jump should manifest itself in big difference between the wall clock
delta and the corresponding CLOCK_MONOTONIC delta.  If such a condition
is detected, then HybridClock::NowWithErrorUnlocked() dumps diagnostic
information about clock synchronisation status and aborts.

This patch also adds a unit test for the newly added functionality.

As a part of this changelist, the following new flags are introduced:
  * --enable_wall_clock_jump_check
      Whether to enable the newly introduced sanity check for readings
      of the wall clock.  Set to 'false' by default.
  * --wall_clock_jump_threshold_s
      The threshold (in seconds) for the difference in corresponding
      deltas of the wall clock's and CLOCK_MONOTONIC_RAW clock's
      readings.  Set to 900 (15 minutes) by default.

The reasoning behind having --enable_wall_clock_jump_check=false by
default is to skip an extra check at the majority of nodes out there
since NTP-synchronized system clock isn't supposed to jump that much
at all.  However, if a problem is observed at some inadequate VMs such
as ones in Azure Cloud, it's now possible to enable the guardrail
to detect such an issue.  If this goes unnoticed, the timestamp might
be persisted with an operation in the WAL and propagated to other
replicas as an orphaned REPLICATE operation.  That leads to crashes
during tablet bootstrapping, and requires manual intervention to
remove the orphaned operations with out-of-wack timestamps from the WAL.

Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
---
M src/kudu/clock/hybrid_clock-test.cc
M src/kudu/clock/hybrid_clock.cc
M src/kudu/clock/hybrid_clock.h
M src/kudu/clock/system_ntp.h
4 files changed, 139 insertions(+), 11 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/73/19473/3
-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Zoltan Chovan, Yuqi Du, Ashwani Raina, Yingchun Lai, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19473

to look at the new patch set (#4).

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................

[clock] add sanity check to detect wall clock jumps

There was a case when a timestamp read from system/local clock using
the ntp_adjtime() call jumped 40+ years ahead when running kudu-tserver
on an Azure VM, while ntp_adjtime() didn't report an error on clock
being unsynchronized. That came along with a huge number of kernel
messages, and other software (such as the SASL library used by SSSD)
detected the strange jump in the local clock as well.  My multiple
attempts to reproduce the issue on a real hardware node, Dockerized
environment run at a real hardware server in a datacenter, and GCE & EC2
VMs were not successful.

This patch adds a sanity check to detect such strange jumps in wall
clock readings.  The idea is to rely on the readings from the
CLOCK_MONOTONIC clock captured along with the wall clock readings.
A jump should manifest itself in big difference between the wall clock
delta and the corresponding CLOCK_MONOTONIC delta.  If such a condition
is detected, then HybridClock::NowWithErrorUnlocked() dumps diagnostic
information about clock synchronisation status and aborts.

This patch also adds a unit test for the newly added functionality.

As a part of this changelist, the following new flags are introduced:
  * --enable_wall_clock_jump_check
      Whether to enable the newly introduced sanity check for readings
      of the wall clock.  Set to 'false' by default.
  * --wall_clock_jump_threshold_s
      The threshold (in seconds) for the difference in corresponding
      deltas of the wall clock's and CLOCK_MONOTONIC_RAW clock's
      readings.  Set to 900 (15 minutes) by default.

The reasoning behind having --enable_wall_clock_jump_check=false by
default is to skip an extra check at the majority of nodes out there
since NTP-synchronized system clock isn't supposed to jump that much
at all.  However, if a problem is observed at some inadequate VMs such
as ones in Azure Cloud, it's now possible to enable the guardrail
to detect such an issue.  If this goes unnoticed, the timestamp might
be persisted with an operation in the WAL and propagated to other
replicas as an orphaned REPLICATE operation.  That leads to crashes
during tablet bootstrapping, and requires manual intervention to
remove the orphaned operations with out-of-wack timestamps from the WAL.

Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
---
M src/kudu/clock/hybrid_clock-test.cc
M src/kudu/clock/hybrid_clock.cc
M src/kudu/clock/hybrid_clock.h
M src/kudu/clock/system_ntp.h
4 files changed, 143 insertions(+), 16 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/73/19473/4
-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 4
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Yuqi Du (Code Review)" <ge...@cloudera.org>.
Yuqi Du has posted comments on this change. ( http://gerrit.cloudera.org:8080/19473 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Patch Set 8: Code-Review+1


-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 8
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Thu, 30 Mar 2023 09:51:28 +0000
Gerrit-HasComments: No

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Zoltan Chovan, Yuqi Du, Ashwani Raina, Yingchun Lai, Attila Bukor, Kudu Jenkins, Abhishek Chennaka, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19473

to look at the new patch set (#9).

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................

[clock] add sanity check to detect wall clock jumps

There was a case when a timestamp read from system/local clock using
the ntp_adjtime() call jumped 40+ years ahead when running kudu-tserver
on an Azure VM, while ntp_adjtime() didn't report an error on clock
being unsynchronized or any other error at all. That came along with
a huge number of kernel messages, and other software (such as the SASL
library used by SSSD) detected the strange jump in the local clock
as well.  My multiple attempts to reproduce the issue on a real
hardware node, Dockerized environment run at a real hardware server
in a datacenter, and GCE & EC2 VMs were not successful.

This patch adds a sanity check to detect such strange jumps in wall
clock readings.  The idea is to rely on the readings from the
CLOCK_MONOTONIC_RAW clock captured along with the wall clock readings.
A jump should manifest itself in a big difference between the wall clock
delta and the corresponding CLOCK_MONOTONIC_RAW delta.  If such a
condition is detected, then HybridClock::NowWithErrorUnlocked() dumps
diagnostic information about clock NTP synchronisation status and
returns Status::ServiceUnavailable() with appropriate error message.

This patch also adds a unit test for the newly added functionality.

As a part of this changelist, the following new flags are introduced:
  * --wall_clock_jump_detection
      This is to control the newly introduced sanity check for readings
      of the wall clock.  Acceptable values are 'auto', 'enabled,
      and 'disabled'.  It is set to 'auto' by default, meaning the
      sanity check for timestamps is enabled if the process detects
      that it's running on a VM in Azure cloud.
  * --wall_clock_jump_threshold_sec
      This is to control the threshold (in seconds) for the difference
      in deltas of the wall clock's and CLOCK_MONOTONIC_RAW clock's
      readings.  It is set to 900 (15 minutes) by default.

The reasoning behind having --wall_clock_jump_detection=auto by
default is to skip an extra check at the majority of nodes out there
since NTP-synchronized system clock isn't supposed to jump that much
at all.  However, since the issue has been observed only at VMs in Azure
cloud, it's enabled automatically at Azure nodes to detect the issue.
If the clock jump went unnoticed, the timestamp would be persisted with
an operation in the WAL and propagated to other replicas.  That could
lead to crashes during tablet bootstrapping and would require manual
intervention to remove the orphaned operations with out-of-whack
timestamps from the WALs of affected tablet replicas.

To assess the performance hit induced by the jump clock detection,
I also added a parameterized HybridClockJumpProtectionTest.BasicPerf
scenario to compare how fast HybridClock::NowWithError() runs
with and without the newly introduced sanity check.

I found that with sanity check enabled the method runs 1.12 times slower
than without the check.  The details are below.

To collect stats on the performance of the HybridClock::NowWithError()
method, I used the command below, running it 10 times for RELEASE and
DEBUG builds to run the new test scenario.  I recorded the timings
output by the test scenarios and computed the average time taken to
invoke HybridClock::NowWithError() 1000000 times.

  KUDU_ALLOW_SLOW_TESTS=1 ./bin/hybrid_clock-test --gtest_filter='*Perf*'

  -------------------------------------------------------------------

  RELEASE build:

  without clock jump detection:
    1000000 iterations in 158.553514 ms
    1000000 iterations in 159.200062 ms
    1000000 iterations in 157.840572 ms
    1000000 iterations in 159.079716 ms
    1000000 iterations in 161.235647 ms

    average: 159.181902 ms

  with clock jump detection:
    1000000 iterations in 178.415508 ms
    1000000 iterations in 178.102677 ms
    1000000 iterations in 177.465617 ms
    1000000 iterations in 180.552167 ms
    1000000 iterations in 179.644688 ms

    average: 178.836131 ms

  -------------------------------------------------------------------

  DEBUG build:

  without clock jump detection:
    1000000 iterations in 592.051281 ms
    1000000 iterations in 591.087500 ms
    1000000 iterations in 597.142767 ms
    1000000 iterations in 591.066335 ms
    1000000 iterations in 590.717687 ms

    average: 592.413114 ms

  with clock jump detection:
    1000000 iterations in 682.691815 ms
    1000000 iterations in 682.087873 ms
    1000000 iterations in 685.602277 ms
    1000000 iterations in 683.057099 ms
    1000000 iterations in 682.095996 ms

    average: 683.107012 ms

  -------------------------------------------------------------------

RELEASE, running under perf stat without clock jump detection:

 Performance counter stats for './bin/hybrid_clock-test --gtest_filter=Perf/HybridClockJumpProtectionTest.BasicPerf/0' (10 runs):

        171.499485 task-clock                #    0.999 CPUs utilized            ( +-  0.18% )
                 3 context-switches          #    0.016 K/sec                    ( +-  4.76% )
                 0 cpu-migrations            #    0.000 K/sec
             2,829 page-faults               #    0.016 M/sec                    ( +-  1.92% )
       562,222,108 cycles                    #    3.278 GHz                      ( +-  0.18% )
       698,278,773 instructions              #    1.24  insns per cycle          ( +-  0.09% )
       122,793,006 branches                  #  715.996 M/sec                    ( +-  0.11% )
           319,525 branch-misses             #    0.26% of all branches          ( +-  0.16% )

       0.171751822 seconds time elapsed                                          ( +-  0.18% )

  -------------------------------------------------------------------

RELEASE, running under perf stat with clock jump detection:

 Performance counter stats for './bin/hybrid_clock-test --gtest_filter=Perf/HybridClockJumpProtectionTest.BasicPerf/1' (10 runs):

        193.189743 task-clock                #    0.999 CPUs utilized            ( +-  0.84% )
                 3 context-switches          #    0.015 K/sec                    ( +-  8.05% )
                 0 cpu-migrations            #    0.000 K/sec
             2,741 page-faults               #    0.014 M/sec                    ( +-  2.35% )
       627,522,941 cycles                    #    3.248 GHz                      ( +-  0.22% )
       797,617,361 instructions              #    1.27  insns per cycle          ( +-  0.13% )
       138,636,629 branches                  #  717.619 M/sec                    ( +-  0.13% )
           339,705 branch-misses             #    0.25% of all branches          ( +-  0.31% )

       0.193459251 seconds time elapsed                                          ( +-  0.84% )

  -------------------------------------------------------------------

Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
---
M src/kudu/clock/hybrid_clock-test.cc
M src/kudu/clock/hybrid_clock.cc
M src/kudu/clock/hybrid_clock.h
M src/kudu/clock/system_ntp.h
M src/kudu/server/server_base.cc
M src/kudu/server/server_base.h
6 files changed, 377 insertions(+), 40 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/73/19473/9
-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 9
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Yuqi Du (Code Review)" <ge...@cloudera.org>.
Yuqi Du has posted comments on this change. ( http://gerrit.cloudera.org:8080/19473 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/19473/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19473/2//COMMIT_MSG@10
PS2, Line 10: the ntp_adjtime() call jumped 40+ years ahead when running kudu-tserver
> +1
Sorry, reply the message too late. 

I have reviewed the case messages that I simply saved at my docs. But I have no more valued messages about this problem.  I will try to best to review some information about this, and then share my messages.



-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 4
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Fri, 24 Feb 2023 10:45:20 +0000
Gerrit-HasComments: Yes

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/19473 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Patch Set 1:

(9 comments)

http://gerrit.cloudera.org:8080/#/c/19473/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19473/1//COMMIT_MSG@23
PS1, Line 23: condition 
> nit: condition is
Done


http://gerrit.cloudera.org:8080/#/c/19473/1//COMMIT_MSG@40
PS1, Line 40: However, if a problem is observed at some inadequate VMs such
            : as ones in Azure Cloud, it's now possible to enable the guardrail
            : to detect such an issue.
> There are chances that this clock issue may go unnoticed unless there is a 
The idea is to turn on this flag if it's found that the cluster is prone clock jump issues.  Doing so, it would help to detect such issues in the future.

From the other side, I have a follow-up WIP patch where the flag changes its default value to 'true' if a Kudu server founds itself running at an Azure VM.

Another alternative would be enabling this check by default, but the issue hasn't been observed to happen often enough in the wild.


http://gerrit.cloudera.org:8080/#/c/19473/1//COMMIT_MSG@46
PS1, Line 46: opeations
> nit: operations
Done


http://gerrit.cloudera.org:8080/#/c/19473/1/src/kudu/clock/hybrid_clock.h
File src/kudu/clock/hybrid_clock.h:

http://gerrit.cloudera.org:8080/#/c/19473/1/src/kudu/clock/hybrid_clock.h@238
PS1, Line 238: captured 
> nit: Did you mean "captured at" ?
Done


http://gerrit.cloudera.org:8080/#/c/19473/1/src/kudu/clock/hybrid_clock.cc
File src/kudu/clock/hybrid_clock.cc:

http://gerrit.cloudera.org:8080/#/c/19473/1/src/kudu/clock/hybrid_clock.cc@165
PS1, Line 165: 15 * 60
> Any specific reason to choose 15 minutes of max allowed divergence?
Nothing very specific except that the idea was to keep this under 1000 seconds.  That's the default threshold for ntpd unless '-g' options is provided. ntpd would exit without trying to adjust time if it sees the difference between the reference time and local clock time is greater than 1000 seconds.

I added corresponding comment.


http://gerrit.cloudera.org:8080/#/c/19473/1/src/kudu/clock/hybrid_clock.cc@287
PS1, Line 287:   uint64_t now_latest = GetPhysicalValueMicros(now) + error;
> warning: The right operand of '+' is a garbage value [clang-analyzer-core.U
This warning is bogus.


http://gerrit.cloudera.org:8080/#/c/19473/1/src/kudu/clock/hybrid_clock.cc@353
PS1, Line 353:   uint64_t now_earliest_usec = now_usec - error;
> warning: The right operand of '-' is a garbage value [clang-analyzer-core.U
This warning is bogus.


http://gerrit.cloudera.org:8080/#/c/19473/1/src/kudu/clock/hybrid_clock.cc@444
PS1, Line 444: monotomic
> nit: monotonic
Done


http://gerrit.cloudera.org:8080/#/c/19473/1/src/kudu/clock/hybrid_clock.cc@773
PS1, Line 773:   return error;
> warning: Undefined or garbage value returned to caller [clang-analyzer-core
This warning is bogus.



-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Tue, 07 Feb 2023 02:43:50 +0000
Gerrit-HasComments: Yes

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Ashwani Raina (Code Review)" <ge...@cloudera.org>.
Ashwani Raina has posted comments on this change. ( http://gerrit.cloudera.org:8080/19473 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Patch Set 1:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/19473/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19473/1//COMMIT_MSG@23
PS1, Line 23: condition 
nit: condition is


http://gerrit.cloudera.org:8080/#/c/19473/1//COMMIT_MSG@40
PS1, Line 40: However, if a problem is observed at some inadequate VMs such
            : as ones in Azure Cloud, it's now possible to enable the guardrail
            : to detect such an issue.
There are chances that this clock issue may go unnoticed unless there is a specific monitoring for such an error going on. If the clock error goes unnoticed (probably that's what happened in this case as well) and wrong clock timestamp persists  and gets propagated to replicas, how would enabling the flag help after that?


http://gerrit.cloudera.org:8080/#/c/19473/1//COMMIT_MSG@46
PS1, Line 46: opeations
nit: operations


http://gerrit.cloudera.org:8080/#/c/19473/1/src/kudu/clock/hybrid_clock.h
File src/kudu/clock/hybrid_clock.h:

http://gerrit.cloudera.org:8080/#/c/19473/1/src/kudu/clock/hybrid_clock.h@238
PS1, Line 238: captured 
nit: Did you mean "captured at" ?


http://gerrit.cloudera.org:8080/#/c/19473/1/src/kudu/clock/hybrid_clock.cc
File src/kudu/clock/hybrid_clock.cc:

http://gerrit.cloudera.org:8080/#/c/19473/1/src/kudu/clock/hybrid_clock.cc@165
PS1, Line 165: 15 * 60
Any specific reason to choose 15 minutes of max allowed divergence?



-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Comment-Date: Mon, 06 Feb 2023 06:21:41 +0000
Gerrit-HasComments: Yes

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Zoltan Chovan, Yuqi Du, Ashwani Raina, Yingchun Lai, Attila Bukor, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19473

to look at the new patch set (#5).

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................

[clock] add sanity check to detect wall clock jumps

There was a case when a timestamp read from system/local clock using
the ntp_adjtime() call jumped 40+ years ahead when running kudu-tserver
on an Azure VM, while ntp_adjtime() didn't report an error on clock
being unsynchronized. That came along with a huge number of kernel
messages, and other software (such as the SASL library used by SSSD)
detected the strange jump in the local clock as well.  My multiple
attempts to reproduce the issue on a real hardware node, Dockerized
environment run at a real hardware server in a datacenter, and GCE & EC2
VMs were not successful.

This patch adds a sanity check to detect such strange jumps in wall
clock readings.  The idea is to rely on the readings from the
CLOCK_MONOTONIC clock captured along with the wall clock readings.
A jump should manifest itself in big difference between the wall clock
delta and the corresponding CLOCK_MONOTONIC delta.  If such a condition
is detected, then HybridClock::NowWithErrorUnlocked() dumps diagnostic
information about clock synchronisation status and aborts.

This patch also adds a unit test for the newly added functionality.

As a part of this changelist, the following new flags are introduced:
  * --wall_clock_jump_detection
      The flag to control the newly introduced sanity check for readings
      of the wall clock.  Acceptable values are 'auto', 'enabled,
      and 'disabled'.  Set to 'auto' by default, meaning the sanity
      check for timestamps is enabled if the process detects that it's
      running on an Azure VM.
  * --wall_clock_jump_threshold_s
      The threshold (in seconds) for the difference in corresponding
      deltas of the wall clock's and CLOCK_MONOTONIC_RAW clock's
      readings.  Set to 900 (15 minutes) by default.

The reasoning behind having --wall_clock_jump_detection=auto by
default is to skip an extra check at the majority of nodes out there
since NTP-synchronized system clock isn't supposed to jump that much
at all.  However, since a problem is observed at some inadequate VMs
such as ones in Azure cloud, it's now enabled automatically on such
nodes to detect such issues.  If the clock jump went unnoticed, the
timestamp would be persisted with an operation in the WAL and propagated
to other replicas.  That could lead to crashes during tablet
bootstrapping, and would require manual intervention to remove
the orphaned operations with out-of-whack timestamps from the WAL.

To assess the performance hit of the jump detection, I also added
a parameterized HybridClockJumpProtectionTest.BasicPerf scenario
to compare how fast HybridClock::NowWithError() method runs with
and without the clock jump detection.

I found the difference in runtimes to be non-material, especially for
the RELEASE build.  The details are below.

I used the command below, running it 10 times for DEBUG and RELEASE
builds to run the test scenarios.  I recorded the timings output by
the test scenarios and computed the average time taken to call invoke
HybridClock::NowWithError() 2000000 times.

  KUDU_ALLOW_SLOW_TESTS=1 ./bin/hybrid_clock-test --gtest_filter='*Perf*'

  DEBUG build:

  without clock jump detection:

    2000000 iterations in 1299.592535 ms
    2000000 iterations in 1232.211616 ms
    2000000 iterations in 1247.237539 ms
    2000000 iterations in 1242.360072 ms
    2000000 iterations in 1368.855273 ms
    2000000 iterations in 1334.929966 ms
    2000000 iterations in 1325.736948 ms
    2000000 iterations in 1332.034905 ms
    2000000 iterations in 1291.398501 ms
    2000000 iterations in 1308.596095 ms

    average: 1298.2953450 ms

  with clock jump detection:

    2000000 iterations in 1281.678966 ms
    2000000 iterations in 1230.120822 ms
    2000000 iterations in 1295.044968 ms
    2000000 iterations in 1234.285532 ms
    2000000 iterations in 1365.331334 ms
    2000000 iterations in 1358.448343 ms
    2000000 iterations in 1336.828018 ms
    2000000 iterations in 1352.825380 ms
    2000000 iterations in 1260.334167 ms
    2000000 iterations in 1343.749751 ms

    average: 1305.8647281 ms

  -------------------------------------------------------------------

  RELEASE build:

  without clock jump detection:

    2000000 iterations in 288.288766 ms
    2000000 iterations in 288.252244 ms
    2000000 iterations in 289.027414 ms
    2000000 iterations in 288.729931 ms
    2000000 iterations in 294.876418 ms
    2000000 iterations in 288.053229 ms
    2000000 iterations in 291.324382 ms
    2000000 iterations in 289.249022 ms
    2000000 iterations in 288.665515 ms
    2000000 iterations in 288.629355 ms

    average: 289.5096276 ms

  with clock jump detection:

    2000000 iterations in 288.331674 ms
    2000000 iterations in 288.380279 ms
    2000000 iterations in 289.147557 ms
    2000000 iterations in 288.010613 ms
    2000000 iterations in 294.012716 ms
    2000000 iterations in 288.398767 ms
    2000000 iterations in 288.180751 ms
    2000000 iterations in 288.699496 ms
    2000000 iterations in 288.403666 ms
    2000000 iterations in 288.395125 ms

    average: 288.9960644 ms

Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
---
M src/kudu/clock/hybrid_clock-test.cc
M src/kudu/clock/hybrid_clock.cc
M src/kudu/clock/hybrid_clock.h
M src/kudu/clock/system_ntp.h
M src/kudu/server/server_base.cc
M src/kudu/server/server_base.h
6 files changed, 374 insertions(+), 40 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/73/19473/5
-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 5
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has removed a vote on this change.

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Removed Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 5
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Zoltan Chovan, Yuqi Du, Ashwani Raina, Yingchun Lai, Attila Bukor, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19473

to look at the new patch set (#6).

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................

[clock] add sanity check to detect wall clock jumps

There was a case when a timestamp read from system/local clock using
the ntp_adjtime() call jumped 40+ years ahead when running kudu-tserver
on an Azure VM, while ntp_adjtime() didn't report an error on clock
being unsynchronized or any other error at all. That came along with
a huge number of kernel messages, and other software (such as the SASL
library used by SSSD) detected the strange jump in the local clock
as well.  My multiple attempts to reproduce the issue on a real
hardware node, Dockerized environment run at a real hardware server
in a datacenter, and GCE & EC2 VMs were not successful.

This patch adds a sanity check to detect such strange jumps in wall
clock readings.  The idea is to rely on the readings from the
CLOCK_MONOTONIC clock captured along with the wall clock readings.
A jump should manifest itself in a big difference between the wall clock
delta and the corresponding CLOCK_MONOTONIC delta.  If such a condition
is detected, then HybridClock::NowWithErrorUnlocked() dumps diagnostic
information about clock NTP synchronisation status and returns
Status::ServiceUnavailable() with appropriate error message.

This patch also adds a unit test for the newly added functionality.

As a part of this changelist, the following new flags are introduced:
  * --wall_clock_jump_detection
      This is to control the newly introduced sanity check for readings
      of the wall clock.  Acceptable values are 'auto', 'enabled,
      and 'disabled'.  It is set to 'auto' by default, meaning the
      sanity check for timestamps is enabled if the process detects
      that it's running on a VM in Azure cloud.
  * --wall_clock_jump_threshold_s
      This is to control the threshold (in seconds) for the difference
      in deltas of the wall clock's and CLOCK_MONOTONIC_RAW clock's
      readings.  It is set to 900 (15 minutes) by default.

The reasoning behind having --wall_clock_jump_detection=auto by
default is to skip an extra check at the majority of nodes out there
since NTP-synchronized system clock isn't supposed to jump that much
at all.  However, since the issue has been observed only at VMs in Azure
cloud, it's enabled automatically at Azure nodes to detect the issue.
If the clock jump went unnoticed, the timestamp would be persisted with
an operation in the WAL and propagated to other replicas.  That could
lead to crashes during tablet bootstrapping and would require manual
intervention to remove the orphaned operations with out-of-whack
timestamps from the WALs of affected tablet replicas.

To assess the performance hit induced by the jump clock detection,
I also added a parameterized HybridClockJumpProtectionTest.BasicPerf
scenario to compare how fast HybridClock::NowWithError() runs
with and without the newly introduced sanity check.

I found that with sanity check enabled the method runs 1.12 times slower
than without the check.  The details are below.

To collect stats on the performance of the HybridClock::NowWithError()
method, I used the command below, running it 10 times for RELEASE and
DEBUG builds to run the new test scenario.  I recorded the timings
output by the test scenarios and computed the average time taken to
invoke HybridClock::NowWithError() 2000000 times.

  KUDU_ALLOW_SLOW_TESTS=1 ./bin/hybrid_clock-test --gtest_filter='*Perf*'

  -------------------------------------------------------------------

  RELEASE build:

  without clock jump detection:
    1000000 iterations in 158.553514 ms
    1000000 iterations in 159.200062 ms
    1000000 iterations in 157.840572 ms
    1000000 iterations in 159.079716 ms
    1000000 iterations in 161.235647 ms

    average: 159.181902 ms

  with clock jump detection:
    1000000 iterations in 178.415508 ms
    1000000 iterations in 178.102677 ms
    1000000 iterations in 177.465617 ms
    1000000 iterations in 180.552167 ms
    1000000 iterations in 179.644688 ms

    average: 178.836131 ms

  -------------------------------------------------------------------

  DEBUG build:

  without clock jump detection:
    1000000 iterations in 592.051281 ms
    1000000 iterations in 591.087500 ms
    1000000 iterations in 597.142767 ms
    1000000 iterations in 591.066335 ms
    1000000 iterations in 590.717687 ms

    average: 592.413114 ms

  with clock jump detection:
    1000000 iterations in 682.691815 ms
    1000000 iterations in 682.087873 ms
    1000000 iterations in 685.602277 ms
    1000000 iterations in 683.057099 ms
    1000000 iterations in 682.095996 ms

    average: 683.107012 ms

  -------------------------------------------------------------------

RELEASE, running under perf stat without clock jump detection:

 Performance counter stats for './bin/hybrid_clock-test --gtest_filter=Perf/HybridClockJumpProtectionTest.BasicPerf/0' (10 runs):

        171.499485 task-clock                #    0.999 CPUs utilized            ( +-  0.18% )
                 3 context-switches          #    0.016 K/sec                    ( +-  4.76% )
                 0 cpu-migrations            #    0.000 K/sec
             2,829 page-faults               #    0.016 M/sec                    ( +-  1.92% )
       562,222,108 cycles                    #    3.278 GHz                      ( +-  0.18% )
       698,278,773 instructions              #    1.24  insns per cycle          ( +-  0.09% )
       122,793,006 branches                  #  715.996 M/sec                    ( +-  0.11% )
           319,525 branch-misses             #    0.26% of all branches          ( +-  0.16% )

       0.171751822 seconds time elapsed                                          ( +-  0.18% )

  -------------------------------------------------------------------

RELEASE, running under perf stat with clock jump detection:

 Performance counter stats for './bin/hybrid_clock-test --gtest_filter=Perf/HybridClockJumpProtectionTest.BasicPerf/1' (10 runs):

        193.189743 task-clock                #    0.999 CPUs utilized            ( +-  0.84% )
                 3 context-switches          #    0.015 K/sec                    ( +-  8.05% )
                 0 cpu-migrations            #    0.000 K/sec
             2,741 page-faults               #    0.014 M/sec                    ( +-  2.35% )
       627,522,941 cycles                    #    3.248 GHz                      ( +-  0.22% )
       797,617,361 instructions              #    1.27  insns per cycle          ( +-  0.13% )
       138,636,629 branches                  #  717.619 M/sec                    ( +-  0.13% )
           339,705 branch-misses             #    0.25% of all branches          ( +-  0.31% )

       0.193459251 seconds time elapsed                                          ( +-  0.84% )

  -------------------------------------------------------------------

Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
---
M src/kudu/clock/hybrid_clock-test.cc
M src/kudu/clock/hybrid_clock.cc
M src/kudu/clock/hybrid_clock.h
M src/kudu/clock/system_ntp.h
M src/kudu/server/server_base.cc
M src/kudu/server/server_base.h
6 files changed, 377 insertions(+), 40 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/73/19473/6
-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 6
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/19473 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Patch Set 8:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/19473/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19473/8//COMMIT_MSG@37
PS8, Line 37: _s
> nit: _sec?
Done


http://gerrit.cloudera.org:8080/#/c/19473/8/src/kudu/server/server_base.h
File src/kudu/server/server_base.h:

http://gerrit.cloudera.org:8080/#/c/19473/8/src/kudu/server/server_base.h@235
PS8, Line 235: _s
> nit: _sec
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 8
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Fri, 31 Mar 2023 21:45:20 +0000
Gerrit-HasComments: Yes

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Abhishek Chennaka (Code Review)" <ge...@cloudera.org>.
Abhishek Chennaka has posted comments on this change. ( http://gerrit.cloudera.org:8080/19473 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Patch Set 8:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/19473/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19473/8//COMMIT_MSG@37
PS8, Line 37: _s
nit: _sec?


http://gerrit.cloudera.org:8080/#/c/19473/8/src/kudu/server/server_base.h
File src/kudu/server/server_base.h:

http://gerrit.cloudera.org:8080/#/c/19473/8/src/kudu/server/server_base.h@235
PS8, Line 235: _s
nit: _sec



-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 8
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Fri, 31 Mar 2023 21:18:55 +0000
Gerrit-HasComments: Yes

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/19473 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................

[clock] add sanity check to detect wall clock jumps

There was a case when a timestamp read from system/local clock using
the ntp_adjtime() call jumped 40+ years ahead when running kudu-tserver
on an Azure VM, while ntp_adjtime() didn't report an error on clock
being unsynchronized or any other error at all. That came along with
a huge number of kernel messages, and other software (such as the SASL
library used by SSSD) detected the strange jump in the local clock
as well.  My multiple attempts to reproduce the issue on a real
hardware node, Dockerized environment run at a real hardware server
in a datacenter, and GCE & EC2 VMs were not successful.

This patch adds a sanity check to detect such strange jumps in wall
clock readings.  The idea is to rely on the readings from the
CLOCK_MONOTONIC_RAW clock captured along with the wall clock readings.
A jump should manifest itself in a big difference between the wall clock
delta and the corresponding CLOCK_MONOTONIC_RAW delta.  If such a
condition is detected, then HybridClock::NowWithErrorUnlocked() dumps
diagnostic information about clock NTP synchronisation status and
returns Status::ServiceUnavailable() with appropriate error message.

This patch also adds a unit test for the newly added functionality.

As a part of this changelist, the following new flags are introduced:
  * --wall_clock_jump_detection
      This is to control the newly introduced sanity check for readings
      of the wall clock.  Acceptable values are 'auto', 'enabled,
      and 'disabled'.  It is set to 'auto' by default, meaning the
      sanity check for timestamps is enabled if the process detects
      that it's running on a VM in Azure cloud.
  * --wall_clock_jump_threshold_sec
      This is to control the threshold (in seconds) for the difference
      in deltas of the wall clock's and CLOCK_MONOTONIC_RAW clock's
      readings.  It is set to 900 (15 minutes) by default.

The reasoning behind having --wall_clock_jump_detection=auto by
default is to skip an extra check at the majority of nodes out there
since NTP-synchronized system clock isn't supposed to jump that much
at all.  However, since the issue has been observed only at VMs in Azure
cloud, it's enabled automatically at Azure nodes to detect the issue.
If the clock jump went unnoticed, the timestamp would be persisted with
an operation in the WAL and propagated to other replicas.  That could
lead to crashes during tablet bootstrapping and would require manual
intervention to remove the orphaned operations with out-of-whack
timestamps from the WALs of affected tablet replicas.

To assess the performance hit induced by the jump clock detection,
I also added a parameterized HybridClockJumpProtectionTest.BasicPerf
scenario to compare how fast HybridClock::NowWithError() runs
with and without the newly introduced sanity check.

I found that with sanity check enabled the method runs 1.12 times slower
than without the check.  The details are below.

To collect stats on the performance of the HybridClock::NowWithError()
method, I used the command below, running it 10 times for RELEASE and
DEBUG builds to run the new test scenario.  I recorded the timings
output by the test scenarios and computed the average time taken to
invoke HybridClock::NowWithError() 1000000 times.

  KUDU_ALLOW_SLOW_TESTS=1 ./bin/hybrid_clock-test --gtest_filter='*Perf*'

  -------------------------------------------------------------------

  RELEASE build:

  without clock jump detection:
    1000000 iterations in 158.553514 ms
    1000000 iterations in 159.200062 ms
    1000000 iterations in 157.840572 ms
    1000000 iterations in 159.079716 ms
    1000000 iterations in 161.235647 ms

    average: 159.181902 ms

  with clock jump detection:
    1000000 iterations in 178.415508 ms
    1000000 iterations in 178.102677 ms
    1000000 iterations in 177.465617 ms
    1000000 iterations in 180.552167 ms
    1000000 iterations in 179.644688 ms

    average: 178.836131 ms

  -------------------------------------------------------------------

  DEBUG build:

  without clock jump detection:
    1000000 iterations in 592.051281 ms
    1000000 iterations in 591.087500 ms
    1000000 iterations in 597.142767 ms
    1000000 iterations in 591.066335 ms
    1000000 iterations in 590.717687 ms

    average: 592.413114 ms

  with clock jump detection:
    1000000 iterations in 682.691815 ms
    1000000 iterations in 682.087873 ms
    1000000 iterations in 685.602277 ms
    1000000 iterations in 683.057099 ms
    1000000 iterations in 682.095996 ms

    average: 683.107012 ms

  -------------------------------------------------------------------

RELEASE, running under perf stat without clock jump detection:

 Performance counter stats for './bin/hybrid_clock-test --gtest_filter=Perf/HybridClockJumpProtectionTest.BasicPerf/0' (10 runs):

        171.499485 task-clock                #    0.999 CPUs utilized            ( +-  0.18% )
                 3 context-switches          #    0.016 K/sec                    ( +-  4.76% )
                 0 cpu-migrations            #    0.000 K/sec
             2,829 page-faults               #    0.016 M/sec                    ( +-  1.92% )
       562,222,108 cycles                    #    3.278 GHz                      ( +-  0.18% )
       698,278,773 instructions              #    1.24  insns per cycle          ( +-  0.09% )
       122,793,006 branches                  #  715.996 M/sec                    ( +-  0.11% )
           319,525 branch-misses             #    0.26% of all branches          ( +-  0.16% )

       0.171751822 seconds time elapsed                                          ( +-  0.18% )

  -------------------------------------------------------------------

RELEASE, running under perf stat with clock jump detection:

 Performance counter stats for './bin/hybrid_clock-test --gtest_filter=Perf/HybridClockJumpProtectionTest.BasicPerf/1' (10 runs):

        193.189743 task-clock                #    0.999 CPUs utilized            ( +-  0.84% )
                 3 context-switches          #    0.015 K/sec                    ( +-  8.05% )
                 0 cpu-migrations            #    0.000 K/sec
             2,741 page-faults               #    0.014 M/sec                    ( +-  2.35% )
       627,522,941 cycles                    #    3.248 GHz                      ( +-  0.22% )
       797,617,361 instructions              #    1.27  insns per cycle          ( +-  0.13% )
       138,636,629 branches                  #  717.619 M/sec                    ( +-  0.13% )
           339,705 branch-misses             #    0.25% of all branches          ( +-  0.31% )

       0.193459251 seconds time elapsed                                          ( +-  0.84% )

  -------------------------------------------------------------------

Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Reviewed-on: http://gerrit.cloudera.org:8080/19473
Tested-by: Alexey Serbin <al...@apache.org>
Reviewed-by: Yingchun Lai <la...@apache.org>
---
M src/kudu/clock/hybrid_clock-test.cc
M src/kudu/clock/hybrid_clock.cc
M src/kudu/clock/hybrid_clock.h
M src/kudu/clock/system_ntp.h
M src/kudu/server/server_base.cc
M src/kudu/server/server_base.h
6 files changed, 377 insertions(+), 40 deletions(-)

Approvals:
  Alexey Serbin: Verified
  Yingchun Lai: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 10
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] [clock] add sanity check to detect wall clock jumps

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/19473 )

Change subject: [clock] add sanity check to detect wall clock jumps
......................................................................


Patch Set 9: Verified+1

unrelated test failure (known flaky test RaftConsensusITest.TestSlowFollower)


-- 
To view, visit http://gerrit.cloudera.org:8080/19473
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I630783653717d975a9b2ad668e8bd47b7796d275
Gerrit-Change-Number: 19473
Gerrit-PatchSet: 9
Gerrit-Owner: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <al...@apache.org>
Gerrit-Reviewer: Ashwani Raina <ar...@cloudera.com>
Gerrit-Reviewer: Attila Bukor <ab...@apache.org>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yingchun Lai <la...@apache.org>
Gerrit-Reviewer: Yuqi Du <sh...@gmail.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Fri, 31 Mar 2023 23:24:39 +0000
Gerrit-HasComments: No