You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Alexey Serbin (Jira)" <ji...@apache.org> on 2023/06/13 05:29:00 UTC
[jira] [Resolved] (KUDU-2906) Don't allow elections when server clocks are too out of sync

     [ https://issues.apache.org/jira/browse/KUDU-2906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexey Serbin resolved KUDU-2906.
---------------------------------
    Fix Version/s: 1.17.0
       Resolution: Implemented

The generic provision to detect system/local clock jump has been implemented with [555854178b9b498701619f4bb0dbbbbeab8e69e7|https://github.com/apache/kudu/commit/555854178b9b498701619f4bb0dbbbbeab8e69e7].  By default, it's enabled only at Azure VM instances, but it's possible to turn it on anywhere: check the commit description for details.

With that, I'm resolving this JIRA item.

> Don't allow elections when server clocks are too out of sync
> ------------------------------------------------------------
>
>                 Key: KUDU-2906
>                 URL: https://issues.apache.org/jira/browse/KUDU-2906
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 1.10.0
>            Reporter: Andrew Wong
>            Priority: Major
>             Fix For: 1.17.0
>
>
> In cases where machine clocks are not properly synchronized, if a tablet replica is elected leader whose clock happens to be very far in the future (greater than --max_clock_sync_error_usec=10 sec), it's possible that any writes that goes to that tablet will be rejected by the followers, but persisted to the leader's WAL.
> Then, upon fixing the clock on that machine, the replica may try to replay the future op, but fail to replay it because the op timestamp is too far in the future, with errors like:
> {code:java}
> F0715 12:03:09.369819  3500 tablet_bootstrap.cc:904] Check failed: _s.ok() Bad status: Invalid argument: Tried to update clock beyond the max. error.{code}
> Dumping a recovery WAL, I could see:
> {code:java}
> 130.138@6400743143334211584 REPLICATE NO_OP
> id { term: 130 index: 138 } timestamp: 6400743143334211584 op_type: NO_OP noop_request { }
> COMMIT 130.138
> op_type: NO_OP commited_op_id { term: 130 index: 138 }
> 131.139@6400743925559676928 REPLICATE NO_OP
> id { term: 131 index: 139 } timestamp: 6400743925559676928 op_type: NO_OP noop_request { }
> COMMIT 131.139
> op_type: NO_OP commited_op_id { term: 131 index: 139 }
> 132.140@11589864471731939930 REPLICATE NO_OP
> id { term: 132 index: 140 } timestamp: 11589864471731939930 op_type: NO_OP noop_request { }{code}
> Note the drastic jump in timestamp.
> In this specific case, we verified that the replayed WAL wasn't that far behind the recovery WAL, which had the future timestamps, so we could just delete the recovery WAL and bootstrap from the replayed WAL.
> It would have been nice had those bad ops not been written at all, maybe by preventing an election between such mismatched servers in the first place.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)