You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2016/11/03 03:16:57 UTC

[kudu-CR] KUDU-1735. Fix crash when aborting a skipped config change round

Hello David Ribeiro Alves, Mike Percy, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4916

to look at the new patch set (#2).

Change subject: KUDU-1735. Fix crash when aborting a skipped config change round
......................................................................

KUDU-1735. Fix crash when aborting a skipped config change round

This fixes a crash seen in a test cluster when deleting a tablet
in which there is a pending (uncommitted) REPLICATE message for a
skipped CONFIG_CHANGE operation. The CONFIG_CHANGE was skipped because
the tablet already had a higher indexed configuration written to disk.

This can happen in a couple cases: one, exercised by a test included
here, is if a tablet server crashes just after committing a config
change to disk but before writing the COMMIT message to the WAL. If it's
then removed from the configuration and restarted, the config change
operation will stay pending until the tablet is eventually deleted, at
which point the crash will be triggered. The newly added test provokes
this condition and fails prior to the included bug fix.

The approach taken to fix this is as follows:

When aborting a config change operation, we only clear the 'pending'
config if we see that its index matches the index of the operation being
aborted. Otherwise, we ignore the abort (whereas we used to crash).

In order to achieve this, we have to remember the index of the pending
config, which previously wasn't set until after the commit. So, this
assignment is moved out of the commit path and into
AddPendingOperationUnlocked(). Changing the assumption of where the
index was set involved making a few corresponding changes to DCHECKs
elsewhere in the code.

There is a slight wire/upgrade compatibility issue here: previously the
leader would replicate config change messages with no opid set in the
'new_config'. Now, it does. I think this would cause DCHECKs to fire in
a mixed-version cluster, though no CHECKs. Additionally, we aren't
purporting to fully support rolling upgrade yet, so I don't think it's a
big deal. I was able to upgrade the test cluster which experienced this
issue in-place and it came up fine.

This patch removes raft_consensus_state-test, which had some various
assertions against setting pending configuration changes with opids
set. After looking at this test, I realized that it's purporting to test
ReplicaState but in fact all of the methods that it tests are just
pass-throughs to ConsensusMeta and thus are already covered by
consensus_meta-test. So, I figured there's no sense spending time
updating this redundant test code.

I ran raft_consensus-itest 1000 times[1] and the new test alone another
1000 times[2] to test.

[1] http://dist-test.cloudera.org/job?job_id=todd.1478141668.26073
[2] http://dist-test.cloudera.org/job?job_id=todd.1478141941.26651

Change-Id: I2b4e4eb1d60ef66527edf33357fabc22f4b5b32f
---
M src/kudu/consensus/CMakeLists.txt
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/raft_consensus.cc
D src/kudu/consensus/raft_consensus_state-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
7 files changed, 120 insertions(+), 175 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/16/4916/2
-- 
To view, visit http://gerrit.cloudera.org:8080/4916
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2b4e4eb1d60ef66527edf33357fabc22f4b5b32f
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>