You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by to...@apache.org on 2016/03/08 21:26:46 UTC
[3/3] incubator-kudu git commit: client_failover-itest: fix flakiness with opid mismatches

client_failover-itest: fix flakiness with opid mismatches

This fixes a common source of flakiness, particular in TSAN builds. The issue
was that we were assuming that, if the TestWorkload wrote N batches, that would
correspond exactly to N log operations on the server side. That actually isn't
the case -- there are some interleavings in which the client 'Batcher' can
split a single Flush call into multiple RPCs, and we don't make any strong
guarantees that a Flush is atomic, even though it is almost all the time.

The fix is simple: switch to single-row batches, which can't be split up
into the client.

On an earlier revision of this patch, I was able to run the
DeleteLeaderWhileScanning tests 5000 times in TSAN with only a few failures[1].
I ran 1000 on the latest revision[2]. The remaining failures seem to be an
unrelated data race on RaftConsensus shutdown.

[1] http://dist-test.cloudera.org/job?job_id=todd.1457413245.29963
[2] http://dist-test.cloudera.org/job?job_id=todd.1457464975.21171

Change-Id: Ib3df1b3f5b0903f069a5e7ae3ba2a64c1c52a427
Reviewed-on: http://gerrit.cloudera.org:8080/2479
Reviewed-by: Mike Percy <mp...@apache.org>
Tested-by: Todd Lipcon <to...@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/incubator-kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-kudu/commit/ebb37133
Tree: http://git-wip-us.apache.org/repos/asf/incubator-kudu/tree/ebb37133
Diff: http://git-wip-us.apache.org/repos/asf/incubator-kudu/diff/ebb37133

Branch: refs/heads/master
Commit: ebb371335c3ccc481d2e6f16da8b1193bb2c7af0
Parents: d24aa18
Author: Todd Lipcon <to...@apache.org>
Authored: Mon Mar 7 19:10:59 2016 -0800
Committer: Todd Lipcon <to...@apache.org>
Committed: Tue Mar 8 20:26:30 2016 +0000

----------------------------------------------------------------------
 src/kudu/integration-tests/client_failover-itest.cc | 7 +++++++
 src/kudu/integration-tests/test_workload.h          | 3 +++
 2 files changed, 10 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/ebb37133/src/kudu/integration-tests/client_failover-itest.cc
----------------------------------------------------------------------
diff --git a/src/kudu/integration-tests/client_failover-itest.cc b/src/kudu/integration-tests/client_failover-itest.cc
index eb57fcc..de32b5f 100644
--- a/src/kudu/integration-tests/client_failover-itest.cc
+++ b/src/kudu/integration-tests/client_failover-itest.cc
@@ -69,6 +69,12 @@ TEST_P(ClientFailoverParamITest, TestDeleteLeaderWhileScanning) {
   // Create the test table.
   TestWorkload workload(cluster_.get());
   workload.set_write_timeout_millis(kTimeout.ToMilliseconds());
+  // We count on each flush from the client corresponding to exactly one
+  // consensus operation in this test. If we use batches with more than one row,
+  // the client is allowed to (and will on rare occasion) break the batches
+  // up into more than one write request, resulting in more than one op
+  // in the log.
+  workload.set_write_batch_size(1);
   workload.Setup();
 
   // Figure out the tablet id.
@@ -120,6 +126,7 @@ TEST_P(ClientFailoverParamITest, TestDeleteLeaderWhileScanning) {
     SleepFor(MonoDelta::FromMilliseconds(10));
   }
   workload.StopAndJoin();
+  LOG(INFO) << "workload completed " << workload.batches_completed() << " batches";
 
   // We don't want the leader that takes over after we kill the first leader to
   // be unsure whether the writes have been committed, so wait until all

http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/ebb37133/src/kudu/integration-tests/test_workload.h
----------------------------------------------------------------------
diff --git a/src/kudu/integration-tests/test_workload.h b/src/kudu/integration-tests/test_workload.h
index 4d106ed..307ae92 100644
--- a/src/kudu/integration-tests/test_workload.h
+++ b/src/kudu/integration-tests/test_workload.h
@@ -119,6 +119,9 @@ class TestWorkload {
 
   // Return the number of batches in which we have successfully inserted at
   // least one row.
+  // NOTE: it is not safe to assume that this is exactly equal to the number
+  // of log operations generated on the TS side. The client may split a single
+  // Flush() call into multiple batches.
   int64_t batches_completed() const {
     return batches_completed_.Load();
   }