You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by to...@apache.org on 2016/03/08 21:26:46 UTC
[3/3] incubator-kudu git commit: client_failover-itest: fix flakiness
with opid mismatches
client_failover-itest: fix flakiness with opid mismatches
This fixes a common source of flakiness, particular in TSAN builds. The issue
was that we were assuming that, if the TestWorkload wrote N batches, that would
correspond exactly to N log operations on the server side. That actually isn't
the case -- there are some interleavings in which the client 'Batcher' can
split a single Flush call into multiple RPCs, and we don't make any strong
guarantees that a Flush is atomic, even though it is almost all the time.
The fix is simple: switch to single-row batches, which can't be split up
into the client.
On an earlier revision of this patch, I was able to run the
DeleteLeaderWhileScanning tests 5000 times in TSAN with only a few failures[1].
I ran 1000 on the latest revision[2]. The remaining failures seem to be an
unrelated data race on RaftConsensus shutdown.
[1] http://dist-test.cloudera.org/job?job_id=todd.1457413245.29963
[2] http://dist-test.cloudera.org/job?job_id=todd.1457464975.21171
Change-Id: Ib3df1b3f5b0903f069a5e7ae3ba2a64c1c52a427
Reviewed-on: http://gerrit.cloudera.org:8080/2479
Reviewed-by: Mike Percy <mp...@apache.org>
Tested-by: Todd Lipcon <to...@apache.org>
Project: http://git-wip-us.apache.org/repos/asf/incubator-kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-kudu/commit/ebb37133
Tree: http://git-wip-us.apache.org/repos/asf/incubator-kudu/tree/ebb37133
Diff: http://git-wip-us.apache.org/repos/asf/incubator-kudu/diff/ebb37133
Branch: refs/heads/master
Commit: ebb371335c3ccc481d2e6f16da8b1193bb2c7af0
Parents: d24aa18
Author: Todd Lipcon <to...@apache.org>
Authored: Mon Mar 7 19:10:59 2016 -0800
Committer: Todd Lipcon <to...@apache.org>
Committed: Tue Mar 8 20:26:30 2016 +0000
----------------------------------------------------------------------
src/kudu/integration-tests/client_failover-itest.cc | 7 +++++++
src/kudu/integration-tests/test_workload.h | 3 +++
2 files changed, 10 insertions(+)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/ebb37133/src/kudu/integration-tests/client_failover-itest.cc
----------------------------------------------------------------------
diff --git a/src/kudu/integration-tests/client_failover-itest.cc b/src/kudu/integration-tests/client_failover-itest.cc
index eb57fcc..de32b5f 100644
--- a/src/kudu/integration-tests/client_failover-itest.cc
+++ b/src/kudu/integration-tests/client_failover-itest.cc
@@ -69,6 +69,12 @@ TEST_P(ClientFailoverParamITest, TestDeleteLeaderWhileScanning) {
// Create the test table.
TestWorkload workload(cluster_.get());
workload.set_write_timeout_millis(kTimeout.ToMilliseconds());
+ // We count on each flush from the client corresponding to exactly one
+ // consensus operation in this test. If we use batches with more than one row,
+ // the client is allowed to (and will on rare occasion) break the batches
+ // up into more than one write request, resulting in more than one op
+ // in the log.
+ workload.set_write_batch_size(1);
workload.Setup();
// Figure out the tablet id.
@@ -120,6 +126,7 @@ TEST_P(ClientFailoverParamITest, TestDeleteLeaderWhileScanning) {
SleepFor(MonoDelta::FromMilliseconds(10));
}
workload.StopAndJoin();
+ LOG(INFO) << "workload completed " << workload.batches_completed() << " batches";
// We don't want the leader that takes over after we kill the first leader to
// be unsure whether the writes have been committed, so wait until all
http://git-wip-us.apache.org/repos/asf/incubator-kudu/blob/ebb37133/src/kudu/integration-tests/test_workload.h
----------------------------------------------------------------------
diff --git a/src/kudu/integration-tests/test_workload.h b/src/kudu/integration-tests/test_workload.h
index 4d106ed..307ae92 100644
--- a/src/kudu/integration-tests/test_workload.h
+++ b/src/kudu/integration-tests/test_workload.h
@@ -119,6 +119,9 @@ class TestWorkload {
// Return the number of batches in which we have successfully inserted at
// least one row.
+ // NOTE: it is not safe to assume that this is exactly equal to the number
+ // of log operations generated on the TS side. The client may split a single
+ // Flush() call into multiple batches.
int64_t batches_completed() const {
return batches_completed_.Load();
}