You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by mp...@apache.org on 2017/09/07 22:24:03 UTC

kudu git commit: Fix flakiness of ts_tablet_manager_itest TestFailedTabletsAreReplaced

Repository: kudu
Updated Branches:
  refs/heads/master 01003ee6b -> 59e9a9287


Fix flakiness of ts_tablet_manager_itest TestFailedTabletsAreReplaced

TestFailedTabletsAreReplaced manually fails the replica after only
verifying that the tablet exists, with no regard for its state. This can
cause the replica's bootstrap process to fail a check:
F0907 00:05:46.153576  2697 tablet_replica.cc:173] Check failed: BOOTSTRAPPING == state_ (0 vs. 2)

This is a test-only race where the replica successfully goes through the
bootstrap process, the tablet is failed in test, and
TabletReplica::Start() is called on the replica, which requires its
state to be BOOTSTRAPPING. This is not an issue seen in production, as
bootstrapping is normally only run if the replica is not failed, but it
did result in 6/1000 failures when run in release mode with
--stress_cpu_thres=32.

To fix this, the replica is failed only after it is verified to be
running. In doing so, the number of failures went from 6/1000 to 0/1000.

Change-Id: I93b41c8196397ea5af42ed9e2aa47e967f7a520e
Reviewed-on: http://gerrit.cloudera.org:8080/7993
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo <ad...@cloudera.com>
Reviewed-by: Hao Hao <ha...@cloudera.com>
Reviewed-by: Mike Percy <mp...@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/59e9a928
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/59e9a928
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/59e9a928

Branch: refs/heads/master
Commit: 59e9a9287d75daec43e69a0fbcb886c2d4543f73
Parents: 01003ee
Author: Andrew Wong <aw...@cloudera.com>
Authored: Thu Sep 7 02:20:53 2017 -0700
Committer: Mike Percy <mp...@apache.org>
Committed: Thu Sep 7 22:23:45 2017 +0000

----------------------------------------------------------------------
 .../ts_tablet_manager-itest.cc                  | 23 ++++++++++++++------
 1 file changed, 16 insertions(+), 7 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/59e9a928/src/kudu/integration-tests/ts_tablet_manager-itest.cc
----------------------------------------------------------------------
diff --git a/src/kudu/integration-tests/ts_tablet_manager-itest.cc b/src/kudu/integration-tests/ts_tablet_manager-itest.cc
index a09fdb1..6755065 100644
--- a/src/kudu/integration-tests/ts_tablet_manager-itest.cc
+++ b/src/kudu/integration-tests/ts_tablet_manager-itest.cc
@@ -131,7 +131,8 @@ TEST_F(TsTabletManagerITest, TestFailedTabletsAreReplaced) {
   work.Setup();
   work.Start();
 
-  // Insert data until until the tablet becomes visible to the server.
+  // Insert data until the tablet becomes visible to the server.
+  // We'll operate on the first tablet server, chosen arbitrarily.
   MiniTabletServer* ts = cluster->mini_tablet_server(0);
   string tablet_id;
   ASSERT_EVENTUALLY([&] {
@@ -139,8 +140,20 @@ TEST_F(TsTabletManagerITest, TestFailedTabletsAreReplaced) {
     ASSERT_EQ(1, tablet_ids.size());
     tablet_id = tablet_ids[0];
   });
+
+  // Wait until the replica is running before failing it.
+  const auto wait_until_running = [&]() {
+    AssertEventually([&]{
+      scoped_refptr<TabletReplica> replica;
+      ASSERT_OK(ts->server()->tablet_manager()->GetTabletReplica(tablet_id, &replica));
+      ASSERT_EQ(replica->state(), tablet::RUNNING);
+    }, MonoDelta::FromSeconds(60));
+    NO_PENDING_FATALS();
+  };
+  wait_until_running();
+
   {
-    // Fail one of the replicas (the first one, chosen arbitrarily).
+    // Inject an error to the replica. Shutting it down will leave it FAILED.
     scoped_refptr<TabletReplica> replica;
     ASSERT_OK(ts->server()->tablet_manager()->GetTabletReplica(tablet_id, &replica));
     replica->SetError(Status::IOError("INJECTED ERROR: tablet failed"));
@@ -149,11 +162,7 @@ TEST_F(TsTabletManagerITest, TestFailedTabletsAreReplaced) {
   }
 
   // Ensure the tablet eventually is replicated.
-  ASSERT_EVENTUALLY([&]{
-    scoped_refptr<TabletReplica> replica;
-    ASSERT_OK(ts->server()->tablet_manager()->GetTabletReplica(tablet_id, &replica));
-    ASSERT_EQ(replica->state(), tablet::RUNNING);
-  });
+  wait_until_running();
   work.StopAndJoin();
 }