You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@kudu.apache.org by aw...@apache.org on 2019/10/02 06:31:38 UTC

[kudu] branch master updated (f73671a -> fc0e973)

This is an automated email from the ASF dual-hosted git repository.

awong pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/kudu.git.


    from f73671a  TestKuduPartitioner: deflake testBuildTimeout
     new 51ccb11  test: deflake TestFailedTServerInMaintenanceModeDoesntRereplicate
     new d5447c1  Fix compilation on XCode 11
     new fc0e973  test: deflake TestBackgroundFailureDuringMaintenanceMode

The 3 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 src/kudu/consensus/quorum_util-test.cc             |  2 +-
 .../integration-tests/maintenance_mode-itest.cc    | 60 +++++++++++++++-------
 2 files changed, 43 insertions(+), 19 deletions(-)

[kudu] 02/03: Fix compilation on XCode 11

Posted by aw...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

awong pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/kudu.git

commit d5447c12c4676c2b176c45ed6a863a1ae9082726
Author: Grant Henke <gr...@apache.org>
AuthorDate: Tue Oct 1 22:39:53 2019 -0500

    Fix compilation on XCode 11
    
    I recently upgraded my Mac to XCode 11 which now fails with:
    
    …/kudu/src/kudu/consensus/quorum_util-test.cc:49:16: error: constexpr variable cannot have non-literal type 'const std::initializer_list<char>'
    constexpr auto kHealthStatuses = { '?', '-', 'x', '+' };
                   ^
    /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/initializer_list:59:28: note: 'initializer_list<char>' is not literal because it is not an aggregate and has no constexpr constructors other than copy or move constructors
    class _LIBCPP_TEMPLATE_VIS initializer_list
    
    This looks to be realted some ambiquity in compiler support for this syntax:
    https://stackoverflow.com/questions/36863503/struct-is-non-literal-type/36863691
    
    I fixed this by converting it to a const instead of a consexpr.
    
    Change-Id: Iad3db5ac4d65fd9bdfa8f357b350f24d0e0d3ec0
    Reviewed-on: http://gerrit.cloudera.org:8080/14343
    Tested-by: Kudu Jenkins
    Reviewed-by: Andrew Wong <aw...@cloudera.com>
---
 src/kudu/consensus/quorum_util-test.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/kudu/consensus/quorum_util-test.cc b/src/kudu/consensus/quorum_util-test.cc
index a0b3851..b4d5f65 100644
--- a/src/kudu/consensus/quorum_util-test.cc
+++ b/src/kudu/consensus/quorum_util-test.cc
@@ -46,7 +46,7 @@ constexpr auto U = RaftPeerPB::UNKNOWN_MEMBER_TYPE; // NOLINT(readability-identi
 constexpr auto V = RaftPeerPB::VOTER;               // NOLINT(readability-identifier-naming)
 
 // The various possible health statuses.
-constexpr auto kHealthStatuses = { '?', '-', 'x', '+' };
+const auto kHealthStatuses = { '?', '-', 'x', '+' };
 
 typedef std::pair<string, bool> Attr;

[kudu] 03/03: test: deflake TestBackgroundFailureDuringMaintenanceMode

Posted by aw...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

awong pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/kudu.git

commit fc0e973ed056c21f19492239f9be9199bbafa4f9
Author: Andrew Wong <aw...@cloudera.com>
AuthorDate: Tue Oct 1 17:54:51 2019 -0700

    test: deflake TestBackgroundFailureDuringMaintenanceMode
    
    We previously weren't accounting for the possibility of a slow
    bootstrap, asserting early that we had the expected number of running
    replicas.
    
    This also fixes the check for 0 failed replicas. Really, we should have
    been checking for explicit healthy/failed states, rather than "not
    failed".
    
    Without this, the test failed 6/1000 times in debug mode. With it, it
    passed 4997/5000 times (failures due to a different issue).
    
    Change-Id: I67eed008a983cd9ed528496777ffe3d30126a55f
    Reviewed-on: http://gerrit.cloudera.org:8080/14341
    Reviewed-by: Adar Dembo <ad...@cloudera.com>
    Tested-by: Andrew Wong <aw...@cloudera.com>
    Reviewed-by: Alexey Serbin <as...@cloudera.com>
---
 .../integration-tests/maintenance_mode-itest.cc    | 56 +++++++++++++++-------
 1 file changed, 38 insertions(+), 18 deletions(-)

diff --git a/src/kudu/integration-tests/maintenance_mode-itest.cc b/src/kudu/integration-tests/maintenance_mode-itest.cc
index c3bb288..43f73a2 100644
--- a/src/kudu/integration-tests/maintenance_mode-itest.cc
+++ b/src/kudu/integration-tests/maintenance_mode-itest.cc
@@ -134,9 +134,10 @@ class MaintenanceModeITest : public ExternalMiniClusterITestBase {
 
   // Return the number of failed replicas there are in the cluster, according
   // to the tablet leaders.
-  Status GetNumFailedReplicas(const unordered_map<string, TServerDetails*>& ts_map,
-                              int* num_replicas_failed) {
+  Status GetReplicaHealthCounts(const unordered_map<string, TServerDetails*>& ts_map,
+                                int* num_replicas_failed, int* num_replicas_healthy) {
     int num_failed = 0;
+    int num_healthy = 0;
     for (int i = 0; i < cluster_->num_tablet_servers(); i++) {
       ExternalTabletServer* tserver = cluster_->tablet_server(i);
       if (tserver->IsShutdown()) {
@@ -158,23 +159,36 @@ class MaintenanceModeITest : public ExternalMiniClusterITestBase {
         const auto& committed_config = consensus_state.committed_config();
         for (int p = 0; p < committed_config.peers_size(); p++) {
           const auto& peer = committed_config.peers(p);
-          if (peer.has_health_report() &&
-              peer.health_report().overall_health() == HealthReportPB::FAILED) {
-            num_failed++;
+          if (peer.has_health_report()) {
+            switch (peer.health_report().overall_health()) {
+              case HealthReportPB::FAILED:
+                num_failed++;
+                break;
+              case HealthReportPB::HEALTHY:
+                num_healthy++;
+                break;
+              default:
+                continue;
+            }
           }
         }
       }
     }
     *num_replicas_failed = num_failed;
+    *num_replicas_healthy = num_healthy;
     return Status::OK();
   }
 
-  void AssertEventuallyNumFailedReplicas(const unordered_map<string, TServerDetails*>& ts_map,
-                                         int expected_failed) {
+  // Asserts that the ts_map will eventually have the given number of failed and healthy replicas
+  // across the cluster.
+  void AssertEventuallyHealthCounts(const unordered_map<string, TServerDetails*>& ts_map,
+                                    int expected_failed, int expected_healthy) {
     ASSERT_EVENTUALLY([&] {
       int num_failed;
-      ASSERT_OK(GetNumFailedReplicas(ts_map, &num_failed));
+      int num_healthy;
+      ASSERT_OK(GetReplicaHealthCounts(ts_map, &num_failed, &num_healthy));
       ASSERT_EQ(expected_failed, num_failed);
+      ASSERT_EQ(expected_healthy, num_healthy);
     });
   }
 
@@ -216,6 +230,7 @@ TEST_F(MaintenanceModeRF3ITest, TestFailedTServerInMaintenanceModeDoesntRereplic
   // receive heartbeats.
   SKIP_IF_SLOW_NOT_ALLOWED();
   const int kNumTablets = 6;
+  const int total_replicas = kNumTablets * 3;
 
   // Create the table with three tablet servers and then add one so we're
   // guaranteed that the replicas are all on the first three servers.
@@ -235,7 +250,7 @@ TEST_F(MaintenanceModeRF3ITest, TestFailedTServerInMaintenanceModeDoesntRereplic
   const auto& ts_map = ts_map_and_deleter.first;
 
   // Do a sanity check that all our replicas are healthy.
-  NO_FATALS(AssertEventuallyNumFailedReplicas(ts_map, 0));
+  NO_FATALS(AssertEventuallyHealthCounts(ts_map, 0, total_replicas));
 
   // Put one of the servers in maintenance mode.
   ExternalTabletServer* maintenance_ts = cluster_->tablet_server(0);
@@ -253,7 +268,7 @@ TEST_F(MaintenanceModeRF3ITest, TestFailedTServerInMaintenanceModeDoesntRereplic
   // Now wait a bit for this failure to make its way to the master. The failure
   // shouldn't have led to any re-replication.
   SleepFor(kDurationForSomeHeartbeats);
-  NO_FATALS(AssertEventuallyNumFailedReplicas(ts_map, kNumTablets));
+  NO_FATALS(AssertEventuallyHealthCounts(ts_map, kNumTablets, total_replicas - kNumTablets));
   NO_FATALS(ExpectStartedTabletCopies(/*should_have_started*/false));
 
   // Restarting the masters shouldn't lead to re-replication either, even
@@ -266,7 +281,7 @@ TEST_F(MaintenanceModeRF3ITest, TestFailedTServerInMaintenanceModeDoesntRereplic
   // Now bring the server back up and wait for it to become healthy. It should
   // be able to do this without tablet copies.
   ASSERT_OK(maintenance_ts->Restart());
-  NO_FATALS(AssertEventuallyNumFailedReplicas(ts_map, 0));
+  NO_FATALS(AssertEventuallyHealthCounts(ts_map, 0, total_replicas));
 
   // Since our server is healthy, leaving maintenance mode shouldn't trigger
   // any re-replication either.
@@ -279,7 +294,7 @@ TEST_F(MaintenanceModeRF3ITest, TestFailedTServerInMaintenanceModeDoesntRereplic
   // result in tablet copies.
   ASSERT_OK(ChangeTServerState(maintenance_uuid, TServerStateChangePB::ENTER_MAINTENANCE_MODE));
   NO_FATALS(maintenance_ts->Shutdown());
-  NO_FATALS(AssertEventuallyNumFailedReplicas(ts_map, kNumTablets));
+  NO_FATALS(AssertEventuallyHealthCounts(ts_map, kNumTablets, total_replicas - kNumTablets));
   ASSERT_OK(ChangeTServerState(maintenance_uuid, TServerStateChangePB::EXIT_MAINTENANCE_MODE));
   ASSERT_EVENTUALLY([&] {
     NO_FATALS(ExpectStartedTabletCopies(/*should_have_started*/true));
@@ -408,6 +423,7 @@ TEST_F(MaintenanceModeRF5ITest, TestBackgroundFailureDuringMaintenanceMode) {
   SKIP_IF_SLOW_NOT_ALLOWED();
   // Create some tables with RF=5.
   const int kNumTablets = 3;
+  const int total_replicas = kNumTablets * 5;
   TestWorkload create_table(cluster_.get());
   create_table.set_num_tablets(kNumTablets);
   create_table.set_num_replicas(5);
@@ -421,7 +437,7 @@ TEST_F(MaintenanceModeRF5ITest, TestBackgroundFailureDuringMaintenanceMode) {
   const auto& ts_map = ts_map_and_deleter.first;
 
   // Do a sanity check that all our replicas are healthy.
-  NO_FATALS(AssertEventuallyNumFailedReplicas(ts_map, 0));
+  NO_FATALS(AssertEventuallyHealthCounts(ts_map, 0, total_replicas));
 
   // Enter maintenance mode on a tserver and shut it down.
   ExternalTabletServer* maintenance_ts = cluster_->tablet_server(0);
@@ -431,7 +447,7 @@ TEST_F(MaintenanceModeRF5ITest, TestBackgroundFailureDuringMaintenanceMode) {
   SleepFor(kDurationForSomeHeartbeats);
 
   // Wait for the failure to be recognized by the other replicas.
-  NO_FATALS(AssertEventuallyNumFailedReplicas(ts_map, kNumTablets));
+  NO_FATALS(AssertEventuallyHealthCounts(ts_map, kNumTablets, total_replicas - kNumTablets));
   NO_FATALS(ExpectStartedTabletCopies(/*should_have_started*/false));
 
   // Now kill another server. We should be able to see some copies.
@@ -439,7 +455,9 @@ TEST_F(MaintenanceModeRF5ITest, TestBackgroundFailureDuringMaintenanceMode) {
   ASSERT_EVENTUALLY([&] {
     NO_FATALS(ExpectStartedTabletCopies(/*should_have_started*/true));
   });
-  NO_FATALS(AssertEventuallyNumFailedReplicas(ts_map, 0));
+  // Eventually we'll be left with just the unresponsive maintenance mode
+  // failed replicas.
+  NO_FATALS(AssertEventuallyHealthCounts(ts_map, kNumTablets, total_replicas - kNumTablets));
   // The previously empty tablet server should hold all the replicas that were
   // re-replicated.
   const TServerDetails* added_details = FindOrDie(ts_map, cluster_->tablet_server(5)->uuid());
@@ -452,11 +470,13 @@ TEST_F(MaintenanceModeRF5ITest, TestBackgroundFailureDuringMaintenanceMode) {
   // original tablets on the maintenance mode tserver should still exist.
   ASSERT_OK(ChangeTServerState(maintenance_uuid, TServerStateChangePB::EXIT_MAINTENANCE_MODE));
   ASSERT_OK(maintenance_ts->Restart());
-  SleepFor(kDurationForSomeHeartbeats);
   const TServerDetails* maintenance_details = FindOrDie(ts_map, maintenance_uuid);
   vector<string> mnt_tablet_ids;
-  ASSERT_OK(ListRunningTabletIds(maintenance_details, MonoDelta::FromSeconds(30), &mnt_tablet_ids));
-  ASSERT_EQ(kNumTablets, mnt_tablet_ids.size());
+  ASSERT_EVENTUALLY([&] {
+    ASSERT_OK(ListRunningTabletIds(
+        maintenance_details, MonoDelta::FromSeconds(30), &mnt_tablet_ids));
+    ASSERT_EQ(kNumTablets, mnt_tablet_ids.size());
+  });
 
   // All the while, our workload should not have been interrupted. Assert
   // eventually to wait for the rows to converge.

[kudu] 01/03: test: deflake TestFailedTServerInMaintenanceModeDoesntRereplicate

Posted by aw...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

awong pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/kudu.git

commit 51ccb1165a1ca6f243dfdcbf86fe6af3abe7fc77
Author: Andrew Wong <aw...@cloudera.com>
AuthorDate: Tue Oct 1 18:34:04 2019 -0700

    test: deflake TestFailedTServerInMaintenanceModeDoesntRereplicate
    
    Ongoing TestWorkloads don't like it when the only master gets restarted. That
    made the test fail with such errors as:
    
    F1002 01:26:04.854351 26289 test_workload.cc:255] Network error: LookupRpc { table: 'test-workload', partition-key: (RANGE (key): 1143140152), attempt: 1 } failed: Client connection negotiation failed: client connection to 127.25.46.190:34359: connect: Connection refused (error 111)
    
    So I've allowed network errors in the TestWorkload. Without this, this
    failed 3/5000 times in debug mode. With it, it passed 4999/5000 times
    (failure due to KUDU-2431).
    
    Change-Id: Ia9f52a65ea76996c503f922af3e068048142e5f5
    Reviewed-on: http://gerrit.cloudera.org:8080/14342
    Reviewed-by: Adar Dembo <ad...@cloudera.com>
    Tested-by: Kudu Jenkins
---
 src/kudu/integration-tests/maintenance_mode-itest.cc | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/src/kudu/integration-tests/maintenance_mode-itest.cc b/src/kudu/integration-tests/maintenance_mode-itest.cc
index f982586..c3bb288 100644
--- a/src/kudu/integration-tests/maintenance_mode-itest.cc
+++ b/src/kudu/integration-tests/maintenance_mode-itest.cc
@@ -219,8 +219,12 @@ TEST_F(MaintenanceModeRF3ITest, TestFailedTServerInMaintenanceModeDoesntRereplic
 
   // Create the table with three tablet servers and then add one so we're
   // guaranteed that the replicas are all on the first three servers.
+  // Restarting our only master may lead to network errors and timeouts, but
+  // that shouldn't matter w.r.t maintenance mode.
   TestWorkload create_table(cluster_.get());
   create_table.set_num_tablets(kNumTablets);
+  create_table.set_network_error_allowed(true);
+  create_table.set_timeout_allowed(true);
   create_table.Setup();
   create_table.Start();
   // Add a server so there's one we could move to after bringing down a