You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@kudu.apache.org by al...@apache.org on 2018/10/29 18:34:12 UTC

[1/2] kudu git commit: Improve consensus queue overflow logging

Repository: kudu
Updated Branches:
  refs/heads/master eb6f4e8a8 -> 2c5c9d06b


Improve consensus queue overflow logging

Suppose tablet server X is a leader of T tablets for which tablet server Y is a
follower. The relevant situation is when T is on the order of 100-1000. If Y
strains under its consensus load and falls behind processing consensus service
requests, UpdateConsensus requests from the leader will get rejected and cause
a message to be logged on the leader X for each of the T tablets. The message
looks like:

W1022 17:20:59.767554 13057 consensus_peers.cc:422] T 9255fdf03ad4451e9fcd62f26741bfe6 P 892cc0d4442c4cdaaf633ed2732f9246 -> Peer dc0af5867d52468f8fd47abf13c08040 (tablet_server_Y.kudu.com:7050): Couldn't send request to peer dc0af5867d52468f8fd47abf13c08040 for tablet 9255fdf03ad4451e9fcd62f26741bfe6. Status: Remote error: Service unavailable: UpdateConsensus request on kudu.consensus.ConsensusService from 10.1.1.1:55528 dropped due to backpressure. The service queue is full; it has 50 items.. Retrying in the next heartbeat period. Already tried 1 times.

Y's consensus service pool also logs the same thing, but it doesn't have the
information about the tablet id or peer ids available to it, and it is throttled
to occur no more than once per second:

W1022 17:37:33.535168  4330 service_pool.cc:130] UpdateConsensus request on kudu.consensus.ConsensusService from 10.45.26.115:36820 dropped due to backpressure. The service queue is full; it has 50 items.

This patch attempts to reduce the spam of the first message in the logs
by throttling it to occur once every 5 retries. It still is logged for
every tablet peer, but those messages are useful if one wants to trace
the history of a particular tablet.

I also added the throttling messages to Y's output, so it's now

W1022 17:37:33.535168  4330 service_pool.cc:130] UpdateConsensus request on kudu.consensus.ConsensusService from 10.45.26.115:36820 dropped due to backpressure. The service queue is full; it has 50 items. [suppressed 5 similar messages]

when e.g. 5 other messages have been suppressed.

Change-Id: I7697c63babefac0f76bcc8c87d70f7e7125e55cc
Reviewed-on: http://gerrit.cloudera.org:8080/11801
Tested-by: Will Berkeley <wd...@gmail.com>
Reviewed-by: Alexey Serbin <as...@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/be64560f
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/be64560f
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/be64560f

Branch: refs/heads/master
Commit: be64560fe44c0466684cb9627e8bf515a7b96726
Parents: eb6f4e8
Author: Will Berkeley <wd...@gmail.org>
Authored: Fri Oct 26 11:17:20 2018 -0700
Committer: Will Berkeley <wd...@gmail.com>
Committed: Fri Oct 26 20:45:33 2018 +0000

----------------------------------------------------------------------
 src/kudu/consensus/consensus_peers.cc | 25 +++++++++++++++++++------
 src/kudu/consensus/consensus_peers.h  |  3 +--
 src/kudu/rpc/service_pool.cc          |  2 +-
 src/kudu/util/logging.h               |  2 +-
 4 files changed, 22 insertions(+), 10 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/be64560f/src/kudu/consensus/consensus_peers.cc
----------------------------------------------------------------------
diff --git a/src/kudu/consensus/consensus_peers.cc b/src/kudu/consensus/consensus_peers.cc
index a99b2c0..8d6b5e7 100644
--- a/src/kudu/consensus/consensus_peers.cc
+++ b/src/kudu/consensus/consensus_peers.cc
@@ -100,6 +100,9 @@ using strings::Substitute;
 namespace kudu {
 namespace consensus {
 
+// The number of retries between failed requests whose failure is logged.
+constexpr auto kNumRetriesBetweenLoggingFailedRequest = 5;
+
 Status Peer::NewRemotePeer(RaftPeerPB peer_pb,
                            string tablet_id,
                            string leader_uuid,
@@ -444,12 +447,22 @@ void Peer::ProcessResponseError(const Status& status) {
                                TabletServerErrorPB::Code_Name(response_.error().code()),
                                response_.error().code());
   }
-  LOG_WITH_PREFIX_UNLOCKED(WARNING) << "Couldn't send request to peer " << peer_pb_.permanent_uuid()
-      << " for tablet " << tablet_id_ << "."
-      << resp_err_info
-      << " Status: " << status.ToString() << "."
-      << " Retrying in the next heartbeat period."
-      << " Already tried " << failed_attempts_ << " times.";
+  // We log the warning at the first failure, then every
+  // 'kNumRetriesBetweenLoggingFailedRequest' retries.
+  // TODO(wdberkeley): If a use case comes up elsewhere, consider adding a
+  // KLOG_EVERY_N macro that supports an appropriate LogThrottler. For now,
+  // this class has 'failed_attempts_' available so it's overkill to add
+  // the throttler support.
+  if (failed_attempts_ % kNumRetriesBetweenLoggingFailedRequest == 1) {
+    LOG_WITH_PREFIX_UNLOCKED(WARNING) <<
+      Substitute("Couldn't send request to peer $0.$1 Status: $2. This is "
+                 "attempt $3: this message will repeat every $4th retry.",
+                 peer_pb_.permanent_uuid(),
+                 resp_err_info,
+                 status.ToString(),
+                 failed_attempts_,
+                 kNumRetriesBetweenLoggingFailedRequest);
+  }
   request_pending_ = false;
 }
 

http://git-wip-us.apache.org/repos/asf/kudu/blob/be64560f/src/kudu/consensus/consensus_peers.h
----------------------------------------------------------------------
diff --git a/src/kudu/consensus/consensus_peers.h b/src/kudu/consensus/consensus_peers.h
index b111fe9..3bcdcb5 100644
--- a/src/kudu/consensus/consensus_peers.h
+++ b/src/kudu/consensus/consensus_peers.h
@@ -189,12 +189,11 @@ class Peer : public std::enable_shared_from_this<Peer> {
   // Repeating timer responsible for scheduling heartbeats to this peer.
   std::shared_ptr<rpc::PeriodicTimer> heartbeater_;
 
-  // lock that protects Peer state changes, initialization, etc.
+  // Lock that protects Peer state changes, initialization, etc.
   mutable simple_spinlock peer_lock_;
   bool request_pending_ = false;
   bool closed_ = false;
   bool has_sent_first_request_ = false;
-
 };
 
 // A proxy to another peer. Usually a thin wrapper around an rpc proxy but can

http://git-wip-us.apache.org/repos/asf/kudu/blob/be64560f/src/kudu/rpc/service_pool.cc
----------------------------------------------------------------------
diff --git a/src/kudu/rpc/service_pool.cc b/src/kudu/rpc/service_pool.cc
index 62d46d6..50a0af7 100644
--- a/src/kudu/rpc/service_pool.cc
+++ b/src/kudu/rpc/service_pool.cc
@@ -126,7 +126,7 @@ void ServicePool::RejectTooBusy(InboundCall* c) {
                  c->remote_address().ToString(),
                  service_queue_.max_size());
   rpcs_queue_overflow_->Increment();
-  KLOG_EVERY_N_SECS(WARNING, 1) << err_msg;
+  KLOG_EVERY_N_SECS(WARNING, 1) << err_msg << THROTTLE_MSG;
   c->RespondFailure(ErrorStatusPB::ERROR_SERVER_TOO_BUSY,
                     Status::ServiceUnavailable(err_msg));
   DLOG(INFO) << err_msg << " Contents of service queue:\n"

http://git-wip-us.apache.org/repos/asf/kudu/blob/be64560f/src/kudu/util/logging.h
----------------------------------------------------------------------
diff --git a/src/kudu/util/logging.h b/src/kudu/util/logging.h
index 428dadc..e041831 100644
--- a/src/kudu/util/logging.h
+++ b/src/kudu/util/logging.h
@@ -121,7 +121,7 @@ class ScopedDisableRedaction {
 // the given severity.
 //
 // The log message may include the special token 'THROTTLE_MSG' which expands
-// to either an empty string or '[suppressed <n> similar messages]'.
+// to either an empty string or ' [suppressed <n> similar messages]'.
 //
 // Example usage:
 //   KLOG_EVERY_N_SECS(WARNING, 1) << "server is low on memory" << THROTTLE_MSG;

[2/2] kudu git commit: [master] update placement logic for RF % 2 == 0

Posted by al...@apache.org.

[master] update placement logic for RF % 2 == 0

Updated the logic of the replica placement in master to properly handle
even replication factors.  Added corresponding unit tests as well.

It's possible to configure Kudu to allow creation of tables with an
even replication factor.  I think it's easier to implement the handling
of those cases instead of handling possible inconsistencies in
the rebalancer and other tools and adding extra paragraphs into release
notes.

Change-Id: I4259f805090dd350f31bf3bf2b6477898214ece4
Reviewed-on: http://gerrit.cloudera.org:8080/11763
Tested-by: Kudu Jenkins
Reviewed-by: Will Berkeley <wd...@gmail.com>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/2c5c9d06
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/2c5c9d06
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/2c5c9d06

Branch: refs/heads/master
Commit: 2c5c9d06bce6897d78f31178ad6a437d7c48f29b
Parents: be64560
Author: Alexey Serbin <as...@cloudera.com>
Authored: Tue Oct 23 17:54:05 2018 -0700
Committer: Alexey Serbin <as...@cloudera.com>
Committed: Mon Oct 29 18:32:12 2018 +0000

----------------------------------------------------------------------
 src/kudu/master/placement_policy-test.cc | 116 ++++++++++++++++++++++++++
 src/kudu/master/placement_policy.cc      |  21 ++++-
 src/kudu/master/placement_policy.h       |   5 +-
 3 files changed, 137 insertions(+), 5 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/2c5c9d06/src/kudu/master/placement_policy-test.cc
----------------------------------------------------------------------
diff --git a/src/kudu/master/placement_policy-test.cc b/src/kudu/master/placement_policy-test.cc
index 951bcd0..541d0eb 100644
--- a/src/kudu/master/placement_policy-test.cc
+++ b/src/kudu/master/placement_policy-test.cc
@@ -769,5 +769,121 @@ TEST_F(PlacementPolicyTest, PlaceExtraTablet_L5_TS16_RF5) {
   EXPECT_GT(1500, placement_stats["B_ts3"]);
 }
 
+// Even RF case: edge cases with 2 locaitons.
+TEST_F(PlacementPolicyTest, PlaceTabletReplicasEmptyCluster_L2_EvenRFEdgeCase) {
+  const vector<LocationInfo> cluster_info = {
+    { "A", { { "A_ts0", 10 }, { "A_ts1", 10 }, { "A_ts2", 10 }, } },
+    { "B", { { "B_ts0", 0 }, { "B_ts1", 0 }, { "B_ts2", 1 }, { "B_ts3", 10 }, } },
+  };
+  ASSERT_OK(Prepare(cluster_info));
+
+  const auto& all = descriptors();
+  PlacementPolicy policy(all, rng());
+
+  {
+    static constexpr auto num_replicas = 2;
+    TSDescriptorVector result;
+    ASSERT_OK(policy.PlaceTabletReplicas(num_replicas, &result));
+    ASSERT_EQ(num_replicas, result.size());
+    TSDescriptorsMap m;
+    ASSERT_OK(TSDescriptorVectorToMap(result, &m));
+    ASSERT_EQ(1, m.count("A_ts0") + m.count("A_ts1") + m.count("A_ts2"));
+    ASSERT_EQ(1, m.count("B_ts0") + m.count("B_ts1") + m.count("B_ts2"));
+  }
+  {
+    static constexpr auto num_replicas = 4;
+    TSDescriptorVector result;
+    ASSERT_OK(policy.PlaceTabletReplicas(num_replicas, &result));
+    ASSERT_EQ(num_replicas, result.size());
+    TSDescriptorsMap m;
+    ASSERT_OK(TSDescriptorVectorToMap(result, &m));
+    ASSERT_EQ(2, m.count("A_ts0") + m.count("A_ts1") + m.count("A_ts2"));
+    ASSERT_EQ(2, m.count("B_ts0") + m.count("B_ts1") + m.count("B_ts2"));
+  }
+  {
+    static constexpr auto num_replicas = 6;
+    TSDescriptorVector result;
+    ASSERT_OK(policy.PlaceTabletReplicas(num_replicas, &result));
+    ASSERT_EQ(num_replicas, result.size());
+    TSDescriptorsMap m;
+    ASSERT_OK(TSDescriptorVectorToMap(result, &m));
+    ASSERT_EQ(1, m.count("A_ts0"));
+    ASSERT_EQ(1, m.count("A_ts1"));
+    ASSERT_EQ(1, m.count("A_ts2"));
+    ASSERT_EQ(1, m.count("B_ts0"));
+    ASSERT_EQ(1, m.count("B_ts1"));
+    ASSERT_EQ(1, m.count("B_ts2") + m.count("B_ts3"));
+  }
+}
+
+// Even RF case: place tablet replicas into a cluster with 3 locations.
+TEST_F(PlacementPolicyTest, PlaceTabletReplicasEmptyCluster_L3_EvenRF) {
+  const vector<LocationInfo> cluster_info = {
+    { "A", { { "A_ts0", 10 }, { "A_ts1", 10 }, { "A_ts2", 10 }, } },
+    { "B", { { "B_ts0", 0 }, { "B_ts1", 0 }, { "B_ts2", 10 }, } },
+    { "C", { { "C_ts0", 0 }, { "C_ts1", 0 }, { "C_ts2", 10 }, } },
+  };
+  ASSERT_OK(Prepare(cluster_info));
+
+  const auto& all = descriptors();
+  PlacementPolicy policy(all, rng());
+
+  {
+    static constexpr auto num_replicas = 2;
+    TSDescriptorVector result;
+    ASSERT_OK(policy.PlaceTabletReplicas(num_replicas, &result));
+    ASSERT_EQ(num_replicas, result.size());
+    TSDescriptorsMap m;
+    ASSERT_OK(TSDescriptorVectorToMap(result, &m));
+    // Two location are to have one replica, one location to have none.
+    ASSERT_GE(1, m.count("A_ts0") + m.count("A_ts1") + m.count("A_ts2"));
+    ASSERT_GE(1, m.count("B_ts0") + m.count("B_ts1"));
+    ASSERT_GE(1, m.count("C_ts0") + m.count("C_ts1"));
+  }
+  {
+    static constexpr auto num_replicas = 4;
+    TSDescriptorVector result;
+    ASSERT_OK(policy.PlaceTabletReplicas(num_replicas, &result));
+    ASSERT_EQ(num_replicas, result.size());
+    TSDescriptorsMap m;
+    ASSERT_OK(TSDescriptorVectorToMap(result, &m));
+    // One location is to have two replicas, the rest are to have just one.
+    ASSERT_LE(1, m.count("A_ts0") + m.count("A_ts1") + m.count("A_ts2"));
+    ASSERT_GE(2, m.count("A_ts0") + m.count("A_ts1") + m.count("A_ts2"));
+    ASSERT_LE(1, m.count("B_ts0") + m.count("B_ts1"));
+    ASSERT_GE(2, m.count("B_ts0") + m.count("B_ts1"));
+    ASSERT_LE(1, m.count("C_ts0") + m.count("C_ts1"));
+    ASSERT_GE(2, m.count("C_ts0") + m.count("C_ts1"));
+  }
+  {
+    static constexpr auto num_replicas = 6;
+    TSDescriptorVector result;
+    ASSERT_OK(policy.PlaceTabletReplicas(num_replicas, &result));
+    ASSERT_EQ(num_replicas, result.size());
+    TSDescriptorsMap m;
+    ASSERT_OK(TSDescriptorVectorToMap(result, &m));
+    ASSERT_EQ(2, m.count("A_ts0") + m.count("A_ts1") + m.count("A_ts2"));
+    ASSERT_EQ(1, m.count("B_ts0"));
+    ASSERT_EQ(1, m.count("B_ts1"));
+    ASSERT_EQ(1, m.count("C_ts0"));
+    ASSERT_EQ(1, m.count("C_ts1"));
+  }
+  {
+    static constexpr auto num_replicas = 8;
+    TSDescriptorVector result;
+    ASSERT_OK(policy.PlaceTabletReplicas(num_replicas, &result));
+    ASSERT_EQ(num_replicas, result.size());
+    TSDescriptorsMap m;
+    ASSERT_OK(TSDescriptorVectorToMap(result, &m));
+    ASSERT_EQ(2, m.count("A_ts0") + m.count("A_ts1") + m.count("A_ts2"));
+    ASSERT_EQ(1, m.count("B_ts0"));
+    ASSERT_EQ(1, m.count("B_ts1"));
+    ASSERT_EQ(1, m.count("B_ts2"));
+    ASSERT_EQ(1, m.count("C_ts0"));
+    ASSERT_EQ(1, m.count("C_ts1"));
+    ASSERT_EQ(1, m.count("C_ts2"));
+  }
+}
+
 } // namespace master
 } // namespace kudu

http://git-wip-us.apache.org/repos/asf/kudu/blob/2c5c9d06/src/kudu/master/placement_policy.cc
----------------------------------------------------------------------
diff --git a/src/kudu/master/placement_policy.cc b/src/kudu/master/placement_policy.cc
index 45f9ef7..5b0b0d6 100644
--- a/src/kudu/master/placement_policy.cc
+++ b/src/kudu/master/placement_policy.cc
@@ -299,6 +299,8 @@ Status PlacementPolicy::SelectLocation(
     string* location) const {
   DCHECK(location);
 
+  const auto num_locations = ltd_.size();
+
   // A pair of the location-per-load maps. The idea is to get a group to select
   // the best location based on the load, while not placing the majority of
   // replicas in same location, if possible. Using multimap (but not
@@ -317,11 +319,24 @@ Status PlacementPolicy::SelectLocation(
         // per tablet server.
         continue;
       }
-      if (location_replicas_num + 1 > num_replicas / 2) {
+      // When placing the replicas of a tablet, it's necessary to take into
+      // account number of available locations, since the maximum number
+      // of replicas per non-overflow location depends on that. For example,
+      // in case of 2 locations the best placement for 4 replicas would be
+      // (2 + 2), while in case of 4 and more locations that's (1 + 1 + 1 + 1).
+      // Similarly, in case of 2 locations and 6 replicas, the best placement
+      // is (3 + 3), while for 3 locations that's (2 + 2 + 2).
+      if ((num_locations == 2 && num_replicas % 2 == 0 &&
+           location_replicas_num + 1 > num_replicas / 2) ||
+          (num_locations > 2 &&
+           location_replicas_num + 1 >= (num_replicas + 1) / 2)) {
         // If possible, avoid placing the majority of the tablet's replicas
         // into a single location even if load-based criterion would favor that.
-        // So, if placing one extra replica will add up to the majority, place
-        // this location into the overflow group.
+        // Prefer such a distribution of replicas that will keep the majority
+        // of replicas alive if any single location fails. So, if placing one
+        // extra replica would add up to the majority in case of odd replication
+        // factor or add up to the half of all replicas in case of even
+        // replication factor, place this location into the overflow group.
         location_per_load_overflow.emplace(
             GetLocationLoad(location, locations_info), location);
         continue;

http://git-wip-us.apache.org/repos/asf/kudu/blob/2c5c9d06/src/kudu/master/placement_policy.h
----------------------------------------------------------------------
diff --git a/src/kudu/master/placement_policy.h b/src/kudu/master/placement_policy.h
index b645818..924c0d1 100644
--- a/src/kudu/master/placement_policy.h
+++ b/src/kudu/master/placement_policy.h
@@ -113,7 +113,7 @@ class PlacementPolicy {
                          const ReplicaLocationsInfo& locations_info) const;
 
   // Select locations to place the given number of replicas ('nreplicas') for
-  // a new tablet. The locations are be chosen according to the placement
+  // a new tablet. The locations are chosen according to the placement
   // policies.
   //
   // TODO (aserbin): add the reference to the document once it's in the repo.
@@ -135,7 +135,8 @@ class PlacementPolicy {
 
   // Select location for next replica of a tablet with the specified replication
   // factor. In essence, the algorithm picks the least loaded location,
-  // making sure no location contains the majority of the replicas.
+  // making sure no location contains the majority of replicas of the tablet,
+  // if possible.
   //
   // Parameters:
   //   'num_replicas'   The total number of tablet replicas to place.