You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by to...@apache.org on 2016/12/05 04:52:57 UTC

[1/2] kudu git commit: client: throttle some warning messages when lookup RPCs fail

Repository: kudu
Updated Branches:
  refs/heads/master 80ac8bae3 -> 47c25ceae


client: throttle some warning messages when lookup RPCs fail

In an Impala stress workload, we found that LookupRpc::SendRpcCb was
spewing warnings at an alarming rate. The warnings themselves were
causing so much load (due to Impala's glog lock contention) that it
produced a vicious cycle: the log lock contention caused more timeouts,
which caused more warnings, which caused more lock contention, etc...

I figured out the address of the glog mutex in an impalad process and
used perf to trace the stack trace of its lock acquisitions over a
5-second period:

[todd@ve1120 kudu]$ sudo perf record -a -e syscalls:sys_enter_futex --filter 'uaddr == "0x7f6950a7714c"' -g sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 4.191 MB perf.data (~183118 samples) ]
[todd@ve1120 kudu]$ sudo perf script | grep -A1 'LogMess' | grep client:: | sort | uniq -c
    272             7f6950477224 kudu::client::internal::WriteRpc::Finish(kudu::Status const&)
     92             7f695049be85 kudu::client::internal::RemoteTablet::MarkReplicaFailed(kudu::client::internal::RemoteTabletServer*, kudu::Status const&)
  14453             7f69504a1bc5 kudu::client::internal::LookupRpc::SendRpcCb(kudu::Status const&)

Here we can see the vast majority came from LookupRpc::SendRpcCb and
WriteRpc::Finish. This patch addresses those as well as a few others
that look like they might potentially occur frequently.

Change-Id: I840bd57976ccefba4453667f82c7aa32756922d3
Reviewed-on: http://gerrit.cloudera.org:8080/5332
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo <ad...@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/f0c27b5a
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/f0c27b5a
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/f0c27b5a

Branch: refs/heads/master
Commit: f0c27b5aa0616be8dd4db30a0682eab0a6218f51
Parents: 80ac8ba
Author: Todd Lipcon <to...@apache.org>
Authored: Fri Dec 2 13:38:06 2016 -0800
Committer: Todd Lipcon <to...@apache.org>
Committed: Mon Dec 5 04:51:19 2016 +0000

----------------------------------------------------------------------
 src/kudu/client/batcher.cc         |  2 +-
 src/kudu/client/client-internal.cc | 22 +++++++++++++---------
 src/kudu/client/meta_cache.cc      | 15 ++++++++-------
 3 files changed, 22 insertions(+), 17 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/f0c27b5a/src/kudu/client/batcher.cc
----------------------------------------------------------------------
diff --git a/src/kudu/client/batcher.cc b/src/kudu/client/batcher.cc
index fb1d8a5..0c3c720 100644
--- a/src/kudu/client/batcher.cc
+++ b/src/kudu/client/batcher.cc
@@ -316,7 +316,7 @@ void WriteRpc::Finish(const Status& status) {
     final_status = final_status.CloneAndPrepend(
         Substitute("Failed to write batch of $0 ops to tablet $1 after $2 attempt(s)",
                    ops_.size(), tablet_id_, num_attempts()));
-    LOG(WARNING) << final_status.ToString();
+    KLOG_EVERY_N_SECS(WARNING, 1) << final_status.ToString();
   }
   batcher_->ProcessWriteResponse(*this, final_status);
 }

http://git-wip-us.apache.org/repos/asf/kudu/blob/f0c27b5a/src/kudu/client/client-internal.cc
----------------------------------------------------------------------
diff --git a/src/kudu/client/client-internal.cc b/src/kudu/client/client-internal.cc
index c6f7a31..e483be9 100644
--- a/src/kudu/client/client-internal.cc
+++ b/src/kudu/client/client-internal.cc
@@ -40,6 +40,7 @@
 #include "kudu/rpc/rpc.h"
 #include "kudu/rpc/rpc_controller.h"
 #include "kudu/rpc/rpc_header.pb.h"
+#include "kudu/util/logging.h"
 #include "kudu/util/net/dns_resolver.h"
 #include "kudu/util/net/net_util.h"
 #include "kudu/util/thread_restrictions.h"
@@ -185,9 +186,10 @@ Status KuduClient::Data::SyncLeaderMasterRpc(
     }
 
     if (s.IsNetworkError()) {
-      LOG(WARNING) << "Unable to send the request (" << req.ShortDebugString()
-                   << ") to leader Master (" << leader_master_hostport().ToString()
-                   << "): " << s.ToString();
+      KLOG_EVERY_N_SECS(WARNING, 1)
+          << "Unable to send the request (" << req.ShortDebugString()
+          << ") to leader Master (" << leader_master_hostport().ToString() << "): "
+          << s.ToString();
       if (client->IsMultiMaster()) {
         LOG(INFO) << "Determining the new leader Master and retrying...";
         WARN_NOT_OK(SetMasterServerProxy(client, deadline),
@@ -198,9 +200,10 @@ Status KuduClient::Data::SyncLeaderMasterRpc(
 
     if (s.IsTimedOut()) {
       if (MonoTime::Now() < deadline) {
-        LOG(WARNING) << "Unable to send the request (" << req.ShortDebugString()
-                     << ") to leader Master (" << leader_master_hostport().ToString()
-                     << "): " << s.ToString();
+        KLOG_EVERY_N_SECS(WARNING, 1)
+            << "Unable to send the request (" << req.ShortDebugString()
+            << ") to leader Master (" << leader_master_hostport().ToString()
+            << "): " << s.ToString();
         if (client->IsMultiMaster()) {
           LOG(INFO) << "Determining the new leader Master and retrying...";
           WARN_NOT_OK(SetMasterServerProxy(client, deadline),
@@ -218,7 +221,7 @@ Status KuduClient::Data::SyncLeaderMasterRpc(
       if (resp->error().code() == MasterErrorPB::NOT_THE_LEADER ||
           resp->error().code() == MasterErrorPB::CATALOG_MANAGER_NOT_INITIALIZED) {
         if (client->IsMultiMaster()) {
-          LOG(INFO) << "Determining the new leader Master and retrying...";
+          KLOG_EVERY_N_SECS(INFO, 1) << "Determining the new leader Master and retrying...";
           WARN_NOT_OK(SetMasterServerProxy(client, deadline),
                       "Unable to determine the new leader Master");
           continue;
@@ -649,8 +652,9 @@ void KuduClient::Data::SetMasterServerProxyAsync(KuduClient* client,
       return;
     }
     if (addrs.size() > 1) {
-      LOG(WARNING) << "Specified master server address '" << master_server_addr << "' "
-                   << "resolved to multiple IPs. Using " << addrs[0].ToString();
+      KLOG_EVERY_N_SECS(WARNING, 1)
+          << "Specified master server address '" << master_server_addr << "' "
+          << "resolved to multiple IPs. Using " << addrs[0].ToString();
     }
     master_sockaddrs.push_back(addrs[0]);
   }

http://git-wip-us.apache.org/repos/asf/kudu/blob/f0c27b5a/src/kudu/client/meta_cache.cc
----------------------------------------------------------------------
diff --git a/src/kudu/client/meta_cache.cc b/src/kudu/client/meta_cache.cc
index 82bcc5d..5de9652 100644
--- a/src/kudu/client/meta_cache.cc
+++ b/src/kudu/client/meta_cache.cc
@@ -33,6 +33,7 @@
 #include "kudu/rpc/messenger.h"
 #include "kudu/rpc/rpc.h"
 #include "kudu/tserver/tserver_service.proxy.h"
+#include "kudu/util/logging.h"
 #include "kudu/util/net/dns_resolver.h"
 #include "kudu/util/net/net_util.h"
 
@@ -193,7 +194,7 @@ void RemoteTablet::MarkReplicaFailed(RemoteTabletServer *ts,
   std::lock_guard<simple_spinlock> l(lock_);
   VLOG(2) << "Tablet " << tablet_id_ << ": Current remote replicas in meta cache: "
           << ReplicasAsStringUnlocked();
-  LOG(WARNING) << "Tablet " << tablet_id_ << ": Replica " << ts->ToString()
+  KLOG_EVERY_N_SECS(WARNING, 1) << "Tablet " << tablet_id_ << ": Replica " << ts->ToString()
                << " has failed: " << status.ToString();
   for (RemoteReplica& rep : replicas_) {
     if (rep.ts == ts) {
@@ -682,7 +683,7 @@ void LookupRpc::NewLeaderMasterDeterminedCb(const Status& status) {
     mutable_retrier()->mutable_controller()->Reset();
     SendRpc();
   } else {
-    LOG(WARNING) << "Failed to determine new Master: " << status.ToString();
+    KLOG_EVERY_N_SECS(WARNING, 1) << "Failed to determine new Master: " << status.ToString();
     mutable_retrier()->DelayedRetry(this, status);
   }
 }
@@ -707,7 +708,7 @@ void LookupRpc::SendRpcCb(const Status& status) {
     if (resp_.error().code() == master::MasterErrorPB::NOT_THE_LEADER ||
         resp_.error().code() == master::MasterErrorPB::CATALOG_MANAGER_NOT_INITIALIZED) {
       if (meta_cache_->client_->IsMultiMaster()) {
-        LOG(WARNING) << "Leader Master has changed, re-trying...";
+        KLOG_EVERY_N_SECS(WARNING, 1) << "Leader Master has changed, re-trying...";
         ResetMasterLeaderAndRetry();
         ignore_result(delete_me.release());
         return;
@@ -720,7 +721,7 @@ void LookupRpc::SendRpcCb(const Status& status) {
   if (new_status.IsTimedOut()) {
     if (MonoTime::Now() < retrier().deadline()) {
       if (meta_cache_->client_->IsMultiMaster()) {
-        LOG(WARNING) << "Leader Master timed out, re-trying...";
+        KLOG_EVERY_N_SECS(WARNING, 1) << "Leader Master timed out, re-trying...";
         ResetMasterLeaderAndRetry();
         ignore_result(delete_me.release());
         return;
@@ -734,8 +735,8 @@ void LookupRpc::SendRpcCb(const Status& status) {
 
   if (new_status.IsNetworkError()) {
     if (meta_cache_->client_->IsMultiMaster()) {
-      LOG(WARNING) << "Encountered a network error from the Master: " << new_status.ToString()
-                     << ", retrying...";
+      KLOG_EVERY_N_SECS(WARNING, 1) << "Encountered a network error from the Master: "
+                                    << new_status.ToString() << ", retrying...";
       ResetMasterLeaderAndRetry();
       ignore_result(delete_me.release());
       return;
@@ -760,7 +761,7 @@ void LookupRpc::SendRpcCb(const Status& status) {
     }
   } else {
     new_status = new_status.CloneAndPrepend(Substitute("$0 failed", ToString()));
-    LOG(WARNING) << new_status.ToString();
+    KLOG_EVERY_N_SECS(WARNING, 1) << new_status.ToString();
   }
   user_cb_.Run(new_status);
 }


[2/2] kudu git commit: ts_recovery-itest: reduce test flakiness

Posted by to...@apache.org.
ts_recovery-itest: reduce test flakiness

In the flaky test dashboard, this test case was flaky in two cases:

1) in TSAN builds sometimes it wouldn't write fast enough to trigger
lots of rolls. In that case, the "wait for crash in 60 seconds" might
not trigger.

2) In DEBUG builds, sometimes the 50% chance of crashes allowed it to
write 10+ segments before crashing (0.5^10 = 1/1000 times). With this
many log segments, the bootstrap could take long enough that the
verification on restart would time out.

This addresses the issue by (a) writing larger ops, so that logs would
roll quicker with fewer ops required, and (b) increasing the crash
probability so that it needs fewer rolls to crash on average.

Change-Id: I4d5018636d20a4663e84071e91f573b86e309a29
Reviewed-on: http://gerrit.cloudera.org:8080/5351
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo <ad...@cloudera.com>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/47c25cea
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/47c25cea
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/47c25cea

Branch: refs/heads/master
Commit: 47c25ceae1ad2a35bc30bfa94dc16d88c1a10035
Parents: f0c27b5
Author: Todd Lipcon <to...@apache.org>
Authored: Sat Dec 3 21:16:34 2016 -0800
Committer: Todd Lipcon <to...@apache.org>
Committed: Mon Dec 5 04:51:45 2016 +0000

----------------------------------------------------------------------
 src/kudu/integration-tests/ts_recovery-itest.cc | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/47c25cea/src/kudu/integration-tests/ts_recovery-itest.cc
----------------------------------------------------------------------
diff --git a/src/kudu/integration-tests/ts_recovery-itest.cc b/src/kudu/integration-tests/ts_recovery-itest.cc
index ec20120..8ab57d1 100644
--- a/src/kudu/integration-tests/ts_recovery-itest.cc
+++ b/src/kudu/integration-tests/ts_recovery-itest.cc
@@ -204,12 +204,13 @@ TEST_F(TsRecoveryITest, TestCrashBeforeWriteLogSegmentHeader) {
   work.set_num_replicas(1);
   work.set_write_timeout_millis(100);
   work.set_timeout_allowed(true);
+  work.set_payload_bytes(10000); // make logs roll without needing lots of ops.
   work.Setup();
 
   // Enable the fault point after creating the table, but before writing any data.
   // Otherwise, we'd crash during creation of the tablet.
   ASSERT_OK(cluster_->SetFlag(cluster_->tablet_server(0),
-                              "fault_crash_before_write_log_segment_header", "0.5"));
+                              "fault_crash_before_write_log_segment_header", "0.9"));
   work.Start();
 
   // Wait for the process to crash during log roll.