You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by to...@apache.org on 2016/08/17 23:02:59 UTC

[1/3] kudu git commit: KUDU-763 consensus queue metrics on followers are messed up

Repository: kudu
Updated Branches:
  refs/heads/master 34b7f1eb5 -> 13ea28029


KUDU-763 consensus queue metrics on followers are messed up

On follower tablet replicas, the majority_done_ops and
in_progress_ops metrics are wrong.

majority_done_ops = committed_index - all_replicated_opid
in_progress_ops = last_appended - committed_index

There are two reasons why:

1) followers do not update their consensus queue's committed index
2) followers do not maintain a correct value for all_replicated_opid,
since their queues generally only track the local peer and the leader
does not notify followers when ops are all-replicated.

This patch fixes 1 by having consensus notify the follower queues of
the updated committed index when the consensus committed index is
updated. This makes in_progress_ops meaningful for followers. Note
that a follower queue's committed index is not used for anything
besides the metrics.

Fixing 2 would require having the leader notify followers when
operations are all-replicated. This isn't needed for consensus, and
would be used by the followers just for the majority_done_ops metric,
so I think it's best just to zero the metric for followers and
document that it is not meaningful in that case.

Change-Id: I9fb0d45f85786b9e2631b5dc0bf044a9d3192a39
Reviewed-on: http://gerrit.cloudera.org:8080/3501
Tested-by: Kudu Jenkins
Reviewed-by: Todd Lipcon <to...@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/0174fda2
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/0174fda2
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/0174fda2

Branch: refs/heads/master
Commit: 0174fda21257ef7c9ba7901671f97a58b3f1521e
Parents: 34b7f1e
Author: Will Berkeley <wd...@gmail.com>
Authored: Sun Jun 26 22:55:46 2016 -0400
Committer: Todd Lipcon <to...@apache.org>
Committed: Wed Aug 17 21:29:40 2016 +0000

----------------------------------------------------------------------
 src/kudu/consensus/consensus_queue-test.cc | 22 ++++++++++++++++++
 src/kudu/consensus/consensus_queue.cc      | 31 ++++++++++++++++++-------
 src/kudu/consensus/consensus_queue.h       |  7 ++++++
 src/kudu/consensus/raft_consensus.cc       |  3 +++
 4 files changed, 54 insertions(+), 9 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/0174fda2/src/kudu/consensus/consensus_queue-test.cc
----------------------------------------------------------------------
diff --git a/src/kudu/consensus/consensus_queue-test.cc b/src/kudu/consensus/consensus_queue-test.cc
index 9f7a20c..41c539e 100644
--- a/src/kudu/consensus/consensus_queue-test.cc
+++ b/src/kudu/consensus/consensus_queue-test.cc
@@ -815,5 +815,27 @@ TEST_F(ConsensusQueueTest, TestTriggerTabletCopyIfTabletNotFound) {
             tc_req.copy_peer_addr().ShortDebugString());
 }
 
+TEST_F(ConsensusQueueTest, TestFollowerCommittedIndexAndMetrics) {
+  queue_->Init(MinimumOpId());
+  queue_->SetNonLeaderMode();
+
+  AppendReplicateMessagesToQueue(queue_.get(), clock_, 1, 10);
+  WaitForLocalPeerToAckIndex(10);
+
+  // The committed_index should be MinimumOpId() since UpdateFollowerCommittedIndex
+  // has not been called.
+  queue_->observers_pool_->Wait();
+  ASSERT_OPID_EQ(queue_->GetCommittedIndexForTests(), MinimumOpId());
+
+  // Update the committed index. In real life, this would be done by the consensus
+  // implementation when it receives an updated committed index from the leader.
+  queue_->UpdateFollowerCommittedIndex(MakeOpId(1, 10));
+  ASSERT_OPID_EQ(queue_->GetCommittedIndexForTests(), MakeOpId(1, 10));
+
+  // Check the metrics have the right values based on the updated committed index.
+  ASSERT_EQ(queue_->metrics_.num_majority_done_ops->value(), 0);
+  ASSERT_EQ(queue_->metrics_.num_in_progress_ops->value(), 0);
+}
+
 }  // namespace consensus
 }  // namespace kudu

http://git-wip-us.apache.org/repos/asf/kudu/blob/0174fda2/src/kudu/consensus/consensus_queue.cc
----------------------------------------------------------------------
diff --git a/src/kudu/consensus/consensus_queue.cc b/src/kudu/consensus/consensus_queue.cc
index a7a6fcf..d684bcd 100644
--- a/src/kudu/consensus/consensus_queue.cc
+++ b/src/kudu/consensus/consensus_queue.cc
@@ -74,10 +74,10 @@ using strings::Substitute;
 METRIC_DEFINE_gauge_int64(tablet, majority_done_ops, "Leader Operations Acked by Majority",
                           MetricUnit::kOperations,
                           "Number of operations in the leader queue ack'd by a majority but "
-                          "not all peers.");
-METRIC_DEFINE_gauge_int64(tablet, in_progress_ops, "Leader Operations in Progress",
+                          "not all peers. This metric is always zero for followers.");
+METRIC_DEFINE_gauge_int64(tablet, in_progress_ops, "Operations in Progress",
                           MetricUnit::kOperations,
-                          "Number of operations in the leader queue ack'd by a minority of "
+                          "Number of operations in the peer's queue ack'd by a minority of "
                           "peers.");
 
 std::string PeerMessageQueue::TrackedPeer::ToString() const {
@@ -466,6 +466,18 @@ void PeerMessageQueue::AdvanceQueueWatermark(const char* type,
   }
 }
 
+void PeerMessageQueue::UpdateFollowerCommittedIndex(const OpId& committed_index) {
+  if (queue_state_.mode == NON_LEADER) {
+    std::lock_guard<simple_spinlock> l(queue_lock_);
+    UpdateFollowerCommittedIndexUnlocked(committed_index);
+  }
+}
+
+void PeerMessageQueue::UpdateFollowerCommittedIndexUnlocked(const OpId& committed_index) {
+  queue_state_.committed_index.CopyFrom(committed_index);
+  UpdateMetrics();
+}
+
 void PeerMessageQueue::NotifyPeerIsResponsiveDespiteError(const std::string& peer_uuid) {
   std::lock_guard<simple_spinlock> l(queue_lock_);
   TrackedPeer* peer = FindPtrOrNull(peers_map_, peer_uuid);
@@ -676,12 +688,13 @@ OpId PeerMessageQueue::GetMajorityReplicatedOpIdForTests() const {
 void PeerMessageQueue::UpdateMetrics() {
   // Since operations have consecutive indices we can update the metrics based
   // on simple index math.
-  metrics_.num_majority_done_ops->set_value(
-      queue_state_.committed_index.index() -
-      queue_state_.all_replicated_opid.index());
+  // For non-leaders, majority_done_ops isn't meaningful because followers don't
+  // track when an op is replicated to all peers.
+  metrics_.num_majority_done_ops->set_value(queue_state_.mode == LEADER ?
+    queue_state_.committed_index.index() - queue_state_.all_replicated_opid.index()
+    : 0);
   metrics_.num_in_progress_ops->set_value(
-    queue_state_.last_appended.index() -
-    queue_state_.committed_index.index());
+    queue_state_.last_appended.index() - queue_state_.committed_index.index());
 }
 
 void PeerMessageQueue::DumpToStrings(vector<string>* lines) const {
@@ -739,7 +752,7 @@ string PeerMessageQueue::ToString() const {
 }
 
 string PeerMessageQueue::ToStringUnlocked() const {
-  return Substitute("Consensus queue metrics:"
+  return Substitute("Consensus queue metrics: "
                     "Only Majority Done Ops: $0, In Progress Ops: $1, Cache: $2",
                     metrics_.num_majority_done_ops->value(), metrics_.num_in_progress_ops->value(),
                     log_cache_.StatsString());

http://git-wip-us.apache.org/repos/asf/kudu/blob/0174fda2/src/kudu/consensus/consensus_queue.h
----------------------------------------------------------------------
diff --git a/src/kudu/consensus/consensus_queue.h b/src/kudu/consensus/consensus_queue.h
index 16d2e4c..cccee1b 100644
--- a/src/kudu/consensus/consensus_queue.h
+++ b/src/kudu/consensus/consensus_queue.h
@@ -213,6 +213,10 @@ class PeerMessageQueue {
                                 const ConsensusResponsePB& response,
                                 bool* more_pending);
 
+  // Called by the consensus implementation to update the follower queue's
+  // committed index, which is mostly used for metrics.
+  void UpdateFollowerCommittedIndex(const OpId& committed_index);
+
   // Closes the queue, peers are still allowed to call UntrackPeer() and
   // ResponseFromPeer() but no additional peers can be tracked or messages
   // queued.
@@ -258,6 +262,7 @@ class PeerMessageQueue {
 
  private:
   FRIEND_TEST(ConsensusQueueTest, TestQueueAdvancesCommittedIndex);
+  FRIEND_TEST(ConsensusQueueTest, TestFollowerCommittedIndexAndMetrics);
 
   // Mode specifies how the queue currently behaves:
   // LEADER - Means the queue tracks remote peers and replicates whatever messages
@@ -370,6 +375,8 @@ class PeerMessageQueue {
                              int num_peers_required,
                              const TrackedPeer* who_caused);
 
+  void UpdateFollowerCommittedIndexUnlocked(const OpId& committed_index);
+
   std::vector<PeerMessageQueueObserver*> observers_;
 
   // The pool which executes observer notifications.

http://git-wip-us.apache.org/repos/asf/kudu/blob/0174fda2/src/kudu/consensus/raft_consensus.cc
----------------------------------------------------------------------
diff --git a/src/kudu/consensus/raft_consensus.cc b/src/kudu/consensus/raft_consensus.cc
index 4847953..51af20a 100644
--- a/src/kudu/consensus/raft_consensus.cc
+++ b/src/kudu/consensus/raft_consensus.cc
@@ -1217,6 +1217,7 @@ Status RaftConsensus::UpdateReplica(const ConsensusRequestPB* request,
     VLOG_WITH_PREFIX_UNLOCKED(1) << "Marking committed up to " << apply_up_to.ShortDebugString();
     TRACE(Substitute("Marking committed up to $0", apply_up_to.ShortDebugString()));
     CHECK_OK(state_->AdvanceCommittedIndexUnlocked(apply_up_to, &committed_index_changed));
+    queue_->UpdateFollowerCommittedIndex(apply_up_to);
 
     // We can now update the last received watermark.
     //
@@ -1772,6 +1773,8 @@ void RaftConsensus::DumpStatusHtml(std::ostream& out) const {
   out << "<h1>Raft Consensus State</h1>" << std::endl;
 
   out << "<h2>State</h2>" << std::endl;
+  out << "<pre>" << EscapeForHtmlToString(state_->ToString()) << "</pre>" << std::endl;
+  out << "<h2>Queue</h2>" << std::endl;
   out << "<pre>" << EscapeForHtmlToString(queue_->ToString()) << "</pre>" << std::endl;
 
   // Dump the queues on a leader.


[2/3] kudu git commit: improvements to /table

Posted by to...@apache.org.
improvements to /table

This is a bunch of improvements to the /table endpoint:

0. The order of elements has been made more logical. It's now basic
info, then schema, then partition schema, then the tablets table,
then the Impala create table statement, and finally the tasks list.
Every section now has a title.

1. The partition schema section has been reformatted and improved to
show the table's range bounds.

2. The column RaftConfig in the tablets table has been renamed to
Peers.

3. The peers are now sorted so the leader appears at the top of the
list in the Peers column.

4. A rendering issue where the Impala create table's container would
"leak" a little bit of background to a line above and below it has
been fixed.

See https://raw.githubusercontent.com/wdberkeley/kudu-cr/master/table_after_top.png
for a sample of the new look.

Change-Id: Ib6724348b1cd199c4d651c1282f1eadb58226bea
Reviewed-on: http://gerrit.cloudera.org:8080/4021
Reviewed-by: Todd Lipcon <to...@apache.org>
Tested-by: Kudu Jenkins


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/07c50b80
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/07c50b80
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/07c50b80

Branch: refs/heads/master
Commit: 07c50b803476f17728dd0c9ad1bfab12caf924e2
Parents: 0174fda
Author: Will Berkeley <wd...@gmail.com>
Authored: Wed Aug 17 15:32:00 2016 -0400
Committer: Todd Lipcon <to...@apache.org>
Committed: Wed Aug 17 23:01:13 2016 +0000

----------------------------------------------------------------------
 src/kudu/common/partition.cc            | 20 ++++++++-----
 src/kudu/master/master-path-handlers.cc | 44 ++++++++++++++++++++++------
 src/kudu/server/webui_util.cc           | 15 +++++-----
 src/kudu/server/webui_util.h            |  5 ++++
 4 files changed, 61 insertions(+), 23 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/07c50b80/src/kudu/common/partition.cc
----------------------------------------------------------------------
diff --git a/src/kudu/common/partition.cc b/src/kudu/common/partition.cc
index b3b31c7..5a24228 100644
--- a/src/kudu/common/partition.cc
+++ b/src/kudu/common/partition.cc
@@ -770,27 +770,33 @@ string PartitionSchema::DisplayString(const Schema& schema) const {
   string display_string;
 
   if (!hash_bucket_schemas_.empty()) {
-    display_string.append("Hash bucket schemas:\n");
+    display_string.append("Hash components:\n");
     for (const HashBucketSchema& hash_bucket_schema : hash_bucket_schemas_) {
-      display_string.append("  Key columns:\n");
+      display_string.append("  (");
+      vector<string> hash_components;
+      hash_components.reserve(hash_bucket_schema.column_ids.size());
       for (const ColumnId& col_id : hash_bucket_schema.column_ids) {
         const ColumnSchema& col = schema.column_by_id(col_id);
-        SubstituteAndAppend(&display_string, "    $0 $1\n", col.name(), col.type_info()->name());
+        hash_components.push_back(Substitute("$0 $1", col.name(), col.type_info()->name()));
       }
-      SubstituteAndAppend(&display_string, "  Bucket count: $0\n", hash_bucket_schema.num_buckets);
+      display_string.append(JoinStrings(hash_components, ", "));
+      SubstituteAndAppend(&display_string, ") bucket count: $0", hash_bucket_schema.num_buckets);
       if (hash_bucket_schema.seed != 0) {
-        SubstituteAndAppend(&display_string, "  Seed: $0\n", hash_bucket_schema.seed);
+        SubstituteAndAppend(&display_string, " seed: $0", hash_bucket_schema.seed);
       }
       display_string.append("\n");
     }
   }
 
   if (!range_schema_.column_ids.empty()) {
-    display_string.append("Range columns:\n");
+    display_string.append("Range component:\n");
+    vector<string> range_component;
+    range_component.reserve(range_schema_.column_ids.size());
     for (const ColumnId& col_id : range_schema_.column_ids) {
       const ColumnSchema& col = schema.column_by_id(col_id);
-      SubstituteAndAppend(&display_string, "  $0 $1\n", col.name(), col.type_info()->name());
+      range_component.push_back(Substitute("$0 $1", col.name(), col.type_info()->name()));
     }
+    SubstituteAndAppend(&display_string, "  ($0)\n", JoinStrings(range_component, ", "));
   }
   return display_string;
 }

http://git-wip-us.apache.org/repos/asf/kudu/blob/07c50b80/src/kudu/master/master-path-handlers.cc
----------------------------------------------------------------------
diff --git a/src/kudu/master/master-path-handlers.cc b/src/kudu/master/master-path-handlers.cc
index d97281c..68f090c 100644
--- a/src/kudu/master/master-path-handlers.cc
+++ b/src/kudu/master/master-path-handlers.cc
@@ -21,6 +21,7 @@
 #include <boost/bind.hpp>
 #include <map>
 #include <memory>
+#include <set>
 #include <string>
 #include <utility>
 #include <vector>
@@ -43,7 +44,6 @@
 #include "kudu/util/string_case.h"
 #include "kudu/util/url-coding.h"
 
-
 namespace kudu {
 
 using consensus::ConsensusStatePB;
@@ -120,9 +120,16 @@ void MasterPathHandlers::HandleCatalogManager(const Webserver::WebRequest& req,
 
 namespace {
 
+int RoleToSortIndex(RaftPeerPB::Role r) {
+  switch (r) {
+    case RaftPeerPB::LEADER: return 0;
+    default: return 1 + static_cast<int>(r);
+  }
+}
+
 bool CompareByRole(const pair<string, RaftPeerPB::Role>& a,
                    const pair<string, RaftPeerPB::Role>& b) {
-  return a.second < b.second;
+  return RoleToSortIndex(a.second) < RoleToSortIndex(b.second);
 }
 
 } // anonymous namespace
@@ -186,11 +193,17 @@ void MasterPathHandlers::HandleTablePage(const Webserver::WebRequest& req,
     table->GetAllTablets(&tablets);
   }
 
+  *output << "<h3>Schema</h3>";
   HtmlOutputSchemaTable(schema, output);
 
-  *output << "<table class='table table-striped'>\n";
-  *output << "  <tr><th>Tablet ID</th><th>Partition</th><th>State</th>"
-      "<th>Message</th><th>RaftConfig</th></tr>\n";
+  // Prepare the tablets table first because the tablet partition information is
+  // also used to make the range bounds.
+  std::set<std::pair<string, string>> range_bounds;
+  std::stringstream tablets_output;
+  tablets_output << "<h3>Tablets</h3>";
+  tablets_output << "<table class='table table-striped'>\n";
+  tablets_output << "  <tr><th>Tablet ID</th><th>Partition</th><th>State</th>"
+      "<th>Message</th><th>Peers</th></tr>\n";
   for (const scoped_refptr<TabletInfo>& tablet : tablets) {
     vector<pair<string, RaftPeerPB::Role>> sorted_replicas;
     TabletMetadataLock l(tablet.get(), TabletMetadataLock::READ);
@@ -227,10 +240,12 @@ void MasterPathHandlers::HandleTablePage(const Webserver::WebRequest& req,
 
     Partition partition;
     Partition::FromPB(l.data().pb.partition(), &partition);
+    range_bounds.insert({partition.range_key_start().ToString(),
+                         partition.range_key_end().ToString()});
 
     string state = SysTabletsEntryPB_State_Name(l.data().pb.state());
     Capitalize(&state);
-    *output << Substitute(
+    tablets_output << Substitute(
         "<tr><th>$0</th><td>$1</td><td>$2</td><td>$3</td><td>$4</td></tr>\n",
         tablet->tablet_id(),
         EscapeForHtmlToString(partition_schema.PartitionDebugString(partition, schema)),
@@ -238,15 +253,25 @@ void MasterPathHandlers::HandleTablePage(const Webserver::WebRequest& req,
         EscapeForHtmlToString(l.data().pb.state_msg()),
         raft_config_html.str());
   }
-  *output << "</table>\n";
+  tablets_output << "</table>\n";
 
-  *output << "<h2>Partition schema</h2>";
+  // Write out the partition schema and range bound information...
+  *output << "<h3>Partition schema &amp; range bounds</h3>";
   *output << "<pre>";
   *output << EscapeForHtmlToString(partition_schema.DisplayString(schema));
+  *output << "Range bounds:\n";
+  for (const auto& pair : range_bounds) {
+    string range_bound = partition_schema.RangePartitionDebugString(pair.first,
+                                                                    pair.second,
+                                                                    schema);
+    *output << Substitute("  $0\n", EscapeForHtmlToString(range_bound));
+  }
   *output << "</pre>";
 
-  *output << "<h2>Impala CREATE TABLE statement</h2>\n";
+  // ...then the tablets table.
+  *output << tablets_output.str();
 
+  *output << "<h3>Impala CREATE TABLE statement</h3>\n";
   string master_addresses;
   if (master_->opts().IsDistributed()) {
     vector<string> all_addresses;
@@ -268,6 +293,7 @@ void MasterPathHandlers::HandleTablePage(const Webserver::WebRequest& req,
   }
   HtmlOutputImpalaSchema(table_name, schema, master_addresses, output);
 
+  *output << "<h3>Tasks</h3>";
   std::vector<scoped_refptr<MonitoredTask> > task_list;
   table->GetTaskList(&task_list);
   HtmlOutputTaskList(task_list, output);

http://git-wip-us.apache.org/repos/asf/kudu/blob/07c50b80/src/kudu/server/webui_util.cc
----------------------------------------------------------------------
diff --git a/src/kudu/server/webui_util.cc b/src/kudu/server/webui_util.cc
index 1b90700..05047ce 100644
--- a/src/kudu/server/webui_util.cc
+++ b/src/kudu/server/webui_util.cc
@@ -19,7 +19,6 @@
 
 #include <string>
 
-#include "kudu/common/schema.h"
 #include "kudu/gutil/strings/join.h"
 #include "kudu/gutil/map-util.h"
 #include "kudu/gutil/strings/human_readable.h"
@@ -70,7 +69,7 @@ void HtmlOutputImpalaSchema(const std::string& table_name,
                             const Schema& schema,
                             const string& master_addresses,
                             std::stringstream* output) {
-  *output << "<code><pre>\n";
+  *output << "<pre><code>\n";
 
   // Escape table and column names with ` to avoid conflicts with Impala reserved words.
   *output << "CREATE EXTERNAL TABLE " << EscapeForHtmlToString("`" + table_name + "`")
@@ -134,11 +133,14 @@ void HtmlOutputImpalaSchema(const std::string& table_name,
 
   *output << "TBLPROPERTIES(\n";
   *output << "  'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',\n";
-  *output << "  'kudu.table_name' = '" << table_name << "',\n";
-  *output << "  'kudu.master_addresses' = '" << master_addresses << "',\n";
-  *output << "  'kudu.key_columns' = '" << JoinElements(key_columns, ", ") << "'\n";
+  *output << "  'kudu.table_name' = '";
+  *output << EscapeForHtmlToString(table_name) << "',\n";
+  *output << "  'kudu.master_addresses' = '";
+  *output << EscapeForHtmlToString(master_addresses) << "',\n";
+  *output << "  'kudu.key_columns' = '";
+  *output << EscapeForHtmlToString(JoinElements(key_columns, ", ")) << "'\n";
   *output << ");\n";
-  *output << "</pre></code>\n";
+  *output << "</code></pre>\n";
 }
 
 void HtmlOutputTaskList(const std::vector<scoped_refptr<MonitoredTask> >& tasks,
@@ -182,5 +184,4 @@ void HtmlOutputTaskList(const std::vector<scoped_refptr<MonitoredTask> >& tasks,
   }
   *output << "</table>\n";
 }
-
 } // namespace kudu

http://git-wip-us.apache.org/repos/asf/kudu/blob/07c50b80/src/kudu/server/webui_util.h
----------------------------------------------------------------------
diff --git a/src/kudu/server/webui_util.h b/src/kudu/server/webui_util.h
index 43ac6ea..397a8aa 100644
--- a/src/kudu/server/webui_util.h
+++ b/src/kudu/server/webui_util.h
@@ -21,6 +21,7 @@
 #include <sstream>
 #include <vector>
 
+#include "kudu/common/schema.h"
 #include "kudu/gutil/ref_counted.h"
 
 namespace kudu {
@@ -36,6 +37,10 @@ void HtmlOutputImpalaSchema(const std::string& table_name,
                             std::stringstream* output);
 void HtmlOutputTaskList(const std::vector<scoped_refptr<MonitoredTask> >& tasks,
                         std::stringstream* output);
+
+string FriendlyEncodingTypeName(EncodingType enc);
+
+string FriendlyCompressionTypeName(CompressionType enc);
 } // namespace kudu
 
 #endif // KUDU_SERVER_WEBUI_UTIL_H


[3/3] kudu git commit: docs: design for handling permanent master failures

Posted by to...@apache.org.
docs: design for handling permanent master failures

Here's a design doc that describes how we might address permanent master
failures. The downside of the proposed solution is that it requires DNS
manipulation, but the upside is that it can be adapted to migrate single
node deployments to multiple masters.

Change-Id: I2f05c319c89cf37e2d71fdc4b7ec951b2932a2b2
Reviewed-on: http://gerrit.cloudera.org:8080/3393
Tested-by: Kudu Jenkins
Reviewed-by: Todd Lipcon <to...@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/13ea2802
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/13ea2802
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/13ea2802

Branch: refs/heads/master
Commit: 13ea28029773c042c77f6a0325e21c30f298bb62
Parents: 07c50b8
Author: Adar Dembo <ad...@cloudera.com>
Authored: Thu Jun 16 11:32:02 2016 -0700
Committer: Todd Lipcon <to...@apache.org>
Committed: Wed Aug 17 23:02:04 2016 +0000

----------------------------------------------------------------------
 docs/design-docs/README.md                  |   1 +
 docs/design-docs/master-perm-failure-1.0.md | 111 +++++++++++++++++++++++
 2 files changed, 112 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/13ea2802/docs/design-docs/README.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/README.md b/docs/design-docs/README.md
index c3903af..faf547c 100644
--- a/docs/design-docs/README.md
+++ b/docs/design-docs/README.md
@@ -39,3 +39,4 @@ made.
 | [Scan Token API](scan-tokens.md) | Client | [gerrit](http://gerrit.cloudera.org:8080/2443) |
 | [Full multi-master support for Kudu 1.0](multi-master-1.0.md) | Master, Client | [gerrit](http://gerrit.cloudera.org:8080/2527) |
 | [Non-covering Range Partitions](non-covering-range-partitions.md) | Master, Client | [gerrit](http://gerrit.cloudera.org:8080/2772) |
+| [Permanent failure handling of masters for Kudu 1.0](master-perm-failure-1.0.md) | Master | |

http://git-wip-us.apache.org/repos/asf/kudu/blob/13ea2802/docs/design-docs/master-perm-failure-1.0.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/master-perm-failure-1.0.md b/docs/design-docs/master-perm-failure-1.0.md
new file mode 100644
index 0000000..3c81327
--- /dev/null
+++ b/docs/design-docs/master-perm-failure-1.0.md
@@ -0,0 +1,111 @@
+<!---
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Permanent failure handling of masters for Kudu 1.0
+
+## Background
+
+Kudu's 1.0 release includes various improvements to multi-master support so that
+it can be used in production safely. The original release plan emphasized using
+multiple masters for high availability in the event of transient failures, but
+unfortunately didn't talk much (if at all) about permanent failures. This
+document compares transient and permanent failures, and provides a design for
+addressing the latter.
+
+### Transient failures
+
+Kudu's handling of transient master failures is best illustrated with an
+example. Assume we have a healthy Raft configuration consisting of three
+masters. If one master suffers a transient failure and is offline for a short
+time before returning, there's no harm. If the failed node was a follower, the
+leader can still replicate to a majority of nodes. If the leader itself failed,
+a majority of nodes can still elect a new leader. No new machinery is needed to
+support this; all of the code has been written and tested with the caveat that
+there are some bugs that we have been squashing over the past few months.
+
+### Permanent failures
+
+What's missing, however, is handling for permanent failures. If a node dies and
+is not coming back, we need to replace it with a healthy one ASAP. If we don't,
+a second failure (transient or otherwise) will lead to a loss of availability.
+
+## Design proposal for handling permanent failures
+
+In practice, the most straight-forward approach to handling any permanent
+failure is to extend Raft configuration change support to the master; currently
+it\u2019s only possible to do it in the tserver. However, this just isn't possible
+given time constraints. Therefore, we will use a DNS-dependent alternative.
+
+Here is the algorithm:
+
+1. Base state:
+   1. There's a healthy Raft configuration of three nodes: **A**, **B**, and
+      **C**.
+   2. **A** is the leader.
+   3. The value of **--master_addresses** (the master-only gflag describing the
+      locations of the entire master Raft configuration) is {**A**, **B**,
+      **C**} on each node.
+   4. Each of **A**, **B**, and **C** are DNS cnames.
+   5. The value of **--tserver_master_address** (the tserver-only gflag
+      describing the locations of the masters) on each tserver is {**A**, **B**,
+      **C**}
+2. **C** dies, permanently. If **A** dies, the directions below are the same,
+   except replace **A** with whichever node was elected the new leader.
+3. Make sure **C** is completely dead and cannot come back to life. If possible,
+   destroy its on-disk master state.
+4. Find a replacement machine **D**.
+5. Modify DNS records such that **D** assumes **C**'s cname.
+6. Invoke new command line tool on **D** that uses remote bootstrap to copy
+   master state from **A** to **D**.
+7. Start a new master on **D**. It should use the same value of
+   **--master_addresses** as used by the other masters.
+
+In order to implement this design, we'll need to make the following changes:
+
+1. Make remote bootstrap available for masters (currently it's tserver-only).
+2. Implement new remote bootstrap "client" command line tool.
+
+## Migration from single-master deployments
+
+While not exactly related to failure handling, the remote bootstrap
+modifications described above can be used to ease migration from a single master
+deployment to a multi-master one. Since migration is a rare and singular event
+in the lifetime of a cluster, it is assumed that a temporary loss of
+availability during the migration is acceptable.
+
+Here is the algorithm:
+
+1. There exists a healthy single-node master deployment called **A**.
+2. Find new master machines, creating DNS cnames for all of them. Create a DNS
+   cname for **A** too, if it's not already a cname. Note: the total number of
+   masters must be odd. To figure out how many masters there should be, consider
+   that **N** failures can be tolerated by a deployment of **2N+1** masters.
+3. Stop the master running on **A**.
+4. Invoke new command line tool to format a filesystem on each new master node.
+5. Invoke new command line tool to print the filesystem uuid on each master node
+   and on existing master node **A**. Record these UUIDs.
+6. Invoke new command line tool on **A** to rewrite the on-disk consensus
+   metadata (cmeta) file describing the Raft configuration. Provide the uuid and
+   cname for each new master node as well as for **A**.
+7. Start the master running on **A**.
+8. Invoke remote bootstrap "client" tool from above on each new node to copy
+   **A**'s master state onto new node. These invocations can be done in parallel
+   to speed up the process, though in practice master state is quite small.
+9. Start the master on each new node.
+
+In order to implement this design, we'll need the following additional changes:
+
+1. Implement new command line tool to format filesystems.
+2. Implement new command line tool to print filesystem uuids.
+3. Implement new command line tool to rewrite cmeta files.