You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by da...@apache.org on 2016/10/25 19:39:52 UTC

[1/3] kudu git commit: Add a design doc for rpc retry/failover semantics

Repository: kudu
Updated Branches:
  refs/heads/master ff24651cb -> 6ed6989ab


Add a design doc for rpc retry/failover semantics

This adds the final version of the rpc retry/failover doc and
includes details of the final implementation. It also includes
a guide and hints on how to implement exactly once semantics
for other RPCs.

Change-Id: Idc2aa40486153b39724e1c9bd09c626b829274c6
Reviewed-on: http://gerrit.cloudera.org:8080/2642
Tested-by: Kudu Jenkins
Reviewed-by: Todd Lipcon <to...@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/a07c5ada
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/a07c5ada
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/a07c5ada

Branch: refs/heads/master
Commit: a07c5adac0aa68390fa6a34a625cb0236986f611
Parents: ff24651
Author: David Alves <da...@cloudera.com>
Authored: Sat Feb 20 16:12:50 2016 -0800
Committer: David Ribeiro Alves <dr...@apache.org>
Committed: Tue Oct 25 01:14:07 2016 +0000

----------------------------------------------------------------------
 docs/design-docs/README.md                 |   1 +
 docs/design-docs/rpc-retry-and-failover.md | 232 ++++++++++++++++++++++++
 2 files changed, 233 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/a07c5ada/docs/design-docs/README.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/README.md b/docs/design-docs/README.md
index faf547c..6ceb3c3 100644
--- a/docs/design-docs/README.md
+++ b/docs/design-docs/README.md
@@ -40,3 +40,4 @@ made.
 | [Full multi-master support for Kudu 1.0](multi-master-1.0.md) | Master, Client | [gerrit](http://gerrit.cloudera.org:8080/2527) |
 | [Non-covering Range Partitions](non-covering-range-partitions.md) | Master, Client | [gerrit](http://gerrit.cloudera.org:8080/2772) |
 | [Permanent failure handling of masters for Kudu 1.0](master-perm-failure-1.0.md) | Master | |
+| [RPC Retry/Failover semantics](rpc-retry-and-failover.md) | Client/TS/Master | [gerrit](http://gerrit.cloudera.org:8080/2642) |

http://git-wip-us.apache.org/repos/asf/kudu/blob/a07c5ada/docs/design-docs/rpc-retry-and-failover.md
----------------------------------------------------------------------
diff --git a/docs/design-docs/rpc-retry-and-failover.md b/docs/design-docs/rpc-retry-and-failover.md
new file mode 100644
index 0000000..70c026d
--- /dev/null
+++ b/docs/design-docs/rpc-retry-and-failover.md
@@ -0,0 +1,232 @@
+
+<!---
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+# Motivation
+
+Kudu has many RPCs that need error handling. When an RPC fails due to a timeout
+or network failure, the sender needs to retry it in the same or in a different
+server, but it often doesn't know whether the original RPC succeeded. Certain
+RPCs are not idempotent i.e. if handled twice they will lead to incorrect state
+and thus we need a mechanism by which the sender can retry the RPC but the receiver
+is guaranteed to execute it **Exactly Once**.
+
+# Problem breakdown
+
+[1] introduces 4 subproblems that need to be solved to obtain exactly-once semantics on
+replicated RPCs, we'll use the same terminology to avoid redefining the concepts:
+
+- Identification - Each RPC must have a unique identifier.
+- Completion Record - Each RPC must have a record of its completion.
+- Retry rendezvous - When an RPC is retried it must find the completion record of the
+                     previous attempt.
+- Garbage collection - After a while completion records must be garbage collected.
+
+## Design options
+
+We can address these problems in multiple ways and, in particular, at multiple abstraction
+layers:
+
+1. Completely encapsulated at the RPC layer - This is the option that [1] presents at length,
+allowing to "bolt-on" these properties almost independently of what the underlying RPCs
+are. This option, while very general, would require a lot of work particularly as for
+replicated RPCs we'd need to come up with a generic, durable, Completion Record storage mechanism.
+While "elegant" in some ways this option also seems weird in other ways: for instance
+it's not totally clear what happens when an RPC mutates multiple "server objects" or
+what these server objects really are.
+
+2. Handled ad-hoc by the client and the tablet server/master, outside of the RPC layer -
+With this option, we choose to handle errors at the TabletServer/TabletPeer, ad-hoc, and for
+each different RPC. For specific RPCs like Write() this option seems to map to existing
+components quite well. For instance the completion record can be the raft log and replica
+replay would mean that it would be relatively easy to implement a retry rendezvous
+mechanism. However this option lacks in generality and it would likely require duplicate
+error handling for operations that are not tablet server transactions.
+
+3. A hybrid of the above - Certain parts of the problem would be handled by the RPC layer,
+such as retrying and retry rendez-vous logic, but other parts of the logic would be implemented
+ad-hoc in other layers. For instance, for Write()s, the RPC layer would know how to sequence and
+retry the RPCs but would delegate durability and cross-server replicated operation rendezvous
+to the tablet peer/transaction manager.
+
+# Design choices
+
+We chose option 3, in which most of the sub-component is encapsulated in the RPC layer, allowing
+to provide limited (non-persistent) bolt-on exactly once semantics for generic RPCs, but the API is
+exposed so that we can perform ad-hoc handling of persistence where required (e.g. for Write RPCs). 
+
+We addressed each of the four sub-problems mentioned in the previous section the following way:
+
+1. Identification - Each individual RPC *attempt* must be uniquely identified. That is, not only do
+we need to identify each RPC a client executes, but we need to distinguish between different attempts
+(retries) of the same RPC. This diverges from [1] which identifies each RPC, but not different
+attempts of the same request. We need to keep this additional information because of the hybrid
+design choice we made. That is, because a part of the exactly once mechanism is bolt-on and one
+part is ad-hoc there are actually two serialization points for replicated RPC rendezvous, in which
+case we need to distinguish between different retries and thus need a new identifier.
+
+Each individual RPC attempt contains the following information:
+- Client ID - A UUID that is generated per client, independently.
+- Sequence number- An integer ID that uniquely identifies a request, even across machines and attempts.
+- Attempt number - An integer that uniquely identifies each attempt of a request.
+
+2. CompletionRecord - A completion record of an RPC is the response that is sent back to the client
+once the RPC is complete. Once a response for a request is built, it is cached in the RPC subsystem
+and clients will always receive the same response for the same request. The RPC subsystem does not
+handle response durability though, it is up to the ad-hoc integration to make sure that either responses
+are durably stored or can be rebuilt, consistently.
+
+3. Retry rendezvous - We make sure that multiple attempts at the same RPC in the same server meet
+by making them go through a new component, the **ResultTracker**. This component is responsible for
+figuring out the state of the RPC, among:
+- NEW - It's the first time the server has seen the RPC and it should be executed.
+- IN_PROGRESS - The RPC has been previously allowed to execute but hasn't yet completed.
+- COMPLETED - The RPC has already completed and it's response has been cached.
+- STALE - The RPC is old enough that the server no longer caches its response, but new enough that
+the server still remembers it has deleted the corresponding response.
+
+4. Garbage collection - We opted to implement the basic watermark-based mechanism for garbage
+collection mentioned in [1]: Each time a client sends a request it sends the "Request ID" of the
+first incomplete request it knows about. This lets the server know that it can delete the responses
+to all the previous requests. In addition, we also implemented a time-based garbage collection
+mechanism (which was needed to garbage collect whole clients anyway since we didn't implement the
+client lease system of [1]). Time based garbage collection deletes responses for a client that are
+older than a certain time (and not in-progress). Whole client state is deleted after
+another (longer) time period has elapsed.
+
+## Note on client leases
+
+We opted for not implementing the distributed client lease mechanism in [1].
+This does allow for cases where an RPC is executed twice: for instance if a client attempts an RPC
+and then is silent for a period greater than the time-based garbage collection period before
+re-attempting the same RPC. We consider that this is unlikely enough not to justify implementing
+a complex lease management mechanism, but for documentation purposes **the following scenario could
+cause double execution**:
+
+- The user sets a timeout on an operation that is longer that the time-based garbage collection period
+(10 minutes, by default) the client then attempts the operation, which is successful on the server.
+However the client is partitioned from the server before receiving the reply. The partition lasts
+more than the time-based garbage collection period but less than the user-set timeout, meaning the
+client continues to retry it. In this case the operation could be executed twice.
+
+# Lifecyle of an Exactly-Once RPC
+
+Clients assign themselves a UUID that acts as their unique identifier for their lifetime.
+Clients have a component, the **Request Tracker** that is responsible for assigning new RPC 
+*Sequence number*s and tracking which ones are outstanding. Along with the outstanding requests,
+the *Request Tracker* is responsible for tracking the *First incomplete sequence number*, i.e. the id
+of the first outstanding RPC. This is important for garbage collection purposes, as we'll see later.
+The client system is also responsible for keeping a counter of the times an RPC is attempted and
+making sure that each time an attempt is performed it is assigned a new *AttemptNumber* based on this
+counter. Together these four elements form a *Request Id* and are set in the RPC header that takes
+the form:
+
+    =========================================
+    |              RequestIdPB              |
+    ----------------------------------------
+    |    - client_id : string               |
+    |    - seq_no : int64                   |
+    |    - first_incomplete_seq_no : int64  |
+    |    - attempt_no : int64               |
+    =========================================
+
+When an RPC reaches the server, the *Request Id* is passed to a **ResultTracker** the server-side
+component responsible for storing Completion Records (the response protobufs), and for doing
+the Retry Rendezvous. For each of the RPCs this component will determine which of the following
+states it is in:
+
+- NEW - This is the first time the server has seen this request from the client. In this case
+the request will be passed on to an RPC handler for execution.
+- IN_PROGRESS - This is a retry of a previous request, but the original request is still being
+executed somewhere in the server. The request is dropped and not executed. It will receive a
+response at the same time as the original retry once the latter completes.
+- COMPLETED - This is a retry for a request that has previously completed. The request is dropped
+and not executed. The response of the original request, which must be still in memory, is sent back
+to the client.
+- STALE - This is either a new request or a retry for a request that is no longer being tracked,
+although the client itself still is. The request is dropped and not executed. An appropriate error
+is sent back to the client.
+
+If the **ResultTracker** returns that the request is *NEW*, then execution proceeds, for instance
+in the case of a Write() this means that it will be sent to the *TransactionManager* for replication
+and eventually to be applied to the tablet.
+
+Once execution of the request is completed successfully it is sent to the **ResultTracker** which
+will store the response in memory (to be able to reply to future retries) and reply back to the
+client.
+
+## Important Details
+
+- The response for errors is not stored. This is for two reasons: Errors (are not supposed) to have
+side-effects; Errors might be transient, for instance a write to a non-leader replica may fail but
+a retry might be successful if the replica is elected leader.
+
+- The mechanism above, just by itself, does not handle replicated RPCs. If an RPC is
+replicated, i.e. if an RPC must be executed exactly once across a set of nodes (e.g. Write()), its
+idempotency is not totally covered by the mechanism above.
+
+- Responses are not stored persistently so, for replicated RPCs, implementations must take care that
+the same request always has the same response, i.e. that the exact same response can be rebuilt
+from history.
+
+## Declaring RPCs as Exactly Once
+
+To make sure the results of an RPC are tracked **transiently** on the server side (we'll cover how
+to make sure ), all that is required is that the service definition enables the appropriate option.
+For instance in the case of writes this is done the following way:
+
+```
+service TabletServerService {
+
+  ...
+  rpc Write(WriteRequestPB) returns (WriteResponsePB)  {
+    option (kudu.rpc.track_rpc_result) = true;
+  }
+  ...
+}
+```
+
+# Exactly Once semantics for replicated, fault-tolerant, RPCs
+
+RPCs that are replicated, for fault tolerance, require more than the mechanics above. In particular
+they have the following additional requirements:
+
+- Cross-replica rendezvous - The **ResultTracker** makes sure that, for a single server, all attempts
+of an RPC from a client directly to that server serialized. However this does not take into account
+attempts from the same client to other servers, which must also be serialized in some way.
+
+- Response rebuilding - When a server crashes it loses some in-memory state. Requests for
+completed operations are durably stored on-disk so that that volatile state can be rebuilt, but
+responses are not. When a request is replayed to rebuild lost state, it's response must be stored
+again and it must be the same as the original response sent to the client.
+
+## Concrete Example: Exactly Once semantics for writes
+
+Writes (any operation that mutates rows on a tablet server) are the primary use case for exactly
+once semantics and thus it was implemented first (for reference this landed in commit ```6d2679bd```
+and this is the corresponding [gerrit](https://gerrit.cloudera.org/#/c/3449/)). Examining this
+change is likely to enlighten adding Exactly Once semantics to other replicated RPCs. A lot of
+the changes were mechanical, the relevant ones are the following:
+
+1. The cross-replica retry rendezvous is implemented in ```transaction_driver.{h,cc}```, its inner
+workings are detailed in the header. This basically makes sure that when an RPC is received from
+a client and a different attempt of the same RPC is received from another replica (a previous leader),
+we execute only one of those attempts (the replica one).
+
+2. Response rebuilding is implemented in ```tablet_bootstrap.cc```, which basically adds the ability
+to rebuild the original response, when a request is replayed on tablet bootstrap.
+
+### References
+
+[1][Implementing Linearizability at Large Scale and Low Latency](http://web.stanford.edu/~ouster/cgi-bin/papers/rifl.pdf)


[2/3] kudu git commit: Add krb5 dependency to Java README

Posted by da...@apache.org.
Add krb5 dependency to Java README

I think it's a little premature to add it to installation.adoc, since it
isn't required for building or operating Kudu yet (it's a test-only
dependency).

Change-Id: I5e4ffe0bee242ddec3a9effd4c3404348f1b889c
Reviewed-on: http://gerrit.cloudera.org:8080/4840
Reviewed-by: Jean-Daniel Cryans <jd...@apache.org>
Tested-by: Dan Burkert <da...@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/8972e9ef
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/8972e9ef
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/8972e9ef

Branch: refs/heads/master
Commit: 8972e9ef45edd9214f921b5b93a4c2cdf8fef8fc
Parents: a07c5ad
Author: Dan Burkert <da...@apache.org>
Authored: Tue Oct 25 11:52:30 2016 -0700
Committer: Dan Burkert <da...@apache.org>
Committed: Tue Oct 25 19:39:23 2016 +0000

----------------------------------------------------------------------
 README.adoc    | 1 -
 java/README.md | 3 +--
 2 files changed, 1 insertion(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/8972e9ef/README.adoc
----------------------------------------------------------------------
diff --git a/README.adoc b/README.adoc
index 295033c..7147382 100644
--- a/README.adoc
+++ b/README.adoc
@@ -197,7 +197,6 @@ or
 $ build-support/run-test.sh build/tsan/bin/foo-test [--test-arguments-here]
 ----
 
-
 ...and then view the logs in _build/tsan/test-logs/_
 
 ====

http://git-wip-us.apache.org/repos/asf/kudu/blob/8972e9ef/java/README.md
----------------------------------------------------------------------
diff --git a/java/README.md b/java/README.md
index f617314..345e030 100644
--- a/java/README.md
+++ b/java/README.md
@@ -19,7 +19,7 @@ System Requirements
 - Java 7
 - Maven 3
 - protobuf 2.6.1 (it needs to be the exact version)
-
+- MIT Kerberos (krb5)
 
 Building the Client
 ------------------------------------------------------------
@@ -28,7 +28,6 @@ $ mvn package -DskipTests
 
 The client jar will can then be found at kudu-client/target.
 
-
 Running the Tests
 ------------------------------------------------------------
 


[3/3] kudu git commit: Fix MiniKdc dtor crash

Posted by da...@apache.org.
Fix MiniKdc dtor crash

The C++ MiniKdc had a bug that caused a CHECK failure when the dtor ran
while stopped.

Change-Id: Iccfa5cdbfadd82d021e9ccede020e1d19e7054fc
Reviewed-on: http://gerrit.cloudera.org:8080/4839
Reviewed-by: Adar Dembo <ad...@cloudera.com>
Tested-by: Dan Burkert <da...@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/6ed6989a
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/6ed6989a
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/6ed6989a

Branch: refs/heads/master
Commit: 6ed6989ab99a415b9442559261cdee719b2eff57
Parents: 8972e9e
Author: Dan Burkert <da...@apache.org>
Authored: Tue Oct 25 11:30:21 2016 -0700
Committer: Dan Burkert <da...@apache.org>
Committed: Tue Oct 25 19:39:32 2016 +0000

----------------------------------------------------------------------
 .../src/test/java/org/apache/kudu/client/TestMiniKdc.java      | 6 ++++++
 src/kudu/security/mini_kdc-test.cc                             | 6 ++++++
 src/kudu/security/mini_kdc.cc                                  | 4 +++-
 3 files changed, 15 insertions(+), 1 deletion(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/6ed6989a/java/kudu-client/src/test/java/org/apache/kudu/client/TestMiniKdc.java
----------------------------------------------------------------------
diff --git a/java/kudu-client/src/test/java/org/apache/kudu/client/TestMiniKdc.java b/java/kudu-client/src/test/java/org/apache/kudu/client/TestMiniKdc.java
index e1628dd..609b1a0 100644
--- a/java/kudu-client/src/test/java/org/apache/kudu/client/TestMiniKdc.java
+++ b/java/kudu-client/src/test/java/org/apache/kudu/client/TestMiniKdc.java
@@ -45,4 +45,10 @@ public class TestMiniKdc {
       assertTrue(klist.contains("krbtgt/KRBTEST.COM@KRBTEST.COM"));
     }
   }
+
+  @Test
+  public void testStopClose() throws Exception {
+    // Test that closing a stopped KDC does not throw.
+    MiniKdc.withDefaults().close();
+  }
 }

http://git-wip-us.apache.org/repos/asf/kudu/blob/6ed6989a/src/kudu/security/mini_kdc-test.cc
----------------------------------------------------------------------
diff --git a/src/kudu/security/mini_kdc-test.cc b/src/kudu/security/mini_kdc-test.cc
index 26a3621..2b88990 100644
--- a/src/kudu/security/mini_kdc-test.cc
+++ b/src/kudu/security/mini_kdc-test.cc
@@ -48,4 +48,10 @@ TEST(MiniKdcTest, TestBasicOperation) {
   ASSERT_STR_CONTAINS(klist, "krbtgt/KRBTEST.COM@KRBTEST.COM");
 }
 
+// Regression test to ensure that dropping a stopped MiniKdc doesn't panic.
+TEST(MiniKdcTest, TestStopDrop) {
+  MiniKdcOptions options;
+  MiniKdc kdc(options);
+}
+
 } // namespace kudu

http://git-wip-us.apache.org/repos/asf/kudu/blob/6ed6989a/src/kudu/security/mini_kdc.cc
----------------------------------------------------------------------
diff --git a/src/kudu/security/mini_kdc.cc b/src/kudu/security/mini_kdc.cc
index 25373fd..9daa527 100644
--- a/src/kudu/security/mini_kdc.cc
+++ b/src/kudu/security/mini_kdc.cc
@@ -60,7 +60,9 @@ MiniKdc::MiniKdc(const MiniKdcOptions& options)
 }
 
 MiniKdc::~MiniKdc() {
-  WARN_NOT_OK(Stop(), "Unable to stop MiniKdc");
+  if (kdc_process_) {
+    WARN_NOT_OK(Stop(), "Unable to stop MiniKdc");
+  }
 }
 
 vector<string> MiniKdc::MakeArgv(const vector<string>& in_argv) {