You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by ad...@apache.org on 2019/11/18 05:24:32 UTC

[kudu] 01/02: docs: add docs for maintenance mode

This is an automated email from the ASF dual-hosted git repository.

adar pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/kudu.git

commit 200509b497a028f7d700152ca4c92818ce5b9f0e
Author: Andrew Wong <aw...@apache.org>
AuthorDate: Fri Nov 15 12:33:05 2019 -0800

    docs: add docs for maintenance mode
    
    A staged version can be found here:
    https://github.com/andrwng/kudu/blob/docs_maintenance_mode/docs/administration.adoc#minimizing_cluster_disruption_during_temporary_single_ts_downtime
    
    Change-Id: I36b9eddc1d4d4a4e4cb149058fc6d6f438e47f1f
    Reviewed-on: http://gerrit.cloudera.org:8080/14718
    Reviewed-by: Alexey Serbin <as...@cloudera.com>
    Tested-by: Kudu Jenkins
---
 docs/administration.adoc | 45 +++++++++++++++++++++++++++++++++++++--------
 1 file changed, 37 insertions(+), 8 deletions(-)

diff --git a/docs/administration.adoc b/docs/administration.adoc
index 2827235..f8a6091 100644
--- a/docs/administration.adoc
+++ b/docs/administration.adoc
@@ -1461,14 +1461,43 @@ for more than `--follower_unavailable_considered_failed_sec` (default 300)
 seconds, the tablet replicas on the down tablet server will be replaced by new
 replicas on available tablet servers. This will cause stress on the cluster
 as tablets re-replicate and, if the downtime lasts long enough, significant
-reduction in the number of replicas on the down tablet server. This may require
-the rebalancer to fix.
+reduction in the number of replicas on the down tablet server, which would
+require the rebalancer to fix.
 
-To work around this, increase `--follower_unavailable_considered_failed_sec` on
-all tablet servers so the amount of time before re-replication will start is
-longer than the expected downtime of the tablet server, including the time it
-takes the tablet server to restart and bootstrap its tablet replicas. To do
-this, run the following command for each tablet server:
+To work around this, in Kudu versions from 1.11 onwards, the `kudu` CLI
+contains a tool to put tablet servers into maintenance mode. While in this
+state, the tablet server’s replicas are not re-replicated due to its downtime
+alone, though re-replication may still occur in the event that the server in
+maintenance suffers from a disk failure or if a follower replica on the tablet
+server falls too far behind its leader replica. Upon exiting maintenance,
+re-replication is triggered for any remaining under-replicated tablets.
+
+The `kudu tserver state enter_maintenance` and `kudu tserver state
+exit_maintenance` tools are added to orchestrate tablet server maintenance.
+The following can be run from a tablet server to put it into maintenance:
+
+[source,bash]
+----
+$ TS_UUID=$(sudo -u kudu kudu fs dump uuid --fs_wal_dir=<wal_dir> --fs_data_dirs=<data_dirs>)
+$ sudo -u kudu kudu tserver state enter_maintenance <master_addresses> "$TS_UUID"
+----
+
+The tablet server maintenance mode is shown in the "Tablet Servers" page of the
+Kudu leader master's web UI, and in the output of `kudu cluster ksck`.  To exit
+maintenance mode, run the following:
+
+[source,bash]
+----
+$ sudo -u kudu kudu tserver state exit_maintenance <master_addresses> "$TS_UUID"
+----
+
+In versions prior to 1.11, a different approach must be used to prevent
+unwanted re-replication. Increase
+`--follower_unavailable_considered_failed_sec` on all tablet servers so the
+amount of time before re-replication starts is longer than the expected
+downtime of the tablet server, including the time it takes the tablet server to
+restart and bootstrap its tablet replicas. To do this, run the following
+command for each tablet server:
 
 [source,bash]
 ----
@@ -1486,7 +1515,7 @@ WARNING: Be sure to reset the value of `--follower_unavailable_considered_failed
 to its original value.
 
 NOTE: On Kudu versions prior to 1.8, the `--force` flag must be provided in the above
-commands.
+`set_flag` commands.
 
 [[rebalancer_tool]]
 === Running the tablet rebalancing tool