You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@kudu.apache.org by aw...@apache.org on 2019/07/17 00:02:32 UTC

[kudu] branch master updated (c7bf122 -> bec75e5)

This is an automated email from the ASF dual-hosted git repository.

awong pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/kudu.git.


    from c7bf122  [maintenance] Add extra config for maintenance manager task priority
     new 899f7a1  [docs] update the upgrade documentation
     new bec75e5  KUDU-2635: ignore failures to delete orphaned blocks

The 2 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 docs/installation.adoc                 | 54 ++++++++++++++++++++++++++++++++--
 src/kudu/tablet/tablet_metadata.cc     |  7 +++--
 src/kudu/tserver/tablet_server-test.cc | 26 ++++++++++++++++
 3 files changed, 83 insertions(+), 4 deletions(-)

[kudu] 01/02: [docs] update the upgrade documentation

Posted by aw...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

awong pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/kudu.git

commit 899f7a101d5713e8e010377d81c06a28e5095973
Author: helifu <hz...@corp.netease.com>
AuthorDate: Tue Jul 9 14:40:57 2019 +0800

    [docs] update the upgrade documentation
    
    The process of upgrading the cluster has been added to
    the installation.adoc.
    
    Change-Id: I6b3e5c549dc05c3388c0b0dd628d205a356da344
    Reviewed-on: http://gerrit.cloudera.org:8080/13820
    Tested-by: Kudu Jenkins
    Reviewed-by: Andrew Wong <aw...@cloudera.com>
---
 docs/installation.adoc | 54 ++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 52 insertions(+), 2 deletions(-)

diff --git a/docs/installation.adoc b/docs/installation.adoc
index a65792b..13e8672 100644
--- a/docs/installation.adoc
+++ b/docs/installation.adoc
@@ -632,8 +632,58 @@ Before upgrading, you should read the link:release_notes.html[Release Notes] for
 the version of Kudu that you are about to install. Pay close attention to the
 incompatibilities, upgrade, and downgrade notes that are documented there.
 
-NOTE: Currently rolling upgrades are not supported. Please shut down all Kudu services before
-  upgrading the software.
+WARNING: The following upgrade process is only relevant when you have binaries available.
+
+. Prepare the software.
+  - Place the new `kudu-tserver`, `kudu-master`, and `kudu` binaries into the appropriate
+    Kudu binary directory.
+. Upgrade the tablet servers.
+  - Set the `follower_unavailable_considered_failed_sec` configuration to a high value
+    (conservatively, twice the expected restart time) to prevent tablet replicas hosted
+    on restarting tablet servers from being evicted and re-replicated.
++
+[source,bash]
+----
+$ ./kudu tserver set_flag <tserver> follower_unavailable_considered_failed_sec 7200
+----
+  - Restart one tablet server.
+  - Wait for all tablet replicas on the tablet server to finish bootstrapping by viewing
+    `/tablets` page in the tablet server web UI.
+  - Restarting the tablet server will have reset the `follower_unavailable_considered_failed_sec`
+    configuration. Raise it again as needed.
+  - Repeat the previous 3 steps for the remaining tablet servers.
+  - Restore the original gflag value of every tablet server (the default is 5 minutes)
++
+[source,bash]
+----
+$ ./kudu tserver set_flag <tserver> follower_unavailable_considered_failed_sec 300
+----
++
+An example for a cluster with three tablet servers A, B, C:
++
+[source,bash]
+----
+# Step 1: Set the unavailable time for every tablet server to a large value
+$ ./kudu tserver set_flag A follower_unavailable_considered_failed_sec 7200
+$ ./kudu tserver set_flag B follower_unavailable_considered_failed_sec 7200
+$ ./kudu tserver set_flag C follower_unavailable_considered_failed_sec 7200
+
+# Step 2: Restart the tablet server and reset the gflag one by one
+<restart A and wait until A is online>
+$ ./kudu tserver set_flag A follower_unavailable_considered_failed_sec 7200
+<restart B and wait until B is online>
+$ ./kudu tserver set_flag B follower_unavailable_considered_failed_sec 7200
+<restart C and wait until C is online>
+$ ./kudu tserver set_flag C follower_unavailable_considered_failed_sec 7200
+
+# Step 3: Restore the default gflag value (5 minutes) for every tablet server
+$ ./kudu tserver set_flag A follower_unavailable_considered_failed_sec 300
+$ ./kudu tserver set_flag B follower_unavailable_considered_failed_sec 300
+$ ./kudu tserver set_flag C follower_unavailable_considered_failed_sec 300
+----
++
+. Upgrade the master servers.
+  - Restart the master server one by one.
 
 [[next_steps]]
 == Next Steps

[kudu] 02/02: KUDU-2635: ignore failures to delete orphaned blocks

Posted by aw...@apache.org.

This is an automated email from the ASF dual-hosted git repository.

awong pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/kudu.git

commit bec75e5ac03eec74da8c091a99a4b9f9a27e2b2b
Author: Andrew Wong <aw...@cloudera.com>
AuthorDate: Fri Jul 12 19:53:29 2019 -0700

    KUDU-2635: ignore failures to delete orphaned blocks
    
    It was previously possible to leave some orphaned blocks in the
    in-memory orphaned blocks list by failing to delete some blocks due to a
    disk failure. Upon deleting a tablet with such orphaned blocks, Kudu
    could crash, as this breaks the assumption in the TabletMetadata that
    once we've made the call to delete orphaned blocks, then the orphaned
    blocks list will be empty.
    
    This patch addresses this by removing the blocks from the orphaned
    blocks list regardless of whether the blocks were deleted at the block
    manager level. At worst, this leaves us with untracked blocks, but it's
    better than crashing due to a bogus assumption.
    
    A test is added that passed 1/100 times when run on dist-test in debug
    mode with stress. With the patch, it passes 100/100 times.
    
    Change-Id: Ice78f41d6d367d42ad31c2127ceb5fc57a244e34
    Reviewed-on: http://gerrit.cloudera.org:8080/13858
    Tested-by: Kudu Jenkins
    Reviewed-by: Adar Dembo <ad...@cloudera.com>
---
 src/kudu/tablet/tablet_metadata.cc     |  7 +++++--
 src/kudu/tserver/tablet_server-test.cc | 26 ++++++++++++++++++++++++++
 2 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/src/kudu/tablet/tablet_metadata.cc b/src/kudu/tablet/tablet_metadata.cc
index 006c231..8189085 100644
--- a/src/kudu/tablet/tablet_metadata.cc
+++ b/src/kudu/tablet/tablet_metadata.cc
@@ -524,10 +524,13 @@ void TabletMetadata::DeleteOrphanedBlocks(const vector<BlockId>& blocks) {
   WARN_NOT_OK(transaction->CommitDeletedBlocks(&deleted),
               "not all orphaned blocks were deleted");
 
-  // Remove the successfully-deleted blocks from the set.
+  // Regardless of whether we deleted all the blocks or not, remove them from
+  // the orphaned blocks list. If we failed to delete the blocks due to
+  // hardware issues, there's not much we can do and we assume the disk isn't
+  // coming back. At worst, this leaves some untracked orphaned blocks.
   {
     std::lock_guard<LockType> l(data_lock_);
-    for (const BlockId& b : deleted) {
+    for (const BlockId& b : blocks) {
       orphaned_blocks_.erase(b);
     }
   }
diff --git a/src/kudu/tserver/tablet_server-test.cc b/src/kudu/tserver/tablet_server-test.cc
index 717aacb..9ef93ac 100644
--- a/src/kudu/tserver/tablet_server-test.cc
+++ b/src/kudu/tserver/tablet_server-test.cc
@@ -59,6 +59,7 @@
 #include "kudu/fs/fs-test-util.h"
 #include "kudu/fs/fs.pb.h"
 #include "kudu/fs/fs_manager.h"
+#include "kudu/gutil/basictypes.h"
 #include "kudu/gutil/callback.h"
 #include "kudu/gutil/casts.h"
 #include "kudu/gutil/gscoped_ptr.h"
@@ -697,6 +698,31 @@ TEST_P(TabletServerDiskErrorTest, TestRandomOpSequence) {
   LOG(INFO) << "Tablet was successfully failed";
 }
 
+// Regression test for KUDU-2635.
+TEST_F(TabletServerTest, TestEIODuringDelete) {
+  // Delete some blocks, but don't always delete them persistently so we're
+  // left with some orphaned blocks in the orphaned blocks list. We'll do this
+  // by injecting some EIOs.
+  NO_FATALS(InsertTestRowsRemote(1, 1));
+  ASSERT_OK(tablet_replica_->tablet()->Flush());
+  NO_FATALS(UpdateTestRowRemote(1, 2));
+  ASSERT_OK(tablet_replica_->tablet()->FlushAllDMSForTests());
+  FsManager* fs_manager = mini_server_->server()->fs_manager();
+  FLAGS_env_inject_eio_globs = JoinPathSegments(fs_manager->GetDataRootDirs()[0], "**");
+  FLAGS_env_inject_eio = 0.5;
+  ignore_result(tablet_replica_->tablet()->MajorCompactAllDeltaStoresForTests());
+
+  // Delete the tablet while still injecting failures. Even if we aren't
+  // successful in deleting our orphaned blocks list, we shouldn't crash.
+  DeleteTabletRequestPB req;
+  DeleteTabletResponsePB resp;
+  req.set_dest_uuid(fs_manager->uuid());
+  req.set_tablet_id(kTabletId);
+  req.set_delete_type(tablet::TABLET_DATA_DELETED);
+  RpcController rpc;
+  ASSERT_OK(admin_proxy_->DeleteTablet(req, &resp, &rpc));
+}
+
 TEST_F(TabletServerTest, TestInsert) {
   WriteRequestPB req;