You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by mp...@apache.org on 2017/08/25 19:26:08 UTC

[2/2] kudu git commit: Fix flaky ts_recovery-itest TestChangeMaxCellSize

Fix flaky ts_recovery-itest TestChangeMaxCellSize

In TSTabletManager::CreateReportedTabletPB(), each replica's state is
reported, and if this state is FAILED, it additionally reports the error
state that caused it to fail. However, replica state is transient, and
can, on rare occasion, change between these two retrievals of replica
state.

This led to a DCHECK failure in ts_recovery-itest when parsing a
report, as reports with errors are expected to be FAILED. The failure
was consistent with the following sequence of events:
  1. The tablet is opened, hits an error, and starts shutting down
  2. The report notes the replica state (STOPPING)
  3. The tablet finishes shutting down and switches its state (FAILED)
  4. The report, seeing the replica is FAILED, notes the error that
     caused the shut down
  5. The catalog manager goes through the report, and notes the error
     tacked onto the report and the fact that the tablet is STOPPING
     instead of FAILED. This fails the DCHECK, as logged below:

BlockManagerType/TsRecoveryITest.TestChangeMaxCellSize/0: catalog_manager.cc:2489] Check failed: report.state() == tablet::FAILED (3 vs. 2)

This sequence may be more frequent since a9d17c0, which enforces the
tablet stops (i.e. go into the STOPPING state and shut down the
replica's internals) before going into the FAILED state, instead of
going directly to the FAILED state.

Validated by injecting a pause in the CreateReportedTabletPB between the
two calls and hitting the DCHECK failure. This is fixed by reporting the
error regardless of state and removing this DCHECK.

Change-Id: I6f0a6b19756777e5f4081ef6c8cb5af4ecc8a3d6
Reviewed-on: http://gerrit.cloudera.org:8080/7820
Tested-by: Kudu Jenkins
Reviewed-by: Mike Percy <mp...@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/bd5cb8b4
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/bd5cb8b4
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/bd5cb8b4

Branch: refs/heads/master
Commit: bd5cb8b466e5b0f26e3c8c975baa4fe7504b7ce5
Parents: 1177c8e
Author: Andrew Wong <aw...@cloudera.com>
Authored: Thu Aug 24 16:36:20 2017 -0700
Committer: Mike Percy <mp...@apache.org>
Committed: Fri Aug 25 19:13:31 2017 +0000

----------------------------------------------------------------------
 src/kudu/master/catalog_manager.cc    | 3 +--
 src/kudu/tserver/ts_tablet_manager.cc | 6 +++---
 2 files changed, 4 insertions(+), 5 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/bd5cb8b4/src/kudu/master/catalog_manager.cc
----------------------------------------------------------------------
diff --git a/src/kudu/master/catalog_manager.cc b/src/kudu/master/catalog_manager.cc
index f7c3b44..3936cda 100644
--- a/src/kudu/master/catalog_manager.cc
+++ b/src/kudu/master/catalog_manager.cc
@@ -2486,8 +2486,7 @@ Status CatalogManager::HandleReportedTablet(TSDescriptor* ts_desc,
   if (report.has_error()) {
     Status s = StatusFromPB(report.error());
     DCHECK(!s.ok());
-    DCHECK_EQ(report.state(), tablet::FAILED);
-    LOG(WARNING) << "Tablet " << tablet->ToString() << " has failed on TS "
+    LOG(WARNING) << "Tablet " << tablet->ToString() << " experienced an error on TS "
                  << ts_desc->ToString() << ": " << s.ToString();
     return Status::OK();
   }

http://git-wip-us.apache.org/repos/asf/kudu/blob/bd5cb8b4/src/kudu/tserver/ts_tablet_manager.cc
----------------------------------------------------------------------
diff --git a/src/kudu/tserver/ts_tablet_manager.cc b/src/kudu/tserver/ts_tablet_manager.cc
index 3bba569..b014aab 100644
--- a/src/kudu/tserver/ts_tablet_manager.cc
+++ b/src/kudu/tserver/ts_tablet_manager.cc
@@ -1011,9 +1011,9 @@ void TSTabletManager::CreateReportedTabletPB(const string& tablet_id,
   reported_tablet->set_tablet_id(tablet_id);
   reported_tablet->set_state(replica->state());
   reported_tablet->set_tablet_data_state(replica->tablet_metadata()->tablet_data_state());
-  if (replica->state() == tablet::FAILED) {
-    AppStatusPB* error_status = reported_tablet->mutable_error();
-    StatusToPB(replica->error(), error_status);
+  const Status& error = replica->error();
+  if (!error.ok()) {
+    StatusToPB(error, reported_tablet->mutable_error());
   }
   reported_tablet->set_schema_version(replica->tablet_metadata()->schema_version());