You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by mp...@apache.org on 2017/05/25 20:23:32 UTC

kudu git commit: Fix flaky test TestRestartWithOrphanedReplicates

Repository: kudu
Updated Branches:
  refs/heads/master e3c5dd18c -> 5d4bf0c15


Fix flaky test TestRestartWithOrphanedReplicates

Looks like the test was setting a fault injection flag before the setup
which was causing the setup to fail. Moving the flag setting to after
the setup, but before Start seems to have done the trick.
We were getting about 1 failure per 1000 runs before this change,
0 per 1000 after.
Dist test job after patch:

http://dist-test.cloudera.org/job?job_id=efan.1495643022.19373

Dist test job before patch with failures:
http://dist-test.cloudera.org/job?job_id=efan.1495149936.30555

Example from a failure log:

I0518 23:26:00.447599  2346 master_service.cc:195] Signed X509 certificate for tserver {username='slave'} at 127.9.10.0:34667
W0518 23:26:00.448045  2529 fault_injection.cc:38] FAULT INJECTION ENABLED!
W0518 23:26:00.448065  2529 fault_injection.cc:39] THIS SERVER MAY CRASH!
E0518 23:26:00.448072  2529 fault_injection.cc:54] Injecting fault: FLAGS_fault_crash_before_append_commit (process will exit)
I0518 23:26:00.448150  2510 heartbeater.cc:380] Master 127.0.0.1:38357 was elected leader, sending a full tablet report...
W0518 23:26:00.450608  2331 connection.cc:462] server connection from 127.9.10.0:34667 recv error: Network error: failed to read from TLS socket: Connection reset by peer (error 104)
W0518 23:26:00.450630  2319 connection.cc:462] client connection to 127.9.10.0:51332 recv error: Network error: failed to read from TLS socket: Connection reset by peer (error 104)
W0518 23:26:00.450822  2331 connection.cc:462] client connection to 127.9.10.0:51332 recv error: Network error: failed to read from TLS socket: Connection reset by peer (error 104)
F0518 23:26:30.315480  2314 test_workload.cc:266] Timed out: Timed out waiting for Table Creation
*** Check failure stack trace: ***
    @     0x7fb089e3b2fd  google::LogMessage::Fail() at ??:0
    @     0x7fb089e3d1bd  google::LogMessage::SendToLog() at ??:0
    @     0x7fb089e3ae39  google::LogMessage::Flush() at ??:0
    @     0x7fb089e3dc5f  google::LogMessageFatal::~LogMessageFatal() at ??:0
    @     0x7fb0943b7492  kudu::TestWorkload::Setup() at ??:0
    @           0x40e68e  kudu::TsRecoveryITest_TestRestartWithOrphanedReplicates_Test::TestBody() at /home/efan/src/kudu/build/release/../../src/kudu/integration-tests/ts_recovery-itest.cc:107
    @     0x7fb08ad35af8  testing::internal::HandleExceptionsInMethodIfSupported<>() at ??:0
    @     0x7fb08ad29fc2  testing::Test::Run() at ??:0
    @     0x7fb08ad2a108  testing::TestInfo::Run() at ??:0
    @     0x7fb08ad2a1e5  testing::TestCase::Run() at ??:0
    @     0x7fb08ad2a4c8  testing::internal::UnitTestImpl::RunAllTests() at ??:0
    @     0x7fb08ad36008  testing::internal::HandleExceptionsInMethodIfSupported<>() at ??:0
    @     0x7fb08ad2a7ad  testing::UnitTest::Run() at ??:0
    @     0x7fb09416566c  main at ??:0
    @     0x7fb088083f45  __libc_start_main at ??:0
    @           0x40dfb9  (unknown) at ??:?

Change-Id: Ied9a55abd20841d350589ce56aa935ea1feece79
Reviewed-on: http://gerrit.cloudera.org:8080/6976
Tested-by: Kudu Jenkins
Reviewed-by: Alexey Serbin <as...@cloudera.com>
Reviewed-by: Mike Percy <mp...@apache.org>


Project: http://git-wip-us.apache.org/repos/asf/kudu/repo
Commit: http://git-wip-us.apache.org/repos/asf/kudu/commit/5d4bf0c1
Tree: http://git-wip-us.apache.org/repos/asf/kudu/tree/5d4bf0c1
Diff: http://git-wip-us.apache.org/repos/asf/kudu/diff/5d4bf0c1

Branch: refs/heads/master
Commit: 5d4bf0c1586b60a23aebe47daa67d4f8bf8874f4
Parents: e3c5dd1
Author: Edward Fancher <ef...@cloudera.com>
Authored: Wed May 24 11:44:56 2017 -0500
Committer: Mike Percy <mp...@apache.org>
Committed: Thu May 25 20:23:15 2017 +0000

----------------------------------------------------------------------
 src/kudu/integration-tests/ts_recovery-itest.cc | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/kudu/blob/5d4bf0c1/src/kudu/integration-tests/ts_recovery-itest.cc
----------------------------------------------------------------------
diff --git a/src/kudu/integration-tests/ts_recovery-itest.cc b/src/kudu/integration-tests/ts_recovery-itest.cc
index 7ef8d49..8913140 100644
--- a/src/kudu/integration-tests/ts_recovery-itest.cc
+++ b/src/kudu/integration-tests/ts_recovery-itest.cc
@@ -82,8 +82,6 @@ void TsRecoveryITest::StartClusterOneTs(const vector<string>& extra_tserver_flag
 // inserted before the crash are recovered.
 TEST_F(TsRecoveryITest, TestRestartWithOrphanedReplicates) {
   NO_FATALS(StartClusterOneTs());
-  cluster_->SetFlag(cluster_->tablet_server(0),
-                    "fault_crash_before_append_commit", "0.05");
 
   TestWorkload work(cluster_.get());
   work.set_num_replicas(1);
@@ -91,6 +89,10 @@ TEST_F(TsRecoveryITest, TestRestartWithOrphanedReplicates) {
   work.set_write_timeout_millis(100);
   work.set_timeout_allowed(true);
   work.Setup();
+
+  // Crash when the WAL contains a replicate message but no corresponding commit.
+  cluster_->SetFlag(cluster_->tablet_server(0),
+                    "fault_crash_before_append_commit", "0.05");
   work.Start();
 
   // Wait for the process to crash due to the injected fault.
@@ -101,6 +103,8 @@ TEST_F(TsRecoveryITest, TestRestartWithOrphanedReplicates) {
 
   // Restart the server, and it should recover.
   cluster_->tablet_server(0)->Shutdown();
+
+  // Restart the server and check to make sure that the change is eventually applied.
   ASSERT_OK(cluster_->tablet_server(0)->Restart());