You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Alexey Serbin (Code Review)" <ge...@cloudera.org> on 2017/08/30 01:08:44 UTC

[kudu-CR] WIP [raft consensus-itest] fix flake in TestSlowLeader

Alexey Serbin has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7887

Change subject: WIP [raft_consensus-itest] fix flake in TestSlowLeader
......................................................................

WIP [raft_consensus-itest] fix flake in TestSlowLeader

Under rare conditions, a reader thread of the test workload of the
RaftConsensusITest.TestSlowLeader test might read data from a lagging
follower replica, while the writer threads just switched to a newly
elected leader replica.  When that happened, the test failed with
a stack trace like the following:

F0829 01:47:51.052803  1605 test_workload.cc:230] Check failed: \
  row_count >= expected_row_count (219550 vs. 249850)
*** Check failure stack trace: ***
    @           0x95e83d  google::LogMessage::Fail() \
                            at thirdparty/src/glog-0.3.5/src/logging.cc:1488
    @           0x9606fd  google::LogMessage::SendToLog() \
                            at thirdparty/src/glog-0.3.5/src/logging.cc:1442
    @           0x95e379  google::LogMessage::Flush() \
                            at thirdparty/src/glog-0.3.5/src/logging.cc:1312
    @           0x96119f  google::LogMessageFatal::~LogMessageFatal() \
                            at thirdparty/src/glog-0.3.5/src/logging.cc:2024
    @           0x955809  kudu::TestWorkload::ReadThread() \
                            at tr1/shared_ptr.h:340
    @     0x7fbdbc108a40  (unknown) at ??:0
    @     0x7fbdbd550184  start_thread at ??:0
    @     0x7fbdbbb7637d  clone at ??:0
    @              (nil)  (unknown)

This patch addresses the issue.  Basically, it works around the RYW
consistency issue which is seen here because of the absense of the
leader leases mechanism.

To test the modifications, I run the test 2K times multiple times,
none of those failed (RELEASE build):
  http://dist-test.cloudera.org//job?job_id=aserbin.1504051374.17423

To run the test without the work-around, I applied the patch below
and get 1 out of 2K failed:
  http://dist-test.cloudera.org//job?job_id=aserbin.1504053764.1127

WIP: because I'm not sure whether we want this workaround now
or we would better wait for the leader leases to be implemented.

------------------------------------------------------------------------
--- a/src/kudu/integration-tests/raft_consensus-itest.cc
+++ b/src/kudu/integration-tests/raft_consensus-itest.cc
@@ -2571,7 +2571,7 @@ TEST_F(RaftConsensusITest, TestSlowLeader) {
   if (!AllowSlowTests()) return;

   static const int kHbIntervalMs = 32;
-  static const int kMaxMissedHbPeriods = 3;
+  static const int kMaxMissedHbPeriods = 1;
   const vector<string> tserver_flags = {
     Substitute("--raft_heartbeat_interval_ms=$0", kHbIntervalMs),
     Substitute("--leader_failure_max_missed_heartbeat_periods=$0",
@@ -2586,9 +2586,9 @@ TEST_F(RaftConsensusITest, TestSlowLeader) {
   TestWorkload workload(cluster_.get());
   workload.set_table_name(kTableId);
   workload.set_num_read_threads(2);
-  workload.set_read_retry_enabled(true);
-  workload.set_read_retry_delay(
-      MonoDelta::FromMilliseconds(kHbIntervalMs * kMaxMissedHbPeriods));
+  //workload.set_read_retry_enabled(true);
+  //workload.set_read_retry_delay(
+  //    MonoDelta::FromMilliseconds(kHbIntervalMs * kMaxMissedHbPeriods));
   workload.Setup();
   workload.Start();
   SleepFor(MonoDelta::FromSeconds(60));
------------------------------------------------------------------------

Change-Id: Ie5ee6c5400c947f87b1da2e76d24dd837b1270ca
---
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/integration-tests/test_workload.h
3 files changed, 43 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/87/7887/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7887
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ie5ee6c5400c947f87b1da2e76d24dd837b1270ca
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>

[kudu-CR] WIP [raft consensus-itest] fix flake in TestSlowLeader

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/7887 )

Change subject: WIP [raft_consensus-itest] fix flake in TestSlowLeader
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7887/2/src/kudu/integration-tests/test_workload.cc
File src/kudu/integration-tests/test_workload.cc:

http://gerrit.cloudera.org:8080/#/c/7887/2/src/kudu/integration-tests/test_workload.cc@220
PS2, Line 220:     while (true) {
instead what if we just used READ_AT_SNAPSHOT scan mode and propagated timestamps back from the writer threads? that way we'd actually get coverage of our snapshot scan code paths.



-- 
To view, visit http://gerrit.cloudera.org:8080/7887
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie5ee6c5400c947f87b1da2e76d24dd837b1270ca
Gerrit-Change-Number: 7887
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 06 Oct 2017 01:29:24 +0000
Gerrit-HasComments: Yes

[kudu-CR] WIP [raft consensus-itest] fix flake in TestSlowLeader

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: WIP [raft_consensus-itest] fix flake in TestSlowLeader
......................................................................


Patch Set 2:

> can you explain a little better the exact scenario that happens?
 > could you cause it deterministically or semi-determinically? could
 > you write a disabled test for it?

Thank you for looking at the patch.  So far I have no better explanation than in the changelist description.  Right now the failure happens semi-deterministically if running this test (1 out of 2K fails).  Yes, I think it would be possible to create a targeted test for that -- while doing so, I could produce a better explanation for the failure path scenario.

-- 
To view, visit http://gerrit.cloudera.org:8080/7887
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie5ee6c5400c947f87b1da2e76d24dd837b1270ca
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: No

[kudu-CR] WIP [raft consensus-itest] fix flake in TestSlowLeader

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/7887 )

Change subject: WIP [raft_consensus-itest] fix flake in TestSlowLeader
......................................................................


Patch Set 2:

I think this change makes sense but I am wondering what the purpose of the read thread is in the test, anyway. The original test doesn't do a good job of justifying its existence or implementation in my opinion. I am honestly surprised it is not more flaky than reported here.


-- 
To view, visit http://gerrit.cloudera.org:8080/7887
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie5ee6c5400c947f87b1da2e76d24dd837b1270ca
Gerrit-Change-Number: 7887
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Comment-Date: Mon, 02 Oct 2017 21:51:57 +0000
Gerrit-HasComments: No

[kudu-CR] WIP [raft consensus-itest] fix flake in TestSlowLeader

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7887

to look at the new patch set (#2).

Change subject: WIP [raft_consensus-itest] fix flake in TestSlowLeader
......................................................................

WIP [raft_consensus-itest] fix flake in TestSlowLeader

Under rare conditions, a reader thread of the test workload of the
RaftConsensusITest.TestSlowLeader test might read data from a lagging
follower replica, while the writer threads just switched to a newly
elected leader replica.  When that happened, the test failed with
a stack trace like the following:

F0829 01:47:51.052803  1605 test_workload.cc:230] Check failed: \
  row_count >= expected_row_count (219550 vs. 249850)
*** Check failure stack trace: ***
    @           0x95e83d  google::LogMessage::Fail() \
                            at thirdparty/src/glog-0.3.5/src/logging.cc:1488
    @           0x9606fd  google::LogMessage::SendToLog() \
                            at thirdparty/src/glog-0.3.5/src/logging.cc:1442
    @           0x95e379  google::LogMessage::Flush() \
                            at thirdparty/src/glog-0.3.5/src/logging.cc:1312
    @           0x96119f  google::LogMessageFatal::~LogMessageFatal() \
                            at thirdparty/src/glog-0.3.5/src/logging.cc:2024
    @           0x955809  kudu::TestWorkload::ReadThread() \
                            at tr1/shared_ptr.h:340
    @     0x7fbdbc108a40  (unknown) at ??:0
    @     0x7fbdbd550184  start_thread at ??:0
    @     0x7fbdbbb7637d  clone at ??:0
    @              (nil)  (unknown)

This patch addresses the issue.  Basically, it works around the RYW
consistency issue which is seen here because of the absense of the
leader leases mechanism.

To test the modifications, I run the test 2K times multiple times,
none of those failed (RELEASE build):
  http://dist-test.cloudera.org//job?job_id=aserbin.1504051374.17423

To run the test without the work-around, I applied the patch below
and get 1 out of 2K failed:
  http://dist-test.cloudera.org//job?job_id=aserbin.1504053764.1127

WIP: because I'm not sure whether we want this workaround now
or we would better wait for the leader leases to be implemented.

------------------------------------------------------------------------
--- a/src/kudu/integration-tests/raft_consensus-itest.cc
+++ b/src/kudu/integration-tests/raft_consensus-itest.cc
@@ -2571,7 +2571,7 @@ TEST_F(RaftConsensusITest, TestSlowLeader) {
   if (!AllowSlowTests()) return;

   static const int kHbIntervalMs = 32;
-  static const int kMaxMissedHbPeriods = 3;
+  static const int kMaxMissedHbPeriods = 1;
   const vector<string> tserver_flags = {
     Substitute("--raft_heartbeat_interval_ms=$0", kHbIntervalMs),
     Substitute("--leader_failure_max_missed_heartbeat_periods=$0",
@@ -2586,9 +2586,9 @@ TEST_F(RaftConsensusITest, TestSlowLeader) {
   TestWorkload workload(cluster_.get());
   workload.set_table_name(kTableId);
   workload.set_num_read_threads(2);
-  workload.set_read_retry_enabled(true);
-  workload.set_read_retry_delay(
-      MonoDelta::FromMilliseconds(kHbIntervalMs * kMaxMissedHbPeriods));
+  //workload.set_read_retry_enabled(true);
+  //workload.set_read_retry_delay(
+  //    MonoDelta::FromMilliseconds(kHbIntervalMs * kMaxMissedHbPeriods));
   workload.Setup();
   workload.Start();
   SleepFor(MonoDelta::FromSeconds(60));
------------------------------------------------------------------------

Change-Id: Ie5ee6c5400c947f87b1da2e76d24dd837b1270ca
---
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/integration-tests/test_workload.h
3 files changed, 43 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/87/7887/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7887
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie5ee6c5400c947f87b1da2e76d24dd837b1270ca
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot

[kudu-CR] WIP [raft consensus-itest] fix flake in TestSlowLeader

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: WIP [raft_consensus-itest] fix flake in TestSlowLeader
......................................................................


Patch Set 2:

can you explain a little better the exact scenario that happens? could you cause it deterministically or semi-determinically? could you write a disabled test for it?

-- 
To view, visit http://gerrit.cloudera.org:8080/7887
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie5ee6c5400c947f87b1da2e76d24dd837b1270ca
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: No

[kudu-CR] WIP [raft consensus-itest] fix flake in TestSlowLeader

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: WIP [raft_consensus-itest] fix flake in TestSlowLeader
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7887/1/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

Line 2591:       MonoDelta::FromMilliseconds(kHbIntervalMs * kMaxMissedHbPeriods));
> warning: either cast from 'int' to 'int64_t' (aka 'long') is ineffective, o
Done


http://gerrit.cloudera.org:8080/#/c/7887/1/src/kudu/integration-tests/test_workload.h
File src/kudu/integration-tests/test_workload.h:

Line 121:     read_retry_delay_ = std::move(delay);
> warning: std::move of the variable 'delay' of the trivially-copyable type '
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7887
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie5ee6c5400c947f87b1da2e76d24dd837b1270ca
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes