You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Alexey Serbin (Code Review)" <ge...@cloudera.org> on 2018/03/24 03:40:04 UTC

[kudu-CR] [linked list-test] fix flake with the 3-4-3 scheme

Alexey Serbin has uploaded this change for review. ( http://gerrit.cloudera.org:8080/9795


Change subject: [linked_list-test] fix flake with the 3-4-3 scheme
......................................................................

[linked_list-test] fix flake with the 3-4-3 scheme

For the LinkedListTest.TestLoadWhileOneServerDownAndVerify scenario,
the logic of the test didn't account for the case when the replica
at the restarted tablet server was still a non-voter before shutting
down the other two tablet servers for the verification phase.  This
patch fixes that: the scenario became more stable if running for
long times.

Change-Id: I132206371e2935f1e0f39e9eacad866fde22c5b8
---
M src/kudu/integration-tests/linked_list-test.cc
1 file changed, 27 insertions(+), 15 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/95/9795/1
-- 
To view, visit http://gerrit.cloudera.org:8080/9795
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I132206371e2935f1e0f39e9eacad866fde22c5b8
Gerrit-Change-Number: 9795
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>

[kudu-CR] [linked list-test] fix flake with the 3-4-3 scheme

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/9795 )

Change subject: [linked_list-test] fix flake with the 3-4-3 scheme
......................................................................


Patch Set 2: Verified+1

Unrelated flake in TabletServerTest.TestTombstonedTabletOnWebUI


-- 
To view, visit http://gerrit.cloudera.org:8080/9795
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I132206371e2935f1e0f39e9eacad866fde22c5b8
Gerrit-Change-Number: 9795
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Sat, 24 Mar 2018 20:58:25 +0000
Gerrit-HasComments: No

[kudu-CR] [linked list-test] fix flake with the 3-4-3 scheme

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Jean-Daniel Cryans, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/9795

to look at the new patch set (#3).

Change subject: [linked_list-test] fix flake with the 3-4-3 scheme
......................................................................

[linked_list-test] fix flake with the 3-4-3 scheme

The LinkedListTest.TestLoadWhileOneServerDownAndVerify scenario became
flaky when running with --seconds_to_run set to about 800 and more
once the 3-4-3 replica management scheme became the default one.

Those cases (--seconds_to_run=X, X >= 800) are special because the
stopped replica falls behind WAL segment GC threshold when running with
the linked list input data, so the system automatically replaces
the failed replica.  In case of the 3-4-3 scheme, the newly added
replica is added as a non-voter.  The WaitForServersToAgree() looks
only at the OpId indices, not distinguishing between voter and
non-voter replicas.  However, the verification phase of the scenario
assumes the only replica left alive is a voter replica.

Prior to this fix, the scenario didn't account for the case when the
replica at the restarted tablet server was still a non-voter, and in
most cases the rest 2 out of 3 tservers were shutdown before the newly
added replica was promoted.  As a result, the latter replica was left
non-voter and the written data could not be read back from it.

This patch adds a step to verify that all 3 replicas are registered
as voters with the master(s) before shutting down the tservers hosting
the source 2 replicas.  The scenario is now stable when running with
--seconds_to_run=900.

Change-Id: I132206371e2935f1e0f39e9eacad866fde22c5b8
---
M src/kudu/integration-tests/linked_list-test.cc
1 file changed, 31 insertions(+), 14 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/95/9795/3
-- 
To view, visit http://gerrit.cloudera.org:8080/9795
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I132206371e2935f1e0f39e9eacad866fde22c5b8
Gerrit-Change-Number: 9795
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [linked list-test] fix flake with the 3-4-3 scheme

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9795 )

Change subject: [linked_list-test] fix flake with the 3-4-3 scheme
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/9795/2/src/kudu/integration-tests/linked_list-test.cc
File src/kudu/integration-tests/linked_list-test.cc:

http://gerrit.cloudera.org:8080/#/c/9795/2/src/kudu/integration-tests/linked_list-test.cc@313
PS2, Line 313: WaitForReplicasReportedToMaster
I don't see how NON_VOTER can be related to this because we only have 3 tablet servers and we create replicas on each one with a config of 3 voters on line 298 in BuildAndStart().


http://gerrit.cloudera.org:8080/#/c/9795/2/src/kudu/integration-tests/linked_list-test.cc@314
PS2, Line 314: run_time
kWaitTime?


http://gerrit.cloudera.org:8080/#/c/9795/2/src/kudu/integration-tests/linked_list-test.cc@316
PS2, Line 316: run_time
I believe this should be kWaitTime, i.e. 30 seconds or so, and not related to the linked list test run time.



-- 
To view, visit http://gerrit.cloudera.org:8080/9795
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I132206371e2935f1e0f39e9eacad866fde22c5b8
Gerrit-Change-Number: 9795
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 26 Mar 2018 19:06:18 +0000
Gerrit-HasComments: Yes

[kudu-CR] [linked list-test] fix flake with the 3-4-3 scheme

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/9795 )

Change subject: [linked_list-test] fix flake with the 3-4-3 scheme
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/9795/2/src/kudu/integration-tests/linked_list-test.cc
File src/kudu/integration-tests/linked_list-test.cc:

http://gerrit.cloudera.org:8080/#/c/9795/2/src/kudu/integration-tests/linked_list-test.cc@313
PS2, Line 313: WaitForReplicasReportedToMaster
> I don't see how NON_VOTER can be related to this because we only have 3 tab
I added some comments and updated the commit messages.  Hopefully, that helps to understand what that was a problem.


http://gerrit.cloudera.org:8080/#/c/9795/2/src/kudu/integration-tests/linked_list-test.cc@314
PS2, Line 314: run_time
> kWaitTime?
Per out discussion offline, I'm keeping it as run_time, but I just added a handy variable for better readability.


http://gerrit.cloudera.org:8080/#/c/9795/2/src/kudu/integration-tests/linked_list-test.cc@316
PS2, Line 316: run_time
> Oops, I missed where kWaitTime was defined. Since there was a comment for t
I ran ~250 iterations of the test at ve0518.halxg.cloudera.com (TSAN build, both slow and fast variants) and didn't find any flakiness with current timeouts (was using --stress_cpu_threads=32).  Also, run the test 1K times using dist_test, not a single failure.

It seems somethings has changed and now we don't need that 5 seconds extra for TSAN builds.



-- 
To view, visit http://gerrit.cloudera.org:8080/9795
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I132206371e2935f1e0f39e9eacad866fde22c5b8
Gerrit-Change-Number: 9795
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 26 Mar 2018 22:34:55 +0000
Gerrit-HasComments: Yes

[kudu-CR] [linked list-test] fix flake with the 3-4-3 scheme

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9795 )

Change subject: [linked_list-test] fix flake with the 3-4-3 scheme
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/9795/2/src/kudu/integration-tests/linked_list-test.cc
File src/kudu/integration-tests/linked_list-test.cc:

http://gerrit.cloudera.org:8080/#/c/9795/2/src/kudu/integration-tests/linked_list-test.cc@316
PS2, Line 316: run_time
> I believe this should be kWaitTime, i.e. 30 seconds or so, and not related 
Oops, I missed where kWaitTime was defined. Since there was a comment for that (particularly for TSAN builds) I wonder if we should retain that (additional 5 seconds) behavior.



-- 
To view, visit http://gerrit.cloudera.org:8080/9795
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I132206371e2935f1e0f39e9eacad866fde22c5b8
Gerrit-Change-Number: 9795
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 26 Mar 2018 19:19:30 +0000
Gerrit-HasComments: Yes

[kudu-CR] [linked list-test] fix flake with the 3-4-3 scheme

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9795 )

Change subject: [linked_list-test] fix flake with the 3-4-3 scheme
......................................................................


Patch Set 3: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/9795
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I132206371e2935f1e0f39e9eacad866fde22c5b8
Gerrit-Change-Number: 9795
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 26 Mar 2018 23:08:39 +0000
Gerrit-HasComments: No

[kudu-CR] [linked list-test] fix flake with the 3-4-3 scheme

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/9795 )

Change subject: [linked_list-test] fix flake with the 3-4-3 scheme
......................................................................

[linked_list-test] fix flake with the 3-4-3 scheme

The LinkedListTest.TestLoadWhileOneServerDownAndVerify scenario became
flaky when running with --seconds_to_run set to about 800 and more
once the 3-4-3 replica management scheme became the default one.

Those cases (--seconds_to_run=X, X >= 800) are special because the
stopped replica falls behind WAL segment GC threshold when running with
the linked list input data, so the system automatically replaces
the failed replica.  In case of the 3-4-3 scheme, the newly added
replica is added as a non-voter.  The WaitForServersToAgree() looks
only at the OpId indices, not distinguishing between voter and
non-voter replicas.  However, the verification phase of the scenario
assumes the only replica left alive is a voter replica.

Prior to this fix, the scenario didn't account for the case when the
replica at the restarted tablet server was still a non-voter, and in
most cases the rest 2 out of 3 tservers were shutdown before the newly
added replica was promoted.  As a result, the latter replica was left
non-voter and the written data could not be read back from it.

This patch adds a step to verify that all 3 replicas are registered
as voters with the master(s) before shutting down the tservers hosting
the source 2 replicas.  The scenario is now stable when running with
--seconds_to_run=900.

Change-Id: I132206371e2935f1e0f39e9eacad866fde22c5b8
Reviewed-on: http://gerrit.cloudera.org:8080/9795
Tested-by: Alexey Serbin <as...@cloudera.com>
Reviewed-by: Mike Percy <mp...@apache.org>
---
M src/kudu/integration-tests/linked_list-test.cc
1 file changed, 31 insertions(+), 14 deletions(-)

Approvals:
  Alexey Serbin: Verified
  Mike Percy: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/9795
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I132206371e2935f1e0f39e9eacad866fde22c5b8
Gerrit-Change-Number: 9795
Gerrit-PatchSet: 4
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [linked list-test] fix flake with the 3-4-3 scheme

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/9795

to look at the new patch set (#2).

Change subject: [linked_list-test] fix flake with the 3-4-3 scheme
......................................................................

[linked_list-test] fix flake with the 3-4-3 scheme

For the LinkedListTest.TestLoadWhileOneServerDownAndVerify scenario,
the logic of the test didn't account for the case when the replica
at the restarted tablet server was still a non-voter before shutting
down the other two tablet servers at the verification phase.  This
patch fixes that: the scenario is now stable if running with the
3-4-3 replica management scheme.

Change-Id: I132206371e2935f1e0f39e9eacad866fde22c5b8
---
M src/kudu/integration-tests/linked_list-test.cc
1 file changed, 17 insertions(+), 14 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/95/9795/2
-- 
To view, visit http://gerrit.cloudera.org:8080/9795
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I132206371e2935f1e0f39e9eacad866fde22c5b8
Gerrit-Change-Number: 9795
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [linked list-test] fix flake with the 3-4-3 scheme

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has removed Kudu Jenkins from this change.  ( http://gerrit.cloudera.org:8080/9795 )

Change subject: [linked_list-test] fix flake with the 3-4-3 scheme
......................................................................


Removed reviewer Kudu Jenkins with the following votes:

* Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/9795
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteReviewer
Gerrit-Change-Id: I132206371e2935f1e0f39e9eacad866fde22c5b8
Gerrit-Change-Number: 9795
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [linked list-test] fix flake with the 3-4-3 scheme

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has removed Kudu Jenkins from this change.  ( http://gerrit.cloudera.org:8080/9795 )

Change subject: [linked_list-test] fix flake with the 3-4-3 scheme
......................................................................


Removed reviewer Kudu Jenkins with the following votes:

* Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/9795
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteReviewer
Gerrit-Change-Id: I132206371e2935f1e0f39e9eacad866fde22c5b8
Gerrit-Change-Number: 9795
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [linked list-test] fix flake with the 3-4-3 scheme

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/9795 )

Change subject: [linked_list-test] fix flake with the 3-4-3 scheme
......................................................................


Patch Set 3: Verified+1

Unrelated failures in the dist_test:

raft_consensus_election-itest.0:  gzip: stdout: Broken pipe
tablet_bootstrap-test.0:  gzip: stdout: Broken pipe

I think those are unrelated, it's safe to ignore.


-- 
To view, visit http://gerrit.cloudera.org:8080/9795
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I132206371e2935f1e0f39e9eacad866fde22c5b8
Gerrit-Change-Number: 9795
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 26 Mar 2018 23:06:48 +0000
Gerrit-HasComments: No