You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org> on 2021/02/19 19:08:43 UTC

[kudu-CR] [test] Fix flakiness with detecting a master in FAILED UNRECOVERABLE state

Bankim Bhavsar has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17089


Change subject: [test] Fix flakiness with detecting a master in FAILED_UNRECOVERABLE state
......................................................................

[test] Fix flakiness with detecting a master in FAILED_UNRECOVERABLE state

Observed one test failure on the flakiness dashboard in the
dynamic_multi_master test where system catalog WAL is expected to be
GC'ed and hence new master can't be caught up from WAL which should
result in the new master going to FAILED_UNRECOVERABLE state.
However the new master gets caught up from WAL and is in HEALTHY
state.

Expected: consensus::HealthReportPB::FAILED_UNRECOVERABLE
  Which is: 2
To be equal to: peer.health_report().overall_health()
  Which is: 1

This is because the check for system catalog WAL was made against
master index 0 which may not be the leader master and that's what
happened in case of the test failure.

The fix makes the GC count check against all masters as master
leadership could change.

This change also includes fixes where master index 0 was used
and it'd be good to explicitly use the leader/follower index
though they are not strictly necessary.

Change-Id: Id6017f1601eaed22be28c8a5babd6e35e93b1d2e
---
M src/kudu/master/dynamic_multi_master-test.cc
1 file changed, 43 insertions(+), 18 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/89/17089/1
-- 
To view, visit http://gerrit.cloudera.org:8080/17089
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Id6017f1601eaed22be28c8a5babd6e35e93b1d2e
Gerrit-Change-Number: 17089
Gerrit-PatchSet: 1
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>

[kudu-CR] [test] Fix flakiness with detecting a master in FAILED UNRECOVERABLE state

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Bankim Bhavsar has posted comments on this change. ( http://gerrit.cloudera.org:8080/17089 )

Change subject: [test] Fix flakiness with detecting a master in FAILED_UNRECOVERABLE state
......................................................................


Patch Set 1: Verified+1

Unrelated test failure in RebalancerAndSingleReplicaTablets.SingleReplicasStayOrMove/1


-- 
To view, visit http://gerrit.cloudera.org:8080/17089
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id6017f1601eaed22be28c8a5babd6e35e93b1d2e
Gerrit-Change-Number: 17089
Gerrit-PatchSet: 1
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Fri, 19 Feb 2021 19:51:09 +0000
Gerrit-HasComments: No

[kudu-CR] [test] Fix flakiness with detecting a master in FAILED UNRECOVERABLE state

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Bankim Bhavsar has posted comments on this change. ( http://gerrit.cloudera.org:8080/17089 )

Change subject: [test] Fix flakiness with detecting a master in FAILED_UNRECOVERABLE state
......................................................................


Patch Set 1:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/17089/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17089/1//COMMIT_MSG@28
PS1, Line 28: This change also includes fixes where master index 0 was used
            : and it'd be good to explicitly use the leader/follower index
            : though they are not strictly necessary.
> I'm not sure I agree with this. Doesn't this reduce test coverage of runnin
See comments below.


http://gerrit.cloudera.org:8080/#/c/17089/1/src/kudu/master/dynamic_multi_master-test.cc
File src/kudu/master/dynamic_multi_master-test.cc:

http://gerrit.cloudera.org:8080/#/c/17089/1/src/kudu/master/dynamic_multi_master-test.cc@171
PS1, Line 171: vector<int64_t> updated_gc_count(orig_num_masters_);
             :     ASSERT_EQ(orig_gc_count.size(), updated_gc_count.size());
> nit: why do we need to keep track of these? Why not compare get_sys_catalog
Good catch, not needed.


http://gerrit.cloudera.org:8080/#/c/17089/1/src/kudu/master/dynamic_multi_master-test.cc@778
PS1, Line 778:   ASSERT_OK(cluster_->GetLeaderMasterIndex(&leader_idx));
> Do we need to disable leadership elections here? Is it actually critical to
It's not necessary to copy from the leader, copying from follower will be fine as well. Reverting.


http://gerrit.cloudera.org:8080/#/c/17089/1/src/kudu/master/dynamic_multi_master-test.cc@1171
PS1, Line 1171:   ASSERT_OK(GetFollowerMasterIndex(&non_leader_idx));
With the current order of validation checks issuing this request to leader master would also result in the same error message. However for some reason if the order of validation for leader master is done first then error would be different resulting in test failure.

> Do we need to disable leadership elections here?

Done.


http://gerrit.cloudera.org:8080/#/c/17089/1/src/kudu/master/dynamic_multi_master-test.cc@1172
PS1, Line 1172:   auto master_to_remove = cluster_->master(non_leader_idx)->bound_rpc_hostport();
> warning: 1st function call argument is an uninitialized value [clang-analyz
This warning is not consistent but will fix it.



-- 
To view, visit http://gerrit.cloudera.org:8080/17089
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id6017f1601eaed22be28c8a5babd6e35e93b1d2e
Gerrit-Change-Number: 17089
Gerrit-PatchSet: 1
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Fri, 19 Feb 2021 23:20:45 +0000
Gerrit-HasComments: Yes

[kudu-CR] [test] Fix flakiness with detecting a master in FAILED UNRECOVERABLE state

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17089 )

Change subject: [test] Fix flakiness with detecting a master in FAILED_UNRECOVERABLE state
......................................................................


Patch Set 2:

BTW have you run the test using dist-test? Did you get a sense for the flakiness before and after this change?


-- 
To view, visit http://gerrit.cloudera.org:8080/17089
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id6017f1601eaed22be28c8a5babd6e35e93b1d2e
Gerrit-Change-Number: 17089
Gerrit-PatchSet: 2
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Mon, 22 Feb 2021 06:41:40 +0000
Gerrit-HasComments: No

[kudu-CR] [test] Fix flakiness with detecting a master in FAILED UNRECOVERABLE state

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17089 )

Change subject: [test] Fix flakiness with detecting a master in FAILED_UNRECOVERABLE state
......................................................................


Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/17089/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17089/1//COMMIT_MSG@28
PS1, Line 28: This change also includes fixes where master index 0 was used
            : and it'd be good to explicitly use the leader/follower index
            : though they are not strictly necessary.
I'm not sure I agree with this. Doesn't this reduce test coverage of running against leader vs non-leaders?


http://gerrit.cloudera.org:8080/#/c/17089/1/src/kudu/master/dynamic_multi_master-test.cc
File src/kudu/master/dynamic_multi_master-test.cc:

http://gerrit.cloudera.org:8080/#/c/17089/1/src/kudu/master/dynamic_multi_master-test.cc@171
PS1, Line 171: vector<int64_t> updated_gc_count(orig_num_masters_);
             :     ASSERT_EQ(orig_gc_count.size(), updated_gc_count.size());
nit: why do we need to keep track of these? Why not compare get_sys_catalog_wal_gc_count(master_idx) directly?


http://gerrit.cloudera.org:8080/#/c/17089/1/src/kudu/master/dynamic_multi_master-test.cc@778
PS1, Line 778:   ASSERT_OK(cluster_->GetLeaderMasterIndex(&leader_idx));
Do we need to disable leadership elections here? Is it actually critical to be copying from the leader?


http://gerrit.cloudera.org:8080/#/c/17089/1/src/kudu/master/dynamic_multi_master-test.cc@1171
PS1, Line 1171:   ASSERT_OK(GetFollowerMasterIndex(&non_leader_idx));
Do we need to disable leadership elections here? Will this fail differently if we actually try to remove a leader?



-- 
To view, visit http://gerrit.cloudera.org:8080/17089
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id6017f1601eaed22be28c8a5babd6e35e93b1d2e
Gerrit-Change-Number: 17089
Gerrit-PatchSet: 1
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Fri, 19 Feb 2021 20:15:22 +0000
Gerrit-HasComments: Yes

[kudu-CR] [test] Fix flakiness with detecting a master in FAILED UNRECOVERABLE state

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Bankim Bhavsar has posted comments on this change. ( http://gerrit.cloudera.org:8080/17089 )

Change subject: [test] Fix flakiness with detecting a master in FAILED_UNRECOVERABLE state
......................................................................


Patch Set 2:

> Patch Set 2:
> 
> BTW have you run the test using dist-test? Did you get a sense for the flakiness before and after this change?

I haven't been able to repro this issue simply by running dist-test before and after the fix.

I see same test failure today on the flakiness dashboard and it's due to the same reason.
http://dist-test.cloudera.org:8080/diagnose?key=c98bbfdc-74bb-11eb-b6d2-0242ac110002


-- 
To view, visit http://gerrit.cloudera.org:8080/17089
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id6017f1601eaed22be28c8a5babd6e35e93b1d2e
Gerrit-Change-Number: 17089
Gerrit-PatchSet: 2
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Mon, 22 Feb 2021 22:03:08 +0000
Gerrit-HasComments: No

[kudu-CR] [test] Fix flakiness with detecting a master in FAILED UNRECOVERABLE state

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Bankim Bhavsar has removed a vote on this change.

Change subject: [test] Fix flakiness with detecting a master in FAILED_UNRECOVERABLE state
......................................................................


Removed Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/17089
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: Id6017f1601eaed22be28c8a5babd6e35e93b1d2e
Gerrit-Change-Number: 17089
Gerrit-PatchSet: 1
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] [test] Fix flakiness with detecting a master in FAILED UNRECOVERABLE state

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Andrew Wong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17089

to look at the new patch set (#2).

Change subject: [test] Fix flakiness with detecting a master in FAILED_UNRECOVERABLE state
......................................................................

[test] Fix flakiness with detecting a master in FAILED_UNRECOVERABLE state

Observed one test failure on the flakiness dashboard in the
dynamic_multi_master test where system catalog WAL is expected to be
GC'ed and hence new master can't be caught up from WAL which should
result in the new master going to FAILED_UNRECOVERABLE state.
However the new master gets caught up from WAL and is in HEALTHY
state.

Expected: consensus::HealthReportPB::FAILED_UNRECOVERABLE
  Which is: 2
To be equal to: peer.health_report().overall_health()
  Which is: 1

This is because the check for system catalog WAL was made against
master index 0 which may not be the leader master and that's what
happened in case of the test failure.

The fix makes the GC count check against all masters as master
leadership could change.

This change also includes another place where master index 0 was used
and it'd be good to explicitly use the leader index to future-proof
it in case order of validation is changed.

Change-Id: Id6017f1601eaed22be28c8a5babd6e35e93b1d2e
---
M src/kudu/master/dynamic_multi_master-test.cc
1 file changed, 40 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/89/17089/2
-- 
To view, visit http://gerrit.cloudera.org:8080/17089
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id6017f1601eaed22be28c8a5babd6e35e93b1d2e
Gerrit-Change-Number: 17089
Gerrit-PatchSet: 2
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)

[kudu-CR] [test] Fix flakiness with detecting a master in FAILED UNRECOVERABLE state

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/17089 )

Change subject: [test] Fix flakiness with detecting a master in FAILED_UNRECOVERABLE state
......................................................................


Patch Set 2: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17089
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Id6017f1601eaed22be28c8a5babd6e35e93b1d2e
Gerrit-Change-Number: 17089
Gerrit-PatchSet: 2
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Mon, 22 Feb 2021 06:41:03 +0000
Gerrit-HasComments: No

[kudu-CR] [test] Fix flakiness with detecting a master in FAILED UNRECOVERABLE state

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Bankim Bhavsar has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/17089 )

Change subject: [test] Fix flakiness with detecting a master in FAILED_UNRECOVERABLE state
......................................................................

[test] Fix flakiness with detecting a master in FAILED_UNRECOVERABLE state

Observed one test failure on the flakiness dashboard in the
dynamic_multi_master test where system catalog WAL is expected to be
GC'ed and hence new master can't be caught up from WAL which should
result in the new master going to FAILED_UNRECOVERABLE state.
However the new master gets caught up from WAL and is in HEALTHY
state.

Expected: consensus::HealthReportPB::FAILED_UNRECOVERABLE
  Which is: 2
To be equal to: peer.health_report().overall_health()
  Which is: 1

This is because the check for system catalog WAL was made against
master index 0 which may not be the leader master and that's what
happened in case of the test failure.

The fix makes the GC count check against all masters as master
leadership could change.

This change also includes another place where master index 0 was used
and it'd be good to explicitly use the leader index to future-proof
it in case order of validation is changed.

Change-Id: Id6017f1601eaed22be28c8a5babd6e35e93b1d2e
Reviewed-on: http://gerrit.cloudera.org:8080/17089
Tested-by: Kudu Jenkins
Reviewed-by: Andrew Wong <aw...@cloudera.com>
---
M src/kudu/master/dynamic_multi_master-test.cc
1 file changed, 40 insertions(+), 17 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Andrew Wong: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/17089
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Id6017f1601eaed22be28c8a5babd6e35e93b1d2e
Gerrit-Change-Number: 17089
Gerrit-PatchSet: 3
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)