You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Todd Lipcon (Code Review)" <ge...@cloudera.org> on 2017/08/24 20:26:20 UTC

[kudu-CR] Fix flakiness of kudu-admin-test TestMoveTablet

Hello Adar Dembo, Will Berkeley,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/7808

to review the following change.

Change subject: Fix flakiness of kudu-admin-test TestMoveTablet
......................................................................

Fix flakiness of kudu-admin-test TestMoveTablet

This fixes an issue in the admin tool for moving a tablet that would occur if
the tablet changed terms multiple times before settling during an election.

Previously, we would wait for a term change, and as soon as we saw any term
change, we'd wait for an operation in exactly that term to be committed. There
were two issues with the code:

1) It used cstate.has_leader_uuid() to check if there was a leader in the new
term. However, the server side actually sets this field to an empty string
rather than leaving it unset if there is no leader. So this check always
evaluated to true, meaning that we would proceed to looking for a committed op
in that term as soon as the term advanced, rather than waiting for an actual
leader to be elected.

2) In the case that somehow the leader changed a second time, the term could
be increased again while we are waiting for a committed operation. In that case
we'd be waiting for a committed op in exactly the term of the first leader
change we saw, rather than potentially refreshing our notion of the "current
term".

The patch fixes both issues by restructuring the loop a bit.

I additionally improved some logging in the consensus service and
implementation which I found helpful while debugging the issue.

This test became significantly more flaky (~20% in ASAN) after the commit
of 21b0f3d5e255760535e281efe5879fe657df1f1c which adjusted the election
timeout behavior. Apparently the new election behavior made it more likely
for the elections to have conflicts before getting a steady leader, which
exposed the above bug in the tool.

With the patch, I was able to run 500/500 successful in an ASAN build (vs
20% fail rate before).

Change-Id: I475d4a44c52f2da7fc42c93e9cf2f38e01735177
---
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/tools/tool_action_tablet.cc
M src/kudu/tserver/tablet_service.cc
4 files changed, 54 insertions(+), 48 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/08/7808/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7808
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I475d4a44c52f2da7fc42c93e9cf2f38e01735177
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] Fix flakiness of kudu-admin-test TestMoveTablet

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: Fix flakiness of kudu-admin-test TestMoveTablet
......................................................................


Patch Set 2:

> Going to ignore IWYU since I dont think I touched anything that
 > should affect includes.

You touched the raft_consensus.cc file, which was not IWYU compliant due to prior updates.  I fixed that in f9a6a1b2d15feb8e2c044f20248774105a3a62bf.

-- 
To view, visit http://gerrit.cloudera.org:8080/7808
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I475d4a44c52f2da7fc42c93e9cf2f38e01735177
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-HasComments: No

[kudu-CR] Fix flakiness of kudu-admin-test TestMoveTablet

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has submitted this change and it was merged.

Change subject: Fix flakiness of kudu-admin-test TestMoveTablet
......................................................................


Fix flakiness of kudu-admin-test TestMoveTablet

This fixes an issue in the admin tool for moving a tablet that would occur if
the tablet changed terms multiple times before settling during an election.

Previously, we would wait for a term change, and as soon as we saw any term
change, we'd wait for an operation in exactly that term to be committed. There
were two issues with the code:

1) It used cstate.has_leader_uuid() to check if there was a leader in the new
term. However, the server side actually sets this field to an empty string
rather than leaving it unset if there is no leader. So this check always
evaluated to true, meaning that we would proceed to looking for a committed op
in that term as soon as the term advanced, rather than waiting for an actual
leader to be elected.

2) In the case that somehow the leader changed a second time, the term could
be increased again while we are waiting for a committed operation. In that case
we'd be waiting for a committed op in exactly the term of the first leader
change we saw, rather than potentially refreshing our notion of the "current
term".

The patch fixes both issues by restructuring the loop a bit.

I additionally improved some logging in the consensus service and
implementation which I found helpful while debugging the issue.

This test became significantly more flaky (~20% in ASAN) after the commit
of 21b0f3d5e255760535e281efe5879fe657df1f1c which adjusted the election
timeout behavior. Apparently the new election behavior made it more likely
for the elections to have conflicts before getting a steady leader, which
exposed the above bug in the tool.

With the patch, I was able to run 500/500 successful in an ASAN build (vs
20% fail rate before).

Change-Id: I475d4a44c52f2da7fc42c93e9cf2f38e01735177
Reviewed-on: http://gerrit.cloudera.org:8080/7808
Reviewed-by: Adar Dembo <ad...@cloudera.com>
Tested-by: Todd Lipcon <to...@apache.org>
---
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/tools/tool_action_tablet.cc
M src/kudu/tserver/tablet_service.cc
4 files changed, 54 insertions(+), 48 deletions(-)

Approvals:
  Adar Dembo: Looks good to me, approved
  Todd Lipcon: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/7808
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I475d4a44c52f2da7fc42c93e9cf2f38e01735177
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] Fix flakiness of kudu-admin-test TestMoveTablet

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: Fix flakiness of kudu-admin-test TestMoveTablet
......................................................................


Patch Set 1: Verified+1

Going to ignore IWYU since I dont think I touched anything that should affect includes.

-- 
To view, visit http://gerrit.cloudera.org:8080/7808
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I475d4a44c52f2da7fc42c93e9cf2f38e01735177
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-HasComments: No

[kudu-CR] Fix flakiness of kudu-admin-test TestMoveTablet

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Hello Adar Dembo, Will Berkeley,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7808

to look at the new patch set (#2).

Change subject: Fix flakiness of kudu-admin-test TestMoveTablet
......................................................................

Fix flakiness of kudu-admin-test TestMoveTablet

This fixes an issue in the admin tool for moving a tablet that would occur if
the tablet changed terms multiple times before settling during an election.

Previously, we would wait for a term change, and as soon as we saw any term
change, we'd wait for an operation in exactly that term to be committed. There
were two issues with the code:

1) It used cstate.has_leader_uuid() to check if there was a leader in the new
term. However, the server side actually sets this field to an empty string
rather than leaving it unset if there is no leader. So this check always
evaluated to true, meaning that we would proceed to looking for a committed op
in that term as soon as the term advanced, rather than waiting for an actual
leader to be elected.

2) In the case that somehow the leader changed a second time, the term could
be increased again while we are waiting for a committed operation. In that case
we'd be waiting for a committed op in exactly the term of the first leader
change we saw, rather than potentially refreshing our notion of the "current
term".

The patch fixes both issues by restructuring the loop a bit.

I additionally improved some logging in the consensus service and
implementation which I found helpful while debugging the issue.

This test became significantly more flaky (~20% in ASAN) after the commit
of 21b0f3d5e255760535e281efe5879fe657df1f1c which adjusted the election
timeout behavior. Apparently the new election behavior made it more likely
for the elections to have conflicts before getting a steady leader, which
exposed the above bug in the tool.

With the patch, I was able to run 500/500 successful in an ASAN build (vs
20% fail rate before).

Change-Id: I475d4a44c52f2da7fc42c93e9cf2f38e01735177
---
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/tools/tool_action_tablet.cc
M src/kudu/tserver/tablet_service.cc
4 files changed, 54 insertions(+), 48 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/08/7808/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7808
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I475d4a44c52f2da7fc42c93e9cf2f38e01735177
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] Fix flakiness of kudu-admin-test TestMoveTablet

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: Fix flakiness of kudu-admin-test TestMoveTablet
......................................................................


Patch Set 2: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/7808
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I475d4a44c52f2da7fc42c93e9cf2f38e01735177
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-HasComments: No

[kudu-CR] Fix flakiness of kudu-admin-test TestMoveTablet

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: Fix flakiness of kudu-admin-test TestMoveTablet
......................................................................


Patch Set 2: Verified+1

Unrelated spark test flakiness.

-- 
To view, visit http://gerrit.cloudera.org:8080/7808
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I475d4a44c52f2da7fc42c93e9cf2f38e01735177
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-HasComments: No

[kudu-CR] Fix flakiness of kudu-admin-test TestMoveTablet

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: Fix flakiness of kudu-admin-test TestMoveTablet
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7808/1//COMMIT_MSG
Commit Message:

PS1, Line 36: Apparently the new election behavior made it more likely
            : for the elections to have conflicts before getting a steady leader
> That seems like a problem too. From your debugging were you able to pinpoin
no, wasn't entirely clear. I figured it was the timing changes you made in your patch (bisect clearly pointed to the above-referenced commit as where it got worse).


-- 
To view, visit http://gerrit.cloudera.org:8080/7808
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I475d4a44c52f2da7fc42c93e9cf2f38e01735177
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-HasComments: Yes

[kudu-CR] Fix flakiness of kudu-admin-test TestMoveTablet

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change.

Change subject: Fix flakiness of kudu-admin-test TestMoveTablet
......................................................................


Patch Set 1: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/7808
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I475d4a44c52f2da7fc42c93e9cf2f38e01735177
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-HasComments: No

[kudu-CR] Fix flakiness of kudu-admin-test TestMoveTablet

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: Fix flakiness of kudu-admin-test TestMoveTablet
......................................................................


Patch Set 1: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7808/1//COMMIT_MSG
Commit Message:

PS1, Line 36: Apparently the new election behavior made it more likely
            : for the elections to have conflicts before getting a steady leader
That seems like a problem too. From your debugging were you able to pinpoint why?


-- 
To view, visit http://gerrit.cloudera.org:8080/7808
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I475d4a44c52f2da7fc42c93e9cf2f38e01735177
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-HasComments: Yes