You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Will Berkeley (Code Review)" <ge...@cloudera.org> on 2017/07/28 15:23:22 UTC

[kudu-CR] [tools] Address potential flakiness of TestMoveTablet

Will Berkeley has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7533

Change subject: [tools] Address potential flakiness of TestMoveTablet
......................................................................

[tools] Address potential flakiness of TestMoveTablet

When, by chance, the leader is the replica that is going to be
removed, the tool asks the leader to step down until a different
replica is elected leader, because it's not allowed to remove a
leader from a configuration. Since graceful leadership change
isn't implemented, it's possible for the original leader to be
reelected, over and over, until the test times out.

This patch bumps the wait-for-a-new-leader timeout from 10s to
30s to make this bad outcome less likely.

Change-Id: Ic3aa5d816b403818f69baa71cf40b35b82ff9096
---
M src/kudu/tools/tool_action_tablet.cc
1 file changed, 6 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/7533/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ic3aa5d816b403818f69baa71cf40b35b82ff9096
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley <wd...@gmail.com>

[kudu-CR] [tools] Address potential flakiness of TestMoveTablet

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has submitted this change and it was merged.

Change subject: [tools] Address potential flakiness of TestMoveTablet
......................................................................


[tools] Address potential flakiness of TestMoveTablet

There's a few potential sources of flakiness in TestMoveTablet
and in the move tablet tool itself:
1. Leader step down may elect the same leader over and over,
preventing it from being removed from the config
2. With ongoing writes, the replicas may not all agree even though
they are all making progress
3. Sometimes leader changes, etc time out

This patch addresses these problems:
1. After step down, the former leader will snooze its failure
detector for 2x a normal election timeout
2. The tool waits for the new replica to see the new config
instead of for all servers to agree (since that's what we were
really waiting for anyway)
3. Timeouts have been bumped up to 30s, which is the norm for
similar tests.

Change-Id: Ic3aa5d816b403818f69baa71cf40b35b82ff9096
Reviewed-on: http://gerrit.cloudera.org:8080/7533
Tested-by: Kudu Jenkins
Reviewed-by: Todd Lipcon <to...@apache.org>
---
M src/kudu/consensus/raft_consensus.cc
M src/kudu/tools/kudu-admin-test.cc
M src/kudu/tools/tool_action_tablet.cc
3 files changed, 14 insertions(+), 8 deletions(-)

Approvals:
  Todd Lipcon: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/7533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ic3aa5d816b403818f69baa71cf40b35b82ff9096
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] [tools] Address potential flakiness of TestMoveTablet

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: [tools] Address potential flakiness of TestMoveTablet
......................................................................


Patch Set 3:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7533/3//COMMIT_MSG
Commit Message:

Line 20: 2. The tool waits for the new replica to see the new config instead of for all servers to agree (since that's what we were
nit: wrap?


http://gerrit.cloudera.org:8080/#/c/7533/3/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

Line 507:               "unable snooze failure detector after stepping down");
unable *to* snooze...


-- 
To view, visit http://gerrit.cloudera.org:8080/7533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic3aa5d816b403818f69baa71cf40b35b82ff9096
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Address potential flakiness of TestMoveTablet

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: [tools] Address potential flakiness of TestMoveTablet
......................................................................


Patch Set 4: Code-Review+2

k, agreed those failures don't look related (adar mentioned seeing the same in his loop run of raft_consensus-itest and david hit them recently too)

-- 
To view, visit http://gerrit.cloudera.org:8080/7533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic3aa5d816b403818f69baa71cf40b35b82ff9096
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-HasComments: No

[kudu-CR] [tools] Address potential flakiness of TestMoveTablet

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7533

to look at the new patch set (#2).

Change subject: [tools] Address potential flakiness of TestMoveTablet
......................................................................

[tools] Address potential flakiness of TestMoveTablet

When, by chance, the leader is the replica that is going to be
removed, the tool asks the leader to step down until a different
replica is elected leader, because it's not allowed to remove a
leader from a configuration. Since graceful leadership change
isn't implemented, it's possible for the original leader to be
reelected, over and over, until the test times out.

This patch makes a leader that is stepping down snooze its
failure detector for an additional election timeout, so that it
should not become leader again if there are other replicas. This
patch also bumps the wait-for-a-new-leader timeout from 10s to
30s, which is helpful if the tablet is under high load.

Change-Id: Ic3aa5d816b403818f69baa71cf40b35b82ff9096
---
M src/kudu/consensus/raft_consensus.cc
M src/kudu/tools/tool_action_tablet.cc
2 files changed, 11 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/7533/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic3aa5d816b403818f69baa71cf40b35b82ff9096
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Address potential flakiness of TestMoveTablet

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: [tools] Address potential flakiness of TestMoveTablet
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7533/1//COMMIT_MSG
Commit Message:

Line 14: reelected, over and over, until the test times out.
> do you think it's possible to make LeaderStepDown reset the election timer 
This sounds like a good idea to me. This may be the first production use of StepDown() ... I think it's otherwise only used in tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/7533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic3aa5d816b403818f69baa71cf40b35b82ff9096
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Address potential flakiness of TestMoveTablet

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Will Berkeley has posted comments on this change.

Change subject: [tools] Address potential flakiness of TestMoveTablet
......................................................................


Patch Set 3:

(2 comments)

> btw can you also loop raft_consensus-itest and tablet_copy-itest a
 > bit since they seem to be the current users of StepDown?

kudu-admin-test and tablet_copy-itest were clean. raft_consensus-itest had some failures but I don't think they were related as the failed tests don't use step down (http://dist-test.cloudera.org/job?job_id=wdberkeley.1501522547.13691)

http://gerrit.cloudera.org:8080/#/c/7533/3//COMMIT_MSG
Commit Message:

Line 20: 2. The tool waits for the new replica to see the new config instead of for all servers to agree (since that's what we were
> nit: wrap?
Done


http://gerrit.cloudera.org:8080/#/c/7533/3/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

Line 507:               "unable snooze failure detector after stepping down");
> unable *to* snooze...
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/7533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic3aa5d816b403818f69baa71cf40b35b82ff9096
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Address potential flakiness of TestMoveTablet

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7533

to look at the new patch set (#3).

Change subject: [tools] Address potential flakiness of TestMoveTablet
......................................................................

[tools] Address potential flakiness of TestMoveTablet

There's a few potential sources of flakiness in TestMoveTablet
and in the move tablet tool itself:
1. Leader step down may elect the same leader over and over,
preventing it from being removed from the config
2. With ongoing writes, the replicas may not all agree even though
they are all making progress
3. Sometimes leader changes, etc time out

This patch addresses these problems:
1. After step down, the former leader will snooze its failure
detector for 2x a normal election timeout
2. The tool waits for the new replica to see the new config instead of for all servers to agree (since that's what we were
really waiting for anyway)
3. Timeouts have been bumped up to 30s, which is the norm for
similar tests.

Change-Id: Ic3aa5d816b403818f69baa71cf40b35b82ff9096
---
M src/kudu/consensus/raft_consensus.cc
M src/kudu/tools/kudu-admin-test.cc
M src/kudu/tools/tool_action_tablet.cc
3 files changed, 14 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/7533/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic3aa5d816b403818f69baa71cf40b35b82ff9096
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Address potential flakiness of TestMoveTablet

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: [tools] Address potential flakiness of TestMoveTablet
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/7533/1//COMMIT_MSG
Commit Message:

Line 14: reelected, over and over, until the test times out.
do you think it's possible to make LeaderStepDown reset the election timer to be twice the normal period or something? perhaps that would make it significantly more likely that it doesn't just get re-elected?


Line 17: 30s to make this bad outcome less likely.
did you loop the test after the change?


-- 
To view, visit http://gerrit.cloudera.org:8080/7533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic3aa5d816b403818f69baa71cf40b35b82ff9096
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Address potential flakiness of TestMoveTablet

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: [tools] Address potential flakiness of TestMoveTablet
......................................................................


Patch Set 3:

btw can you also loop raft_consensus-itest and tablet_copy-itest a bit since they seem to be the current users of StepDown?

-- 
To view, visit http://gerrit.cloudera.org:8080/7533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ic3aa5d816b403818f69baa71cf40b35b82ff9096
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] [tools] Address potential flakiness of TestMoveTablet

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7533

to look at the new patch set (#4).

Change subject: [tools] Address potential flakiness of TestMoveTablet
......................................................................

[tools] Address potential flakiness of TestMoveTablet

There's a few potential sources of flakiness in TestMoveTablet
and in the move tablet tool itself:
1. Leader step down may elect the same leader over and over,
preventing it from being removed from the config
2. With ongoing writes, the replicas may not all agree even though
they are all making progress
3. Sometimes leader changes, etc time out

This patch addresses these problems:
1. After step down, the former leader will snooze its failure
detector for 2x a normal election timeout
2. The tool waits for the new replica to see the new config
instead of for all servers to agree (since that's what we were
really waiting for anyway)
3. Timeouts have been bumped up to 30s, which is the norm for
similar tests.

Change-Id: Ic3aa5d816b403818f69baa71cf40b35b82ff9096
---
M src/kudu/consensus/raft_consensus.cc
M src/kudu/tools/kudu-admin-test.cc
M src/kudu/tools/tool_action_tablet.cc
3 files changed, 14 insertions(+), 8 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/7533/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic3aa5d816b403818f69baa71cf40b35b82ff9096
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Will Berkeley <wd...@gmail.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>