You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Dinesh Bhat (Code Review)" <ge...@cloudera.org> on 2016/12/04 08:01:58 UTC

[kudu-CR] [consensus] KUDU-1407 replica is not evcited when TABLET NOT RUNNING

Hello David Ribeiro Alves, Mike Percy,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/5352

to review the following change.

Change subject: [consensus] KUDU-1407 replica is not evcited when TABLET_NOT_RUNNING
......................................................................

[consensus] KUDU-1407 replica is not evcited when TABLET_NOT_RUNNING

Fixes replica eviction failure where a follower returning
TABLET_NOT_RUNNING is not evicted from the consensus config.

Change-Id: I554ad61f25a7de78eda60fd50228e0f015c1b625
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tserver/tablet_server-test-base.h
M src/kudu/tserver/tablet_server_test_util.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
8 files changed, 91 insertions(+), 31 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/52/5352/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5352
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I554ad61f25a7de78eda60fd50228e0f015c1b625
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] [consensus] KUDU-1407 replica is not evcited when TABLET NOT RUNNING

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5352

to look at the new patch set (#3).

Change subject: [consensus] KUDU-1407 replica is not evcited when TABLET_NOT_RUNNING
......................................................................

[consensus] KUDU-1407 replica is not evcited when TABLET_NOT_RUNNING

Fixes replica eviction failure where a follower returning
TABLET_NOT_RUNNING is not evicted from the consensus config.

Change-Id: I554ad61f25a7de78eda60fd50228e0f015c1b625
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tserver/tablet_server-test-base.h
M src/kudu/tserver/tablet_server_test_util.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
8 files changed, 97 insertions(+), 34 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/52/5352/3
-- 
To view, visit http://gerrit.cloudera.org:8080/5352
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I554ad61f25a7de78eda60fd50228e0f015c1b625
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot

[kudu-CR] [consensus] KUDU-1407 replica is not evcited when TABLET NOT RUNNING

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: [consensus] KUDU-1407 replica is not evcited when TABLET_NOT_RUNNING
......................................................................


Patch Set 3:

(3 comments)

The JIRA (KUDU-1407) talks about evicting a failed replica, but here you are tablet copying a failed replica, so having some discussion about the design of this in the commit message would help.

http://gerrit.cloudera.org:8080/#/c/5352/3//COMMIT_MSG
Commit Message:

PS3, Line 7: evcited
evicted


Line 9: Fixes replica eviction failure where a follower returning
Please provide a detailed description of how you are fixing this problem and what the different cases are


http://gerrit.cloudera.org:8080/#/c/5352/3/src/kudu/tserver/tablet_server_test_util.cc
File src/kudu/tserver/tablet_server_test_util.cc:

Line 34: const char* TabletServerTestBase::kTableId = "TestTable";
Shouldn't we put this in a new file tablet_server-test-base.cc ?


-- 
To view, visit http://gerrit.cloudera.org:8080/5352
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I554ad61f25a7de78eda60fd50228e0f015c1b625
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

[kudu-CR] [consensus] KUDU-1407 replica is not evcited when TABLET NOT RUNNING

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [consensus] KUDU-1407 replica is not evcited when TABLET_NOT_RUNNING
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5352/3//COMMIT_MSG
Commit Message:

PS3, Line 7: evcited
> haven't looked at this code yet, but I can't tell from the commit message w
Todd, yeah this fix aimed to evict the replicas returning TABLET_NOT_RUNNING. I had looked at your comments on the JIRA before attempting a fix on this. I may have misunderstood the bug little bit here. My understanding was that tablet server would return TABLET_NOT_RUNNING error once after tablet has reached a "steady" state of FAILED or some other error state. It appears like we return TABLET_NOT_RUNNING when tablet is anything other than RUNNING, so bootstrapping would fall under this too. We definitely don't want to evict when tablet is bootstrapping (until we hit a timeout ? ).

As a solution, a) would it be better to introduce another response code here ? something like TABLET_BOOTSTRAPPING which is an indication for consensus to not to evict such replicas from config until some timeout, and treat TABLET_NOT_RUNNING as a fatal error ? 
b) Or else we could treat TABLET_NOT_RUNNING as a transient error code and after 300 secs (sufficient window to copy large tablets given the improvements in upcoming tablet copy workflows ?) we could evict the replica from config.


-- 
To view, visit http://gerrit.cloudera.org:8080/5352
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I554ad61f25a7de78eda60fd50228e0f015c1b625
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] KUDU-1407 replica is not evcited when TABLET NOT RUNNING

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Mike Percy,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5352

to look at the new patch set (#2).

Change subject: [consensus] KUDU-1407 replica is not evcited when TABLET_NOT_RUNNING
......................................................................

[consensus] KUDU-1407 replica is not evcited when TABLET_NOT_RUNNING

Fixes replica eviction failure where a follower returning
TABLET_NOT_RUNNING is not evicted from the consensus config.

Change-Id: I554ad61f25a7de78eda60fd50228e0f015c1b625
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tserver/tablet_server-test-base.h
M src/kudu/tserver/tablet_server_test_util.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
8 files changed, 95 insertions(+), 33 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/52/5352/2
-- 
To view, visit http://gerrit.cloudera.org:8080/5352
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I554ad61f25a7de78eda60fd50228e0f015c1b625
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot

[kudu-CR] [consensus] KUDU-1407 replica is not evcited when TABLET NOT RUNNING

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: [consensus] KUDU-1407 replica is not evcited when TABLET_NOT_RUNNING
......................................................................


Patch Set 3:

(4 comments)

Again, will defer to Mike and David.

http://gerrit.cloudera.org:8080/#/c/5352/3/src/kudu/consensus/consensus_queue-test.cc
File src/kudu/consensus/consensus_queue-test.cc:

Line 818: TEST_F(ConsensusQueueTest, TestTriggerTabletCopyIfTabletNotRunning) {
This looks like it's exactly the same test as the one below, but with a different code. Could you consolidate all of the shared code? If this means running both "tests" in a single TEST_F(), that's fine too.


http://gerrit.cloudera.org:8080/#/c/5352/3/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

Line 2772:   // A new good copy should get created because follower returns
Ah, this addresses one of my comments from your earlier patch. It looks like this hunk belonged in that patch. Can you move it?


http://gerrit.cloudera.org:8080/#/c/5352/3/src/kudu/tserver/ts_tablet_manager.cc
File src/kudu/tserver/ts_tablet_manager.cc:

Line 574: Status TSTabletManager::StartTabletStateTransitionUnlocked(
These changes affect the semantics of DeleteTablet() and StartTabletCopy() somewhat, right? AFAICT it's now possible for DeleteTablet() to return ALREADY_INPROGRESS, and for StartTabletCopy() to return TABLET_NOT_RUNNING.

I don't think that's necessarily wrong, I just want to make sure you've thought through the side effects of that. How will the RPC callers of DeleteTablet() and StartTabletCopy() react to those errors?


http://gerrit.cloudera.org:8080/#/c/5352/3/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

PS3, Line 207: replica
Nit: the rest of this comment refers to "tablet" when describing a "tablet replica", so please do the same for the new section you added.


-- 
To view, visit http://gerrit.cloudera.org:8080/5352
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I554ad61f25a7de78eda60fd50228e0f015c1b625
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: Yes

[kudu-CR] [consensus] KUDU-1407 replica is not evcited when TABLET NOT RUNNING

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: [consensus] KUDU-1407 replica is not evcited when TABLET_NOT_RUNNING
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5352/3//COMMIT_MSG
Commit Message:

PS3, Line 7: evcited
> evicted
haven't looked at this code yet, but I can't tell from the commit message what the effect of this change is.

What's the change being made? You're fixing a bug such that now tablets _will_ be evicted for TABLET_NOT_RUNNING? Or making it so that they won't be?

I think not evicting is the correct behavior for the case when a tablet is in BOOTSTRAPPING state. Otherwise restarting a node (as in rolling restart) would cause all of its replicas to be evicted and cause big issues.


-- 
To view, visit http://gerrit.cloudera.org:8080/5352
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I554ad61f25a7de78eda60fd50228e0f015c1b625
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes