You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Dinesh Bhat (Code Review)" <ge...@cloudera.org> on 2016/11/16 21:19:24 UTC

[kudu-CR] [consensus] KUDU-1718: Fix few bugs around replica eviction failures

Hello David Ribeiro Alves, Adar Dembo, Todd Lipcon,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/5111

to review the following change.

Change subject: [consensus] KUDU-1718: Fix few bugs around replica eviction failures
......................................................................

[consensus] KUDU-1718: Fix few bugs around replica eviction failures

KUDU-1613: Tablet server returning WRONG_SERVER_UUID perhaps
due to a disks-wiped-out situation and the server was never evicted
from the consensus config.
KUDU-1407: Tablet under TABLET_NOT_RUNNING state due to a large
tablet eventually failed to bootstrap should be evicted and replicated
either on same node or somewhere else.
KUDU-1608: Catalog manager retries DeleteTablet RPC forever upon
replica eviction in some of the above scenarios. We can treat some of
these errors as fatal, and stop retrying on them.

Added bunch of tests to repro the bugs and verify these fixes.

Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/integration-tests/mini_cluster.cc
M src/kudu/integration-tests/mini_cluster.h
M src/kudu/master/catalog_manager.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tserver/mini_tablet_server.cc
M src/kudu/tserver/mini_tablet_server.h
M src/kudu/tserver/tablet_server-test-base.h
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
12 files changed, 295 insertions(+), 36 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/11/5111/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [consensus] KUDU-1613: Fix replica eviction failure for WRONG SERVER UUID

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: [consensus] KUDU-1613: Fix replica eviction failure for WRONG_SERVER_UUID
......................................................................


Patch Set 5:

(5 comments)

I'm deferring to Mike/David since I'm less familiar with this code.

http://gerrit.cloudera.org:8080/#/c/5111/5/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

Line 238:   switch (ps.response) {
Maybe add a default case that calls LOG(FATAL) on the (unknown) response type?


http://gerrit.cloudera.org:8080/#/c/5111/5/src/kudu/consensus/consensus_peers.h
File src/kudu/consensus/consensus_peers.h:

PS5, Line 140: respond with appropriate response status.
Nit: technically, this method does not "respond" (in the sense of sending an RPC response).

Perhaps "and return the appropriate response status"?


http://gerrit.cloudera.org:8080/#/c/5111/5/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

Line 2770
Why was this comment removed? If it's no longer true, shouldn't the entire DeleteTablet() call be thrown out too?


PS5, Line 2782: // These tests exhibit the replica eviction behaviors when a follower
              : // returns WRONG_SERVER_UUID error code. Tests verify that followers
              : // returning these error codes are evicted from consensus after a
              : // specified --follower_unavailable_considered_failed_sec.
Nit: this is talking about tests in the plural, but it's just a single test, right?


Line 2795:   int leader_indx = -1;
Nit: conventionally, we use 'idx' as an infix for naming an index, not 'indx'.


-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] KUDU-1613: Fix replica eviction failure for WRONG SERVER UUID

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5111

to look at the new patch set (#7).

Change subject: [consensus] KUDU-1613: Fix replica eviction failure for WRONG_SERVER_UUID
......................................................................

[consensus] KUDU-1613: Fix replica eviction failure for WRONG_SERVER_UUID

Fixes replica eviction from leader where a follower returning
WRONG_SERVER_UUID (perhaps due to a disks-removed situation) is never
evicted from the consensus config.
If a tablet server changes its uuid, it starts responding with
Status::InvalidArgument for most of the RPCs along with the error
code WRONG_SERVER_UUID in the response. Raft consensus can treat this
error code as fatal since this means that all the replicas on that
node has become unavailable and that situation is irrecoverable.
Consensus should evict such nodes from the tablet config and try
to re-replicate if needed. Same tablet server carrying a new uuid
could be one of the candidates considered for re-replication.

Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_peers.h
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
5 files changed, 191 insertions(+), 46 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/11/5111/7
-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [consensus] KUDU-1613: Fix replica eviction failure for WRONG SERVER UUID

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5111

to look at the new patch set (#6).

Change subject: [consensus] KUDU-1613: Fix replica eviction failure for WRONG_SERVER_UUID
......................................................................

[consensus] KUDU-1613: Fix replica eviction failure for WRONG_SERVER_UUID

Fixes replica eviction from leader where a follower returning
WRONG_SERVER_UUID (perhaps due to a disks-removed situation) is never
evicted from the consensus config.
If a tablet server changes its uuid, it starts responding with
Status::InvalidArgument for most of the RPCs along with the error
code WRONG_SERVER_UUID in the response. Raft consensus can treat this
error code as fatal since this means that all the replicas on that
node has become unavailable and that situation is irrecoverable.
Consensus should evict such nodes from the tablet config and try
to re-replicate if needed. Same tablet server carrying a new uuid
could be one of the candidates considered for re-replication.

Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_peers.h
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
5 files changed, 189 insertions(+), 47 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/11/5111/6
-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [consensus] KUDU-1613: Fix replica eviction failure for WRONG SERVER UUID

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [consensus] KUDU-1613: Fix replica eviction failure for WRONG_SERVER_UUID
......................................................................


Patch Set 7:

(6 comments)

TFTR Mike, I will be adding one more test which changes just the UUID keeping the tablets intact (which means all tablets are reported behind a new server UUID but behind same RPC endpoints) after I figure out a anomaly with non_participant I observed in the new test.

http://gerrit.cloudera.org:8080/#/c/5111/7/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

Line 305:     if (response_.error().code() != TabletServerErrorPB::WRONG_SERVER_UUID &&
> I think we should consider WRONG_SERVER_UUID as unresponsive (ERROR) instea
Done, please note that as per our offline discussions last week, the tablet copy is driven due to this error code deemed as tolerable until heartbeat timeout kicks out the replica and re-replication is triggered at taht point.


http://gerrit.cloudera.org:8080/#/c/5111/7/src/kudu/consensus/consensus_queue-test.cc
File src/kudu/consensus/consensus_queue-test.cc:

Line 816: // Verify that tablet copy is triggered when peer responds with
> Is this test still supposed to be here?
Not after we removed the WRONG_SERVER_UUID from queue->ResponseFromPeer(). Updated.


http://gerrit.cloudera.org:8080/#/c/5111/7/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 597:       peer->needs_tablet_copy = true;
> Why this for WRONG_SERVER_UUID?
I suppose this isn't needed if we are driving the tablet-copy via replica eviction and then re-replication, right ?


http://gerrit.cloudera.org:8080/#/c/5111/7/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

Line 2789:   MonoDelta kTimeout = MonoDelta::FromSeconds(30);
> nit: const
Done


Line 2795:   int leader_idx = -1;
> nit: This is written in a C-like way. It would be a bit more readable to de
Done


Line 2816: 
> Please add a comment in here to explain what's going on:
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] KUDU-1718: Fix few bugs around replica eviction failures

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: [consensus] KUDU-1718: Fix few bugs around replica eviction failures
......................................................................


Patch Set 1:

(1 comment)

would it be possible to split this patch into several ones that take care of each ticker individually or would that be too hard?

http://gerrit.cloudera.org:8080/#/c/5111/1/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

PS1, Line 235:   if (!controller_.status().ok()) {
             :     if (controller_.status().IsRemoteError()) {
             :       // Most controller errors are caused by network issues or corner cases
             :       // like shutdown and failure to serialize a protobuf. Therefore, we
             :       // generally consider these errors to indicate an unreachable peer.
             :       // However, a RemoteError wraps some other error propagated from the
             :       // remote peer, so we know the remote is alive. Therefore, we will let
             :       // the queue know that the remote is responsive.
             :       queue_->NotifyPeerIsResponsiveDespiteError(peer_pb_.permanent_uuid());
             :     }
             :     ProcessResponseError(controller_.status());
             :     return;
             :   }
             :   // Again, let the queue know that the remote is still responsive, since we
             :   // will not be sending this error response through to the queue.
             :   // For certain error codes, we want the queue to treat the remote as
             :   // unresponsive and take necessary actions, hence bypassing the notification
             :   // for those error codes.
             :   if (response_.has_error()) {
             :     if (response_.error().code() != TabletServerErrorPB::TABLET_NOT_FOUND &&
             :         response_.error().code() != TabletServerErrorPB::WRONG_SERVER_UUID &&
             :         response_.error().code() != TabletServerErrorPB::TABLET_NOT_RUNNING) {
             :       queue_->NotifyPeerIsResponsiveDespiteError(peer_pb_.permanent_uuid());
             :       ProcessResponseError(StatusFromPB(response_.error().status()));
             :       return;
             :     }
             :   } else if (response_.status().has_error()) {
             :     if (response_.status().error().code() == consensus::ConsensusErrorPB::CANNOT_PREPARE) {
             :       queue_->NotifyPeerIsResponsiveDespiteError(peer_pb_.permanent_uuid());
             :       ProcessResponseError(StatusFromPB(response_.error().status()));
             :       return;
             :     }
             :   }
It would be good if we could have something like we have in the c++ client where we analize the respose and then get a directive back.

Something like:
PeerResponseStatus AnalyzeResponse() which would have the nightmarish string of ifs inside and return:
OK
ERROR
ERROR_BUT_RESPONSIVE

would that make sense?


-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] KUDU-1613: Fix replica eviction failure for WRONG SERVER UUID

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [consensus] KUDU-1613: Fix replica eviction failure for WRONG_SERVER_UUID
......................................................................


Patch Set 5:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/5111/5//COMMIT_MSG
Commit Message:

Line 9: Fixes replica eviction from leader where a follower returning
> Can you please explain your approach to fixing this issue in the commit mes
Done, could you see updated version would suffice ?


http://gerrit.cloudera.org:8080/#/c/5111/5/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

Line 238:   switch (ps.response) {
> Maybe add a default case that calls LOG(FATAL) on the (unknown) response ty
Done


http://gerrit.cloudera.org:8080/#/c/5111/5/src/kudu/consensus/consensus_peers.h
File src/kudu/consensus/consensus_peers.h:

PS5, Line 140: respond with appropriate response status.
> Nit: technically, this method does not "respond" (in the sense of sending a
yeah, good catch. fixed.


http://gerrit.cloudera.org:8080/#/c/5111/5/src/kudu/consensus/consensus_queue-test.cc
File src/kudu/consensus/consensus_queue-test.cc:

Line 816: // Verify that tablet copy is triggered when peer responds with
> Why do we want tablet copy to be triggered when the peer responds that it i
I copied this line-by-line from TestTriggerTabletCopyIfTabletNotFound above. In this case, WRONG_SERVER_UUID almost always means that for some reason tablet server went through reinitialization, which means  all the tablets become inaccessible due to uuid mismatch for all RPCs via CheckUuidMatchOrRespond after the reinitialization. In some way, this is the same situation as TABLET_NOT_FOUND but impact is more because it affected all replicas on that node. As for why tablet copy is necessary, I think it's probably more relevant when the tablet becomes under-replicated. I added this test really for that 1-line change in consensus_queue.cc.


http://gerrit.cloudera.org:8080/#/c/5111/5/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

Line 2770
> Why was this comment removed? If it's no longer true, shouldn't the entire 
yeah, this line of code(and comment was supposed to be removed in next patch). I removed both comment+code in this patch itself initially, but added only code back realizing that this code is testing TABLET_NOT_RUNNING situation(forgot to add comment back). I think I am going to leave it without comments for now because my my next patch laid on top of this is removing both anyways, and it's a somewhat messy rebase situation otherwise.


PS5, Line 2782: // These tests exhibit the replica eviction behaviors when a follower
              : // returns WRONG_SERVER_UUID error code. Tests verify that followers
              : // returning these error codes are evicted from consensus after a
              : // specified --follower_unavailable_considered_failed_sec.
> Nit: this is talking about tests in the plural, but it's just a single test
done, yeah I think I forgot to change comments when I split one patch into multiple patches.


Line 2795:   int leader_indx = -1;
> Nit: conventionally, we use 'idx' as an infix for naming an index, not 'ind
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] KUDU-1613: Fix replica eviction failure for WRONG SERVER UUID

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: [consensus] KUDU-1613: Fix replica eviction failure for WRONG_SERVER_UUID
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5111/7/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 597:       peer->needs_tablet_copy = true;
Why this for WRONG_SERVER_UUID?


-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] KUDU-1718: Fix few bugs around replica eviction failures

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5111

to look at the new patch set (#2).

Change subject: [consensus] KUDU-1718: Fix few bugs around replica eviction failures
......................................................................

[consensus] KUDU-1718: Fix few bugs around replica eviction failures

KUDU-1613: Tablet server returning WRONG_SERVER_UUID perhaps
due to a disks-wiped-out situation and the server was never evicted
from the consensus config.
KUDU-1407: Tablet under TABLET_NOT_RUNNING state due to a large
tablet eventually failed to bootstrap should be evicted and replicated
either on same node or somewhere else.
KUDU-1608: Catalog manager retries DeleteTablet RPC forever upon
replica eviction in some of the above scenarios. We can treat some of
these errors as fatal, and stop retrying on them.

Added bunch of tests to repro the bugs and verify these fixes.

Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_peers.h
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/tserver/tablet_server-test-base.h
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
9 files changed, 271 insertions(+), 56 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/11/5111/2
-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [consensus] KUDU-1718: Fix few bugs around replica eviction failures

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: [consensus] KUDU-1718: Fix few bugs around replica eviction failures
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5111/3/src/kudu/consensus/consensus_peers.h
File src/kudu/consensus/consensus_peers.h:

Line 62:     ERROR,
> maybe this would be better named TIMED_OUT ?
IMO it shouldn't this should be a general umbrella status for when the peer returned a non-recoverable error which might not be TIMED_OUT


-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] KUDU-1613: Fix replica eviction failure for WRONG SERVER UUID

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [consensus] KUDU-1613: Fix replica eviction failure for WRONG_SERVER_UUID
......................................................................


Patch Set 3:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/5111/3/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

Line 198: 
> nit: spurious line
Done


Line 241:       // Fallthrough intended.
> use FALLTHROUGH_INTENDED macro instead
Done


Line 256:   }
> better to break at the end of the case
Done


http://gerrit.cloudera.org:8080/#/c/5111/3/src/kudu/tserver/tablet_server_test_util.cc
File src/kudu/tserver/tablet_server_test_util.cc:

Line 27: DEFINE_int32(num_updater_threads, 1, "Number of updating threads to launch");
> Why put this stuff in this file? Maybe you need to create a new tablet_serv
yeah that was my initial thought, but I decided to chuck a new file for few variable definitions which need linking from other modules. This source file seemed like a good candidate to hold them since it's part of same library against which other modules link.


-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] KUDU-1613: Fix replica eviction failure for WRONG SERVER UUID

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Mike Percy, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5111

to look at the new patch set (#4).

Change subject: [consensus] KUDU-1613: Fix replica eviction failure for WRONG_SERVER_UUID
......................................................................

[consensus] KUDU-1613: Fix replica eviction failure for WRONG_SERVER_UUID

Fixes replica eviction from leader where a follower returning
WRONG_SERVER_UUID perhaps due to a disks-removed situation and
the follower is never evicted from the consensus config.

Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_peers.h
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
5 files changed, 182 insertions(+), 50 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/11/5111/4
-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [consensus] KUDU-1613: Fix replica eviction failure for WRONG SERVER UUID

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Mike Percy, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5111

to look at the new patch set (#5).

Change subject: [consensus] KUDU-1613: Fix replica eviction failure for WRONG_SERVER_UUID
......................................................................

[consensus] KUDU-1613: Fix replica eviction failure for WRONG_SERVER_UUID

Fixes replica eviction from leader where a follower returning
WRONG_SERVER_UUID perhaps due to a disks-removed situation and
the follower is never evicted from the consensus config.

Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_peers.h
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
5 files changed, 182 insertions(+), 46 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/11/5111/5
-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [consensus] KUDU-1613: Fix replica eviction failure for WRONG SERVER UUID

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: [consensus] KUDU-1613: Fix replica eviction failure for WRONG_SERVER_UUID
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5111/7/src/kudu/consensus/consensus_queue-test.cc
File src/kudu/consensus/consensus_queue-test.cc:

Line 816: // Verify that tablet copy is triggered when peer responds with
Is this test still supposed to be here?


-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] KUDU-1718: Fix few bugs around replica eviction failures

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Mike Percy, Adar Dembo, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5111

to look at the new patch set (#3).

Change subject: [consensus] KUDU-1718: Fix few bugs around replica eviction failures
......................................................................

[consensus] KUDU-1718: Fix few bugs around replica eviction failures

KUDU-1613: Tablet server returning WRONG_SERVER_UUID perhaps
due to a disks-wiped-out situation and the server was never evicted
from the consensus config.
KUDU-1407: Tablet under TABLET_NOT_RUNNING state due to a large
tablet eventually failed to bootstrap should be evicted and replicated
either on same node or somewhere else.
KUDU-1608: Catalog manager retries DeleteTablet RPC forever upon
replica eviction in some of the above scenarios. We can treat some of
these errors as fatal, and stop retrying on them.

Added bunch of tests to repro the bugs and verify these fixes.

Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_peers.h
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/tserver/tablet_server-test-base.h
M src/kudu/tserver/tablet_server_test_util.cc
M src/kudu/tserver/ts_tablet_manager.cc
M src/kudu/tserver/ts_tablet_manager.h
10 files changed, 299 insertions(+), 74 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/11/5111/3
-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [consensus] KUDU-1718: Fix few bugs around replica eviction failures

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: [consensus] KUDU-1718: Fix few bugs around replica eviction failures
......................................................................


Patch Set 2:

(4 comments)

can you please address the tidy bot nits? I know most are not your fault but would be nice to fix them anyway

http://gerrit.cloudera.org:8080/#/c/5111/2/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

PS2, Line 235: // Analyze the response from peer.
             :   PeerResponseStatus ps = AnalyzePeerResponse();
             :   if (PREDICT_FALSE(ps.response != PeerResponseStatus::OK)) {
             :     VLOG_WITH_PREFIX_UNLOCKED(1) << "Error response from peer: " << ps.status.ToString()
             :                                  << ": " << response_.ShortDebugString();
             :     if (ps.response == PeerResponseStatus::ERROR_BUT_ALIVE) {
             :       queue_->NotifyPeerIsResponsiveDespiteError(peer_pb_.permanent_uuid());
             :     }
             :     ProcessResponseError(ps.status);
can you change this for a switch case where the error cases get a fall through?
something like:
switch (ps) {
case ERROR_BUT_ALIVE:
  queue_->NotifyPeerIsResponsiveDespiteError(peer_pb_.permanent_uuid());
  // Fallthrough intended.
case ERROR:
  ProcessResponseError(ps.status);
  return;
case OK:
  // The queue's handling of the peer response may generate IO (reads against
  // the WAL) and SendNextRequest() may do the same thing. So we run the rest
  // of the response handling logic on our thread pool and not on the reactor
  // thread.
  Status s = thread_pool_->SubmitClosure(Bind(&Peer::DoProcessResponse, Unretained(this)));
  if (PREDICT_FALSE(!s.ok())) {
    LOG_WITH_PREFIX_UNLOCKED(WARNING) << "Unable to process peer response: " << s.ToString()
        << ": " << response_.ShortDebugString();
    sem_.Release();
  }


http://gerrit.cloudera.org:8080/#/c/5111/2/src/kudu/consensus/consensus_peers.h
File src/kudu/consensus/consensus_peers.h:

PS2, Line 58: PeerResponseStatus
thanks for adding this.


http://gerrit.cloudera.org:8080/#/c/5111/2/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

PS2, Line 2695: // Tests KUDU-1613 and KUDU-1407 fixes when followers aren't evicted.
please be a little more discriptive


http://gerrit.cloudera.org:8080/#/c/5111/2/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

PS2, Line 2589:  VLOG(4) << "Error deleting the tablet " << tablet_id() << ": "
              :               << resp_.DebugString();
should we use a higher log level here or is there enough information in the statements below?


-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] KUDU-1613: Fix replica eviction failure for WRONG SERVER UUID

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [consensus] KUDU-1613: Fix replica eviction failure for WRONG_SERVER_UUID
......................................................................


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5111/3/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 2609:                      const scoped_refptr<TabletInfo>& tablet,
> Would be better to use the standard prefix T <tablet-id> P <peer-uuid> for 
This change is moved to different patch, addressing it there.


http://gerrit.cloudera.org:8080/#/c/5111/3/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

Line 213:   // Open a tablet meta from the local file system by loading its superblock.
> Why optional<>* ? New param needs doc
Since this change moved to a different patch, I have addressed it there.
http://gerrit.cloudera.org:8080/5352


-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] KUDU-1718: Fix few bugs around replica eviction failures

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [consensus] KUDU-1718: Fix few bugs around replica eviction failures
......................................................................


Patch Set 1:

(1 comment)

TFTR David,
 
> (1 comment)
 > 
 > would it be possible to split this patch into several ones that
 > take care of each ticker individually or would that be too hard?

I can do that if it helps for easier review, but clubbing them in one patch was intentional since they were all related to one set of problem space and it was easier to test them together. Also, the actual fixes are in few lines under consensus_peers.cc and catalog_manager.cc, majority of the diffs are in tests I think. 
PS: I need to move the kudu-tool-test diffs to a correct location, because I used kudu-tool-test as a placeholder and it is a terrible idea to place this test there.

http://gerrit.cloudera.org:8080/#/c/5111/1/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

PS1, Line 235:   if (!controller_.status().ok()) {
             :     if (controller_.status().IsRemoteError()) {
             :       // Most controller errors are caused by network issues or corner cases
             :       // like shutdown and failure to serialize a protobuf. Therefore, we
             :       // generally consider these errors to indicate an unreachable peer.
             :       // However, a RemoteError wraps some other error propagated from the
             :       // remote peer, so we know the remote is alive. Therefore, we will let
             :       // the queue know that the remote is responsive.
             :       queue_->NotifyPeerIsResponsiveDespiteError(peer_pb_.permanent_uuid());
             :     }
             :     ProcessResponseError(controller_.status());
             :     return;
             :   }
             :   // Again, let the queue know that the remote is still responsive, since we
             :   // will not be sending this error response through to the queue.
             :   // For certain error codes, we want the queue to treat the remote as
             :   // unresponsive and take necessary actions, hence bypassing the notification
             :   // for those error codes.
             :   if (response_.has_error()) {
             :     if (response_.error().code() != TabletServerErrorPB::TABLET_NOT_FOUND &&
             :         response_.error().code() != TabletServerErrorPB::WRONG_SERVER_UUID &&
             :         response_.error().code() != TabletServerErrorPB::TABLET_NOT_RUNNING) {
             :       queue_->NotifyPeerIsResponsiveDespiteError(peer_pb_.permanent_uuid());
             :       ProcessResponseError(StatusFromPB(response_.error().status()));
             :       return;
             :     }
             :   } else if (response_.status().has_error()) {
             :     if (response_.status().error().code() == consensus::ConsensusErrorPB::CANNOT_PREPARE) {
             :       queue_->NotifyPeerIsResponsiveDespiteError(peer_pb_.permanent_uuid());
             :       ProcessResponseError(StatusFromPB(response_.error().status()));
             :       return;
             :     }
             :   }
> It would be good if we could have something like we have in the c++ client 
Yeah, that's a good idea, I will look into that. Also slightly related to this topic: I wanted to group these errors into 2 buckets,  fatal and retriable. That way we can consolidate all of the error codes in one place and less window for programmers to leave some error codes behind. I will attempt that in a different patch though as it is not clear to me whether such grouping could work in all situations. For eg, TABLET_NOT_FOUND may be a fatal error for consensus, but may not be a fatal for some other workflow like DeleteTablet.


-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] KUDU-1613: Fix replica eviction failure for WRONG SERVER UUID

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: [consensus] KUDU-1613: Fix replica eviction failure for WRONG_SERVER_UUID
......................................................................


Patch Set 7:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/5111/7/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

Line 305:     if (response_.error().code() != TabletServerErrorPB::WRONG_SERVER_UUID &&
I think we should consider WRONG_SERVER_UUID as unresponsive (ERROR) instead of ERROR_BUT_ALIVE, since it seems pointless to try to run tablet copy on it when that would never succeed due to it having a different UUID.


http://gerrit.cloudera.org:8080/#/c/5111/7/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

Line 2789:   MonoDelta kTimeout = MonoDelta::FromSeconds(30);
nit: const


Line 2795:   int leader_idx = -1;
nit: This is written in a C-like way. It would be a bit more readable to declare the variables at the site where they are first used. Also the nullptr initializations are not needed and the -1s are never used.


Line 2816: 
Please add a comment in here to explain what's going on:

// At this point, the leader will evict the old tablet due to unresponsiveness and the master will assign the replica to the tablet server with the new UUID. We wait for the leader to copy the tablet to the newly-wiped replica.


-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] KUDU-1718: Fix few bugs around replica eviction failures

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: [consensus] KUDU-1718: Fix few bugs around replica eviction failures
......................................................................


Patch Set 3:

I would also prefer to see these split into several individual patches

-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] [consensus] KUDU-1613: Fix replica eviction failure for WRONG SERVER UUID

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: [consensus] KUDU-1613: Fix replica eviction failure for WRONG_SERVER_UUID
......................................................................


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5111/5//COMMIT_MSG
Commit Message:

Line 9: Fixes replica eviction from leader where a follower returning
Can you please explain your approach to fixing this issue in the commit message in detail?


http://gerrit.cloudera.org:8080/#/c/5111/5/src/kudu/consensus/consensus_queue-test.cc
File src/kudu/consensus/consensus_queue-test.cc:

Line 816: // Verify that tablet copy is triggered when peer responds with
Why do we want tablet copy to be triggered when the peer responds that it is no longer the expected peer?


-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] KUDU-1613: Fix replica eviction failure for WRONG SERVER UUID

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Mike Percy, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5111

to look at the new patch set (#8).

Change subject: [consensus] KUDU-1613: Fix replica eviction failure for WRONG_SERVER_UUID
......................................................................

[consensus] KUDU-1613: Fix replica eviction failure for WRONG_SERVER_UUID

Fixes replica eviction failure where a follower hosting the replica returns
WRONG_SERVER_UUID (perhaps due to all-disks-replaced situation) and that
follower is never evicted from the consensus config.

If a tablet server changes its uuid, it starts responding with
Status::InvalidArgument for most of the RPCs along with the error
code WRONG_SERVER_UUID in the response. Raft consensus can treat this
error code as fatal since this means that the server has changed its
identity (uuid) and the server is irrecoverable from that situation.
Tablet leader should identify this as a transient error and evict such nodes
from the tablet config after heartbeat timeout. Master will re-replicate
the tablet if the tablet is under-replicated after eviction.

It is important to note that same tablet server identified with new uuid
could be one of the candidates considered for re-replication because
the follower registered itself as a new tablet server. Added an integration
test which verifies this workflow.

Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
---
M src/kudu/consensus/consensus_peers.cc
M src/kudu/consensus/consensus_peers.h
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/integration-tests/raft_consensus-itest.cc
5 files changed, 194 insertions(+), 50 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/11/5111/8
-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [consensus] KUDU-1718: Fix few bugs around replica eviction failures

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [consensus] KUDU-1718: Fix few bugs around replica eviction failures
......................................................................


Patch Set 2:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/5111/2/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

PS2, Line 235: // Analyze the response from peer.
             :   PeerResponseStatus ps = AnalyzePeerResponse();
             :   if (PREDICT_FALSE(ps.response != PeerResponseStatus::OK)) {
             :     VLOG_WITH_PREFIX_UNLOCKED(1) << "Error response from peer: " << ps.status.ToString()
             :                                  << ": " << response_.ShortDebugString();
             :     if (ps.response == PeerResponseStatus::ERROR_BUT_ALIVE) {
             :       queue_->NotifyPeerIsResponsiveDespiteError(peer_pb_.permanent_uuid());
             :     }
             :     ProcessResponseError(ps.status);
> can you change this for a switch case where the error cases get a fall thro
Good idea, Done.


http://gerrit.cloudera.org:8080/#/c/5111/2/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

PS2, Line 2695: // Tests KUDU-1613 and KUDU-1407 fixes when followers aren't evicted.
> please be a little more discriptive
Done


http://gerrit.cloudera.org:8080/#/c/5111/2/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 2586:   virtual void HandleResponse(int attempt) OVERRIDE {
> warning: parameter 'attempt' is unused [misc-unused-parameters]
We are better off tackling this in a different patch since this will make other interfaces inconsistent. This variable is being unused at other places too AFAICT.


PS2, Line 2589:  VLOG(4) << "Error deleting the tablet " << tablet_id() << ": "
              :               << resp_.DebugString();
> should we use a higher log level here or is there enough information in the
I think the warnings below carry enough info for logging purposes, I added this if we want to be more verbose with "vmodule" option. I reduced it down to 1 looking at other VLOG levels in this file.


http://gerrit.cloudera.org:8080/#/c/5111/2/src/kudu/tserver/tablet_server-test-base.h
File src/kudu/tserver/tablet_server-test-base.h:

Line 476: const char* TabletServerTestBase::kTableId = "TestTable";
> warning: variable 'kTableId' defined in a header file; variable definitions
Done


Line 477: const char* TabletServerTestBase::kTabletId = "TestTablet";
> warning: variable 'kTabletId' defined in a header file; variable definition
Done


Line 478: const string TabletServerTestBase::kTestDir = "TabletServerTest-fsroot";
> warning: variable 'kTestDir' defined in a header file; variable definitions
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [consensus] KUDU-1718: Fix few bugs around replica eviction failures

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: [consensus] KUDU-1718: Fix few bugs around replica eviction failures
......................................................................


Patch Set 3:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/5111/3/src/kudu/consensus/consensus_peers.cc
File src/kudu/consensus/consensus_peers.cc:

Line 198: 
nit: spurious line


Line 241:       // Fallthrough intended.
use FALLTHROUGH_INTENDED macro instead


Line 256:   }
better to break at the end of the case


http://gerrit.cloudera.org:8080/#/c/5111/3/src/kudu/consensus/consensus_peers.h
File src/kudu/consensus/consensus_peers.h:

Line 62:     ERROR,
maybe this would be better named TIMED_OUT ?


http://gerrit.cloudera.org:8080/#/c/5111/3/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 2609:           LOG(WARNING) << "TS " << target_ts_desc_->ToString()
Would be better to use the standard prefix T <tablet-id> P <peer-uuid> for these log messages here and below


http://gerrit.cloudera.org:8080/#/c/5111/3/src/kudu/tserver/tablet_server_test_util.cc
File src/kudu/tserver/tablet_server_test_util.cc:

Line 27: DEFINE_int32(num_updater_threads, 1, "Number of updating threads to launch");
Why put this stuff in this file? Maybe you need to create a new tablet_server-test-base.cc file?


http://gerrit.cloudera.org:8080/#/c/5111/3/src/kudu/tserver/ts_tablet_manager.h
File src/kudu/tserver/ts_tablet_manager.h:

Line 213:       boost::optional<TabletServerErrorPB::Code>* error_code);
Why optional<>* ? New param needs doc


-- 
To view, visit http://gerrit.cloudera.org:8080/5111
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I0d3f84fd297a8e4760208a213c1ee393e92499a3
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes