You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Dinesh Bhat (Code Review)" <ge...@cloudera.org> on 2016/09/24 03:46:55 UTC

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Hello David Ribeiro Alves, Todd Lipcon,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/4533

to review the following change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................

[tools] Implement a manual leader_step_down for a tablet

This change introduces a leader_step_down functionality
under 'kudu tablet'. This tool may be handy to recover from
situations when a single tablet server is overloaded and we want
to kick off  a new election to balance the load across the clusters.
Although it is not guaranteed that a different replica will be elected
as the leader, this is an optimistic effort to elect a new tablet
server as the leader for the given tablet in the cluster.

Also snuck in a small change to display host:port details with
'kudu tablet list <master_addresses> --list_tablets' command.

Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
---
M src/kudu/tools/kudu-admin-test.cc
M src/kudu/tools/tool_action_table.cc
M src/kudu/tools/tool_action_tablet.cc
3 files changed, 185 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4533/1
-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Adar Dembo, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4533

to look at the new patch set (#8).

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................

[tools] Implement a manual leader_step_down for a tablet

This change introduces a leader_step_down functionality
under 'kudu tablet'. This tool may be handy to recover from
situations when a single tablet server is overloaded and we want
to kick off a new election to balance the load across the clusters.
Although it is not guaranteed that a different replica will be elected
as the leader, this is an optimistic effort to elect a new tablet
server as the leader for the given tablet in the cluster.

Test: Ran 8000 iterations of leader_step_down test on dist_test.

Also snuck in a small change to display host:port details with
'kudu table list <master_addresses> --list_tablets' command.

Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
---
M src/kudu/tools/kudu-admin-test.cc
M src/kudu/tools/tool_action_cluster.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_table.cc
M src/kudu/tools/tool_action_tablet.cc
5 files changed, 168 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4533/8
-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/4533/6/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

PS6, Line 219: --enable_leader_failure_detection=false
> Why do you need to override the default timeout value of 30s? timeout_ms is
Yeah, I am reverting to default timeout after bunch of tests I executed on dist_test this morning. Initial thought was to allow just enough room for consensus to change, but turned out to be a bad idea.


PS6, Line 222: pendValuesFromMap(tablet_servers_, &tservers);
> There's no difference in the outcome of the test if we retry 10 times witho
Bunch of things I narrowed down here this morning:
- The same leader getting elected was just an observation I made on local testbed, when moved to dist_test I was able to see different leader at times(and same leader as well bunch of times).

- Also spoke to David on few things, it turns out a leader may never get elected even after terms are advanced multiple times which makes this retry loop even more flaky-prone. This is consistent with my test on dist_test too where I observed couple of times an election resulting into no leader emerging on other side and subsequently failed with an error code from the actual kudu command L208 "Illegal state: not currently a leader". 

- I also kept just one routine grabbing just the consensus term instead of going to master for leader_uuid, as it turns out master could have a stale info as well. This is inline with what you suggested in one of your earlier comments.

Given your above suggestions and my test results, I am inclined to keep the code as following now:

if (SubProcess::Call('kudu leader_step_down')) {
  AssertEventually([&]() {
        GetINfoFromConsensus(leader_uuid, new_term)
        ASSERT_GT(new_term, current_term);
  }
}

- The reason I am hesitant to ASSERT on leader_step_down command is because as we know leader might have changed in the background not resulting in a new leader, and also I don't happen to be waiting for a leader election after cluster is setup, in which case it's perfectly possible that the command will fail. We can expect the term to have advanced in 30 secs however if the command succeeds, hence safe to assert on that.


-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Adar Dembo, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4533

to look at the new patch set (#6).

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................

[tools] Implement a manual leader_step_down for a tablet

This change introduces a leader_step_down functionality
under 'kudu tablet'. This tool may be handy to recover from
situations when a single tablet server is overloaded and we want
to kick off a new election to balance the load across the clusters.
Although it is not guaranteed that a different replica will be elected
as the leader, this is an optimistic effort to elect a new tablet
server as the leader for the given tablet in the cluster.

Also snuck in a small change to display host:port details with
'kudu table list <master_addresses> --list_tablets' command.

Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
---
M src/kudu/tools/kudu-admin-test.cc
M src/kudu/tools/tool_action_cluster.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_table.cc
M src/kudu/tools/tool_action_tablet.cc
5 files changed, 187 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4533/6
-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................


Patch Set 9: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................


Patch Set 8:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/4533/8/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

Line 227:   Status s = GetTermFromConsensus(tservers, tablet_id_,
Nit: since the variable 's' is not used anywhere else down the code, consider

ASSERT_TRUE(GetTermFromConsensus(...).IsNotFound());


http://gerrit.cloudera.org:8080/#/c/4533/8/src/kudu/tools/tool_action_local_replica.cc
File src/kudu/tools/tool_action_local_replica.cc:

PS8, Line 121: I
Is capitalization of 'i' intentional?


http://gerrit.cloudera.org:8080/#/c/4533/8/src/kudu/tools/tool_action_table.cc
File src/kudu/tools/tool_action_table.cc:

PS8, Line 53: const char* const kMasterAddressesArgDesc
It seems this is used across different set of tools.  Is it possible to define it only once, not repeating it over and over again across all those files which require this parameter?


http://gerrit.cloudera.org:8080/#/c/4533/8/src/kudu/tools/tool_action_tablet.cc
File src/kudu/tools/tool_action_tablet.cc:

PS8, Line 67: I
I saw this string is defined in different file as well.  Does it make sense to have it defined only in a single place?  E.g., putting it into kudu::tools::util namespace or alike.


PS8, Line 177: string
const string&  here and further when using FindOrDie() ?

The idea is that FindOrDie() returns const reference, and the result is not modified in the code below.


-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................


Patch Set 9:

(5 comments)

> (5 comments)
 > 
 > There is one extra instance of "master_addresses" string literal
 > and corresponding description in tool_action_test.cc.  Could you
 > updated that to use the common constant as well?

TFTR, that instance was missing since I hadn't rebased on top of your latest tool change I think, rebased and updated now.

http://gerrit.cloudera.org:8080/#/c/4533/9/src/kudu/tools/tool_action_common.cc
File src/kudu/tools/tool_action_common.cc:

PS9, Line 102: hostname:port
> Nit: do you want to add that the 'port' part could be omitted if the master
I guess so, although I would like to punt this into another patch after making sure all the RPCs used by the toolset comply to that theory.


PS9, Line 104: I
> Nit: Is capitalization of 'i' intentional?
Sort of. I think we could be bit more consistent here; this was more in the spirit of 'Tablet Server, Kudu Master'.. and so on although comparison isn't really apple-to-apple. We have also used 'block identifier' to counter my argument here.


http://gerrit.cloudera.org:8080/#/c/4533/9/src/kudu/tools/tool_action_common.h
File src/kudu/tools/tool_action_common.h:

PS9, Line 81: extern
> Does 'extern' have to be present due to some compilation issues?  I would e
Fixed the ordering. About 'extern', I remember looking up cpp guideline for this, since I couldn't find anything, I followed the existing code's policy. Besides, explicit keyword provides a really strong readability.


http://gerrit.cloudera.org:8080/#/c/4533/9/src/kudu/tools/tool_action_tablet.cc
File src/kudu/tools/tool_action_tablet.cc:

PS9, Line 175: string tablet_id
> nit: const string& tablet_id ?
Done


Line 246:         "to step down")
> nit: parameter/line continuation shift is 4 spaces.
I am not clear about this. Guideline seems to say : "start the arguments on a new line indented by four spaces and continue at that 4 space indent". This isn't necessarily starting the argument on a newline. I remember following what other files have followed here, for eg take a look at tool_action_*.cc. If needed correction, I can fix them all in this change or a different change.


-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has submitted this change and it was merged.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................


[tools] Implement a manual leader_step_down for a tablet

This change introduces a leader_step_down functionality
under 'kudu tablet'. This tool may be handy to recover from
situations when a single tablet server is overloaded and we want
to kick off a new election to balance the load across the clusters.
Although it is not guaranteed that a different replica will be elected
as the leader, this is an optimistic effort to elect a new tablet
server as the leader for the given tablet in the cluster.

Test: Ran 8000 iterations of leader_step_down test on dist_test.

Also snuck in a small change to display host:port details with
'kudu table list <master_addresses> --list_tablets' command.

Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Reviewed-on: http://gerrit.cloudera.org:8080/4533
Tested-by: Kudu Jenkins
Reviewed-by: Alexey Serbin <as...@cloudera.com>
---
M src/kudu/tools/kudu-admin-test.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_cluster.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_common.h
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tools/tool_action_table.cc
M src/kudu/tools/tool_action_tablet.cc
M src/kudu/tools/tool_action_test.cc
10 files changed, 190 insertions(+), 66 deletions(-)

Approvals:
  Alexey Serbin: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 14
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello Adar Dembo, Alexey Serbin, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4533

to look at the new patch set (#12).

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................

[tools] Implement a manual leader_step_down for a tablet

This change introduces a leader_step_down functionality
under 'kudu tablet'. This tool may be handy to recover from
situations when a single tablet server is overloaded and we want
to kick off a new election to balance the load across the clusters.
Although it is not guaranteed that a different replica will be elected
as the leader, this is an optimistic effort to elect a new tablet
server as the leader for the given tablet in the cluster.

Test: Ran 8000 iterations of leader_step_down test on dist_test.

Also snuck in a small change to display host:port details with
'kudu table list <master_addresses> --list_tablets' command.

Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
---
M src/kudu/tools/kudu-admin-test.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_cluster.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_common.h
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tools/tool_action_table.cc
M src/kudu/tools/tool_action_tablet.cc
M src/kudu/tools/tool_action_test.cc
10 files changed, 191 insertions(+), 66 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4533/12
-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello Adar Dembo, Alexey Serbin, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4533

to look at the new patch set (#13).

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................

[tools] Implement a manual leader_step_down for a tablet

This change introduces a leader_step_down functionality
under 'kudu tablet'. This tool may be handy to recover from
situations when a single tablet server is overloaded and we want
to kick off a new election to balance the load across the clusters.
Although it is not guaranteed that a different replica will be elected
as the leader, this is an optimistic effort to elect a new tablet
server as the leader for the given tablet in the cluster.

Test: Ran 8000 iterations of leader_step_down test on dist_test.

Also snuck in a small change to display host:port details with
'kudu table list <master_addresses> --list_tablets' command.

Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
---
M src/kudu/tools/kudu-admin-test.cc
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_cluster.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_common.h
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tools/tool_action_table.cc
M src/kudu/tools/tool_action_tablet.cc
M src/kudu/tools/tool_action_test.cc
10 files changed, 190 insertions(+), 66 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4533/13
-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 13
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Adar Dembo, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4533

to look at the new patch set (#5).

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................

[tools] Implement a manual leader_step_down for a tablet

This change introduces a leader_step_down functionality
under 'kudu tablet'. This tool may be handy to recover from
situations when a single tablet server is overloaded and we want
to kick off a new election to balance the load across the clusters.
Although it is not guaranteed that a different replica will be elected
as the leader, this is an optimistic effort to elect a new tablet
server as the leader for the given tablet in the cluster.

Also snuck in a small change to display host:port details with
'kudu table list <master_addresses> --list_tablets' command.

Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
---
M src/kudu/tools/kudu-admin-test.cc
M src/kudu/tools/tool_action_cluster.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_table.cc
M src/kudu/tools/tool_action_tablet.cc
5 files changed, 187 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4533/5
-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Adar Dembo, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4533

to look at the new patch set (#4).

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................

[tools] Implement a manual leader_step_down for a tablet

This change introduces a leader_step_down functionality
under 'kudu tablet'. This tool may be handy to recover from
situations when a single tablet server is overloaded and we want
to kick off a new election to balance the load across the clusters.
Although it is not guaranteed that a different replica will be elected
as the leader, this is an optimistic effort to elect a new tablet
server as the leader for the given tablet in the cluster.

Also snuck in a small change to display host:port details with
'kudu table list <master_addresses> --list_tablets' command.

Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
---
M src/kudu/tools/kudu-admin-test.cc
M src/kudu/tools/tool_action_cluster.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_table.cc
M src/kudu/tools/tool_action_tablet.cc
5 files changed, 186 insertions(+), 42 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4533/4
-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Adar Dembo, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4533

to look at the new patch set (#2).

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................

[tools] Implement a manual leader_step_down for a tablet

This change introduces a leader_step_down functionality
under 'kudu tablet'. This tool may be handy to recover from
situations when a single tablet server is overloaded and we want
to kick off a new election to balance the load across the clusters.
Although it is not guaranteed that a different replica will be elected
as the leader, this is an optimistic effort to elect a new tablet
server as the leader for the given tablet in the cluster.

Also snuck in a small change to display host:port details with
'kudu table list <master_addresses> --list_tablets' command.

Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
---
M src/kudu/tools/kudu-admin-test.cc
M src/kudu/tools/tool_action_table.cc
M src/kudu/tools/tool_action_tablet.cc
3 files changed, 180 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4533/2
-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................


Patch Set 13: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 13
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/4533/1/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

Line 218:       tablet_id_
> But why not get the leader info from the consensus info too? Why go to two 
One reason I was hesitant to use the same routine for leadership info was because GetConsensus collects the details from all the tservers, and I wasn't sure whether one of them could end up returning a stale info depending on when the RPC hits them. In the code below at L234, probability of that RPC hitting when the leader failover is in transition is very likely. If anything, this was mainly due to not knowing the codebase well and staying on safer side(go to master for leader info).


PS1, Line 230:             "leader_failure_max_missed_heartbeat_periods",
             :             &max_missed_hb_periods_str));
> Now I see what you mean; the number of attempts in AssertEventually is neve
Well we don't want to include the <do leader step down> under AssertEventually, because it's perfectly possible that even after specified timeout, we never elected a leader which is different than current_leader. ASSERT_NE(new_leader, current_leader) could yield us incorrect results. I will still debug what I observed is just conincidence or there is more to it than just meet the eye. However, the ++attempts kinda ensures that eventually new_leader emerges on the other side.


http://gerrit.cloudera.org:8080/#/c/4533/3/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

PS3, Line 221:     // Grab the default settings for heartbeat timeouts.
             :     string hb_timeout_str;
             :     ASSERT_TRUE(google::GetCommandLineOption("raft_heartbeat_interval_ms",
             :                                              &hb_timeout_str));
             :     int32 hb_timeout;
             :     ASSERT_TRUE(safe_strto32(hb_timeout_str, &hb_timeout));
             :     string max_missed_hb_periods_str;
             :     ASSERT_TRUE(
             :         google::GetCommandLineOption(
             :             "leader_failure_max_missed_heartbeat_periods",
             :             &max_missed_hb_periods_str));
             :     double max_missed_hb_periods;
             :     ASSERT_TRUE(safe_strtod(max_missed_hb_periods_str,
             :                             &max_missed_hb_periods));
> You don't need to go through all this trouble (and you certainly don't need
Good point, forgot about DECLARE, done.


-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Adar Dembo, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4533

to look at the new patch set (#9).

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................

[tools] Implement a manual leader_step_down for a tablet

This change introduces a leader_step_down functionality
under 'kudu tablet'. This tool may be handy to recover from
situations when a single tablet server is overloaded and we want
to kick off a new election to balance the load across the clusters.
Although it is not guaranteed that a different replica will be elected
as the leader, this is an optimistic effort to elect a new tablet
server as the leader for the given tablet in the cluster.

Test: Ran 8000 iterations of leader_step_down test on dist_test.

Also snuck in a small change to display host:port details with
'kudu table list <master_addresses> --list_tablets' command.

Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
---
M src/kudu/tools/kudu-admin-test.cc
M src/kudu/tools/tool_action_cluster.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_common.h
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tools/tool_action_table.cc
M src/kudu/tools/tool_action_tablet.cc
8 files changed, 186 insertions(+), 62 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4533/9
-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................


Patch Set 1:

(13 comments)

http://gerrit.cloudera.org:8080/#/c/4533/1//COMMIT_MSG
Commit Message:

PS1, Line 12:   
> Nit: got an extra space here.
Done


PS1, Line 18: tablet
> You mean 'table' here, right?
yeah, thanks for catching.


http://gerrit.cloudera.org:8080/#/c/4533/1/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

PS1, Line 178:     Status s = itest::GetConsensusState(ts,
             :                                         tablet_id,
             :                                         consensus::CONSENSUS_CONFIG_COMMITTED,
             :                                         MonoDelta::FromSeconds(10), &cstate);
             :     if (!s.ok()) {
             :       return s;
             :     }
> Just use RETURN_NOT_OK(...).
Done


Line 198:   int num_retries = 0;
> Why is this defined way up here instead of closer to where it's used?
Done, was thinking of keeping the *retries* variables together initially, but I agree about moving it down just before the use. Also just made it one variable now.


PS1, Line 204: CHECK_EQ
> Use ASSERT_* instead of CHECK_* since all of these are on the main test thr
Sure, could you tell me what's the relationship between this being a main thread and these macros here ? I remember asking this to you earlier in one of earlier rev comments but forgot to follow up in-person. 
I am just curious if there are debug-only versions of these macros ?


Line 218:   CHECK_EQ(master_reported_leader, current_leader);
> The leader can change between the GetInfoFromConsensus calls and L213, righ
In reality yes, but given that we are operating in a deterministic test environment where the server failures are manually triggered by code, I thought this check is safe to have. However, I am not entirely sure what this check is buying us, I just happen to notice that there are more than one way we could find the leader replica, and I thought it would be good to compare them when we know that they don't change on the fly during the test. I can remove this if you think this might be a room for any flakiness.


PS1, Line 230:     // Wait for re-election to happen again.
             :     // 3 seconds picked to accommodate consensus heartbeat timeout(1.5s)
> Use AssertEventually() to make this more robust, and to minimize the amount
I tried this with an unsuccessful result, later realized AssertEventually() kinda doesn't seem to fit this scenario here; Reason being: a) We really want to check the consensus state after the heartbeat timeout as opposed to executing f() and then sleeping;  AssertEventually() aggressively(every ms) keeps executing the f() until the timeout hits. b) I won't be able to use assert from inside the lambda argument when we know that Consensus state is still under flux.
Side note: I found one minor bug in AssertEventually() here which I can fix in another patch:
    int attempts = 0;
    int sleep_ms = (attempts < 10) ? (1 << attempts) : 1000;
    SleepFor(sleep_ms);

Perhaps we can devise another routine which sleeps for N secs and then executes the function and returns the status ?


Line 264:   CHECK_EQ(master_reported_leader, "");
> Let's not check this. It's a common calling convention that if a function f
Done


PS1, Line 271:   CHECK_EQ(current_leader, "");
             :   CHECK_EQ(current_term, 0);
> Likewise here, which means you needn't initialize current_term.
Done


http://gerrit.cloudera.org:8080/#/c/4533/1/src/kudu/tools/tool_action_tablet.cc
File src/kudu/tools/tool_action_tablet.cc:

PS1, Line 194:   Status s = GetTabletLeader(client, tablet_id, &leader_uuid, &leader_hp);
> Consider consolidating from here to ResolveAddresses in a helper function, 
Done, although not sure if we need a toggle given that the routine anyways returns the Status and caller can decide if NotFound is FATAL or not ?


PS1, Line 198: else if (!s.ok()) {
             :     return s;
             :   }
> To pile on with Tidy Bot, just do RETURN_NOT_OK(s) after the not found chec
Done


PS1, Line 209:   ServerStatusPB status;
             :   RETURN_NOT_OK(GetServerStatus(leader_hp.host(), leader_hp.port(), &status));
> What's this for? I don't see status used anywhere.
I kinda misunderstood the combination of RETURN_NOT_OK(GetServerStatus... thinking that this will return failure if leader is not up; My goal here was to check if leader is up & running so that we need not send the RPC if the server is already down. It looks like this doesn't buy us anything since RPC will fail anyways if the leader is down. Hence removed.


PS1, Line 224:  return Status::RemoteError(resp.ShortDebugString());
> Why not return StatusFromPB(resp.error().status()) like ConfigChange?
Sounds good, just didn't know about this routine, thanks.


-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................


Patch Set 6:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/4533/6/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

PS6, Line 219: MonoDelta::FromMilliseconds(timeout_ms)
Why do you need to override the default timeout value of 30s? timeout_ms is almost assuredly going to be less.


PS6, Line 222: while (current_leader == new_leader && --num_retries > 0);
There's no difference in the outcome of the test if we retry 10 times without a change in leader, or if the leader does change. Either way, the test will pass. Put another way, if there was a bug in leader_step_down that prevented the leader from actually  changing, this test would not catch that.

As such, I don't see the point of the retry loop. Just try the outer loop logic once, looping until the term changes, and then end the test.

The alternative would be to run the outer loop extensively and enforce that the leader really has changed (either by asserting that new_leader!=current_leader, or by asserting that there's no leader at all), but it didn't sound like you were comfortable doing that.


-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................


Patch Set 13:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4533/12/src/kudu/tools/tool_action_tablet.cc
File src/kudu/tools/tool_action_tablet.cc:

PS12, Line 245: to step down
> What if that '(if present)' is dropped?  If it do not provide essential inf
It kinda tells this command is effective only with leader present.  However initial inclination was not to compromise on the actual help description to satisfy a test case. But command returns gracefully with a string if leader not present, so I guess we can drop it.. Updated.


-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 13
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................


Patch Set 4:

Please rebase this patch before pushing another revision.

-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................


Patch Set 12:

> > Patch Set 11: Patch Set 10 was rebased
 > 
 > Could you also add coverage for the new sub-command into
 > ToolTest.TestModeHelp in kudu-tool-test.cc?

Good idea, added. Please note the "*leader*step*" since I had to workaround a small issue in how we validate the usage output when lines are split by '\n' , because it doesn't go well with all the output. For eg,
$ ./bin/kudu tablet
Usage: ./bin/kudu tablet <command> [<args>]
     change_config     Change a tablet's Raft configuration
  leader_step_down   Force the tablet's leader replica (if present) to step 
                                   down
We really want to split the output such that one sub-command becomes one vector element, instead of splitting them by '\n'. I still don't know how to fix this, but will try to tackle this problem in a different change. For now, we could just add another regex with just "*down", but that doesn't look good, hence doing away with 'down' in the kTabletModeRegexes at the moment.

-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................


Patch Set 2:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/4533/1/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

PS1, Line 204: }
> Sure, could you tell me what's the relationship between this being a main t
The CHECK_* macros come from glog, and they all do the same thing regardless of the context in which they're called:
1. Evaluate an expression.
2. If evaluation fails, log something.
3. Dump a stack trace.
4. Call abort() to exit the process.

The DCHECK_* macros (also from glog) are debug-only variants that are otherwise equivalent, but compiled out in non-debug builds.

The ASSERT_* macros come from gtest, so they're specific to how gtest works. They too evaluate an expression, but on failure, they do some gtest stuff to capture the failure, end the run of the current test, and move to the next test. They have some foibles, and one of them is that ASSERT failures in the thread that isn't the main process thread don't seem to get noticed  by gtest. Maybe there's something we need to do for that to happen; if there is, I don't know what it is.

Oh, and EXPECT_* is like ASSERT_* except that the test keeps running. So ASSERT_* will only ever reveal the "first" failure, while EXPECT_* will reveal all of them.

So to summarize:
- Use CHECK_*/DCHECK_* macros in production code. Use DCHECK_* if you expect to get sufficient test coverage of the callsite, CHECK_* otherwise.
- Use ASSERT_*/EXPECT_* macros in test-only code, unless that code is expected to run on a separate test thread, in which case use CHECK_* too. Default to EXPECT_*, unless the operation in question must succeed for the rest of the test to function properly, in which case use ASSERT_*.

Hope this helps.


Line 218:     ASSERT_OK(Subprocess::Call({
> In reality yes, but given that we are operating in a deterministic test env
Unfortunately a test environment with leader failure detection isn't deterministic. The test can run on a machine with arbitrary load (perhaps from other tests), in which case one of the tserver processes can pause for arbitrary periods of time, causing a new leader to be elected. This is one of the leading causes of (low-grade) test flakiness for us.


PS1, Line 230:     CHECK_OK(GetInfoFromConsensus(tservers, tablet_id_,
             :                                   &new_leader, &new_term));
> I tried this with an unsuccessful result, later realized AssertEventually()
Couple things:

1. I don't understand why AssertEventually() doesn't work here. AFAICT the goal of this snippet is "tell the leader to step down, wait for an election to happen, then assert that the new term is higher than the one before". If you wrap L229-L233 in AssertEventually(), doesn't that mean you don't need the 3 second sleep (which could be flaky to begin with, since a cluster running with ASAN/TSAN could take a lot longer to run an election)? I don't see what the issue is; can you help me understand? Look at some of the existing usages of AssertEventually() if it's not clear.
2. I'm not seeing the bug in the code you pasted. What is it?


Line 264:   s = GetInfoFromConsensus(tservers, tablet_id_,
> Done
The actual check (not in the CHECK sense of the word, but the comparison and assertion) is still here, though.


http://gerrit.cloudera.org:8080/#/c/4533/2/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

PS2, Line 251: CHECK_OK
Still got some leftover CHECK_* statements that should be converted to their ASSERT_* equivalents.


http://gerrit.cloudera.org:8080/#/c/4533/1/src/kudu/tools/tool_action_tablet.cc
File src/kudu/tools/tool_action_tablet.cc:

PS1, Line 194: 
> Done, although not sure if we need a toggle given that the routine anyways 
On second thought, the ResolveAddresses output isn't even used, is it? So let's drop it altogether. Looks like address resolution happens within BuildProxy() so that extra call was redundant from day one (in ConfigChange() as well).


PS1, Line 274: 
             : 
> Nit: reformat as "Force the tablet's leader replica (if present) to step do
You missed these comments down here.


-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/4533/1/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

Line 218:       tablet_id_
> sure, removed the comparison check, relying on master to report leader info
But why not get the leader info from the consensus info too? Why go to two different places to get the combined set of information? That just opens the door for discrepancies between the two.


PS1, Line 230:             "leader_failure_max_missed_heartbeat_periods",
             :             &max_missed_hb_periods_str));
> Sorry for not being clear before. Lemme try to explain although I haven't r
Now I see what you mean; the number of attempts in AssertEventually is never actually incremented. That's a pretty serious problem; can you fix that, and rebase this patch on top of the fix? No unit test needed.

As for the correct use of AssertEventually, let's get to the bottom of it as it should work. I can imagine something like this working:

  string current_leader = ...
  int64_t current_term = ...
  AssertEventually([&]() {
    <do leader step down>
    GetInfoFromConsensus(&new_leader, &new_term)
    ASSERT_NE(new_leader, current_leader)
    ASSERT_GT(new_term, current_term)
  });

It's possible that the broken backoff behavior in AssertEventually is responsible, which is why I suggested you fix the bug and rebase this patch on top of it. If not, use gdb or temporary logging to figure out what's going on.


http://gerrit.cloudera.org:8080/#/c/4533/3/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

PS3, Line 221:     // Grab the default settings for heartbeat timeouts.
             :     string hb_timeout_str;
             :     ASSERT_TRUE(google::GetCommandLineOption("raft_heartbeat_interval_ms",
             :                                              &hb_timeout_str));
             :     int32 hb_timeout;
             :     ASSERT_TRUE(safe_strto32(hb_timeout_str, &hb_timeout));
             :     string max_missed_hb_periods_str;
             :     ASSERT_TRUE(
             :         google::GetCommandLineOption(
             :             "leader_failure_max_missed_heartbeat_periods",
             :             &max_missed_hb_periods_str));
             :     double max_missed_hb_periods;
             :     ASSERT_TRUE(safe_strtod(max_missed_hb_periods_str,
             :                             &max_missed_hb_periods));
You don't need to go through all this trouble (and you certainly don't need to do it in a loop; it won't change during the loop). Just access FLAGS_raft_heartbeat_interval_ms and FLAGS_leader_failure_max_missed_heartbeat_periods directly, and  DECLARE_* them at the top of the file.


-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Adar Dembo, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4533

to look at the new patch set (#3).

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................

[tools] Implement a manual leader_step_down for a tablet

This change introduces a leader_step_down functionality
under 'kudu tablet'. This tool may be handy to recover from
situations when a single tablet server is overloaded and we want
to kick off a new election to balance the load across the clusters.
Although it is not guaranteed that a different replica will be elected
as the leader, this is an optimistic effort to elect a new tablet
server as the leader for the given tablet in the cluster.

Also snuck in a small change to display host:port details with
'kudu table list <master_addresses> --list_tablets' command.

Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
---
M src/kudu/tools/kudu-admin-test.cc
M src/kudu/tools/tool_action_cluster.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_table.cc
M src/kudu/tools/tool_action_tablet.cc
M src/kudu/util/test_util.cc
6 files changed, 198 insertions(+), 42 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4533/3
-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................


Patch Set 3:

(9 comments)

TFTR Adar, pls see responses inline, also pls see why my jenkins are still failing. I see that they are still using ...llvm-3.8.0.src/ as per the logs.

http://gerrit.cloudera.org:8080/#/c/4533/1/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

PS1, Line 204: string c
> The CHECK_* macros come from glog, and they all do the same thing regardles
Thank you for that awesome clarity on when to use them. This has gone in my favorite notes :)


Line 218:       tablet_id_
> Unfortunately a test environment with leader failure detection isn't determ
sure, removed the comparison check, relying on master to report leader info now. I have kept the getconsensus intact to grab the new term info.


PS1, Line 230:             "leader_failure_max_missed_heartbeat_periods",
             :             &max_missed_hb_periods_str));
> Couple things:
Sorry for not being clear before. Lemme try to explain although I haven't root caused the issue myself yet. When I used AssertEventually the GetConsensus comes out around 1.5 secs or so with a new_term. But what is not clear to me yet is, if I sleep for ~1.5 secs to allow the consensus to settle down, I see eventually a new_leader emerging which won't happen if I keep using GetConsensus via AssertEventually(despite 20 retries). Logs are confimring that GetConsensus indeed is detecting the new leader correctly, it's just that new_leader happens to be the current_leader everytime. 

I want to see why I see same leader gets elected again and again unless I allow some more time window before calling GetConsensus in which case I see new_leader coming up after few retries. With the fix of '++attempts' however, I eventually see another leader coming up so thats kinda hiding the problem. I think that ++attempts fix should have been there to keep the sleep time monotonically increasing.


PS1, Line 241:         ASSERT_OK(GetTermFromConsensus(tservers, tablet_id_,
             :                                        &new_ter
> Again, the leader can change whenever, so this doesn't seem safe.
Done


Line 264: 
> The actual check (not in the CHECK sense of the word, but the comparison an
yeah, missed out removing here i think. done.


http://gerrit.cloudera.org:8080/#/c/4533/2/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

PS2, Line 251: AGS_num_
> Still got some leftover CHECK_* statements that should be converted to thei
Done, thanks for explaining the distinctions earlier.


http://gerrit.cloudera.org:8080/#/c/4533/1/src/kudu/tools/tool_action_tablet.cc
File src/kudu/tools/tool_action_tablet.cc:

PS1, Line 274: 
             : 
> You missed these comments down here.
Ugh... sorry, for some reason I totally missed out scrolling down thinking that only change I had introduced was in the routine above :)


PS1, Line 278: 
             : 
> I admit, this one was my bad, but can you consolidate all of these descript
Done


PS1, Line 280: 
> Likewise for this, it's copied extensively.
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4533/6/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

PS6, Line 222: while (current_leader == new_leader && --num_retries > 0);
> Bunch of things I narrowed down here this morning:
Note that BuildAndStart() wait for all tablets to reach the state of RUNNING in the master. For that to happen, each tablet has to have elected a leader. So just before the call to leader_step_down we can guarantee that at least one leader election happened, though we can't guarantee that there's a leader _right now_.

I think your pseudocode is a little dangerous because if for some reason there's another hidden bug that causes leader_step_down to fail, we won't see it. What if you whitelisted the "Illegal state: not currently a leader" case and allowed the ::Call() result to be either that, or Status::OK()? Wouldn't that give you the best of both worlds?


-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................


Patch Set 11:

> Patch Set 11: Patch Set 10 was rebased

Could you also add coverage for the new sub-command into ToolTest.TestModeHelp in kudu-tool-test.cc?

-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello Adar Dembo, Alexey Serbin, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4533

to look at the new patch set (#10).

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................

[tools] Implement a manual leader_step_down for a tablet

This change introduces a leader_step_down functionality
under 'kudu tablet'. This tool may be handy to recover from
situations when a single tablet server is overloaded and we want
to kick off a new election to balance the load across the clusters.
Although it is not guaranteed that a different replica will be elected
as the leader, this is an optimistic effort to elect a new tablet
server as the leader for the given tablet in the cluster.

Test: Ran 8000 iterations of leader_step_down test on dist_test.

Also snuck in a small change to display host:port details with
'kudu table list <master_addresses> --list_tablets' command.

Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
---
M src/kudu/tools/kudu-admin-test.cc
M src/kudu/tools/tool_action_cluster.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_common.h
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tools/tool_action_table.cc
M src/kudu/tools/tool_action_tablet.cc
M src/kudu/tools/tool_action_test.cc
9 files changed, 189 insertions(+), 65 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4533/10
-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................


Patch Set 9:

(5 comments)

There is one extra instance of "master_addresses" string literal and corresponding description in tool_action_test.cc.  Could you updated that to use the common constant as well?

http://gerrit.cloudera.org:8080/#/c/4533/9/src/kudu/tools/tool_action_common.cc
File src/kudu/tools/tool_action_common.cc:

PS9, Line 102: hostname:port
Nit: do you want to add that the 'port' part could be omitted if the master listens at the default port?


PS9, Line 104: I
Nit: Is capitalization of 'i' intentional?


http://gerrit.cloudera.org:8080/#/c/4533/9/src/kudu/tools/tool_action_common.h
File src/kudu/tools/tool_action_common.h:

PS9, Line 81: extern
Does 'extern' have to be present due to some compilation issues?  I would expect it to compile without 'extern'.

Also, consider moving those up to precede the first function in the kudu::tools namespace.  This is to be more conformant with the declaration order specified by the style guide: https://google.github.io/styleguide/cppguide.html#Declaration_Order


http://gerrit.cloudera.org:8080/#/c/4533/9/src/kudu/tools/tool_action_tablet.cc
File src/kudu/tools/tool_action_tablet.cc:

PS9, Line 175: string tablet_id
nit: const string& tablet_id ?


Line 246:         "to step down")
nit: parameter/line continuation shift is 4 spaces.


-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Dinesh Bhat has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................


Patch Set 9:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/4533/6/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

PS6, Line 222: ctor<TServerDetails*> tservers;
> Note that BuildAndStart() wait for all tablets to reach the state of RUNNIN
Hmmm, AFAICT the only way we could whitelist that string is by comparing with stderr output. Let me know if I misunderstood your comment. Done that change.


http://gerrit.cloudera.org:8080/#/c/4533/8/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

Line 227:                                             tablet_id_,
> Nit: since the variable 's' is not used anywhere else down the code, consid
Done


http://gerrit.cloudera.org:8080/#/c/4533/8/src/kudu/tools/tool_action_local_replica.cc
File src/kudu/tools/tool_action_local_replica.cc:

PS8, Line 121: 
> Is capitalization of 'i' intentional?
Yeah, wanted to be consistent with rest of the places.


http://gerrit.cloudera.org:8080/#/c/4533/8/src/kudu/tools/tool_action_table.cc
File src/kudu/tools/tool_action_table.cc:

PS8, Line 53: const char* const kTableNameArg = "table_
> It seems this is used across different set of tools.  Is it possible to def
Done


http://gerrit.cloudera.org:8080/#/c/4533/8/src/kudu/tools/tool_action_tablet.cc
File src/kudu/tools/tool_action_tablet.cc:

PS8, Line 67: 
> I saw this string is defined in different file as well.  Does it make sense
Done


PS8, Line 177: client
> const string&  here and further when using FindOrDie() ?
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Dinesh Bhat (Code Review)" <ge...@cloudera.org>.
Hello David Ribeiro Alves, Adar Dembo, Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4533

to look at the new patch set (#7).

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................

[tools] Implement a manual leader_step_down for a tablet

This change introduces a leader_step_down functionality
under 'kudu tablet'. This tool may be handy to recover from
situations when a single tablet server is overloaded and we want
to kick off a new election to balance the load across the clusters.
Although it is not guaranteed that a different replica will be elected
as the leader, this is an optimistic effort to elect a new tablet
server as the leader for the given tablet in the cluster.

Test: Ran 8000 iterations of leader_step_down test on dist_test.

Also snuck in a small change to display host:port details with
'kudu table list <master_addresses> --list_tablets' command.

Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
---
M src/kudu/tools/kudu-admin-test.cc
M src/kudu/tools/tool_action_cluster.cc
M src/kudu/tools/tool_action_local_replica.cc
M src/kudu/tools/tool_action_table.cc
M src/kudu/tools/tool_action_tablet.cc
5 files changed, 172 insertions(+), 44 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/33/4533/7
-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................


Patch Set 1:

(17 comments)

Please also update kudu-tool-test's help tests with the new action.

http://gerrit.cloudera.org:8080/#/c/4533/1//COMMIT_MSG
Commit Message:

PS1, Line 12:   
Nit: got an extra space here.


PS1, Line 18: tablet
You mean 'table' here, right?


http://gerrit.cloudera.org:8080/#/c/4533/1/src/kudu/tools/kudu-admin-test.cc
File src/kudu/tools/kudu-admin-test.cc:

PS1, Line 178:     Status s = itest::GetConsensusState(ts,
             :                                         tablet_id,
             :                                         consensus::CONSENSUS_CONFIG_COMMITTED,
             :                                         MonoDelta::FromSeconds(10), &cstate);
             :     if (!s.ok()) {
             :       return s;
             :     }
Just use RETURN_NOT_OK(...).


Line 198:   int num_retries = 0;
Why is this defined way up here instead of closer to where it's used?


PS1, Line 204: CHECK_EQ
Use ASSERT_* instead of CHECK_* since all of these are on the main test thread.


Line 218:   CHECK_EQ(master_reported_leader, current_leader);
The leader can change between the GetInfoFromConsensus calls and L213, right? If so, isn't this unsafe?


PS1, Line 230:     // Wait for re-election to happen again.
             :     // 3 seconds picked to accommodate consensus heartbeat timeout(1.5s)
Use AssertEventually() to make this more robust, and to minimize the amount of sleeping actually needed.


PS1, Line 241:   CHECK_OK(GetTabletLeaderUUIDFromMaster(tablet_id_, &master_reported_leader));
             :   CHECK_EQ(master_reported_leader, new_leader);
Again, the leader can change whenever, so this doesn't seem safe.


Line 264:   CHECK_EQ(master_reported_leader, "");
Let's not check this. It's a common calling convention that if a function fails, it doesn't set any OUT parameters, and we don't need to verify that ourselves.


PS1, Line 271:   CHECK_EQ(current_leader, "");
             :   CHECK_EQ(current_term, 0);
Likewise here, which means you needn't initialize current_term.


http://gerrit.cloudera.org:8080/#/c/4533/1/src/kudu/tools/tool_action_tablet.cc
File src/kudu/tools/tool_action_tablet.cc:

PS1, Line 194:   Status s = GetTabletLeader(client, tablet_id, &leader_uuid, &leader_hp);
Consider consolidating from here to ResolveAddresses in a helper function, used from ChangeConfig too. It'd need a toggle to specify whether NotFound is fatal or not.


PS1, Line 198: else if (!s.ok()) {
             :     return s;
             :   }
To pile on with Tidy Bot, just do RETURN_NOT_OK(s) after the not found check.


PS1, Line 209:   ServerStatusPB status;
             :   RETURN_NOT_OK(GetServerStatus(leader_hp.host(), leader_hp.port(), &status));
What's this for? I don't see status used anywhere.


PS1, Line 224:  return Status::RemoteError(resp.ShortDebugString());
Why not return StatusFromPB(resp.error().status()) like ConfigChange?


PS1, Line 274: "Request the current leader of the "
             :         "tablet(if present) to step down"
Nit: reformat as "Force the tablet's leader replica (if present) to step down".


PS1, Line 278: "Comma-separated list of Kudu Master addresses where each address is "
             :         "of form 'hostname:port'"
I admit, this one was my bad, but can you consolidate all of these descriptions in a global string constant like for kMasterAddressesArg?


PS1, Line 280: "Tablet Identifier"
Likewise for this, it's copied extensively.


-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [tools] Implement a manual leader step down for a tablet

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: [tools] Implement a manual leader_step_down for a tablet
......................................................................


Patch Set 12:

(1 comment)

> > > Patch Set 11: Patch Set 10 was rebased
 > >
 > > Could you also add coverage for the new sub-command into
 > > ToolTest.TestModeHelp in kudu-tool-test.cc?
 > 
 > Good idea, added. Please note the "*leader*step*" since I had to
 > workaround a small issue in how we validate the usage output when
 > lines are split by '\n' , because it doesn't go well with all the
 > output. For eg,
 > $ ./bin/kudu tablet
 > Usage: ./bin/kudu tablet <command> [<args>]
 > change_config     Change a tablet's Raft configuration
 > leader_step_down   Force the tablet's leader replica (if present)
 > to step
 > down
 > We really want to split the output such that one sub-command
 > becomes one vector element, instead of splitting them by '\n'. I
 > still don't know how to fix this, but will try to tackle this
 > problem in a different change. For now, we could just add another
 > regex with just "*down", but that doesn't look good, hence doing
 > away with 'down' in the kTabletModeRegexes at the moment.

As an easier workaround, I would consider updating the help string for the new command.  I added corresponding comment regarding that.

Also, consider addressing Tidy Bot's warning about unused 'using' directive.

Otherwise, looks good.

http://gerrit.cloudera.org:8080/#/c/4533/12/src/kudu/tools/tool_action_tablet.cc
File src/kudu/tools/tool_action_tablet.cc:

PS12, Line 245: (if present)
What if that '(if present)' is dropped?  If it do not provide essential information you want to emphasize here, it would help with the test code command help coverage and would look better in 80-line terminal since no wrapping of lines.


-- 
To view, visit http://gerrit.cloudera.org:8080/4533
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ia046a28a2008f4f5d1e955f57752a32a1ddc5ab8
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Dinesh Bhat <di...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes