You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Yifan Zhang (Code Review)" <ge...@cloudera.org> on 2022/01/06 14:36:24 UTC

[kudu-CR] KUDU-2915: add tool to remove a tablet server

Yifan Zhang has uploaded this change for review. ( http://gerrit.cloudera.org:8080/18124


Change subject: KUDU-2915: add tool to remove a tablet server
......................................................................

KUDU-2915: add tool to remove a tablet server

Add a 'kudu tserver remove' tool to unregister a tserver from the
master. This tool will be useful when we want to decommission a tserver
without restart masters.

It could remove the dead tserver from master's in-memory map and
persisted data(if any) by default. We could also force remove a live
tserver and keep the tserver's maintenance state by adding some
optional arguments.

Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/master.proto
M src/kudu/master/master_service.cc
M src/kudu/master/master_service.h
M src/kudu/master/ts_manager.cc
M src/kudu/master/ts_manager.h
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_common.h
M src/kudu/tools/tool_action_master.cc
M src/kudu/tools/tool_action_tserver.cc
11 files changed, 270 insertions(+), 63 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/24/18124/1
-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 1
Gerrit-Owner: Yifan Zhang <ch...@163.com>

[kudu-CR] KUDU-2915: add tool to remove a tablet server

Posted by "Yifan Zhang (Code Review)" <ge...@cloudera.org>.
Yifan Zhang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18124 )

Change subject: KUDU-2915: add tool to remove a tablet server
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/18124/1/src/kudu/master/ts_manager.cc
File src/kudu/master/ts_manager.cc:

http://gerrit.cloudera.org:8080/#/c/18124/1/src/kudu/master/ts_manager.cc@330
PS1, Line 330:   servers_by_id_.erase(ts_uuid);
> do you think it would be good to also remove the related entry from ts_stat
Yes, the tserver's state would be removed here: https://gerrit.cloudera.org/c/18124/1/src/kudu/master/master_service.cc#355.


http://gerrit.cloudera.org:8080/#/c/18124/1/src/kudu/tools/tool_action_common.cc
File src/kudu/tools/tool_action_common.cc:

http://gerrit.cloudera.org:8080/#/c/18124/1/src/kudu/tools/tool_action_common.cc@786
PS1, Line 786: >
> shouldn't this be >= ?
This is >1, for checking every master has same master_address.



-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 1
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Mon, 10 Jan 2022 03:43:23 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2915: add tool to unregister a tablet server

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/18124 )

Change subject: KUDU-2915: add tool to unregister a tablet server
......................................................................


Patch Set 8: Verified+1 Code-Review+2

Failed test seems unrelated: master_failover-itest


-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 8
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Fri, 21 Jan 2022 18:26:43 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2915: add tool to remove a tablet server

Posted by "Yifan Zhang (Code Review)" <ge...@cloudera.org>.
Yifan Zhang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18124 )

Change subject: KUDU-2915: add tool to remove a tablet server
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/18124/1/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

http://gerrit.cloudera.org:8080/#/c/18124/1/src/kudu/tools/kudu-tool-test.cc@7490
PS1, Line 7490:     RunActionStdoutStderrString(Su
> nit: ts_uuid would be present in the ksck no matter what. Probably check fo
Done


http://gerrit.cloudera.org:8080/#/c/18124/1/src/kudu/tools/kudu-tool-test.cc@7536
PS1, Line 7536: SSERT_STR_CONTAINS(err, ts_uuid);
> Do we want to run the tool without the flag -force_remove_live_tserver and 
I think we should make sure the tserver has been brought down when using this tool, the error is to avoid removing a live tserver by mistake.



-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 2
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Tue, 11 Jan 2022 09:51:36 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2915: add tool to remove a tablet server

Posted by "Yifan Zhang (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Chovan, Kudu Jenkins, Abhishek Chennaka, Andrew Wong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18124

to look at the new patch set (#4).

Change subject: KUDU-2915: add tool to remove a tablet server
......................................................................

KUDU-2915: add tool to remove a tablet server

Add a 'kudu tserver remove' tool to unregister a tserver from the
master. This tool will be useful when we want to decommission a tserver
without restart masters.

It could remove the dead tserver from master's in-memory map and
persisted data (if any) by default. We could also force remove a live
tserver and keep the tserver's maintenance state by adding some
optional arguments.

Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/master.proto
M src/kudu/master/master_service.cc
M src/kudu/master/master_service.h
M src/kudu/master/ts_manager.cc
M src/kudu/master/ts_manager.h
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_common.h
M src/kudu/tools/tool_action_master.cc
M src/kudu/tools/tool_action_tserver.cc
11 files changed, 332 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/24/18124/4
-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 4
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] KUDU-2915: add tool to remove a tablet server

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/18124 )

Change subject: KUDU-2915: add tool to remove a tablet server
......................................................................


Patch Set 3: Code-Review+1

(7 comments)

http://gerrit.cloudera.org:8080/#/c/18124/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18124/3//COMMIT_MSG@14
PS3, Line 14: a(i
nit: missing space


http://gerrit.cloudera.org:8080/#/c/18124/3/src/kudu/master/ts_manager.cc
File src/kudu/master/ts_manager.cc:

http://gerrit.cloudera.org:8080/#/c/18124/3/src/kudu/master/ts_manager.cc@321
PS3, Line 321:   if (!LookupTSByUUID(ts_uuid, &ts_desc)) {
I was a bit concerned that there'd be a TOCTOU here, since we take the lock once here, but only erase under a different critical section down below. That said, I don't think there are incorrect interleavings, given the call to 'erase' is idempotent.


http://gerrit.cloudera.org:8080/#/c/18124/3/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

http://gerrit.cloudera.org:8080/#/c/18124/3/src/kudu/tools/kudu-tool-test.cc@7561
PS3, Line 7561:   }
nit: after this, can we also verify that the ksck output still shows the tserver?


http://gerrit.cloudera.org:8080/#/c/18124/3/src/kudu/tools/tool_action_tserver.cc
File src/kudu/tools/tool_action_tserver.cc:

http://gerrit.cloudera.org:8080/#/c/18124/3/src/kudu/tools/tool_action_tserver.cc@61
PS3, Line 61: force remove the tserver even if "
            :     "it was considered alive by the master.
nit: how about "If true, force the removal of the tserver, even if it is considered live by the master"


http://gerrit.cloudera.org:8080/#/c/18124/3/src/kudu/tools/tool_action_tserver.cc@64
PS3, Line 64:     "in-memory map but keep its persisted state(if any). When the same tserver re-register 
nit: "If the same tserver re-registers on the..."


http://gerrit.cloudera.org:8080/#/c/18124/3/src/kudu/tools/tool_action_tserver.cc@64
PS3, Line 64: te(
nit: add a space


http://gerrit.cloudera.org:8080/#/c/18124/3/src/kudu/tools/tool_action_tserver.cc@323
PS3, Line 323: if (resp.has_error()) {
             :       return StatusFromPB(resp.error().status());
             :     }
Have you considered running the RPC on all masters, and then collecting errors at the end, maybe logging a warning for each failure? It might make it easier to reason about the behavior in the event that a master is down



-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 3
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Tue, 18 Jan 2022 02:56:29 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2915: add tool to unregister a tablet server

Posted by "Yifan Zhang (Code Review)" <ge...@cloudera.org>.
Yifan Zhang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18124 )

Change subject: KUDU-2915: add tool to unregister a tablet server
......................................................................


Patch Set 6:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/18124/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18124/5//COMMIT_MSG@9
PS5, Line 9: kudu tserver unregi
> Should we call this 'kudu tserver unregister' instead?  From what I can see
Done


http://gerrit.cloudera.org:8080/#/c/18124/5//COMMIT_MSG@11
PS5, Line 11: restart
> nit: restarting
Done


http://gerrit.cloudera.org:8080/#/c/18124/5//COMMIT_MSG@13
PS5, Line 13: removes the 
> nit: removes
Done


http://gerrit.cloudera.org:8080/#/c/18124/5//COMMIT_MSG@14
PS5, Line 14: le to unregister a tserver
> nit: It's also possible to remove ...
Done


http://gerrit.cloudera.org:8080/#/c/18124/5//COMMIT_MSG@15
PS5, Line 15: tserver', and keep
            : tserver's persiste
> nit: instead of referring those as 'some', maybe specify the exact names of
Done


http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master.proto
File src/kudu/master/master.proto:

http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master.proto@1074
PS5, Line 1074: persisted cata
> nit: persistent catalog
Done


http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master.proto@1076
PS5, Line 1076: // The tserver UUID to be
> nit: even if it seems self-obvious, it would be nice to add a comment for t
Done


http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master.proto@1080
PS5, Line 1080: // For this to take effect, the receiv
> I'm curious if it makes sense to enable this by default?  Could you add an 
Done


http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master_service.cc
File src/kudu/master/master_service.cc:

http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master_service.cc@341
PS5, Line 341: can c
> nit: could --> can (for simplicity, keep the same tense in the sentence: ht
Done


http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master_service.cc@351
PS5, Line 351:               
> If following the comment above, this should be RespondSuccess(), shouldn't 
This is in the condition where the status is not 'NotFound' so it should be RespondFailure.


http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master_service.cc@355
PS5, Line 355:       rpc->RespondFailure(s);
             :       return;
             :     }
             :   }
             : 
             :   Status s = server_->ts_manager()->UnregisterTServer(ts_uuid, force_unregister_live_tserver);
             :   if (PREDICT_FALSE(!s.ok() && !s.IsNotFound())) {
             :     // Ignore the NotFound error to make this RPC retriable and effectively idempotent.
             :     rpc->RespondFailure(s);
             :    
> Shouldn't we first try removing tserver's state record?  Are the system cat
Done


http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/ts_manager.cc
File src/kudu/master/ts_manager.cc:

http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/ts_manager.cc@328
PS5, Line 328: erver $0 is not presumed de
> I'm not sure this is exactly relevant in the error message from the server 
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 6
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Wed, 19 Jan 2022 11:05:38 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2915: add tool to remove a tablet server

Posted by "Yifan Zhang (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Chovan, Kudu Jenkins, Abhishek Chennaka, Andrew Wong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18124

to look at the new patch set (#3).

Change subject: KUDU-2915: add tool to remove a tablet server
......................................................................

KUDU-2915: add tool to remove a tablet server

Add a 'kudu tserver remove' tool to unregister a tserver from the
master. This tool will be useful when we want to decommission a tserver
without restart masters.

It could remove the dead tserver from master's in-memory map and
persisted data(if any) by default. We could also force remove a live
tserver and keep the tserver's maintenance state by adding some
optional arguments.

Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/master.proto
M src/kudu/master/master_service.cc
M src/kudu/master/master_service.h
M src/kudu/master/ts_manager.cc
M src/kudu/master/ts_manager.h
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_common.h
M src/kudu/tools/tool_action_master.cc
M src/kudu/tools/tool_action_tserver.cc
11 files changed, 314 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/24/18124/3
-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 3
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] KUDU-2915: add tool to unregister a tablet server

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/18124 )

Change subject: KUDU-2915: add tool to unregister a tablet server
......................................................................

KUDU-2915: add tool to unregister a tablet server

Add a 'kudu tserver unregister' tool to unregister a tserver from the
master. This tool will be useful when we want to decommission a tserver
without restarting masters.

This tool unregisters the dead tserver from master's in-memory map and
removes its persisted state from catalog table by default. It's also
possible to unregister a tserver which is not presumed dead by adding
'-force_unregister_live_tserver', or keep tserver's persisted state
by adding '-remove_tserver_state=false'.

Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Reviewed-on: http://gerrit.cloudera.org:8080/18124
Reviewed-by: Alexey Serbin <as...@cloudera.com>
Reviewed-by: Andrew Wong <aw...@cloudera.com>
Tested-by: Kudu Jenkins
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/master.proto
M src/kudu/master/master_service.cc
M src/kudu/master/master_service.h
M src/kudu/master/ts_manager.cc
M src/kudu/master/ts_manager.h
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_common.h
M src/kudu/tools/tool_action_master.cc
M src/kudu/tools/tool_action_tserver.cc
11 files changed, 319 insertions(+), 63 deletions(-)

Approvals:
  Alexey Serbin: Looks good to me, approved
  Andrew Wong: Looks good to me, approved
  Kudu Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 11
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] KUDU-2915: add tool to unregister a tablet server

Posted by "Yifan Zhang (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Zoltan Chovan, Alexey Serbin, Kudu Jenkins, Abhishek Chennaka, Andrew Wong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18124

to look at the new patch set (#10).

Change subject: KUDU-2915: add tool to unregister a tablet server
......................................................................

KUDU-2915: add tool to unregister a tablet server

Add a 'kudu tserver unregister' tool to unregister a tserver from the
master. This tool will be useful when we want to decommission a tserver
without restarting masters.

This tool unregisters the dead tserver from master's in-memory map and
removes its persisted state from catalog table by default. It's also
possible to unregister a tserver which is not presumed dead by adding
'-force_unregister_live_tserver', or keep tserver's persisted state
by adding '-remove_tserver_state=false'.

Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/master.proto
M src/kudu/master/master_service.cc
M src/kudu/master/master_service.h
M src/kudu/master/ts_manager.cc
M src/kudu/master/ts_manager.h
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_common.h
M src/kudu/tools/tool_action_master.cc
M src/kudu/tools/tool_action_tserver.cc
11 files changed, 319 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/24/18124/10
-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 10
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] KUDU-2915: add tool to unregister a tablet server

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/18124 )

Change subject: KUDU-2915: add tool to unregister a tablet server
......................................................................


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master_service.cc
File src/kudu/master/master_service.cc:

http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master_service.cc@355
PS5, Line 355:       rpc->RespondFailure(s);
             :       return;
             :     }
             :   }
             : 
             :   Status s = server_->ts_manager()->UnregisterTServer(ts_uuid, force_unregister_live_tserver);
             :   if (PREDICT_FALSE(!s.ok() && !s.IsNotFound())) {
             :     // Ignore the NotFound error to make this RPC retriable and effectively idempotent.
             :     rpc->RespondFailure(s);
             :    
> Done
I'm not sure the system catalog and the run-time state need to be consistent here since they refer to different things. The catalog tracks the persistent state, whereas the run-time state tracks registration.

If we fail to remove the persistent state (e.g. because leadership changed), it still seems reasonable to unconditionally unregister the run-time state. If anything, the tserver (if alive) will re-register itself anyway.



-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 6
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Wed, 19 Jan 2022 18:44:42 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2915: add tool to remove a tablet server

Posted by "Yifan Zhang (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Chovan, Kudu Jenkins, Abhishek Chennaka, Andrew Wong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18124

to look at the new patch set (#5).

Change subject: KUDU-2915: add tool to remove a tablet server
......................................................................

KUDU-2915: add tool to remove a tablet server

Add a 'kudu tserver remove' tool to unregister a tserver from the
master. This tool will be useful when we want to decommission a tserver
without restart masters.

It could remove the dead tserver from master's in-memory map and
persisted data (if any) by default. We could also force remove a live
tserver and keep the tserver's maintenance state by adding some
optional arguments.

Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/master.proto
M src/kudu/master/master_service.cc
M src/kudu/master/master_service.h
M src/kudu/master/ts_manager.cc
M src/kudu/master/ts_manager.h
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_common.h
M src/kudu/tools/tool_action_master.cc
M src/kudu/tools/tool_action_tserver.cc
11 files changed, 332 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/24/18124/5
-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 5
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] KUDU-2915: add tool to remove a tablet server

Posted by "Abhishek Chennaka (Code Review)" <ge...@cloudera.org>.
Abhishek Chennaka has posted comments on this change. ( http://gerrit.cloudera.org:8080/18124 )

Change subject: KUDU-2915: add tool to remove a tablet server
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/18124/1/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

http://gerrit.cloudera.org:8080/#/c/18124/1/src/kudu/tools/kudu-tool-test.cc@7490
PS1, Line 7490: ASSERT_STR_CONTAINS(out, ts_uuid);
nit: ts_uuid would be present in the ksck no matter what. Probably check for a string like 
 ts_uuid | ts_hostname:port | Unavailable
Same below


http://gerrit.cloudera.org:8080/#/c/18124/1/src/kudu/tools/kudu-tool-test.cc@7536
PS1, Line 7536: We could force remove the tserver
Do we want to run the tool without the flag -force_remove_live_tserver and ensure it errors out if the tserver is not marked as dead?



-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 1
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Mon, 10 Jan 2022 16:44:37 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2915: add tool to unregister a tablet server

Posted by "Yifan Zhang (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Chovan, Alexey Serbin, Kudu Jenkins, Abhishek Chennaka, Andrew Wong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18124

to look at the new patch set (#7).

Change subject: KUDU-2915: add tool to unregister a tablet server
......................................................................

KUDU-2915: add tool to unregister a tablet server

Add a 'kudu tserver unregister' tool to unregister a tserver from the
master. This tool will be useful when we want to decommission a tserver
without restarting masters.

It removes the dead tserver from master's in-memory map and persisted
catalog by default. It's also possible to unregister a tserver which is
not presumed dead by adding '-force_unregister_live_tserver', and keep
tserver's persisted state by adding '-remove_tserver_state=false'.

Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/master.proto
M src/kudu/master/master_service.cc
M src/kudu/master/master_service.h
M src/kudu/master/ts_manager.cc
M src/kudu/master/ts_manager.h
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_common.h
M src/kudu/tools/tool_action_master.cc
M src/kudu/tools/tool_action_tserver.cc
11 files changed, 318 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/24/18124/7
-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 7
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] KUDU-2915: add tool to remove a tablet server

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/18124 )

Change subject: KUDU-2915: add tool to remove a tablet server
......................................................................


Patch Set 5: Code-Review+1

(12 comments)

Thank you for adding this new CLI tool!

http://gerrit.cloudera.org:8080/#/c/18124/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18124/5//COMMIT_MSG@9
PS5, Line 9: kudu tserver remove
Should we call this 'kudu tserver unregister' instead?  From what I can see, the server is simply unregistered from the system catalog, but if it checks in with the master, it will be simply re-registered back.


http://gerrit.cloudera.org:8080/#/c/18124/5//COMMIT_MSG@11
PS5, Line 11: restart
nit: restarting


http://gerrit.cloudera.org:8080/#/c/18124/5//COMMIT_MSG@13
PS5, Line 13: could remove
nit: removes


http://gerrit.cloudera.org:8080/#/c/18124/5//COMMIT_MSG@14
PS5, Line 14: We could also force remove
nit: It's also possible to remove ...


http://gerrit.cloudera.org:8080/#/c/18124/5//COMMIT_MSG@15
PS5, Line 15: adding some
            : optional arguments
nit: instead of referring those as 'some', maybe specify the exact names of the command-line arguments for achieving that?


http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master.proto
File src/kudu/master/master.proto:

http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master.proto@1074
PS5, Line 1074: persisted data
nit: persistent catalog


http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master.proto@1076
PS5, Line 1076: optional string uuid = 1;
nit: even if it seems self-obvious, it would be nice to add a comment for this field as well


http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master.proto@1080
PS5, Line 1080: optional bool remove_tserver_state = 2
I'm curious if it makes sense to enable this by default?  Could you add an extra sentence to reason about the default setting for this field?


http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master_service.cc
File src/kudu/master/master_service.cc:

http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master_service.cc@341
PS5, Line 341: could
nit: could --> can (for simplicity, keep the same tense in the sentence: https://justpublishingadvice.com/maintaining-good-tense-control-in-your-writing/)


http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master_service.cc@351
PS5, Line 351: RespondFailure
If following the comment above, this should be RespondSuccess(), shouldn't it?


http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master_service.cc@355
PS5, Line 355:   if (remove_tserver_state && l.leader_status().ok()) {
             :     s = server_->ts_manager()->SetTServerState(ts_uuid,
             :                                                TServerStatePB::NONE,
             :                                                ChangeTServerStateRequestPB::ALLOW_MISSING_TSERVER,
             :                                                server_->catalog_manager()->sys_catalog());
             :     if (PREDICT_FALSE(!s.ok())) {
             :       rpc->RespondFailure(s);
             :       return;
             :     }
             :   }
Shouldn't we first try removing tserver's state record?  Are the system catalog and the run-time state consistent if RemoveTServer() passes, but SetTServerState() fails?


http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/ts_manager.cc
File src/kudu/master/ts_manager.cc:

http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/ts_manager.cc@328
PS5, Line 328: --force_remove_live_tserver
I'm not sure this is exactly relevant in the error message from the server given that the API can be used from elsewhere, not only from the newly introduced kudu CLI tool.



-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 5
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Wed, 19 Jan 2022 04:55:56 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2915: add tool to unregister a tablet server

Posted by "Yifan Zhang (Code Review)" <ge...@cloudera.org>.
Yifan Zhang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18124 )

Change subject: KUDU-2915: add tool to unregister a tablet server
......................................................................


Patch Set 6:

(1 comment)

> Patch Set 6:
> 
> (1 comment)

http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master_service.cc
File src/kudu/master/master_service.cc:

http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master_service.cc@355
PS5, Line 355:       rpc->RespondFailure(s);
             :       return;
             :     }
             :   }
             : 
             :   Status s = server_->ts_manager()->UnregisterTServer(ts_uuid, force_unregister_live_tserver);
             :   if (PREDICT_FALSE(!s.ok() && !s.IsNotFound())) {
             :     // Ignore the NotFound error to make this RPC retriable and effectively idempotent.
             :     rpc->RespondFailure(s);
             :    
> Another thought to consider: perhaps we should consider separating this int
I think separating this into two RPCs should be a good way.



-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 6
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Fri, 21 Jan 2022 03:31:59 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2915: add tool to remove a tablet server

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/18124 )

Change subject: KUDU-2915: add tool to remove a tablet server
......................................................................


Patch Set 2:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master.proto
File src/kudu/master/master.proto:

http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master.proto@1076
PS2, Line 1076:   required string uuid = 1;
nit: we try to err away from using "required" in protobuf, given it's removed in proto3.


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master.proto@1077
PS2, Line 1077:   optional bool keep_tserver_state = 2;
              :   optional bool force_remove_live_tserver = 3;
nit: add a comment describing these args. Here's a start

// Whether or not to remove any persisted tserver state (e.g.
// MAINTENANCE_MODE). For this to take effect, the receiving master must be the
// leader.

// Whether to return an error in case the tserver is not presumed to be dead,
// per --tserver_unresponsive_timeout_ms.


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master_service.cc
File src/kudu/master/master_service.cc:

http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master_service.cc@339
PS2, Line 339: keep_tserver_state
nit: maybe rename this to 'remove_tserver_state' so we don't have to keep dealing with the negative below? If so, could you also change the proto field name?


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master_service.cc@344
PS2, Line 344: if (!keep_tserver_state && !l.CheckIsInitializedOrRespond(resp, rpc)) {
             :     return;
             :   }
             : 
             :   Status s = server_->ts_manager()->RemoveTServer(ts_uuid, force_remove_live_tserver);
             :   if (PREDICT_FALSE(!s.ok())) {
             :     rpc->RespondFailure(s);
             :     return;
             :   }
             : 
             :   if (!keep_tserver_state && l.leader_status().ok()) {
             :     s = server_->ts_manager()->SetTServerState(
nit: may be easier to follow as:

if (!l.CheckIsInitializedOrRespond(...)) {
  return;
}
Status s = ts_manager->RemoveTServer(...);
if (!s.ok()) {
  ...
}
if (!keep_tserver_state) {
  if (l.leader_status().ok()) {
    ts_manager->SetTserverState(...);
  }
}


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master_service.cc@358
PS2, Line 358: DONT_ALLOW_MISSING_TSERVER
I'm curious what the rationale is behind not allowing a missing tserver. It seems like a nice property to have that calls to this RPC could be retriable and effectively idempotent.


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/ts_manager.cc
File src/kudu/master/ts_manager.cc:

http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/ts_manager.cc@326
PS2, Line 326: Tablet Server
nit: let's try keeping usage of "tserver" vs "Tablet Server" vs "tablet server" homogenous at least without the same areas of code. Maybe just 'tserver $0 ..." here?


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7472
PS2, Line 7472:   opts.num_masters = 3;
              :   opts.extra_master_flags.push_back(
              :       Substitute("--tserver_unresponsive_timeout_ms=$0", kTserverUnresponsiveMs));
The --heartbeat_interval_ms default value is also 1s. Let's set it a bit higher on the tservers so we ensure the tserver becomes unresponsive


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7480
PS2, Line 7480:   // Enter maintenance mode on the tserver, shutdown it and wait it become dead.
nit: "shut it down"


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7483
PS2, Line 7483: SleepFor(MonoDelta::FromMilliseconds(kTserverUnresponsiveMs));
              : 
              :   {
              :     string out;
              :     string err;
              :     // Getting an error when running ksck and the output contains the dead tserver.
              :     Status s =
              :         RunActionStdoutStderrString(Substitute("cluster ksck $0", master_addrs_str), &out, &err);
              :     ASSERT_TRUE(s.IsRuntimeError());
              :     ASSERT_STR_CONTAINS(
              :         out, Substitute("$0 | $1 | UNAVAILABLE", ts_uuid, ts->bound_rpc_hostport().ToString()));
              :   }
nit: rather than sleeping and expecting the exact sleep to work, could we instead remove the sleep and wrap this in an ASSERT_EVENTUALLY()? Seems that would be less prone to flakiness


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7509
PS2, Line 7509:     ASSERT_STR_CONTAINS(out,
              :                         Substitute("Tablet Server States\n"
              :                                    "              Server              |      State\n"
              :                                    "----------------------------------+------------------\n"
              :                                    " $0 | MAINTENANCE_MODE",
              :                                    ts_uuid));
Maybe after this, could we restart the tserver with a very short heartbeat interval such that it is guaranteed to be alive, and then run 'kudu tserver remove -force_remove_live_tserver -keep_tserver_state=false'? Just so we have fuller test coverage. I would expect that after that, we'd see no MAINTENANCE_MODE message, but we should be able to see the ts_uuid in the ksck output


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7508
PS2, Line 7508:  if (keep_tserver_state) {
              :     ASSERT_STR_CONTAINS(out,
              :                         Substitute("Tablet Server States\n"
              :                                    "              Server              |      State\n"
              :                                    "----------------------------------+------------------\n"
              :                                    " $0 | MAINTENANCE_MODE",
              :                                    ts_uuid));
              :   }
In the else case, can we ASSERT_STR_NOT_CONTAINS("MAINTENANCE_MODE")? We should also be able to assert the ts_uuid isn't in ksck at all, right?


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7541
PS2, Line 7541:       RunKuduTool({"tserver", "remove", master_addrs_str, ts_uuid, "-force_remove_live_tserver"}));
Could we also set a high tserver heartbeat interval, and then check afterwards that the tserver doesn't show up in ksck?



-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 2
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Fri, 14 Jan 2022 00:35:40 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2915: add tool to unregister a tablet server

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/18124 )

Change subject: KUDU-2915: add tool to unregister a tablet server
......................................................................


Patch Set 10: Code-Review+2

Thank you for creating this new useful tool!


-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 10
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Sat, 22 Jan 2022 01:20:23 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2915: add tool to unregister a tablet server

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/18124 )

Change subject: KUDU-2915: add tool to unregister a tablet server
......................................................................


Patch Set 10: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 10
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Sat, 22 Jan 2022 01:26:55 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2915: add tool to unregister a tablet server

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has removed a vote on this change.

Change subject: KUDU-2915: add tool to unregister a tablet server
......................................................................


Removed Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 8
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] KUDU-2915: add tool to unregister a tablet server

Posted by "Yifan Zhang (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Zoltan Chovan, Alexey Serbin, Kudu Jenkins, Abhishek Chennaka, Andrew Wong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18124

to look at the new patch set (#9).

Change subject: KUDU-2915: add tool to unregister a tablet server
......................................................................

KUDU-2915: add tool to unregister a tablet server

Add a 'kudu tserver unregister' tool to unregister a tserver from the
master. This tool will be useful when we want to decommission a tserver
without restarting masters.

This tool unregister the dead tserver from master's in-memory map and
removes its persisted state from catalog table by default. It's also
possible to unregister a tserver which is not presumed dead by adding
'-force_unregister_live_tserver', or keep tserver's persisted state
by adding '-remove_tserver_state=false'.

Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/master.proto
M src/kudu/master/master_service.cc
M src/kudu/master/master_service.h
M src/kudu/master/ts_manager.cc
M src/kudu/master/ts_manager.h
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_common.h
M src/kudu/tools/tool_action_master.cc
M src/kudu/tools/tool_action_tserver.cc
11 files changed, 319 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/24/18124/9
-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 9
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] KUDU-2915: add tool to unregister a tablet server

Posted by "Yifan Zhang (Code Review)" <ge...@cloudera.org>.
Yifan Zhang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18124 )

Change subject: KUDU-2915: add tool to unregister a tablet server
......................................................................


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18124/8/src/kudu/tools/tool_action_tserver.cc
File src/kudu/tools/tool_action_tserver.cc:

http://gerrit.cloudera.org:8080/#/c/18124/8/src/kudu/tools/tool_action_tserver.cc@332
PS8, Line 332: " Unable to unregister the tserver from master $0, status: $1
> nit: maybe add spacing between consecutive messages, so the result error me
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 10
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Sat, 22 Jan 2022 01:18:51 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2915: add tool to unregister a tablet server

Posted by "Yifan Zhang (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Zoltan Chovan, Alexey Serbin, Kudu Jenkins, Abhishek Chennaka, Andrew Wong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18124

to look at the new patch set (#8).

Change subject: KUDU-2915: add tool to unregister a tablet server
......................................................................

KUDU-2915: add tool to unregister a tablet server

Add a 'kudu tserver unregister' tool to unregister a tserver from the
master. This tool will be useful when we want to decommission a tserver
without restarting masters.

It removes the dead tserver from master's in-memory map and persisted
catalog by default. It's also possible to unregister a tserver which is
not presumed dead by adding '-force_unregister_live_tserver', and keep
tserver's persisted state by adding '-remove_tserver_state=false'.

Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/master.proto
M src/kudu/master/master_service.cc
M src/kudu/master/master_service.h
M src/kudu/master/ts_manager.cc
M src/kudu/master/ts_manager.h
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_common.h
M src/kudu/tools/tool_action_master.cc
M src/kudu/tools/tool_action_tserver.cc
11 files changed, 319 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/24/18124/8
-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 8
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] KUDU-2915: add tool to remove a tablet server

Posted by "Yifan Zhang (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Chovan, Kudu Jenkins, Abhishek Chennaka, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18124

to look at the new patch set (#2).

Change subject: KUDU-2915: add tool to remove a tablet server
......................................................................

KUDU-2915: add tool to remove a tablet server

Add a 'kudu tserver remove' tool to unregister a tserver from the
master. This tool will be useful when we want to decommission a tserver
without restart masters.

It could remove the dead tserver from master's in-memory map and
persisted data(if any) by default. We could also force remove a live
tserver and keep the tserver's maintenance state by adding some
optional arguments.

Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/master.proto
M src/kudu/master/master_service.cc
M src/kudu/master/master_service.h
M src/kudu/master/ts_manager.cc
M src/kudu/master/ts_manager.h
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_common.h
M src/kudu/tools/tool_action_master.cc
M src/kudu/tools/tool_action_tserver.cc
11 files changed, 271 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/24/18124/2
-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 2
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] KUDU-2915: add tool to remove a tablet server

Posted by "Yifan Zhang (Code Review)" <ge...@cloudera.org>.
Yifan Zhang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18124 )

Change subject: KUDU-2915: add tool to remove a tablet server
......................................................................


Patch Set 3:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master.proto
File src/kudu/master/master.proto:

http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master.proto@1076
PS2, Line 1076:   optional string uuid = 1;
> nit: we try to err away from using "required" in protobuf, given it's remov
Done


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master.proto@1077
PS2, Line 1077: 
              :   // Whether or not to remove any persisted ts
> nit: add a comment describing these args. Here's a start
Done


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master_service.cc
File src/kudu/master/master_service.cc:

http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master_service.cc@339
PS2, Line 339: remove_tserver_sta
> nit: maybe rename this to 'remove_tserver_state' so we don't have to keep d
Done


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master_service.cc@344
PS2, Line 344: if (!l.CheckIsInitializedOrRespond(resp, rpc)) {
             :     return;
             :   }
             : 
             :   Status s = server_->ts_manager()->RemoveTServer(ts_uuid, force_remove_live_tserver);
             :   if (PREDICT_FALSE(!s.ok() && !s.IsNotFound())) {
             :     // Ignore the NotFound error to make this RPC retriable and effectively idempotent.
             :     rpc->RespondFailure(s);
             :     return;
             :   }
             : 
             :   if (remove_tserver_state && l.leader_status()
> nit: may be easier to follow as:
Done


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/master_service.cc@358
PS2, Line 358:           ChangeTServerSta
> I'm curious what the rationale is behind not allowing a missing tserver. It
The original idea is we should not operate an unknown tserver, now I think here we must allow missing tserver because it has already been unregistered. But it seems this flag doesn't affect anything when set NONE for a tserver.


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/ts_manager.cc
File src/kudu/master/ts_manager.cc:

http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/master/ts_manager.cc@326
PS2, Line 326: TServer $0 is
> nit: let's try keeping usage of "tserver" vs "Tablet Server" vs "tablet ser
Done


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7472
PS2, Line 7472: 
              : INSTANTIATE_TEST_SUITE_P(, RemoveTServerTest, ::testing::Bool());
              : 
> The --heartbeat_interval_ms default value is also 1s. Let's set it a bit hi
Done


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7480
PS2, Line 7480:   NO_FATALS(StartCluster());
> nit: "shut it down"
Done


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7483
PS2, Line 7483: const string ts_uuid = ts->uuid();
              :   const string ts_hostport = ts->bound_rpc_addr().ToString();
              : 
              :   // Enter maintenance mode on the tserver and shut it down.
              :   ASSERT_OK(RunKuduTool({"tserver", "state", "enter_maintenance", master_addrs_str, ts_uuid}));
              :   ts->Shutdown();
              : 
              :   {
              :     string out;
              :     string err;
              :     // Getting an error when running ksck and the output contains the dead tserver.
              :    
> nit: rather than sleeping and expecting the exact sleep to work, could we i
Done


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7509
PS2, Line 7509:   {
              :     // Run ksck and get no error.
              :     string out;
              :     NO_FATALS(RunActionStdoutString(Substitute("cluster ksck $0", master_addrs_str), &out));
              :     if (remove_tserver_state) {
              :       // Both the persisted state and registr
> Maybe after this, could we restart the tserver with a very short heartbeat 
Done


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7508
PS2, Line 7508:                         Substitute("-remove_tserver_state=$0", remove_tserver_state)}));
              :   {
              :     // Run ksck and get no error.
              :     string out;
              :     NO_FATALS(RunActionStdoutString(Substitute("cluster ksck $0", master_addrs_str), &out));
              :     if (remove_tserver_state) {
              :       // Both the persisted state and registration of the tserver was removed.
              :    
> In the else case, can we ASSERT_STR_NOT_CONTAINS("MAINTENANCE_MODE")? We sh
Done


http://gerrit.cloudera.org:8080/#/c/18124/2/src/kudu/tools/kudu-tool-test.cc@7541
PS2, Line 7541: }
> Could we also set a high tserver heartbeat interval, and then check afterwa
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 3
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Fri, 14 Jan 2022 11:38:44 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2915: add tool to unregister a tablet server

Posted by "Yifan Zhang (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Chovan, Alexey Serbin, Kudu Jenkins, Abhishek Chennaka, Andrew Wong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/18124

to look at the new patch set (#6).

Change subject: KUDU-2915: add tool to unregister a tablet server
......................................................................

KUDU-2915: add tool to unregister a tablet server

Add a 'kudu tserver unregister' tool to unregister a tserver from the
master. This tool will be useful when we want to decommission a tserver
without restarting masters.

It removes the dead tserver from master's in-memory map and persisted
catalog by default. It's also possible to unregister a tserver which is
not presumed dead by adding '-force_unregister_live_tserver', and keep
tserver's persisted state by adding '-remove_tserver_state=false'.

Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/master.proto
M src/kudu/master/master_service.cc
M src/kudu/master/master_service.h
M src/kudu/master/ts_manager.cc
M src/kudu/master/ts_manager.h
M src/kudu/tools/kudu-tool-test.cc
M src/kudu/tools/tool_action_common.cc
M src/kudu/tools/tool_action_common.h
M src/kudu/tools/tool_action_master.cc
M src/kudu/tools/tool_action_tserver.cc
11 files changed, 339 insertions(+), 63 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/24/18124/6
-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 6
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>

[kudu-CR] KUDU-2915: add tool to remove a tablet server

Posted by "Zoltan Chovan (Code Review)" <ge...@cloudera.org>.
Zoltan Chovan has posted comments on this change. ( http://gerrit.cloudera.org:8080/18124 )

Change subject: KUDU-2915: add tool to remove a tablet server
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/18124/1/src/kudu/master/ts_manager.cc
File src/kudu/master/ts_manager.cc:

http://gerrit.cloudera.org:8080/#/c/18124/1/src/kudu/master/ts_manager.cc@330
PS1, Line 330:   servers_by_id_.erase(ts_uuid);
do you think it would be good to also remove the related entry from ts_state_by_uuid_ as well? so there are no orphaned entries in there


http://gerrit.cloudera.org:8080/#/c/18124/1/src/kudu/tools/tool_action_common.cc
File src/kudu/tools/tool_action_common.cc:

http://gerrit.cloudera.org:8080/#/c/18124/1/src/kudu/tools/tool_action_common.cc@786
PS1, Line 786: >
shouldn't this be >= ?



-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 1
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Fri, 07 Jan 2022 17:45:25 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2915: add tool to remove a tablet server

Posted by "Yifan Zhang (Code Review)" <ge...@cloudera.org>.
Yifan Zhang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18124 )

Change subject: KUDU-2915: add tool to remove a tablet server
......................................................................


Patch Set 5:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/18124/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/18124/3//COMMIT_MSG@14
PS3, Line 14: a (
> nit: missing space
Done


http://gerrit.cloudera.org:8080/#/c/18124/3/src/kudu/master/ts_manager.cc
File src/kudu/master/ts_manager.cc:

http://gerrit.cloudera.org:8080/#/c/18124/3/src/kudu/master/ts_manager.cc@321
PS3, Line 321:   shared_ptr<TSDescriptor> ts_desc;
> I was a bit concerned that there'd be a TOCTOU here, since we take the lock
Done


http://gerrit.cloudera.org:8080/#/c/18124/3/src/kudu/tools/kudu-tool-test.cc
File src/kudu/tools/kudu-tool-test.cc:

http://gerrit.cloudera.org:8080/#/c/18124/3/src/kudu/tools/kudu-tool-test.cc@7561
PS3, Line 7561:    
> nit: after this, can we also verify that the ksck output still shows the ts
Done


http://gerrit.cloudera.org:8080/#/c/18124/3/src/kudu/tools/tool_action_tserver.cc
File src/kudu/tools/tool_action_tserver.cc:

http://gerrit.cloudera.org:8080/#/c/18124/3/src/kudu/tools/tool_action_tserver.cc@61
PS3, Line 61: force the removal of the tserver "
            :             "even if it is considered live 
> nit: how about "If true, force the removal of the tserver, even if it is co
Done


http://gerrit.cloudera.org:8080/#/c/18124/3/src/kudu/tools/tool_action_tserver.cc@64
PS3, Line 64:             "in-memory map but keep its persisted state (if any). If the same tserver re-re
> nit: "If the same tserver re-registers on the..."
Done


http://gerrit.cloudera.org:8080/#/c/18124/3/src/kudu/tools/tool_action_tserver.cc@64
PS3, Line 64: ste
> nit: add a space
Done


http://gerrit.cloudera.org:8080/#/c/18124/3/src/kudu/tools/tool_action_tserver.cc@323
PS3, Line 323: RETURN_NOT_OK(proxy->RemoveTabletServer(req, &resp, &rpc));
             :     if (resp.has_error()) {
             :      
> Have you considered running the RPC on all masters, and then collecting err
Done. If a master is down, we get errors in previous `VerifyMasterAddressList`, but logging error messages is useful.



-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 5
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Tue, 18 Jan 2022 14:03:55 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2915: add tool to unregister a tablet server

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/18124 )

Change subject: KUDU-2915: add tool to unregister a tablet server
......................................................................


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master_service.cc
File src/kudu/master/master_service.cc:

http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master_service.cc@355
PS5, Line 355:       rpc->RespondFailure(s);
             :       return;
             :     }
             :   }
             : 
             :   Status s = server_->ts_manager()->UnregisterTServer(ts_uuid, force_unregister_live_tserver);
             :   if (PREDICT_FALSE(!s.ok() && !s.IsNotFound())) {
             :     // Ignore the NotFound error to make this RPC retriable and effectively idempotent.
             :     rpc->RespondFailure(s);
             :    
> Maybe it's not necessary to keep the run-time state consistent with system 
Another thought to consider: perhaps we should consider separating this into multiple tools, or at least separate RPCs -- one that removes tserver states on the leader, the other that unregisters all masters.

I understand baking this all into a single RPC is convenient, but I think it makes the failure semantics confusing.

If we go this route, we can use a AsyncLeaderMasterRpc to handle leader retries automatically, and only upon successfully writing to the leader we could proceed with unregistering on all masters. What do you think?



-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 6
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Fri, 21 Jan 2022 01:37:48 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2915: add tool to unregister a tablet server

Posted by "Yifan Zhang (Code Review)" <ge...@cloudera.org>.
Yifan Zhang has posted comments on this change. ( http://gerrit.cloudera.org:8080/18124 )

Change subject: KUDU-2915: add tool to unregister a tablet server
......................................................................


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master_service.cc
File src/kudu/master/master_service.cc:

http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master_service.cc@355
PS5, Line 355:       rpc->RespondFailure(s);
             :       return;
             :     }
             :   }
             : 
             :   Status s = server_->ts_manager()->UnregisterTServer(ts_uuid, force_unregister_live_tserver);
             :   if (PREDICT_FALSE(!s.ok() && !s.IsNotFound())) {
             :     // Ignore the NotFound error to make this RPC retriable and effectively idempotent.
             :     rpc->RespondFailure(s);
             :    
> I'm not sure the system catalog and the run-time state need to be consisten
Maybe it's not necessary to keep the run-time state consistent with system catalog, if one of these two operations fails, to rerun this tool is also acceptable.



-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 6
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Thu, 20 Jan 2022 03:33:05 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2915: add tool to unregister a tablet server

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/18124 )

Change subject: KUDU-2915: add tool to unregister a tablet server
......................................................................


Patch Set 8: Code-Review+1

(2 comments)

Overall LGTM, thanks!

Just a tiny nit with error message formatting.

http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master_service.cc
File src/kudu/master/master_service.cc:

http://gerrit.cloudera.org:8080/#/c/18124/5/src/kudu/master/master_service.cc@351
PS5, Line 351:               
> This is in the condition where the status is not 'NotFound' so it should be
Ah, that makes sense: it seems I misunderstood this code.


http://gerrit.cloudera.org:8080/#/c/18124/8/src/kudu/tools/tool_action_tserver.cc
File src/kudu/tools/tool_action_tserver.cc:

http://gerrit.cloudera.org:8080/#/c/18124/8/src/kudu/tools/tool_action_tserver.cc@332
PS8, Line 332: "Unable to unregister the tserver from master $0, status: $1"
nit: maybe add spacing between consecutive messages, so the result error message would be more readable?  Adding a space in the beginning of each message should be good enough, I guess.



-- 
To view, visit http://gerrit.cloudera.org:8080/18124
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If1f5c2979a8d14428f4bcc8e850c57ce228c793a
Gerrit-Change-Number: 18124
Gerrit-PatchSet: 8
Gerrit-Owner: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Abhishek Chennaka <ac...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Yifan Zhang <ch...@163.com>
Gerrit-Reviewer: Zoltan Chovan <zc...@cloudera.com>
Gerrit-Comment-Date: Fri, 21 Jan 2022 22:03:14 +0000
Gerrit-HasComments: Yes