You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org> on 2020/12/08 01:13:27 UTC

[kudu-CR] [master][consensus] Procedure for copying system catalog

Bankim Bhavsar has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16830


Change subject: [master][consensus] Procedure for copying system catalog
......................................................................

[master][consensus] Procedure for copying system catalog

This change outlines procedure to copy system catalog for the newly added
master using existing CLI tools and the master ChangeConfig RPC.
- Flag --consensus_allow_status_msg_for_failed_peer must be turned on for
existing masters.
- Start the new master with
--master_address_add_new_master=<new-master-hostport> and --master_addresses
that contains itself and existing masters.
- Invoke ChangeConfig to add the master.
- If the new master is promoted to being a VOTER then following tablet copy
steps can be skipped.
- Shutdown the new master.
- Delete the system catalog on the new master.
- Copy the system catalog from the leader master to the new master.
- Bring up the new master.
- Verify the new master is promoted as VOTER.

Bulk of the change is in the test code and involves refactoring to use common
parts of the earlier postive test case.

One functional change is in Raft consensus and adds the ability for the leader
to send status only messages to the peer even if it's in FAILED_UNRECOVERABLE
state. Without this change when the system catalog is copied externally the new
master remains in FAILED_UNRECOVERABLE state and doesn't get promoted to being
a VOTER despite the system catalog being up to date. This behavior is currently
disabled by default and hidden under the
--consensus_allow_status_msg_for_failed_peer flag.

Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
(cherry picked from commit 0d97614c04a072391a643b54220ee1ac5a52a7d4)
---
M src/kudu/consensus/consensus_queue.cc
M src/kudu/master/dynamic_multi_master-test.cc
2 files changed, 262 insertions(+), 130 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/16830/1
-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 1
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>

[kudu-CR] [master] KUDU-2181 Procedure for copying sys catalog on adding master

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16830 )

Change subject: [master] KUDU-2181 Procedure for copying sys catalog on adding master
......................................................................


Patch Set 9:

(1 comment)

Typically we reserve carrying forward +2s if there are no meaningful changes in the diff (e.g. if it's just a rebase), in case the diff get comments. Since this was already merged, feel free to ignore; I don't expect the extra check costs too much.

http://gerrit.cloudera.org:8080/#/c/16830/9/src/kudu/master/dynamic_multi_master-test.cc
File src/kudu/master/dynamic_multi_master-test.cc:

http://gerrit.cloudera.org:8080/#/c/16830/9/src/kudu/master/dynamic_multi_master-test.cc@172
PS9, Line 172: break
nit: we should be able to return here, rather than checking again below



-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 9
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Thu, 07 Jan 2021 18:41:06 +0000
Gerrit-HasComments: Yes

[kudu-CR] [master] KUDU-2181 Procedure for copying sys catalog on adding master

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Bankim Bhavsar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16830 )

Change subject: [master] KUDU-2181 Procedure for copying sys catalog on adding master
......................................................................


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16830/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16830/6//COMMIT_MSG@26
PS6, Line 26:  If not,
            : above steps will have to be repeated.
> I see. It probably makes sense to build this verification into a small tool
CLI tool to add master that invokes ChangeConfig can definitely include this verification step.


http://gerrit.cloudera.org:8080/#/c/16830/7/src/kudu/master/dynamic_multi_master-test.cc
File src/kudu/master/dynamic_multi_master-test.cc:

http://gerrit.cloudera.org:8080/#/c/16830/7/src/kudu/master/dynamic_multi_master-test.cc@173
PS7, Line 173:     int64_t updated_gc_count;
             :     while (MonoTime::Now() < deadline) {
             :       updated_gc_count = get_sys_catalog_wal_gc_count();
             :       if (updated_gc_count > orig_gc_count) {
             :         break;
             :       }
> Could we check this metric before creating each table? Then we wouldn't nee
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 7
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Wed, 06 Jan 2021 17:55:16 +0000
Gerrit-HasComments: Yes

[kudu-CR] [consensus] KUDU-2181 Allow sending status msgs to FAILED peers

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16830 )

Change subject: [consensus] KUDU-2181 Allow sending status msgs to FAILED peers
......................................................................


Patch Set 5:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/16830/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16830/4//COMMIT_MSG@28
PS4, Line 28: - Shutdown the new ma
> My earlier plan was to split the Raft change and the test procedure. But th
I haven't looked in much depth, but I grepped around for FAILED_UNRECOVERABLE and some some tests in consensus_queue-test that test this state. Did you look into testing it there? Were those unit tests not a good fit for the targeted consensus-related functionality we're introducing here?


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/consensus/consensus_queue.cc@74
PS4, Line 74: 
            : DEFINE_bool(consensus_allow_status_msg_for_failed_peer, false,
            :             "Allows status only Raft messages to be sent to a peer in FAILED_UNRECOVERABLE state.");
            : TAG_FLAG(consensus_allow_status_msg_for_failed_peer, experimental);
            : TAG_FLAG(consensus_allow_status_msg_for_failed_peer, runtime)
> Is the intention to restrict turning on this flag only for masters?

Yes, or more strongly, don't expose it as a flag at all, and instead pass it as an argument to RaftConsensus, and have only masters set that argument.


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc
File src/kudu/master/dynamic_multi_master-test.cc:

http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc@139
PS4, Line 139:   // has been GC'ed.
> With external mini cluster the test will be very similar to the actual proc
Ah I see. Yeah I think the difficulty with IMC is providing different flags to each process. I do think it's possible in that both mini cluster types have Shutdown() and Restart() methods, but the flags may be more difficult to wrangle -- especially the runtime ones.


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc@154
PS4, Line 154: 
> Can't use ASSERT_OK() with a lambda that returns a non-void data type.
Ah, indeed I overlooked that this was a lambda.


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc@416
PS4, Line 416:  new_master_->Shutdown();
             :     cluster_.reset();
             : 
             :     LOG(INFO) << "Bringing up the migrated cluster";
             :     opts_.num_masters = orig_num_masters_ + 1;
             :     opts_.master_rpc_addresses = master_hps;
             :     ExternalMiniCluster migrated_cluster(opts_);
             :     ASSERT_OK(migrated_cluster.Start());
             :     for (int i = 0; i < migrated_cluster.num_masters(); i++) {
             :       ASSERT_OK(migrated_cluster.master(i)->WaitForCatalogManager());
             :     }
             : 
             :     // Verify the cluster still has the same 3 masters.
             :     {
             :       ListMastersResponsePB resp;
             :       NO_FATALS(RunListMaste
> This verification function is called after verifying that the new master ha
Right, though this iteration of the test isn't checking any non-system tablets either. Maybe I missed it -- why not use the ClusterVerifier? This code seems like it could be replaced with ClusterVerifier::CheckRowCount().



-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 5
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 15 Dec 2020 21:56:15 +0000
Gerrit-HasComments: Yes

[kudu-CR] [master] KUDU-2181 Procedure for copying sys catalog on adding master

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Hello Alexey Serbin, Kudu Jenkins, Andrew Wong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16830

to look at the new patch set (#6).

Change subject: [master] KUDU-2181 Procedure for copying sys catalog on adding master
......................................................................

[master] KUDU-2181 Procedure for copying sys catalog on adding master

This change outlines procedure to copy system catalog for the newly
added master using existing CLI tools and the master ChangeConfig RPC.

Only functional change is to hookup the master runtime flag
--master_consensus_allow_status_msg_for_failed_peer to Raft consensus.
New master could go into a FAILED_RECOVERABLE state if the leader
master's system catalog WAL has been GC'ed. This change allows the
new master to be promoted after copying the system catalog externally.

Outline of the test procedure:
1) Runtime flag --master_consensus_allow_status_msg_for_failed_peer
must be turned on for existing masters.
2) Start the new master with
--master_address_add_new_master=<new-master-hostport> and
--master_addresses that contains itself and existing masters.
3) Invoke ChangeConfig to add the master.
4) Verify the new master is part of the Raft config even if it's a
LEARNER/NON_VOTER or goes into FAILED_RECOVERABLE state. If not,
above steps will have to be repeated.
5) If the new master is promoted to being a VOTER then following tablet
copy steps can be skipped.
6) Shutdown the new master.
7) Delete the system catalog on the new master.
8) Copy the system catalog from the leader master to the new master.
9) Bring up the new master.
10) Verify the new master is promoted as VOTER.
  If the new master doesn't get promoted to a VOTER then double check
  whether the new master is part of the Raft config for masters by
  running "kudu master list".
    - If yes, repeat procedure from step 6.
    - Else repeat the entire procedure

Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
---
M src/kudu/master/dynamic_multi_master-test.cc
M src/kudu/master/sys_catalog.cc
2 files changed, 294 insertions(+), 141 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/16830/6
-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 6
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] [consensus] KUDU-2181 Allow sending status msgs to FAILED peers

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/16830 )

Change subject: [consensus] KUDU-2181 Allow sending status msgs to FAILED peers
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16830/5/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

http://gerrit.cloudera.org:8080/#/c/16830/5/src/kudu/consensus/consensus_queue.cc@707
PS5, Line 707: peer_copy.wal_catchup_possible
Is this part is not exactly related to the changes in the context of multi-master automation tasks?  If so, maybe it's worth separating it into its own changelist for easier tracking and visibility?



-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 5
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 15 Dec 2020 22:00:34 +0000
Gerrit-HasComments: Yes

[kudu-CR] [master] KUDU-2181 Procedure for copying sys catalog on adding master

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Bankim Bhavsar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16830 )

Change subject: [master] KUDU-2181 Procedure for copying sys catalog on adding master
......................................................................


Patch Set 6:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/16830/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16830/6//COMMIT_MSG@26
PS6, Line 26:  If not,
            : above steps will have to be repeated.
> Curious, do we actually expect this to happen? How could an operator get in
Validation errors while invoking ChangeConfig which were ignored could be one reason.
Apart from that I don't really expect this to happen.

ChangeConfig step is idempotent so repeating it should help fix the issue.


http://gerrit.cloudera.org:8080/#/c/16830/6/src/kudu/master/dynamic_multi_master-test.cc
File src/kudu/master/dynamic_multi_master-test.cc:

http://gerrit.cloudera.org:8080/#/c/16830/6/src/kudu/master/dynamic_multi_master-test.cc@172
PS6, Line 172:     MonoTime deadline = MonoTime::Now() + MonoDelta::FromSeconds(2);
> If you haven't already, it'd be good to run this test suite a couple thousa
I'm running into some issues running dist-test. For now, I ran locally asan & tsan with --stress_cpu_threads=4 10 times and test passed successfully.


http://gerrit.cloudera.org:8080/#/c/16830/6/src/kudu/master/sys_catalog.cc
File src/kudu/master/sys_catalog.cc:

http://gerrit.cloudera.org:8080/#/c/16830/6/src/kudu/master/sys_catalog.cc@101
PS6, Line 101: unsafe
> nit: probably should be advanced instead of unsafe, given we want this to b
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 6
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Wed, 06 Jan 2021 00:35:33 +0000
Gerrit-HasComments: Yes

[kudu-CR] [master] KUDU-2181 Procedure for copying sys catalog on adding master

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Bankim Bhavsar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16830 )

Change subject: [master] KUDU-2181 Procedure for copying sys catalog on adding master
......................................................................


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16830/9/src/kudu/master/dynamic_multi_master-test.cc
File src/kudu/master/dynamic_multi_master-test.cc:

http://gerrit.cloudera.org:8080/#/c/16830/9/src/kudu/master/dynamic_multi_master-test.cc@172
PS9, Line 172: break
> nit: we should be able to return here, rather than checking again below
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 9
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Fri, 08 Jan 2021 17:29:20 +0000
Gerrit-HasComments: Yes

[kudu-CR] [master][consensus] Procedure for copying system catalog

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16830

to look at the new patch set (#2).

Change subject: [master][consensus] Procedure for copying system catalog
......................................................................

[master][consensus] Procedure for copying system catalog

This change outlines procedure to copy system catalog for the newlyadded
master using existing CLI tools and the master ChangeConfig RPC.
- Flag --consensus_allow_status_msg_for_failed_peer must be turned on
for existing masters.
- Start the new master with
--master_address_add_new_master=<new-master-hostport> and
--master_addresses that contains itself and existing masters.
- Invoke ChangeConfig to add the master.
- If the new master is promoted to being a VOTER then following tablet
copy steps can be skipped.
- Shutdown the new master.
- Delete the system catalog on the new master.
- Copy the system catalog from the leader master to the new master.
- Bring up the new master.
- Verify the new master is promoted as VOTER.

Bulk of the change is in the test code and involves refactoring to use
common parts of the earlier postive test case.

One functional change is in Raft consensus and adds the ability for the
leader to send status only messages to the peer even if it's in
FAILED_UNRECOVERABLE state. Without this change when the system catalog
is copied externally the new master remains in FAILED_UNRECOVERABLE
state and doesn't get promoted to being a VOTER despite the system
catalog being up to date. This behavior is currently disabled by default
and hidden under the --consensus_allow_status_msg_for_failed_peer flag.

Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
---
M src/kudu/consensus/consensus_queue.cc
M src/kudu/master/dynamic_multi_master-test.cc
2 files changed, 262 insertions(+), 130 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/16830/2
-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 2
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] [master][consensus] Procedure for copying system catalog

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16830

to look at the new patch set (#3).

Change subject: [master][consensus] Procedure for copying system catalog
......................................................................

[master][consensus] Procedure for copying system catalog

This change outlines procedure to copy system catalog for the newly
added master using existing CLI tools and the master ChangeConfig RPC.
- Flag --consensus_allow_status_msg_for_failed_peer must be turned on
for existing masters.
- Start the new master with
--master_address_add_new_master=<new-master-hostport> and
--master_addresses that contains itself and existing masters.
- Invoke ChangeConfig to add the master.
- If the new master is promoted to being a VOTER then following tablet
copy steps can be skipped.
- Shutdown the new master.
- Delete the system catalog on the new master.
- Copy the system catalog from the leader master to the new master.
- Bring up the new master.
- Verify the new master is promoted as VOTER.

Bulk of the change is in the test code and involves refactoring to use
common parts of the earlier postive test case.

One functional change is in Raft consensus and adds the ability for the
leader to send status only messages to the peer even if it's in
FAILED_UNRECOVERABLE state. Without this change when the system catalog
is copied externally the new master remains in FAILED_UNRECOVERABLE
state and doesn't get promoted to being a VOTER despite the system
catalog being up to date. This behavior is currently disabled by default
and hidden under the --consensus_allow_status_msg_for_failed_peer flag.

Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
---
M src/kudu/consensus/consensus_queue.cc
M src/kudu/master/dynamic_multi_master-test.cc
2 files changed, 275 insertions(+), 130 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/16830/3
-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 3
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] [master] KUDU-2181 Procedure for copying sys catalog on adding master

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Bankim Bhavsar has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16830 )

Change subject: [master] KUDU-2181 Procedure for copying sys catalog on adding master
......................................................................

[master] KUDU-2181 Procedure for copying sys catalog on adding master

This change outlines procedure to copy system catalog for the newly
added master using existing CLI tools and the master ChangeConfig RPC.

Only functional change is to hookup the master runtime flag
--master_consensus_allow_status_msg_for_failed_peer to Raft consensus.
New master could go into a FAILED_RECOVERABLE state if the leader
master's system catalog WAL has been GC'ed. This change allows the
new master to be promoted after copying the system catalog externally.

Outline of the test procedure:
1) Runtime flag --master_consensus_allow_status_msg_for_failed_peer
must be turned on for existing masters.
2) Start the new master with
--master_address_add_new_master=<new-master-hostport> and
--master_addresses that contains itself and existing masters.
3) Invoke ChangeConfig to add the master.
4) Verify the new master is part of the Raft config even if it's a
LEARNER/NON_VOTER or goes into FAILED_RECOVERABLE state. If not,
above steps will have to be repeated.
5) If the new master is promoted to being a VOTER then following tablet
copy steps can be skipped.
6) Shutdown the new master.
7) Delete the system catalog on the new master.
8) Copy the system catalog from the leader master to the new master.
9) Bring up the new master.
10) Verify the new master is promoted as VOTER.
  If the new master doesn't get promoted to a VOTER then double check
  whether the new master is part of the Raft config for masters by
  running "kudu master list".
    - If yes, repeat procedure from step 6.
    - Else repeat the entire procedure

Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Reviewed-on: http://gerrit.cloudera.org:8080/16830
Tested-by: Kudu Jenkins
Reviewed-by: Bankim Bhavsar <ba...@cloudera.com>
---
M src/kudu/master/dynamic_multi_master-test.cc
M src/kudu/master/sys_catalog.cc
2 files changed, 300 insertions(+), 141 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Bankim Bhavsar: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 9
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] [consensus] KUDU-2181 Allow sending status msgs to FAILED peers

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Bankim Bhavsar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16830 )

Change subject: [consensus] KUDU-2181 Allow sending status msgs to FAILED peers
......................................................................


Patch Set 5:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/16830/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16830/5//COMMIT_MSG@32
PS5, Line 32: Verify the new master is promoted as VOTER
> What if it's not promoted to a voter role?  What's the recovery scenario?
Assuming the Raft ChangeConfig went through fine, if the new master doesn't get promoted to a VOTER then leader master logs should have information about why catchup/promotion is not possible. In such a case, tablet copy part of the process will need to be repeated. I'll add that to the commit message.


http://gerrit.cloudera.org:8080/#/c/16830/5/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

http://gerrit.cloudera.org:8080/#/c/16830/5/src/kudu/consensus/consensus_queue.cc@707
PS5, Line 707: peer_copy.wal_catchup_possible
> Is this part is not exactly related to the changes in the context of multi-
Done


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc
File src/kudu/master/dynamic_multi_master-test.cc:

http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc@416
PS4, Line 416:  new_master_->Shutdown();
             :     cluster_.reset();
             : 
             :     LOG(INFO) << "Bringing up the migrated cluster";
             :     opts_.num_masters = orig_num_masters_ + 1;
             :     opts_.master_rpc_addresses = master_hps;
             :     ExternalMiniCluster migrated_cluster(opts_);
             :     ASSERT_OK(migrated_cluster.Start());
             :     for (int i = 0; i < migrated_cluster.num_masters(); i++) {
             :       ASSERT_OK(migrated_cluster.master(i)->WaitForCatalogManager());
             :     }
             : 
             :     // Verify the cluster still has the same 3 masters.
             :     {
             :       ListMastersResponsePB resp;
             :       NO_FATALS(RunListMaste
> Right, though this iteration of the test isn't checking any non-system tabl
Done


http://gerrit.cloudera.org:8080/#/c/16830/5/src/kudu/master/dynamic_multi_master-test.cc
File src/kudu/master/dynamic_multi_master-test.cc:

http://gerrit.cloudera.org:8080/#/c/16830/5/src/kudu/master/dynamic_multi_master-test.cc@168
PS5, Line 168: Need to create around 1k tables even with lowest flush threshold and log segment size.
> Might it be helpful to add dimension_label for a table to decrease the numb
I tried adding a long dimension label on creating tables but that didn't help bring down the the number of tables required to trigger GC.


http://gerrit.cloudera.org:8080/#/c/16830/5/src/kudu/master/dynamic_multi_master-test.cc@424
PS5, Line 424:     for (int i = 0; i < migrated_cluster.num_masters(); i++) {
             :       ASSERT_OK(migrated_cluster.master(i)->WaitForCatalogManager());
             :     }
> nit: does it makes sense to ensure that new_master_ is now a part of master
Done


http://gerrit.cloudera.org:8080/#/c/16830/5/src/kudu/master/dynamic_multi_master-test.cc@438
PS5, Line 438:         ASSERT_EQ(1, master.registration().rpc_addresses_size());
             :         HostPort actual_hp = HostPortFromPB(master.registration().rpc_addresses(0));
             :         ASSERT_TRUE(std::find(master_hps.begin(), master_hps.end(), actual_hp) != master_hps.end());
             :  
> Does it make sense to check there aren't any duplicates among the registere
Done


http://gerrit.cloudera.org:8080/#/c/16830/5/src/kudu/master/dynamic_multi_master-test.cc@467
PS5, Line 467: master one
> nit: one master
Done


http://gerrit.cloudera.org:8080/#/c/16830/5/src/kudu/master/dynamic_multi_master-test.cc@475
PS5, Line 475: kTableName
> To harden the use case, does it make sense to try opening a table that was 
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 5
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 22 Dec 2020 20:50:00 +0000
Gerrit-HasComments: Yes

[kudu-CR] [master] KUDU-2181 Procedure for copying sys catalog on adding master

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16830 )

Change subject: [master] KUDU-2181 Procedure for copying sys catalog on adding master
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16830/7/src/kudu/master/dynamic_multi_master-test.cc
File src/kudu/master/dynamic_multi_master-test.cc:

http://gerrit.cloudera.org:8080/#/c/16830/7/src/kudu/master/dynamic_multi_master-test.cc@173
PS7, Line 173:     int64_t updated_gc_count;
             :     while (MonoTime::Now() < deadline) {
             :       updated_gc_count = get_sys_catalog_wal_gc_count();
             :       if (updated_gc_count > orig_gc_count) {
             :         break;
             :       }
Could we check this metric before creating each table? Then we wouldn't need 1000 tables exactly every time.



-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 7
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Wed, 06 Jan 2021 02:24:16 +0000
Gerrit-HasComments: Yes

[kudu-CR] [master] KUDU-2181 Procedure for copying sys catalog on adding master

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16830 )

Change subject: [master] KUDU-2181 Procedure for copying sys catalog on adding master
......................................................................


Patch Set 6:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/16830/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16830/6//COMMIT_MSG@26
PS6, Line 26:  If not,
            : above steps will have to be repeated.
Curious, do we actually expect this to happen? How could an operator get into this scenario?


http://gerrit.cloudera.org:8080/#/c/16830/6/src/kudu/master/dynamic_multi_master-test.cc
File src/kudu/master/dynamic_multi_master-test.cc:

http://gerrit.cloudera.org:8080/#/c/16830/6/src/kudu/master/dynamic_multi_master-test.cc@172
PS6, Line 172:     MonoTime deadline = MonoTime::Now() + MonoDelta::FromSeconds(2);
If you haven't already, it'd be good to run this test suite a couple thousand times, maybe in ASAN or TSAN mode, maybe with --stress_cpu_threads=4 to make sure this won't be flaky.


http://gerrit.cloudera.org:8080/#/c/16830/6/src/kudu/master/sys_catalog.cc
File src/kudu/master/sys_catalog.cc:

http://gerrit.cloudera.org:8080/#/c/16830/6/src/kudu/master/sys_catalog.cc@101
PS6, Line 101: unsafe
nit: probably should be advanced instead of unsafe, given we want this to be used as a part of documented workflow.



-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 6
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Mon, 04 Jan 2021 07:04:41 +0000
Gerrit-HasComments: Yes

[kudu-CR] [master] KUDU-2181 Procedure for copying sys catalog on adding master

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16830 )

Change subject: [master] KUDU-2181 Procedure for copying sys catalog on adding master
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/16830/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16830/6//COMMIT_MSG@26
PS6, Line 26:  If not,
            : above steps will have to be repeated.
> Validation errors while invoking ChangeConfig which were ignored could be o
I see. It probably makes sense to build this verification into a small tool. Otherwise it seems like it might be a hassle to verify this. Or is this straightforward from a run of ksck?



-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 7
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Wed, 06 Jan 2021 01:21:13 +0000
Gerrit-HasComments: Yes

[kudu-CR] [master] KUDU-2181 Procedure for copying sys catalog on adding master

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Bankim Bhavsar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16830 )

Change subject: [master] KUDU-2181 Procedure for copying sys catalog on adding master
......................................................................


Patch Set 8:

Re-instating the +2 from Andrew.


-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 8
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Thu, 07 Jan 2021 13:42:18 +0000
Gerrit-HasComments: No

[kudu-CR] [master] KUDU-2181 Procedure for copying sys catalog on adding master

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Bankim Bhavsar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16830 )

Change subject: [master] KUDU-2181 Procedure for copying sys catalog on adding master
......................................................................


Patch Set 8: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 8
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Thu, 07 Jan 2021 13:41:42 +0000
Gerrit-HasComments: No

[kudu-CR] [master] KUDU-2181 Procedure for copying sys catalog on adding master

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Bankim Bhavsar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16830 )

Change subject: [master] KUDU-2181 Procedure for copying sys catalog on adding master
......................................................................


Patch Set 9:

> Patch Set 9:
> 
> (1 comment)
> 
> Typically we reserve carrying forward +2s if there are no meaningful changes in the diff (e.g. if it's just a rebase), in case the diff get comments. Since this was already merged, feel free to ignore; I don't expect the extra check costs too much.

Ack.
Addressed in upcoming change.


-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 9
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Fri, 08 Jan 2021 17:28:57 +0000
Gerrit-HasComments: No

[kudu-CR] [master] KUDU-2181 Procedure for copying sys catalog on adding master

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Hello Alexey Serbin, Kudu Jenkins, Andrew Wong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16830

to look at the new patch set (#8).

Change subject: [master] KUDU-2181 Procedure for copying sys catalog on adding master
......................................................................

[master] KUDU-2181 Procedure for copying sys catalog on adding master

This change outlines procedure to copy system catalog for the newly
added master using existing CLI tools and the master ChangeConfig RPC.

Only functional change is to hookup the master runtime flag
--master_consensus_allow_status_msg_for_failed_peer to Raft consensus.
New master could go into a FAILED_RECOVERABLE state if the leader
master's system catalog WAL has been GC'ed. This change allows the
new master to be promoted after copying the system catalog externally.

Outline of the test procedure:
1) Runtime flag --master_consensus_allow_status_msg_for_failed_peer
must be turned on for existing masters.
2) Start the new master with
--master_address_add_new_master=<new-master-hostport> and
--master_addresses that contains itself and existing masters.
3) Invoke ChangeConfig to add the master.
4) Verify the new master is part of the Raft config even if it's a
LEARNER/NON_VOTER or goes into FAILED_RECOVERABLE state. If not,
above steps will have to be repeated.
5) If the new master is promoted to being a VOTER then following tablet
copy steps can be skipped.
6) Shutdown the new master.
7) Delete the system catalog on the new master.
8) Copy the system catalog from the leader master to the new master.
9) Bring up the new master.
10) Verify the new master is promoted as VOTER.
  If the new master doesn't get promoted to a VOTER then double check
  whether the new master is part of the Raft config for masters by
  running "kudu master list".
    - If yes, repeat procedure from step 6.
    - Else repeat the entire procedure

Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
---
M src/kudu/master/dynamic_multi_master-test.cc
M src/kudu/master/sys_catalog.cc
2 files changed, 300 insertions(+), 141 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/16830/8
-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 8
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] [master][consensus] Procedure for copying system catalog

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16830 )

Change subject: [master][consensus] Procedure for copying system catalog
......................................................................


Patch Set 4:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/16830/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16830/4//COMMIT_MSG@28
PS4, Line 28: One functional change
nit: it seems this patch should focus more on this aspect, rather than being touted as a test change, since this won't a test-only thing if we expect to use the new flag in a multi-master recovery. Or is the idea to get the test working and then iron out the approach for status messages?


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/consensus/consensus_queue.cc@74
PS4, Line 74: 
            : DEFINE_bool(consensus_allow_status_msg_for_failed_peer, false,
            :             "Allows status only Raft messages to be sent to a peer in FAILED_UNRECOVERABLE state.");
            : TAG_FLAG(consensus_allow_status_msg_for_failed_peer, hidden);
            : TAG_FLAG(consensus_allow_status_msg_for_failed_peer, unsafe);
Do we actually intend on using this flag for our documented procedure to recover masters? If so, it shouldn't be unsafe. Also mark it runtime?

Alternatively, would it make sense to pass some option to the Raft implementation from the master that enables this?


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc
File src/kudu/master/dynamic_multi_master-test.cc:

http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc@139
PS4, Line 139:   void StartClusterWithSysCatalogGCed(vector<HostPort>* master_hps,
I seem to recall discussing the need to GC in order to have the replica _not_ be catch-up-able through the WALs. But I don't recall why not just use an internal mini cluster and call the GC functions directly? I suppose there might be issues with --master_support_change_config, but it isn't an obvious blocker to using an IMC?

I don't feel strongly about it to request a rewrite, but I'm curious what the rationale is since it seems like it might simplify the tests.


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc@154
PS4, Line 154: CHECK_OK
nit: ASSERT_OK?


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc@171
PS4, Line 171:     int64_t time_left_to_sleep_msecs = 2000;
nit: more idiomatic to use a MonoTime as a deadline and use MonoTime::Now()


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc@177
PS4, Line 177: ASSERT_GT(time_left_to_sleep_msecs, 0)
             :       << "Timed out waiting for system catalog WAL to be GC'ed";
nit: how about assert on the GC count? Since, as long as we GCed, it's not a big deal if we just missed the timeout?


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc@343
PS4, Line 343:     ASSERT_EVENTUALLY([&] {
nit: I typically find it cleaner to have callers do the ASSERT_EVENTUALLY() and have test methods like this be single-try functions. We could even reuse the single-try variant if we do have stronger guarantees. As is, it imposes some assumptions about how the method will be used (i.e. that it's going to likely take multiple tries).


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc@416
PS4, Line 416:  LOG(INFO) << "Verifying the first table";
             :     // Verify the table created before adding the new master.
             :     shared_ptr<KuduTable> table;
             :     ASSERT_OK(client->OpenTable(kTableName, &table));
             :     KuduScanner scanner1(table.get());
             :     size_t row_count = 0;
             :     ASSERT_OK(client::CountRowsWithRetries(&scanner1, &row_count));
             :     ASSERT_EQ(0, row_count);
             : 
             :     LOG(INFO) << "Creating and verifying the second table";
             :     // Perform an operation that requires replication to masters.
             :     ASSERT_OK(CreateTable(&migrated_cluster, "second_table"));
             :     ASSERT_OK(client->OpenTable("second_table", &table));
             :     KuduScanner scanner2(table.get());
             :     ASSERT_OK(client::CountRowsWithRetries(&scanner2, &row_count));
             :     ASSERT_EQ(0, row_count);
We're still not necessarily testing that the new master is caught up with the rest of the masters here, since the OpenTable() request and GetTableLocations() requests go through whichever master is elected leader, which may or may not be the new master.

Another thought is to use a ClusterVerifier, which runs ksck and runs some scans, which seems like a superset of what's done here.


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc@433
PS4, Line 433:     LOG(INFO) << "Pausing and resuming individual masters";
nit: move this into the if block?


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc@588
PS4, Line 588:   SKIP_IF_SLOW_NOT_ALLOWED();
             : 
             :   vector<HostPort> master_hps;
             :   NO_FATALS(StartClusterWithSysCatalogGCed(&master_hps,
             :                                            {"--consensus_allow_status_msg_for_failed_peer"}));
             :   ASSERT_OK(CreateTable(cluster_.get(), kTableName));
             :   // Bring up the new master and add to the cluster.
             :   master_hps.emplace_back(reserved_hp_);
             :   NO_FATALS(StartNewMaster(master_hps));
             :   ASSERT_OK(AddMasterToCluster(reserved_hp_));
             : 
             :   // Newly added master will be added to the master Raft config but won't be caught up
             :   // from the WAL and hence remain as a NON_VOTER.
             :   ListMastersResponsePB list_resp;
             :   NO_FATALS(RunListMasters(&list_resp));
             :   ASSERT_EQ(orig_num_masters_ + 1, list_resp.masters_size());
             : 
             :   for (const auto& master : list_resp.masters()) {
             :     ASSERT_EQ(1, master.registration().rpc_addresses_size());
             :     auto hp = HostPortFromPB(master.registration().rpc_addresses(0));
             :     if (hp == reserved_hp_) {
             :       ASSERT_EQ(consensus::RaftPeerPB::NON_VOTER, master.member_type());
             :       ASSERT_TRUE(master.role() == consensus::RaftPeerPB::LEARNER);
             :     }
             :   }
             : 
             :   // Double check by directly contacting the new master.
             :   NO_FATALS(VerifyNewMasterDirectly({ consensus::RaftPeerPB::LEARNER },
             :                                     consensus::RaftPeerPB::NON_VOTER));
             : 
             :   // Verify new master is in FAILED_UNRECOVERABLE state.
             :   NO_FATALS(VerifyNewMasterInFailedUnrecoverableState());
If this is for the most part a copy TestAddMasterCatchupFromWALNotPossible, why not just have this test supersede it? We could even set --consensus_allow_status_msg_for_failed_peer at runtime if the idea is to test when --consensus_allow_status_msg_for_failed_peer=false.



-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 4
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Sat, 12 Dec 2020 01:28:35 +0000
Gerrit-HasComments: Yes

[kudu-CR] [master][consensus] Procedure for copying system catalog

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Bankim Bhavsar has posted comments on this change. ( http://gerrit.cloudera.org:8080/16830 )

Change subject: [master][consensus] Procedure for copying system catalog
......................................................................


Patch Set 4:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/16830/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16830/4//COMMIT_MSG@28
PS4, Line 28: One functional change
> nit: it seems this patch should focus more on this aspect, rather than bein
My earlier plan was to split the Raft change and the test procedure. But the test procedure looks like the best way to test the Raft change.
I'll update the summary and update the description to highlight the Raft change.


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/consensus/consensus_queue.cc@74
PS4, Line 74: 
            : DEFINE_bool(consensus_allow_status_msg_for_failed_peer, false,
            :             "Allows status only Raft messages to be sent to a peer in FAILED_UNRECOVERABLE state.");
            : TAG_FLAG(consensus_allow_status_msg_for_failed_peer, hidden);
            : TAG_FLAG(consensus_allow_status_msg_for_failed_peer, unsafe);
> Do we actually intend on using this flag for our documented procedure to recover masters? If so, it shouldn't be unsafe. Also mark it runtime?

Ack.

> Alternatively, would it make sense to pass some option to the Raft implementation from the master that enables this?

I didn't understand this. Is the intention to restrict turning on this flag only for masters?
One way I can think about doing is honoring this flag only if running in master or clobbering it to false if not running in a master.


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc
File src/kudu/master/dynamic_multi_master-test.cc:

http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc@139
PS4, Line 139:   void StartClusterWithSysCatalogGCed(vector<HostPort>* master_hps,
> I seem to recall discussing the need to GC in order to have the replica _no
With external mini cluster the test will be very similar to the actual procedure of adding/removing masters. Also I remember when starting to work on the test, things like adding a new master once the IMC is up, restarting a subset of processes didn't look straightforward with IMC.

So apart from the roundabout way of forcing GC, overall external mini cluster does look like a better option.


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc@154
PS4, Line 154: CHECK_OK
> nit: ASSERT_OK?
Can't use ASSERT_OK() with a lambda that returns a non-void data type.


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc@171
PS4, Line 171:     int64_t time_left_to_sleep_msecs = 2000;
> nit: more idiomatic to use a MonoTime as a deadline and use MonoTime::Now()
Done


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc@177
PS4, Line 177: ASSERT_GT(time_left_to_sleep_msecs, 0)
             :       << "Timed out waiting for system catalog WAL to be GC'ed";
> nit: how about assert on the GC count? Since, as long as we GCed, it's not 
Done


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc@343
PS4, Line 343:     ASSERT_EVENTUALLY([&] {
> nit: I typically find it cleaner to have callers do the ASSERT_EVENTUALLY()
Done


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc@416
PS4, Line 416:  LOG(INFO) << "Verifying the first table";
             :     // Verify the table created before adding the new master.
             :     shared_ptr<KuduTable> table;
             :     ASSERT_OK(client->OpenTable(kTableName, &table));
             :     KuduScanner scanner1(table.get());
             :     size_t row_count = 0;
             :     ASSERT_OK(client::CountRowsWithRetries(&scanner1, &row_count));
             :     ASSERT_EQ(0, row_count);
             : 
             :     LOG(INFO) << "Creating and verifying the second table";
             :     // Perform an operation that requires replication to masters.
             :     ASSERT_OK(CreateTable(&migrated_cluster, "second_table"));
             :     ASSERT_OK(client->OpenTable("second_table", &table));
             :     KuduScanner scanner2(table.get());
             :     ASSERT_OK(client::CountRowsWithRetries(&scanner2, &row_count));
             :     ASSERT_EQ(0, row_count);
> We're still not necessarily testing that the new master is caught up with t
This verification function is called after verifying that the new master has been promoted to a VOTER. So yeah perhaps not completely caught but good enough from Raft point of view.

I took a look at ClusterVerifier the log verifier step currently only checks for non-system tablets on tablet servers. So could potentially be extended.

From the user point of view it's important that the new master can become leader, so instead opted for transferring leadership to the new master.


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc@433
PS4, Line 433:     LOG(INFO) << "Pausing and resuming individual masters";
> nit: move this into the if block?
Done


http://gerrit.cloudera.org:8080/#/c/16830/4/src/kudu/master/dynamic_multi_master-test.cc@588
PS4, Line 588:   SKIP_IF_SLOW_NOT_ALLOWED();
             : 
             :   vector<HostPort> master_hps;
             :   NO_FATALS(StartClusterWithSysCatalogGCed(&master_hps,
             :                                            {"--consensus_allow_status_msg_for_failed_peer"}));
             :   ASSERT_OK(CreateTable(cluster_.get(), kTableName));
             :   // Bring up the new master and add to the cluster.
             :   master_hps.emplace_back(reserved_hp_);
             :   NO_FATALS(StartNewMaster(master_hps));
             :   ASSERT_OK(AddMasterToCluster(reserved_hp_));
             : 
             :   // Newly added master will be added to the master Raft config but won't be caught up
             :   // from the WAL and hence remain as a NON_VOTER.
             :   ListMastersResponsePB list_resp;
             :   NO_FATALS(RunListMasters(&list_resp));
             :   ASSERT_EQ(orig_num_masters_ + 1, list_resp.masters_size());
             : 
             :   for (const auto& master : list_resp.masters()) {
             :     ASSERT_EQ(1, master.registration().rpc_addresses_size());
             :     auto hp = HostPortFromPB(master.registration().rpc_addresses(0));
             :     if (hp == reserved_hp_) {
             :       ASSERT_EQ(consensus::RaftPeerPB::NON_VOTER, master.member_type());
             :       ASSERT_TRUE(master.role() == consensus::RaftPeerPB::LEARNER);
             :     }
             :   }
             : 
             :   // Double check by directly contacting the new master.
             :   NO_FATALS(VerifyNewMasterDirectly({ consensus::RaftPeerPB::LEARNER },
             :                                     consensus::RaftPeerPB::NON_VOTER));
             : 
             :   // Verify new master is in FAILED_UNRECOVERABLE state.
             :   NO_FATALS(VerifyNewMasterInFailedUnrecoverableState());
> If this is for the most part a copy TestAddMasterCatchupFromWALNotPossible,
That's a good point, the TestAddMasterCatchupFromWALNotPossible can now be removed.



-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 4
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 15 Dec 2020 21:11:46 +0000
Gerrit-HasComments: Yes

[kudu-CR] [master] KUDU-2181 Procedure for copying sys catalog on adding master

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Hello Alexey Serbin, Kudu Jenkins, Andrew Wong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16830

to look at the new patch set (#7).

Change subject: [master] KUDU-2181 Procedure for copying sys catalog on adding master
......................................................................

[master] KUDU-2181 Procedure for copying sys catalog on adding master

This change outlines procedure to copy system catalog for the newly
added master using existing CLI tools and the master ChangeConfig RPC.

Only functional change is to hookup the master runtime flag
--master_consensus_allow_status_msg_for_failed_peer to Raft consensus.
New master could go into a FAILED_RECOVERABLE state if the leader
master's system catalog WAL has been GC'ed. This change allows the
new master to be promoted after copying the system catalog externally.

Outline of the test procedure:
1) Runtime flag --master_consensus_allow_status_msg_for_failed_peer
must be turned on for existing masters.
2) Start the new master with
--master_address_add_new_master=<new-master-hostport> and
--master_addresses that contains itself and existing masters.
3) Invoke ChangeConfig to add the master.
4) Verify the new master is part of the Raft config even if it's a
LEARNER/NON_VOTER or goes into FAILED_RECOVERABLE state. If not,
above steps will have to be repeated.
5) If the new master is promoted to being a VOTER then following tablet
copy steps can be skipped.
6) Shutdown the new master.
7) Delete the system catalog on the new master.
8) Copy the system catalog from the leader master to the new master.
9) Bring up the new master.
10) Verify the new master is promoted as VOTER.
  If the new master doesn't get promoted to a VOTER then double check
  whether the new master is part of the Raft config for masters by
  running "kudu master list".
    - If yes, repeat procedure from step 6.
    - Else repeat the entire procedure

Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
---
M src/kudu/master/dynamic_multi_master-test.cc
M src/kudu/master/sys_catalog.cc
2 files changed, 294 insertions(+), 141 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/16830/7
-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 7
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] [consensus] KUDU-2181 Allow sending status msgs to FAILED peers

Posted by "Bankim Bhavsar (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, Andrew Wong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/16830

to look at the new patch set (#5).

Change subject: [consensus] KUDU-2181 Allow sending status msgs to FAILED peers
......................................................................

[consensus] KUDU-2181 Allow sending status msgs to FAILED peers

This change adds the ability for the leader to send status only messages
to the peer even if it's in FAILED_UNRECOVERABLE state. Without this
change when the system catalog is copied externally the new master
remains in FAILED_UNRECOVERABLE state and doesn't get promoted to being
a VOTER despite the system catalog being up to date. This behavior is
currently disabled by default under the
--consensus_allow_status_msg_for_failed_peer flag.

Bulk of the change is a test that outlines the procedure to copy system
catalog for the newly added master using existing CLI tools and
the master ChangeConfig RPC.
- Flag --consensus_allow_status_msg_for_failed_peer must be turned on
for existing masters.
- Start the new master with
--master_address_add_new_master=<new-master-hostport> and
--master_addresses that contains itself and existing masters.
- Invoke ChangeConfig to add the master.
- If the new master is promoted to being a VOTER then following tablet
copy steps can be skipped.
- Shutdown the new master.
- Delete the system catalog on the new master.
- Copy the system catalog from the leader master to the new master.
- Bring up the new master.
- Verify the new master is promoted as VOTER.

Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
---
M src/kudu/consensus/consensus_queue.cc
M src/kudu/master/dynamic_multi_master-test.cc
2 files changed, 279 insertions(+), 137 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/30/16830/5
-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 5
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] [consensus] KUDU-2181 Allow sending status msgs to FAILED peers

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/16830 )

Change subject: [consensus] KUDU-2181 Allow sending status msgs to FAILED peers
......................................................................


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/16830/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/16830/5//COMMIT_MSG@32
PS5, Line 32: Verify the new master is promoted as VOTER
What if it's not promoted to a voter role?  What's the recovery scenario?


http://gerrit.cloudera.org:8080/#/c/16830/5/src/kudu/master/dynamic_multi_master-test.cc
File src/kudu/master/dynamic_multi_master-test.cc:

http://gerrit.cloudera.org:8080/#/c/16830/5/src/kudu/master/dynamic_multi_master-test.cc@168
PS5, Line 168: Need to create around 1k tables even with lowest flush threshold and log segment size.
Might it be helpful to add dimension_label for a table to decrease the number of tables to create?  I recall that helped with a test scenario to reproduce KUDU-3036: https://github.com/apache/kudu/commit/b4c317299f39ae0530cb10702e540cd22393656c

There are also options to use extra_config and comment for a column if thinking to increase per-table amount of data in the system catalog.



-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 5
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 15 Dec 2020 22:14:48 +0000
Gerrit-HasComments: Yes

[kudu-CR] [master] KUDU-2181 Procedure for copying sys catalog on adding master

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/16830 )

Change subject: [master] KUDU-2181 Procedure for copying sys catalog on adding master
......................................................................


Patch Set 9: Code-Review+2

(1 comment)

It's already merged, but just in case :)

http://gerrit.cloudera.org:8080/#/c/16830/5/src/kudu/master/dynamic_multi_master-test.cc
File src/kudu/master/dynamic_multi_master-test.cc:

http://gerrit.cloudera.org:8080/#/c/16830/5/src/kudu/master/dynamic_multi_master-test.cc@168
PS5, Line 168:  i;
> I tried adding a long dimension label on creating tables but that didn't he
ack



-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 9
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Thu, 07 Jan 2021 20:06:41 +0000
Gerrit-HasComments: Yes

[kudu-CR] [master] KUDU-2181 Procedure for copying sys catalog on adding master

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/16830 )

Change subject: [master] KUDU-2181 Procedure for copying sys catalog on adding master
......................................................................


Patch Set 7: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 7
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Wed, 06 Jan 2021 01:17:59 +0000
Gerrit-HasComments: No

[kudu-CR] [consensus] KUDU-2181 Allow sending status msgs to FAILED peers

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/16830 )

Change subject: [consensus] KUDU-2181 Allow sending status msgs to FAILED peers
......................................................................


Patch Set 5:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/16830/5/src/kudu/master/dynamic_multi_master-test.cc
File src/kudu/master/dynamic_multi_master-test.cc:

http://gerrit.cloudera.org:8080/#/c/16830/5/src/kudu/master/dynamic_multi_master-test.cc@424
PS5, Line 424:     for (int i = 0; i < migrated_cluster.num_masters(); i++) {
             :       ASSERT_OK(migrated_cluster.master(i)->WaitForCatalogManager());
             :     }
nit: does it makes sense to ensure that new_master_ is now a part of masters returned by migrated_cluster.master(i), i.e.  that UUID returned by new_master_->uuid() is among those which are in migrated_cluster?


http://gerrit.cloudera.org:8080/#/c/16830/5/src/kudu/master/dynamic_multi_master-test.cc@438
PS5, Line 438:         ASSERT_EQ(1, master.registration().rpc_addresses_size());
             :         HostPort actual_hp = HostPortFromPB(master.registration().rpc_addresses(0));
             :         ASSERT_TRUE(std::find(master_hps.begin(), master_hps.end(), actual_hp) != master_hps.end());
             :  
Does it make sense to check there aren't any duplicates among the registered RPC master addresses in the migrated cluster?


http://gerrit.cloudera.org:8080/#/c/16830/5/src/kudu/master/dynamic_multi_master-test.cc@467
PS5, Line 467: master one
nit: one master


http://gerrit.cloudera.org:8080/#/c/16830/5/src/kudu/master/dynamic_multi_master-test.cc@475
PS5, Line 475: kTableName
To harden the use case, does it make sense to try opening a table that was created at the previous iteration of this cycle, if any (i.e. start opening "table-0" at iteration 1 instead of "first_table", and so on)?



-- 
To view, visit http://gerrit.cloudera.org:8080/16830
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I142c1dec442ec72c38c5be9d62cdf270e441d6e3
Gerrit-Change-Number: 16830
Gerrit-PatchSet: 5
Gerrit-Owner: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Bankim Bhavsar <ba...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 15 Dec 2020 22:46:48 +0000
Gerrit-HasComments: Yes