You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Hannah Nguyen (Code Review)" <ge...@cloudera.org> on 2019/10/02 02:10:25 UTC

[kudu-CR] WIP KUDU-2780: create thread for auto-rebalancing

Hannah Nguyen has posted comments on this change. ( http://gerrit.cloudera.org:8080/14177 )

Change subject: WIP KUDU-2780: create thread for auto-rebalancing
......................................................................


Patch Set 5:

(64 comments)

http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc
File src/kudu/master/auto_rebalancer-test.cc:

http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@50
PS5, Line 50: FLAGS_auto_rebalancing_stop_flag
> You probably meant to set the interval here.
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@87
PS5, Line 87: 
            :   if (!AllowSlowTests()) {
            :     LOG(WARNING) << "test is skipped; set KUDU_ALLOW_SLOW_TESTS=1 to run";
            :     return;
            :   }
> nit: you can use SKIP_IF_SLOW_NOT_ALLOWED() for this. Same elsewhere.
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@111
PS5, Line 111:     }
             :     else {
> Join the lines for better readability.
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@104
PS5, Line 104:   SleepFor(MonoDelta::FromSeconds(10));
             : 
             :   int leader_idx;
             :   ASSERT_OK(cluster_->GetLeaderMasterIndex(&leader_idx));
             :   for (int i=0; i<kNumMasters; i++) {
             :     if (i==leader_idx) {
             :       ASSERT_NE(0, number_of_loop_iterations(i));
             :     }
             :     else {
             :       ASSERT_EQ(0, moves_scheduled_this_round(i));
             :     }
             :   }
             : 
> Rather than sleeping this long, how about asserting with ASSERT_EVENTUALLY 
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@117
PS5, Line 117:   cluster_->Shutdown();
> nit: don't need this since it gets called in TearDown()
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@144
PS5, Line 144:   auto iterations = number_of_loop_iterations(new_leader_idx);
             :   ASSERT_LT(0, iterations);
> Should this be wrapped in ASSERT_EVENTUALLY? What if the thread hasn't star
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@149
PS5, Line 149: TEST_F(AutoRebalancerTest, NoReplicaMovesIfBalanced) {
> Consider breaking this into smaller tests so the cases are parallelizable.
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@200
PS5, Line 200:     num_tservers = 2;
             :     num_tablets = 1;
> It doesn't seem particularly useful to keep defining these separately, and 
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@224
PS5, Line 224: SleepFor(MonoDelta::FromSeconds(10));
             :     ASSERT_OK(cluster_->GetLeaderMasterIndex(&leader_idx));
             :     ASSERT_EQ(moves_scheduled_this_round(leader_idx), 0);
> I wonder if it's sufficient to just define "no moves" as having the loop it
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer-test.cc@263
PS5, Line 263: ASSERT_GE
> ASSERT_EQ()?
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h
File src/kudu/master/auto_rebalancer.h:

http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h@63
PS5, Line 63:   // the catalog manager must be in the process of initializing.
> nit: Why does this need to be the case? The only invariant I see right now 
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h@71
PS5, Line 71: int number_of_loop_iterations;
            :   int moves_scheduled_this_round;
> Nit: could you suffix these with _for_test so it's obvious that their acces
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h@70
PS5, Line 70:   // Variables for testing.
            :   int number_of_loop_iterations;
            :   int moves_scheduled_this_round;
> Is it possible to make this variables private and declare the test that acc
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h@85
PS5, Line 85: const boost::optional<std::string>& location
> Please document the semantics of the 'location'  parameter -- it's not self
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h@88
PS5, Line 88:   Status GetMovesUsingRebalancingAlgo(
            :       const rebalance::ClusterRawInfo& raw_info,
            :       std::vector<rebalance::Rebalancer::ReplicaMove>* replica_moves,
            :       rebalance::RebalancingAlgo* algo,
            :       bool cross_location = false) const;
            : 
            :   Status GetMoves(std::vector<rebalance::Rebalancer::ReplicaMove>* replica_moves,
            :                   bool* policy_fixing) const;
            : 
            :   Status GetTabletLeader(
            :       const std::string& tablet_id,
            :       std::string* leader_uuid,
            :       HostPort* leader_hp) const;
            : 
            :   Status ExecuteMoves(
            :       const std::vector<rebalance::Rebalancer::ReplicaMove>& replica_moves,
            :       const bool& policy_fixing);
> nit: add docs. What do these and their arguments do?
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h@122
PS5, Line 122: rebalancer_
> Does this _need_ to be heap allocated?
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.h@133
PS5, Line 133: 
> nit: an  extra line
Done


http://gerrit.cloudera.org:8080/#/c/14177/3/src/kudu/master/auto_rebalancer.cc
File src/kudu/master/auto_rebalancer.cc:

http://gerrit.cloudera.org:8080/#/c/14177/3/src/kudu/master/auto_rebalancer.cc@377
PS3, Line 377:   unordered_set<string> tablets_in_move;
             :   for (const auto& move : moves) {
             :     vector<string> tablet_ids;
             :     RETURN_NOT_OK(rebalancer_->FindReplicas(move, raw_info, &tablet_ids));
             :     if (cross_location) {
             :       // In case of cross-location (a.k.a. inter-location) rebalancing it is
             :       // necessary to make sure the majority of replicas would not end up
             :       // at the same location after the move. If so, remove those tablets
             :       // from the list of candidates.
             :       RETURN_NOT_OK(rebalancer_->FilterCrossLocationTabletCandidates(
             :           cluster_info.locality.location_by_ts_id, tpi, move, &tablet_ids));
             :     }
             :     // Shuffle the set of the tablet identifiers: that's to achieve even spread
             :     // of moves across tables with the same skew.
             :     std::shuffle(tablet_ids.begin(), tablet_ids.end(), random_device());
             :     string move_tablet_id;
             :     for (const auto& tablet_id : tablet_ids) {
             :       if (tablets_in_move.find(tablet_id) == tablets_in_move.end()) {
             :         // For now, choose the very first tablet that does not have replicas
             :         // in move. Later on, additional logic might be added to find
             :         // the best candidate.
             :         move_tablet_id = tablet_id;
             :         break;
             :       }
             :     }
> As a reviewer, my main gripe with this is that in reviewing this patch, it'
Done


http://gerrit.cloudera.org:8080/#/c/14177/3/src/kudu/master/auto_rebalancer.cc@637
PS3, Line 637: !location) {
> nit: this is much less readable than specifying the actual type
Done


http://gerrit.cloudera.org:8080/#/c/14177/4/src/kudu/master/auto_rebalancer.cc
File src/kudu/master/auto_rebalancer.cc:

http://gerrit.cloudera.org:8080/#/c/14177/4/src/kudu/master/auto_rebalancer.cc@570
PS4, Line 570:   ts_manager->GetDescriptorsAvailableForPlacement(&descriptors);
> error: no member named 'GetAllLiveDescriptors' in 'kudu::master::TSManager'
Done


http://gerrit.cloudera.org:8080/#/c/14177/4/src/kudu/master/auto_rebalancer.cc@693
PS4, Line 693:   bool table_skew = max_table_skew-min_table_skew > 1;
> warning: redundant boolean literal in conditional return statement [readabi
Done


http://gerrit.cloudera.org:8080/#/c/14177/4/src/kudu/master/auto_rebalancer.cc@711
PS4, Line 711:     return !IsClusterOrTablesSkewed(cluster_info_);
> warning: redundant boolean literal in conditional return statement [readabi
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc
File src/kudu/master/auto_rebalancer.cc:

http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@85
PS5, Line 85: using kudu::rpc::Messenger;
> warning: using decl 'Messenger' is unused [misc-unused-using-decls]
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@101
PS5, Line 101: auto_rebalancing_stop_flag
> Yep, that sounds good to me.  And if I remember correctly, that was one the
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@139
PS5, Line 139:   
> nit: should be 4 spaces for this sort of 'continuation indent'
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@147
PS5, Line 147: 300
> Do we want to make this parameter configurable or the hard-coded value is g
This value isn't actually used in this implementation of the auto-rebalancer, just in RebalancerTool::RunWith().
I filled in the default value of the flag here to build the Config struct. What's the best practice here for this field?


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@177
PS5, Line 177: DCHECK
> What's the reason of using DCHECK() here but using CHECK() for thread_ vali
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@196
PS5, Line 196:       CatalogManager::ScopedLeaderSharedLock l(catalog_manager_);
> Taking a lock and then sleeping at line 200 doesn't look good.
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@200
PS5, Line 200:         SleepFor(MonoDelta::FromSeconds(FLAGS_auto_rebalancing_interval_seconds));
> Should we just sleep at the top of the loop so we don't have to repeat this
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@209
PS5, Line 209: cluster_info_
> Nit: since we're blowing these away every iteration, how about just making 
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@212
PS5, Line 212:     Status s;
             :     s = BuildClusterRawInfo(boost::none, &raw_info_);
> nit: why not just
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@215
PS5, Line 215: s.ToString() << ": could not retrieve cluster info";
> nit: consider using Substitute() from gutil/strings/substitute.h here, e.g.
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@221
PS5, Line 221: (s.ok())
> nit: drop the extra parens; same elsewhere
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@238
PS5, Line 238: this->I
> nit: drop this->?
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@245
PS5, Line 245:     if (!FLAGS_auto_rebalancing_stop_flag) {
> We should probably move this check to the top of the loop so we don't go th
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@274
PS5, Line 274: bool* policy_fixing
> nit: it wasn't immediately obvious to me from "policy fixing" that this is 
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@286
PS5, Line 286: replica_moves
> Nit: space before
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@334
PS5, Line 334: vector<Rebalancer::ReplicaMove>* replica_moves,
             :   rebalance::RebalancingAlgo* algo,
             :   bool cross_location
> nit: mind reordering these so the pure out param (replica_moves) comes afte
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@336
PS5, Line 336: bool cross_location
> nit: for readability, consider using an enum like CrossLocations::{YES, NO}
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@433
PS5, Line 433: 
> nit: extra line
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@460
PS5, Line 460: 
> nit: extra line
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@462
PS5, Line 462:   if (policy_fixing)
> What is the actual difference between policy_fixing clause and the !policy_
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@474
PS5, Line 474:       unique_ptr<ConsensusServiceProxy> proxy;
> Does this have to be heap-allocated? Same elsewhere.
making it a shared_ptr to be able to reuse the TSdescriptor's GetConsensusProxy()


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@491
PS5, Line 491: MonoDelta::FromSeconds(60)
> Where does this timeout comes from?  Should it be configurable?
Made it a configurable gflag!


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@500
PS5, Line 500: 
> nit: extra line
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@513
PS5, Line 513:       HostPort hp;
             :       RETURN_NOT_OK(hp.ParseString(leader_uuid, leader_hp.port()));
             :       vector<Sockaddr> resolved;
             :       RETURN_NOT_OK(hp.ResolveAddresses(&resolved));
             :       proxy.reset(new ConsensusServiceProxy(messenger_, resolved[0], hp.host()));
> At lines 528-533 TSDescriptor for dst_ts_uuid is fetched.  TSDescriptor has
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@564
PS5, Line 564:   vector<ServerHealthSummary> tserver_summaries;
             :   vector<TableSummary> table_summaries;
             :   vector<TabletSummary> tablet_summaries;
> Since the code below uses emplace_back()/push_back() to fill this vectors, 
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@577
PS5, Line 577: (ts->last_address())
> Are the outer parentheses really necessary?
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@577
PS5, Line 577:     summary.ts_location = (ts->last_address()).ToString();
> How come address (i.e. HostPort) became a location for a tablet server?  Th
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@594
PS5, Line 594: SysTablesEntryPB table_data = table->metadata().state().pb;
> Is it possible to get a constant reference to the SysTablesEntryPB here?
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@601
PS5, Line 601: 
> What about TableMetadataLock when retrieving information on all its tablets
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@615
PS5, Line 615: ReplicaTypeFilter::ANY_REPLICA
> Are we really interested in non-voters as well while rebalancing?
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@622
PS5, Line 622: TSInfoPB ts_info
> Why not to use a reference here?
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@626
PS5, Line 626:         rep.ts_address = Substitute("$0:$1", addr.host(), addr.port());
> What about filling in 'is_leader' field for ReplicaSummary?
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@627
PS5, Line 627:         replicas.push_back(rep);
> What about filling in the 'consensus_state' field for ReplicaSummary?  As I
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@641
PS5, Line 641:     raw_info->table_summaries = std::move(table_summaries);
> nit: for readability, add return Status::OK() and move the code out of the 
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@680
PS5, Line 680: namespace {
> nit: add a line before 'namespace'
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@681
PS5, Line 681: Is
> nit: sounds a bit awkward. Maybe remove? or "AreCluster.."?
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@682
PS5, Line 682: 
> nit: extra line
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@687
PS5, Line 687: max_replica_count-min_replica_count
> nit: add spaces, same below
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@691
PS5, Line 691: const auto min_table_skew = table_skew_info.begin()->first;
             :   const auto m
> const auto&?
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/auto_rebalancer.cc@709
PS5, Line 709:   // One location: table and cluster skew?
> nit: complete sentences in comments are preferred. Same elsewhere.
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/master/catalog_manager.cc@5202
PS5, Line 5202: 
              : TSManager* CatalogManager::ts_manager() const {
              :   return master_->ts_manager();
              : }
              : 
> nit: Rather than exposing this from the catalog manager, how about passing 
Done


http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/rebalance/rebalancer.h
File src/kudu/rebalance/rebalancer.h:

http://gerrit.cloudera.org:8080/#/c/14177/5/src/kudu/rebalance/rebalancer.h@147
PS5, Line 147:   // Public accessors for fields of Rebalancer's config object.
             :   bool ShouldRunPolicyFixer() const;
             :   bool ShouldRunCrossLocationRebalancing() const;
             :   bool ShouldRunIntraLocationRebalancing()  const;
> Don't need these anymore
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/14177
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifca25d1063c07047cf2123e6792b3c7395be20e4
Gerrit-Change-Number: 14177
Gerrit-PatchSet: 5
Gerrit-Owner: Hannah Nguyen <ha...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Hannah Nguyen <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Comment-Date: Wed, 02 Oct 2019 02:10:25 +0000
Gerrit-HasComments: Yes