You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org> on 2017/05/04 22:40:19 UTC

[kudu-CR] [catalog manager] categorization of rw operation failures

David Ribeiro Alves has posted comments on this change.

Change subject: [catalog_manager] categorization of rw operation failures
......................................................................


Patch Set 26:

(25 comments)

http://gerrit.cloudera.org:8080/#/c/6170/26/src/kudu/integration-tests/catalog_manager_tsk-itest.cc
File src/kudu/integration-tests/catalog_manager_tsk-itest.cc:

PS26, Line 82:  We want this to happen as often as possible, but
             :       // due to truncation issues (double --> int64_t) two TSK keys might be
             :       // generated with interval less than 1 second, which is not desirable.
not sure what's the intent of this sentence? do you do something to mitigate the "undesirability"?


PS26, Line 88: master_tsk_propagated_by_non_leaders
this flag is a bit cryptic, at first I thought this was referring to tablet servers. not sure it's defined in this patch but master_non_leader_masters_propagate_tsk would be clearer, IMO


PS26, Line 114: builder
              :         .default_admin_operation_timeout(timeout)
no need to wrap, in fact could you do all of this in the decl line?


Line 126:     // signed by the key which hasn't yet reached the tserver. This can
with a key which hasn't yet reached the tserver or non-leader master (or somesuch)


PS26, Line 132: IPKI
can you TODO yourself?


PS26, Line 134: Ideally, (hb_interval_ms_ * 2) would be enough, but due
              :     //   to network latency and multi-process OS, the factor should be greater.
how about: We allow for extra slop beyond the minimum (hb_interval_ms_ * 2) to account for network latency and thread preemption.


PS26, Line 136: SleepFor(MonoDelta::FromMilliseconds(hb_interval_ms_ * 10));
these timeouts always come back to bite us in the form of flakyness. could we sleep the minimum and keep retrying for a while (always making sure that we get the expected error and none other)


PS26, Line 137: //    std::min<int64_t>(run_time_seconds_ * 1000 / 2, hb_interval_ms_ * 50)));
did you mean to leave this?


PS26, Line 185: //SleepFor(MonoDelta::FromMilliseconds(hb_interval_ms_));
left over garbage?


http://gerrit.cloudera.org:8080/#/c/6170/26/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

PS26, Line 471: const
we never declare status const. I know you can here, but I haven't seen that anywhere, so I'd prefer to follow the same style


PS26, Line 484: const
same


PS26, Line 492: LOG(WARNING) << err_msg << "will try next time";
this is not very informative


PS26, Line 494: has left
where did it go? :)


PS26, Line 762: const
same


PS26, Line 777:  leadership change in the middle:
              :     // if the master server loses its leadership role, then there will be an
              :     // error if trying to write into the system table. If that happens, we do
              :     // not activate the newly generated CA information since it is no longer
              :     // relevant. If it was leadership change indeed, the system table should
              :     // contain the CA certificate information when this master is re-elected
              :     // next.
numbered bullets or some better way to lay this out. it's hard to parse as it is


PS26, Line 794:  
a


PS26, Line 857: LOG(INFO) << "Generated new certificate authority record";
can you provide more info? it'd be good to print a seq no or some identifier


PS26, Line 914: using std::bind;
using declarations go on the top of the file


PS26, Line 919:     auto wrapper = [this](std::function<Status()> func,
              :                           const Consensus& consensus,
              :                           int64_t start_term,
              :                           const char* op_description) {
              :       leader_lock_.AssertAcquiredForWriting();
              :       const Status s = func();
not really sure what's happening here. can you simplify?


PS26, Line 961: static const char* const kLoadMetaOpDescription =
              :         "Loading table and tablet metadata into memory";
can you declare this const on the top of the file?


PS26, Line 3231:     LOG_WITH_PREFIX(INFO)
               :         << "Latest config for has opid_index of " << latest_index
               :         << " while this task has opid_index of "
               :         << cstate_.config().opid_index() << ". Aborting task.";
spurious reformat


PS26, Line 3285:       LOG_WITH_PREFIX(WARNING)
               :           << "ChangeConfig() failed with leader "
               :           << target_ts_desc_->ToString()
               :           << " due to CAS failure. No further retry: "
               :           << status.ToString();
               :       MarkFailed();
               :       break;
               :     default:
               :       LOG_WITH_PREFIX(INFO)
               :           << "ChangeConfig() failed with leader "
               :           << target_ts_desc_->ToString()
               :           << " due to error "
               :           << TabletServerErrorPB::Code_Name(resp_.error().code())
               :           << ". This operation will be retried. Error detail: "
               :           << status.ToString();
               :       break;
spurious reformat


http://gerrit.cloudera.org:8080/#/c/6170/26/src/kudu/master/catalog_manager.h
File src/kudu/master/catalog_manager.h:

PS26, Line 358: changed from the initial.
what's the initial


PS26, Line 568: of the object
which object?


PS26, Line 607: load TSK records from the system table and import them into the
              :   // TokenSigner. After doing that check whether it's time to generate new
              :   // token signing key; if so, then generate and store the new one in the system
              :   // catalog table. Besides, purge the expired TSKs from the system table.
can you split this into numbered steps (1, 1.1, 1.2, 2) etc


-- 
To view, visit http://gerrit.cloudera.org:8080/6170
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I826826049e3c08a6c8345949690cbbedaea32ff8
Gerrit-PatchSet: 26
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes