You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Adar Dembo (Code Review)" <ge...@cloudera.org> on 2017/09/18 00:09:10 UTC

[kudu-CR] KUDU-1125: issue one catalog write per tablet report

Hello Dan Burkert, Mike Percy, Todd Lipcon,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/8090

to review the following change.

Change subject: KUDU-1125: issue one catalog write per tablet report
......................................................................

KUDU-1125: issue one catalog write per tablet report

This commit addresses a long-standing issue in the catalog manager where an
independent catalog write is performed for each reported tablet. When the
master is configured to fsync WAL writes, this can add a lot of load during
election storms and when the master is restarted.

To tackle this, I fully rewrote ProcessTabletReport and friends. I had
higher hopes for the final product, but all of the dependent control flow
complicates decomposition. Still, I managed to make some improvements. For
example, all RPCs are now sent at the end in one go rather than piecemeal. I
also rewrote all of the comments so that they can serve as a map to the
function and to emphasize the actions performed by the corresponding code.

Here are the actual semantic changes being made:
- Table and tablet locks are now acquired en masse. For tablets, this is
  required for correctness; I've documented why I did the same for tables.
- We no longer check for non-running tables. AFAICT this was a useless check
  because a non-running table must be a deleted table, and there's a check
  for that just before.
- We no longer wake the BgTasks thread upon completion. This is because:
  1. WakeIfHasPendingUpdates() was semi-broken; pending_updates_ is never
     set in ProcessTabletReport.
  2. Regardless, there's no work for the BgTasks thread to do.

Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
2 files changed, 380 insertions(+), 408 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/90/8090/1
-- 
To view, visit http://gerrit.cloudera.org:8080/8090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1125: issue one catalog write per tablet report

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has uploaded a new patch set (#3).

Change subject: KUDU-1125: issue one catalog write per tablet report
......................................................................

KUDU-1125: issue one catalog write per tablet report

This commit addresses a long-standing issue in the catalog manager where an
independent catalog write is performed for each reported tablet. When the
master is configured to fsync WAL writes, this can add a lot of load during
election storms and when the master is restarted.

To tackle this, I fully rewrote ProcessTabletReport and friends. I had
higher hopes for the final product, but all of the dependent control flow
complicates decomposition. Still, I managed to make some improvements. For
example, all RPCs are now sent at the end in one go rather than piecemeal. I
also rewrote all of the comments so that they can serve as a map to the
function and to emphasize the actions performed by the corresponding code.

Here are the actual semantic changes being made:
- Table and tablet locks are now acquired en masse. For tablets, this is
  required for correctness; I've documented why I did the same for tables.
- We no longer check for non-running tables. AFAICT this was a useless check
  because a non-running table must be a deleted table, and there's a check
  for that just before.
- We no longer wake the BgTasks thread upon completion. This is because:
  1. WakeIfHasPendingUpdates() was semi-broken; pending_updates_ is never
     set in ProcessTabletReport.
  2. Regardless, there's no work for the BgTasks thread to do.

Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
2 files changed, 380 insertions(+), 408 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/90/8090/3
-- 
To view, visit http://gerrit.cloudera.org:8080/8090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1125: issue one catalog write per tablet report

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Dan Burkert, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8090

to look at the new patch set (#6).

Change subject: KUDU-1125: issue one catalog write per tablet report
......................................................................

KUDU-1125: issue one catalog write per tablet report

This commit addresses a long-standing issue in the catalog manager where an
independent catalog write is performed for each reported tablet. When the
master is configured to fsync WAL writes, this can add a lot of load during
election storms and when the master is restarted.

To tackle this, I fully rewrote ProcessTabletReport and friends. I had
higher hopes for the final product, but all of the dependent control flow
complicates decomposition. Still, I managed to make some improvements. For
example, all RPCs are now sent at the end in one go rather than piecemeal. I
also rewrote all of the comments so that they can serve as a map to the
function and to emphasize the actions performed by the corresponding code.

Here are the actual semantic changes being made:
- Table and tablet locks are now acquired en masse. For tablets, this is
  required for correctness; I've documented why I did the same for tables.
- We no longer check for non-running tables. AFAICT this was a useless check
  because a non-running table must be a deleted table, and there's a check
  for that just before.
- We no longer wake the BgTasks thread upon completion. This is because:
  1. WakeIfHasPendingUpdates() was semi-broken; pending_updates_ is never
     set in ProcessTabletReport.
  2. Regardless, there's no work for the BgTasks thread to do.

Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
2 files changed, 383 insertions(+), 402 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/90/8090/6
-- 
To view, visit http://gerrit.cloudera.org:8080/8090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Gerrit-Change-Number: 8090
Gerrit-PatchSet: 6
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1125: issue one catalog write per tablet report

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/8090 )

Change subject: KUDU-1125: issue one catalog write per tablet report
......................................................................


Patch Set 6:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/8090/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8090/6//COMMIT_MSG@10
PS6, Line 10: When the
            : master is configured to fsync WAL writes, this can add a lot of load during
            : election storms and when the master is restarted.
> any chance you did a before/after test here? eg start a small cluster with 
Haven't done any perf testing yet. I was planning to do it post-merge, but I'll tackle it now instead.

I'll also try to experiment to see whether your estimate for max tablets is correct. It's tough because although it's pretty easy to trigger a full report (restart the master or a tserver), the underlying write performed by that report is likely to be sparse because 3b6ab64 will filter out all tablets that haven't actually changed. Meanwhile, events that do yield real changes (such as an election storm or the creation of a table) tend to be staggered over time. In short, it's hard to create a full report with nothing but real changes.


http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3106
PS6, Line 3106:                "num_tablets", full_report.updated_tablets_size());
> would be nice to also use TRACE_COUNTER_INCREMENT("reported_tablets", updat
Done


http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3122
PS6, Line 3122:   unordered_map<string, ReportedTabletUpdatesPB*> updates;
> worth adding to the comment explaining ownership of these raw pointers
Done


http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3151
PS6, Line 3151:     ReportedTabletUpdatesPB* update = full_report_update->add_tablets();
> to cut down on small allocations here, consider doing full_report_update->m
Done


http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3157
PS6, Line 3157:       shared_lock<LockType> l(lock_);
> I think this would likely perform better if you can acquire this once for t
I think I can do that without much refactoring, just need a new scope here.


http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3227
PS6, Line 3227: bootstrapping
> should this now say "copying"?
Yeah. The original wording was "This prevents us from spuriously deleting replicas that have just been added to a pending config and are in the process of catching up to the log entry where they were added to the config."

Dan asked me whether it should say "just been added to the committed config and are in the process of bootstrapping" and I misinterpreted his suggestion as "change the wording to this", though that wasn't his intent.


http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3249
PS6, Line 3249: with an error
> nit: rephrase as "non-deleted tablet which is reporting an error" or somesu
Done


http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3268
PS6, Line 3268:       if (!cstate.committed_config().has_opid_index()) {
> nit: would it be possible to move this if statement up a few lines before c
Yeah, I think that'd be safe.


http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3285
PS6, Line 3285:       // TODO(adar): ShouldTransitionTabletToRunning doesn't take step 7b into
              :       // account. Should it?
> not sure I understand this question -- isn't the condition from ShouldTrans
This TODO is trying to draw attention to the fact that ShouldTransitionTabletToRunning when examining the replica's ConsensusStatePB, goes straight to report rather than looking at the local 'cstate' (which may have had its leader_uuid() cleared).

I am comfortable changing it to use 'cstate', but I wasn't sure whether that'd be correct.


http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3344
PS6, Line 3344:         ConsensusStatePB prev_cstate = tablet->metadata().state().pb.consensus_state();
> this is a little confusing - this is shadowing a variable with the same nam
Ugh, yes. Given the COW nature of the persistent data, it should be safe to reuse the existing prev_cstate even though we're mutating the cstate.



-- 
To view, visit http://gerrit.cloudera.org:8080/8090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Gerrit-Change-Number: 8090
Gerrit-PatchSet: 6
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 03 Oct 2017 01:08:42 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1125: issue one catalog write per tablet report

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8090 )

Change subject: KUDU-1125: issue one catalog write per tablet report
......................................................................


Patch Set 8: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/8090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Gerrit-Change-Number: 8090
Gerrit-PatchSet: 8
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Sat, 07 Oct 2017 02:58:49 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-1125: issue one catalog write per tablet report

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/8090 )

Change subject: KUDU-1125: issue one catalog write per tablet report
......................................................................


Patch Set 5:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3227
PS5, Line 3227: just been added to a pending config and are in the process of catching
              :     // up to the log entry where they were added to the config
> Should this say 'just been added to the committed config and are in the pro
I copied this part of the comment verbatim since I don't pretend to understand the nuances of catch-up, but I'll use your wording.


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3274
PS5, Line 3274:       // of the committed config,
> s/,/.
I'm not sure. I think it might be possible if the tablet report is coming from a tombstone rather than a live replica. Tombstones might have a very stale notion of who is in the replication group as I don't believe they receive config changes.


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3294
PS5, Line 3294:         // TODO(matteo): we could batch the IO onto a background thread, or at
> I thought this TODO is what the commit is addressing?
I left it in because of the "background thread" part, but I agree that this change is more than enough to merit.


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3303
PS5, Line 3303:       // - There was an opid_index change.
> For my own edification, why do we update the on-disk cstate on opid change?
Yes, your edit sums it up well. I'll reword this.


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3309
PS5, Line 3309: cstate.current_term() > prev_cstate.current_term()
> I'm would have expected that we'd update the on-disk cstate for every term 
Hmm. Given long enough elections, every term change will first be reported as having no leader, and only later will report a leader. What's the use of the master persisting that intermediate state?

AFAICT the master only cares about persisting leadership so that it can use that information immediately after a reboot, before all tablets report in. Even that is a sketchy argument; leadership could probably be in-memory only and no one would be the wiser.


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3311
PS5, Line 3311: 7e
> Shouldn't this be 7d(i)?
If you think a third level of nesting is more clear, sure I'll do that.


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3370
PS5, Line 3370:         // 7h. Add a server to the config if it is under-replicated.
> Is this idempotent?
Yes, and it uses a CAS on cstate's opid_index so that if the index changes, the RPC is discarded.

I'll update the comment to reflect that.


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3386
PS5, Line 3386:             report.schema_version());
> Still send the AsyncAlterTable?
The old code did that, so I continued doing it here.


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3396
PS5, Line 3396: AsyncAlterTable
> Would have expected this to be called AsyncAlterTablet, but I guess that's 
Agreed.


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3408
PS5, Line 3408:   tables_lock.Unlock();
> Makes me nervous that this acquires tables before tablets, and releases tab
While "table before tablets when acquiring and the reverse when releasing" rule is a requirement to avoid deadlocks for WRITE locks, it doesn't matter for READ locks. We could hold the table locks for longer, but it just creates more contention w.r.t. table writes than needed.

If it makes you feel better, the old code used this ordering too (albeit on a single table/tablet at a time).


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3418
PS5, Line 3418:     LOG(ERROR) << Substitute("Error updating tablets: $0. Tablet report was: $1",
> Include the source tserver ID.
Done


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3448
PS5, Line 3448:       // response to a tablet report.
> Which RPC types does this apply to?  AFAICT they all have an associated tab
DeleteReplica() for an unknown tablet.



-- 
To view, visit http://gerrit.cloudera.org:8080/8090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Gerrit-Change-Number: 8090
Gerrit-PatchSet: 5
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 22 Sep 2017 22:24:47 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1125: issue one catalog write per tablet report

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/8090 )

Change subject: KUDU-1125: issue one catalog write per tablet report
......................................................................


Patch Set 8: Verified+1

Overriding Jenkins, a few TSAN tests failed due to unsynchronized clocks.


-- 
To view, visit http://gerrit.cloudera.org:8080/8090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Gerrit-Change-Number: 8090
Gerrit-PatchSet: 8
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 06 Oct 2017 04:30:07 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-1125: issue one catalog write per tablet report

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Dan Burkert, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8090

to look at the new patch set (#7).

Change subject: KUDU-1125: issue one catalog write per tablet report
......................................................................

KUDU-1125: issue one catalog write per tablet report

This commit addresses a long-standing issue in the catalog manager where an
independent catalog write is performed for each reported tablet. When the
master is configured to fsync WAL writes, this can add a lot of load during
election storms and when the master is restarted.

To tackle this, I fully rewrote ProcessTabletReport and friends. I had
higher hopes for the final product, but all of the dependent control flow
complicates decomposition. Still, I managed to make some improvements. For
example, all RPCs are now sent at the end in one go rather than piecemeal. I
also rewrote all of the comments so that they can serve as a map to the
function and to emphasize the actions performed by the corresponding code.

Here are the actual semantic changes being made:
- Table and tablet locks are now acquired en masse. For tablets, this is
  required for correctness; I've documented why I did the same for tables.
- We no longer check for non-running tables. AFAICT this was a useless check
  because a non-running table must be a deleted table, and there's a check
  for that just before.
- We no longer wake the BgTasks thread upon completion. This is because:
  1. WakeIfHasPendingUpdates() was semi-broken; pending_updates_ is never
     set in ProcessTabletReport.
  2. Regardless, there's no work for the BgTasks thread to do.

Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
2 files changed, 387 insertions(+), 402 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/90/8090/7
-- 
To view, visit http://gerrit.cloudera.org:8080/8090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Gerrit-Change-Number: 8090
Gerrit-PatchSet: 7
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1125: issue one catalog write per tablet report

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/8090

to look at the new patch set (#5).

Change subject: KUDU-1125: issue one catalog write per tablet report
......................................................................

KUDU-1125: issue one catalog write per tablet report

This commit addresses a long-standing issue in the catalog manager where an
independent catalog write is performed for each reported tablet. When the
master is configured to fsync WAL writes, this can add a lot of load during
election storms and when the master is restarted.

To tackle this, I fully rewrote ProcessTabletReport and friends. I had
higher hopes for the final product, but all of the dependent control flow
complicates decomposition. Still, I managed to make some improvements. For
example, all RPCs are now sent at the end in one go rather than piecemeal. I
also rewrote all of the comments so that they can serve as a map to the
function and to emphasize the actions performed by the corresponding code.

Here are the actual semantic changes being made:
- Table and tablet locks are now acquired en masse. For tablets, this is
  required for correctness; I've documented why I did the same for tables.
- We no longer check for non-running tables. AFAICT this was a useless check
  because a non-running table must be a deleted table, and there's a check
  for that just before.
- We no longer wake the BgTasks thread upon completion. This is because:
  1. WakeIfHasPendingUpdates() was semi-broken; pending_updates_ is never
     set in ProcessTabletReport.
  2. Regardless, there's no work for the BgTasks thread to do.

Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
2 files changed, 382 insertions(+), 402 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/90/8090/5
-- 
To view, visit http://gerrit.cloudera.org:8080/8090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1125: issue one catalog write per tablet report

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/8090 )

Change subject: KUDU-1125: issue one catalog write per tablet report
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8090/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8090/6//COMMIT_MSG@10
PS6, Line 10: When the
            : master is configured to fsync WAL writes, this can add a lot of load during
            : election storms and when the master is restarted.
> That's fair. If it's going to take a lot of effort I don't think it's worth
On our internal flash cluster I ran the following experiment:
1. Reconfigure the master to use log_force_fsync_all=true.
2. Stop the master.
3. Restart a tserver.
4. Start the master.
5. Look at the TSHeartbeat samples in the master's /rpcz page.

In this experiment not only does the master receive a full heartbeat from every tserver, but it also has to update a bunch of persistent tablet state for the tablets whose leaders were on the restarted tserver but are now elsewhere.

- Without this change, I saw a "fat" heartbeat sample which updated 140 tablets and took 10304 ms.
- With this change, an equivalent sample updated 308 tablets and took 157 ms.

As for measuring when the number of tablets may run up against the maximum RPC size, I modified the master to LOG during a tablet report the number of tablets and the size of the overall report. I restarted the master and calculated an average across the full tablet reports sent by all of the tservers in the cluster. It worked out to about ~283 bytes per tablet. With our 50MB maximum RPC size, we'll be in danger of blowing that out at around 185,000 tablets. Note that this applies to single-master deployments too as the maximum RPC size applies to the TSHeartbeat itself.



-- 
To view, visit http://gerrit.cloudera.org:8080/8090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Gerrit-Change-Number: 8090
Gerrit-PatchSet: 8
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Sat, 07 Oct 2017 01:23:27 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1125: issue one catalog write per tablet report

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8090 )

Change subject: KUDU-1125: issue one catalog write per tablet report
......................................................................


Patch Set 6:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/8090/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8090/6//COMMIT_MSG@10
PS6, Line 10: When the
            : master is configured to fsync WAL writes, this can add a lot of load during
            : election storms and when the master is restarted.
any chance you did a before/after test here? eg start a small cluster with 1000 tablets per node and grab some metric which shows the improvement?

Also did you consider if there is any bound beyond which we end up trying to write single batches which are 50MB+ and thus might blow out the max RPC size when using multi-master? Some back-of-the-envelope math says that a pretty loose upper bound on SysTabletsEntryPB is probably 1KB, so we'd need a batch of 50,000 tablet writes before we overrun the 50MB limit, right? That shoudl be safe given we'd blow out a lot of per-TS limits long before this one, but would be good if you can sanity-check my thinking here.


http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3106
PS6, Line 3106:                "num_tablets", full_report.updated_tablets_size());
would be nice to also use TRACE_COUNTER_INCREMENT("reported_tablets", updated_tablets.size()) or somesuch here so we can see the metric in /rpcz if we get any latency outliers


http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3122
PS6, Line 3122:   unordered_map<string, ReportedTabletUpdatesPB*> updates;
worth adding to the comment explaining ownership of these raw pointers


http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3151
PS6, Line 3151:     ReportedTabletUpdatesPB* update = full_report_update->add_tablets();
to cut down on small allocations here, consider doing full_report_update->mutable_tablets()->Reserve(full_report.update_tablets_size()) up top before the loop?


http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3157
PS6, Line 3157:       shared_lock<LockType> l(lock_);
I think this would likely perform better if you can acquire this once for the whole report instead of acquire/release for each tablet. i.e a single "long" acquisition is probably better than N "short" acquisitions, considering that 'lock_' is only rarely accessed for write

If such a restructuring is a pain, just add a TODO?


http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3227
PS6, Line 3227: bootstrapping
should this now say "copying"?


http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3249
PS6, Line 3249: with an error
nit: rephrase as "non-deleted tablet which is reporting an error" or somesuch? Initially I wasn't clear whether "with an error" was referring to "Skip" or "tablet"


http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3268
PS6, Line 3268:       if (!cstate.committed_config().has_opid_index()) {
nit: would it be possible to move this if statement up a few lines before copying 'cstate' since I think this case is relatively common (servers may have thousands of tombstoned replicas) and because copying the cstate PB is probably expensive? (lots of embedded strings, etc)


http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3285
PS6, Line 3285:       // TODO(adar): ShouldTransitionTabletToRunning doesn't take step 7b into
              :       // account. Should it?
not sure I understand this question -- isn't the condition from ShouldTransitiontabletToRunning equivalent to the condition in 7b? is it actually redundant now?

Maybe it's worth inlining that function here since it seems like, given the redundancy above, it is really just one or two boolean expressions?


http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3344
PS6, Line 3344:         ConsensusStatePB prev_cstate = tablet->metadata().state().pb.consensus_state();
this is a little confusing - this is shadowing a variable with the same name in outer scope?



-- 
To view, visit http://gerrit.cloudera.org:8080/8090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Gerrit-Change-Number: 8090
Gerrit-PatchSet: 6
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 29 Sep 2017 18:38:54 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1125: issue one catalog write per tablet report

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/8090 )

Change subject: KUDU-1125: issue one catalog write per tablet report
......................................................................


Patch Set 8:

(8 comments)

http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3113
PS6, Line 3113:              
> TODO(todd) per git blame
Done


http://gerrit.cloudera.org:8080/#/c/8090/7/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/8090/7/src/kudu/master/catalog_manager.cc@3152
PS7, Line 3152: acquire
> acquire
Done


http://gerrit.cloudera.org:8080/#/c/8090/7/src/kudu/master/catalog_manager.cc@3173
PS7, Line 3173:         // reasonable to ignore it and wait for an operator fix the situation.
> I had a similar thought but seems out of scope for this commit. Perhaps add
So, I added this behavior in commit 42b65b0. The idea was to prevent the master from automatically deleting truly unknown tablets, but for the gflag to serve as a "safety valve" of sorts in case the admin knew that these tablets were no longer useful and wanted the master to delete them.

That said, our tooling has improved since then, and it's easier now to manually remove unwanted tablets. Plus I doubt anyone has actually used this gflag. So I'll remove it.


http://gerrit.cloudera.org:8080/#/c/8090/7/src/kudu/master/catalog_manager.cc@3236
PS7, Line 3236: index) {
> I think this tertiary if condition needs to be: (report.has_consensus_state
Done


http://gerrit.cloudera.org:8080/#/c/8090/7/src/kudu/master/catalog_manager.cc@3291
PS7, Line 3291: 
> Interesting question. I think it technically should, because the purpose of
I'll make the change so this doesn't confuse anyone else in the future.


http://gerrit.cloudera.org:8080/#/c/8090/7/src/kudu/master/catalog_manager.cc@3319
PS7, Line 3319:  than 
> cstate; a config doesn't really have a term
Done


http://gerrit.cloudera.org:8080/#/c/8090/7/src/kudu/master/catalog_manager.cc@3320
PS7, Line 3320: vious 
> cstate
Done


http://gerrit.cloudera.org:8080/#/c/8090/7/src/kudu/master/catalog_manager.cc@3452
PS7, Line 3452: 
> in
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/8090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Gerrit-Change-Number: 8090
Gerrit-PatchSet: 8
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 06 Oct 2017 01:42:08 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1125: issue one catalog write per tablet report

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8090 )

Change subject: KUDU-1125: issue one catalog write per tablet report
......................................................................


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3227
PS6, Line 3227: bootstrapping
> I thought bootstrapping and copying were synonymous?
We used to have a "remote bootstrap" process and we renamed it to "tablet copy" to differentiate it from "tablet bootstrap", the process of WAL replay followed by tablet startup. The latter still bears the name "bootstrap".



-- 
To view, visit http://gerrit.cloudera.org:8080/8090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Gerrit-Change-Number: 8090
Gerrit-PatchSet: 6
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 03 Oct 2017 22:04:09 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1125: issue one catalog write per tablet report

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8090 )

Change subject: KUDU-1125: issue one catalog write per tablet report
......................................................................

KUDU-1125: issue one catalog write per tablet report

This commit addresses a long-standing issue in the catalog manager where an
independent catalog write is performed for each reported tablet. When the
master is configured to fsync WAL writes, this can add a lot of load during
election storms and when the master is restarted.

To tackle this, I fully rewrote ProcessTabletReport and friends. I had
higher hopes for the final product, but all of the dependent control flow
complicates decomposition. Still, I managed to make some improvements. For
example, all RPCs are now sent at the end in one go rather than piecemeal. I
also rewrote all of the comments so that they can serve as a map to the
function and to emphasize the actions performed by the corresponding code.

Here are the actual semantic changes being made:
- Table and tablet locks are now acquired en masse. For tablets, this is
  required for correctness; I've documented why I did the same for tables.
- We no longer check for non-running tables. AFAICT this was a useless check
  because a non-running table must be a deleted table, and there's a check
  for that just before.
- We no longer wake the BgTasks thread upon completion. This is because:
  1. WakeIfHasPendingUpdates() was semi-broken; pending_updates_ is never
     set in ProcessTabletReport.
  2. Regardless, there's no work for the BgTasks thread to do.

Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Reviewed-on: http://gerrit.cloudera.org:8080/8090
Tested-by: Adar Dembo <ad...@cloudera.com>
Reviewed-by: Mike Percy <mp...@apache.org>
---
M src/kudu/integration-tests/delete_table-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
3 files changed, 391 insertions(+), 444 deletions(-)

Approvals:
  Adar Dembo: Verified
  Mike Percy: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/8090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Gerrit-Change-Number: 8090
Gerrit-PatchSet: 9
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1125: issue one catalog write per tablet report

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change. ( http://gerrit.cloudera.org:8080/8090 )

Change subject: KUDU-1125: issue one catalog write per tablet report
......................................................................


Patch Set 5:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3227
PS5, Line 3227: just been added to a pending config and are in the process of catching
              :     // up to the log entry where they were added to the config
Should this say 'just been added to the committed config and are in the process of bootstrapping'?


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3274
PS5, Line 3274:       // of the committed config,
s/,/.

For my own edification, do you know off the top of your head if this can this happen in normal circumstances?  I would not expect the tserver to think a tserver was a leader who wasn't in the config.


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3294
PS5, Line 3294:         // TODO(matteo): we could batch the IO onto a background thread, or at
I thought this TODO is what the commit is addressing?


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3303
PS5, Line 3303:       // - There was an opid_index change.
For my own edification, why do we update the on-disk cstate on opid change?  Doesn't that happen for every write batch (i.e. a lot)?  I would have expected it to be updated on term change || new knowledge of leader || config change.

Edit: going to leave my original question because it shows what I think is a natural question for someone new to this code.  However, I think the code _is_ doing what I originally expected, but the comment could be more clear.  The cstate isn't changing for every opid increment; it's changing if the committed config changes (the config's opid_index just happens to be how this is recognized).


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3309
PS5, Line 3309: cstate.current_term() > prev_cstate.current_term()
I'm would have expected that we'd update the on-disk cstate for every term change, even if we don't know the leader of the new term.


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3311
PS5, Line 3311: 7e
Shouldn't this be 7d(i)?


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3370
PS5, Line 3370:         // 7h. Add a server to the config if it is under-replicated.
Is this idempotent?


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3386
PS5, Line 3386:             report.schema_version());
Still send the AsyncAlterTable?


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3396
PS5, Line 3396: AsyncAlterTable
Would have expected this to be called AsyncAlterTablet, but I guess that's probably outside the scope here.


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3408
PS5, Line 3408:   tables_lock.Unlock();
Makes me nervous that this acquires tables before tablets, and releases tables before tablets as well.


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3418
PS5, Line 3418:     LOG(ERROR) << Substitute("Error updating tablets: $0. Tablet report was: $1",
Include the source tserver ID.


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3448
PS5, Line 3448:       // response to a tablet report.
Which RPC types does this apply to?  AFAICT they all have an associated tablet.



-- 
To view, visit http://gerrit.cloudera.org:8080/8090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Gerrit-Change-Number: 8090
Gerrit-PatchSet: 5
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 21 Sep 2017 22:44:09 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1125: issue one catalog write per tablet report

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Dan Burkert, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/8090

to look at the new patch set (#8).

Change subject: KUDU-1125: issue one catalog write per tablet report
......................................................................

KUDU-1125: issue one catalog write per tablet report

This commit addresses a long-standing issue in the catalog manager where an
independent catalog write is performed for each reported tablet. When the
master is configured to fsync WAL writes, this can add a lot of load during
election storms and when the master is restarted.

To tackle this, I fully rewrote ProcessTabletReport and friends. I had
higher hopes for the final product, but all of the dependent control flow
complicates decomposition. Still, I managed to make some improvements. For
example, all RPCs are now sent at the end in one go rather than piecemeal. I
also rewrote all of the comments so that they can serve as a map to the
function and to emphasize the actions performed by the corresponding code.

Here are the actual semantic changes being made:
- Table and tablet locks are now acquired en masse. For tablets, this is
  required for correctness; I've documented why I did the same for tables.
- We no longer check for non-running tables. AFAICT this was a useless check
  because a non-running table must be a deleted table, and there's a check
  for that just before.
- We no longer wake the BgTasks thread upon completion. This is because:
  1. WakeIfHasPendingUpdates() was semi-broken; pending_updates_ is never
     set in ProcessTabletReport.
  2. Regardless, there's no work for the BgTasks thread to do.

Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
---
M src/kudu/integration-tests/delete_table-itest.cc
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
3 files changed, 391 insertions(+), 444 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/90/8090/8
-- 
To view, visit http://gerrit.cloudera.org:8080/8090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Gerrit-Change-Number: 8090
Gerrit-PatchSet: 8
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1125: issue one catalog write per tablet report

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/8090 )

Change subject: KUDU-1125: issue one catalog write per tablet report
......................................................................


Patch Set 7: Code-Review+1

(4 comments)

lgtm after you address Mike's comments

http://gerrit.cloudera.org:8080/#/c/8090/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/8090/6//COMMIT_MSG@10
PS6, Line 10: When the
            : master is configured to fsync WAL writes, this can add a lot of load during
            : election storms and when the master is restarted.
> Haven't done any perf testing yet. I was planning to do it post-merge, but 
That's fair. If it's going to take a lot of effort I don't think it's worth trying to hit this issue. It might also take a lot of machines since we won't generate writes for TOMBSTONED tablets, and the non-tombstoned tablet count is limited by the thread scalability issues on the TS side. I'd skip doing anything fancy for now and just try to enable fsync on a cluster with a reasonable number of tablets and see if there's a noticeable difference in the TSHeartbeat latency profile (there should be an enormous one)


http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3106
PS6, Line 3106:                "requestor", rpc->requestor_string(),
> Todd, I'm not sure how your example can help measure latency. Did you mean 
well, /rpcz already keeps samples based on the overall latency of the request handling. For each sampling bucket (eg >1sec) it'll keep a single sampled trace, which includes the metrics. So the idea is that if we go to /rpcz and look at the ">1sec TSHeartbeat" sample, we'll have a useful metric to see whether the report was particularly large or something.


http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3285
PS6, Line 3285:       // leader elected. We need to wait for a leader before marking a tablet
              :       // as RUNNING, or else
> This TODO is trying to draw attention to the fact that ShouldTransitionTabl
Ack


http://gerrit.cloudera.org:8080/#/c/8090/7/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/8090/7/src/kudu/master/catalog_manager.cc@3173
PS7, Line 3173:         // reasonable to ignore it and wait for an operator fix the situation.
> The end of this comment only makes sense after going to the flag definition
I had a similar thought but seems out of scope for this commit. Perhaps add a TODO to consider removing this flag? Seems like it's only used by one test, and that test's goal is just to see that the flag works :)



-- 
To view, visit http://gerrit.cloudera.org:8080/8090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Gerrit-Change-Number: 8090
Gerrit-PatchSet: 7
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 06 Oct 2017 00:36:57 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1125: issue one catalog write per tablet report

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change. ( http://gerrit.cloudera.org:8080/8090 )

Change subject: KUDU-1125: issue one catalog write per tablet report
......................................................................


Patch Set 6:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3227
PS6, Line 3227: bootstrapping
> Yeah. The original wording was "This prevents us from spuriously deleting r
I thought bootstrapping and copying were synonymous?



-- 
To view, visit http://gerrit.cloudera.org:8080/8090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Gerrit-Change-Number: 8090
Gerrit-PatchSet: 6
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 03 Oct 2017 14:48:50 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1125: issue one catalog write per tablet report

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has removed Kudu Jenkins from this change.  ( http://gerrit.cloudera.org:8080/8090 )

Change subject: KUDU-1125: issue one catalog write per tablet report
......................................................................


Removed reviewer Kudu Jenkins with the following votes:

* Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/8090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteReviewer
Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Gerrit-Change-Number: 8090
Gerrit-PatchSet: 8
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1125: issue one catalog write per tablet report

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change. ( http://gerrit.cloudera.org:8080/8090 )

Change subject: KUDU-1125: issue one catalog write per tablet report
......................................................................


Patch Set 6: Code-Review+1

(2 comments)

I think Mike should do a pass.

http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3227
PS5, Line 3227: just been added to the committed config and are in the process of bootstrapping.
              :     const ConsensusStatePB& prev_cstate = tablet->metadata().s
> I copied this part of the comment verbatim since I don't pretend to underst
No, it was a real question.  I'm not sure the two are equivalent.


http://gerrit.cloudera.org:8080/#/c/8090/5/src/kudu/master/catalog_manager.cc@3408
PS5, Line 3408:   tables_lock.Unlock();
> While "table before tablets when acquiring and the reverse when releasing" 
sounds good



-- 
To view, visit http://gerrit.cloudera.org:8080/8090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Gerrit-Change-Number: 8090
Gerrit-PatchSet: 6
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Sat, 23 Sep 2017 00:43:46 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1125: issue one catalog write per tablet report

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: KUDU-1125: issue one catalog write per tablet report
......................................................................


Patch Set 2: Verified+1

Overriding Jenkins, the only failure was in setting up one of the TSAN tests.

-- 
To view, visit http://gerrit.cloudera.org:8080/8090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-1125: issue one catalog write per tablet report

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/8090 )

Change subject: KUDU-1125: issue one catalog write per tablet report
......................................................................


Patch Set 7:

(9 comments)

This is a big improvement, Adar.

http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3106
PS6, Line 3106:                "requestor", rpc->requestor_string(),
> would be nice to also use TRACE_COUNTER_INCREMENT("reported_tablets", updat
Todd, I'm not sure how your example can help measure latency. Did you mean to suggest using something like TRACE_COUNTER_SCOPE_LATENCY_US() ?


http://gerrit.cloudera.org:8080/#/c/8090/6/src/kudu/master/catalog_manager.cc@3113
PS6, Line 3113:              
TODO(todd) per git blame


http://gerrit.cloudera.org:8080/#/c/8090/7/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/8090/7/src/kudu/master/catalog_manager.cc@3152
PS7, Line 3152: acqurie
acquire


http://gerrit.cloudera.org:8080/#/c/8090/7/src/kudu/master/catalog_manager.cc@3173
PS7, Line 3173:         // reasonable to ignore it and wait for an operator fix the situation.
The end of this comment only makes sense after going to the flag definition and seeing that it defaults to false. I wonder, if we are going to default it to false, if we shouldn't just remove it and this by-default-disabled logic altogether and simply emit the warning on L3181. When was the last time we enabled this? I actually thought the default behavior of the master was to delete unknown tablets.


http://gerrit.cloudera.org:8080/#/c/8090/7/src/kudu/master/catalog_manager.cc@3236
PS7, Line 3236: report.has_consensus_state()
I think this tertiary if condition needs to be: (report.has_consensus_state() && report.consensus_state().committed_config().has_opid_index()) because of the issue brought up in the comment on L3269.


http://gerrit.cloudera.org:8080/#/c/8090/7/src/kudu/master/catalog_manager.cc@3291
PS7, Line 3291: Should it?
Interesting question. I think it technically should, because the purpose of disregarding the non-member leaders is to wait until a leader is part of the committed config before publishing its existence to clients. OTOH I can't easily come up with a series of events where this would be a problem because to change the config there has to be an initial leader that coordinates the config change, and this logic is about waiting for that first leader when creating a new tablet from scratch for the first time.


http://gerrit.cloudera.org:8080/#/c/8090/7/src/kudu/master/catalog_manager.cc@3319
PS7, Line 3319: config
cstate; a config doesn't really have a term


http://gerrit.cloudera.org:8080/#/c/8090/7/src/kudu/master/catalog_manager.cc@3320
PS7, Line 3320: config
cstate


http://gerrit.cloudera.org:8080/#/c/8090/7/src/kudu/master/catalog_manager.cc@3452
PS7, Line 3452: as
in



-- 
To view, visit http://gerrit.cloudera.org:8080/8090
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie6f5cf0e4b1cd1160b3b310d89c6dbf3dd62e43b
Gerrit-Change-Number: 8090
Gerrit-PatchSet: 7
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 03 Oct 2017 02:55:41 +0000
Gerrit-HasComments: Yes