You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Adar Dembo (Code Review)" <ge...@cloudera.org> on 2017/09/05 09:13:11 UTC

[kudu-CR] KUDU-1807: improve IsCreateInProgress performance

Hello Dan Burkert, Todd Lipcon,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/7957

to review the following change.

Change subject: KUDU-1807: improve IsCreateInProgress performance
......................................................................

KUDU-1807: improve IsCreateInProgress performance

For some inexplicable reason, IsCreateInProgress is invoked when servicing a
GetTableSchema RPC. While it's attractive to change that, it's also
untenable without affecting backwards compatibility. So instead, here's a
patch that adds in-memory caching to the master so that IsCreateInProgress
needn't acquire N tablet locks.

Just like schema_version_counts_, the cached state is a map of all tablet
states to the number of tablets currently in that state. For this particular
problem a single count of "number of creating tablets" might suffice, but I
liked the consistency with schema_version_counts_ and I found it easier to
handle the various corner cases when accounting for all states.

Because the map contents reflect persistent state, the locking semantics
differ somewhat from schema_version_counts_:
1. Tablet state changes are usually wrapped in a tablet metadata lock, so
   care must be taken to only update the state if the lock commits.
2. Further, the count must be updated with the tablet metadata lock held.
   Why? So that count updates are serialized in the same order as state
   changes themselves. Consider an example where two threads are trying to
   update a tablet's state (currently X) to Y and Z respectively. The count
   of state X begins at 1 (C_X=1) and the counts of states Y and Z are 0
   (C_Y=C_Z=0). There's no defined order so whichever state change happens
   second "wins". However, if the count updates were allowed outside the
   metadata locks, it would be possible for the order of operations to be
   X->Y, Y->Z, (C_Y-- C_Z++), (C_X-- C_Y++), causing C_Y to momentarily go
   negative. If count updates are restricted to inside the metadata locks, a
   state change order of X->Y, Y->Z forces the count update order to be
   (C_X-- C_Y++), (C_Y--, C_Z++).

To accommodate these semantics, the ScopedTabletInfoCommitter was modified
to support tracking state changes. There are two notable (and confusing)
exceptions which are documented in the code.

Change-Id: Idca379bb1249e54c0b7fbb0df408ed829ab9fbd2
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
2 files changed, 224 insertions(+), 68 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/57/7957/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7957
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Idca379bb1249e54c0b7fbb0df408ed829ab9fbd2
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1807: improve IsCreateInProgress performance

Posted by "Dan Burkert (Code Review)" <ge...@cloudera.org>.
Dan Burkert has posted comments on this change.

Change subject: KUDU-1807: improve IsCreateInProgress performance
......................................................................


Patch Set 3:

(3 comments)

overall I agree with Todd, I'm finding it very difficult to understand the intricacies of the locking in this patch (as opposed to the simpler patch earlier).

http://gerrit.cloudera.org:8080/#/c/7957/3/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 4605:   DCHECK_EQ(calculated, tablet_state_counts_);
Would it be good to verify the overall count against the total tablet count, or is that implicitly done already?


http://gerrit.cloudera.org:8080/#/c/7957/3/src/kudu/master/catalog_manager.h
File src/kudu/master/catalog_manager.h:

Line 800:                                   ScopedTabletInfoCommitter* committer_in,
Please add docs about what committer_in and comitter_out do.


http://gerrit.cloudera.org:8080/#/c/7957/3/src/kudu/util/cow_object.h
File src/kudu/util/cow_object.h:

Line 77:     if (commit_cb) {
how does this work?  What is this testing?


-- 
To view, visit http://gerrit.cloudera.org:8080/7957
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Idca379bb1249e54c0b7fbb0df408ed829ab9fbd2
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1807: improve IsCreateInProgress performance

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-1807: improve IsCreateInProgress performance
......................................................................


Patch Set 3:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/7957/3/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 554
hrm, why'd you remove the anonymous namespace here? actually it appears it should probably be higher up to encompass the visitors as well?


PS3, Line 634: the same order as tablet
             :       // lock acquisition
I thought we usually acquire the table lock first, but here we are acquiring the tablet lock in commit mode before we are trying to acquire some table-level lock in ApplyTabletStateChange, no?


PS3, Line 639:         const auto* change = FindOrNull(state_changes_, t);
             :         if (change) {
             :           t->mutable_metadata()->CommitMutation([&]() {
             :             t->table()->ApplyTabletStateChange(change->first, change->second);
             :           });
             :         } else {
             :           t->mutable_metadata()->CommitMutation();
             :         }
hrm, I think this might be easier to read as:

t->mutable_metadata()->CommitMutation([&]() {
  if (const auto* change = FindOrNull(state_changes_, t)) {
    t->table()->ApplyTabletStateChange(change->first, change->second);
  }
});

i.e fold the branch inside the lambda rather than outside, so you get rid of the multiple call sites to CommitMutation


PS3, Line 3747: unlocker_out
update this name


Line 4573:   VLOG(3) << Substitute("$0: incrementing tablet state $1",
this log might be more useful if it included either the old or new value of the count


Line 4582:   VLOG(3) << Substitute("$0: decrementing tablet state $1",
same


PS3, Line 4589: DCHECK
I think this is worth a CHECK because otherwise we will be decrementing some random memory


Line 4592:   if (it->second == 0) {
maybe DCHECK_GE(it->second, 0)?


PS3, Line 4597: VerifyTabletStateCountsUnlocked
maybe expand this to also verify teh schema version counts since it's a similar use case?


http://gerrit.cloudera.org:8080/#/c/7957/3/src/kudu/master/catalog_manager.h
File src/kudu/master/catalog_manager.h:

PS3, Line 298: ApplyTabletStateChange
how about AccountTabletStateChange since it's only accounting and not actually changing any states?


PS3, Line 739: tablet_lock
change param name


-- 
To view, visit http://gerrit.cloudera.org:8080/7957
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Idca379bb1249e54c0b7fbb0df408ed829ab9fbd2
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1807: improve IsCreateInProgress performance

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7957

to look at the new patch set (#2).

Change subject: KUDU-1807: improve IsCreateInProgress performance
......................................................................

KUDU-1807: improve IsCreateInProgress performance

For some inexplicable reason, IsCreateInProgress is invoked when servicing a
GetTableSchema RPC. While it's attractive to change that, it's also
untenable without affecting backwards compatibility. So instead, here's a
patch that adds in-memory caching to the master so that IsCreateInProgress
needn't acquire N tablet locks.

Just like schema_version_counts_, the cached state is a map of all tablet
states to the number of tablets currently in that state. For this particular
problem a single count of "number of creating tablets" might suffice, but I
liked the consistency with schema_version_counts_ and I found it easier to
handle the various corner cases when accounting for all states.

Because the map contents reflect persistent state, the locking semantics
differ somewhat from schema_version_counts_:
1. Tablet state changes are usually wrapped in a tablet metadata lock, so
   care must be taken to only update the state if the lock commits.
2. Further, the count must be updated with the tablet metadata lock held.
   Why? So that count updates are serialized in the same order as state
   changes themselves. Consider an example where two threads are trying to
   update a tablet's state (currently X) to Y and Z respectively. The count
   of state X begins at 1 (C_X=1) and the counts of states Y and Z are 0
   (C_Y=C_Z=0). There's no defined order so whichever state change happens
   second "wins". However, if the count updates were allowed outside the
   metadata locks, it would be possible for the order of operations to be
   X->Y, Y->Z, (C_Y-- C_Z++), (C_X-- C_Y++), causing C_Y to momentarily go
   negative. If count updates are restricted to inside the metadata locks, a
   state change order of X->Y, Y->Z forces the count update order to be
   (C_X-- C_Y++), (C_Y--, C_Z++).

To accommodate these semantics, the ScopedTabletInfoCommitter was modified
to support tracking state changes. There are two notable (and confusing)
exceptions which are documented in the code.

Change-Id: Idca379bb1249e54c0b7fbb0df408ed829ab9fbd2
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
2 files changed, 225 insertions(+), 68 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/57/7957/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7957
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Idca379bb1249e54c0b7fbb0df408ed829ab9fbd2
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1807: improve IsCreateInProgress performance

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has abandoned this change.

Change subject: KUDU-1807: improve IsCreateInProgress performance
......................................................................


Abandoned

This turned out to be quite complicated, so I'm exploring a client-side fix instead: http://gerrit.cloudera.org:8080/8026

-- 
To view, visit http://gerrit.cloudera.org:8080/7957
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: abandon
Gerrit-Change-Id: Idca379bb1249e54c0b7fbb0df408ed829ab9fbd2
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-1807: improve IsCreateInProgress performance

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-1807: improve IsCreateInProgress performance
......................................................................


Patch Set 2:

(11 comments)

I only partially followed this patch... it's one of those ones that's hard to verify by looking at it visually.

Is there any way we can add a debug-mode consistency check of some sort that verifies that the map of counts maps the calculated iteration? even if it's correct right now it seems like it may regress and be really hard to debug at some point later. (iirc the Java client just loops forever trying to fetch locations if it thinks the table is in-progress)

http://gerrit.cloudera.org:8080/#/c/7957/2/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 637:       // Then, commit the mutations.
is there some race window here where a caller ends up asking for GetTableLocations and it says that create is not in-progress, but in fact some of the tablets still haven't gotten to the Commit() step below, and hence are not returned?


PS2, Line 663: set_state
perhaps this should be named swap_state to indicate its return value better


Line 665:     state_changes_[tablet->table()][old_state]--;
can we DCHECK here that this never goes negative?


Line 672:   const PersistentTabletInfo& data(
nit: is this wrapping necessary?


Line 681:   PersistentTabletInfo* mutable_data(
nit: same (wrapping)?


PS2, Line 3697: unlocker_in
maybe 'committer' is a better name
(same below)


PS2, Line 4468: SysTabletsEntryPB::RUNNING
don't we have to worry about REPLACED and DELETED? or do those actually get entirely removed?


PS2, Line 4591:   DCHECK_GT(it->second, 0);
can you DCHECK_GE() vs 'delta'?


http://gerrit.cloudera.org:8080/#/c/7957/2/src/kudu/master/catalog_manager.h
File src/kudu/master/catalog_manager.h:

PS2, Line 123: get_state
unusual naming for a getter


PS2, Line 244: TabletStateMap
how about TabletStateCountMap? otherwise I would expect this type to be keyed by tablet id like the TabletInfo one.


PS2, Line 318:   void IncrementTabletStateCountUnlocked(SysTabletsEntryPB::State state,
             :                                          int64_t delta = 1);
             :   void DecrementTabletStateCountUnlocked(SysTabletsEntryPB::State state,
             :                                          int64_t delta = 1);
if you are taking counts here, it seems a little odd to have the two separate methods instead of just having one, and allowing a negative 'delta'


-- 
To view, visit http://gerrit.cloudera.org:8080/7957
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Idca379bb1249e54c0b7fbb0df408ed829ab9fbd2
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1807: improve IsCreateInProgress performance

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: KUDU-1807: improve IsCreateInProgress performance
......................................................................


Patch Set 2:

(11 comments)

> Is there any way we can add a debug-mode consistency check of some
 > sort that verifies that the map of counts maps the calculated
 > iteration? even if it's correct right now it seems like it may
 > regress and be really hard to debug at some point later. (iirc the
 > Java client just loops forever trying to fetch locations if it
 > thinks the table is in-progress)

I added a function to do this verification, and added a call to it in IsCreateInProgress. Let me know if you think I should call it elsewhere, or add something else.

http://gerrit.cloudera.org:8080/#/c/7957/2/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

Line 637:       // Then, commit the mutations.
> is there some race window here where a caller ends up asking for GetTableLo
You're right that there's a race window here, and that it can cause GetTableLocations to think that a tablet isn't yet running. However, all of our client implementations will treat this as a transient state and retry with some backoff.

Given the above, I wasn't sure whether this is worth fixing. What ultimately convinced me was that it's needed for the correctness of a "compare the tablet state count map contents against the actual state counts" function.


PS2, Line 663: set_state
> perhaps this should be named swap_state to indicate its return value better
Done


Line 665:     state_changes_[tablet->table()][old_state]--;
> can we DCHECK here that this never goes negative?
On the contrary: it probably will be negative, because state_changes_ represents the deltas that will be applied rather than the absolute counts.

In any case, this is moot now because we're no longer aggregating state changes for tables.


Line 672:   const PersistentTabletInfo& data(
> nit: is this wrapping necessary?
I like to wrap at 80, and these two lines were 84 and 86 respectively. But I'll unwrap since you found it distracting.


Line 681:   PersistentTabletInfo* mutable_data(
> nit: same (wrapping)?
Done


PS2, Line 3697: unlocker_in
> maybe 'committer' is a better name
Done, though I preserved the 'in' and 'out' suffixes to help navigate between these functions and ProcessPendingAssignments.


PS2, Line 4468: SysTabletsEntryPB::RUNNING
> don't we have to worry about REPLACED and DELETED? or do those actually get
I tried to maintain the same semantics as the old implementation, which returned true when it found any non-RUNNING tablet.

A tablet switches to REPLACED in ProcessPendingAssignments but is implicitly removed from the map when AddRemoveTablets is used to add the actual replacement tablet. It's interesting that here the mutation is committed before AddRemoveTablets is called, so there's a very brief window where a REPLACED tablet could be visible. I'm guessing this is harmless because it also implies a CREATING tablet, which means the table is definitely still being created.

A tablet switches to DELETED either in DeleteTable or in AlterTable (when a range partition is dropped). In the former, the entire table switches to DELETED, which causes IsCreateInProgress callers to short-circuit. In the latter, a DELETED tablet is explicitly dropped via AddRemoveTablets. And unlike ProcessPendingAssignments, the mutation is committed after the tablet is dropped, so the DELETED tablet is never visible.


PS2, Line 4591:   DCHECK_GT(it->second, 0);
> can you DCHECK_GE() vs 'delta'?
Moot, Increment/Decrement now always add/remove just one.


http://gerrit.cloudera.org:8080/#/c/7957/2/src/kudu/master/catalog_manager.h
File src/kudu/master/catalog_manager.h:

PS2, Line 123: get_state
> unusual naming for a getter
It's because the three callers are all tablet->metadata().state().get_state(). So if this were just state(), the calls would be tablet->metadata().state().state(), which I found confusing.

I can still change this if you prefer.


PS2, Line 244: TabletStateMap
> how about TabletStateCountMap? otherwise I would expect this type to be key
Done


PS2, Line 318:   void IncrementTabletStateCountUnlocked(SysTabletsEntryPB::State state,
             :                                          int64_t delta = 1);
             :   void DecrementTabletStateCountUnlocked(SysTabletsEntryPB::State state,
             :                                          int64_t delta = 1);
> if you are taking counts here, it seems a little odd to have the two separa
My original implementation used a single method which allowed negative deltas, but then I saw that the implementation was pretty different for negative deltas (due to additional invariant enforcement), and the symmetry with the SchemaVersionCount methods looked nice.

Anyway, this is moot now because per-table state change batching is no longer a thing, so we always increment/decrement one at a time.


-- 
To view, visit http://gerrit.cloudera.org:8080/7957
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Idca379bb1249e54c0b7fbb0df408ed829ab9fbd2
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-1807: improve IsCreateInProgress performance

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7957

to look at the new patch set (#3).

Change subject: KUDU-1807: improve IsCreateInProgress performance
......................................................................

KUDU-1807: improve IsCreateInProgress performance

For some inexplicable reason, IsCreateInProgress is invoked when servicing a
GetTableSchema RPC. While it's attractive to change that, it's also
untenable without affecting backwards compatibility. So instead, here's a
patch that adds in-memory caching to the master so that IsCreateInProgress
needn't acquire N tablet locks.

Just like schema_version_counts_, the cached state is a map of all tablet
states to the number of tablets currently in that state. For this particular
problem a single count of "number of creating tablets" might suffice, but I
liked the consistency with schema_version_counts_ and I found it easier to
handle the various corner cases when accounting for all states.

Because the map contents reflect persistent state, the locking semantics
differ somewhat from schema_version_counts_:
1. Tablet state changes are usually wrapped in a tablet metadata lock, so
   care must be taken to only update the state if the lock commits.
2. Further, the count must be updated with the tablet metadata lock held.
   Why? So that count updates are serialized in the same order as state
   changes themselves. Consider an example where two threads are trying to
   update a tablet's state (currently X) to Y and Z respectively. The count
   of state X begins at 1 (C_X=1) and the counts of states Y and Z are 0
   (C_Y=C_Z=0). There's no defined order so whichever state change happens
   second "wins". However, if the count updates were allowed outside the
   metadata locks, it would be possible for the order of operations to be
   X->Y, Y->Z, (C_Y-- C_Z++), (C_X-- C_Y++), causing C_Y to momentarily go
   negative. If count updates are restricted to inside the metadata locks, a
   state change order of X->Y, Y->Z forces the count update order to be
   (C_X-- C_Y++), (C_Y--, C_Z++).
3. Lastly, the count must also be updated with the tablet metadata lock in
   commit mode. Otherwise it's possible for the state change to "leak" to
   another thread via the cached map (and thus affect IsCreateInProgress)
   but not via the actual tablet state.

To accommodate these semantics, the ScopedTabletInfoCommitter was modified
to support tracking state changes. There are two notable (and confusing)
exceptions which are documented in the code.

Change-Id: Idca379bb1249e54c0b7fbb0df408ed829ab9fbd2
---
M src/kudu/master/catalog_manager.cc
M src/kudu/master/catalog_manager.h
M src/kudu/master/sys_catalog-test.cc
M src/kudu/util/cow_object.h
4 files changed, 261 insertions(+), 80 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/57/7957/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7957
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Idca379bb1249e54c0b7fbb0df408ed829ab9fbd2
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Dan Burkert <da...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>