You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org> on 2016/11/27 23:16:10 UTC

[kudu-CR] KUDU-798 (part 3) Remove the clock from MvccManager

David Ribeiro Alves has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/5240

Change subject: KUDU-798 (part 3) Remove the clock from MvccManager
......................................................................

KUDU-798 (part 3) Remove the clock from MvccManager

All mvcc transactions are now started at a timestamp so there is no
need for MvccManager to assign timestamps. Without this there
was a single place where we used the clock, outside of checks:
We used the clock in WaitForCleanSnapshotAtTimestamp() to make sure
the clock was past the timestamp before waiting on the snapshot to
be clean.

While this didn't make much sense, since the clean timestamp
shouldn't ever move past the current clock value, this is currently
a proxy to waiting for "safe time" on leader side transactions.
Until we properly handle that in follow up patches this check was
moved to where we handle snapshot scans in TabletService.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M src/kudu/tablet/compaction-test.cc
M src/kudu/tablet/deltamemstore-test.cc
M src/kudu/tablet/diskrowset-test-base.h
M src/kudu/tablet/major_delta_compaction-test.cc
M src/kudu/tablet/memrowset-test.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
12 files changed, 86 insertions(+), 106 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>

[kudu-CR] WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes
......................................................................


Patch Set 21:

(29 comments)

http://gerrit.cloudera.org:8080/#/c/5240/21//COMMIT_MSG
Commit Message:

Line 7: WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes
> nit: absence
Done


PS21, Line 19: if there aren't any writes in-flight that
             :   it doesn't know about yet
> I think this merits further explanation
this is more of a time manager detail. explained it further there.


PS21, Line 22: Mvcc's safe time is a more conservative version of "global"
             :   safe time
> should we consider renaming this to 'trimmable time' or 'safetime_lower_bou
not sure, with leader leases it can actually be the same as the time manager's safe time..


http://gerrit.cloudera.org:8080/#/c/5240/21/java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
File java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java:

PS21, Line 58: cluster 
> nit: comma here (http://grammar.ccc.commnet.edu/GRAMMAR/commas_intro.htm ha
thanks. done


PS21, Line 59: se
> absence
Done


PS21, Line 59: we expect precise clock values in the test
> not understanding this exactly. What does one thing have to do with the oth
Done


PS21, Line 61: --disable_
> I think 'negative' gflags are kind of confusing, because you get these kind
Done


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/consensus.h
File src/kudu/consensus/consensus.h:

Line 28: #include "kudu/consensus/time_manager.h"
> can you forward decl?
Done


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/consensus_queue-test.cc
File src/kudu/consensus/consensus_queue-test.cc:

Line 78:     scoped_refptr<TimeManager> time_manager(new TimeManager(clock, Timestamp::kMin));
> missing include for this? (I guess it's getting it transitively)
Done


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 105:                                    const scoped_refptr<TimeManager>& time_manager,
> pass by value and then use std::move below
Done


Line 275:     RETURN_NOT_OK(time_manager_->AdvanceSafeTimeWithMessage(*msg->get()));
> above comment should be updated to reference this.
Done. It was not possible before, but with recent changes to the TimeManager it now is. Also we should only do this in leader mode (until we have leader leases we only advance safe time in replicas on commit to reduce opportunity for it to move back) so added that condition


PS21, Line 442:  
> comma
Done


PS21, Line 443: else {
              :     if
> collapse
Done


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

Line 34: #include "kudu/consensus/time_manager.h"
> is forward decl sufficient?
Done


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

Line 533:   RETURN_NOT_OK(time_manager_->AssignTimestamp(replicate));
> PREPEND
I changed this to a check ok. in this case we just changed the queue/time_manager to leader mode under the consensus lock so this shouldn't fail


PS21, Line 1198: *(*iter)->get()
> is **iter sufficient?
good point. I changed this to a CHECK_OK. Currently it only fails if we can't update the clock (in which case we should crash). If this changes in the future we'll have to update this. Left a TODO for myself.


PS21, Line 1202: d 
> comma
Done


Line 1205:       RETURN_NOT_OK(time_manager_->AdvanceSafeTime(Timestamp(request->safe_timestamp())));
> same, is this safe to return or should it be a CHECK?
this returns void now and does a check internally.


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/raft_consensus_state.h
File src/kudu/consensus/raft_consensus_state.h:

Line 33: #include "kudu/consensus/time_manager.h"
> fwd decl?
Done


PS21, Line 256: explicit
> dont need explicit
Done


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

PS21, Line 1195: the clock values are unexpected.
> not quite following this
The above is required  because, in the tests that call this, we're using logical values for the timestamps.

This is not required anymore.


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/tablet/mvcc-test.cc
File src/kudu/tablet/mvcc-test.cc:

Line 508:   // Commit tx 2 - thread should still wai.
> typo
Done


Line 510:   mgr.CommitTransaction(tx2);
> probably want an ASSERT_FALSE(HasResultSnapshot()) here.
Done


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/tablet/mvcc.cc
File src/kudu/tablet/mvcc.cc:

Line 285: bool MvccManager::AnyInFlightAtOrBeforeUnlocked(Timestamp ts) const {
> hrm, could this just be implemented by looking at earliest_in_flight_?
It could, but I thought you said the current implementation was a bug? (i.e. how is this different from the method above)
I reverted the rename and left a TODO(todd)


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/tablet/mvcc.h
File src/kudu/tablet/mvcc.h:

Line 224:   // time received from the leader (which can go back without leader leases).
> perhaps elaborate that this would crash if you tried to move it backwards? 
this won't crash as it doesn't affect correctness (submitting a committed timestamp after moving clean time would though). I tried a crash but we have a bunch of test-only code like LocalTabletWriter that might set this back. Because of this I left a LOG_EVERY_N(ERROR) but didn't make it crash. I couldn't see any crashed on the full stack itests though.


Line 235:   // them to complete before returning.
> can you clarify in the docs whether this has any bearing on safetime? i.e t
Added a note that 'timestamp' must be marked as safe first.


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/tablet/tablet_peer.cc
File src/kudu/tablet/tablet_peer.cc:

PS21, Line 168: GetCleanTimestamp(
> is this ever relevant at this point? isn't the tablet not bootstrapped yet?
yeah, but the time manager doesn't know what is "safe" yet. this way even if the replica is partitioned out it can still serve scans for ts's that come before the last know clean time (which is a strict lower bound for safe time)


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/tablet/transactions/write_transaction.h
File src/kudu/tablet/transactions/write_transaction.h:

PS21, Line 103:  This must be called exactly once, during the PREPARE phase just
              :   // after the MvccManager has assigned a timestamp.
              :   // This also copies the timestamp from the MVCC transaction into the
              :   // WriteTransactionState object.
> needs an update
Done


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/tserver/tablet_service.cc
File src/kudu/tserver/tablet_service.cc:

PS21, Line 85: unsafe
> is this really unsafe? or just experimental?
increasing this would increase the likelyhood of hanging scanner threads for a long time. It was a hard clamp before. I stand by the unsafe.


-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 21
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#19).

Change subject: WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes
......................................................................

WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes

This plugs the new TimeManager through consensus and wherever else
it's needed and integrates with with transactions, tablet service
and consensus.

Noteworthy:

- Safe time advances as before _except_ in the absense of writes.
  That is safe time is still advanced in Mvcc by the transaction
  driver and only up to the committed transaction's timestamp.
  When there are no writes to send the leader sends either its
  current clock value (if there aren't any writes in-flight that
  it doesn't know about yet) or the last safe timestamp.

- Mvcc's safe time is a more conservative version of "global"
  safe time. Mvcc does not get updated with safe time heartbeats
  from the leader. The clean snapshot must stay "clean" otherwise
  a lot of things would break.

- Mvcc's WaitForCleanSnapshot() method was removed. There is no
  longer a guarantee that the "clean" timestamp will move to
  the present. It might not, in the absense of writes. This
  patch introduces a new method that allows to wait for all
  "known" transactions before a timestamp to be committed.

WIP as still missing a directed test but follow up patches
include integration tests (like an improved version of
linked_list-test that performs snapshot scans in the present)
that have been looped on dist-test.

This should be good to be reviewed.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
27 files changed, 160 insertions(+), 116 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/19
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 19
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-798 (part 5) Correct safe time advancement

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-798 (part 5) Correct safe time advancement
......................................................................


Patch Set 32:

(35 comments)

http://gerrit.cloudera.org:8080/#/c/5240/32//COMMIT_MSG
Commit Message:

PS32, Line 31: whole
hole


PS32, Line 56: it
is


PS32, Line 52: 
             : - TimeManager now does a pre-flight check before waiting on safe time.
             :   In particular it checks that: i) it has heard from the leader within
             :   a configurable amount of time (that safe time _can_ make progress).
             :   ii) it checks that the safe time it not more that a configurable
             :   amount of time in the past, 30 seconds by default (that safe time
             :   is likely to make progress to the required timestamp).
I haven't looked at the code yet, but one question here: these checks should be only relevant for 'now' or 'recent past' type snapshot queries. If you queried 30 seconds in the past, for example, then the preflight check is not necessary, right?

Put another way, we should only do this check if the requested snapshot is not safe yet (ie if we'd actually have to wait at all).

Ignore this comment if you're already doing this :)


PS32, Line 65: unique
any particular reason why not the one with dup keys too?


PS32, Line 71: workloaf
typo


http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/client/client-test-util.cc
File src/kudu/client/client-test-util.cc:

Line 58:   ASSERT_OK(scanner.SetSelection(KuduClient::LEADER_ONLY));
is LEADER_ONLY still relevant here? or should this be changed to set it to SNAPSHOT at current time? (feel free to add TODO if it's not a trivial change)


http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/client/client-test.cc
File src/kudu/client/client-test.cc:

Line 1147: TEST_F(ClientTest, TestScanFaultTolerance) {
did you loop this test too?


http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 456:       KLOG_EVERY_N_SECS(WARNING, 10) << "Safe time advancement without writes is disabled. "
hrm, maybe FIRST_K is better here? not sure seeing it once every 10 seconds is that useful. (or maybe once every 5 minutes or something?)


http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

Line 1238:     if (request->has_safe_timestamp()) {
I'm still not sure this is in the right spot if the tail of the messages in the request failed to prepare. Agree right now it's not relevant, but once the leader starts sending this alongside ops, this behavior isn't right


http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/consensus/time_manager.cc
File src/kudu/consensus/time_manager.cc:

PS32, Line 31: safe_time_max_last_advanced_safe_time
not a big fan of this name. How about: missed_heartbeats_before_rejecting_snapshot_scans or something of that nature?


PS32, Line 33: snapshot scans
scans at snapshots that aren't yet safe


Line 34: TAG_FLAG(safe_time_max_last_advanced_safe_time, advanced);
let's mark this unstable for now too


PS32, Line 37: lag
lag behind the requested timestamp?


Line 39: TAG_FLAG(safe_time_max_lag_ms, advanced);
same


PS32, Line 151:  $0 secs, (max is $1 sec
'secs' is redundant here since MonoDelta::ToString already has a 's' suffix


Line 166:                                 safe_time_diff.ToMilliseconds(),
wrong units here (message says secs)


Line 167:                                 FLAGS_safe_time_max_lag_ms);
same


Line 173: void TimeManager::MakeWaiterTimeoutMessageUnlocked(Timestamp timestamp, string* error_message) {
probably easier to just return string from this method instead of the out-param. Or just inline it at the call site (I only see it once)


Line 188:     if (mode_ == NON_LEADER && timestamp > last_safe_ts_) {
ah ok, so you are doing the thing that I asked about in the commit message. nice.


http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/consensus/time_manager.h
File src/kudu/consensus/time_manager.h:

PS32, Line 117: is 
if


http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

Line 2345:       "--maintenance_manager_polling_interval_ms=300",
typo?


Line 2664:   //workload.set_num_read_threads(2);
hrm?


http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/integration-tests/test_workload.cc
File src/kudu/integration-tests/test_workload.cc:

Line 270:   start_latch_.Reset((uint64_t)num_write_threads_ + (uint64_t)num_read_threads_);
nit: why these casts?


http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/tools/tool_action_remote_replica.cc
File src/kudu/tools/tool_action_remote_replica.cc:

Line 100:     // and thus cannot do snapshot scans.
I think ORDERED is still useful here.


http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/tserver/tablet_service.cc
File src/kudu/tserver/tablet_service.cc:

PS32, Line 84: time
'of time (in milliseconds)' and remove 'in milliseconds' from the end of the sentence


PS32, Line 85: allows to
'allows us to' or 'allows waiting'


PS32, Line 85: it's 
its


Line 1329:               ": has no lower or upper bound.");
lots of reindentation in this file. mind reverting that part of the patch?


PS32, Line 1387: Retunrs
typo


PS32, Line 1388: CheckIfAncientHistory
rename to 'VerifyNotAncientHistory' or 'VerifyAfterAHM' or something. Otherwise it's not clear from the method name what the return status indicates


Line 1759: MonoTime ClampDeadline(const MonoTime& deadline, bool* was_clamped) {
rename to ClampScanDeadlineForWait or something so it's clearer where this is used.


PS32, Line 1816: ahm
nit: AHM


PS32, Line 1842: allows to
allows us to


PS32, Line 1850: s.ok() 
!s.ok() is redundant here with s.IsTimedout()


Line 1861:   if (!s.ok() && s.IsTimedOut() && was_clamped) {
same

Could you combine these two cases by just making the above be:

if (s.ok()) {
  s = mvcc_manager->waitForSnapshotWithAllCommitted...
}

?


-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 32
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] WIP: KUDU-798 (part 5) Safe time advancement in the absence of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#23).

Change subject: WIP: KUDU-798 (part 5) Safe time advancement in the absence of writes
......................................................................

WIP: KUDU-798 (part 5) Safe time advancement in the absence of writes

This plugs the new TimeManager through consensus and wherever else
it's needed and integrates with with transactions, tablet service
and consensus.

Noteworthy:

- Safe time advances as before _except_ in the absense of writes.
  That is safe time is still advanced in Mvcc by the transaction
  driver and only up to the committed transaction's timestamp.
  When there are no writes to send the leader sends its last
  safe time to the replica, which updates itself and unblocks
  waiters.

- Mvcc's safe time is a more conservative version of "global"
  safe time. Mvcc does not get updated with safe time heartbeats
  from the leader. The clean snapshot must stay "clean" otherwise
  a lot of things would break.

- Mvcc's WaitForCleanSnapshot() method was removed. There is no
  longer a guarantee that the "clean" timestamp will move to
  the present. It might not, in the absense of writes. This
  patch introduces a new method that allows to wait for all
  "known" transactions before a timestamp to be committed.

WIP workign on a thorough integration test to be merged with
this.

This should be good to be reviewed.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/client/client-test-util.cc
M src/kudu/client/client-test.cc
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/tablet_peer_mm_ops.cc
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/transaction_tracker.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
33 files changed, 218 insertions(+), 149 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/23
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 23
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [NOT FOR REVIEW] KUDU-798 (part 3) Remove the clock from MvccManager

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: [NOT FOR REVIEW] KUDU-798 (part 3) Remove the clock from MvccManager
......................................................................


Patch Set 2:

nah, this is broken without the rest. will push soon.

-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] WIP: Support safe time advancement on replicas in the absense of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#9).

Change subject: WIP: Support safe time advancement on replicas in the absense of writes
......................................................................

WIP: Support safe time advancement on replicas in the absense of writes

Mostly getting some jenkins mileage but the design should be close
to the final one.

This is a WIP that needs to be broken down further but that summarily
adds the following:

- Add a TimeManager class: This class is responsible for tracking
  safe time advancement on leaders and replicas.
- Removes the clock from MvccManager - Since safe time is now managed
  by the TimeManager there is no need for Mvcc to have one internally.
- Changes how we wait for "all committed" in MvccManager. Previously
  we would wait for a clean snapshot. Now we instead wait for all
  transactions before the timestamp to be committed even if the
  clean timestamp doesn't advance. This is because without leader
  leases the safe time can go back, but moving the clean time back
  would break a lot of things internally.
- Plugs the TimeManager in consensus/queue and calls its methods
  appropriately.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M src/kudu/consensus/CMakeLists.txt
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
A src/kudu/consensus/time_manager-test.cc
A src/kudu/consensus/time_manager.cc
A src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/compaction-test.cc
M src/kudu/tablet/deltamemstore-test.cc
M src/kudu/tablet/diskrowset-test-base.h
M src/kudu/tablet/major_delta_compaction-test.cc
M src/kudu/tablet/memrowset-test.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
31 files changed, 626 insertions(+), 214 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/9
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-798 (part 5) Correct safe time advancement

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: KUDU-798 (part 5) Correct safe time advancement
......................................................................


Patch Set 28:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/5240/27/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

PS27, Line 1198: fail.
               :       // Once we have that functionality we'll have to revisit this.
> nit: two sentences
Done


Line 1204:     // If we stopped before reaching the end we failed to prepare some message(s) and need
> but in the meantime don't we want to do this only if we prepared all ops? i
sure, makes sense and it's defensive though I'm not sure it matters in the current implementation.


http://gerrit.cloudera.org:8080/#/c/5240/28/src/kudu/consensus/time_manager.cc
File src/kudu/consensus/time_manager.cc:

PS28, Line 184:  // Pre-flight checks, make sure we've heard from the leader recently and that safe time
              :   // isn't lagging too much.
maybe we should do a GetSafeTime() call here so that if we're leader we 're less likely to trip the leader check (though we do call it in every heartbeat).


-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 28
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes
......................................................................


Patch Set 21:

(29 comments)

mostly small stuff

http://gerrit.cloudera.org:8080/#/c/5240/21//COMMIT_MSG
Commit Message:

Line 7: WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes
nit: absence


PS21, Line 19: if there aren't any writes in-flight that
             :   it doesn't know about yet
I think this merits further explanation


PS21, Line 22: Mvcc's safe time is a more conservative version of "global"
             :   safe time
should we consider renaming this to 'trimmable time' or 'safetime_lower_bound' or something?


http://gerrit.cloudera.org:8080/#/c/5240/21/java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
File java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java:

PS21, Line 58: cluster 
nit: comma here (http://grammar.ccc.commnet.edu/GRAMMAR/commas_intro.htm has some good examples)


PS21, Line 59: se
absence


PS21, Line 59: we expect precise clock values in the test
not understanding this exactly. What does one thing have to do with the other? safetime advancements end up pushing the clock value ahead or something? Please elaborate, especially since people editing the Java client tests may not be as familiar with the internal details.


PS21, Line 61: --disable_
I think 'negative' gflags are kind of confusing, because you get these kind of double-negatives. why not have it be an 'enable' flag, defaulted to true, and set it to false to disable it?


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/consensus.h
File src/kudu/consensus/consensus.h:

Line 28: #include "kudu/consensus/time_manager.h"
can you forward decl?


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/consensus_queue-test.cc
File src/kudu/consensus/consensus_queue-test.cc:

Line 78:     scoped_refptr<TimeManager> time_manager(new TimeManager(clock, Timestamp::kMin));
missing include for this? (I guess it's getting it transitively)


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 105:                                    const scoped_refptr<TimeManager>& time_manager,
pass by value and then use std::move below


Line 275:     RETURN_NOT_OK(time_manager_->AdvanceSafeTimeWithMessage(*msg->get()));
above comment should be updated to reference this.

would it be more efficient (and just as useful) to just do it once at the end for msgs.back()?


PS21, Line 442:  
comma


PS21, Line 443: else {
              :     if
collapse


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

Line 34: #include "kudu/consensus/time_manager.h"
is forward decl sufficient?


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

Line 533:   RETURN_NOT_OK(time_manager_->AssignTimestamp(replicate));
PREPEND


PS21, Line 1198: *(*iter)->get()
is **iter sufficient?

Can't remember under what circumstances this returns non-OK, but are we actually safe to return here rather than break?


PS21, Line 1202: d 
comma


Line 1205:       RETURN_NOT_OK(time_manager_->AdvanceSafeTime(Timestamp(request->safe_timestamp())));
same, is this safe to return or should it be a CHECK?


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/consensus/raft_consensus_state.h
File src/kudu/consensus/raft_consensus_state.h:

Line 33: #include "kudu/consensus/time_manager.h"
fwd decl?


PS21, Line 256: explicit
dont need explicit


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

PS21, Line 1195: the clock values are unexpected.
not quite following this


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/tablet/mvcc-test.cc
File src/kudu/tablet/mvcc-test.cc:

Line 508:   // Commit tx 2 - thread should still wai.
typo


Line 510:   mgr.CommitTransaction(tx2);
probably want an ASSERT_FALSE(HasResultSnapshot()) here.


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/tablet/mvcc.cc
File src/kudu/tablet/mvcc.cc:

Line 285: bool MvccManager::AnyInFlightAtOrBeforeUnlocked(Timestamp ts) const {
hrm, could this just be implemented by looking at earliest_in_flight_?


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/tablet/mvcc.h
File src/kudu/tablet/mvcc.h:

Line 224:   // time received from the leader (which can go back without leader leases).
perhaps elaborate that this would crash if you tried to move it backwards? (would it?)

Maybe it should be called something like SetNewTransactionLowerBound? per your commit message, the MvccManager doesn't really track safetime anymore, it just tracks a lower bound on new transactions (which itself is a lower bound on safe time). Or am I misunderstanding?


Line 235:   // them to complete before returning.
can you clarify in the docs whether this has any bearing on safetime? i.e this does (or doesnt) guarantee that a new transaction could still start with a timestamp < timestamp


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/tablet/tablet_peer.cc
File src/kudu/tablet/tablet_peer.cc:

PS21, Line 168: GetCleanTimestamp(
is this ever relevant at this point? isn't the tablet not bootstrapped yet?


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/tablet/transactions/write_transaction.h
File src/kudu/tablet/transactions/write_transaction.h:

PS21, Line 103:  This must be called exactly once, during the PREPARE phase just
              :   // after the MvccManager has assigned a timestamp.
              :   // This also copies the timestamp from the MVCC transaction into the
              :   // WriteTransactionState object.
needs an update


http://gerrit.cloudera.org:8080/#/c/5240/21/src/kudu/tserver/tablet_service.cc
File src/kudu/tserver/tablet_service.cc:

PS21, Line 85: unsafe
is this really unsafe? or just experimental?


-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 21
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-798 (part 5) Correct safe time advancement

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#32).

Change subject: KUDU-798 (part 5) Correct safe time advancement
......................................................................

KUDU-798 (part 5) Correct safe time advancement

This patch fixes safe time advancement in general and allows for
safe time advancement in the absence of writes in particular.
The core of the patch is to plug in the new TimeManager class wherever
needed. But there is also a major cleanup of the waiting story in
TabletService (KUDU-1127) and a few new integration test features.

There is an instance of a TimeManager per tablet. It's used for:

- When replicating messages to other replicas a leader uses the
  TimeManager to assign timestamps and to obtain a safe time to send,
  even when there are no writes.

- When receiving messages from a leader consensus uses the TimeManager
  to update the clock and to unblock any waiters that might waiting
  for a particular timestamp to be safe.

- Before a snapshot scan proceeds to scan, it must first wait for the
  TimeManager to deem whatever timestamp it has safe. Then it proceeds
  to wait for the snapshot at timestamp to have all its transactions
  committed and, finally, proceeds with the scan.

Put together, these changes allow to make sure that snapshot scans are
repeatable in the large majority of cases. The one "whole" in safe time
is solved by leader leases. Until we have those this patch takes a
conservative approach to safe time progress.

Fixing safe time broke a bunch of our tests that were expecting broken
snapshot scans. In particular we would return broken snapshots all the
time instead of waiting for the snapshot to be correct. Of course
when these errors were fixed the tests started failing.

In order to address these test failures I cleaned up our snapshot scan
waiting story in TabletServer::HandleScanAtSnapshot(). In particular:

- The client's deadline in no longer taken into account when deciding
  how long to wait for a snapshot to be repeatable. There is a hard
  (configurable) max that the server will wait for, "clamping" the
  client's deadline. The key here is that, when the client deadline
  is clamped, we return Status::ServiceUnavailable on time out
  instead of Status::TimeOut(). Since HandleScanAtSnapshot() is called
  on KuduScanner::Open() and ServiceUnavailable is a retryable status
  this enables the client to try somewhere else, perhaps where it won't
  have to wait as long.

- TimeManager now does a pre-flight check before waiting on safe time.
  In particular it checks that: i) it has heard from the leader within
  a configurable amount of time (that safe time _can_ make progress).
  ii) it checks that the safe time it not more that a configurable
  amount of time in the past, 30 seconds by default (that safe time
  is likely to make progress to the required timestamp).

Finally, this patch adds two new integration test workloads that
prove that it works. It adds a new read workload to TestWorkload
that performs snapshot scans in the present, while writes are
happening. This can be enabled anywhere but this patch enables
it for a bunch of tests in RaftConsensusItest, in particular the
*Churny* and *Crashy* ones with unique keys. This patch also enables
linked_list-test to perform snapshot scans in the present after
the verification.

Results:

I ran raft_consensus-itest with the new snapshot read workloaf
on dist-test, asan, with the following config:

KUDU_ALLOW_SLOW_TESTS=1 ../../build-support/dist_test.py \
--disable-sharding loop -n 1000 bin/raft_consensus-itest \
--gtest_filter=*Churny*:*Crashy*-*Duplicate*

I pulled the test to before this patch. It failed 1000/1000 on
master. With this patch it passed 1000/1000:

http://dist-test.cloudera.org//job?job_id=david.alves.1481097865.18287

I ran linked_list-test with the new snapshot scans in dist-test,
asan, with the following config:

KUDU_ALLOW_SLOW_TESTS=1 ../../build-support/dist_test.py \
loop -n 1000 bin/linked_list-test --stress_cpu_threads=2 \
--gtest_filter=*TestLoadAndVerify*

The test passed 1000/1000 whereas before it would fail 427/1000.
Results:

http://dist-test.cloudera.org//job?job_id=david.alves.1481104106.24665

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/client/client-test-util.cc
M src/kudu/client/client-test.cc
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/linked_list-test-util.h
M src/kudu/integration-tests/linked_list-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/tablet_history_gc-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/integration-tests/test_workload.h
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/transaction_tracker.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
41 files changed, 638 insertions(+), 327 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/32
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 32
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-798 (part 5) Correct safe time advancement

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#29).

Change subject: KUDU-798 (part 5) Correct safe time advancement
......................................................................

KUDU-798 (part 5) Correct safe time advancement

This patch fixes safe time advancement in general and allows for
safe time advancement in the absence of writes in particular.
The core of the patch is to plug in the new TimeManager class wherever
needed. But there is also a major cleanup of the waiting story in
TabletService (KUDU-1127) and a few new integration test features.

There is an instance of a TimeManager per tablet. It's used for:

- When replicating messages to other replicas a leader uses the
  TimeManager to assign timestamps and to obtain a safe time to send,
  even when there are no writes.

- When receiving messages from a leader consensus uses the TimeManager
  to update the clock and to unblock any waiters that might waiting
  for a particular timestamp to be safe.

- Before a snapshot scan proceeds to scan, it must first wait for the
  TimeManager to deem whatever timestamp it has safe. Then it proceeds
  to wait for the snapshot at timestamp to have all its transactions
  committed and, finally, proceeds with the scan.

Put together, these changes allow to make sure that snapshot scans are
repeatable in the large majority of cases. The one "whole" in safe time
is solved by leader leases. Until we have those this patch takes a
conservative approach to safe time progress.

Fixing safe time broke a bunch of our tests that were expecting broken
snapshot scans. In particular we would return broken snapshots all the
time instead of waiting for the snapshot to be correct. Of course
when these errors were fixed the tests started failing.

In order to address these test failures I cleaned up our snapshot scan
waiting story in TabletServer::HandleScanAtSnapshot(). In particular:

- The client's deadline in no longer taken into account when deciding
  how long to wait for a snapshot to be repeatable. There is a hard
  (configurable) max that the server will wait for, "clamping" the
  client's deadline. The key here is that, when the client deadline
  is clamped, we return Status::ServiceUnavailable on time out
  instead of Status::TimeOut(). Since HandleScanAtSnapshot() is called
  on KuduScanner::Open() and ServiceUnavailable is a retryable status
  this enables the client to try somewhere else, perhaps where it won't
  have to wait as long.

- TimeManager now does a pre-flight check before waiting on safe time.
  In particular it checks that: i) it has heard from the leader within
  a configurable amount of time (that safe time _can_ make progress).
  ii) it checks that the safe time it not more that a configurable
  amount of time in the past, 30 seconds by default (that safe time
  is likely to make progress to the required timestamp).

Finally, this patch adds two new integration test workloads that
prove that it works. It adds a new read workload to TestWorkload
that performs snapshot scans in the present, while writes are
happening. This can be enabled anywhere but this patch enables
it for a bunch of tests in RaftConsensusItest, in particular the
*Churny* and *Crashy* ones with unique keys. This patch also enables
linked_list-test to perform snapshot scans in the present after
the verification.

Results:

I ran raft_consensus-itest with the new snapshot read workloaf
on dist-test, asan, with the following config:

KUDU_ALLOW_SLOW_TESTS=1 ../../build-support/dist_test.py \
--disable-sharding loop -n 1000 bin/raft_consensus-itest \
--gtest_filter=*Churny*:*Crashy*-*Duplicate*

I pulled the test to before this patch. It failed 1000/1000 on
master. With this patch it passed 1000/1000:

http://dist-test.cloudera.org//job?job_id=david.alves.1481097865.18287

I ran linked_list-test with the new snapshot scans in dist-test,
asan, with the following config:

KUDU_ALLOW_SLOW_TESTS=1 ../../build-support/dist_test.py \
loop -n 1000 bin/linked_list-test --stress_cpu_threads=2 \
--gtest_filter=*TestLoadAndVerify*

The test passed 1000/1000 whereas before it would fail 427/1000.
Results:

http://dist-test.cloudera.org//job?job_id=david.alves.1481104106.24665

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/client/client-test-util.cc
M src/kudu/client/client-test.cc
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/linked_list-test-util.h
M src/kudu/integration-tests/linked_list-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/tablet_history_gc-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/integration-tests/test_workload.h
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/tablet_peer_mm_ops.cc
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/transaction_tracker.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
41 files changed, 640 insertions(+), 317 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/29
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 29
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#21).

Change subject: WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes
......................................................................

WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes

This plugs the new TimeManager through consensus and wherever else
it's needed and integrates with with transactions, tablet service
and consensus.

Noteworthy:

- Safe time advances as before _except_ in the absense of writes.
  That is safe time is still advanced in Mvcc by the transaction
  driver and only up to the committed transaction's timestamp.
  When there are no writes to send the leader sends either its
  current clock value (if there aren't any writes in-flight that
  it doesn't know about yet) or the last safe timestamp.

- Mvcc's safe time is a more conservative version of "global"
  safe time. Mvcc does not get updated with safe time heartbeats
  from the leader. The clean snapshot must stay "clean" otherwise
  a lot of things would break.

- Mvcc's WaitForCleanSnapshot() method was removed. There is no
  longer a guarantee that the "clean" timestamp will move to
  the present. It might not, in the absense of writes. This
  patch introduces a new method that allows to wait for all
  "known" transactions before a timestamp to be committed.

WIP as still missing a directed test but follow up patches
include integration tests (like an improved version of
linked_list-test that performs snapshot scans in the present)
that have been looped on dist-test.

This should be good to be reviewed.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/client/client-test.cc
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
28 files changed, 179 insertions(+), 139 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/21
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 21
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-798 (part 5) Correct safe time advancement

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: KUDU-798 (part 5) Correct safe time advancement
......................................................................


Patch Set 34: Verified+1

overriding jenkins, all tests passed but debug build failed due to: 05:06:31 All tests passed, yet some left behind their test output:
05:06:31 krb5kdc-1481173108926

-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 34
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#18).

Change subject: WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes
......................................................................

WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes

This plugs the new TimeManager through consensus and wherever else
it's needed and integrates with with transactions, tablet service
and consensus.

Noteworthy:

- Safe time advances as before _except_ in the absense of writes.
  That is safe time is still advanced in Mvcc by the transaction
  driver and only up to the committed transaction's timestamp.
  When there are no writes to send the leader sends either its
  current clock value (if there aren't any writes in-flight that
  it doesn't know about yet) or the last safe timestamp.

- Mvcc's safe time is a more conservative version of "global"
  safe time. Mvcc does not get updated with safe time heartbeats
  from the leader. The clean snapshot must stay "clean" otherwise
  a lot of things would break.

- Mvcc's WaitForCleanSnapshot() method was removed. There is no
  longer a guarantee that the "clean" timestamp will move to
  the present. It might not, in the absense of writes. This
  patch introduces a new method that allows to wait for all
  "known" transactions before a timestamp to be committed.

WIP as still missing a directed test but follow up patches
include integration tests (like an improved version of
linked_list-test that performs snapshot scans in the present)
that have been looped on dist-test.

This should be good to be reviewed.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
27 files changed, 153 insertions(+), 111 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/18
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 18
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-798 (part 5) Correct safe time advancement

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: KUDU-798 (part 5) Correct safe time advancement
......................................................................


Patch Set 28:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5240/27//COMMIT_MSG
Commit Message:

PS27, Line 15: n
nit: add comma or period?


http://gerrit.cloudera.org:8080/#/c/5240/28/src/kudu/integration-tests/consistency-itest.cc
File src/kudu/integration-tests/consistency-itest.cc:

PS28, Line 639: 3
What is the reason to have 3 rows inserted instead of 1?  It would be nice to know.


-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 28
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-798 (part 5) Correct safe time advancement

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#27).

Change subject: KUDU-798 (part 5) Correct safe time advancement
......................................................................

KUDU-798 (part 5) Correct safe time advancement

This patch fixes safe time advancement in general and
allows for safe time advancement in the absence of writes
in particular.

The core of the patch is to plug in the new TimeManager
class wherever needed. There is an instance of a TimeManager
per tablet it's main uses are:

- When replicating messages to other replicas a leader
  uses the TimeManager to assign timestamps and to obtain
  a safe time to send, even when there are no writes.

- When receiving messages from a leader consensus uses
  the TimeManager to update the clock and to unblock
  any waiters that might waiting for a particular
  timestamp to be safe.

- Before a snapshot scan proceeds to scan, it must
  first wait for the TimeManager to deem whatever
  timestamp it has safe. Then it proceeds to wait for
  the snapshot at timestamp to have all its transactions
  committed and, finally, proceeds with the scan.

Put together, these changes allow to make sure that
snapshot scans are repeatable in the large majority
of cases.

Note: There is a high likely hood that https://gerrit.cloudera.org/#/c/5375/
will be merged with this patch. However I'm currently running
some experiments with it pre/post this patch.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/client/client-test-util.cc
M src/kudu/client/client-test.cc
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/tablet_peer_mm_ops.cc
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/transaction_tracker.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
33 files changed, 228 insertions(+), 150 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/27
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 27
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#14).

Change subject: WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes
......................................................................

WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes

Mostly getting some jenkins mileage but the design should be close
to the final one.

This is a WIP that needs to be broken down further but that summarily
adds the following:

- Add a TimeManager class: This class is responsible for tracking
  safe time advancement on leaders and replicas.
- Removes the clock from MvccManager - Since safe time is now managed
  by the TimeManager there is no need for Mvcc to have one internally.
- Changes how we wait for "all committed" in MvccManager. Previously
  we would wait for a clean snapshot. Now we instead wait for all
  transactions before the timestamp to be committed even if the
  clean timestamp doesn't advance. This is because without leader
  leases the safe time can go back, but moving the clean time back
  would break a lot of things internally.
- Plugs the TimeManager in consensus/queue and calls its methods
  appropriately.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/CMakeLists.txt
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/compaction-test.cc
M src/kudu/tablet/deltamemstore-test.cc
M src/kudu/tablet/diskrowset-test-base.h
M src/kudu/tablet/major_delta_compaction-test.cc
M src/kudu/tablet/memrowset-test.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
29 files changed, 231 insertions(+), 173 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/14
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 14
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] WIP: Support safe time advancement on replicas in the absense of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#4).

Change subject: WIP: Support safe time advancement on replicas in the absense of writes
......................................................................

WIP: Support safe time advancement on replicas in the absense of writes

Mostly getting some jenkins mileage but the design should be close
to the final one.

This is a WIP that needs to be broken down further but that summarily
adds the following:

- Add a TimeManager class: This class is responsible for tracking
  safe time advancement on leaders and replicas.
- Removes the clock from MvccManager - Since safe time is now managed
  by the TimeManager there is no need for Mvcc to have one internally.
- Changes how we wait for "all committed" in MvccManager. Previously
  we would wait for a clean snapshot. Now we instead wait for all
  transactions before the timestamp to be committed even if the
  clean timestamp doesn't advance. This is because without leader
  leases the safe time can go back, but moving the clean time back
  would break a lot of things internally.
- Plugs the TimeManager in consensus/queue and calls its methods
  appropriately.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M src/kudu/consensus/CMakeLists.txt
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
A src/kudu/consensus/time_manager-test.cc
A src/kudu/consensus/time_manager.cc
A src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/compaction-test.cc
M src/kudu/tablet/deltamemstore-test.cc
M src/kudu/tablet/diskrowset-test-base.h
M src/kudu/tablet/major_delta_compaction-test.cc
M src/kudu/tablet/memrowset-test.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
31 files changed, 584 insertions(+), 174 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/4
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#15).

Change subject: WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes
......................................................................

WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes

Mostly getting some jenkins mileage but the design should be close
to the final one.

This is a WIP that needs to be broken down further but that summarily
adds the following:

- Add a TimeManager class: This class is responsible for tracking
  safe time advancement on leaders and replicas.
- Removes the clock from MvccManager - Since safe time is now managed
  by the TimeManager there is no need for Mvcc to have one internally.
- Changes how we wait for "all committed" in MvccManager. Previously
  we would wait for a clean snapshot. Now we instead wait for all
  transactions before the timestamp to be committed even if the
  clean timestamp doesn't advance. This is because without leader
  leases the safe time can go back, but moving the clean time back
  would break a lot of things internally.
- Plugs the TimeManager in consensus/queue and calls its methods
  appropriately.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/compaction-test.cc
M src/kudu/tablet/deltamemstore-test.cc
M src/kudu/tablet/diskrowset-test-base.h
M src/kudu/tablet/major_delta_compaction-test.cc
M src/kudu/tablet/memrowset-test.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
28 files changed, 229 insertions(+), 173 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/15
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 15
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] WIP: Support safe time advancement on replicas in the absense of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#8).

Change subject: WIP: Support safe time advancement on replicas in the absense of writes
......................................................................

WIP: Support safe time advancement on replicas in the absense of writes

Mostly getting some jenkins mileage but the design should be close
to the final one.

This is a WIP that needs to be broken down further but that summarily
adds the following:

- Add a TimeManager class: This class is responsible for tracking
  safe time advancement on leaders and replicas.
- Removes the clock from MvccManager - Since safe time is now managed
  by the TimeManager there is no need for Mvcc to have one internally.
- Changes how we wait for "all committed" in MvccManager. Previously
  we would wait for a clean snapshot. Now we instead wait for all
  transactions before the timestamp to be committed even if the
  clean timestamp doesn't advance. This is because without leader
  leases the safe time can go back, but moving the clean time back
  would break a lot of things internally.
- Plugs the TimeManager in consensus/queue and calls its methods
  appropriately.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M src/kudu/consensus/CMakeLists.txt
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
A src/kudu/consensus/time_manager-test.cc
A src/kudu/consensus/time_manager.cc
A src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/compaction-test.cc
M src/kudu/tablet/deltamemstore-test.cc
M src/kudu/tablet/diskrowset-test-base.h
M src/kudu/tablet/major_delta_compaction-test.cc
M src/kudu/tablet/memrowset-test.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
31 files changed, 600 insertions(+), 176 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/8
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] WIP: Support safe time advancement on replicas in the absense of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#13).

Change subject: WIP: Support safe time advancement on replicas in the absense of writes
......................................................................

WIP: Support safe time advancement on replicas in the absense of writes

Mostly getting some jenkins mileage but the design should be close
to the final one.

This is a WIP that needs to be broken down further but that summarily
adds the following:

- Add a TimeManager class: This class is responsible for tracking
  safe time advancement on leaders and replicas.
- Removes the clock from MvccManager - Since safe time is now managed
  by the TimeManager there is no need for Mvcc to have one internally.
- Changes how we wait for "all committed" in MvccManager. Previously
  we would wait for a clean snapshot. Now we instead wait for all
  transactions before the timestamp to be committed even if the
  clean timestamp doesn't advance. This is because without leader
  leases the safe time can go back, but moving the clean time back
  would break a lot of things internally.
- Plugs the TimeManager in consensus/queue and calls its methods
  appropriately.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M java/kudu-client/src/test/java/org/apache/kudu/client/BaseKuduTest.java
M java/kudu-client/src/test/java/org/apache/kudu/client/MiniKuduCluster.java
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/CMakeLists.txt
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
A src/kudu/consensus/time_manager-test.cc
A src/kudu/consensus/time_manager.cc
A src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/compaction-test.cc
M src/kudu/tablet/deltamemstore-test.cc
M src/kudu/tablet/diskrowset-test-base.h
M src/kudu/tablet/major_delta_compaction-test.cc
M src/kudu/tablet/memrowset-test.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
34 files changed, 643 insertions(+), 176 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/13
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 13
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] WIP: KUDU-798 (part 5) Safe time advancement in the absence of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#22).

Change subject: WIP: KUDU-798 (part 5) Safe time advancement in the absence of writes
......................................................................

WIP: KUDU-798 (part 5) Safe time advancement in the absence of writes

This plugs the new TimeManager through consensus and wherever else
it's needed and integrates with with transactions, tablet service
and consensus.

Noteworthy:

- Safe time advances as before _except_ in the absense of writes.
  That is safe time is still advanced in Mvcc by the transaction
  driver and only up to the committed transaction's timestamp.
  When there are no writes to send the leader sends its last
  safe time to the replica, which updates itself and unblocks
  waiters.

- Mvcc's safe time is a more conservative version of "global"
  safe time. Mvcc does not get updated with safe time heartbeats
  from the leader. The clean snapshot must stay "clean" otherwise
  a lot of things would break.

- Mvcc's WaitForCleanSnapshot() method was removed. There is no
  longer a guarantee that the "clean" timestamp will move to
  the present. It might not, in the absense of writes. This
  patch introduces a new method that allows to wait for all
  "known" transactions before a timestamp to be committed.

WIP workign on a thorough integration test to be merged with
this.

This should be good to be reviewed.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/client/client-test.cc
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/tablet_peer_mm_ops.cc
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/transaction_tracker.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
32 files changed, 214 insertions(+), 148 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/22
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 22
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] WIP: Support safe time advancement on replicas in the absense of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#5).

Change subject: WIP: Support safe time advancement on replicas in the absense of writes
......................................................................

WIP: Support safe time advancement on replicas in the absense of writes

Mostly getting some jenkins mileage but the design should be close
to the final one.

This is a WIP that needs to be broken down further but that summarily
adds the following:

- Add a TimeManager class: This class is responsible for tracking
  safe time advancement on leaders and replicas.
- Removes the clock from MvccManager - Since safe time is now managed
  by the TimeManager there is no need for Mvcc to have one internally.
- Changes how we wait for "all committed" in MvccManager. Previously
  we would wait for a clean snapshot. Now we instead wait for all
  transactions before the timestamp to be committed even if the
  clean timestamp doesn't advance. This is because without leader
  leases the safe time can go back, but moving the clean time back
  would break a lot of things internally.
- Plugs the TimeManager in consensus/queue and calls its methods
  appropriately.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M src/kudu/consensus/CMakeLists.txt
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
A src/kudu/consensus/time_manager-test.cc
A src/kudu/consensus/time_manager.cc
A src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/compaction-test.cc
M src/kudu/tablet/deltamemstore-test.cc
M src/kudu/tablet/diskrowset-test-base.h
M src/kudu/tablet/major_delta_compaction-test.cc
M src/kudu/tablet/memrowset-test.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
31 files changed, 586 insertions(+), 176 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/5
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-798 (part 5) Correct safe time advancement

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: KUDU-798 (part 5) Correct safe time advancement
......................................................................


Patch Set 30:

I won't address any more tidy bot nits that aren't part of the changes in this patch. Its big as it is without spurious changes.

-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 30
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-798 (part 5) Correct safe time advancement

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has submitted this change and it was merged.

Change subject: KUDU-798 (part 5) Correct safe time advancement
......................................................................


KUDU-798 (part 5) Correct safe time advancement

This patch fixes safe time advancement in general and allows for
safe time advancement in the absence of writes in particular.
The core of the patch is to plug in the new TimeManager class wherever
needed. But there is also a major cleanup of the waiting story in
TabletService (KUDU-1127) and a few new integration test features.

There is an instance of a TimeManager per tablet. It's used for:

- When replicating messages to other replicas a leader uses the
  TimeManager to assign timestamps and to obtain a safe time to send,
  even when there are no writes.

- When receiving messages from a leader consensus uses the TimeManager
  to update the clock and to unblock any waiters that might waiting
  for a particular timestamp to be safe.

- Before a snapshot scan proceeds to scan, it must first wait for the
  TimeManager to deem whatever timestamp it has safe. Then it proceeds
  to wait for the snapshot at timestamp to have all its transactions
  committed and, finally, proceeds with the scan.

Put together, these changes allow to make sure that snapshot scans are
repeatable in the large majority of cases. The one "hole" in safe time
is solved by leader leases. Until we have those this patch takes a
conservative approach to safe time progress.

Fixing safe time broke a bunch of our tests that were expecting broken
snapshot scans. In particular we would return broken snapshots all the
time instead of waiting for the snapshot to be correct. Of course
when these errors were fixed the tests started failing.

In order to address these test failures I cleaned up our snapshot scan
waiting story in TabletServer::HandleScanAtSnapshot(). In particular:

- The client's deadline in no longer taken into account when deciding
  how long to wait for a snapshot to be repeatable. There is a hard
  (configurable) max that the server will wait for, "clamping" the
  client's deadline. The key here is that, when the client deadline
  is clamped, we return Status::ServiceUnavailable on time out
  instead of Status::TimeOut(). Since HandleScanAtSnapshot() is called
  on KuduScanner::Open() and ServiceUnavailable is a retryable status
  this enables the client to try somewhere else, perhaps where it won't
  have to wait as long.

- TimeManager now does a pre-flight check before waiting on safe time.
  In particular it checks that: i) it has heard from the leader within
  a configurable amount of time (that safe time _can_ make progress).
  ii) it checks that the safe time is not more that a configurable
  amount of time in the past, 30 seconds by default (that safe time
  is likely to make progress to the required timestamp).

Finally, this patch adds two new integration test workloads that
prove that it works. It adds a new read workload to TestWorkload
that performs snapshot scans in the present, while writes are
happening. This can be enabled anywhere but this patch enables
it for a bunch of tests in RaftConsensusItest, in particular the
*Churny* and *Crashy* ones with unique keys. This patch also enables
linked_list-test to perform snapshot scans in the present after
the verification.

Results:

I ran raft_consensus-itest with the new snapshot read workload
on dist-test, asan, with the following config:

KUDU_ALLOW_SLOW_TESTS=1 ../../build-support/dist_test.py \
--disable-sharding loop -n 1000 bin/raft_consensus-itest \
--gtest_filter=*Churny*:*Crashy*-*Duplicate*

I pulled the test to before this patch. It failed 1000/1000 on
master. With this patch it passed 1000/1000:

http://dist-test.cloudera.org//job?job_id=david.alves.1481097865.18287

I ran linked_list-test with the new snapshot scans in dist-test,
asan, with the following config:

KUDU_ALLOW_SLOW_TESTS=1 ../../build-support/dist_test.py \
loop -n 1000 bin/linked_list-test --stress_cpu_threads=2 \
--gtest_filter=*TestLoadAndVerify*

The test passed 1000/1000 whereas before it would fail 427/1000.
Results:

http://dist-test.cloudera.org//job?job_id=david.alves.1481104106.24665

I also ran the test in client-test that tests fault tolerance.
Run config:

KUDU_ALLOW_SLOW_TESTS=1 ../../build-support/dist_test.py \
--disable-sharding loop -n 1000 -- bin/client-test \
--gtest_filter=*ScanFaultTolerance* --stress_cpu_threads=2

The test passed 1000/1000 times. Results:

http://dist-test.cloudera.org//job?job_id=david.alves.1481171410.4460

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Reviewed-on: http://gerrit.cloudera.org:8080/5240
Reviewed-by: Todd Lipcon <to...@apache.org>
Tested-by: David Ribeiro Alves <dr...@apache.org>
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/client/client-test-util.cc
M src/kudu/client/client-test.cc
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/linked_list-test-util.h
M src/kudu/integration-tests/linked_list-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/tablet_history_gc-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/integration-tests/test_workload.h
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/transaction_tracker.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
41 files changed, 591 insertions(+), 274 deletions(-)

Approvals:
  David Ribeiro Alves: Verified
  Todd Lipcon: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 35
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-798 (part 5) Correct safe time advancement

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-798 (part 5) Correct safe time advancement
......................................................................


Patch Set 34: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 34
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#17).

Change subject: WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes
......................................................................

WIP: KUDU-798 (part 5) Safe time advancement in the absense of writes

Mostly getting some jenkins mileage but the design should be close
to the final one.

This is a WIP that needs to be broken down further but that summarily
adds the following:

- Removes the clock from MvccManager - Since safe time is now managed
  by the TimeManager there is no need for Mvcc to have one internally.
- Changes how we wait for "all committed" in MvccManager. Previously
  we would wait for a clean snapshot. Now we instead wait for all
  transactions before the timestamp to be committed even if the
  clean timestamp doesn't advance. This is because without leader
  leases the safe time can go back, but moving the clean time back
  would break a lot of things internally.
- Plugs the TimeManager in consensus/queue and calls its methods
  appropriately.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/integration-tests/linked_list-test-util.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/compaction-test.cc
M src/kudu/tablet/deltamemstore-test.cc
M src/kudu/tablet/diskrowset-test-base.h
M src/kudu/tablet/major_delta_compaction-test.cc
M src/kudu/tablet/memrowset-test.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
33 files changed, 264 insertions(+), 227 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/17
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 17
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-798 (part 5) Correct safe time advancement

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: KUDU-798 (part 5) Correct safe time advancement
......................................................................


Patch Set 32:

(35 comments)

http://gerrit.cloudera.org:8080/#/c/5240/32//COMMIT_MSG
Commit Message:

PS32, Line 31: whole
> hole
doh


PS32, Line 56: it
> is
Done


PS32, Line 52: 
             : - TimeManager now does a pre-flight check before waiting on safe time.
             :   In particular it checks that: i) it has heard from the leader within
             :   a configurable amount of time (that safe time _can_ make progress).
             :   ii) it checks that the safe time it not more that a configurable
             :   amount of time in the past, 30 seconds by default (that safe time
             :   is likely to make progress to the required timestamp).
> I haven't looked at the code yet, but one question here: these checks shoul
doing this already, yeah :)


PS32, Line 65: unique
> any particular reason why not the one with dup keys too?
because the duplicate keys one only has 20 different rows and the test works with row counts.


PS32, Line 71: workloaf
> typo
Done


http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/client/client-test-util.cc
File src/kudu/client/client-test-util.cc:

Line 58:   ASSERT_OK(scanner.SetSelection(KuduClient::LEADER_ONLY));
> is LEADER_ONLY still relevant here? or should this be changed to set it to 
left a TODO think we should cleanup this retry code.


http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/client/client-test.cc
File src/kudu/client/client-test.cc:

Line 1147: TEST_F(ClientTest, TestScanFaultTolerance) {
> did you loop this test too?
I had looped it quite a bit cuz it was flaky, but forgot to keep the results. Just looped it 1000 more. 1000 passes :) Updated the commit message.


http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 456:       KLOG_EVERY_N_SECS(WARNING, 10) << "Safe time advancement without writes is disabled. "
> hrm, maybe FIRST_K is better here? not sure seeing it once every 10 seconds
made it minutes Done


http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

Line 1238:     if (request->has_safe_timestamp()) {
> I'm still not sure this is in the right spot if the tail of the messages in
yeah I agree that we shouldn't take the safe timestamp into account if we failed to prepare. left a todo.


http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/consensus/time_manager.cc
File src/kudu/consensus/time_manager.cc:

PS32, Line 31: safe_time_max_last_advanced_safe_time
> not a big fan of this name. How about: missed_heartbeats_before_rejecting_s
Done


PS32, Line 33: snapshot scans
> scans at snapshots that aren't yet safe
Done


Line 34: TAG_FLAG(safe_time_max_last_advanced_safe_time, advanced);
> let's mark this unstable for now too
Done


PS32, Line 37: lag
> lag behind the requested timestamp?
Done


Line 39: TAG_FLAG(safe_time_max_lag_ms, advanced);
> same
Done


PS32, Line 151:  $0 secs, (max is $1 sec
> 'secs' is redundant here since MonoDelta::ToString already has a 's' suffix
Done


Line 166:                                 safe_time_diff.ToMilliseconds(),
> wrong units here (message says secs)
Done


Line 167:                                 FLAGS_safe_time_max_lag_ms);
> same
Done


Line 173: void TimeManager::MakeWaiterTimeoutMessageUnlocked(Timestamp timestamp, string* error_message) {
> probably easier to just return string from this method instead of the out-p
yeah, it's ugly so I tried to hide it. moved back.


Line 188:     if (mode_ == NON_LEADER && timestamp > last_safe_ts_) {
> ah ok, so you are doing the thing that I asked about in the commit message.
Done


http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/consensus/time_manager.h
File src/kudu/consensus/time_manager.h:

PS32, Line 117: is 
> if
Done


http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

Line 2345:       "--maintenance_manager_polling_interval_ms=300",
> typo?
Done


Line 2664:   //workload.set_num_read_threads(2);
> hrm?
Done


http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/integration-tests/test_workload.cc
File src/kudu/integration-tests/test_workload.cc:

Line 270:   start_latch_.Reset((uint64_t)num_write_threads_ + (uint64_t)num_read_threads_);
> nit: why these casts?
cuz tidy bot complains otherwise. removed


http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/tools/tool_action_remote_replica.cc
File src/kudu/tools/tool_action_remote_replica.cc:

Line 100:     // and thus cannot do snapshot scans.
> I think ORDERED is still useful here.
the two were mashed together when we added (or after) fault tolerance. now the  the tablet service will return an error if ordered is set but not snapshot.
I agree that order is important but not being able to dump a server data if its alone is worse imo.
Left a TODO for when we have another read mode.


http://gerrit.cloudera.org:8080/#/c/5240/32/src/kudu/tserver/tablet_service.cc
File src/kudu/tserver/tablet_service.cc:

PS32, Line 84: time
> 'of time (in milliseconds)' and remove 'in milliseconds' from the end of th
Done


PS32, Line 85: it's 
> its
Done


PS32, Line 85: allows to
> 'allows us to' or 'allows waiting'
Done


Line 1329:               ": has no lower or upper bound.");
> lots of reindentation in this file. mind reverting that part of the patch?
must have pressed the indent file key by mistake. Done


PS32, Line 1387: Retunrs
> typo
Done


PS32, Line 1388: CheckIfAncientHistory
> rename to 'VerifyNotAncientHistory' or 'VerifyAfterAHM' or something. Other
Done


Line 1759: MonoTime ClampDeadline(const MonoTime& deadline, bool* was_clamped) {
> rename to ClampScanDeadlineForWait or something so it's clearer where this 
Done


PS32, Line 1816: ahm
> nit: AHM
Done


PS32, Line 1842: allows to
> allows us to
Done


PS32, Line 1850: s.ok() 
> !s.ok() is redundant here with s.IsTimedout()
Done


Line 1861:   if (!s.ok() && s.IsTimedOut() && was_clamped) {
> same
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 32
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-798 (part 3) Remove the clock from MvccManager

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: KUDU-798 (part 3) Remove the clock from MvccManager
......................................................................


Patch Set 1:

hum turns out this is a bad idea without the rest of the safe time advancement stuff. merging this into follow up patches.

-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: No

[kudu-CR] WIP: Support safe time advancement on replicas in the absense of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: WIP: Support safe time advancement on replicas in the absense of writes
......................................................................


Patch Set 3:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/5240/3/src/kudu/consensus/time_manager.cc
File src/kudu/consensus/time_manager.cc:

Line 211: } // consensus
> warning: namespace 'consensus' ends with an unrecognized comment [google-re
Done


Line 212: } // kudu
> warning: namespace 'kudu' ends with an unrecognized comment [google-readabi
Done


http://gerrit.cloudera.org:8080/#/c/5240/3/src/kudu/consensus/time_manager.h
File src/kudu/consensus/time_manager.h:

Line 34:   TimeManager(const scoped_refptr<server::Clock>& clock);
> warning: single-argument constructors must be marked explicit to avoid unin
Done


http://gerrit.cloudera.org:8080/#/c/5240/3/src/kudu/tablet/mvcc.h
File src/kudu/tablet/mvcc.h:

Line 240:   Status WaitForSnapshotWithAllCommitted(Timestamp timestamp,
> warning: function 'kudu::tablet::MvccManager::WaitForSnapshotWithAllCommitt
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] WIP: Support safe time advancement on replicas in the absense of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#6).

Change subject: WIP: Support safe time advancement on replicas in the absense of writes
......................................................................

WIP: Support safe time advancement on replicas in the absense of writes

Mostly getting some jenkins mileage but the design should be close
to the final one.

This is a WIP that needs to be broken down further but that summarily
adds the following:

- Add a TimeManager class: This class is responsible for tracking
  safe time advancement on leaders and replicas.
- Removes the clock from MvccManager - Since safe time is now managed
  by the TimeManager there is no need for Mvcc to have one internally.
- Changes how we wait for "all committed" in MvccManager. Previously
  we would wait for a clean snapshot. Now we instead wait for all
  transactions before the timestamp to be committed even if the
  clean timestamp doesn't advance. This is because without leader
  leases the safe time can go back, but moving the clean time back
  would break a lot of things internally.
- Plugs the TimeManager in consensus/queue and calls its methods
  appropriately.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M src/kudu/consensus/CMakeLists.txt
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
A src/kudu/consensus/time_manager-test.cc
A src/kudu/consensus/time_manager.cc
A src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/compaction-test.cc
M src/kudu/tablet/deltamemstore-test.cc
M src/kudu/tablet/diskrowset-test-base.h
M src/kudu/tablet/major_delta_compaction-test.cc
M src/kudu/tablet/memrowset-test.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
31 files changed, 587 insertions(+), 176 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/6
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] WIP: Support safe time advancement on replicas in the absense of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#12).

Change subject: WIP: Support safe time advancement on replicas in the absense of writes
......................................................................

WIP: Support safe time advancement on replicas in the absense of writes

Mostly getting some jenkins mileage but the design should be close
to the final one.

This is a WIP that needs to be broken down further but that summarily
adds the following:

- Add a TimeManager class: This class is responsible for tracking
  safe time advancement on leaders and replicas.
- Removes the clock from MvccManager - Since safe time is now managed
  by the TimeManager there is no need for Mvcc to have one internally.
- Changes how we wait for "all committed" in MvccManager. Previously
  we would wait for a clean snapshot. Now we instead wait for all
  transactions before the timestamp to be committed even if the
  clean timestamp doesn't advance. This is because without leader
  leases the safe time can go back, but moving the clean time back
  would break a lot of things internally.
- Plugs the TimeManager in consensus/queue and calls its methods
  appropriately.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M src/kudu/consensus/CMakeLists.txt
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
A src/kudu/consensus/time_manager-test.cc
A src/kudu/consensus/time_manager.cc
A src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/compaction-test.cc
M src/kudu/tablet/deltamemstore-test.cc
M src/kudu/tablet/diskrowset-test-base.h
M src/kudu/tablet/major_delta_compaction-test.cc
M src/kudu/tablet/memrowset-test.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
31 files changed, 593 insertions(+), 173 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/12
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 12
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] WIP: KUDU-798 (part 5) Safe time advancement in the absence of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#25).

Change subject: WIP: KUDU-798 (part 5) Safe time advancement in the absence of writes
......................................................................

WIP: KUDU-798 (part 5) Safe time advancement in the absence of writes

This plugs the new TimeManager through consensus and wherever else
it's needed and integrates with with transactions, tablet service
and consensus.

Noteworthy:

- Safe time advances as before _except_ in the absense of writes.
  That is safe time is still advanced in Mvcc by the transaction
  driver and only up to the committed transaction's timestamp.
  When there are no writes to send the leader sends its last
  safe time to the replica, which updates itself and unblocks
  waiters.

- Mvcc's safe time is a more conservative version of "global"
  safe time. Mvcc does not get updated with safe time heartbeats
  from the leader. The clean snapshot must stay "clean" otherwise
  a lot of things would break.

- Mvcc's WaitForCleanSnapshot() method was removed. There is no
  longer a guarantee that the "clean" timestamp will move to
  the present. It might not, in the absense of writes. This
  patch introduces a new method that allows to wait for all
  "known" transactions before a timestamp to be committed.

WIP workign on a thorough integration test to be merged with
this.

This should be good to be reviewed.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/client/client-test-util.cc
M src/kudu/client/client-test.cc
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/tablet_peer_mm_ops.cc
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/transaction_tracker.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
33 files changed, 220 insertions(+), 150 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/25
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 25
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-798 (part 5) Correct safe time advancement

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: KUDU-798 (part 5) Correct safe time advancement
......................................................................


Patch Set 30:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5240/30/src/kudu/tserver/tablet_service.cc
File src/kudu/tserver/tablet_service.cc:

Line 1521:         } else if (!s.ok()) {
> warning: don't use else after return [readability-else-after-return]
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 30
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-798 (part 5) Correct safe time advancement

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#33).

Change subject: KUDU-798 (part 5) Correct safe time advancement
......................................................................

KUDU-798 (part 5) Correct safe time advancement

This patch fixes safe time advancement in general and allows for
safe time advancement in the absence of writes in particular.
The core of the patch is to plug in the new TimeManager class wherever
needed. But there is also a major cleanup of the waiting story in
TabletService (KUDU-1127) and a few new integration test features.

There is an instance of a TimeManager per tablet. It's used for:

- When replicating messages to other replicas a leader uses the
  TimeManager to assign timestamps and to obtain a safe time to send,
  even when there are no writes.

- When receiving messages from a leader consensus uses the TimeManager
  to update the clock and to unblock any waiters that might waiting
  for a particular timestamp to be safe.

- Before a snapshot scan proceeds to scan, it must first wait for the
  TimeManager to deem whatever timestamp it has safe. Then it proceeds
  to wait for the snapshot at timestamp to have all its transactions
  committed and, finally, proceeds with the scan.

Put together, these changes allow to make sure that snapshot scans are
repeatable in the large majority of cases. The one "hole" in safe time
is solved by leader leases. Until we have those this patch takes a
conservative approach to safe time progress.

Fixing safe time broke a bunch of our tests that were expecting broken
snapshot scans. In particular we would return broken snapshots all the
time instead of waiting for the snapshot to be correct. Of course
when these errors were fixed the tests started failing.

In order to address these test failures I cleaned up our snapshot scan
waiting story in TabletServer::HandleScanAtSnapshot(). In particular:

- The client's deadline in no longer taken into account when deciding
  how long to wait for a snapshot to be repeatable. There is a hard
  (configurable) max that the server will wait for, "clamping" the
  client's deadline. The key here is that, when the client deadline
  is clamped, we return Status::ServiceUnavailable on time out
  instead of Status::TimeOut(). Since HandleScanAtSnapshot() is called
  on KuduScanner::Open() and ServiceUnavailable is a retryable status
  this enables the client to try somewhere else, perhaps where it won't
  have to wait as long.

- TimeManager now does a pre-flight check before waiting on safe time.
  In particular it checks that: i) it has heard from the leader within
  a configurable amount of time (that safe time _can_ make progress).
  ii) it checks that the safe time is not more that a configurable
  amount of time in the past, 30 seconds by default (that safe time
  is likely to make progress to the required timestamp).

Finally, this patch adds two new integration test workloads that
prove that it works. It adds a new read workload to TestWorkload
that performs snapshot scans in the present, while writes are
happening. This can be enabled anywhere but this patch enables
it for a bunch of tests in RaftConsensusItest, in particular the
*Churny* and *Crashy* ones with unique keys. This patch also enables
linked_list-test to perform snapshot scans in the present after
the verification.

Results:

I ran raft_consensus-itest with the new snapshot read workload
on dist-test, asan, with the following config:

KUDU_ALLOW_SLOW_TESTS=1 ../../build-support/dist_test.py \
--disable-sharding loop -n 1000 bin/raft_consensus-itest \
--gtest_filter=*Churny*:*Crashy*-*Duplicate*

I pulled the test to before this patch. It failed 1000/1000 on
master. With this patch it passed 1000/1000:

http://dist-test.cloudera.org//job?job_id=david.alves.1481097865.18287

I ran linked_list-test with the new snapshot scans in dist-test,
asan, with the following config:

KUDU_ALLOW_SLOW_TESTS=1 ../../build-support/dist_test.py \
loop -n 1000 bin/linked_list-test --stress_cpu_threads=2 \
--gtest_filter=*TestLoadAndVerify*

The test passed 1000/1000 whereas before it would fail 427/1000.
Results:

http://dist-test.cloudera.org//job?job_id=david.alves.1481104106.24665

I also ran the test in client-test that tests fault tolerance.
Run config:

KUDU_ALLOW_SLOW_TESTS=1 ../../build-support/dist_test.py \
--disable-sharding loop -n 1000 -- bin/client-test \
--gtest_filter=*ScanFaultTolerance* --stress_cpu_threads=2

The test passed 1000/1000 times. Results:

http://dist-test.cloudera.org//job?job_id=david.alves.1481171410.4460

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/client/client-test-util.cc
M src/kudu/client/client-test.cc
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/linked_list-test-util.h
M src/kudu/integration-tests/linked_list-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/tablet_history_gc-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/integration-tests/test_workload.h
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/transaction_tracker.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
41 files changed, 591 insertions(+), 274 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/33
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 33
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-798 (part 5) Correct safe time advancement

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#30).

Change subject: KUDU-798 (part 5) Correct safe time advancement
......................................................................

KUDU-798 (part 5) Correct safe time advancement

This patch fixes safe time advancement in general and allows for
safe time advancement in the absence of writes in particular.
The core of the patch is to plug in the new TimeManager class wherever
needed. But there is also a major cleanup of the waiting story in
TabletService (KUDU-1127) and a few new integration test features.

There is an instance of a TimeManager per tablet. It's used for:

- When replicating messages to other replicas a leader uses the
  TimeManager to assign timestamps and to obtain a safe time to send,
  even when there are no writes.

- When receiving messages from a leader consensus uses the TimeManager
  to update the clock and to unblock any waiters that might waiting
  for a particular timestamp to be safe.

- Before a snapshot scan proceeds to scan, it must first wait for the
  TimeManager to deem whatever timestamp it has safe. Then it proceeds
  to wait for the snapshot at timestamp to have all its transactions
  committed and, finally, proceeds with the scan.

Put together, these changes allow to make sure that snapshot scans are
repeatable in the large majority of cases. The one "whole" in safe time
is solved by leader leases. Until we have those this patch takes a
conservative approach to safe time progress.

Fixing safe time broke a bunch of our tests that were expecting broken
snapshot scans. In particular we would return broken snapshots all the
time instead of waiting for the snapshot to be correct. Of course
when these errors were fixed the tests started failing.

In order to address these test failures I cleaned up our snapshot scan
waiting story in TabletServer::HandleScanAtSnapshot(). In particular:

- The client's deadline in no longer taken into account when deciding
  how long to wait for a snapshot to be repeatable. There is a hard
  (configurable) max that the server will wait for, "clamping" the
  client's deadline. The key here is that, when the client deadline
  is clamped, we return Status::ServiceUnavailable on time out
  instead of Status::TimeOut(). Since HandleScanAtSnapshot() is called
  on KuduScanner::Open() and ServiceUnavailable is a retryable status
  this enables the client to try somewhere else, perhaps where it won't
  have to wait as long.

- TimeManager now does a pre-flight check before waiting on safe time.
  In particular it checks that: i) it has heard from the leader within
  a configurable amount of time (that safe time _can_ make progress).
  ii) it checks that the safe time it not more that a configurable
  amount of time in the past, 30 seconds by default (that safe time
  is likely to make progress to the required timestamp).

Finally, this patch adds two new integration test workloads that
prove that it works. It adds a new read workload to TestWorkload
that performs snapshot scans in the present, while writes are
happening. This can be enabled anywhere but this patch enables
it for a bunch of tests in RaftConsensusItest, in particular the
*Churny* and *Crashy* ones with unique keys. This patch also enables
linked_list-test to perform snapshot scans in the present after
the verification.

Results:

I ran raft_consensus-itest with the new snapshot read workloaf
on dist-test, asan, with the following config:

KUDU_ALLOW_SLOW_TESTS=1 ../../build-support/dist_test.py \
--disable-sharding loop -n 1000 bin/raft_consensus-itest \
--gtest_filter=*Churny*:*Crashy*-*Duplicate*

I pulled the test to before this patch. It failed 1000/1000 on
master. With this patch it passed 1000/1000:

http://dist-test.cloudera.org//job?job_id=david.alves.1481097865.18287

I ran linked_list-test with the new snapshot scans in dist-test,
asan, with the following config:

KUDU_ALLOW_SLOW_TESTS=1 ../../build-support/dist_test.py \
loop -n 1000 bin/linked_list-test --stress_cpu_threads=2 \
--gtest_filter=*TestLoadAndVerify*

The test passed 1000/1000 whereas before it would fail 427/1000.
Results:

http://dist-test.cloudera.org//job?job_id=david.alves.1481104106.24665

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/client/client-test-util.cc
M src/kudu/client/client-test.cc
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/linked_list-test-util.h
M src/kudu/integration-tests/linked_list-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/tablet_history_gc-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/integration-tests/test_workload.h
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/tablet_peer_mm_ops.cc
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/transaction_tracker.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
41 files changed, 639 insertions(+), 325 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/30
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 30
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-798 (part 5) Correct safe time advancement

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#34).

Change subject: KUDU-798 (part 5) Correct safe time advancement
......................................................................

KUDU-798 (part 5) Correct safe time advancement

This patch fixes safe time advancement in general and allows for
safe time advancement in the absence of writes in particular.
The core of the patch is to plug in the new TimeManager class wherever
needed. But there is also a major cleanup of the waiting story in
TabletService (KUDU-1127) and a few new integration test features.

There is an instance of a TimeManager per tablet. It's used for:

- When replicating messages to other replicas a leader uses the
  TimeManager to assign timestamps and to obtain a safe time to send,
  even when there are no writes.

- When receiving messages from a leader consensus uses the TimeManager
  to update the clock and to unblock any waiters that might waiting
  for a particular timestamp to be safe.

- Before a snapshot scan proceeds to scan, it must first wait for the
  TimeManager to deem whatever timestamp it has safe. Then it proceeds
  to wait for the snapshot at timestamp to have all its transactions
  committed and, finally, proceeds with the scan.

Put together, these changes allow to make sure that snapshot scans are
repeatable in the large majority of cases. The one "hole" in safe time
is solved by leader leases. Until we have those this patch takes a
conservative approach to safe time progress.

Fixing safe time broke a bunch of our tests that were expecting broken
snapshot scans. In particular we would return broken snapshots all the
time instead of waiting for the snapshot to be correct. Of course
when these errors were fixed the tests started failing.

In order to address these test failures I cleaned up our snapshot scan
waiting story in TabletServer::HandleScanAtSnapshot(). In particular:

- The client's deadline in no longer taken into account when deciding
  how long to wait for a snapshot to be repeatable. There is a hard
  (configurable) max that the server will wait for, "clamping" the
  client's deadline. The key here is that, when the client deadline
  is clamped, we return Status::ServiceUnavailable on time out
  instead of Status::TimeOut(). Since HandleScanAtSnapshot() is called
  on KuduScanner::Open() and ServiceUnavailable is a retryable status
  this enables the client to try somewhere else, perhaps where it won't
  have to wait as long.

- TimeManager now does a pre-flight check before waiting on safe time.
  In particular it checks that: i) it has heard from the leader within
  a configurable amount of time (that safe time _can_ make progress).
  ii) it checks that the safe time is not more that a configurable
  amount of time in the past, 30 seconds by default (that safe time
  is likely to make progress to the required timestamp).

Finally, this patch adds two new integration test workloads that
prove that it works. It adds a new read workload to TestWorkload
that performs snapshot scans in the present, while writes are
happening. This can be enabled anywhere but this patch enables
it for a bunch of tests in RaftConsensusItest, in particular the
*Churny* and *Crashy* ones with unique keys. This patch also enables
linked_list-test to perform snapshot scans in the present after
the verification.

Results:

I ran raft_consensus-itest with the new snapshot read workload
on dist-test, asan, with the following config:

KUDU_ALLOW_SLOW_TESTS=1 ../../build-support/dist_test.py \
--disable-sharding loop -n 1000 bin/raft_consensus-itest \
--gtest_filter=*Churny*:*Crashy*-*Duplicate*

I pulled the test to before this patch. It failed 1000/1000 on
master. With this patch it passed 1000/1000:

http://dist-test.cloudera.org//job?job_id=david.alves.1481097865.18287

I ran linked_list-test with the new snapshot scans in dist-test,
asan, with the following config:

KUDU_ALLOW_SLOW_TESTS=1 ../../build-support/dist_test.py \
loop -n 1000 bin/linked_list-test --stress_cpu_threads=2 \
--gtest_filter=*TestLoadAndVerify*

The test passed 1000/1000 whereas before it would fail 427/1000.
Results:

http://dist-test.cloudera.org//job?job_id=david.alves.1481104106.24665

I also ran the test in client-test that tests fault tolerance.
Run config:

KUDU_ALLOW_SLOW_TESTS=1 ../../build-support/dist_test.py \
--disable-sharding loop -n 1000 -- bin/client-test \
--gtest_filter=*ScanFaultTolerance* --stress_cpu_threads=2

The test passed 1000/1000 times. Results:

http://dist-test.cloudera.org//job?job_id=david.alves.1481171410.4460

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/client/client-test-util.cc
M src/kudu/client/client-test.cc
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/linked_list-test-util.h
M src/kudu/integration-tests/linked_list-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/tablet_history_gc-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/integration-tests/test_workload.h
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/transaction_tracker.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
41 files changed, 591 insertions(+), 274 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/34
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 34
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-798 (part 5) Correct safe time advancement

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-798 (part 5) Correct safe time advancement
......................................................................


Patch Set 27:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5240/27/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

Line 1204:     // All transactions that are going to be prepared were started, advance the safe timestamp.
> yeah, the comment is misleading likely left over from the leader leases pat
but in the meantime don't we want to do this only if we prepared all ops? i.e move this down to line 1214?


-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 27
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] KUDU-798 (part 5) Correct safe time advancement

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-798 (part 5) Correct safe time advancement
......................................................................


Patch Set 33: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 33
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-798 (part 5) Correct safe time advancement

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: KUDU-798 (part 5) Correct safe time advancement
......................................................................


Patch Set 28:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/5240/27//COMMIT_MSG
Commit Message:

PS27, Line 15: n
> nit: add comma or period?
Done


PS27, Line 15:  ins
> its
Done


http://gerrit.cloudera.org:8080/#/c/5240/27/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 450:   // If we're not sending ops to the follower, set the safe time on the request.
> worth a TODO here that we could also set it if we are sending our latest op
Done


http://gerrit.cloudera.org:8080/#/c/5240/27/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

Line 34: //#include "kudu/consensus/time_manager.h"
> nit
Done


http://gerrit.cloudera.org:8080/#/c/5240/28/src/kudu/consensus/time_manager.cc
File src/kudu/consensus/time_manager.cc:

PS28, Line 184:  // Pre-flight checks, make sure we've heard from the leader recently and that safe time
              :   // isn't lagging too much.
> maybe we should do a GetSafeTime() call here so that if we're leader we 're
Done


http://gerrit.cloudera.org:8080/#/c/5240/28/src/kudu/consensus/time_manager.h
File src/kudu/consensus/time_manager.h:

Line 145:   void MakeWaiterTimeoutMessageUnlocked(Timestamp timestmap, std::string* error_message);
> warning: function 'kudu::consensus::TimeManager::MakeWaiterTimeoutMessageUn
Done


http://gerrit.cloudera.org:8080/#/c/5240/28/src/kudu/integration-tests/consistency-itest.cc
File src/kudu/integration-tests/consistency-itest.cc:

PS28, Line 639: 3
> What is the reason to have 3 rows inserted instead of 1?  It would be nice 
Think was leftover from a previous patch. removed


http://gerrit.cloudera.org:8080/#/c/5240/27/src/kudu/tablet/tablet_bootstrap.cc
File src/kudu/tablet/tablet_bootstrap.cc:

Line 37: #include "kudu/consensus/time_manager.h"
> needed?
Done


http://gerrit.cloudera.org:8080/#/c/5240/27/src/kudu/tablet/tablet_peer_mm_ops.cc
File src/kudu/tablet/tablet_peer_mm_ops.cc:

Line 26: #include "kudu/consensus/time_manager.h"
> unnecessary
Done


http://gerrit.cloudera.org:8080/#/c/5240/27/src/kudu/tablet/transactions/alter_schema_transaction.cc
File src/kudu/tablet/transactions/alter_schema_transaction.cc:

Line 23: #include "kudu/consensus/time_manager.h"
> unnecessary
Done


http://gerrit.cloudera.org:8080/#/c/5240/27/src/kudu/tablet/transactions/transaction_tracker.cc
File src/kudu/tablet/transactions/transaction_tracker.cc:

Line 24: #include "kudu/consensus/time_manager.h"
> unnecessary
Done


http://gerrit.cloudera.org:8080/#/c/5240/27/src/kudu/tablet/transactions/write_transaction.cc
File src/kudu/tablet/transactions/write_transaction.cc:

Line 25: #include "kudu/consensus/time_manager.h"
> same
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 28
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] WIP: KUDU-798 (part 5) Safe time advancement in the absence of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#24).

Change subject: WIP: KUDU-798 (part 5) Safe time advancement in the absence of writes
......................................................................

WIP: KUDU-798 (part 5) Safe time advancement in the absence of writes

This plugs the new TimeManager through consensus and wherever else
it's needed and integrates with with transactions, tablet service
and consensus.

Noteworthy:

- Safe time advances as before _except_ in the absense of writes.
  That is safe time is still advanced in Mvcc by the transaction
  driver and only up to the committed transaction's timestamp.
  When there are no writes to send the leader sends its last
  safe time to the replica, which updates itself and unblocks
  waiters.

- Mvcc's safe time is a more conservative version of "global"
  safe time. Mvcc does not get updated with safe time heartbeats
  from the leader. The clean snapshot must stay "clean" otherwise
  a lot of things would break.

- Mvcc's WaitForCleanSnapshot() method was removed. There is no
  longer a guarantee that the "clean" timestamp will move to
  the present. It might not, in the absense of writes. This
  patch introduces a new method that allows to wait for all
  "known" transactions before a timestamp to be committed.

WIP workign on a thorough integration test to be merged with
this.

This should be good to be reviewed.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/client/client-test-util.cc
M src/kudu/client/client-test.cc
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/tablet_peer_mm_ops.cc
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/transaction_tracker.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
33 files changed, 219 insertions(+), 150 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/24
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 24
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [NOT FOR REVIEW] KUDU-798 (part 3) Remove the clock from MvccManager

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#2).

Change subject: [NOT FOR REVIEW] KUDU-798 (part 3) Remove the clock from MvccManager
......................................................................

[NOT FOR REVIEW] KUDU-798 (part 3) Remove the clock from MvccManager

All mvcc transactions are now started at a timestamp so there is no
need for MvccManager to assign timestamps. Without this there
was a single place where we used the clock, outside of checks:
We used the clock in WaitForCleanSnapshotAtTimestamp() to make sure
the clock was past the timestamp before waiting on the snapshot to
be clean.

While this didn't make much sense, since the clean timestamp
shouldn't ever move past the current clock value, this is currently
a proxy to waiting for "safe time" on leader side transactions.
Until we properly handle that in follow up patches this check was
moved to where we handle snapshot scans in TabletService.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M src/kudu/tablet/compaction-test.cc
M src/kudu/tablet/deltamemstore-test.cc
M src/kudu/tablet/diskrowset-test-base.h
M src/kudu/tablet/major_delta_compaction-test.cc
M src/kudu/tablet/memrowset-test.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
12 files changed, 86 insertions(+), 106 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/2
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-798 (part 5) Correct safe time advancement

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#28).

Change subject: KUDU-798 (part 5) Correct safe time advancement
......................................................................

KUDU-798 (part 5) Correct safe time advancement

This patch fixes safe time advancement in general and allows for
safe time advancement in the absence of writes in particular.
The core of the patch is to plug in the new TimeManager class wherever
needed. But there is also a major cleanup of the waiting story in
TabletService (KUDU-1127) and a few new integration test features.

There is an instance of a TimeManager per tablet. It's used for:

- When replicating messages to other replicas a leader uses the
  TimeManager to assign timestamps and to obtain a safe time to send,
  even when there are no writes.

- When receiving messages from a leader consensus uses the TimeManager
  to update the clock and to unblock any waiters that might waiting
  for a particular timestamp to be safe.

- Before a snapshot scan proceeds to scan, it must first wait for the
  TimeManager to deem whatever timestamp it has safe. Then it proceeds
  to wait for the snapshot at timestamp to have all its transactions
  committed and, finally, proceeds with the scan.

Put together, these changes allow to make sure that snapshot scans are
repeatable in the large majority of cases. The one "whole" in safe time
is solved by leader leases. Until we have those this patch takes a
conservative approach to safe time progress.

Fixing safe time broke a bunch of our tests that were expecting broken
snapshot scans. In particular we would return broken snapshots all the
time instead of waiting for the snapshot to be correct. Of course
when these errors were fixed the tests started failing.

In order to address these test failures I cleaned up our snapshot scan
waiting story in TabletServer::HandleScanAtSnapshot(). In particular:

- The client's deadline in no longer taken into account when deciding
  how long to wait for a snapshot to be repeatable. There is a hard
  (configurable) max that the server will wait for, "clamping" the
  client's deadline. The key here is that, when the client deadline
  is clamped, we return Status::ServiceUnavailable on time out
  instead of Status::TimeOut(). Since HandleScanAtSnapshot() is called
  on KuduScanner::Open() and ServiceUnavailable is a retryable status
  this enables the client to try somewhere else, perhaps where it won't
  have to wait as long.

- TimeManager now does a pre-flight check before waiting on safe time.
  In particular it checks that: i) it has heard from the leader within
  a configurable amount of time (that safe time _can_ make progress).
  ii) it checks that the safe time it not more that a configurable
  amount of time in the past, 30 seconds by default (that safe time
  is likely to make progress to the required timestamp).

Finally, this patch adds two new integration test workloads that
prove that it works. It adds a new read workload to TestWorkload
that performs snapshot scans in the present, while writes are
happening. This can be enabled anywhere but this patch enables
it for a bunch of tests in RaftConsensusItest, in particular the
*Churny* and *Crashy* ones with unique keys. This patch also enables
linked_list-test to perform snapshot scans in the present after
the verification.

Results:

I ran raft_consensus-itest with the new snapshot read workloaf
on dist-test, asan, with the following config:

KUDU_ALLOW_SLOW_TESTS=1 ../../build-support/dist_test.py \
--disable-sharding loop -n 1000 bin/raft_consensus-itest \
--gtest_filter=*Churny*:*Crashy*-*Duplicate*

I pulled the test to before this patch. It failed 1000/1000 on
master. With this patch it passed 1000/1000:

http://dist-test.cloudera.org//job?job_id=david.alves.1481097865.18287

I ran linked_list-test with the new snapshot scans in dist-test,
asan, with the following config:

KUDU_ALLOW_SLOW_TESTS=1 ../../build-support/dist_test.py \
loop -n 1000 bin/linked_list-test --stress_cpu_threads=2 \
--gtest_filter=*TestLoadAndVerify*

The test passed 1000/1000 whereas before it would fail 427/1000.
Results:

http://dist-test.cloudera.org//job?job_id=david.alves.1481104106.24665

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/client/client-test-util.cc
M src/kudu/client/client-test.cc
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/consistency-itest.cc
M src/kudu/integration-tests/linked_list-test-util.h
M src/kudu/integration-tests/linked_list-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/tablet_history_gc-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/integration-tests/test_workload.h
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/tablet_peer_mm_ops.cc
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/transaction_tracker.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
42 files changed, 629 insertions(+), 309 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/28
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 28
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] WIP: Support safe time advancement on replicas in the absense of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#11).

Change subject: WIP: Support safe time advancement on replicas in the absense of writes
......................................................................

WIP: Support safe time advancement on replicas in the absense of writes

Mostly getting some jenkins mileage but the design should be close
to the final one.

This is a WIP that needs to be broken down further but that summarily
adds the following:

- Add a TimeManager class: This class is responsible for tracking
  safe time advancement on leaders and replicas.
- Removes the clock from MvccManager - Since safe time is now managed
  by the TimeManager there is no need for Mvcc to have one internally.
- Changes how we wait for "all committed" in MvccManager. Previously
  we would wait for a clean snapshot. Now we instead wait for all
  transactions before the timestamp to be committed even if the
  clean timestamp doesn't advance. This is because without leader
  leases the safe time can go back, but moving the clean time back
  would break a lot of things internally.
- Plugs the TimeManager in consensus/queue and calls its methods
  appropriately.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M src/kudu/consensus/CMakeLists.txt
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
A src/kudu/consensus/time_manager-test.cc
A src/kudu/consensus/time_manager.cc
A src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/compaction-test.cc
M src/kudu/tablet/deltamemstore-test.cc
M src/kudu/tablet/diskrowset-test-base.h
M src/kudu/tablet/major_delta_compaction-test.cc
M src/kudu/tablet/memrowset-test.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
32 files changed, 623 insertions(+), 178 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/11
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 11
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] WIP: Support safe time advancement on replicas in the absense of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#10).

Change subject: WIP: Support safe time advancement on replicas in the absense of writes
......................................................................

WIP: Support safe time advancement on replicas in the absense of writes

Mostly getting some jenkins mileage but the design should be close
to the final one.

This is a WIP that needs to be broken down further but that summarily
adds the following:

- Add a TimeManager class: This class is responsible for tracking
  safe time advancement on leaders and replicas.
- Removes the clock from MvccManager - Since safe time is now managed
  by the TimeManager there is no need for Mvcc to have one internally.
- Changes how we wait for "all committed" in MvccManager. Previously
  we would wait for a clean snapshot. Now we instead wait for all
  transactions before the timestamp to be committed even if the
  clean timestamp doesn't advance. This is because without leader
  leases the safe time can go back, but moving the clean time back
  would break a lot of things internally.
- Plugs the TimeManager in consensus/queue and calls its methods
  appropriately.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M src/kudu/consensus/CMakeLists.txt
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
A src/kudu/consensus/time_manager-test.cc
A src/kudu/consensus/time_manager.cc
A src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/compaction-test.cc
M src/kudu/tablet/deltamemstore-test.cc
M src/kudu/tablet/diskrowset-test-base.h
M src/kudu/tablet/major_delta_compaction-test.cc
M src/kudu/tablet/memrowset-test.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
31 files changed, 634 insertions(+), 216 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/10
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] WIP: Support safe time advancement on replicas in the absense of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#7).

Change subject: WIP: Support safe time advancement on replicas in the absense of writes
......................................................................

WIP: Support safe time advancement on replicas in the absense of writes

Mostly getting some jenkins mileage but the design should be close
to the final one.

This is a WIP that needs to be broken down further but that summarily
adds the following:

- Add a TimeManager class: This class is responsible for tracking
  safe time advancement on leaders and replicas.
- Removes the clock from MvccManager - Since safe time is now managed
  by the TimeManager there is no need for Mvcc to have one internally.
- Changes how we wait for "all committed" in MvccManager. Previously
  we would wait for a clean snapshot. Now we instead wait for all
  transactions before the timestamp to be committed even if the
  clean timestamp doesn't advance. This is because without leader
  leases the safe time can go back, but moving the clean time back
  would break a lot of things internally.
- Plugs the TimeManager in consensus/queue and calls its methods
  appropriately.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M src/kudu/consensus/CMakeLists.txt
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
A src/kudu/consensus/time_manager-test.cc
A src/kudu/consensus/time_manager.cc
A src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/compaction-test.cc
M src/kudu/tablet/deltamemstore-test.cc
M src/kudu/tablet/diskrowset-test-base.h
M src/kudu/tablet/major_delta_compaction-test.cc
M src/kudu/tablet/memrowset-test.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
31 files changed, 589 insertions(+), 176 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/7
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-798 (part 5) Correct safe time advancement

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#31).

Change subject: KUDU-798 (part 5) Correct safe time advancement
......................................................................

KUDU-798 (part 5) Correct safe time advancement

This patch fixes safe time advancement in general and allows for
safe time advancement in the absence of writes in particular.
The core of the patch is to plug in the new TimeManager class wherever
needed. But there is also a major cleanup of the waiting story in
TabletService (KUDU-1127) and a few new integration test features.

There is an instance of a TimeManager per tablet. It's used for:

- When replicating messages to other replicas a leader uses the
  TimeManager to assign timestamps and to obtain a safe time to send,
  even when there are no writes.

- When receiving messages from a leader consensus uses the TimeManager
  to update the clock and to unblock any waiters that might waiting
  for a particular timestamp to be safe.

- Before a snapshot scan proceeds to scan, it must first wait for the
  TimeManager to deem whatever timestamp it has safe. Then it proceeds
  to wait for the snapshot at timestamp to have all its transactions
  committed and, finally, proceeds with the scan.

Put together, these changes allow to make sure that snapshot scans are
repeatable in the large majority of cases. The one "whole" in safe time
is solved by leader leases. Until we have those this patch takes a
conservative approach to safe time progress.

Fixing safe time broke a bunch of our tests that were expecting broken
snapshot scans. In particular we would return broken snapshots all the
time instead of waiting for the snapshot to be correct. Of course
when these errors were fixed the tests started failing.

In order to address these test failures I cleaned up our snapshot scan
waiting story in TabletServer::HandleScanAtSnapshot(). In particular:

- The client's deadline in no longer taken into account when deciding
  how long to wait for a snapshot to be repeatable. There is a hard
  (configurable) max that the server will wait for, "clamping" the
  client's deadline. The key here is that, when the client deadline
  is clamped, we return Status::ServiceUnavailable on time out
  instead of Status::TimeOut(). Since HandleScanAtSnapshot() is called
  on KuduScanner::Open() and ServiceUnavailable is a retryable status
  this enables the client to try somewhere else, perhaps where it won't
  have to wait as long.

- TimeManager now does a pre-flight check before waiting on safe time.
  In particular it checks that: i) it has heard from the leader within
  a configurable amount of time (that safe time _can_ make progress).
  ii) it checks that the safe time it not more that a configurable
  amount of time in the past, 30 seconds by default (that safe time
  is likely to make progress to the required timestamp).

Finally, this patch adds two new integration test workloads that
prove that it works. It adds a new read workload to TestWorkload
that performs snapshot scans in the present, while writes are
happening. This can be enabled anywhere but this patch enables
it for a bunch of tests in RaftConsensusItest, in particular the
*Churny* and *Crashy* ones with unique keys. This patch also enables
linked_list-test to perform snapshot scans in the present after
the verification.

Results:

I ran raft_consensus-itest with the new snapshot read workloaf
on dist-test, asan, with the following config:

KUDU_ALLOW_SLOW_TESTS=1 ../../build-support/dist_test.py \
--disable-sharding loop -n 1000 bin/raft_consensus-itest \
--gtest_filter=*Churny*:*Crashy*-*Duplicate*

I pulled the test to before this patch. It failed 1000/1000 on
master. With this patch it passed 1000/1000:

http://dist-test.cloudera.org//job?job_id=david.alves.1481097865.18287

I ran linked_list-test with the new snapshot scans in dist-test,
asan, with the following config:

KUDU_ALLOW_SLOW_TESTS=1 ../../build-support/dist_test.py \
loop -n 1000 bin/linked_list-test --stress_cpu_threads=2 \
--gtest_filter=*TestLoadAndVerify*

The test passed 1000/1000 whereas before it would fail 427/1000.
Results:

http://dist-test.cloudera.org//job?job_id=david.alves.1481104106.24665

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/client/client-test-util.cc
M src/kudu/client/client-test.cc
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
M src/kudu/consensus/raft_consensus_state.cc
M src/kudu/consensus/raft_consensus_state.h
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/linked_list-test-util.h
M src/kudu/integration-tests/linked_list-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/tablet_history_gc-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/integration-tests/test_workload.h
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_bootstrap.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/transaction_driver.h
M src/kudu/tablet/transactions/transaction_tracker-test.cc
M src/kudu/tablet/transactions/transaction_tracker.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tools/tool_action_remote_replica.cc
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
41 files changed, 638 insertions(+), 327 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/31
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 31
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-798 (part 3) Remove the clock from MvccManager

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-798 (part 3) Remove the clock from MvccManager
......................................................................


Patch Set 1:

still worth looking at this or not really given the test failures?

-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] KUDU-798 (part 5) Correct safe time advancement

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: KUDU-798 (part 5) Correct safe time advancement
......................................................................


Patch Set 27:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/5240/27//COMMIT_MSG
Commit Message:

PS27, Line 15: it's
its


http://gerrit.cloudera.org:8080/#/c/5240/27/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

Line 450:   // If we're not sending ops to the follower, set the safe time on the request.
worth a TODO here that we could also set it if we are sending our latest op to the follower?


http://gerrit.cloudera.org:8080/#/c/5240/27/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

Line 34: //#include "kudu/consensus/time_manager.h"
nit


http://gerrit.cloudera.org:8080/#/c/5240/27/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

PS27, Line 1198: fail
               :       // once we have that functionality we'll have to revisit this.
nit: two sentences


Line 1204:     // All transactions that are going to be prepared were started, advance the safe timestamp.
this doesn't smell quite right: imagine the leader sent:

1.1 @ time 10
1.2 @ time 20
1.3 @ time 30
safetime = 40

and then we failed to prepare 1.3.. in that case, we can't go ahead and set safetime to 40


http://gerrit.cloudera.org:8080/#/c/5240/27/src/kudu/tablet/tablet_bootstrap.cc
File src/kudu/tablet/tablet_bootstrap.cc:

Line 37: #include "kudu/consensus/time_manager.h"
needed?


http://gerrit.cloudera.org:8080/#/c/5240/27/src/kudu/tablet/tablet_peer_mm_ops.cc
File src/kudu/tablet/tablet_peer_mm_ops.cc:

Line 26: #include "kudu/consensus/time_manager.h"
unnecessary


http://gerrit.cloudera.org:8080/#/c/5240/27/src/kudu/tablet/transactions/alter_schema_transaction.cc
File src/kudu/tablet/transactions/alter_schema_transaction.cc:

Line 23: #include "kudu/consensus/time_manager.h"
unnecessary


http://gerrit.cloudera.org:8080/#/c/5240/27/src/kudu/tablet/transactions/transaction_tracker.cc
File src/kudu/tablet/transactions/transaction_tracker.cc:

Line 24: #include "kudu/consensus/time_manager.h"
unnecessary


http://gerrit.cloudera.org:8080/#/c/5240/27/src/kudu/tablet/transactions/write_transaction.cc
File src/kudu/tablet/transactions/write_transaction.cc:

Line 25: #include "kudu/consensus/time_manager.h"
same


-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 27
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] WIP: Support safe time advancement on replicas in the absense of writes

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5240

to look at the new patch set (#3).

Change subject: WIP: Support safe time advancement on replicas in the absense of writes
......................................................................

WIP: Support safe time advancement on replicas in the absense of writes

Mostly getting some jenkins mileage but the design should be close
to the final one.

This is a WIP that needs to be broken down further but that summarily
adds the following:

- Add a TimeManager class: This class is responsible for tracking
  safe time advancement on leaders and replicas.
- Removes the clock from MvccManager - Since safe time is now managed
  by the TimeManager there is no need for Mvcc to have one internally.
- Changes how we wait for "all committed" in MvccManager. Previously
  we would wait for a clean snapshot. Now we instead wait for all
  transactions before the timestamp to be committed even if the
  clean timestamp doesn't advance. This is because without leader
  leases the safe time can go back, but moving the clean time back
  would break a lot of things internally.
- Plugs the TimeManager in consensus/queue and calls its methods
  appropriately.

Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
---
M src/kudu/consensus/CMakeLists.txt
M src/kudu/consensus/consensus.h
M src/kudu/consensus/consensus.proto
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/raft_consensus_quorum-test.cc
A src/kudu/consensus/time_manager-test.cc
A src/kudu/consensus/time_manager.cc
A src/kudu/consensus/time_manager.h
M src/kudu/tablet/compaction-test.cc
M src/kudu/tablet/deltamemstore-test.cc
M src/kudu/tablet/diskrowset-test-base.h
M src/kudu/tablet/major_delta_compaction-test.cc
M src/kudu/tablet/memrowset-test.cc
M src/kudu/tablet/mvcc-test.cc
M src/kudu/tablet/mvcc.cc
M src/kudu/tablet/mvcc.h
M src/kudu/tablet/tablet.cc
M src/kudu/tablet/tablet_peer.cc
M src/kudu/tablet/tablet_peer.h
M src/kudu/tablet/transactions/alter_schema_transaction.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tablet/transactions/write_transaction.cc
M src/kudu/tablet/transactions/write_transaction.h
M src/kudu/tserver/tablet_service.cc
M src/kudu/tserver/tablet_service.h
30 files changed, 576 insertions(+), 170 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/40/5240/3
-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] KUDU-798 (part 5) Correct safe time advancement

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: KUDU-798 (part 5) Correct safe time advancement
......................................................................


Patch Set 27:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5240/27/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

Line 1204:     // All transactions that are going to be prepared were started, advance the safe timestamp.
> this doesn't smell quite right: imagine the leader sent:
yeah, the comment is misleading likely left over from the leader leases patch. see, when we have those, the leader is free to send safe time with actual uncommitted messages, but until then the leader only sets this timestamp when there are no messages to send. this largely reduces the opportunity for anomalies, at the cost of some extra latency. its a bit conservative, but can be helped with "update me" rpcs which I still have some hope to put up as it should be simple


-- 
To view, visit http://gerrit.cloudera.org:8080/5240
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I8532fdb069c8bee7f3e08ffe74cab0273885cc8e
Gerrit-PatchSet: 27
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes