You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org> on 2017/06/19 15:22:42 UTC

[kudu-CR] Simplify OpId/Timestamp assignment

David Ribeiro Alves has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/7221

Change subject: Simplify OpId/Timestamp assignment
......................................................................

Simplify OpId/Timestamp assignment

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to extract the
current safe time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/tablet/transactions/transaction_driver.cc
10 files changed, 167 insertions(+), 135 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/1
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................


Patch Set 24:

(5 comments)

Just posting some nits on the commit message.  I'll try to look at the after two weeks, if it's still around.

http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@26
PS24, Line 26: also 
drop 'also'?


http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@26
PS24, Line 26: it
t  ?


http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@29
PS24, Line 29: non-atomically with
separately from


http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@35
PS24, Line 35: with
it ?


http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@39
PS24, Line 39: ,
replace with stop/dot ?



-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 24
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 27 Mar 2018 06:08:35 +0000
Gerrit-HasComments: Yes

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................


Patch Set 8:

(2 comments)

need to look in more detail at the changes in consensus/ but some quicker questions about the MvccManager design

http://gerrit.cloudera.org:8080/#/c/7221/8/src/kudu/tablet/transactions/transaction_driver.cc
File src/kudu/tablet/transactions/transaction_driver.cc:

Line 267:       prepare_state_ = PREPARED;
it's sketching me out that the state transition to PREPARED happens under this lock acquisition for followers but not until down below for leaders. Why can't we keep it as it was?


Separate note: I think the big block comment in the header needs to be tweaked a bit.


PS8, Line 295:       // Register a dummy transaction with mvcc manager that we'll abort later.
instead of using a dummy transaction, would it be possible to register it early with the current clock time, as you've done here, and add a new API like MvccManager::PushTransactionTimestamp(scoped_txn, new_timestamp) which changes the timestamp of the existing entry?

This would avoid an allocation at least and can include an assertion that you can only PushTransactionTimestamp to a later timestamp


-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#10).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
21 files changed, 335 insertions(+), 270 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/10
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 10
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#23).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
23 files changed, 356 insertions(+), 284 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/23
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 23
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#18).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
23 files changed, 348 insertions(+), 282 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/18
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 18
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#6).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
18 files changed, 303 insertions(+), 254 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/6
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot

[kudu-CR] WIP: Simplify OpId/Timestamp assignment and make it atomic

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#14).

Change subject: WIP: Simplify OpId/Timestamp assignment and make it atomic
......................................................................

WIP: Simplify OpId/Timestamp assignment and make it atomic

Note: Wip simply because one test has been temporarily disabled, the large
majority of the patch should be gtg.

Context:

Each message serialized through consensus is assigned a Timestamp and an OpId.
These indexes are used in different contexts to establish ordering and, for the
most part, they are monotonically increasing. In particular, for write messages,
successive OpIds are guaranteed to have increasing Timestamps. For non-write
messages though, like NO_OP and CONFIG_CHANGE, this is not the case. Because
timestamps are assigned to messages in different places (for write messages they
are assigned in the prepare phase, by the TransactionDriver, while for NO_OP and
CONFIG_CHANGE they are assigned inside consensus) there was no way, until now,
to make sure that timestamps are monotonically increasing across message types,
like OpIds are.

Not being able to trust that timestamps are monotonically increasing across
all message types, means that we can't use the timestamps from non-write
messages to advance time. This is a requirement for leader leases, as we
need a single source of truth regarding time advancement and need config
changes in particular to have been assigned timestamps we can trust, but
pehaps more importantly it's also a requirement to be able to advance time
on bootstrap when there are no write messages in the WAL. Not being able
to do this is the underlying cause of KUDU-2233.

Implementation:

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.
These convoluted interactions make it so that OpId and Timestamp are
not assigned atomically.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
23 files changed, 342 insertions(+), 273 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/14
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 14
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................


Patch Set 17:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7221/8/src/kudu/tablet/transactions/transaction_driver.cc
File src/kudu/tablet/transactions/transaction_driver.cc:

http://gerrit.cloudera.org:8080/#/c/7221/8/src/kudu/tablet/transactions/transaction_driver.cc@295
PS8, Line 295:     CHECK_EQ(prepare_state_, NOT_PREPARED);
> instead of using a dummy transaction, would it be possible to register it e
Did you see this comment and the one above?



-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 17
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Sat, 10 Feb 2018 01:22:52 +0000
Gerrit-HasComments: Yes

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................


Patch Set 17:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/7221/16//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7221/16//COMMIT_MSG@27
PS16, Line 27: pehaps
perhaps


http://gerrit.cloudera.org:8080/#/c/7221/16//COMMIT_MSG@43
PS16, Line 43: t
readability nit: T


http://gerrit.cloudera.org:8080/#/c/7221/16//COMMIT_MSG@44
PS16, Line 44: t1 < t
readability nit: T1 < T


http://gerrit.cloudera.org:8080/#/c/7221/16//COMMIT_MSG@45
PS16, Line 45: t
readability nit: T


http://gerrit.cloudera.org:8080/#/c/7221/16//COMMIT_MSG@74
PS16, Line 74: aditional
additional


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h
File src/kudu/consensus/consensus-test-util.h:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@64
PS17, Line 64: using log::Log;
             : using strings::Substitute;
nit: usually it's not a good idea to have 'using' in header files, unless that's inside function body or other locally enclosed scope.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@67
PS17, Line 67: inline gscoped_ptr<ReplicateMsg> CreateDummyReplicateWithOpId
As I see, the pattern is always CreateDummyReplicateWithOpId().release().

Why to return a gscoped_ptr at all?  Maybe, just return a raw pointer instead.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@82
PS17, Line 82: inline gscoped_ptr<ReplicateMsg> CreateDummyReplicateWithoutOpId(int64_t payload_size) {
ditto: maybe, just return a raw pointer.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@115
PS17, Line 115:     AppendMode mode = APPEND_NON_LEADER,
Instead, is it possible to rely on the mode of the object pointed to by the queue parameter?


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@133
PS17, Line 133:  
an extra space


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_meta.cc
File src/kudu/consensus/consensus_meta.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_meta.cc@153
PS17, Line 153: CHECK(config.has_opid_index());
Maybe, just use DCHECK() here?  As I understand, the committed configuration is verified by VerifyRaftConfig() upon flushing the data into the disk.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.h@238
PS17, Line 238:   Status LeaderAppendOperation(const ReplicateRefPtr& msg);
              : 
              :   Status NonLeaderAppendOperation(const ReplicateRefPtr& msg);
Is it possible just to rely on the status of the queue and have a single method here?  Or you wanted it exactly like this because this way it's easier to enforce some constraints?


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

PS17: 
Does it make sense to add some specific test for this new behavior or it's already enough coverage?


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2622
PS17, Line 2622: RaftConfigPB
This could be const reference as well.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2623
PS17, Line 2623:       CHECK(differencer.Equals(new_config, pending_config)) << "The pending config and"
nit: add a space after 'and'


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2624
PS17, Line 2624: .
nit: replace with colon?


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2844
PS17, Line 2844:     // We didn't know the op id index at the time we set the config as pending so set it
               :     // so that we can still compare the configs.
               :     pending_config.set_opid_index(config_to_commit.opid_index());
Since this is just for comparing configs in the if() closure below, move it in there?


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager-test.cc
File src/kudu/consensus/time_manager-test.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager-test.cc@a176
PS17, Line 176: 
Maybe, add an assertion instead:

ASSERT_GT(time_manager_->GetSafeTime(), safe_before);



-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 17
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 13 Feb 2018 16:49:01 +0000
Gerrit-HasComments: Yes

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#22).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
23 files changed, 353 insertions(+), 280 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/22
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 22
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................


Patch Set 8:

Gonna pause this patch series until mike's consensus metadata changes land

-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-HasComments: No

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#12).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................

Simplify OpId/Timestamp assignment and make it atomic

Context:

Each message serialized through consensus is assigned a Timestamp and an OpId.
These indexes are used in different contexts to establish ordering and, for the
most part, they are monotonically increasing. In particular, for write messages,
successive OpIds are guaranteed to have increasing Timestamps. For non-write
messages though, like NO_OP and CONFIG_CHANGE, this is not the case. Because
timestamps are assigned to messages in different places (for write messages they
are assigned in the prepare phase, by the TransactionDriver, while for NO_OP and
CONFIG_CHANGE they are assigned inside consensus) there was no way, until now,
to make sure that timestamps are monotonically increasing across message types,
like OpIds are.

Not being able to trust that timestamps are monotonically increasing across
all message types, means that we can't use the timestamps from non-write
messages to advance time. This is a requirement for leader leases, as we
need a single source of truth regarding time advancement and need config
changes in particular to have been assigned timestamps we can trust, but
pehaps more importantly it's also a requirement to be able to advance time
on bootstrap when there are no write messages in the WAL. Not being able
to do this is the underlying cause of KUDU-2233.

Implementation:

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.
These convoluted interactions make it so that OpId and Timestamp are
not assigned atomically.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
21 files changed, 337 insertions(+), 270 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/12
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 12
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#17).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................

Simplify OpId/Timestamp assignment and make it atomic

Context:

Each message serialized through consensus is assigned a Timestamp and an OpId.
These indexes are used in different contexts to establish ordering and, for the
most part, they are monotonically increasing. In particular, for write messages,
successive OpIds are guaranteed to have increasing Timestamps. For non-write
messages though, like NO_OP and CONFIG_CHANGE, this is not the case. Because
timestamps are assigned to messages in different places (for write messages they
are assigned in the prepare phase, by the TransactionDriver, while for NO_OP and
CONFIG_CHANGE they are assigned inside consensus) there was no way, until now,
to make sure that timestamps are monotonically increasing across message types,
like OpIds are.

Not being able to trust that timestamps are monotonically increasing across
all message types, means that we can't use the timestamps from non-write
messages to advance time. This is a requirement for leader leases, as we
need a single source of truth regarding time advancement and need config
changes in particular to have been assigned timestamps we can trust, but
pehaps more importantly it's also a requirement to be able to advance time
on bootstrap when there are no write messages in the WAL. Not being able
to do this is the underlying cause of KUDU-2233.

Implementation:

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.
These convoluted interactions make it so that OpId and Timestamp are
not assigned atomically.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
23 files changed, 350 insertions(+), 275 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/17
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 17
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Simplify OpId/Timestamp assignment

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#4).

Change subject: Simplify OpId/Timestamp assignment
......................................................................

Simplify OpId/Timestamp assignment

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to extract the
current safe time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
but did require some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
18 files changed, 301 insertions(+), 252 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/4
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................


Patch Set 19:

> Patch Set 19: Verified-1
> 
> Build Failed 
> 
> http://jenkins.kudu.apache.org/job/kudu-gerrit/12511/ : FAILURE

Hum, something weird is going on with a couple of tests. Should still be worth reviewing though


-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 19
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Thu, 15 Mar 2018 23:41:35 +0000
Gerrit-HasComments: No

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................


Patch Set 17:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/quorum_util.cc
File src/kudu/consensus/quorum_util.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/quorum_util.cc@158
PS17, Line 158:   std::set<string> uuids;
add: DCHECK(type == COMMITTED_CONFIG || type == PENDING_CONFIG)


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.h
File src/kudu/consensus/raft_consensus.h:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.h@449
PS17, Line 449: mus
must


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@648
PS17, Line 648:   RETURN_NOT_OK(HandlePendingConfigChangeUnlocked(round));
> although this is right, it looks confusing because round probably isn't a c
or just: if (round->replicate_msg()->op_type() == CHANGE_CONFIG_OP) { ... }


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@662
PS17, Line 662:   if (PREDICT_TRUE(round->replicate_msg()->op_type() != CHANGE_CONFIG_OP)) return Status::OK();
Due to the method name, seems less surprising to add: DCHECK_EQ(CHANGE_CONFIG_OP, round->replicate_msg()->op_type());


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@673
PS17, Line 673: 
nit: extra line


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@691
PS17, Line 691: (
nit: no need for inner parentheses


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/tablet/transactions/transaction_driver.cc
File src/kudu/tablet/transactions/transaction_driver.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/tablet/transactions/transaction_driver.cc@336
PS17, Line 336: rule
s/rule/invariant/


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/tablet/transactions/transaction_driver.cc@360
PS17, Line 360: constraint
s/constraint/above invariant/

By the way: is there a DCHECK somewhere in mvcc to enforce the invariant that clean time < any tx timestamp?


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/tablet/transactions/transaction_driver.cc@371
PS17, Line 371: repl_state_copy
This variable masks a variable of the same name in an outer scope


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/tablet/transactions/transaction_driver.cc@384
PS17, Line 384: case REPLICATING
can we still get to this case?



-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 17
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 20 Feb 2018 22:20:34 +0000
Gerrit-HasComments: Yes

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#24).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
24 files changed, 358 insertions(+), 284 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/24
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 24
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#19).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
23 files changed, 347 insertions(+), 282 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/19
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 19
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................


Patch Set 8:

(36 comments)

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h
File src/kudu/consensus/consensus-test-util.h:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@64
PS17, Line 64:   OpId* id = msg->mutable_id();
             :   id->set_term(term);
> nit: usually it's not a good idea to have 'using' in header files, unless t
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@67
PS17, Line 67: 
> As I see, the pattern is always CreateDummyReplicateWithOpId().release().
I essentially split CreateDummyReplicate in 2 versions one that sets an opid and one that doesn't. I can refactor the pointer ownership in a follow up if you feel strongly, but would like to keep that out of this straightforward change.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@82
PS17, Line 82: inline RaftPeerPB FakeRaftPeerPB(const std::string& uuid) {
> ditto: maybe, just return a raw pointer.
same


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@115
PS17, Line 115:         continue;
> Instead, is it possible to rely on the mode of the object pointed to by the
not sure I follow...


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus-test-util.h@133
PS17, Line 133: _
> an extra space
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_meta.cc
File src/kudu/consensus/consensus_meta.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_meta.cc@153
PS17, Line 153: return leader_uuid_;
> Maybe, just use DCHECK() here?  As I understand, the committed configuratio
tbh honest I don't see a problem in having this here and being extra defensive. the point of this patch is that the pending config is no longer assigned an op_id before being marked as pendign which must be set before marking it as commited. this is here to make sure we always enforce that last bit.
added a comment to that regard.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.h@231
PS17, Line 231:   // it is alive and making progress.
> update this to indicate it also does assignment of OpId and Timestamp
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.h@238
PS17, Line 238:                         bool* more_pending);
              : 
              :   // Called by the consensus implementation to update the queu
> Is it possible just to rely on the status of the queue and have a single me
the latter. I pondered the "uglyness" of having two methods/extra params versus just relying on the state of the queue, but ultimately decided on this to make it clear/explicit and have extra checks.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc@361
PS17, Line 361:   // Maintain a thread-safe copy of necessary members.
              :   OpId preceding_id;
> per comment below about GetNextOpId having a side effect, here's an example
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc@421
PS17, Line 421:     // We try to get the follower's next_index from our log.
> nit: indent
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc@440
PS17, Line 440:                                         << "while preparing peer request: "
              :                                         << s.ToString() << ". Destination peer: "
              :    
> hrm, this seems a little surprising that GetNextOpId has this side effect r
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/quorum_util.cc
File src/kudu/consensus/quorum_util.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/quorum_util.cc@158
PS17, Line 158: 
> add: DCHECK(type == COMMITTED_CONFIG || type == PENDING_CONFIG)
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.h
File src/kudu/consensus/raft_consensus.h:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.h@449
PS17, Line 449: t, 
> must
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

PS17: 
> Does it make sense to add some specific test for this new behavior or it's 
for writes the behavior is undistinguishable. for non-writes the behavior is now "correct". there's a patch further down the series that makes sure this is actually what happens.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@648
PS17, Line 648: 
> or just: if (round->replicate_msg()->op_type() == CHANGE_CONFIG_OP) { ... }
went with todds suggestion


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@648
PS17, Line 648: 
> although this is right, it looks confusing because round probably isn't a c
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@662
PS17, Line 662:   // Check if the pending Raft config has an OpId and whether the index is less than the
> Due to the method name, seems less surprising to add: DCHECK_EQ(CHANGE_CONF
went with todds naming suggestion


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@673
PS17, Line 673:     return Status::OK();
> nit: extra line
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@691
PS17, Line 691: h
> nit: no need for inner parentheses
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2609
PS17, Line 2609: ClearLeaderU
> this and new_config can be const refs
new_config can't we set the opid below before we set it as the committed config


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2622
PS17, Line 2622: 
> This could be const reference as well.
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2623
PS17, Line 2623: bool RaftConsensus::HasLeaderUnlocked() const {
> nit: add a space after 'and'
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2624
PS17, Line 2624: 
> nit: replace with colon?
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2633
PS17, Line 2633: const bool RaftConsensus::HasVotedCurrentTermUnlocked() const {
               :   DCHECK(lock_.is_locked());
> it should still have an opid right? ie if we get to this path it should hav
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2844
PS17, Line 2844: 
               : 
               : 
> Since this is just for comparing configs in the if() closure below, move it
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager-test.cc
File src/kudu/consensus/time_manager-test.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager-test.cc@a176
PS17, Line 176: 
> Maybe, add an assertion instead:
this is testing pinning safetime advancement, which no longer happens


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager.cc
File src/kudu/consensus/time_manager.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager.cc@110
PS17, Line 110:   //
              :   //       The only opportunity for unrepeatable reads in this setup is if an old leader
> hrm, is this method still even very useful? Maybe it would be clearer to ju
would prefer to keep this, even if now it's just really checking the stage. as the docs say it's necessary for leader leases, also consensus doesn't directly interact with the clock anymore.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager.cc@294
PS17, Line 294: 
              : 
> this looks wrong. you are falling through to the next case which breaks any
you're right that this looked confusing. fixed that. the point is to either get the current clock value or the last safe ts. hopefully it's clear now.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager.cc@307
PS17, Line 307: 
> can you DCHECK_EQ(mode_, LEADER) here?
Done


http://gerrit.cloudera.org:8080/#/c/7221/8/src/kudu/tablet/transactions/transaction_driver.cc
File src/kudu/tablet/transactions/transaction_driver.cc:

http://gerrit.cloudera.org:8080/#/c/7221/8/src/kudu/tablet/transactions/transaction_driver.cc@267
PS8, Line 267:       prepare_state_ = PREPARED;
> it's sketching me out that the state transition to PREPARED happens under t
Left a comment related to that in LOC 330. The point is that the transaction is now started after being submitted to consensus, and consensus replication and its callback will race with this thread, so we need to make sure we've started the txn before marking it PREPARED signaling it's ready to move to apply.
Done.


http://gerrit.cloudera.org:8080/#/c/7221/8/src/kudu/tablet/transactions/transaction_driver.cc@295
PS8, Line 295:       // Register a dummy transaction with mvcc manager that we'll abort later.
> Did you see this comment and the one above?
Done


http://gerrit.cloudera.org:8080/#/c/7221/8/src/kudu/tablet/transactions/transaction_driver.cc@295
PS8, Line 295:       // Register a dummy transaction with mvcc manager that we'll abort later.
> instead of using a dummy transaction, would it be possible to register it e
hum, that might work, but for us to have different timestamps for the same txn we'd likely have to add a timestamp state or mode so that we don't believe the timestamp elsewhere if its not definitive.

I'll just bite the bullet and add a pinning mechanism to mvcc (this was a lazy way of doing that).

Update: Added a scoped clean time advancement pin mechanism.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/tablet/transactions/transaction_driver.cc
File src/kudu/tablet/transactions/transaction_driver.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/tablet/transactions/transaction_driver.cc@336
PS17, Line 336: 
> s/rule/invariant/
Done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/tablet/transactions/transaction_driver.cc@360
PS17, Line 360: 
> s/constraint/above invariant/
Done
mvcc can't know of all transactions, but if does crash if we try to either submit a transaction with a ts lower than clean time or if we commit a transaction with a ts lower than clean time


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/tablet/transactions/transaction_driver.cc@371
PS17, Line 371: $0)", s.ToStrin
> This variable masks a variable of the same name in an outer scope
ah, good catch. done


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/tablet/transactions/transaction_driver.cc@384
PS17, Line 384: {
> can we still get to this case?
I don't see why not



-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 8
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 14 Mar 2018 21:40:07 +0000
Gerrit-HasComments: Yes

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................


Patch Set 24:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@26
PS24, Line 26: it
> t  ?
Done


http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@26
PS24, Line 26: also 
> drop 'also'?
Done


http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@29
PS24, Line 29: non-atomically with
> separately from
I think non-atomically here is better. after all non-separately doesn't necessarily mean atomically


http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@35
PS24, Line 35: with
> it ?
Done


http://gerrit.cloudera.org:8080/#/c/7221/24//COMMIT_MSG@39
PS24, Line 39: ,
> replace with stop/dot ?
Done


http://gerrit.cloudera.org:8080/#/c/7221/21/src/kudu/tablet/transactions/transaction_driver.cc
File src/kudu/tablet/transactions/transaction_driver.cc:

http://gerrit.cloudera.org:8080/#/c/7221/21/src/kudu/tablet/transactions/transaction_driver.cc@324
PS21, Line 324:       DCHECK_EQ(driver_type, consensus::LEADER);
> would it be correct here to add a DCHECK_EQ(transaction_->type(), consensus
Done


http://gerrit.cloudera.org:8080/#/c/7221/24/src/kudu/tablet/transactions/transaction_driver.cc
File src/kudu/tablet/transactions/transaction_driver.cc:

http://gerrit.cloudera.org:8080/#/c/7221/24/src/kudu/tablet/transactions/transaction_driver.cc@69
PS24, Line 69: using std::unique_ptr;
> warning: using decl 'unique_ptr' is unused [misc-unused-using-decls]
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 24
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Mon, 23 Apr 2018 21:15:49 +0000
Gerrit-HasComments: Yes

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#28).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
24 files changed, 358 insertions(+), 284 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/28
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 28
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Simplify OpId/Timestamp assignment

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#2).

Change subject: Simplify OpId/Timestamp assignment
......................................................................

Simplify OpId/Timestamp assignment

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to extract the
current safe time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
but did require some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/tablet/transactions/transaction_driver.cc
15 files changed, 247 insertions(+), 206 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/2
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................


Patch Set 17:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager.cc
File src/kudu/consensus/time_manager.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager.cc@307
PS17, Line 307:   DCHECK(lock_.is_locked());
> can you DCHECK_EQ(mode_, LEADER) here?
tried this, can't do it because of unsafe config changes.



-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 17
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 14 Mar 2018 23:34:21 +0000
Gerrit-HasComments: Yes

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................


Patch Set 21:

(2 comments)

Just looked at TransactionDriver first since David asked for a quick review of that part

http://gerrit.cloudera.org:8080/#/c/7221/21/src/kudu/tablet/transactions/transaction_driver.cc
File src/kudu/tablet/transactions/transaction_driver.cc:

http://gerrit.cloudera.org:8080/#/c/7221/21/src/kudu/tablet/transactions/transaction_driver.cc@324
PS21, Line 324:         std::lock_guard<simple_spinlock> lock(lock_);
would it be correct here to add a DCHECK_EQ(transaction_->type(), consensus::LEADER)? I think so. Maybe would be good doc value.


http://gerrit.cloudera.org:8080/#/c/7221/21/src/kudu/tablet/transactions/transaction_driver.cc@341
PS21, Line 341:  MvccManager::ScopedCleanTimeAdvancePin pin(replica->tablet()->mvcc_manager(),
              :                                                    replica->clock()->Now())
I'm surprised that the clock()->Now() call doesn't need to be made inside of the MVCC lock. Otherwise isn't it racey?

T1: call clock->Now()
T2: advance clean time to a higher value
T1: call Pin(older timestamp than current clean time)

.. or are we relying on the fact that there can be no T2 starting a transaction at a higher 'Now' because the only thread that starts transactions is the prepare thread? Maybe needs some comment on that if it's the case.



-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 21
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 21 Mar 2018 19:53:16 +0000
Gerrit-HasComments: Yes

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................


Patch Set 9:

this last push is just rebase over the current master, that fixes the hard conflicts that arose because of all the changes due to 3-4-3. Still need to do some follow up cleanup and address comments


-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 9
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Sat, 20 Jan 2018 00:22:46 +0000
Gerrit-HasComments: No

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#15).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................

Simplify OpId/Timestamp assignment and make it atomic

Context:

Each message serialized through consensus is assigned a Timestamp and an OpId.
These indexes are used in different contexts to establish ordering and, for the
most part, they are monotonically increasing. In particular, for write messages,
successive OpIds are guaranteed to have increasing Timestamps. For non-write
messages though, like NO_OP and CONFIG_CHANGE, this is not the case. Because
timestamps are assigned to messages in different places (for write messages they
are assigned in the prepare phase, by the TransactionDriver, while for NO_OP and
CONFIG_CHANGE they are assigned inside consensus) there was no way, until now,
to make sure that timestamps are monotonically increasing across message types,
like OpIds are.

Not being able to trust that timestamps are monotonically increasing across
all message types, means that we can't use the timestamps from non-write
messages to advance time. This is a requirement for leader leases, as we
need a single source of truth regarding time advancement and need config
changes in particular to have been assigned timestamps we can trust, but
pehaps more importantly it's also a requirement to be able to advance time
on bootstrap when there are no write messages in the WAL. Not being able
to do this is the underlying cause of KUDU-2233.

Implementation:

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.
These convoluted interactions make it so that OpId and Timestamp are
not assigned atomically.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
23 files changed, 348 insertions(+), 273 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/15
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 15
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Simplify OpId/Timestamp assignment

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#5).

Change subject: Simplify OpId/Timestamp assignment
......................................................................

Simplify OpId/Timestamp assignment

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to extract the
current safe time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
but did require some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
18 files changed, 301 insertions(+), 252 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/5
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#9).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
21 files changed, 324 insertions(+), 264 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/9
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 9
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#21).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
23 files changed, 349 insertions(+), 280 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/21
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 21
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has abandoned this change. ( http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................


Abandoned

superseded by https://gerrit.cloudera.org/c/11428/
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 28
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Simplify OpId/Timestamp assignment

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#3).

Change subject: Simplify OpId/Timestamp assignment
......................................................................

Simplify OpId/Timestamp assignment

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to extract the
current safe time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
but did require some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
18 files changed, 301 insertions(+), 245 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/3
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#16).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................

Simplify OpId/Timestamp assignment and make it atomic

Context:

Each message serialized through consensus is assigned a Timestamp and an OpId.
These indexes are used in different contexts to establish ordering and, for the
most part, they are monotonically increasing. In particular, for write messages,
successive OpIds are guaranteed to have increasing Timestamps. For non-write
messages though, like NO_OP and CONFIG_CHANGE, this is not the case. Because
timestamps are assigned to messages in different places (for write messages they
are assigned in the prepare phase, by the TransactionDriver, while for NO_OP and
CONFIG_CHANGE they are assigned inside consensus) there was no way, until now,
to make sure that timestamps are monotonically increasing across message types,
like OpIds are.

Not being able to trust that timestamps are monotonically increasing across
all message types, means that we can't use the timestamps from non-write
messages to advance time. This is a requirement for leader leases, as we
need a single source of truth regarding time advancement and need config
changes in particular to have been assigned timestamps we can trust, but
pehaps more importantly it's also a requirement to be able to advance time
on bootstrap when there are no write messages in the WAL. Not being able
to do this is the underlying cause of KUDU-2233.

Implementation:

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.
These convoluted interactions make it so that OpId and Timestamp are
not assigned atomically.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t before
all the operations with a timestamp lower than it have also been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with would only allow at most
one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii), Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
23 files changed, 350 insertions(+), 275 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/16
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 16
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................


Patch Set 17:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.h
File src/kudu/consensus/consensus_queue.h:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.h@231
PS17, Line 231:   // Appends a single message to be replicated to the peers.
update this to indicate it also does assignment of OpId and Timestamp


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc
File src/kudu/consensus/consensus_queue.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc@361
PS17, Line 361:     op_id->CopyFrom(GetNextOpIdUnlocked());
              :     RETURN_NOT_OK(time_manager_->AssignTimestamp(msg->get()));
per comment below about GetNextOpId having a side effect, here's an example of why I think it's dangerous. Here AssignTimestamp might fail, but then the side effect of GetNextOpId still took effect.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc@421
PS17, Line 421:  {
nit: indent


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/consensus_queue.cc@440
PS17, Line 440:   if (PREDICT_FALSE(queue_state_.first_index_in_current_term == boost::none)) {
              :     queue_state_.first_index_in_current_term = id.index();
              :   }
hrm, this seems a little surprising that GetNextOpId has this side effect rather than setting it on append.


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc
File src/kudu/consensus/raft_consensus.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@648
PS17, Line 648:   RETURN_NOT_OK(HandlePendingConfigChangeUnlocked(round));
although this is right, it looks confusing because round probably isn't a config change. Maybe if we name it something like 'SnoopForConfigChangeUnlocked' or 'HandleConfigChangeIfNecessaryUnlocked' it would be clearer?


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2609
PS17, Line 2609: RaftConfigPB
this and new_config can be const refs


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/raft_consensus.cc@2633
PS17, Line 2633:       LOG_WITH_PREFIX_UNLOCKED(INFO) << "Skipping abort of non-pending config change. Status:  "
               :                                      << status.ToString();
it should still have an opid right? ie if we get to this path it should have been in the queue and been assigned an op?


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager.cc
File src/kudu/consensus/time_manager.cc:

http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager.cc@110
PS17, Line 110:   // NOTE: Currently this method just updates the clock and stores the message's timestamp.
              :   //       It always returns Status::OK() if the clock returns an OK status on Update().
hrm, is this method still even very useful? Maybe it would be clearer to just use clock->Update?


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager.cc@294
PS17, Line 294:  case LEADER: GetSerialTimestampUnlocked();
              :     FALLTHROUGH_INTENDED;
this looks wrong. you are falling through to the next case which breaks anyway.

Also seems odd to be calling GetSerialTimestamp without doing anything with the result.

Wouldn't this function be equivalent to just say 'return GetSerialTimestampUnlocked'?


http://gerrit.cloudera.org:8080/#/c/7221/17/src/kudu/consensus/time_manager.cc@307
PS17, Line 307:   DCHECK(lock_.is_locked());
can you DCHECK_EQ(mode_, LEADER) here?



-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 17
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Sat, 10 Feb 2018 02:14:38 +0000
Gerrit-HasComments: Yes

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, Alexey Serbin, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/7221

to look at the new patch set (#25).

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................

Simplify OpId/Timestamp assignment and make it atomic

There are a couple of constraints that are tricky to enforce and were
requiring us to have convoluted interactions on OpId/Timestamp assignment
and registration between the TimeManager, MvccManager and ConsensusQueue.

These convoluted interactions are currently preventing a clean and simple
leader leases implementation, particularly the single timestamp outstanding
in TimeManager.

The "tricky" constraints are:

i) Before clean time is advanced in mvcc, all transactions with a
timestamp lower than the current time must be registered with mvcc.
This allows to wait for a certain timestamp t to be "safe" with TimeManager
and then wait for all transactions with a timestamp t1 < t to commit, thus
making sure that a scan at t is repeatable.

ii) The leader can't advance follower safe timestamp to t, before
all the operations with a timestamp lower than t have been sent.
This is hard since timestamps are assigned outside of consensus.

In general it's problematic to generate OpIds non-atomically with
timestamps, since these are both supposed to be monotonically increasing
in unison. To address this we had some complex state and interactions between
the components. For instance this was requiring us to have TimeManager
stop safe time advancement until a transaction being prepared was
submitted to the queue, after assigning a timestamp in the
TransactionDriver. This was problematic because with it would only allow at
most one assigned timestamp in-flight, meaning we were not able to move safe
time at will.

This patch completely addresses ii) Timestamps and OpIds are now assigned
by PeerMessageQueue, atomically. This makes it trivial to make sure that the
safe time sent to replicas is inline with the sent messages. This also allows
some simplifications and refactorings of the PeerMessageQueue and TimeManager.

As for i) the way we enforce this constraint changed. Instead of registering
a transaction before submitting it to consensus (which we now can't since
consensus submission is also where we assign the timestamp) we register
another, dummy transaction on the MvccManager before submission and abort it
afterwards. This allows to "pin" safe time advancement down to a timestamp
that is mandatorily smaller than then one that will be assigned thus enforcing
the constraint with a simpler implementation at the cost of an additional
(short-lived) transaction in mvcc. Note in this regard that there will be
at most one aditional transaction in mvcc.

One side effect of this change is that, for new config changes, we won't
know the OpId until after the queue submission, which needs to happen
after actually setting the pending config. This is largely inconsequential
since, by design, we can have at most one pending config, but did require
some refactoring.

Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
---
M java/kudu-client/src/test/java/org/apache/kudu/client/TestHybridTime.java
M src/kudu/consensus/consensus-test-util.h
M src/kudu/consensus/consensus_meta-test.cc
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_peers-test.cc
M src/kudu/consensus/consensus_queue-test.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/log_cache-test.cc
M src/kudu/consensus/pending_rounds.cc
M src/kudu/consensus/quorum_util.cc
M src/kudu/consensus/quorum_util.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/consensus/raft_consensus.h
M src/kudu/consensus/time_manager-test.cc
M src/kudu/consensus/time_manager.cc
M src/kudu/consensus/time_manager.h
M src/kudu/integration-tests/alter_table-test.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/master/sys_catalog.cc
M src/kudu/tablet/transactions/transaction_driver.cc
M src/kudu/tserver/tablet_server-test.cc
M src/kudu/tserver/ts_tablet_manager.cc
24 files changed, 357 insertions(+), 284 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/21/7221/25
-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 25
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Simplify OpId/Timestamp assignment and make it atomic

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/7221 )

Change subject: Simplify OpId/Timestamp assignment and make it atomic
......................................................................


Patch Set 28:

I think we can abandon this approach now that we went with another approach to KUDU-2463


-- 
To view, visit http://gerrit.cloudera.org:8080/7221
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2bedcc5c3708d06f48131f34ea14c835867b4800
Gerrit-Change-Number: 7221
Gerrit-PatchSet: 28
Gerrit-Owner: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 12 Oct 2018 21:40:58 +0000
Gerrit-HasComments: No