You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Andrew Wong (Code Review)" <ge...@cloudera.org> on 2018/08/07 16:23:16 UTC

[kudu-CR] tablet bootstrap: adjust mvcc safetime with no-ops

Andrew Wong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/11142


Change subject: tablet_bootstrap: adjust mvcc safetime with no-ops
......................................................................

tablet_bootstrap: adjust mvcc safetime with no-ops

Previously, during tablet bootstrap, a tablet would only update its MVCC
safetime based on write messages, as the timstamps in the write messages
are guaranteed to be serialized with respect to one another, by virtue
of being assigned in a single thread (the prepare thread) on the leader
replica.

From this, we conclude that timestamps for write operations are
monotonically increasing in unison with opid. The same cannot
necessarily be said for timestamps of no-ops and change configs.

This is a conservative conclusion about assigned timestamps, and this
patch hinges on the fact that our Raft implementation ensures the
following sequence of events:

1. replica A becomes leader of Term N
2. leader A assigns a timestamp t1 to its no-op
3. leader A replicates the no-op to replicas B and C, asserting its
   leadership for Term N
4. leader A prepares a write and assigns it a timestamp t2. A assigns a
   higher timestamp than t1, as this step happens after Step 2
5. leader A replicates the write to replicas B and C, checking that it
   is leader for the current term

Given the above series of operations, within the same term, the no-op
used to assert leadership is always assigned a timestamp that must be
lower than any writes in that term. As such, the timestamps assigned to
no-ops can and should be used to bump safetime.

This patch updates tablet bootstrap to adjust MVCC safetime based on
no-ops seen in the WALs. A test is added asserting that this is true of
no-ops with respect to writes. I.e. all replicate messages must have
monotonically increasing OpIds and monotonically increasing timestamps.

Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
---
M src/kudu/integration-tests/CMakeLists.txt
M src/kudu/integration-tests/log_verifier.cc
M src/kudu/integration-tests/log_verifier.h
A src/kudu/integration-tests/timestamp_serialization-itest.cc
M src/kudu/tablet/tablet_bootstrap.cc
5 files changed, 216 insertions(+), 13 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/11142/1
-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>

[kudu-CR] KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11142 )

Change subject: KUDU-2463 pt 1: adjust MVCC when replaying no-ops
......................................................................


Patch Set 8: Verified+1

Unrelated flake of trace test (posted logs to KUDU-2554)


-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 8
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Thu, 20 Sep 2018 00:40:27 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11142 )

Change subject: KUDU-2463 pt 1: adjust MVCC when replaying no-ops
......................................................................


Patch Set 7:

> Patch Set 7:
> 
> (32 comments)

Here's a dist test run: http://dist-test.cloudera.org/job?job_id=awong.1537249774.103150. Seems stable, 10-20 seconds with TSAN


-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 7
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Tue, 18 Sep 2018 15:14:05 +0000
Gerrit-HasComments: No

[kudu-CR] tablet bootstrap: adjust mvcc safetime with no-ops

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11142

to look at the new patch set (#2).

Change subject: tablet_bootstrap: adjust mvcc safetime with no-ops
......................................................................

tablet_bootstrap: adjust mvcc safetime with no-ops

Previously, during tablet bootstrap, a tablet would only update its MVCC
safetime based on write messages, as the timstamps in the write messages
are guaranteed to be serialized with respect to one another, by virtue
of being assigned in a single thread (the prepare thread) on the leader
replica.

From this, we conclude that timestamps for write operations are
monotonically increasing in unison with opid. The same cannot
necessarily be said for timestamps of no-ops and change configs.

This is a conservative conclusion about assigned timestamps, and this
patch hinges on the fact that our Raft implementation ensures the
following sequence of events:

1. replica A becomes leader of Term N
2. leader A assigns a timestamp t1 to its no-op
3. leader A replicates the no-op to replicas B and C, asserting its
   leadership for Term N
4. leader A prepares a write and assigns it a timestamp t2. A assigns a
   higher timestamp than t1, as this step happens after Step 2
5. leader A replicates the write to replicas B and C, checking that it
   is leader for the current term

Given the above series of operations, within the same term, the no-op
used to assert leadership is always assigned a timestamp that must be
lower than any writes in that term. As such, the timestamps assigned to
no-ops can and should be used to bump safetime.

This patch updates tablet bootstrap to adjust MVCC safetime based on
no-ops seen in the WALs. A test is added asserting that this is true of
no-ops with respect to writes. I.e. all replicate messages must have
monotonically increasing OpIds and monotonically increasing timestamps.

A case in tablet_bootstrap-test depends on the ability for no-ops to
written out of timestamp order. To maintain this, and to keep this
functionality around (which may be useful for general timestamp
assignment testing), a flag has been added to the NoOpRequestPB
indicating whether or not its timestamp should be trusted to advance
safetime.

Additionally, I tweaked the artificial timestamps used in
raft_consensus-itest. These timestamps were previously very low and
would overlap with the real timestamp used by the leadership no-op.

Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
---
M src/kudu/consensus/consensus.proto
M src/kudu/integration-tests/CMakeLists.txt
M src/kudu/integration-tests/log_verifier.cc
M src/kudu/integration-tests/log_verifier.h
M src/kudu/integration-tests/raft_consensus-itest.cc
A src/kudu/integration-tests/timestamp_serialization-itest.cc
M src/kudu/tablet/tablet_bootstrap-test.cc
M src/kudu/tablet/tablet_bootstrap.cc
8 files changed, 241 insertions(+), 14 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/11142/2
-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins

[kudu-CR] KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has removed a vote on this change.

Change subject: KUDU-2463 pt 1: adjust MVCC when replaying no-ops
......................................................................


Removed Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 8
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11142 )

Change subject: KUDU-2463 pt 1: adjust MVCC when replaying no-ops
......................................................................


Patch Set 3:

For simplicity of the patch, I've stripped away the "non-serialized" flag from earlier PSs.


-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 3
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Tue, 11 Sep 2018 23:15:33 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11142

to look at the new patch set (#4).

Change subject: KUDU-2463 pt 1: adjust MVCC when replaying no-ops
......................................................................

KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Previously, during tablet bootstrap, a tablet would only update its MVCC
safetime based on write messages, as the timstamps in the write messages
are guaranteed to be serialized with respect to one another, by virtue
of being assigned in a single thread (the prepare thread) on the leader
replica.

From this, we conclude that timestamps for write operations are
monotonically increasing in unison with opid. The same cannot
necessarily be said for timestamps of no-ops and change configs.

This is a conservative conclusion about assigned timestamps, and this
patch hinges on the fact that our Raft implementation ensures the
following sequence of events:

1. replica A becomes leader of Term N
2. leader A assigns a timestamp t1 to its no-op
3. leader A replicates the no-op to replicas B and C, asserting its
   leadership for Term N
4. leader A prepares a write and assigns it a timestamp t2 > t1
5. leader A replicates the write to replicas B and C, checking that it
   is leader for the Term N

Given the above series of operations, within the same term, the no-op
used to assert leadership is always assigned a timestamp that must be
lower than any writes in that term. As such, the timestamps assigned to
no-ops can and should be used to bump safetime. This patch does so at
bootstrap time with committed no-ops found in the WAL.

This alone isn't enough to prevent KUDU-2463, but it reduces the window
during which incorrect results can be returned.

Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
---
M src/kudu/consensus/consensus.proto
M src/kudu/integration-tests/CMakeLists.txt
M src/kudu/integration-tests/raft_consensus-itest.cc
A src/kudu/integration-tests/timestamp_advancement-itest.cc
M src/kudu/tablet/tablet_bootstrap-test.cc
M src/kudu/tablet/tablet_bootstrap.cc
6 files changed, 332 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/11142/4
-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 4
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11142 )

Change subject: KUDU-2463 pt 1: adjust MVCC when replaying no-ops
......................................................................


Patch Set 9:

(12 comments)

http://gerrit.cloudera.org:8080/#/c/11142/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11142/8//COMMIT_MSG@16
PS8, Line 16: Because
            : no-op and change config operations are not serialized through the
> Because no-op and change config operations are not serialized through the p
Done


http://gerrit.cloudera.org:8080/#/c/11142/8//COMMIT_MSG@16
PS8, Line 16: Because
            : no-op and change config operations are not serialized through the
> Because no-op and change config operations are not serialized through the p
Done


http://gerrit.cloudera.org:8080/#/c/11142/8//COMMIT_MSG@19
PS8, Line 19: monotonicity guarantees.
            : 
> However in the case of no-ops replicated to assert leadership, we get the s
Done


http://gerrit.cloudera.org:8080/#/c/11142/8//COMMIT_MSG@19
PS8, Line 19: monotonicity guarantees.
            : 
> However in the case of no-ops replicated to assert leadership, we get the s
Done


http://gerrit.cloudera.org:8080/#/c/11142/8//COMMIT_MSG@31
PS8, Line 31: plicas B and C, chec
> for a given term
Done


http://gerrit.cloudera.org:8080/#/c/11142/8//COMMIT_MSG@31
PS8, Line 31: plicas B and C, chec
> for a given term
Done


http://gerrit.cloudera.org:8080/#/c/11142/8//COMMIT_MSG@34
PS8, Line 34:  a given
> MVCC safe time
Done


http://gerrit.cloudera.org:8080/#/c/11142/8//COMMIT_MSG@34
PS8, Line 34:  a given
> MVCC safe time
Done


http://gerrit.cloudera.org:8080/#/c/11142/8//COMMIT_MSG@37
PS8, Line 37: ed to b
> always prevent
Done


http://gerrit.cloudera.org:8080/#/c/11142/8//COMMIT_MSG@37
PS8, Line 37: ed to b
> always prevent
Done


http://gerrit.cloudera.org:8080/#/c/11142/8//COMMIT_MSG@37
PS8, Line 37:  This patch does so
            : at bootstrap time with committed no-ops found 
> dramatically reduces the likelihood of it occurring
Done


http://gerrit.cloudera.org:8080/#/c/11142/8//COMMIT_MSG@37
PS8, Line 37:  This patch does so
            : at bootstrap time with committed no-ops found 
> dramatically reduces the likelihood of it occurring
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 9
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Thu, 20 Sep 2018 23:59:42 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11142

to look at the new patch set (#3).

Change subject: KUDU-2463 pt 1: adjust MVCC when replaying no-ops
......................................................................

KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Previously, during tablet bootstrap, a tablet would only update its MVCC
safetime based on write messages, as the timstamps in the write messages
are guaranteed to be serialized with respect to one another, by virtue
of being assigned in a single thread (the prepare thread) on the leader
replica.

From this, we conclude that timestamps for write operations are
monotonically increasing in unison with opid. The same cannot
necessarily be said for timestamps of no-ops and change configs.

This is a conservative conclusion about assigned timestamps, and this
patch hinges on the fact that our Raft implementation ensures the
following sequence of events:

1. replica A becomes leader of Term N
2. leader A assigns a timestamp t1 to its no-op
3. leader A replicates the no-op to replicas B and C, asserting its
   leadership for Term N
4. leader A prepares a write and assigns it a timestamp t2 > t1
5. leader A replicates the write to replicas B and C, checking that it
   is leader for the Term N

Given the above series of operations, within the same term, the no-op
used to assert leadership is always assigned a timestamp that must be
lower than any writes in that term. As such, the timestamps assigned to
no-ops can and should be used to bump safetime. This patch does so at
bootstrap time with committed no-ops found in the WAL.

This alone isn't enough to prevent KUDU-2463, but it reduces the window
during which incorrect results can be returned.

Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
---
M src/kudu/integration-tests/CMakeLists.txt
A src/kudu/integration-tests/timestamp_advancement-itest.cc
M src/kudu/tablet/tablet_bootstrap.cc
3 files changed, 304 insertions(+), 7 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/11142/3
-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 3
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/11142 )

Change subject: KUDU-2463 pt 1: adjust MVCC when replaying no-ops
......................................................................

KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Previously, during tablet bootstrap, a tablet replica would only update
its MVCC safetime based on write and alter schema messages, as the
timestamps in these messages are guaranteed to be serialized with
respect to one another, by virtue of being assigned in a single thread
(the prepare thread) on the leader replica.

From this, we conclude that timestamps for write and alter schema
operations are monotonically increasing in unison with opid. Because
no-op and change config operations are not serialized through the
prepare thread, they don't necessarily provide the same timestamp
monotonicity guarantees.

However, in the case of no-ops replicated to assert leadership, we get
the same timestamp monotonicity guarantee due to a different mechanism.
This patch takes advantage of the fact that our Raft implementation
ensures the following sequence of events:

1. replica A becomes leader of Term N
2. leader A assigns a timestamp t1 to its no-op
3. leader A replicates the no-op to replicas B and C, asserting its
   leadership for Term N
4. leader A prepares a write and assigns it a timestamp t2 > t1
5. leader A replicates the write to replicas B and C, checking that it
   is leader for the Term N

Given the above series of operations, for a given term, the no-op used
to assert leadership is always assigned a timestamp that must be lower
than any other ops in that term. As such, the timestamps assigned to
no-ops can and should be used to bump MVCC safe time. This patch does so
at bootstrap time with committed no-ops found in the WAL.

This alone isn't enough to always prevent KUDU-2463, but it dramatically
reduces the likelihood of it occuring.

Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Reviewed-on: http://gerrit.cloudera.org:8080/11142
Tested-by: Kudu Jenkins
Reviewed-by: Mike Percy <mp...@apache.org>
---
M src/kudu/consensus/consensus.proto
M src/kudu/integration-tests/CMakeLists.txt
M src/kudu/integration-tests/raft_consensus-itest.cc
A src/kudu/integration-tests/timestamp_advancement-itest.cc
M src/kudu/tablet/tablet_bootstrap-test.cc
M src/kudu/tablet/tablet_bootstrap.cc
6 files changed, 276 insertions(+), 9 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Mike Percy: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 11
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/11142 )

Change subject: KUDU-2463 pt 1: adjust MVCC when replaying no-ops
......................................................................


Patch Set 10: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 10
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Fri, 21 Sep 2018 19:55:16 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11142

to look at the new patch set (#10).

Change subject: KUDU-2463 pt 1: adjust MVCC when replaying no-ops
......................................................................

KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Previously, during tablet bootstrap, a tablet replica would only update
its MVCC safetime based on write and alter schema messages, as the
timestamps in these messages are guaranteed to be serialized with
respect to one another, by virtue of being assigned in a single thread
(the prepare thread) on the leader replica.

From this, we conclude that timestamps for write and alter schema
operations are monotonically increasing in unison with opid. Because
no-op and change config operations are not serialized through the
prepare thread, they don't necessarily provide the same timestamp
monotonicity guarantees.

However, in the case of no-ops replicated to assert leadership, we get
the same timestamp monotonicity guarantee due to a different mechanism.
This patch takes advantage of the fact that our Raft implementation
ensures the following sequence of events:

1. replica A becomes leader of Term N
2. leader A assigns a timestamp t1 to its no-op
3. leader A replicates the no-op to replicas B and C, asserting its
   leadership for Term N
4. leader A prepares a write and assigns it a timestamp t2 > t1
5. leader A replicates the write to replicas B and C, checking that it
   is leader for the Term N

Given the above series of operations, for a given term, the no-op used
to assert leadership is always assigned a timestamp that must be lower
than any other ops in that term. As such, the timestamps assigned to
no-ops can and should be used to bump MVCC safe time. This patch does so
at bootstrap time with committed no-ops found in the WAL.

This alone isn't enough to always prevent KUDU-2463, but it dramatically
reduces the likelihood of it occuring.

Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
---
M src/kudu/consensus/consensus.proto
M src/kudu/integration-tests/CMakeLists.txt
M src/kudu/integration-tests/raft_consensus-itest.cc
A src/kudu/integration-tests/timestamp_advancement-itest.cc
M src/kudu/tablet/tablet_bootstrap-test.cc
M src/kudu/tablet/tablet_bootstrap.cc
6 files changed, 276 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/11142/10
-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 10
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11142

to look at the new patch set (#7).

Change subject: KUDU-2463 pt 1: adjust MVCC when replaying no-ops
......................................................................

KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Previously, during tablet bootstrap, a tablet replica would only update
its MVCC safetime based on write and alter schema messages, as the
timestamps in these messages are guaranteed to be serialized with
respect to one another, by virtue of being assigned in a single thread
(the prepare thread) on the leader replica.

From this, we conclude that timestamps for write and alter schema
operations are monotonically increasing in unison with opid. The same
cannot necessarily be said for timestamps of no-ops and change configs.

This is a conservative conclusion about assigned timestamps, and this
patch hinges on the fact that our Raft implementation ensures the
following sequence of events:

1. replica A becomes leader of Term N
2. leader A assigns a timestamp t1 to its no-op
3. leader A replicates the no-op to replicas B and C, asserting its
   leadership for Term N
4. leader A prepares a write and assigns it a timestamp t2 > t1
5. leader A replicates the write to replicas B and C, checking that it
   is leader for the Term N

Given the above series of operations, within the same term, the no-op
used to assert leadership is always assigned a timestamp that must be
lower than any other ops in that term. As such, the timestamps assigned
to no-ops can and should be used to bump safetime. This patch does so at
bootstrap time with committed no-ops found in the WAL.

This alone isn't enough to prevent KUDU-2463, but it reduces the window
during which incorrect results can be returned.

Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
---
M src/kudu/consensus/consensus.proto
M src/kudu/integration-tests/CMakeLists.txt
M src/kudu/integration-tests/raft_consensus-itest.cc
A src/kudu/integration-tests/timestamp_advancement-itest.cc
M src/kudu/tablet/tablet_bootstrap-test.cc
M src/kudu/tablet/tablet_bootstrap.cc
6 files changed, 276 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/11142/7
-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 7
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11142 )

Change subject: KUDU-2463 pt 1: adjust MVCC when replaying no-ops
......................................................................


Patch Set 5:

@mpercy Currently dist testing some seemingly test-only failures, but pushed the major changes in case you wanted to review the changes/tests, had comments/suggestions, etc.


-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 5
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Wed, 12 Sep 2018 01:54:38 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11142

to look at the new patch set (#5).

Change subject: KUDU-2463 pt 1: adjust MVCC when replaying no-ops
......................................................................

KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Previously, during tablet bootstrap, a tablet would only update its MVCC
safetime based on write messages, as the timstamps in the write messages
are guaranteed to be serialized with respect to one another, by virtue
of being assigned in a single thread (the prepare thread) on the leader
replica.

From this, we conclude that timestamps for write operations are
monotonically increasing in unison with opid. The same cannot
necessarily be said for timestamps of no-ops and change configs.

This is a conservative conclusion about assigned timestamps, and this
patch hinges on the fact that our Raft implementation ensures the
following sequence of events:

1. replica A becomes leader of Term N
2. leader A assigns a timestamp t1 to its no-op
3. leader A replicates the no-op to replicas B and C, asserting its
   leadership for Term N
4. leader A prepares a write and assigns it a timestamp t2 > t1
5. leader A replicates the write to replicas B and C, checking that it
   is leader for the Term N

Given the above series of operations, within the same term, the no-op
used to assert leadership is always assigned a timestamp that must be
lower than any writes in that term. As such, the timestamps assigned to
no-ops can and should be used to bump safetime. This patch does so at
bootstrap time with committed no-ops found in the WAL.

This alone isn't enough to prevent KUDU-2463, but it reduces the window
during which incorrect results can be returned.

Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
---
M src/kudu/consensus/consensus.proto
M src/kudu/integration-tests/CMakeLists.txt
M src/kudu/integration-tests/raft_consensus-itest.cc
A src/kudu/integration-tests/timestamp_advancement-itest.cc
M src/kudu/tablet/tablet_bootstrap-test.cc
M src/kudu/tablet/tablet_bootstrap.cc
6 files changed, 336 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/11142/5
-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 5
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11142

to look at the new patch set (#6).

Change subject: KUDU-2463 pt 1: adjust MVCC when replaying no-ops
......................................................................

KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Previously, during tablet bootstrap, a tablet would only update its MVCC
safetime based on write messages, as the timstamps in the write messages
are guaranteed to be serialized with respect to one another, by virtue
of being assigned in a single thread (the prepare thread) on the leader
replica.

From this, we conclude that timestamps for write operations are
monotonically increasing in unison with opid. The same cannot
necessarily be said for timestamps of no-ops and change configs.

This is a conservative conclusion about assigned timestamps, and this
patch hinges on the fact that our Raft implementation ensures the
following sequence of events:

1. replica A becomes leader of Term N
2. leader A assigns a timestamp t1 to its no-op
3. leader A replicates the no-op to replicas B and C, asserting its
   leadership for Term N
4. leader A prepares a write and assigns it a timestamp t2 > t1
5. leader A replicates the write to replicas B and C, checking that it
   is leader for the Term N

Given the above series of operations, within the same term, the no-op
used to assert leadership is always assigned a timestamp that must be
lower than any writes in that term. As such, the timestamps assigned to
no-ops can and should be used to bump safetime. This patch does so at
bootstrap time with committed no-ops found in the WAL.

This alone isn't enough to prevent KUDU-2463, but it reduces the window
during which incorrect results can be returned.

Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
---
M src/kudu/consensus/consensus.proto
M src/kudu/integration-tests/CMakeLists.txt
M src/kudu/integration-tests/raft_consensus-itest.cc
A src/kudu/integration-tests/timestamp_advancement-itest.cc
M src/kudu/tablet/tablet_bootstrap-test.cc
M src/kudu/tablet/tablet_bootstrap.cc
6 files changed, 337 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/11142/6
-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 6
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/11142 )

Change subject: KUDU-2463 pt 1: adjust MVCC when replaying no-ops
......................................................................


Patch Set 6:

(32 comments)

Approach looks good, test looks good. How long does the test take to run? Maybe should be considered a slow test? Is it stable on dist-test?

I mostly have just left nitpick feedback.

http://gerrit.cloudera.org:8080/#/c/11142/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11142/6//COMMIT_MSG@10
PS6, Line 10: timstamps
timestamps


http://gerrit.cloudera.org:8080/#/c/11142/6//COMMIT_MSG@17
PS6, Line 17: necessarily be said for timestamps of no-ops and change configs.
also schema changes (I think)


http://gerrit.cloudera.org:8080/#/c/11142/6//COMMIT_MSG@37
PS6, Line 37: This alone isn't enough to prevent KUDU-2463, but it reduces the window
The cases where it doesn't prevent the problem are super edge case, right? The only scenario I can think of is where the WAL is full of config change ops or alter table ops and not enough replicas come up to get a leader elected.


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/consensus/consensus.proto
File src/kudu/consensus/consensus.proto:

http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/consensus/consensus.proto@235
PS6, Line 235: It
If


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/consensus/consensus.proto@236
PS6, Line 236: that the timestamp is assigned serially
to be true


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/raft_consensus-itest.cc@259
PS6, Line 259: 10000
is this needed for this test to pass? or just for debuggability upon test failure? could use a small comment


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc
File src/kudu/integration-tests/timestamp_advancement-itest.cc:

PS6: 
Did clang-tidy run on this file?


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@78
PS6, Line 78: using std::string;
nit: hrm, is there a reason you put these outside the namespace? seems easier to maintain with one "using" list


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@79
PS6, Line 79: using std::shared_ptr;
nit: out of alpha order


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@86
PS6, Line 86: class RaftPeerPB;
not used?


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@90
PS6, Line 90: class RpcController;
not used?


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@94
PS6, Line 94: class TabletServerServiceProxy;
not used


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@107
PS6, Line 107: using tserver::TabletServerServiceProxy;
nit: not used


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@108
PS6, Line 108: using rpc::RpcController;
nit: out of alpha order


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@112
PS6, Line 112: NonAdvancedTimestampsITest
nit: TimestampAdvancementITest ?


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@121
PS6, Line 121: write.set_num_tablets(1);
this is the default


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@122
PS6, Line 122:     write.set_payload_bytes(10);
             :     write.set_num_write_threads(1);
             :     write.set_write_batch_size(1);
Why change this stuff? If no compelling reason, let's remove it so we don't have to wonder about it.


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@135
PS6, Line 135:     const int64_t rows_inserted = write.rows_inserted();
             :     for (int i = 0; i < cluster_->num_tablet_servers(); i++) {
             :       HostPort hp;
             :       hp.set_host(cluster_->mini_tablet_server(i)->bound_http_addr().host());
             :       hp.set_port(cluster_->mini_tablet_server(i)->bound_http_addr().port());
             :       ASSERT_EVENTUALLY([&] {
             :         int64_t rows_in_replica;
             :         WARN_NOT_OK(itest::GetInt64Metric(hp,
             :             &METRIC_ENTITY_tablet, nullptr, &METRIC_rows_inserted, "value", &rows_in_replica),
             :             "Couldn't get metric... will retry");
             :         ASSERT_EQ(rows_inserted, rows_in_replica);
             :       });
             :     }
you should be able to replace this with:

    ASSERT_OK(WaitForServersToAgree(kTimeout, ts_map_, tablet_id, workload.batches_completed()));


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@149
PS6, Line 149: segement
segment


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@150
PS6, Line 150: onto
nit: to


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@151
PS6, Line 151: assert eventually here
not seeing this


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@181
PS6, Line 181:     return replicas[0];
nit: maybe add a CHECK_EQ(1, replicas.size());


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@186
PS6, Line 186: HasNoWriteOps
nit: Rename to CheckForWriteOpsInLog() ?


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@199
PS6, Line 199: s.IsCorruption()
nit: we don't expect a truncated log file in this test, do we?


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@202
PS6, Line 202:           case log::COMMIT:
do we care about commit messages? if they don't have a matching replicate message then they are dropped during tablet bootstrap


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@223
PS6, Line 223: leave
s/leave//


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@224
PS6, Line 224: timestamps
safe time timestamp


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@225
PS6, Line 225: TestTimestampsAdvancedFromBootstrapNoOp
nit: TestNoOpAdvancesMvccSafeTimeOnBootstrap ?


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@226
PS6, Line 226:   // Set a low Raft heartbeat and encourage frequent elections.
... so that we can fill up the WAL with no-op entries naturally


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@272
PS6, Line 272: 
nit: extra blank line


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/tablet/tablet_bootstrap.cc
File src/kudu/tablet/tablet_bootstrap.cc:

http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/tablet/tablet_bootstrap.cc@1097
PS6, Line 1097: assigned serially with respect to
              :   // one another.
suggestion: guaranteed to be monotonic with respect to the write-ahead log.


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/tablet/tablet_bootstrap.cc@1113
PS6, Line 1113:     default:
Hmm. I wonder if we had a bug excluding alter schema here. Do you know whether alter schema provides the same guarantees that writes provide?

I think I would be more comfortable writing this with the default set to false, something like:

timestamp_assigned_in_opid_order = false;
switch (op_type) {
  case WRITE_OP:
    timestamp_assigned_in_opid_order = true;
    break;
  case NO_OP: {
    const auto& req = replicate->noop_request();
    if (req.has_timestamp_in_opid_order()) {
      timestamp_assigned_in_opid_order = req.timestamp_in_opid_order();
      break;
    }
    timestamp_assigned_in_opid_order = true;
    break;
  }
  default:
    break;
}



-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 6
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Sat, 15 Sep 2018 00:15:20 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11142 )

Change subject: KUDU-2463 pt 1: adjust MVCC when replaying no-ops
......................................................................


Patch Set 10:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11142/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11142/9//COMMIT_MSG@23
PS9, Line 23: takes advantag
> takes advantage
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 10
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Fri, 21 Sep 2018 19:23:27 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11142

to look at the new patch set (#9).

Change subject: KUDU-2463 pt 1: adjust MVCC when replaying no-ops
......................................................................

KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Previously, during tablet bootstrap, a tablet replica would only update
its MVCC safetime based on write and alter schema messages, as the
timestamps in these messages are guaranteed to be serialized with
respect to one another, by virtue of being assigned in a single thread
(the prepare thread) on the leader replica.

From this, we conclude that timestamps for write and alter schema
operations are monotonically increasing in unison with opid. Because
no-op and change config operations are not serialized through the
prepare thread, they don't necessarily provide the same timestamp
monotonicity guarantees.

However, in the case of no-ops replicated to assert leadership, we get
the same timestamp monotonicity guarantee due to a different mechanism.
This patch tackes advance of the fact that our Raft implementation
ensures the following sequence of events:

1. replica A becomes leader of Term N
2. leader A assigns a timestamp t1 to its no-op
3. leader A replicates the no-op to replicas B and C, asserting its
   leadership for Term N
4. leader A prepares a write and assigns it a timestamp t2 > t1
5. leader A replicates the write to replicas B and C, checking that it
   is leader for the Term N

Given the above series of operations, for a given term, the no-op used
to assert leadership is always assigned a timestamp that must be lower
than any other ops in that term. As such, the timestamps assigned to
no-ops can and should be used to bump MVCC safe time. This patch does so
at bootstrap time with committed no-ops found in the WAL.

This alone isn't enough to always prevent KUDU-2463, but it dramatically
reduces the likelihood of it occuring.

Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
---
M src/kudu/consensus/consensus.proto
M src/kudu/integration-tests/CMakeLists.txt
M src/kudu/integration-tests/raft_consensus-itest.cc
A src/kudu/integration-tests/timestamp_advancement-itest.cc
M src/kudu/tablet/tablet_bootstrap-test.cc
M src/kudu/tablet/tablet_bootstrap.cc
6 files changed, 276 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/42/11142/9
-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 9
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>

[kudu-CR] KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11142 )

Change subject: KUDU-2463 pt 1: adjust MVCC when replaying no-ops
......................................................................


Patch Set 7:

(32 comments)

http://gerrit.cloudera.org:8080/#/c/11142/6//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11142/6//COMMIT_MSG@10
PS6, Line 10: r schema 
> timestamps
Done


http://gerrit.cloudera.org:8080/#/c/11142/6//COMMIT_MSG@17
PS6, Line 17: cannot necessarily be said for timestamps of no-ops and change configs.
> also schema changes (I think)
Good callout on schema changes, though I think they fall into the same bucket as write ops. They both use the TransactionDriver and get their timestamps assigned on the prepare thread.


http://gerrit.cloudera.org:8080/#/c/11142/6//COMMIT_MSG@37
PS6, Line 37: This alone isn't enough to prevent KUDU-2463, but it reduces the window
> The cases where it doesn't prevent the problem are super edge case, right? 
Right, with Part 2, this is a pretty narrow edge case.


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/consensus/consensus.proto
File src/kudu/consensus/consensus.proto:

http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/consensus/consensus.proto@235
PS6, Line 235: If
> If
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/consensus/consensus.proto@236
PS6, Line 236: to be true.
> to be true
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/raft_consensus-itest.cc@259
PS6, Line 259: stamp
> is this needed for this test to pass? or just for debuggability upon test f
Yep, it's needed. Right now, the first no-op is assigned a timestamp of 1, so we end up committing timestamp 1. With it, we update the clean time to 1. Then, we try to replay WAL entries with timestamp 1 and crash because timestamp 1 has already been committed.


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc
File src/kudu/integration-tests/timestamp_advancement-itest.cc:

PS6: 
> Did clang-tidy run on this file?
Seems Adar found the issue with clang tidy
https://gerrit.cloudera.org/c/11457/


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@78
PS6, Line 78: namespace itest {
> nit: hrm, is there a reason you put these outside the namespace? seems easi
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@79
PS6, Line 79: 
> nit: out of alpha order
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@86
PS6, Line 86: 
> not used?
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@90
PS6, Line 90:     // Set a low batch size so we have finer-grained control over flushing of
> not used?
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@94
PS6, Line 94:     write.Start();
> not used
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@107
PS6, Line 107:     // Flush the current log batch and roll over to get a fresh WAL segment.
> nit: not used
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@108
PS6, Line 108:     ASSERT_OK(tablet_replica->log()->WaitUntilAllFlushed());
> nit: out of alpha order
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@112
PS6, Line 112: SERT_OK(tablet_replica->ta
> nit: TimestampAdvancementITest ?
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@121
PS6, Line 121: 
> this is the default
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@122
PS6, Line 122:   // Get the tablet replica on the tablet server 'ts'.
             :   scoped_refptr<TabletReplica> tablet_replica_on_ts(int ts) const {
             :     vector<scoped_refptr<TabletRep
> Why change this stuff? If no compelling reason, let's remove it so we don't
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@135
PS6, Line 135:     RETURN_NOT_OK(LogReader::Open(env_,
             :                   ts->server()->fs_manager()->GetTabletWalDir(tablet_id),
             :                   scoped_refptr<log::LogIndex>(), tablet_id,
             :                   scoped_refptr<MetricEntity>(), &reader));
             :     log::SegmentSequence segs;
             :     RETURN_NOT_OK(reader->GetSegmentsSnapshot(&segs));
             :     unique_ptr<log::LogEntryPB> entry;
             :     for (const auto& seg : segs) {
             :       log::LogEntryReader reader(seg.get());
             :       while (true) {
             :         Status s = reader.ReadNextEntry(&entry);
             :         if (s.IsEndOfFile()) break;
             :      
> you should be able to replace this with:
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@149
PS6, Line 149: 
> segment
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@150
PS6, Line 150: repl
> nit: to
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@151
PS6, Line 151: urn Status::OK();
> not seeing this
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@181
PS6, Line 181:   MiniTabletServer* ts = tserver(kTserver);
> nit: maybe add a CHECK_EQ(1, replicas.size());
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@186
PS6, Line 186: _leader_failu
> nit: Rename to CheckForWriteOpsInLog() ?
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@199
PS6, Line 199: y_ms_in_notifica
> nit: we don't expect a truncated log file in this test, do we?
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@202
PS6, Line 202:     int64_t gcable_size;
> do we care about commit messages? if they don't have a matching replicate m
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@223
PS6, Line 223: 
> s/leave//
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@224
PS6, Line 224: start());
> safe time timestamp
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@225
PS6, Line 225: OrDie(ts_map_, ts->uuid());
> nit: TestNoOpAdvancesMvccSafeTimeOnBootstrap ?
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@226
PS6, Line 226: 
> ... so that we can fill up the WAL with no-op entries naturally
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/integration-tests/timestamp_advancement-itest.cc@272
PS6, Line 272: 
> nit: extra blank line
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/tablet/tablet_bootstrap.cc
File src/kudu/tablet/tablet_bootstrap.cc:

http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/tablet/tablet_bootstrap.cc@1097
PS6, Line 1097: guaranteed to be monotonically
              :   // increasing w
> suggestion: guaranteed to be monotonic with respect to the write-ahead log.
Done


http://gerrit.cloudera.org:8080/#/c/11142/6/src/kudu/tablet/tablet_bootstrap.cc@1113
PS6, Line 1113:   }
> Hmm. I wonder if we had a bug excluding alter schema here. Do you know whet
Chatted briefly about this on Slack: alter schema requests go through the prepare thread, same as writes, so they are serialized. Given that, I left this defaulted to true to match the implementation before this patch.



-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 7
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Tue, 18 Sep 2018 07:30:30 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/11142 )

Change subject: KUDU-2463 pt 1: adjust MVCC when replaying no-ops
......................................................................


Patch Set 8: Code-Review+1

(6 comments)

looks ready to go, just a few more wording nits

http://gerrit.cloudera.org:8080/#/c/11142/8//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11142/8//COMMIT_MSG@16
PS8, Line 16: The same
            : cannot necessarily be said for timestamps of no-ops and change configs.
Because no-op and change config operations are not serialized through the prepare thread, they don't necessarily provide the same timestamp monotonicity guarantees.


http://gerrit.cloudera.org:8080/#/c/11142/8//COMMIT_MSG@19
PS8, Line 19: This is a conservative conclusion about assigned timestamps, and this
            : patch hinges on the fact
However in the case of no-ops replicated to assert leadership, we get the same timestamp monotonicity guarantee due to a different mechanism. This patch takes advantage of the fact that...


http://gerrit.cloudera.org:8080/#/c/11142/8//COMMIT_MSG@31
PS8, Line 31: within the same term
for a given term


http://gerrit.cloudera.org:8080/#/c/11142/8//COMMIT_MSG@34
PS8, Line 34: safetime
MVCC safe time


http://gerrit.cloudera.org:8080/#/c/11142/8//COMMIT_MSG@37
PS8, Line 37: prevent
always prevent


http://gerrit.cloudera.org:8080/#/c/11142/8//COMMIT_MSG@37
PS8, Line 37: reduces the window
            : during which incorrect results can be returned
dramatically reduces the likelihood of it occurring



-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 8
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Thu, 20 Sep 2018 21:35:19 +0000
Gerrit-HasComments: Yes

[kudu-CR] tablet bootstrap: adjust mvcc safetime with no-ops

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11142 )

Change subject: tablet_bootstrap: adjust mvcc safetime with no-ops
......................................................................


Patch Set 1:

Ah, I omitted the part where we add the timestamp_in_opid_order to the NoOpRequestPB, hence the broken tests. Let me update this a bit.


-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Comment-Date: Tue, 07 Aug 2018 18:46:53 +0000
Gerrit-HasComments: No

[kudu-CR] KUDU-2463 pt 1: adjust MVCC when replaying no-ops

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/11142 )

Change subject: KUDU-2463 pt 1: adjust MVCC when replaying no-ops
......................................................................


Patch Set 9: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11142/9//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11142/9//COMMIT_MSG@23
PS9, Line 23: tackes advance
takes advantage



-- 
To view, visit http://gerrit.cloudera.org:8080/11142
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I26deff32da8c990cb8a2ba220bb81858ddd6d73f
Gerrit-Change-Number: 11142
Gerrit-PatchSet: 9
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Comment-Date: Fri, 21 Sep 2018 18:58:47 +0000
Gerrit-HasComments: Yes