You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org> on 2016/12/03 17:10:03 UTC

[kudu-CR] Add integration tests for duplicate keys

David Ribeiro Alves has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/5349

Change subject: Add integration tests for duplicate keys
......................................................................

Add integration tests for duplicate keys

This splits the "crashy nodes" and "churny elections" test
of raft_consensus-itest into unique and duplicate key variants.
This change is meant to stress any possible deadlock scenarios
related to transaction commit/abort and 2-phase locking for
which we didn't have much coverage.

Additionally this also disallows timeouts on writes and requires
an exact count of the rows at the end. This is now possible
due to exactly once semantics.

Finally this changes the cluster verifier to use snapshot scans
and changes the timeout of another scan in that test. These
two changes deflaked the test from 27/1000 to 3/1000 with
asan, slow mode, and 1 stress thread (any more and the test
becomes much more flaky, as before).

Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
---
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
2 files changed, 122 insertions(+), 65 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/49/5349/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5349
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>

[kudu-CR] Add integration tests for duplicate keys

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5349

to look at the new patch set (#5).

Change subject: Add integration tests for duplicate keys
......................................................................

Add integration tests for duplicate keys

This splits the "crashy nodes" and "churny elections" test
of raft_consensus-itest into unique and duplicate key variants.
This change is meant to stress any possible deadlock scenarios
related to transaction commit/abort and 2-phase locking for
which we didn't have much coverage.

Additionally this also disallows timeouts on writes and requires
an exact count of the rows at the end. This is now possible
due to exactly once semantics.

Finally this changes the cluster verifier to use snapshot scans
and changes the timeout of another scan in that test. These
two changes deflaked the test from 27/1000 to 3/1000 with
asan, slow mode, and 1 stress thread (any more and the test
becomes much more flaky, as before). Of the 3 failures two
are unrelated (inability to start the webserver and a timeout
on another, unrelated test). The one failure that is related
to this patch is a snapshot scan anomaly and should be solved
by the safe time patches.

The coverage of these kinds of workloads is now much better.
An example of that is that revision 12 of [1] caused a deadlock
while aborting transactions out of order.  Looping
raft_consensus-itest in slow mode, asan, 1 stress thread would
previously not fail with the buggy code and now it fails
127/1000.

Results:
With master (3/1000 failures):
http://dist-test.cloudera.org//job?job_id=david.alves.1480767369.16205
With buggy patch [1] (127/1000 failures):
http://dist-test.cloudera.org//job?job_id=david.alves.1480805849.13039

[1] - https://gerrit.cloudera.org/#/c/5294/12

Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
---
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
2 files changed, 122 insertions(+), 65 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/49/5349/5
-- 
To view, visit http://gerrit.cloudera.org:8080/5349
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Add integration tests for duplicate keys

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: Add integration tests for duplicate keys
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5349/8/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

Line 1396:   std::sort(results.begin(), results.end());
> I saw a failure of this test where it showed the second row first. I assume
OK, interesting. I did look this test 500 times when committing it, but not with stress threads. We'll watch the flaky dashboard for it.

One plausible thing (just a guess) is that when we commit the new ops, they are still in-flight in the "apply" queue, and could get reordered if the upsert of key 2 happens faster than the upsert of key 1 (which had a prior version). So if we aren't using SCAN_AT_SNAPSHOT we might see an order inversion here unless we put the verification in a loop.


-- 
To view, visit http://gerrit.cloudera.org:8080/5349
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Add integration tests for duplicate keys

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has submitted this change and it was merged.

Change subject: Add integration tests for duplicate keys
......................................................................


Add integration tests for duplicate keys

This splits the "crashy nodes" and "churny elections" test
of raft_consensus-itest into unique and duplicate key variants.
This change is meant to stress any possible deadlock scenarios
related to transaction commit/abort and 2-phase locking for
which we didn't have much coverage.

Additionally this also disallows timeouts on writes and requires
an exact count of the rows at the end. This is now possible
due to exactly once semantics.

Finally this changes the cluster verifier to use snapshot scans
and changes the timeout of another scan in that test. These
two changes deflaked the test from 27/1000 to 3/1000 with
asan, slow mode, and 1 stress thread (any more and the test
becomes much more flaky, as before). Of the 3 failures two
are unrelated (inability to start the webserver and a timeout
on another, unrelated test). The one failure that is related
to this patch is a snapshot scan anomaly and should be solved
by the safe time patches.

The coverage of these kinds of workloads is now much better.
An example of that is that revision 12 of [1] caused a deadlock
while aborting transactions out of order.  Looping
raft_consensus-itest in slow mode, asan, 1 stress thread would
previously not fail with the buggy code and now it fails
127/1000.

Results:
With master (3/1000 failures):
http://dist-test.cloudera.org//job?job_id=david.alves.1480767369.16205
With buggy patch[1] (127/1000 failures):
http://dist-test.cloudera.org//job?job_id=david.alves.1480805849.13039

[1] - https://gerrit.cloudera.org/#/c/5294/12

Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
Reviewed-on: http://gerrit.cloudera.org:8080/5349
Reviewed-by: Todd Lipcon <to...@apache.org>
Tested-by: Kudu Jenkins
---
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
2 files changed, 126 insertions(+), 75 deletions(-)

Approvals:
  Todd Lipcon: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/5349
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Add integration tests for duplicate keys

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5349

to look at the new patch set (#9).

Change subject: Add integration tests for duplicate keys
......................................................................

Add integration tests for duplicate keys

This splits the "crashy nodes" and "churny elections" test
of raft_consensus-itest into unique and duplicate key variants.
This change is meant to stress any possible deadlock scenarios
related to transaction commit/abort and 2-phase locking for
which we didn't have much coverage.

Additionally this also disallows timeouts on writes and requires
an exact count of the rows at the end. This is now possible
due to exactly once semantics.

Finally this changes the cluster verifier to use snapshot scans
and changes the timeout of another scan in that test. These
two changes deflaked the test from 27/1000 to 3/1000 with
asan, slow mode, and 1 stress thread (any more and the test
becomes much more flaky, as before). Of the 3 failures two
are unrelated (inability to start the webserver and a timeout
on another, unrelated test). The one failure that is related
to this patch is a snapshot scan anomaly and should be solved
by the safe time patches.

The coverage of these kinds of workloads is now much better.
An example of that is that revision 12 of [1] caused a deadlock
while aborting transactions out of order.  Looping
raft_consensus-itest in slow mode, asan, 1 stress thread would
previously not fail with the buggy code and now it fails
127/1000.

Results:
With master (3/1000 failures):
http://dist-test.cloudera.org//job?job_id=david.alves.1480767369.16205
With buggy patch[1] (127/1000 failures):
http://dist-test.cloudera.org//job?job_id=david.alves.1480805849.13039

[1] - https://gerrit.cloudera.org/#/c/5294/12

Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
---
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
2 files changed, 126 insertions(+), 75 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/49/5349/9
-- 
To view, visit http://gerrit.cloudera.org:8080/5349
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Add integration tests for duplicate keys

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5349

to look at the new patch set (#8).

Change subject: Add integration tests for duplicate keys
......................................................................

Add integration tests for duplicate keys

This splits the "crashy nodes" and "churny elections" test
of raft_consensus-itest into unique and duplicate key variants.
This change is meant to stress any possible deadlock scenarios
related to transaction commit/abort and 2-phase locking for
which we didn't have much coverage.

Additionally this also disallows timeouts on writes and requires
an exact count of the rows at the end. This is now possible
due to exactly once semantics.

Finally this changes the cluster verifier to use snapshot scans
and changes the timeout of another scan in that test. These
two changes deflaked the test from 27/1000 to 3/1000 with
asan, slow mode, and 1 stress thread (any more and the test
becomes much more flaky, as before). Of the 3 failures two
are unrelated (inability to start the webserver and a timeout
on another, unrelated test). The one failure that is related
to this patch is a snapshot scan anomaly and should be solved
by the safe time patches.

The coverage of these kinds of workloads is now much better.
An example of that is that revision 12 of [1] caused a deadlock
while aborting transactions out of order.  Looping
raft_consensus-itest in slow mode, asan, 1 stress thread would
previously not fail with the buggy code and now it fails
127/1000.

Results:
With master (3/1000 failures):
http://dist-test.cloudera.org//job?job_id=david.alves.1480767369.16205
With buggy patch[1] (127/1000 failures):
http://dist-test.cloudera.org//job?job_id=david.alves.1480805849.13039

[1] - https://gerrit.cloudera.org/#/c/5294/12

Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
---
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
2 files changed, 126 insertions(+), 72 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/49/5349/8
-- 
To view, visit http://gerrit.cloudera.org:8080/5349
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Add integration tests for duplicate keys

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: Add integration tests for duplicate keys
......................................................................


Patch Set 7:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5349/7/src/kudu/integration-tests/cluster_verifier.cc
File src/kudu/integration-tests/cluster_verifier.cc:

PS7, Line 126:   CHECK_OK(scanner.SetReadMode(client::KuduScanner::READ_AT_SNAPSHOT));
             :   CHECK_OK(scanner.SetFaultTolerant());
             :   CHECK_OK(scanner.SetTimeoutMillis(5000));
             :   CHECK_OK(scanner.SetProjectedColumns(vector<string>()));
> We were already using this pattern. We RETURN_NOT_OK() on things that may f
From the caller's point of view, what is the difference if this method/function would return an error either from client->OpenTable() or scanner.SetReadMode()?

I think calling abort() does not make sense when it's possible to handle the error gracefully and staying consistent.  As I see it, the only case when you would want to call abort() is when it's clear the consistency of the code is compromised and it's impossible to continue safely.  This is not the case, IMO.


-- 
To view, visit http://gerrit.cloudera.org:8080/5349
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Add integration tests for duplicate keys

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5349

to look at the new patch set (#3).

Change subject: Add integration tests for duplicate keys
......................................................................

Add integration tests for duplicate keys

This splits the "crashy nodes" and "churny elections" test
of raft_consensus-itest into unique and duplicate key variants.
This change is meant to stress any possible deadlock scenarios
related to transaction commit/abort and 2-phase locking for
which we didn't have much coverage.

Additionally this also disallows timeouts on writes and requires
an exact count of the rows at the end. This is now possible
due to exactly once semantics.

Finally this changes the cluster verifier to use snapshot scans
and changes the timeout of another scan in that test. These
two changes deflaked the test from 27/1000 to 3/1000 with
asan, slow mode, and 1 stress thread (any more and the test
becomes much more flaky, as before).

The coverage of these kinds of workloads is now much better.
An example of that is that revision 12 of [1] caused a deadlock
while aborting transactions out of order.  Looping
raft_consensus-itest in slow mode, asan, 1 stress thread would
previously not fail with the buggy code and now it fails
127/1000.

Results:
With master (3/1000 failures):
http://dist-test.cloudera.org//job?job_id=david.alves.1480767369.16205
With buggy patch [1] (127/1000):
http://dist-test.cloudera.org//job?job_id=david.alves.1480805849.13039

[1] - https://gerrit.cloudera.org/#/c/5294/12

Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
---
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
2 files changed, 122 insertions(+), 65 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/49/5349/3
-- 
To view, visit http://gerrit.cloudera.org:8080/5349
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Add integration tests for duplicate keys

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: Add integration tests for duplicate keys
......................................................................


Patch Set 8:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5349/8/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

PS8, Line 991:   // Ensure that the replicas converge.
             :   // We don't know exactly how many rows got inserted, since the writer
             :   // probably saw many errors which left inserts in indeterminate state.
             :   // But, we should have at least as many as we got confirmation for.
> need to update this comment
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5349
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Add integration tests for duplicate keys

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: Add integration tests for duplicate keys
......................................................................


Patch Set 8:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5349/8/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

PS8, Line 991:   // Ensure that the replicas converge.
             :   // We don't know exactly how many rows got inserted, since the writer
             :   // probably saw many errors which left inserts in indeterminate state.
             :   // But, we should have at least as many as we got confirmation for.
need to update this comment


Line 1396:   std::sort(results.begin(), results.end());
hrm, is this necessary? WaitForRowCount() uses ScanReplica() which has the same sort at the end, unless I'm missing something.


-- 
To view, visit http://gerrit.cloudera.org:8080/5349
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Add integration tests for duplicate keys

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: Add integration tests for duplicate keys
......................................................................


Patch Set 7:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/5349/7/src/kudu/integration-tests/cluster_verifier.cc
File src/kudu/integration-tests/cluster_verifier.cc:

PS7, Line 126:   CHECK_OK(scanner.SetReadMode(client::KuduScanner::READ_AT_SNAPSHOT));
             :   CHECK_OK(scanner.SetFaultTolerant());
             :   CHECK_OK(scanner.SetTimeoutMillis(5000));
             :   CHECK_OK(scanner.SetProjectedColumns(vector<string>()));
Is it so crucial to abort if something goes non-OK among those?  Looking at that without knowing much of context here, I those non-OK returns would not break consistency, so I would expect that simple RETURN_NOT_OK() or RETURN_NOT_OK_PREPEND() would work here as well.


PS7, Line 129: vector<string>()
nit: consider using '{}' instead of 'vector<string>()' for brevity.


http://gerrit.cloudera.org:8080/#/c/5349/7/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

PS7, Line 359:   static const bool WITH_NOTIFICATION_LATENCY = true;
             :   static const bool WITHOUT_NOTIFICATION_LATENCY = false;
Are these needed in the new code?


PS7, Line 870: kCrashesToCause
nit: since it's not static or global, consider renaming into 'crashes_to_cause'


PS7, Line 907: CHECK_OK
nit: ASSERT_OK() ?


PS7, Line 947: 3)
Is batch of size 3 is enough to get higher chance of collision?  I wonder what would happen if this set to, say, 50.


PS7, Line 954: master_flags
It seems this stays empty, consider removing this variable and using '{}' for corresponding argument.


PS7, Line 959:   ts_flags.push_back("--raft_heartbeat_interval_ms=5");
             : #else
             :   ts_flags.push_back("--raft_heartbeat_interval_ms=1");
             : #endif
             :   ts_flags.push_back("--leader_failure_monitor_check_mean_ms=1");
             :   ts_flags.push_back("--leader_failure_monitor_check_stddev_ms=1");
             :   ts_flags.push_back("--never_fsync");
nit: consider replacing the construction of the ts_flags via push_back() with list initializer instead:

vector<string> ts_flags = {
  "--raft_heartbeat_interval_ms=5",
  ...
};


PS7, Line 968: master_flags
As described above: replace with {}.


PS7, Line 1009: TEST_F(RaftConsensusITest, TestChurnyElections) {
              :   const int kNumWrites = AllowSlowTests() ? 10000 : 1000;
              :   CreateClusterForChurnyElectionsTests({});
              :   TestWorkload workload(cluster_.get());
              :   workload.set_write_batch_size(1);
              :   DoTestChurnyElections(&workload, kNumWrites);
              : }
              : 
              : // The same test, except inject artificial latency when propagating notifications
              : // from the queue back to consensus. This previously reproduced bugs like KUDU-1078 which
              : // normally only appear under high load.
              : TEST_F(RaftConsensusITest, TestChurnyElections_WithNotificationLatency) {
              :   CreateClusterForChurnyElectionsTests({"--consensus_inject_latency_ms_in_notifications=50"});
              :   const int kNumWrites = AllowSlowTests() ? 10000 : 1000;
              :   TestWorkload workload(cluster_.get());
              :   workload.set_write_batch_size(1);
              :   DoTestChurnyElections(&workload, kNumWrites);
              : }
              : 
              : // The same as TestChurnyElections except insert many duplicated rows.
              : // This emulates cases where there are many duplicate keys which, due to two phase
              : // locking, may cause deadlocks and other anomalies that cannot be observed when
              : // keys are unique.
              : TEST_F(RaftConsensusITest, TestChurnyElections_WithDuplicateKeys) {
              :   CreateClusterForChurnyElectionsTests({});
              :   const int kNumWrites = AllowSlowTests() ? 10000 : 1000;
              :   TestWorkload workload(cluster_.get());
              :   workload.set_write_pattern(TestWorkload::INSERT_WITH_MANY_DUP_KEYS);
              :   // Increase the number of rows per batch to get a higher chance of key collision.
              :   workload.set_write_batch_size(3);
              :   DoTestChurnyElections(&workload, kNumWrites);
These look like candidates for extracting a common function and then re-using it with different set of parameters correspondingly.


-- 
To view, visit http://gerrit.cloudera.org:8080/5349
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Add integration tests for duplicate keys

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5349

to look at the new patch set (#7).

Change subject: Add integration tests for duplicate keys
......................................................................

Add integration tests for duplicate keys

This splits the "crashy nodes" and "churny elections" test
of raft_consensus-itest into unique and duplicate key variants.
This change is meant to stress any possible deadlock scenarios
related to transaction commit/abort and 2-phase locking for
which we didn't have much coverage.

Additionally this also disallows timeouts on writes and requires
an exact count of the rows at the end. This is now possible
due to exactly once semantics.

Finally this changes the cluster verifier to use snapshot scans
and changes the timeout of another scan in that test. These
two changes deflaked the test from 27/1000 to 3/1000 with
asan, slow mode, and 1 stress thread (any more and the test
becomes much more flaky, as before). Of the 3 failures two
are unrelated (inability to start the webserver and a timeout
on another, unrelated test). The one failure that is related
to this patch is a snapshot scan anomaly and should be solved
by the safe time patches.

The coverage of these kinds of workloads is now much better.
An example of that is that revision 12 of [1] caused a deadlock
while aborting transactions out of order.  Looping
raft_consensus-itest in slow mode, asan, 1 stress thread would
previously not fail with the buggy code and now it fails
127/1000.

Results:
With master (3/1000 failures):
http://dist-test.cloudera.org//job?job_id=david.alves.1480767369.16205
With buggy patch[1] (127/1000 failures):
http://dist-test.cloudera.org//job?job_id=david.alves.1480805849.13039

[1] - https://gerrit.cloudera.org/#/c/5294/12

Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
---
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
2 files changed, 120 insertions(+), 64 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/49/5349/7
-- 
To view, visit http://gerrit.cloudera.org:8080/5349
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Add integration tests for duplicate keys

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: Add integration tests for duplicate keys
......................................................................


Patch Set 9:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5349/8/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

Line 1396: }
> hrm, is this necessary? WaitForRowCount() uses ScanReplica() which has the 
I saw a failure of this test where it showed the second row first. I assumed since the WaitForRowCount didn't fail it was misordered. if you're saying it can't be then something else is wrong. I removed this. When it fails again we can look more.


-- 
To view, visit http://gerrit.cloudera.org:8080/5349
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Add integration tests for duplicate keys

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: Add integration tests for duplicate keys
......................................................................


Patch Set 9: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/5349
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] Add raft integration tests with duplicate keys

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5349

to look at the new patch set (#2).

Change subject: Add raft integration tests with duplicate keys
......................................................................

Add raft integration tests with duplicate keys

This splits the "crashy nodes" and "churny elections" tests
of raft_consensus-itest into unique and duplicate key variants.
This change is meant to stress any possible deadlock scenarios
related to transaction commit/abort and 2-phase locking for
which we didn't have much coverage.

Additionally this also disallows timeouts on writes and requires
an exact count of the rows at the end. This is now possible
due to exactly once semantics.

Finally this changes the cluster verifier to use snapshot scans
and changes the timeout of another scan in that test. These
two changes deflaked the test from 27/1000 to 3/1000 with
asan, slow mode, and 1 stress thread (any more and the test
becomes much more flaky, as before).

Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
---
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
2 files changed, 120 insertions(+), 64 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/49/5349/2
-- 
To view, visit http://gerrit.cloudera.org:8080/5349
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins

[kudu-CR] Add integration tests for duplicate keys

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: Add integration tests for duplicate keys
......................................................................


Patch Set 7:

(10 comments)

http://gerrit.cloudera.org:8080/#/c/5349/7/src/kudu/integration-tests/cluster_verifier.cc
File src/kudu/integration-tests/cluster_verifier.cc:

PS7, Line 126:   CHECK_OK(scanner.SetReadMode(client::KuduScanner::READ_AT_SNAPSHOT));
             :   CHECK_OK(scanner.SetFaultTolerant());
             :   CHECK_OK(scanner.SetTimeoutMillis(5000));
             :   CHECK_OK(scanner.SetProjectedColumns(vector<string>()));
> Is it so crucial to abort if something goes non-OK among those?  Looking at
We were already using this pattern. We RETURN_NOT_OK() on things that may fail in the test or the things we're testing and we CHECK_OK() on the invariants. You should RETURN_NOT_OK() when there is a chance that the caller can do something about it (e.g. as in this case in CheckRowCount vs CheckRowCountWithRetries) otherwise ASSERT_OK/CHECK_OK is better. We can't use ASSERT here, so CHECK will have to do


PS7, Line 129: vector<string>()
> nit: consider using '{}' instead of 'vector<string>()' for brevity.
Done


http://gerrit.cloudera.org:8080/#/c/5349/7/src/kudu/integration-tests/raft_consensus-itest.cc
File src/kudu/integration-tests/raft_consensus-itest.cc:

PS7, Line 359:   static const bool WITH_NOTIFICATION_LATENCY = true;
             :   static const bool WITHOUT_NOTIFICATION_LATENCY = false;
> Are these needed in the new code?
Done


PS7, Line 870: kCrashesToCause
> nit: since it's not static or global, consider renaming into 'crashes_to_ca
Done


PS7, Line 907: CHECK_OK
> nit: ASSERT_OK() ?
Done


PS7, Line 947: 3)
> Is batch of size 3 is enough to get higher chance of collision?  I wonder w
were only inserting 300 rows so this would be finished super fast. IMO 3 is a good tradeoff between speed and coverage. There are only 20 dup keys. Note that even with 1 there is a chance of collision.


PS7, Line 954: master_flags
> It seems this stays empty, consider removing this variable and using '{}' f
Done


PS7, Line 959:   ts_flags.push_back("--raft_heartbeat_interval_ms=5");
             : #else
             :   ts_flags.push_back("--raft_heartbeat_interval_ms=1");
             : #endif
             :   ts_flags.push_back("--leader_failure_monitor_check_mean_ms=1");
             :   ts_flags.push_back("--leader_failure_monitor_check_stddev_ms=1");
             :   ts_flags.push_back("--never_fsync");
> nit: consider replacing the construction of the ts_flags via push_back() wi
this was not changed in the patch. would rather not stray too much into random backlog fixes. Besides even though I do prefer initialized lists, we'de have to have an ifdef in the middle (or resort to push_back again) which I think is worse


PS7, Line 968: master_flags
> As described above: replace with {}.
Done


PS7, Line 1009: TEST_F(RaftConsensusITest, TestChurnyElections) {
              :   const int kNumWrites = AllowSlowTests() ? 10000 : 1000;
              :   CreateClusterForChurnyElectionsTests({});
              :   TestWorkload workload(cluster_.get());
              :   workload.set_write_batch_size(1);
              :   DoTestChurnyElections(&workload, kNumWrites);
              : }
              : 
              : // The same test, except inject artificial latency when propagating notifications
              : // from the queue back to consensus. This previously reproduced bugs like KUDU-1078 which
              : // normally only appear under high load.
              : TEST_F(RaftConsensusITest, TestChurnyElections_WithNotificationLatency) {
              :   CreateClusterForChurnyElectionsTests({"--consensus_inject_latency_ms_in_notifications=50"});
              :   const int kNumWrites = AllowSlowTests() ? 10000 : 1000;
              :   TestWorkload workload(cluster_.get());
              :   workload.set_write_batch_size(1);
              :   DoTestChurnyElections(&workload, kNumWrites);
              : }
              : 
              : // The same as TestChurnyElections except insert many duplicated rows.
              : // This emulates cases where there are many duplicate keys which, due to two phase
              : // locking, may cause deadlocks and other anomalies that cannot be observed when
              : // keys are unique.
              : TEST_F(RaftConsensusITest, TestChurnyElections_WithDuplicateKeys) {
              :   CreateClusterForChurnyElectionsTests({});
              :   const int kNumWrites = AllowSlowTests() ? 10000 : 1000;
              :   TestWorkload workload(cluster_.get());
              :   workload.set_write_pattern(TestWorkload::INSERT_WITH_MANY_DUP_KEYS);
              :   // Increase the number of rows per batch to get a higher chance of key collision.
              :   workload.set_write_batch_size(3);
              :   DoTestChurnyElections(&workload, kNumWrites);
> These look like candidates for extracting a common function and then re-usi
that's what this patch does :) we need common code to create the cluster, then setup a custom workload then common code to run it. all common workload params are set on the common methods
I also like that we can tweak the number of writes and workload params per test to be able to address flakyness individually


-- 
To view, visit http://gerrit.cloudera.org:8080/5349
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Add integration tests for duplicate keys

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5349

to look at the new patch set (#6).

Change subject: Add integration tests for duplicate keys
......................................................................

Add integration tests for duplicate keys

This splits the "crashy nodes" and "churny elections" test
of raft_consensus-itest into unique and duplicate key variants.
This change is meant to stress any possible deadlock scenarios
related to transaction commit/abort and 2-phase locking for
which we didn't have much coverage.

Additionally this also disallows timeouts on writes and requires
an exact count of the rows at the end. This is now possible
due to exactly once semantics.

Finally this changes the cluster verifier to use snapshot scans
and changes the timeout of another scan in that test. These
two changes deflaked the test from 27/1000 to 3/1000 with
asan, slow mode, and 1 stress thread (any more and the test
becomes much more flaky, as before). Of the 3 failures two
are unrelated (inability to start the webserver and a timeout
on another, unrelated test). The one failure that is related
to this patch is a snapshot scan anomaly and should be solved
by the safe time patches.

The coverage of these kinds of workloads is now much better.
An example of that is that revision 12 of [1] caused a deadlock
while aborting transactions out of order.  Looping
raft_consensus-itest in slow mode, asan, 1 stress thread would
previously not fail with the buggy code and now it fails
127/1000.

Results:
With master (3/1000 failures):
http://dist-test.cloudera.org//job?job_id=david.alves.1480767369.16205
With buggy patch [1] (127/1000 failures):
http://dist-test.cloudera.org//job?job_id=david.alves.1480805849.13039

[1] - https://gerrit.cloudera.org/#/c/5294/12

Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
---
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
2 files changed, 120 insertions(+), 64 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/49/5349/6
-- 
To view, visit http://gerrit.cloudera.org:8080/5349
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Add integration tests for duplicate keys

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5349

to look at the new patch set (#4).

Change subject: Add integration tests for duplicate keys
......................................................................

Add integration tests for duplicate keys

This splits the "crashy nodes" and "churny elections" test
of raft_consensus-itest into unique and duplicate key variants.
This change is meant to stress any possible deadlock scenarios
related to transaction commit/abort and 2-phase locking for
which we didn't have much coverage.

Additionally this also disallows timeouts on writes and requires
an exact count of the rows at the end. This is now possible
due to exactly once semantics.

Finally this changes the cluster verifier to use snapshot scans
and changes the timeout of another scan in that test. These
two changes deflaked the test from 27/1000 to 3/1000 with
asan, slow mode, and 1 stress thread (any more and the test
becomes much more flaky, as before). Of the 3 failures two
are unrelated (inability to start the webserver and a timeout
on another, unrelated test). The one failure that is related
to this patch is a snapshot scan anomaly and should be solved
by the safe time patches.

The coverage of these kinds of workloads is now much better.
An example of that is that revision 12 of [1] caused a deadlock
while aborting transactions out of order.  Looping
raft_consensus-itest in slow mode, asan, 1 stress thread would
previously not fail with the buggy code and now it fails
127/1000.

Results:
With master (3/1000 failures):
http://dist-test.cloudera.org//job?job_id=david.alves.1480767369.16205
With buggy patc [1] (127/1000 failures):
http://dist-test.cloudera.org//job?job_id=david.alves.1480805849.13039

[1] - https://gerrit.cloudera.org/#/c/5294/12

Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
---
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/raft_consensus-itest.cc
2 files changed, 122 insertions(+), 65 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/49/5349/4
-- 
To view, visit http://gerrit.cloudera.org:8080/5349
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6d018281d412ae034bd7b70c8311077a52b2795d
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>