You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org> on 2016/12/02 05:38:46 UTC

[kudu-CR] [flaky tests] Fix "Already present" failures on raft consensus-itest

David Ribeiro Alves has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/5319

Change subject: [flaky tests] Fix "Already present" failures on raft_consensus-itest
......................................................................

[flaky tests] Fix "Already present" failures on raft_consensus-itest

In the flaky tests dashboarf TestSlowLeader fails with:

"Check failed: e->status().ok() Unexpected status: Already present: key already present"

This happens because it's possible for TestWorload to generate
identical random numbers on different threads, even though we use
a multiplicative linear congruential PRNG that is supposed to
generate all unique numbers within a single period of the PRNG.

This patch changes TestWorkload to use a ThreadSafeRandom. We
could also change the key type to int64 and do something like
int64 key = r.Next32() << 32 | thread_index, however changing
the type of the key is very invasive as a bunch of tests
depend on it.

This also increses the timeout of the snapshot scan in
TestReplicaBehaviorViaRPC as this would spuriously fail and
increases the time we wain on TestCommitIndexFarBehindAfterLeaderElection
which would cause spurious failures.

Results of running this on dist-test:
http://dist-test.cloudera.org//job?job_id=david.alves.1480656349.3518

To be fair I've only seen the failure "Already present" failure
outside of the flaky dashboard once, it's probably rarer than
1000 loops would allow to assert. However it's apparently difficult
to mimick the exact same conditions as the flaky dashboard tests:
running raft_consensus-itest with the stress option makes it incredibly
flaky, with different failures than the ones seen on the dashboard.
Not running it with stress makes it pass the large majority of
the time.

Change-Id: I35faf53cb9bb8585ec1c01d038b1cd64a0bb533e
---
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/integration-tests/test_workload.h
3 files changed, 23 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/19/5319/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I35faf53cb9bb8585ec1c01d038b1cd64a0bb533e
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>

[kudu-CR] [flaky tests] Fix "Already present" failures on raft consensus-itest

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: [flaky tests] Fix "Already present" failures on raft_consensus-itest
......................................................................


Patch Set 3: Code-Review+2

(2 comments)

lgtm modulo typos

http://gerrit.cloudera.org:8080/#/c/5319/3//COMMIT_MSG
Commit Message:

Line 9: In the flaky tests dashboarf TestSlowLeader fails with:
typo


Line 26: increases the time we wain on TestCommitIndexFarBehindAfterLeaderElection
wait


-- 
To view, visit http://gerrit.cloudera.org:8080/5319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I35faf53cb9bb8585ec1c01d038b1cd64a0bb533e
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [flaky tests] Fix "Already present" failures on raft consensus-itest

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has submitted this change and it was merged.

Change subject: [flaky tests] Fix "Already present" failures on raft_consensus-itest
......................................................................


[flaky tests] Fix "Already present" failures on raft_consensus-itest

In the flaky tests dashboard TestSlowLeader fails with:

"Check failed: e->status().ok() Unexpected status: Already present: key already present"

This happens because it's possible for TestWorkload to generate
identical random numbers on different threads, even though we use
a multiplicative linear congruential PRNG that is supposed to
generate all unique numbers within a single period of the PRNG.

This patch changes TestWorkload to use a ThreadSafeRandom. We
could also change the key type to int64 and do something like
int64 key = r.Next32() << 32 | thread_index, however changing
the type of the key is very invasive as a bunch of tests
depend on it.

This also increases the timeout of the snapshot scan in
TestReplicaBehaviorViaRPC as this would spuriously fail and
increases the time we wait on TestCommitIndexFarBehindAfterLeaderElection
which would cause spurious failures.

Results of running this on dist-test:
http://dist-test.cloudera.org//job?job_id=david.alves.1480656349.3518

To be fair I've only seen the failure "Already present" failure
outside of the flaky dashboard once, it's probably rarer than
1000 loops would allow to assert. However it's apparently difficult
to mimic the exact same conditions as the flaky dashboard tests:
running raft_consensus-itest with the stress option makes it incredibly
flaky, with different failures than the ones seen on the dashboard.
Not running it with stress makes it pass the large majority of
the time.

Change-Id: I35faf53cb9bb8585ec1c01d038b1cd64a0bb533e
Reviewed-on: http://gerrit.cloudera.org:8080/5319
Reviewed-by: David Ribeiro Alves <dr...@apache.org>
Tested-by: Kudu Jenkins
---
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/integration-tests/test_workload.h
3 files changed, 23 insertions(+), 12 deletions(-)

Approvals:
  David Ribeiro Alves: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/5319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I35faf53cb9bb8585ec1c01d038b1cd64a0bb533e
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [flaky tests] Fix "Already present" failures on raft consensus-itest

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Todd Lipcon, Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5319

to look at the new patch set (#4).

Change subject: [flaky tests] Fix "Already present" failures on raft_consensus-itest
......................................................................

[flaky tests] Fix "Already present" failures on raft_consensus-itest

In the flaky tests dashboard TestSlowLeader fails with:

"Check failed: e->status().ok() Unexpected status: Already present: key already present"

This happens because it's possible for TestWorkload to generate
identical random numbers on different threads, even though we use
a multiplicative linear congruential PRNG that is supposed to
generate all unique numbers within a single period of the PRNG.

This patch changes TestWorkload to use a ThreadSafeRandom. We
could also change the key type to int64 and do something like
int64 key = r.Next32() << 32 | thread_index, however changing
the type of the key is very invasive as a bunch of tests
depend on it.

This also increases the timeout of the snapshot scan in
TestReplicaBehaviorViaRPC as this would spuriously fail and
increases the time we wait on TestCommitIndexFarBehindAfterLeaderElection
which would cause spurious failures.

Results of running this on dist-test:
http://dist-test.cloudera.org//job?job_id=david.alves.1480656349.3518

To be fair I've only seen the failure "Already present" failure
outside of the flaky dashboard once, it's probably rarer than
1000 loops would allow to assert. However it's apparently difficult
to mimic the exact same conditions as the flaky dashboard tests:
running raft_consensus-itest with the stress option makes it incredibly
flaky, with different failures than the ones seen on the dashboard.
Not running it with stress makes it pass the large majority of
the time.

Change-Id: I35faf53cb9bb8585ec1c01d038b1cd64a0bb533e
---
M src/kudu/integration-tests/raft_consensus-itest.cc
M src/kudu/integration-tests/test_workload.cc
M src/kudu/integration-tests/test_workload.h
3 files changed, 23 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/19/5319/4
-- 
To view, visit http://gerrit.cloudera.org:8080/5319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I35faf53cb9bb8585ec1c01d038b1cd64a0bb533e
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [flaky tests] Fix "Already present" failures on raft consensus-itest

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: [flaky tests] Fix "Already present" failures on raft_consensus-itest
......................................................................


Patch Set 4: Code-Review+2

(5 comments)

keeping Todd's +2

http://gerrit.cloudera.org:8080/#/c/5319/3//COMMIT_MSG
Commit Message:

Line 9: In the flaky tests dashboard TestSlowLeader fails with:
> typo
Done


PS3, Line 13: TestWorkloa
> TestWorkload
Done


PS3, Line 24: increase
> increases
Done


Line 26: increases the time we wait on TestCommitIndexFarBehindAfterLeaderElection
> wait
Done


PS3, Line 35: mimic 
> mimic
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I35faf53cb9bb8585ec1c01d038b1cd64a0bb533e
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [flaky tests] Fix "Already present" failures on raft consensus-itest

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: [flaky tests] Fix "Already present" failures on raft_consensus-itest
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5319/3//COMMIT_MSG
Commit Message:

PS3, Line 18: This patch changes TestWorkload to use a ThreadSafeRandom. We
            : could also change the key type to int64 and do something like
            : int64 key = r.Next32() << 32 | thread_index, however changing
            : the type of the key is very invasive as a bunch of tests
            : depend on it.
> Instead of relying on the implementation of PRNG, why not to give those thr
woops, that would not help since it also depends on the PRNG behavior pattern.  Disregard this.


-- 
To view, visit http://gerrit.cloudera.org:8080/5319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I35faf53cb9bb8585ec1c01d038b1cd64a0bb533e
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] [flaky tests] Fix "Already present" failures on raft consensus-itest

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: [flaky tests] Fix "Already present" failures on raft_consensus-itest
......................................................................


Patch Set 3:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/5319/3//COMMIT_MSG
Commit Message:

PS3, Line 13: TestWorload
TestWorkload


PS3, Line 18: This patch changes TestWorkload to use a ThreadSafeRandom. We
            : could also change the key type to int64 and do something like
            : int64 key = r.Next32() << 32 | thread_index, however changing
            : the type of the key is very invasive as a bunch of tests
            : depend on it.
Instead of relying on the implementation of PRNG, why not to give those threads non-intersecting ranges of the target key space (-2^31 to 2^31 - 1 in this case) and then, if those thread, simply do

next_key = range_begin + random.Next() % (range_end - range_begin)

instead?


PS3, Line 24: increses
increases


PS3, Line 35: mimick
mimic


-- 
To view, visit http://gerrit.cloudera.org:8080/5319
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I35faf53cb9bb8585ec1c01d038b1cd64a0bb533e
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes