You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@kudu.apache.org by "Alexey Serbin (Code Review)" <ge...@cloudera.org> on 2018/02/10 04:35:07 UTC

[kudu-CR] WIP [tests] scenario to repro off-by-one error in TestWorkload

Alexey Serbin has uploaded this change for review. ( http://gerrit.cloudera.org:8080/9277


Change subject: WIP [tests] scenario to repro off-by-one error in TestWorkload
......................................................................

WIP [tests] scenario to repro off-by-one error in TestWorkload

This patch is basically patch from https://gerrit.cloudera.org/#/c/9255/
with additional updates.

To repro, compile in RELEASE configuration and run locally or
using dist-test:

KUDU_ALLOW_SLOW_TESTS=1 ./bin/raft_consensus_nonvoter-itest \
  --gtest_filter='*RemoveReplaceInCycle*'

The failure rate at dist-test is about 1/16.  The relevant failures
look like the following (two cases at least):

--------

src/kudu/integration-tests/cluster_verifier.cc:131: Failure
Failed
Bad status: Corruption: row count 31950 is not exactly expected value 31949
src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc:2045: Failure
Expected: v.CheckRowCount(workload.table_name(), ClusterVerifier::EXACTLY, workload.rows_inserted()) doesn't generate new fatal failures in the current thread.
Actual: it does.

--------

F0210 04:28:10.156322  5051 test_workload.cc:236] Check failed: row_count >= expected_row_count (31049 vs. 31050)
*** Check failure stack trace: ***
    @           0x9426bd  google::LogMessage::Fail() at thirdparty/src/glog-0.3.5/src/logging.cc:1488
    @           0x94457d  google::LogMessage::SendToLog() at thirdparty/src/glog-0.3.5/src/logging.cc:1442
    @           0x9421f9  google::LogMessage::Flush() at thirdparty/src/glog-0.3.5/src/logging.cc:1312
    @           0x94501f  google::LogMessageFatal::~LogMessageFatal() at thirdparty/src/glog-0.3.5/src/logging.cc:2024
    @           0x930339  kudu::TestWorkload::ReadThread() at /opt/rh/devtoolset-3/root/usr/lib/gcc/x86_64-redhat-linux/4.9.2/../../../../include/c++/4.9.2/tr1/shared_ptr.h:340
    @     0x7f22ba943a60  (unknown) at ??:0
    @     0x7f22bbd8b184  start_thread at ??:0
    @     0x7f22ba3b0ffd  clone at ??:0
    @              (nil)  (unknown)

--------

An example run at dist-test:
  http://dist-test.cloudera.org//job?job_id=aserbin.1518236014.105889

Change-Id: I60666b8b05dce8dd13fcdee6408c0930915ba0c1
---
M src/kudu/integration-tests/cluster_verifier.h
M src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc
M src/kudu/tablet/tablet.cc
M src/kudu/tserver/ts_tablet_manager.cc
4 files changed, 177 insertions(+), 11 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/77/9277/1
-- 
To view, visit http://gerrit.cloudera.org:8080/9277
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I60666b8b05dce8dd13fcdee6408c0930915ba0c1
Gerrit-Change-Number: 9277
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>

[kudu-CR] WIP [tests] scenario to repro off-by-one error in TestWorkload

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.

David Ribeiro Alves has posted comments on this change. ( http://gerrit.cloudera.org:8080/9277 )

Change subject: WIP [tests] scenario to repro off-by-one error in TestWorkload
......................................................................

Patch Set 1:

I briefly tried this with my WIP implementation of leader leases and the problem seemed to have gone away. While not definitive this is encouraging evidence that the problem indeed stems from the lack of leader leases.
Conceptually this seems plausible too. Scenario:
- Two replicas (a,b) dispute leadership of tablet A, i.e. both think they're leaders, but "b" is the actual most recent leader.
- Client writes the last row to replica "b".
- Client then performs a scan at snapshot for all rows on replica "a", sending the timestamp of the write.
- Replica "a" thinking it's leader simply considers the current timestamp "safe" and executes a (non-repeatable) scan that doesn't include the last row written to b.

Possible follow-up steps:
- Retry the scan. If the problem goes away, then it's more likely that the problem stems from the lack of leader leases.
- Even better, hack it so that we retry the scan _at the same snapshot timestamp_.
- Add logging events (test only) that register the timestamps/safe time/chosen replica on the server side (hopefully showing the smoking gun).
- Implement leader leader and put this test on top :)

--
To view, visit http://gerrit.cloudera.org:8080/9277
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60666b8b05dce8dd13fcdee6408c0930915ba0c1
Gerrit-Change-Number: 9277
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 21 Feb 2018 00:54:04 +0000
Gerrit-HasComments: No

[kudu-CR] WIP [tests] scenario to repro off-by-one error in TestWorkload

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.

Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/9277 )

Change subject: WIP [tests] scenario to repro off-by-one error in TestWorkload
......................................................................


Patch Set 1:

> Patch Set 1:
> 
> I briefly tried this with my WIP implementation of leader leases and the problem seemed to have gone away. While not definitive this is encouraging evidence that the problem indeed stems from the lack of leader leases.
> Conceptually this seems plausible too. Scenario:
> - Two replicas (a,b) dispute leadership of tablet A, i.e. both think they're leaders, but "b" is the actual most recent leader.
> - Client writes the last row to replica "b".
> - Client then performs a scan at snapshot for all rows on replica "a", sending the timestamp of the write.
> - Replica "a" thinking it's leader simply considers the current timestamp "safe" and executes a (non-repeatable) scan that doesn't include the last row written to b.
> 
> Possible follow-up steps:
> - Retry the scan. If the problem goes away, then it's more likely that the problem stems from the lack of leader leases.
> - Even better, hack it so that we retry the scan _at the same snapshot timestamp_.
> - Add logging events (test only) that register the timestamps/safe time/chosen replica on the server side (hopefully showing the smoking gun).
> - Implement leader leader and put this test on top :)

Thank you David for looking at this!  I'm glad to know your in-progress leader leases patch fixed the problem.

I think I can start hacking this to add retry the scan at the same snapshot, as recommended.


-- 
To view, visit http://gerrit.cloudera.org:8080/9277
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I60666b8b05dce8dd13fcdee6408c0930915ba0c1
Gerrit-Change-Number: 9277
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: David Ribeiro Alves <da...@gmail.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 21 Feb 2018 18:52:24 +0000
Gerrit-HasComments: No