You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Alexey Serbin (Code Review)" <ge...@cloudera.org> on 2018/02/08 07:56:01 UTC

[kudu-CR] WIP [non voter-itest] stress test for the 3-4-3 replacement scheme

Alexey Serbin has uploaded this change for review. ( http://gerrit.cloudera.org:8080/9255


Change subject: WIP [non_voter-itest] stress test for the 3-4-3 replacement scheme
......................................................................

WIP [non_voter-itest] stress test for the 3-4-3 replacement scheme

Using this test, I was able to hit the following issue:

F0208 05:35:22.152496  2721 ts_tablet_manager.cc:563] T 5384471d823e46929029f9ff6ce212a3 P c713ac498df040caa897d3229214baa3: Tablet Copy: Found tablet in TABLET_DATA_COPYING state during StartTabletCopy()
*** Check failure stack trace: ***
    @     0x7f22d7fa92fd  google::LogMessage::Fail() at ??:0
    @     0x7f22d7fab1bd  google::LogMessage::SendToLog() at ??:0
    @     0x7f22d7fa8e39  google::LogMessage::Flush() at ??:0
    @     0x7f22d7fabc5f  google::LogMessageFatal::~LogMessageFatal() at ??:0
    @     0x7f22e1fa6ed7  kudu::tserver::TSTabletManager::RunTabletCopy() at ??:0
    @     0x7f22e1fb296f  kudu::tserver::TabletCopyRunnable::Run() at ??:0
    @     0x7f22d9c9a1f5  kudu::ThreadPool::DispatchThread() at ??:0
    @     0x7f22d9ca9069  boost::_mfi::mf0<>::operator()() at ??:0
    @     0x7f22d9ca8fd0  boost::_bi::list1<>::operator()<>() at ??:0
    @     0x7f22d9ca8f7a  boost::_bi::bind_t<>::operator()() at ??:0
    @     0x7f22d9ca8d5d  boost::detail::function::void_function_obj_invoker0<>::invoke() at ??:0
    @     0x7f22db98d018  boost::function0<>::operator()() at ??:0
    @     0x7f22d9c8bdfd  kudu::Thread::SuperviseThread() at ??:0
    @     0x7f22ddc68184  start_thread at ??:0

WIP: Fix the issue and verify the fix.  Probably, the dist-test machines
filled up because I ran this test: both the tablet servers and the master
were very chatty before I added the --minloglevel=2 option.

Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
---
M src/kudu/consensus/consensus_meta.cc
M src/kudu/consensus/consensus_queue.cc
M src/kudu/consensus/consensus_queue.h
M src/kudu/consensus/raft_consensus.cc
M src/kudu/integration-tests/cluster_verifier.h
M src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc
6 files changed, 202 insertions(+), 1 deletion(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/55/9255/1
-- 
To view, visit http://gerrit.cloudera.org:8080/9255
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
Gerrit-Change-Number: 9255
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>

[kudu-CR] WIP [non voter-itest] stress test for the 3-4-3 replacement scheme

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9255 )

Change subject: WIP [non_voter-itest] stress test for the 3-4-3 replacement scheme
......................................................................


Patch Set 1:

(4 comments)

Nice torture test

http://gerrit.cloudera.org:8080/#/c/9255/1/src/kudu/integration-tests/cluster_verifier.h
File src/kudu/integration-tests/cluster_verifier.h:

http://gerrit.cloudera.org:8080/#/c/9255/1/src/kudu/integration-tests/cluster_verifier.h@79
PS1, Line 79:  DoKsck
nit: maybe rename to RunKsck() since DoKsck() sounds like the name of an internal method


http://gerrit.cloudera.org:8080/#/c/9255/1/src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc
File src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc:

http://gerrit.cloudera.org:8080/#/c/9255/1/src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc@1915
PS1, Line 1915: too chatty, so logging error and fatal messages only
nit: comment punctuation here and below


http://gerrit.cloudera.org:8080/#/c/9255/1/src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc@1916
PS1, Line 1916:     "--raft_prepare_replacement_before_eviction=true",
consider adding --tablet_copy_early_session_timeout_prob=.2 to create more tablet copy failures and chaos


http://gerrit.cloudera.org:8080/#/c/9255/1/src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc@1988
PS1, Line 1988: tablet_id_
always the same tablet?



-- 
To view, visit http://gerrit.cloudera.org:8080/9255
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
Gerrit-Change-Number: 9255
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Comment-Date: Sat, 10 Feb 2018 03:08:42 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2274 [itest] stress test for replica replacement

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9255 )

Change subject: KUDU-2274 [itest] stress test for replica replacement
......................................................................


Patch Set 4:

(16 comments)

http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc
File src/kudu/integration-tests/raft_consensus_stress-itest.cc:

http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@66
PS3, Line 66: 2
magic number, can we add something like /* ERROR */ or ideally use the constant from glog if possible? Also an explanation as to why we chose this as the default would be helpful for posterity.


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@79
PS3, Line 79: 2
20 for ASAN and 5 for everybody else?


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@113
PS3, Line 113: RaftConsensusITestBase
Since we are not using tablet_id_ etc, can we instead use ExternalMiniClusterITestBase as the base class?


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@117
PS3, Line 117: being
nit: s/being/are/


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@122
PS3, Line 122: this test is not fit to run under TSAN
nit: how about:

  this test is not reliable under TSAN; skipping test


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@170
PS3, Line 170: 2
kNumReplicas ?


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@175
PS3, Line 175: TODO(aserbin)
TODO(KUDU-1188)


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@212
PS3, Line 212: if (do_pauses) {
this loop will run hot if do_pauses is set to false


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@241
PS3, Line 241:     do_elections = false;
             :     do_pauses = false;
are these lines needed?


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@265
PS3, Line 265: 0
false


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@266
PS3, Line 266: 0
false


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@276
PS3, Line 276: 1
true


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@277
PS3, Line 277: 1
true


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@284
PS3, Line 284:   do_elections = false;
             :   do_pauses = false;
remove these lines?


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@293
PS3, Line 293: onst auto& s
Consider saving this in a Status last_ksck_status defined next to ksck_failures_in_a_row (outside the loop) and saved on line 273, since it would be confusing to print a success message on failure if the final attempt succeeds.


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@305
PS3, Line 305:   //NO_FATALS(v.CheckRowCount(workload.table_name(),
Doesn't CheckRowCount() retry? Leader leases won't help here, I don't think. If we allow writes to time out, then we can actually end up writing more rows than got acknowledged. Since we are using workload.set_timeout_allowed(true), we should be using (ClusterVerifier::AT_LEAST, workload.rows_inserted()), not ClusterVerifier::EXACTLY.



-- 
To view, visit http://gerrit.cloudera.org:8080/9255
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
Gerrit-Change-Number: 9255
Gerrit-PatchSet: 4
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Comment-Date: Wed, 14 Feb 2018 02:19:16 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2274 [itest] stress test for replica replacement

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/9255 )

Change subject: KUDU-2274 [itest] stress test for replica replacement
......................................................................

KUDU-2274 [itest] stress test for replica replacement

This is a stress test for the automatic replica replacement in Kudu.

Parameters of the test are configurable via run-time gflags, so it's
possible to run it in a 'standalone' mode, targeting it to be a sort
of an endurance test.

This test reproduces the race described in KUDU-2274: it triggers
DCHECK() assertions added in 194fd8b169f29aafbd78a47709ac51d2e8354a1a
before the relevant fixes for KUDU-2274 were checked in.

Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
Reviewed-on: http://gerrit.cloudera.org:8080/9255
Tested-by: Kudu Jenkins
Reviewed-by: Mike Percy <mp...@apache.org>
---
M src/kudu/integration-tests/CMakeLists.txt
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/cluster_verifier.h
A src/kudu/integration-tests/raft_consensus_stress-itest.cc
4 files changed, 305 insertions(+), 4 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Mike Percy: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/9255
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
Gerrit-Change-Number: 9255
Gerrit-PatchSet: 7
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot

[kudu-CR] [itest] stress test for replica replacement

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/9255 )

Change subject: [itest] stress test for replica replacement
......................................................................


Patch Set 2:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/9255/1/src/kudu/integration-tests/cluster_verifier.h
File src/kudu/integration-tests/cluster_verifier.h:

http://gerrit.cloudera.org:8080/#/c/9255/1/src/kudu/integration-tests/cluster_verifier.h@79
PS1, Line 79:  the ks
> nit: maybe rename to RunKsck() since DoKsck() sounds like the name of an in
Done


http://gerrit.cloudera.org:8080/#/c/9255/1/src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc
File src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc:

http://gerrit.cloudera.org:8080/#/c/9255/1/src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc@1915
PS1, Line 1915: 
> nit: comment punctuation here and below
Done


http://gerrit.cloudera.org:8080/#/c/9255/1/src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc@1916
PS1, Line 1916: 
> consider adding --tablet_copy_early_session_timeout_prob=.2 to create more 
That's a good idea!

Done.


http://gerrit.cloudera.org:8080/#/c/9255/1/src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc@1939
PS1, Line 1939: 
> warning: either cast from 'int' to 'long' is ineffective, or there is loss 
Done


http://gerrit.cloudera.org:8080/#/c/9255/1/src/kudu/integration-tests/raft_consensus_nonvoter-itest.cc@1988
PS1, Line 1988: 
> always the same tablet?
woops, indeed.  It will make sense to do that for different tablets.



-- 
To view, visit http://gerrit.cloudera.org:8080/9255
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
Gerrit-Change-Number: 9255
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Comment-Date: Mon, 12 Feb 2018 07:23:45 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2274 [itest] stress test for replica replacement

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/9255

to look at the new patch set (#3).

Change subject: KUDU-2274 [itest] stress test for replica replacement
......................................................................

KUDU-2274 [itest] stress test for replica replacement

This is a stress test for the automatic replica replacement in Kudu.

Parameters of the test is configurable via run-time gflags, so it's
possible to run it in a 'standalone' mode, targeting it to be a sort
of an endurance test.

This test reproduces the race described in KUDU-2274: it triggers
DCHECK() assertions added in 194fd8b169f29aafbd78a47709ac51d2e8354a1a
before the relevant fixes for KUDU-2274 were checked in.

Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
---
M src/kudu/integration-tests/CMakeLists.txt
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/cluster_verifier.h
A src/kudu/integration-tests/raft_consensus_stress-itest.cc
4 files changed, 317 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/55/9255/3
-- 
To view, visit http://gerrit.cloudera.org:8080/9255
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
Gerrit-Change-Number: 9255
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot

[kudu-CR] KUDU-2274 [itest] stress test for replica replacement

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/9255

to look at the new patch set (#6).

Change subject: KUDU-2274 [itest] stress test for replica replacement
......................................................................

KUDU-2274 [itest] stress test for replica replacement

This is a stress test for the automatic replica replacement in Kudu.

Parameters of the test are configurable via run-time gflags, so it's
possible to run it in a 'standalone' mode, targeting it to be a sort
of an endurance test.

This test reproduces the race described in KUDU-2274: it triggers
DCHECK() assertions added in 194fd8b169f29aafbd78a47709ac51d2e8354a1a
before the relevant fixes for KUDU-2274 were checked in.

Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
---
M src/kudu/integration-tests/CMakeLists.txt
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/cluster_verifier.h
A src/kudu/integration-tests/raft_consensus_stress-itest.cc
4 files changed, 305 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/55/9255/6
-- 
To view, visit http://gerrit.cloudera.org:8080/9255
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
Gerrit-Change-Number: 9255
Gerrit-PatchSet: 6
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot

[kudu-CR] KUDU-2274 [itest] stress test for replica replacement

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/9255 )

Change subject: KUDU-2274 [itest] stress test for replica replacement
......................................................................


Patch Set 5:

(15 comments)

http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc
File src/kudu/integration-tests/raft_consensus_stress-itest.cc:

http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@66
PS3, Line 66: t
> magic number, can we add something like /* ERROR */ or ideally use the cons
Done


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@79
PS3, Line 79: "
> 20 for ASAN and 5 for everybody else?
Yep, that's the ultimate typo.  It should be inversed/swapped.


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@117
PS3, Line 117:  int 
> nit: s/being/are/
Done


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@122
PS3, Line 122: ShortTimeout = MonoDelta::FromSeconds(
> nit: how about:
well, as it turned out, after fixing the ultimate typo it works for TSAN even without patch for TestWorkload.


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@170
PS3, Line 170: 
> kNumReplicas ?
By my understanding (after looking into catalog_manager.cc), FLAGS_max_create_tablets_per_ts limits the number of partitions/tablets, not the number of total replicas.  My original idea was to have FLAGS_test_num_tablets_per_server as a proxy for FLAGS_max_create_tablets_per_ts.

However, I updated the signature and now the test speaks in number of replicas per tablet server: I agree it's more convenient.


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@175
PS3, Line 175: le (workload.
> TODO(KUDU-1188)
Done


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@212
PS3, Line 212:   const auto ts_
> this loop will run hot if do_pauses is set to false
Right.  I moved SleepFor(MonoDelta::FromMilliseconds(250)) out of the if() closure.


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@241
PS3, Line 241:     workload.StopAndJoin();
             :     rows_inserted = wo
> are these lines needed?
Yep, now I think we can get rid of them safely.  Maybe, in one of the former revisions they were useful.


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@265
PS3, Line 265: 
> false
Done


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@266
PS3, Line 266: e
> false
Done


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@276
PS3, Line 276: T
> true
Done


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@277
PS3, Line 277: r
> true
Done


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@284
PS3, Line 284:       FAIL() << Substitute("$0: tablet replicas haven't converged", s.ToString());
             :     }
> remove these lines?
Done


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@293
PS3, Line 293: FATALS(v.Che
> Consider saving this in a Status last_ksck_status defined next to ksck_fail
Good catch.  It seems like just another lost update due to multiple revisions of this test.  I corrected the code as necessary.


http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@305
PS3, Line 305: 
> Doesn't CheckRowCount() retry? Leader leases won't help here, I don't think
Indeed, thank you for pointing at the right direction.  I thought the message was about reading less than expected, but as it turned out it was about reading more than expected.



-- 
To view, visit http://gerrit.cloudera.org:8080/9255
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
Gerrit-Change-Number: 9255
Gerrit-PatchSet: 5
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Comment-Date: Wed, 14 Feb 2018 06:56:13 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2274 [itest] stress test for replica replacement

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has removed Kudu Jenkins from this change.  ( http://gerrit.cloudera.org:8080/9255 )

Change subject: KUDU-2274 [itest] stress test for replica replacement
......................................................................


Removed reviewer Kudu Jenkins with the following votes:

* Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/9255
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteReviewer
Gerrit-Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
Gerrit-Change-Number: 9255
Gerrit-PatchSet: 5
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot

[kudu-CR] KUDU-2274 [itest] stress test for replica replacement

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/9255

to look at the new patch set (#4).

Change subject: KUDU-2274 [itest] stress test for replica replacement
......................................................................

KUDU-2274 [itest] stress test for replica replacement

This is a stress test for the automatic replica replacement in Kudu.

Parameters of the test is configurable via run-time gflags, so it's
possible to run it in a 'standalone' mode, targeting it to be a sort
of an endurance test.

This test reproduces the race described in KUDU-2274: it triggers
DCHECK() assertions added in 194fd8b169f29aafbd78a47709ac51d2e8354a1a
before the relevant fixes for KUDU-2274 were checked in.

Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
---
M src/kudu/integration-tests/CMakeLists.txt
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/cluster_verifier.h
A src/kudu/integration-tests/raft_consensus_stress-itest.cc
4 files changed, 317 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/55/9255/4
-- 
To view, visit http://gerrit.cloudera.org:8080/9255
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
Gerrit-Change-Number: 9255
Gerrit-PatchSet: 4
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot

[kudu-CR] KUDU-2274 [itest] stress test for replica replacement

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/9255

to look at the new patch set (#5).

Change subject: KUDU-2274 [itest] stress test for replica replacement
......................................................................

KUDU-2274 [itest] stress test for replica replacement

This is a stress test for the automatic replica replacement in Kudu.

Parameters of the test is configurable via run-time gflags, so it's
possible to run it in a 'standalone' mode, targeting it to be a sort
of an endurance test.

This test reproduces the race described in KUDU-2274: it triggers
DCHECK() assertions added in 194fd8b169f29aafbd78a47709ac51d2e8354a1a
before the relevant fixes for KUDU-2274 were checked in.

Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
---
M src/kudu/integration-tests/CMakeLists.txt
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/cluster_verifier.h
A src/kudu/integration-tests/raft_consensus_stress-itest.cc
4 files changed, 305 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/55/9255/5
-- 
To view, visit http://gerrit.cloudera.org:8080/9255
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
Gerrit-Change-Number: 9255
Gerrit-PatchSet: 5
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot

[kudu-CR] KUDU-2274 [itest] stress test for replica replacement

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/9255 )

Change subject: KUDU-2274 [itest] stress test for replica replacement
......................................................................


Patch Set 5: Verified+1

(1 comment)

Unreleated flakes in

  DiskReservationITest.TestFillMultipleDisks
  DiskReservationITest.TestWalWriteToFullDiskAborts

due to NTP failure

F0214 06:53:26.686384 28842 master_main.cc:74] Check failed: _s.ok() Bad status: Service unavailable: Cannot initialize clock: Error reading clock. Clock considered unsynchronized

http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc
File src/kudu/integration-tests/raft_consensus_stress-itest.cc:

http://gerrit.cloudera.org:8080/#/c/9255/3/src/kudu/integration-tests/raft_consensus_stress-itest.cc@113
PS3, Line 113: 
> Since we are not using tablet_id_ etc, can we instead use ExternalMiniClust
This test uses tablet_servers_ and the code to populate that.  I'm also thinking that we might expand functionality of this test in the future, so I would prefer to keep RaftConsensusITestBase here.

However, if you feel strongly about that, I'll update this to use ExternalMiniClusterITestBase.



-- 
To view, visit http://gerrit.cloudera.org:8080/9255
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
Gerrit-Change-Number: 9255
Gerrit-PatchSet: 5
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Comment-Date: Wed, 14 Feb 2018 19:49:02 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2274 [itest] stress test for replica replacement

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9255 )

Change subject: KUDU-2274 [itest] stress test for replica replacement
......................................................................


Patch Set 5: Code-Review+1

(3 comments)

http://gerrit.cloudera.org:8080/#/c/9255/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9255/5//COMMIT_MSG@11
PS5, Line 11: is
are


http://gerrit.cloudera.org:8080/#/c/9255/5/src/kudu/integration-tests/raft_consensus_stress-itest.cc
File src/kudu/integration-tests/raft_consensus_stress-itest.cc:

http://gerrit.cloudera.org:8080/#/c/9255/5/src/kudu/integration-tests/raft_consensus_stress-itest.cc@236
PS5, Line 236:  
nit: why not iteration++ here?


http://gerrit.cloudera.org:8080/#/c/9255/5/src/kudu/integration-tests/raft_consensus_stress-itest.cc@277
PS5, Line 277:   if (ksck_failures_in_a_row > FLAGS_test_max_ksck_failures) {
             :     // Suspecting a Raft consensus failure while running ksck with shorter
             :     // timeout (see above). Run an extra round of ksck with the regular timeout
             :     // to verify that replicas haven't really converged and, if so, just bail
             :     // right at this point.
             :     const auto& s = v.RunKsck();
             :     if (!s.ok()) {
             :       FAIL() << Substitute("$0: tablet replicas haven't converged", s.ToString());
             :     }
             :   }
isn't this the same as line 288 where we run v.CheckCluster() ?



-- 
To view, visit http://gerrit.cloudera.org:8080/9255
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
Gerrit-Change-Number: 9255
Gerrit-PatchSet: 5
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Comment-Date: Tue, 20 Feb 2018 23:42:10 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2274 [itest] stress test for replica replacement

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/9255 )

Change subject: KUDU-2274 [itest] stress test for replica replacement
......................................................................


Patch Set 6:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/9255/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/9255/5//COMMIT_MSG@11
PS5, Line 11: ar
> are
Done


http://gerrit.cloudera.org:8080/#/c/9255/5/src/kudu/integration-tests/raft_consensus_stress-itest.cc
File src/kudu/integration-tests/raft_consensus_stress-itest.cc:

http://gerrit.cloudera.org:8080/#/c/9255/5/src/kudu/integration-tests/raft_consensus_stress-itest.cc@236
PS5, Line 236:  
> nit: why not iteration++ here?
Because I don't want to count retry due to a ksck failure as an iteration: if RunKsck() fails, the logic starts another loop, but  that case is not counted as 'iteration' in sense of this scenario.


http://gerrit.cloudera.org:8080/#/c/9255/5/src/kudu/integration-tests/raft_consensus_stress-itest.cc@277
PS5, Line 277:   if (ksck_failures_in_a_row > FLAGS_test_max_ksck_failures) {
             :     // Suspecting a Raft consensus failure while running ksck with shorter
             :     // timeout (see above). Run an extra round of ksck with the regular timeout
             :     // to verify that replicas haven't really converged and, if so, just bail
             :     // right at this point.
             :     const auto& s = v.RunKsck();
             :     if (!s.ok()) {
             :       FAIL() << Substitute("$0: tablet replicas haven't converged", s.ToString());
             :     }
             :   }
> isn't this the same as line 288 where we run v.CheckCluster() ?
Yes, CheckCluster contains RunKsck().  However, it's wrapped into some extra logic for retries.  Here I wanted to short-circuit and report an error right away if the ksck has already failed too many times in a row.



-- 
To view, visit http://gerrit.cloudera.org:8080/9255
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
Gerrit-Change-Number: 9255
Gerrit-PatchSet: 6
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Comment-Date: Tue, 20 Feb 2018 23:54:32 +0000
Gerrit-HasComments: Yes

[kudu-CR] [itest] stress test for replica replacement

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Tidy Bot, Mike Percy, Kudu Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/9255

to look at the new patch set (#2).

Change subject: [itest] stress test for replica replacement
......................................................................

[itest] stress test for replica replacement

This is a stress test for the automatic replica replacement in Kudu.

Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
---
M src/kudu/integration-tests/CMakeLists.txt
M src/kudu/integration-tests/cluster_verifier.cc
M src/kudu/integration-tests/cluster_verifier.h
A src/kudu/integration-tests/raft_consensus_stress-itest.cc
4 files changed, 253 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/55/9255/2
-- 
To view, visit http://gerrit.cloudera.org:8080/9255
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
Gerrit-Change-Number: 9255
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot

[kudu-CR] KUDU-2274 [itest] stress test for replica replacement

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9255 )

Change subject: KUDU-2274 [itest] stress test for replica replacement
......................................................................


Patch Set 6: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/9255
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I036a882f7e9132a5c26013227c50c0699b59ed6e
Gerrit-Change-Number: 9255
Gerrit-PatchSet: 6
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Comment-Date: Wed, 21 Feb 2018 01:37:02 +0000
Gerrit-HasComments: No