You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Alexey Serbin (Code Review)" <ge...@cloudera.org> on 2018/01/24 22:57:08 UTC

[kudu-CR] [tests] de-flaking DontEvictIfRemainingConfigIsUnstable

Alexey Serbin has uploaded this change for review. ( http://gerrit.cloudera.org:8080/9126


Change subject: [tests] de-flaking DontEvictIfRemainingConfigIsUnstable
......................................................................

[tests] de-flaking DontEvictIfRemainingConfigIsUnstable

Tuned configuration of the test cluster to minimize chances of reporting
on failed tablet replicas 'one-by-one' instead of 'two-at-once' for the
DontEvictIfRemainingConfigIsUnstableITest-derived scenarios.  The
'one-by-one' reporting on failed replicas was the reason behind
flakiness in the following scenarios:

  * DontEvictIfRemainingConfigIsUnstableITest.NodesDown/{0,1}
  * DontEvictIfRemainingConfigIsUnstableITest.NodesStopped/{0,1}

That was because the logic of those test scenarios assumes 2 out of 3
replicas becoming unresponsive at once.  If only 1 out of 3 replicas
became unresponsive and reported as such, the re-replication logic
(both 3-2-3 and 3-4-3 schemes) tried to initiate replacement of the
failed replica.  That's completely legit since at the time of the single
replica failure report the majority of replicas was reported as being
online.

As an additional fix, the ASSERT_EVENTUALLY() wrapper is added for the
invocation of GetTabletToReplicaUUIDsMapping().  That's to guarantee
that all relevant tablet servers are included into the mapping being
built.

Change-Id: Icb23d71b600a46f7fd6bba56091f884eae24784f
---
M src/kudu/integration-tests/tablet_replacement-itest.cc
1 file changed, 30 insertions(+), 12 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/26/9126/1
-- 
To view, visit http://gerrit.cloudera.org:8080/9126
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Icb23d71b600a46f7fd6bba56091f884eae24784f
Gerrit-Change-Number: 9126
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>

[kudu-CR] [tests] de-flaking DontEvictIfRemainingConfigIsUnstable

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9126 )

Change subject: [tests] de-flaking DontEvictIfRemainingConfigIsUnstable
......................................................................


Patch Set 2: Code-Review+2

Thanks for trying to tune it down.


-- 
To view, visit http://gerrit.cloudera.org:8080/9126
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icb23d71b600a46f7fd6bba56091f884eae24784f
Gerrit-Change-Number: 9126
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Wed, 31 Jan 2018 03:12:18 +0000
Gerrit-HasComments: No

[kudu-CR] [tests] de-flaking DontEvictIfRemainingConfigIsUnstable

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/9126 )

Change subject: [tests] de-flaking DontEvictIfRemainingConfigIsUnstable
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/9126/1/src/kudu/integration-tests/tablet_replacement-itest.cc
File src/kudu/integration-tests/tablet_replacement-itest.cc:

http://gerrit.cloudera.org:8080/#/c/9126/1/src/kudu/integration-tests/tablet_replacement-itest.cc@149
PS1, Line 149:   constexpr auto kHbIntervalSec = 3 * kUnavailableSec;
> can we do 3-4 sec instead of 9 sec here?
I made it 6 seconds.  3-4 turned to be less reliable in case of TSAN builds.  Even with 6 seconds it's about 0.48% failure rate if running TSAN builds with --stress-cpu-threads=16.


http://gerrit.cloudera.org:8080/#/c/9126/1/src/kudu/integration-tests/tablet_replacement-itest.cc@159
PS1, Line 159:     "--leader_failure_max_missed_heartbeat_periods=30",
> per our offline conversation, this can be removed
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/9126
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icb23d71b600a46f7fd6bba56091f884eae24784f
Gerrit-Change-Number: 9126
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 26 Jan 2018 04:39:49 +0000
Gerrit-HasComments: Yes

[kudu-CR] [tests] de-flaking DontEvictIfRemainingConfigIsUnstable

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/9126 )

Change subject: [tests] de-flaking DontEvictIfRemainingConfigIsUnstable
......................................................................

[tests] de-flaking DontEvictIfRemainingConfigIsUnstable

Tuned configuration of the test cluster to minimize chances of reporting
on failed tablet replicas 'one-by-one' instead of 'two-at-once' for the
DontEvictIfRemainingConfigIsUnstableITest-derived scenarios.  The
'one-by-one' reporting on failed replicas was the reason behind
flakiness in the following scenarios:

  * DontEvictIfRemainingConfigIsUnstableITest.NodesDown/{0,1}
  * DontEvictIfRemainingConfigIsUnstableITest.NodesStopped/{0,1}

That was because the logic of those test scenarios assumes 2 out of 3
replicas becoming unresponsive at once.  If only 1 out of 3 replicas
became unresponsive and reported as such, the re-replication logic
(both 3-2-3 and 3-4-3 schemes) tried to initiate replacement of the
failed replica.  That's completely legit since at the time of the single
replica failure report the majority of replicas was reported as being
online.

As an additional fix, the ASSERT_EVENTUALLY() wrapper is added for the
invocation of GetTabletToReplicaUUIDsMapping().  That's to guarantee
that all relevant tablet servers are included into the mapping being
built.

Change-Id: Icb23d71b600a46f7fd6bba56091f884eae24784f
Reviewed-on: http://gerrit.cloudera.org:8080/9126
Tested-by: Kudu Jenkins
Reviewed-by: Mike Percy <mp...@apache.org>
---
M src/kudu/integration-tests/tablet_replacement-itest.cc
1 file changed, 29 insertions(+), 12 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Mike Percy: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/9126
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Icb23d71b600a46f7fd6bba56091f884eae24784f
Gerrit-Change-Number: 9126
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tests] de-flaking DontEvictIfRemainingConfigIsUnstable

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Mike Percy, Kudu Jenkins, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/9126

to look at the new patch set (#2).

Change subject: [tests] de-flaking DontEvictIfRemainingConfigIsUnstable
......................................................................

[tests] de-flaking DontEvictIfRemainingConfigIsUnstable

Tuned configuration of the test cluster to minimize chances of reporting
on failed tablet replicas 'one-by-one' instead of 'two-at-once' for the
DontEvictIfRemainingConfigIsUnstableITest-derived scenarios.  The
'one-by-one' reporting on failed replicas was the reason behind
flakiness in the following scenarios:

  * DontEvictIfRemainingConfigIsUnstableITest.NodesDown/{0,1}
  * DontEvictIfRemainingConfigIsUnstableITest.NodesStopped/{0,1}

That was because the logic of those test scenarios assumes 2 out of 3
replicas becoming unresponsive at once.  If only 1 out of 3 replicas
became unresponsive and reported as such, the re-replication logic
(both 3-2-3 and 3-4-3 schemes) tried to initiate replacement of the
failed replica.  That's completely legit since at the time of the single
replica failure report the majority of replicas was reported as being
online.

As an additional fix, the ASSERT_EVENTUALLY() wrapper is added for the
invocation of GetTabletToReplicaUUIDsMapping().  That's to guarantee
that all relevant tablet servers are included into the mapping being
built.

Change-Id: Icb23d71b600a46f7fd6bba56091f884eae24784f
---
M src/kudu/integration-tests/tablet_replacement-itest.cc
1 file changed, 29 insertions(+), 12 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/26/9126/2
-- 
To view, visit http://gerrit.cloudera.org:8080/9126
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icb23d71b600a46f7fd6bba56091f884eae24784f
Gerrit-Change-Number: 9126
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tests] de-flaking DontEvictIfRemainingConfigIsUnstable

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change. ( http://gerrit.cloudera.org:8080/9126 )

Change subject: [tests] de-flaking DontEvictIfRemainingConfigIsUnstable
......................................................................


Patch Set 1:

(2 comments)

looks good

http://gerrit.cloudera.org:8080/#/c/9126/1/src/kudu/integration-tests/tablet_replacement-itest.cc
File src/kudu/integration-tests/tablet_replacement-itest.cc:

http://gerrit.cloudera.org:8080/#/c/9126/1/src/kudu/integration-tests/tablet_replacement-itest.cc@149
PS1, Line 149:   constexpr auto kHbIntervalSec = 3 * kUnavailableSec;
can we do 3-4 sec instead of 9 sec here?


http://gerrit.cloudera.org:8080/#/c/9126/1/src/kudu/integration-tests/tablet_replacement-itest.cc@159
PS1, Line 159:     "--leader_failure_max_missed_heartbeat_periods=30",
per our offline conversation, this can be removed



-- 
To view, visit http://gerrit.cloudera.org:8080/9126
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icb23d71b600a46f7fd6bba56091f884eae24784f
Gerrit-Change-Number: 9126
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Fri, 26 Jan 2018 02:06:52 +0000
Gerrit-HasComments: Yes