You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Alexey Serbin (Code Review)" <ge...@cloudera.org> on 2017/09/08 23:47:10 UTC

[kudu-CR] [tests] de-flaking catalog manager tsk-itest

Alexey Serbin has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/8017

Change subject: [tests] de-flaking catalog_manager_tsk-itest
......................................................................

[tests] de-flaking catalog_manager_tsk-itest

After recent updates the catalog_manager_tsk-itest became unstable.
One of the failure scenarios is when the test tablet server cannot
register in the cluster within the specified timeout (30 seconds).

It seems some test machines are too slow to accommodate the test
scenario with 16ms Raft heartbeat interval.  When running the test
with too short Raft heartbeat interval, the following scenario occurs
on slow or very busy machines:

* An election happens among masters (e.g., term 1) and leader master
  is elected.

* Shortly after that, the followers stop receiving some Raft heartbeats
  from the leader within the specified timeout interval.

* The followers start new election, but experience timeouts for vote
  requests among them as well.

* The leader fails getting responses from the followers for its
  UpdateConsensus RPC requests.

* The tablet server fails to register with the cluster.

Sometimes the scenario above is enriched with dropped incoming
Raft requests due to the backpressure on the Raft RPC service queue
in masters.

The following changes where made to address the flakiness due
to the described scenarios:
  * increasing the Raft heartbeat interval
  * increasing max length of the Raft RPC service queue
  * increasing the back-off interval after leader election failures

After making the changes above, the test became more stable.  Not
a single failure was spot in multiple 1K runs when running by dist-test
with --stress_cpu_threads=16:

ASAN:
  http://dist-test.cloudera.org/job?job_id=aserbin.1504914035.18113

DEBUG:
  http://dist-test.cloudera.org/job?job_id=aserbin.1504911962.26895

RELEASE:
  http://dist-test.cloudera.org/job?job_id=aserbin.1504913524.8185

TSAN:
  http://dist-test.cloudera.org/job?job_id=aserbin.1504903775.17126

This is a follow-up for faa0b14effb6e15f9989d686e5a1f8e1040a1dd6.

Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
---
M src/kudu/integration-tests/catalog_manager_tsk-itest.cc
1 file changed, 5 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/17/8017/1
-- 
To view, visit http://gerrit.cloudera.org:8080/8017
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>

[kudu-CR] [tests] de-flaking catalog manager tsk-itest

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: [tests] de-flaking catalog_manager_tsk-itest
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8017/1//COMMIT_MSG
Commit Message:

Did you try looping this test with the recent failure-detection change (https://github.com/apache/kudu/commit/21b0f3d5e255760535e281efe5879fe657df1f1c) reverted? Todd suspected that it made elections converge more slowly, and it could be responsible for the flakiness here too.

If it is responsible, it'd be better to address that than to continually tweak the test values ever so slightly to ensure it passes.


http://gerrit.cloudera.org:8080/#/c/8017/1/src/kudu/integration-tests/catalog_manager_tsk-itest.cc
File src/kudu/integration-tests/catalog_manager_tsk-itest.cc:

Line 64:         hb_interval_ms_(128),
I get the feeling that, although these values may now be carefully tuned so that the test passes, that may change as Kudu continues to evolve, at which point we'll burn more time on deflaking it.

Why exactly does this test need so many overridden values? Are they required for the test in some way? Or can any be reverted back to the defaults?


-- 
To view, visit http://gerrit.cloudera.org:8080/8017
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes

[kudu-CR] [tests] de-flaking catalog manager tsk-itest

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has abandoned this change. ( http://gerrit.cloudera.org:8080/8017 )

Change subject: [tests] de-flaking catalog_manager_tsk-itest
......................................................................


Abandoned

Since the fix for KUDU-2149 has been committed, this test became stable even with its original settings.
-- 
To view, visit http://gerrit.cloudera.org:8080/8017
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
Gerrit-Change-Number: 8017
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] [tests] de-flaking catalog manager tsk-itest

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: [tests] de-flaking catalog_manager_tsk-itest
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8017/1//COMMIT_MSG
Commit Message:

> Did you try looping this test with the recent failure-detection change (htt
OK, here the result running without and with 21b0f3d5 changelist:

Without the changelist (HEAD is at c8e04077), --stress-cpu-threads=16, 0/1024 failed:
  http://dist-test.cloudera.org//job?job_id=aserbin.150515
9852.20744

With the changelist (HEAD is at 21b0f3d5), --stress-cpu-threads=16, at least 7/1024 failed:
  http://dist-test.cloudera.org//job?job_id=aserbin.1505160964.1750


Could you clarify on what do you want to address in this regard?

As I understand, the test was built to induce many re-elections among masters, and the parameters were set so the process was converging more or less in the specified timeout intervals.  With the new way of sending heartbeats and doing master failure detection, it seems the masters sometimes were not fast enough to handle Raft HBs as fast as they used to be.  But it's all about 'boundary' conditions, as I understand.


-- 
To view, visit http://gerrit.cloudera.org:8080/8017
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes

[kudu-CR] [tests] de-flaking catalog manager tsk-itest

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change.

Change subject: [tests] de-flaking catalog_manager_tsk-itest
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8017/1/src/kudu/integration-tests/catalog_manager_tsk-itest.cc
File src/kudu/integration-tests/catalog_manager_tsk-itest.cc:

PS1, Line 84: // Add master-only flags.
Someone newly reading through this test might not understand why all these flags are necessary without reading the commit msg. Could you comment with a high-level statement explaining what the desired behavior of the master is?


PS1, Line 98: // Add tserver-only flags.
Same here.


-- 
To view, visit http://gerrit.cloudera.org:8080/8017
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes

[kudu-CR] [tests] de-flaking catalog manager tsk-itest

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: [tests] de-flaking catalog_manager_tsk-itest
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/8017/1//COMMIT_MSG
Commit Message:

> Did you try looping this test with the recent failure-detection change (htt
I didn't try but I know it was not that flaky prior to that patch.

I can double-check and report on that.

Actually, I suspect there were 2 changelists that made this test flakier: this one and another one committed between 2 and 4 of September.  I can dig in to find the exact ones.


http://gerrit.cloudera.org:8080/#/c/8017/1/src/kudu/integration-tests/catalog_manager_tsk-itest.cc
File src/kudu/integration-tests/catalog_manager_tsk-itest.cc:

Line 64:         hb_interval_ms_(128),
> I get the feeling that, although these values may now be carefully tuned so
In this test we want to induce many elections among masters, so that elections happen while a leader tries to write some data into the system catalog table (particularly, a new token signing key).  Adding that --catalog_manager_inject_latency_prior_tsk_write_ms=1000 flag and making the Raft HB interval less than that 1000ms interval (along with disabling pre-elections) gives us the desired behavior.


-- 
To view, visit http://gerrit.cloudera.org:8080/8017
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes

[kudu-CR] [tests] de-flaking catalog manager tsk-itest

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: [tests] de-flaking catalog_manager_tsk-itest
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8017/1/src/kudu/integration-tests/catalog_manager_tsk-itest.cc
File src/kudu/integration-tests/catalog_manager_tsk-itest.cc:

Line 64:         hb_interval_ms_(128),
> In this test we want to induce many elections among masters, so that electi
And the more re-election we have among masters, the better.  That's why the Raft HB interval is set to those just tens/hundreds of milliseconds.

Another approach might be setting --catalog_manager_inject_latency_prior_tsk_write_ms=10000 and using the default Raft HB interval of 1 second, but that would require longer test runtime to get the same number of master re-elections during the test.


-- 
To view, visit http://gerrit.cloudera.org:8080/8017
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes

[kudu-CR] [tests] de-flaking catalog manager tsk-itest

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: [tests] de-flaking catalog_manager_tsk-itest
......................................................................


Patch Set 1:

What's the verdict on this? It seems like the test is no longer as flaky as it was last week. Did we fix something?

-- 
To view, visit http://gerrit.cloudera.org:8080/8017
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] [tests] de-flaking catalog manager tsk-itest

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change.

Change subject: [tests] de-flaking catalog_manager_tsk-itest
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8017/1//COMMIT_MSG
Commit Message:

> OK, here the result running without and with 21b0f3d5 changelist:
I can buy that the new approach to failure detection could require rejiggering of test parameters in order to find the new boundaries.

However, if the logs/timings show that election convergence is net _less efficient_ than it was before that change, then it'd be better to treat that as a bug and figure out how to fix that than it'd be to rejigger the boundary conditions in this one test.

The main question is whether election convergence is "worse" or just "different". If the latter, then I agree with you that we should just tweak the timings in this test. But if the former, then we should address that directly.


-- 
To view, visit http://gerrit.cloudera.org:8080/8017
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-HasComments: Yes

[kudu-CR] [tests] de-flaking catalog manager tsk-itest

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change.

Change subject: [tests] de-flaking catalog_manager_tsk-itest
......................................................................


Patch Set 1:

> What's the verdict on this? It seems like the test is no longer as
 > flaky as it was last week. Did we fix something?

I was thinking to take a closer look at the reason behind this test starting being flaky since the mentioned changelist 21b0f3d5, but I haven't done that yet.

As for the less flaky observed for this test, nothing has been fixed in that regard yet, I think the observed 'more stable behavior' was due to less load during running the test (or it might be more powerful machines where the test has been run recently).

-- 
To view, visit http://gerrit.cloudera.org:8080/8017
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50cee27a579cffa7232137c7039b02a1ad4ab7eb
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No