You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Alexey Serbin (Code Review)" <ge...@cloudera.org> on 2018/08/30 20:15:08 UTC

[kudu-CR] [tests] make master-stress-test more stable

Alexey Serbin has uploaded this change for review. ( http://gerrit.cloudera.org:8080/11364


Change subject: [tests] make master-stress-test more stable
......................................................................

[tests] make master-stress-test more stable

The master-stress-test has been flaky for some time.  After looking
at those failure closely, I found about five different issues.  This
patch addresses the most prominent one: failures of the test scenario
because of timeouts errors in case of TSAN builds.  The timeout errors
were induced by frequent RPC queue overflows.

The rest of issues behind the flakiness will be addressed separately.

This patch also introduces rpc_negotiation_timeout as a member
for ExternalMiniClusterOptions: that's to customize connection
negotiation timeout for the cluster's utility messenger.

Change-Id: I6b30d8afd4a24acdbd96481cadeaf8f6a9475adf
---
M src/kudu/integration-tests/master-stress-test.cc
M src/kudu/mini-cluster/external_mini_cluster.cc
M src/kudu/mini-cluster/external_mini_cluster.h
3 files changed, 77 insertions(+), 32 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/64/11364/1
-- 
To view, visit http://gerrit.cloudera.org:8080/11364
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I6b30d8afd4a24acdbd96481cadeaf8f6a9475adf
Gerrit-Change-Number: 11364
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>

[kudu-CR] [tests] make master-stress-test more stable

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Will Berkeley, Kudu Jenkins, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11364

to look at the new patch set (#6).

Change subject: [tests] make master-stress-test more stable
......................................................................

[tests] make master-stress-test more stable

The master-stress-test has been flaky for some time.  After looking
at those failure closely, I found at least five different issues.

This patch addresses the most prominent one: failures of the test
scenario because of timeout errors in case of TSAN builds.  About 9 out
of 10 failures were due to the issue fixed by this patch.  The timeout
errors were triggered by RPC queue overflow and the timing of master
restarts wrt the retry/back-off pattern used by KuduClient and other
test utility code.

The rest of issues behind the flakiness will be addressed separately.

This patch also introduces rpc_negotiation_timeout as a member for
ExternalMiniClusterOptions: that's to customize connection negotiation
timeout for the cluster's utility messenger.

Some statistics about the flakiness:

before the fix:
  37 out of 256 failed in TSAN build, where almost all failures were
  due to the issues fixed by this patch:
    http://dist-test.cloudera.org//job?job_id=aserbin.1535666928.86597

after the fix:
  2 out of 256 failed in TSAN build, where the failure was due to [2],
  which will be addressed separately:
    http://dist-test.cloudera.org/job?job_id=aserbin.1535665784.64065

A few of other issues due to which the test is still a bit flaky:
  [1] https://issues.apache.org/jira/browse/KUDU-2561
  [2] https://issues.apache.org/jira/browse/KUDU-2564
  [3] https://issues.apache.org/jira/browse/HIVE-19874

By my understanding, Dan found [3] to be the reason behind one type
of HMS-related failures; and there two more to evaluate.

Change-Id: I6b30d8afd4a24acdbd96481cadeaf8f6a9475adf
---
M src/kudu/integration-tests/master-stress-test.cc
M src/kudu/mini-cluster/external_mini_cluster.cc
M src/kudu/mini-cluster/external_mini_cluster.h
3 files changed, 93 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/64/11364/6
-- 
To view, visit http://gerrit.cloudera.org:8080/11364
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6b30d8afd4a24acdbd96481cadeaf8f6a9475adf
Gerrit-Change-Number: 11364
Gerrit-PatchSet: 6
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] [tests] make master-stress-test more stable

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/11364 )

Change subject: [tests] make master-stress-test more stable
......................................................................


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/11364/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11364/2//COMMIT_MSG@9
PS2, Line 9: The master-stress-test has been flaky for some time.  After looking
           : at those failure closely, I found about five different issues.  This
           : patch addresses the most prominent one: failures of the test scenario
           : because of timeouts errors in case of TSAN builds.  The timeout errors
           : were induced by frequent RPC queue overflows.
> Thanks for tackling this. Are you able to quantify how much this particular
Done


http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc
File src/kudu/integration-tests/master-stress-test.cc:

http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc@140
PS2, Line 140:     // Make the Raft heartbeat interval shorter to allow for faster detection
             :     // of failed/restarted leader masters.
             :     opts.extra_master_flags.emplace_back("--raft_heartbeat_interval_ms=500");
> Isn't this the default value?
Woops.  Yes, 500 ms is the default value.  I was experimenting with the shorter interval here, but then got back to the default since there is no real benefit shortening this.  I'll remove this piece.


http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc@161
PS2, Line 161:     opts.extra_tserver_flags.emplace_back("--raft_heartbeat_interval_ms=500");
> See above.
Yep, that's the default, but I thought it would be nice to list it here just to see the difference.  But if everybody remembers the default, then I'll just a comment about that and remove this line.


http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc@327
PS2, Line 327:       if (s.IsNetworkError()) {
             :         continue;
             :       }
             :       if (tablet_ids.empty()) {
             :         continue;
             :       }
> Should we backoff in both of these cases?
Yep, backing off would be nice.  Done.


http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc@415
PS2, Line 415: it's crucial to keep the
             :       // master up
> All masters? Or just the new one?
De facto that's about all masters.  Yes, the rest two can re-elect a new leader and be available for requests, but the next cycle the new leader might be shutdown.  Otherwise, things get hairy time to time and all retries fail in vain eventually.

I think I'll just increase the default run time then.


http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc@492
PS2, Line 492:   FLAGS_timeout_ms = kDefaultAdminTimeout.ToMilliseconds();
> We don't use the CLI in this test, do we? If this is for the LeaderMasterPr
Yes, that's for the SyncRpc() calls on the LeaderMasterProxy.



-- 
To view, visit http://gerrit.cloudera.org:8080/11364
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b30d8afd4a24acdbd96481cadeaf8f6a9475adf
Gerrit-Change-Number: 11364
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Thu, 30 Aug 2018 22:31:33 +0000
Gerrit-HasComments: Yes

[kudu-CR] [tests] make master-stress-test more stable

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Will Berkeley, Kudu Jenkins, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11364

to look at the new patch set (#3).

Change subject: [tests] make master-stress-test more stable
......................................................................

[tests] make master-stress-test more stable

The master-stress-test has been flaky for some time.  After looking
at those failure closely, I found about five different issues.  This
patch addresses the most prominent one: failures of the test scenario
because of timeouts errors in case of TSAN builds.  The timeout errors
were induced by frequent RPC queue overflows.

The rest of issues behind the flakiness will be addressed separately.

This patch also introduces rpc_negotiation_timeout as a member
for ExternalMiniClusterOptions: that's to customize connection
negotiation timeout for the cluster's utility messenger.

Change-Id: I6b30d8afd4a24acdbd96481cadeaf8f6a9475adf
---
M src/kudu/integration-tests/master-stress-test.cc
M src/kudu/mini-cluster/external_mini_cluster.cc
M src/kudu/mini-cluster/external_mini_cluster.h
3 files changed, 82 insertions(+), 32 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/64/11364/3
-- 
To view, visit http://gerrit.cloudera.org:8080/11364
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6b30d8afd4a24acdbd96481cadeaf8f6a9475adf
Gerrit-Change-Number: 11364
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] [tests] make master-stress-test more stable

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/11364 )

Change subject: [tests] make master-stress-test more stable
......................................................................


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/11364/5/src/kudu/integration-tests/master-stress-test.cc
File src/kudu/integration-tests/master-stress-test.cc:

http://gerrit.cloudera.org:8080/#/c/11364/5/src/kudu/integration-tests/master-stress-test.cc@187
PS5, Line 187:     builder.default_admin_operation_timeout(kDefaultAdminTimeout);
> This was already done on L180.
woops


http://gerrit.cloudera.org:8080/#/c/11364/5/src/kudu/integration-tests/master-stress-test.cc@421
PS5, Line 421: to
> Nit: drop this
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/11364
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b30d8afd4a24acdbd96481cadeaf8f6a9475adf
Gerrit-Change-Number: 11364
Gerrit-PatchSet: 5
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Fri, 31 Aug 2018 17:51:26 +0000
Gerrit-HasComments: Yes

[kudu-CR] [tests] make master-stress-test more stable

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Will Berkeley, Kudu Jenkins, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11364

to look at the new patch set (#5).

Change subject: [tests] make master-stress-test more stable
......................................................................

[tests] make master-stress-test more stable

The master-stress-test has been flaky for some time.  After looking
at those failure closely, I found at least five different issues.

This patch addresses the most prominent one: failures of the test
scenario because of timeout errors in case of TSAN builds.  About 9 out
of 10 failures were due to the issue fixed by this patch.  The timeout
errors were triggered by RPC queue overflow and the timing of master
restarts wrt the retry/back-off pattern used by KuduClient and other
test utility code.

The rest of issues behind the flakiness will be addressed separately.

This patch also introduces rpc_negotiation_timeout as a member for
ExternalMiniClusterOptions: that's to customize connection negotiation
timeout for the cluster's utility messenger.

Some statistics about the flakiness:

before the fix:
  37 out of 256 failed in TSAN build, where almost all failures are
  due to the issues fixed by this patch:
    http://dist-test.cloudera.org//job?job_id=aserbin.1535666928.86597

after the fix:
  2 out of 256 failed in TSAN build, where the failure was due to [2]
  (not addressed by this change list, it will be addressed separately):
    http://dist-test.cloudera.org/job?job_id=aserbin.1535665784.64065

A few of other issues due to which the test is still a bit flaky:
  [1] https://issues.apache.org/jira/browse/KUDU-2561
  [2] https://issues.apache.org/jira/browse/KUDU-2564
  [3] https://issues.apache.org/jira/browse/HIVE-19874

By my understanding, Dan found [3] to be the reason behind one type
of HMS-related failures; and there two more to evaluate.

Change-Id: I6b30d8afd4a24acdbd96481cadeaf8f6a9475adf
---
M src/kudu/integration-tests/master-stress-test.cc
M src/kudu/mini-cluster/external_mini_cluster.cc
M src/kudu/mini-cluster/external_mini_cluster.h
3 files changed, 93 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/64/11364/5
-- 
To view, visit http://gerrit.cloudera.org:8080/11364
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6b30d8afd4a24acdbd96481cadeaf8f6a9475adf
Gerrit-Change-Number: 11364
Gerrit-PatchSet: 5
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] [tests] make master-stress-test more stable

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/11364 )

Change subject: [tests] make master-stress-test more stable
......................................................................

[tests] make master-stress-test more stable

The master-stress-test has been flaky for some time.  After looking
at those failure closely, I found at least five different issues.

This patch addresses the most prominent one: failures of the test
scenario because of timeout errors in case of TSAN builds.  About 9 out
of 10 failures were due to the issue fixed by this patch.  The timeout
errors were triggered by RPC queue overflow and the timing of master
restarts wrt the retry/back-off pattern used by KuduClient and other
test utility code.

The rest of issues behind the flakiness will be addressed separately.

This patch also introduces rpc_negotiation_timeout as a member for
ExternalMiniClusterOptions: that's to customize connection negotiation
timeout for the cluster's utility messenger.

Some statistics about the flakiness:

before the fix:
  37 out of 256 failed in TSAN build, where almost all failures were
  due to the issues fixed by this patch:
    http://dist-test.cloudera.org//job?job_id=aserbin.1535666928.86597

after the fix:
  2 out of 256 failed in TSAN build, where the failure was due to [2],
  which will be addressed separately:
    http://dist-test.cloudera.org/job?job_id=aserbin.1535665784.64065

A few of other issues due to which the test is still a bit flaky:
  [1] https://issues.apache.org/jira/browse/KUDU-2561
  [2] https://issues.apache.org/jira/browse/KUDU-2564
  [3] https://issues.apache.org/jira/browse/HIVE-19874

By my understanding, Dan found [3] to be the reason behind one type
of HMS-related failures; and there two more to evaluate.

Change-Id: I6b30d8afd4a24acdbd96481cadeaf8f6a9475adf
Reviewed-on: http://gerrit.cloudera.org:8080/11364
Tested-by: Kudu Jenkins
Reviewed-by: Adar Dembo <ad...@cloudera.com>
---
M src/kudu/integration-tests/master-stress-test.cc
M src/kudu/mini-cluster/external_mini_cluster.cc
M src/kudu/mini-cluster/external_mini_cluster.h
3 files changed, 93 insertions(+), 35 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Adar Dembo: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/11364
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I6b30d8afd4a24acdbd96481cadeaf8f6a9475adf
Gerrit-Change-Number: 11364
Gerrit-PatchSet: 7
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] [tests] make master-stress-test more stable

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/11364 )

Change subject: [tests] make master-stress-test more stable
......................................................................


Patch Set 2:

(7 comments)

http://gerrit.cloudera.org:8080/#/c/11364/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11364/2//COMMIT_MSG@9
PS2, Line 9: The master-stress-test has been flaky for some time.  After looking
           : at those failure closely, I found about five different issues.  This
           : patch addresses the most prominent one: failures of the test scenario
           : because of timeouts errors in case of TSAN builds.  The timeout errors
           : were induced by frequent RPC queue overflows.
Thanks for tackling this. Are you able to quantify how much this particular patch (vs. the others) deflakes the test?


http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc
File src/kudu/integration-tests/master-stress-test.cc:

http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc@140
PS2, Line 140:     // Make the Raft heartbeat interval shorter to allow for faster detection
             :     // of failed/restarted leader masters.
             :     opts.extra_master_flags.emplace_back("--raft_heartbeat_interval_ms=500");
Isn't this the default value?


http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc@161
PS2, Line 161:     opts.extra_tserver_flags.emplace_back("--raft_heartbeat_interval_ms=500");
See above.


http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc@327
PS2, Line 327:       if (s.IsNetworkError()) {
             :         continue;
             :       }
             :       if (tablet_ids.empty()) {
             :         continue;
             :       }
Should we backoff in both of these cases?


http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc@415
PS2, Line 415: it's crucial to keep the
             :       // master up
All masters? Or just the new one?

If the latter, maybe we should modify this to kill masters in a round robin instead of randomly. I'm just a little concerned that a 2s sleep is quite long when the test only runs for 5s in normal mode.


http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/integration-tests/master-stress-test.cc@492
PS2, Line 492:   FLAGS_timeout_ms = kDefaultAdminTimeout.ToMilliseconds();
We don't use the CLI in this test, do we? If this is for the LeaderMasterProxy, can you note that in a comment?


http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/mini-cluster/external_mini_cluster.h
File src/kudu/mini-cluster/external_mini_cluster.h:

http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/mini-cluster/external_mini_cluster.h@159
PS2, Line 159:   // an incomplete connection negotiation will timeout.
Can you note the default here?



-- 
To view, visit http://gerrit.cloudera.org:8080/11364
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b30d8afd4a24acdbd96481cadeaf8f6a9475adf
Gerrit-Change-Number: 11364
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Thu, 30 Aug 2018 20:31:00 +0000
Gerrit-HasComments: Yes

[kudu-CR] [tests] make master-stress-test more stable

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/11364 )

Change subject: [tests] make master-stress-test more stable
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/mini-cluster/external_mini_cluster.h
File src/kudu/mini-cluster/external_mini_cluster.h:

http://gerrit.cloudera.org:8080/#/c/11364/2/src/kudu/mini-cluster/external_mini_cluster.h@159
PS2, Line 159:   // an incomplete connection negotiation will timeout.
> Can you note the default here?
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/11364
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b30d8afd4a24acdbd96481cadeaf8f6a9475adf
Gerrit-Change-Number: 11364
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Fri, 31 Aug 2018 02:11:26 +0000
Gerrit-HasComments: Yes

[kudu-CR] [tests] make master-stress-test more stable

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/11364 )

Change subject: [tests] make master-stress-test more stable
......................................................................


Patch Set 5:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/11364/5/src/kudu/integration-tests/master-stress-test.cc
File src/kudu/integration-tests/master-stress-test.cc:

http://gerrit.cloudera.org:8080/#/c/11364/5/src/kudu/integration-tests/master-stress-test.cc@187
PS5, Line 187:     builder.default_admin_operation_timeout(kDefaultAdminTimeout);
This was already done on L180.


http://gerrit.cloudera.org:8080/#/c/11364/5/src/kudu/integration-tests/master-stress-test.cc@421
PS5, Line 421: to
Nit: drop this



-- 
To view, visit http://gerrit.cloudera.org:8080/11364
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b30d8afd4a24acdbd96481cadeaf8f6a9475adf
Gerrit-Change-Number: 11364
Gerrit-PatchSet: 5
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Fri, 31 Aug 2018 17:35:59 +0000
Gerrit-HasComments: Yes

[kudu-CR] [tests] make master-stress-test more stable

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Will Berkeley, Kudu Jenkins, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11364

to look at the new patch set (#4).

Change subject: [tests] make master-stress-test more stable
......................................................................

[tests] make master-stress-test more stable

The master-stress-test has been flaky for some time.  After looking
at those failure closely, I found about five different issues.  This
patch addresses the most prominent one: failures of the test scenario
because of timeouts errors in case of TSAN builds: about 9 out of 10
failures were due to the issue fixed by this patch.  The timeout errors
were induced by frequent RPC queue overflows and the timing of master
restarts wrt specifics of KuduClient backoff scheme used for retries.

The rest of issues behind the flakiness will be addressed separately.

This patch also introduces rpc_negotiation_timeout as a member
for ExternalMiniClusterOptions: that's to customize connection
negotiation timeout for the cluster's utility messenger.

Some statistics:

before the fix:
  37 out of 256 failed in TSAN build, where almost all failures are
  due to the issues fixed by this patch:
    http://dist-test.cloudera.org//job?job_id=aserbin.1535666928.86597

after the fix:
  2 out of 256 failed in TSAN build, where the failure was due to the
  issue [1] not addressed by this change list
  (it will be addressed separately):
    http://dist-test.cloudera.org/job?job_id=aserbin.1535665784.64065

A couple of other issues due to which the test has failed:
[1] https://issues.apache.org/jira/browse/KUDU-2564
[2] https://issues.apache.org/jira/browse/HIVE-19874 (Dan's evaluation)

Change-Id: I6b30d8afd4a24acdbd96481cadeaf8f6a9475adf
---
M src/kudu/integration-tests/master-stress-test.cc
M src/kudu/mini-cluster/external_mini_cluster.cc
M src/kudu/mini-cluster/external_mini_cluster.h
3 files changed, 91 insertions(+), 35 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/64/11364/4
-- 
To view, visit http://gerrit.cloudera.org:8080/11364
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6b30d8afd4a24acdbd96481cadeaf8f6a9475adf
Gerrit-Change-Number: 11364
Gerrit-PatchSet: 4
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] [tests] make master-stress-test more stable

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Hello Will Berkeley, Kudu Jenkins, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11364

to look at the new patch set (#2).

Change subject: [tests] make master-stress-test more stable
......................................................................

[tests] make master-stress-test more stable

The master-stress-test has been flaky for some time.  After looking
at those failure closely, I found about five different issues.  This
patch addresses the most prominent one: failures of the test scenario
because of timeouts errors in case of TSAN builds.  The timeout errors
were induced by frequent RPC queue overflows.

The rest of issues behind the flakiness will be addressed separately.

This patch also introduces rpc_negotiation_timeout as a member
for ExternalMiniClusterOptions: that's to customize connection
negotiation timeout for the cluster's utility messenger.

Change-Id: I6b30d8afd4a24acdbd96481cadeaf8f6a9475adf
---
M src/kudu/integration-tests/master-stress-test.cc
M src/kudu/mini-cluster/external_mini_cluster.cc
M src/kudu/mini-cluster/external_mini_cluster.h
3 files changed, 82 insertions(+), 32 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/64/11364/2
-- 
To view, visit http://gerrit.cloudera.org:8080/11364
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6b30d8afd4a24acdbd96481cadeaf8f6a9475adf
Gerrit-Change-Number: 11364
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] [tests] make master-stress-test more stable

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/11364 )

Change subject: [tests] make master-stress-test more stable
......................................................................


Patch Set 6: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/11364
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6b30d8afd4a24acdbd96481cadeaf8f6a9475adf
Gerrit-Change-Number: 11364
Gerrit-PatchSet: 6
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Fri, 31 Aug 2018 18:14:06 +0000
Gerrit-HasComments: No