You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org> on 2017/03/01 21:50:43 UTC

[kudu-CR] WIP: Make ExternalDaemon::StartProcess() handle fault injection

David Ribeiro Alves has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/6211

Change subject: WIP: Make ExternalDaemon::StartProcess() handle fault injection
......................................................................

WIP: Make ExternalDaemon::StartProcess() handle fault injection

Getting some jenkins mileage. Needs tests.

Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
---
M src/kudu/integration-tests/external_mini_cluster.cc
M src/kudu/integration-tests/external_mini_cluster.h
2 files changed, 58 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/11/6211/1
-- 
To view, visit http://gerrit.cloudera.org:8080/6211
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
Gerrit-PatchSet: 1
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>

[kudu-CR] Make ExternalDaemon::StartProcess() handle fault injection

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6211

to look at the new patch set (#9).

Change subject: Make ExternalDaemon::StartProcess() handle fault injection
......................................................................

Make ExternalDaemon::StartProcess() handle fault injection

Occasionally starting a kudu daemon fails becuase of fault injection.
We recently fixed this error manifesting as SIGPIPE, but we didn't
fix the underlying cause: under load, tests doing fault
injection might fail before StartProcess() completes making
this method return a non-OK() status.

This patch makes sure that if the exit status is an expected one (85)
we still return as if starting the process had succeeded. Subsequent
calls to WaitForInjectedCrash() will return Status::OK();

To make sure this works, this patch includes a test that enables
a "fake" fault early in the startup process and makes sure that
both WaitForInjectedCrash and StartProcess() return OK() in
spite of it.

Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
---
M src/kudu/integration-tests/external_mini_cluster-test.cc
M src/kudu/integration-tests/external_mini_cluster.cc
M src/kudu/integration-tests/external_mini_cluster.h
M src/kudu/tserver/tablet_server_main.cc
4 files changed, 37 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/11/6211/9
-- 
To view, visit http://gerrit.cloudera.org:8080/6211
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Make ExternalDaemon::StartProcess() handle fault injection

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has submitted this change and it was merged.

Change subject: Make ExternalDaemon::StartProcess() handle fault injection
......................................................................


Make ExternalDaemon::StartProcess() handle fault injection

Occasionally starting a kudu daemon fails becuase of fault injection.
We recently fixed this error manifesting as SIGPIPE, but we didn't
fix the underlying cause: under load, tests doing fault
injection might fail before StartProcess() completes making
this method return a non-OK() status.

This patch makes sure that if the exit status is an expected one (85)
we still return as if starting the process had succeeded. Subsequent
calls to WaitForInjectedCrash() will return Status::OK();

To make sure this works, this patch includes a test that enables
a "fake" fault early in the startup process and makes sure that
both WaitForInjectedCrash and StartProcess() return OK() in
spite of it.

Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
Reviewed-on: http://gerrit.cloudera.org:8080/6211
Reviewed-by: Todd Lipcon <to...@apache.org>
Reviewed-by: Mike Percy <mp...@apache.org>
Tested-by: Kudu Jenkins
---
M src/kudu/integration-tests/external_mini_cluster-test.cc
M src/kudu/integration-tests/external_mini_cluster.cc
M src/kudu/integration-tests/external_mini_cluster.h
M src/kudu/tserver/tablet_server_main.cc
4 files changed, 37 insertions(+), 3 deletions(-)

Approvals:
  Mike Percy: Looks good to me, approved
  Todd Lipcon: Looks good to me, approved
  Kudu Jenkins: Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/6211
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
Gerrit-PatchSet: 10
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Make ExternalDaemon::StartProcess() handle fault injection

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: Make ExternalDaemon::StartProcess() handle fault injection
......................................................................


Patch Set 6:

Is this patch necessary if we pull in https://gerrit.cloudera.org/6262 ?

-- 
To view, visit http://gerrit.cloudera.org:8080/6211
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] Make ExternalDaemon::StartProcess() handle fault injection

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
David Ribeiro Alves has posted comments on this change.

Change subject: Make ExternalDaemon::StartProcess() handle fault injection
......................................................................


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6211/6/src/kudu/integration-tests/external_mini_cluster.h
File src/kudu/integration-tests/external_mini_cluster.h:

Line 436:   // Starts a process with the given flags.
> is retrying multiple times the best way to go? or should we just return "OK
hum, some tests might be expecting a running server after calling this, but you're right in that likely its those tests that should be fixed. done


Line 438: 
> we can get rid of the SIGPIPE case with Mike's patch, right?
k. feel kinda nervous allowing it to crash without knowing whether it has a fault flag, but I see your point. done


-- 
To view, visit http://gerrit.cloudera.org:8080/6211
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Make ExternalDaemon::StartProcess() handle fault injection

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6211

to look at the new patch set (#4).

Change subject: Make ExternalDaemon::StartProcess() handle fault injection
......................................................................

Make ExternalDaemon::StartProcess() handle fault injection

external_mini_cluster-test is failing frequently with:
F0302 18:10:29.412564 22826 ts_itest-base.h:487] Check failed: _s.ok() Bad status: Runtime error: /tmp/run_tha_testm13fcv/build/debug/bin/kudu-tserver: process exited on signal 13

The reason for the failure is that, under load, tests doing fault
injection might fail before StartProcess() completes.
Sometimes these errors get misreported as termination because
of a SIGPIPE signal (number 13).
     
The fix is to run StartProcess() in loop until it is
successful as long as there is a flag with 'fault' passed
to the daemon.

To reproduce the error I ran this test on dist-test with stress
with the following command:

build-support/dist_test.py loop -n 500 \
build/debug/bin/external_mini_cluster-test \
--stress_cpu_threads=10 --gtest_repeat=10 \
--gtest_break_on_failure

Without this fix the test failed 500/500 times. I inspected
some of the logs and found the same error. The results can
be found at:
http://dist-test.cloudera.org//job?job_id=david.alves.1488478196.9127

With this fix the test passes 500/500 times. The results
can be found at:
http://dist-test.cloudera.org//job?job_id=david.alves.1488481249.22125

Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
---
M src/kudu/integration-tests/external_mini_cluster.cc
M src/kudu/integration-tests/external_mini_cluster.h
2 files changed, 56 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/11/6211/4
-- 
To view, visit http://gerrit.cloudera.org:8080/6211
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
Gerrit-PatchSet: 4
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Make ExternalDaemon::StartProcess() handle fault injection

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6211

to look at the new patch set (#6).

Change subject: Make ExternalDaemon::StartProcess() handle fault injection
......................................................................

Make ExternalDaemon::StartProcess() handle fault injection

external_mini_cluster-test is failing frequently with:
F0302 18:10:29.412564 22826 ts_itest-base.h:487] Check failed: _s.ok() Bad status: Runtime error: /tmp/run_tha_testm13fcv/build/debug/bin/kudu-tserver: process exited on signal 13

The reason for the failure is that, under load, tests doing fault
injection might fail before StartProcess() completes.
Sometimes these errors get misreported as termination because
of a SIGPIPE signal (number 13).

The fix is to run StartProcess() in loop until it is
successful as long as there is a flag with 'fault' passed
to the daemon.

To reproduce the error I ran this test on dist-test with stress
with the following command:

build-support/dist_test.py loop -n 500 \
build/debug/bin/external_mini_cluster-test \
--stress_cpu_threads=10 --gtest_repeat=10 \
--gtest_break_on_failure

Without this fix the test failed 500/500 times. I inspected
some of the logs and found the same error. The results can
be found at:
http://dist-test.cloudera.org//job?job_id=david.alves.1488478196.9127

With this fix the test passes 500/500 times. The results
can be found at:
http://dist-test.cloudera.org//job?job_id=david.alves.1488481249.22125

Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
---
M src/kudu/integration-tests/external_mini_cluster.cc
M src/kudu/integration-tests/external_mini_cluster.h
2 files changed, 56 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/11/6211/6
-- 
To view, visit http://gerrit.cloudera.org:8080/6211
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Make ExternalDaemon::StartProcess() handle fault injection

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6211

to look at the new patch set (#8).

Change subject: Make ExternalDaemon::StartProcess() handle fault injection
......................................................................

Make ExternalDaemon::StartProcess() handle fault injection

Occasionally starting a kudu daemon fails becuase of fault injection.
We recently fixed this error manifesting as SIGPIPE, but we didn't
fix the underlying cause: under load, tests doing fault
injection might fail before StartProcess() completes making
this method return a non-OK() status.

This patch makes sure that if the exit status is an expected one (85)
we still return as if starting the process had succeeded. Subsequent
calls to WaitForInjectedCrash() will return Status::OK();

To make sure this works, this patch includes a test that enables
a "fake" fault early in the startup process and makes sure that
both WaitForInjectedCrash and StartProcess() return OK() in
spite of it.

Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
---
M src/kudu/integration-tests/external_mini_cluster-test.cc
M src/kudu/integration-tests/external_mini_cluster.cc
M src/kudu/integration-tests/external_mini_cluster.h
M src/kudu/tserver/tablet_server_main.cc
4 files changed, 38 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/11/6211/8
-- 
To view, visit http://gerrit.cloudera.org:8080/6211
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
Gerrit-PatchSet: 8
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Make ExternalDaemon::StartProcess() handle fault injection

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: Make ExternalDaemon::StartProcess() handle fault injection
......................................................................


Patch Set 9: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6211
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No

[kudu-CR] Make ExternalDaemon::StartProcess() handle fault injection

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6211

to look at the new patch set (#5).

Change subject: Make ExternalDaemon::StartProcess() handle fault injection
......................................................................

Make ExternalDaemon::StartProcess() handle fault injection

external_mini_cluster-test is failing frequently with:
F0302 18:10:29.412564 22826 ts_itest-base.h:487] Check failed: _s.ok() Bad status: Runtime error: /tmp/run_tha_testm13fcv/build/debug/bin/kudu-tserver: process exited on signal 13

The reason for the failure is that, under load, tests doing fault
injection might fail before StartProcess() completes.
Sometimes these errors get misreported as termination because
of a SIGPIPE signal (number 13).

The fix is to run StartProcess() in loop until it is
successful as long as there is a flag with 'fault' passed
to the daemon.

To reproduce the error I ran this test on dist-test with stress
with the following command:

build-support/dist_test.py loop -n 500 \
build/debug/bin/external_mini_cluster-test \
--stress_cpu_threads=10 --gtest_repeat=10 \
--gtest_break_on_failure

Without this fix the test failed 500/500 times. I inspected
some of the logs and found the same error. The results can
be found at:
http://dist-test.cloudera.org//job?job_id=david.alves.1488478196.9127

With this fix the test passes 500/500 times. The results
can be found at:
http://dist-test.cloudera.org//job?job_id=david.alves.1488481249.22125

Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
---
M src/kudu/integration-tests/external_mini_cluster.cc
M src/kudu/integration-tests/external_mini_cluster.h
2 files changed, 56 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/11/6211/5
-- 
To view, visit http://gerrit.cloudera.org:8080/6211
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
Gerrit-PatchSet: 5
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Make ExternalDaemon::StartProcess() handle fault injection

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6211

to look at the new patch set (#7).

Change subject: Make ExternalDaemon::StartProcess() handle fault injection
......................................................................

Make ExternalDaemon::StartProcess() handle fault injection

Occasionally starting a kudu daemon fails becuase of fault injection.
We recently fixed this error manifesting as SIGPIPE, but we didn't
fix the underlying cause: under load, tests doing fault
injection might fail before StartProcess() completes making
this method return a non-OK() status.

This patch makes sure that if the exit status is an expected one (85)
we still return as if starting the process had succeeded. Subsequent
calls to WaitForInjectedCrash() will return Status::OK();

To make sure this works, this patch includes a test that enables
a "fake" fault early in the startup process and makes sure that
both WaitForInjectedCrash and StartProcess() return OK() in
spite of it.

Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
---
M src/kudu/integration-tests/external_mini_cluster-test.cc
M src/kudu/integration-tests/external_mini_cluster.cc
M src/kudu/integration-tests/external_mini_cluster.h
M src/kudu/tserver/tablet_server_main.cc
4 files changed, 38 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/11/6211/7
-- 
To view, visit http://gerrit.cloudera.org:8080/6211
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
Gerrit-PatchSet: 7
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] Make ExternalDaemon::StartProcess() handle fault injection

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6211

to look at the new patch set (#3).

Change subject: Make ExternalDaemon::StartProcess() handle fault injection
......................................................................

Make ExternalDaemon::StartProcess() handle fault injection

external_mini_cluster-test is failing frequently with:
F0302 18:10:29.412564 22826 ts_itest-base.h:487] Check failed: _s.ok() Bad status: Runtime error: /tmp/run_tha_testm13fcv/build/debug/bin/kudu-tserver: process exited on signal 13

The reason for that is that, under load, test that do fault
injection might fail before StartProcess() completes.
Sometimes these errors get misreported as SIGPIPE (signal
13).

The fix is to run StartProcess() in loop until it is
successful as long as there is a flag with 'fault' passed
to the daemon.

To reproduce the error I ran this test on dist-test with stress
with the following command:

build-support/dist_test.py loop -n 500 \
build/debug/bin/external_mini_cluster-test \
--stress_cpu_threads=10 --gtest_repeat=10 \
--gtest_break_on_failure

Without this fix the test failed 500/500 times. I inspected
some of the logs and found the same error. The results can
be found at:
http://dist-test.cloudera.org//job?job_id=david.alves.1488478196.9127

With this fix the test passes 500/500 times. The results
can be found at:
http://dist-test.cloudera.org//job?job_id=david.alves.1488481249.22125

Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
---
M src/kudu/integration-tests/external_mini_cluster.cc
M src/kudu/integration-tests/external_mini_cluster.h
2 files changed, 56 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/11/6211/3
-- 
To view, visit http://gerrit.cloudera.org:8080/6211
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
Gerrit-PatchSet: 3
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot

[kudu-CR] Make ExternalDaemon::StartProcess() handle fault injection

Posted by "David Ribeiro Alves (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/6211

to look at the new patch set (#2).

Change subject: Make ExternalDaemon::StartProcess() handle fault injection
......................................................................

Make ExternalDaemon::StartProcess() handle fault injection

external_mini_cluster-test is failing frequently with:
F0302 18:10:29.412564 22826 ts_itest-base.h:487] Check failed: _s.ok() Bad status: Runtime error: /tmp/run_tha_testm13fcv/build/debug/bin/kudu-tserver: process exited on signal 13

The reason for that is that, under load, test that do fault
injection might fail before StartProcess() completes.
Sometimes these errors get misreported as SIGPIPE (signal
13).

The fix is to run StartProcess() in loop until it is
successful as long as there is a flag with 'fault' passed
to the daemon.

To reproduce the error I ran this test on dist-test with stress
with the following command:

build-support/dist_test.py loop -n 500 \
build/debug/bin/external_mini_cluster-test \
--stress_cpu_threads=10 --gtest_repeat=10 \
--gtest_break_on_failure

Without this fix the test failed 500/500 times. I inspected
some of the logs and found the same error. The results can
be found at:
http://dist-test.cloudera.org//job?job_id=david.alves.1488478196.9127

With this fix the test passes 500/500 times. The results
can be found at:
http://dist-test.cloudera.org//job?job_id=david.alves.1488481249.22125

Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
---
M src/kudu/integration-tests/external_mini_cluster.cc
M src/kudu/integration-tests/external_mini_cluster.h
2 files changed, 57 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/11/6211/2
-- 
To view, visit http://gerrit.cloudera.org:8080/6211
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
Gerrit-PatchSet: 2
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Tidy Bot

[kudu-CR] Make ExternalDaemon::StartProcess() handle fault injection

Posted by "Todd Lipcon (Code Review)" <ge...@cloudera.org>.
Todd Lipcon has posted comments on this change.

Change subject: Make ExternalDaemon::StartProcess() handle fault injection
......................................................................


Patch Set 6:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/6211/6/src/kudu/integration-tests/external_mini_cluster.h
File src/kudu/integration-tests/external_mini_cluster.h:

Line 436:   // If HasFaultFlags() is true, tries to start the process multiple times if TryStartProcess()
is retrying multiple times the best way to go? or should we just return "OK"? I can see some tests are expecting to trigger exactly one crash rather than potentially restarting multiple times.


Line 438:   // a expected exit code OR that it received a SIGPIPE.
we can get rid of the SIGPIPE case with Mike's patch, right?

And given that we now only have to worry about the explicit fault injection crash, which has a recognizable error code, perhaps we should just do it all the time rather than trying to look at the flags?


-- 
To view, visit http://gerrit.cloudera.org:8080/6211
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
Gerrit-PatchSet: 6
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: Yes

[kudu-CR] Make ExternalDaemon::StartProcess() handle fault injection

Posted by "Mike Percy (Code Review)" <ge...@cloudera.org>.
Mike Percy has posted comments on this change.

Change subject: Make ExternalDaemon::StartProcess() handle fault injection
......................................................................


Patch Set 9: Code-Review+2

-- 
To view, visit http://gerrit.cloudera.org:8080/6211
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6046e34a321de3e324e20e3d63249e4073712447
Gerrit-PatchSet: 9
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-Owner: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: David Ribeiro Alves <dr...@apache.org>
Gerrit-Reviewer: Jean-Daniel Cryans <jd...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Mike Percy <mp...@apache.org>
Gerrit-Reviewer: Tidy Bot
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-HasComments: No