You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Andrew Wong (Code Review)" <ge...@cloudera.org> on 2018/08/31 19:18:26 UTC

[kudu-CR] build: retry failed tests by default

Andrew Wong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/11374


Change subject: build: retry failed tests by default
......................................................................

build: retry failed tests by default

Previously, the option KUDU_FLAKY_TEST_ATTEMPTS, would either pull the list
of flaky tests from the test server and retry those tests, or retry all flaky
tests if specified by the user.

This patch updates the unit test run logic to retry tests all tests.
Now, the flaky test list will only be used to specify which tests to
run, e.g. if RUN_FLAKY_ONLY is set.

I tested this locally by running run-test.sh with a dummy failed test,
setting KUDU_FLAKY_TEST_ATTEMPTS to a couple different values.

I made the appropriate change for dist_test as well; a run with
KUDU_FLAKY_TEST_ATTEMPTS=3:
http://dist-test.cloudera.org/job?job_id=awong.1535741607.141999

One with KUDU_FLAKY_TEST_ATTEMPTS=1:
http://dist-test.cloudera.org/job?job_id=awong.1535741550.141639

Change-Id: I6fc069527d6f2553a4e04e14c821090f826cbc0a
---
M build-support/dist_test.py
M build-support/jenkins/build-and-test.sh
M build-support/run-test.sh
3 files changed, 13 insertions(+), 60 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/74/11374/1
-- 
To view, visit http://gerrit.cloudera.org:8080/11374
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I6fc069527d6f2553a4e04e14c821090f826cbc0a
Gerrit-Change-Number: 11374
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>

[kudu-CR] build: retry failed tests by default

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/11374 )

Change subject: build: retry failed tests by default
......................................................................


Patch Set 1:

(5 comments)

It should be noted that this change will make it easier to merge a new test that is flaky from day one. In the past I opposed making such a policy change, but I've since come around: minimizing precommit friction is a more important goal, and the flaky test dashboard will (eventually) tell us about a new test's flakiness.

http://gerrit.cloudera.org:8080/#/c/11374/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11374/1//COMMIT_MSG@13
PS1, Line 13: retry tests all tests
Nit: retry all failed tests


http://gerrit.cloudera.org:8080/#/c/11374/1/build-support/dist_test.py
File build-support/dist_test.py:

http://gerrit.cloudera.org:8080/#/c/11374/1/build-support/dist_test.py@61
PS1, Line 61: # Whether to retry all failed C++ tests, rather than just known flaky tests.
            : # Since Java flaky tests are not reported by the test server, Java tests are
            : # always retried, regardless of this value.
            : RETRY_ALL_TESTS = int(os.environ.get('KUDU_RETRY_ALL_FAILED_TESTS', 0))
This no longer exists, right? I mean, we no longer read this env var in run-test.sh, so I assume we shouldn't here either.


http://gerrit.cloudera.org:8080/#/c/11374/1/build-support/dist_test.py@376
PS1, Line 376:                      flaky_test_set=set()):
Doesn't seem like we're using this anymore either.


http://gerrit.cloudera.org:8080/#/c/11374/1/build-support/jenkins/build-and-test.sh
File build-support/jenkins/build-and-test.sh:

http://gerrit.cloudera.org:8080/#/c/11374/1/build-support/jenkins/build-and-test.sh@40
PS1, Line 40: #   KUDU_FLAKY_TEST_ATTEMPTS  Default: 1
Shouldn't we default this to 3 to honor the intent of this change?


http://gerrit.cloudera.org:8080/#/c/11374/1/build-support/run-test.sh
File build-support/run-test.sh:

http://gerrit.cloudera.org:8080/#/c/11374/1/build-support/run-test.sh@27
PS1, Line 27: This can be used in the gerrit workflow
            : # to prevent annoying false-negatives.
This script feels a little too far removed from gerrit to mention it.



-- 
To view, visit http://gerrit.cloudera.org:8080/11374
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6fc069527d6f2553a4e04e14c821090f826cbc0a
Gerrit-Change-Number: 11374
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Comment-Date: Fri, 31 Aug 2018 20:00:34 +0000
Gerrit-HasComments: Yes

[kudu-CR] build: retry all failed tests

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11374 )

Change subject: build: retry all failed tests
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11374/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11374/4//COMMIT_MSG@13
PS4, Line 13: KUDU_FLAKY_TEST_ATTEMPTS
> typo here
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/11374
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6fc069527d6f2553a4e04e14c821090f826cbc0a
Gerrit-Change-Number: 11374
Gerrit-PatchSet: 5
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 04 Sep 2018 18:50:04 +0000
Gerrit-HasComments: Yes

[kudu-CR] build: retry all failed tests

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, Adar Dembo, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11374

to look at the new patch set (#4).

Change subject: build: retry all failed tests
......................................................................

build: retry all failed tests

Previously, the option KUDU_FLAKY_TEST_ATTEMPTS, would either pull the
list of flaky tests from the test server and retry those tests, or retry
all failed tests if specified by the user. This patch updates the unit
test run logic to retry all failed tests as long as
KUDU_FLAKY_TEST_ATTEMPST is set. Now, the flaky test list will only be
used to specify which tests to run, e.g. if RUN_FLAKY_ONLY is set.

The new default when running build-and-test.sh will be to retry failed
tests 3 times. If running without dist-test, this is handled naturally
by running run-test.sh with the KUDU_FLAKY_TEST_ATTEMPTS environment
variable. If running with dist-test, the tasks' 'max_retries' field is
set appropriately.

I tested this locally by running run-test.sh with a dummy failed test,
setting KUDU_FLAKY_TEST_ATTEMPTS to a couple different values.

Here are some dist test runs with KUDU_FLAKY_TEST_ATTEMPTS=3:
http://dist-test.cloudera.org/job?job_id=awong.1536084217.20341

One with KUDU_FLAKY_TEST_ATTEMPTS unset:
http://dist-test.cloudera.org/job?job_id=awong.1536084777.22826

Change-Id: I6fc069527d6f2553a4e04e14c821090f826cbc0a
---
M build-support/dist_test.py
M build-support/jenkins/build-and-test.sh
M build-support/run-test.sh
3 files changed, 17 insertions(+), 79 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/74/11374/4
-- 
To view, visit http://gerrit.cloudera.org:8080/11374
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6fc069527d6f2553a4e04e14c821090f826cbc0a
Gerrit-Change-Number: 11374
Gerrit-PatchSet: 4
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] build: retry all failed tests

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11374 )

Change subject: build: retry all failed tests
......................................................................


Patch Set 4:

(3 comments)

> Patch Set 3:
> 
> (3 comments)
> 
> > Ah, actually pre-commit uses build-and-test too, just with enabled dist-test. So changing the default value in build-and-test.sh does affect the pre-commit case.
> 
> We have to choose between two options: 1) the status quo, wherein only tests from the flaky test list can be retried, or 2) this patch, wherein all tests can be retried. Neither are particularly good for the reasons you've outlined: #1 means we don't let ourselves retry tests that fail once per week, and #2 means we can potentially introduce new flaky tests without noticing at precommit time.
> 
> If you're open to modifying test_result_server.py, we can explore other options. For example, we could use the flaky database to figure out if a particular test is "new" (i.e. has no runs) and not retry it, but retry the rest.

I'm still for #2, at the very least for C++ tests, we do have visibility into the flakiness of tests. Empirically, it seems like a majority of failed pre-commits come from flaky tests that aren't retried.

I'm open to modifying test_result_server.py, maybe by extending the duration of failed test tracking (i.e. making the flaky list longer). Also empirically, it doesn't seem like "new" flaky tests aren't actually new tests, but tests that happen to be flaky less frequently than once a week.

http://gerrit.cloudera.org:8080/#/c/11374/3/build-support/dist_test.py
File build-support/dist_test.py:

http://gerrit.cloudera.org:8080/#/c/11374/3/build-support/dist_test.py@59
PS3, Line 59: NUM_TEST_RETRIES = int(os.environ.get('KUDU_FLAKY_TEST_ATTEMPTS', 1)) - 1
> Might want to rename/recomment this so it reflects the fact that _any_ fail
Agreed re KUDU_FLAKY_TEST_ATTEMPTS, done


http://gerrit.cloudera.org:8080/#/c/11374/3/build-support/dist_test.py@395
PS3, Line 395:                }] * replicate_tasks
> Can just put FLAKY_TEST_RETRIES directly in here.
Done


http://gerrit.cloudera.org:8080/#/c/11374/3/build-support/run-test.sh
File build-support/run-test.sh:

http://gerrit.cloudera.org:8080/#/c/11374/3/build-support/run-test.sh@71
PS3, Line 71: TEST_EXECUTION_ATTEMPTS=${KUDU_FLAKY_TEST_ATTEMPTS:-1}
> dist-test uses run-test.sh, so won't this cause any dist-test triggered tes
Right, since dist-test doesn't pass in this environment variable, it would've been retried 3 times in dist test.



-- 
To view, visit http://gerrit.cloudera.org:8080/11374
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6fc069527d6f2553a4e04e14c821090f826cbc0a
Gerrit-Change-Number: 11374
Gerrit-PatchSet: 4
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 04 Sep 2018 18:46:26 +0000
Gerrit-HasComments: Yes

[kudu-CR] build: retry failed tests by default

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11374 )

Change subject: build: retry failed tests by default
......................................................................


Patch Set 3:

> Patch Set 2:
> 
> Ah, actually pre-commit uses build-and-test too, just with enabled dist-test. So changing the default value in build-and-test.sh does affect the pre-commit case.

And it's worth noting that the _default_ for build-and-test with dist test enabled, wrt pre-commit, may not be very important. I checked the pre-commit configuration and it sets KUDU_FLAKY_TEST_ATTEMPTS=3.


-- 
To view, visit http://gerrit.cloudera.org:8080/11374
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6fc069527d6f2553a4e04e14c821090f826cbc0a
Gerrit-Change-Number: 11374
Gerrit-PatchSet: 3
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Sat, 01 Sep 2018 01:34:14 +0000
Gerrit-HasComments: No

[kudu-CR] build: retry all failed tests

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, Adar Dembo, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11374

to look at the new patch set (#5).

Change subject: build: retry all failed tests
......................................................................

build: retry all failed tests

Previously, the option KUDU_FLAKY_TEST_ATTEMPTS, would either pull the
list of flaky tests from the test server and retry those tests, or retry
all failed tests if specified by the user. This patch updates the unit
test run logic to retry all failed tests as long as
KUDU_FLAKY_TEST_ATTEMPTS is set. Now, the flaky test list will only be
used to specify which tests to run, e.g. if RUN_FLAKY_ONLY is set.

The new default when running build-and-test.sh will be to retry failed
tests 3 times. If running without dist-test, this is handled naturally
by running run-test.sh with the KUDU_FLAKY_TEST_ATTEMPTS environment
variable. If running with dist-test, the tasks' 'max_retries' fields are
set appropriately.

I tested this locally by running run-test.sh with a dummy failed test,
setting KUDU_FLAKY_TEST_ATTEMPTS to a couple different values.

Here are some dist test runs with KUDU_FLAKY_TEST_ATTEMPTS=3:
http://dist-test.cloudera.org/job?job_id=awong.1536084217.20341

One with KUDU_FLAKY_TEST_ATTEMPTS unset:
http://dist-test.cloudera.org/job?job_id=awong.1536084777.22826

Change-Id: I6fc069527d6f2553a4e04e14c821090f826cbc0a
---
M build-support/dist_test.py
M build-support/jenkins/build-and-test.sh
M build-support/run-test.sh
3 files changed, 17 insertions(+), 79 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/74/11374/5
-- 
To view, visit http://gerrit.cloudera.org:8080/11374
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6fc069527d6f2553a4e04e14c821090f826cbc0a
Gerrit-Change-Number: 11374
Gerrit-PatchSet: 5
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] build: retry failed tests by default

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11374 )

Change subject: build: retry failed tests by default
......................................................................


Patch Set 1:

(5 comments)

Summarizing/reposting some discussion we had on Slack. I'm aware of a few places we currently run all tests, configured in the following ways:
1. during pre-commit, we run all tests with dist test and retry Java tests up to 3 times
2. periodic runs of all tests, we run all tests with dist test and retry Java tests up to 3 times
3. flaky-test exploration, we run all tests without dist test, and retry flaky tests (C++ only) up to 3 times, and Java tests up to 3 times
4. flaky-test drilldown, we run only the flaky tests without dist test, and retry flaky tests (C++ only) up to 3 times (no Java tests get run, at least until we start reporting Java flakes)

Regarding each point and how they would change if the default for dist-test vs non-dist-test runs were to be bumped.
1. it's important be mindful of introducing new flakies. There have been cases where a Java flaky test was introduced, and wasn't caught in pre-commit because we retry Java tests. If the default for dist test were bumped, there would be less friction in pre-commit flakes, but our flaky-catching would be only as good as our vigilance of the flaky dashboard.
2. similar to the pre-commit case, but by default we'd expect lower failure rates.
3. bumping the non-dist-test retry default, would go unchanged since flaky tests are already retried up to 3 times.
4. bumping the non-dist-test retry default, non-flaky tests would now be retried up to 3 times.

The current PS2 does _not_ update the dist-test default because:
- Developers actively use dist-test to test potentially broken tests. Retrying in these cases by default doesn't seem helpful.
- The concerns about pre-commit made in the above bullet 1.
- The bonus in bullet 2 isn't always a bonus. If we want to see this, it should be an active decision by setting KUDU_FLAKY_TEST_ATTEMPTS.

http://gerrit.cloudera.org:8080/#/c/11374/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11374/1//COMMIT_MSG@13
PS1, Line 13: retry tests all tests
> Nit: retry all failed tests
Done


http://gerrit.cloudera.org:8080/#/c/11374/1/build-support/dist_test.py
File build-support/dist_test.py:

http://gerrit.cloudera.org:8080/#/c/11374/1/build-support/dist_test.py@61
PS1, Line 61: # Whether to retry all failed C++ tests, rather than just known flaky tests.
            : # Since Java flaky tests are not reported by the test server, Java tests are
            : # always retried, regardless of this value.
            : RETRY_ALL_TESTS = int(os.environ.get('KUDU_RETRY_ALL_FAILED_TESTS', 0))
> This no longer exists, right? I mean, we no longer read this env var in run
Yeah, good catch.


http://gerrit.cloudera.org:8080/#/c/11374/1/build-support/dist_test.py@376
PS1, Line 376:                      flaky_test_set=set()):
> Doesn't seem like we're using this anymore either.
Done


http://gerrit.cloudera.org:8080/#/c/11374/1/build-support/jenkins/build-and-test.sh
File build-support/jenkins/build-and-test.sh:

http://gerrit.cloudera.org:8080/#/c/11374/1/build-support/jenkins/build-and-test.sh@40
PS1, Line 40: #   KUDU_FLAKY_TEST_ATTEMPTS  Default: 1
> Shouldn't we default this to 3 to honor the intent of this change?
Done. I didn't update the default in dist_test.py, since that's widely used for testing things that may well be broken/very flaky.


http://gerrit.cloudera.org:8080/#/c/11374/1/build-support/run-test.sh
File build-support/run-test.sh:

http://gerrit.cloudera.org:8080/#/c/11374/1/build-support/run-test.sh@27
PS1, Line 27: This can be used in the gerrit workflow
            : # to prevent annoying false-negatives.
> This script feels a little too far removed from gerrit to mention it.
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/11374
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6fc069527d6f2553a4e04e14c821090f826cbc0a
Gerrit-Change-Number: 11374
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Sat, 01 Sep 2018 00:54:32 +0000
Gerrit-HasComments: Yes

[kudu-CR] build: retry failed tests by default

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, Adar Dembo, Todd Lipcon, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11374

to look at the new patch set (#3).

Change subject: build: retry failed tests by default
......................................................................

build: retry failed tests by default

Previously, the option KUDU_FLAKY_TEST_ATTEMPTS, would either pull the
list of flaky tests from the test server and retry those tests, or retry
all flaky tests if specified by the user. This patch updates the unit
test run logic to retry all failed tests. Now, the flaky test list will
only be used to specify which tests to run, e.g. if RUN_FLAKY_ONLY is
set.

The new default when running build-and-test.sh without dist-test enabled
will be to retry 3 times for all failed tests. The default when running
with dist test enabled has been unchanged.

I tested this locally by running run-test.sh with a dummy failed test,
setting KUDU_FLAKY_TEST_ATTEMPTS to a couple different values.

Here are some dist test runs with KUDU_FLAKY_TEST_ATTEMPTS=2:
http://dist-test.cloudera.org/job?job_id=awong.1535756796.39753

One with KUDU_FLAKY_TEST_ATTEMPTS unset:
http://dist-test.cloudera.org/job?job_id=awong.1535756961.40518

Change-Id: I6fc069527d6f2553a4e04e14c821090f826cbc0a
---
M build-support/dist_test.py
M build-support/jenkins/build-and-test.sh
M build-support/run-test.sh
3 files changed, 15 insertions(+), 76 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/74/11374/3
-- 
To view, visit http://gerrit.cloudera.org:8080/11374
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6fc069527d6f2553a4e04e14c821090f826cbc0a
Gerrit-Change-Number: 11374
Gerrit-PatchSet: 3
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] build: retry failed tests by default

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Kudu Jenkins, Adar Dembo, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11374

to look at the new patch set (#2).

Change subject: build: retry failed tests by default
......................................................................

build: retry failed tests by default

Previously, the option KUDU_FLAKY_TEST_ATTEMPTS, would either pull the
list of flaky tests from the test server and retry those tests, or retry
all flaky tests if specified by the user. This patch updates the unit
test run logic to retry all failed tests. Now, the flaky test list will
only be used to specify which tests to run, e.g. if RUN_FLAKY_ONLY is
set.

The new default when running build-and-test.sh without dist-test enabled
will be to retry 3 times for all failed tests. The default when running
with dist test enabled has been unchanged.

I tested this locally by running run-test.sh with a dummy failed test,
setting KUDU_FLAKY_TEST_ATTEMPTS to a couple different values.

Here are some dist test runs with KUDU_FLAKY_TEST_ATTEMPTS=2:
http://dist-test.cloudera.org/job?job_id=awong.1535756796.39753

One with KUDU_FLAKY_TEST_ATTEMPTS unset:
http://dist-test.cloudera.org/job?job_id=awong.1535756961.40518

Change-Id: I6fc069527d6f2553a4e04e14c821090f826cbc0a
---
M build-support/dist_test.py
M build-support/jenkins/build-and-test.sh
M build-support/run-test.sh
3 files changed, 15 insertions(+), 75 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/74/11374/2
-- 
To view, visit http://gerrit.cloudera.org:8080/11374
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6fc069527d6f2553a4e04e14c821090f826cbc0a
Gerrit-Change-Number: 11374
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins

[kudu-CR] build: retry failed tests by default

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11374 )

Change subject: build: retry failed tests by default
......................................................................


Patch Set 2:

Ah, actually pre-commit uses build-and-test too, just with enabled dist-test. So changing the default value in build-and-test.sh does affect the pre-commit case.


-- 
To view, visit http://gerrit.cloudera.org:8080/11374
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6fc069527d6f2553a4e04e14c821090f826cbc0a
Gerrit-Change-Number: 11374
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Sat, 01 Sep 2018 00:58:43 +0000
Gerrit-HasComments: No

[kudu-CR] build: retry all failed tests

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/11374 )

Change subject: build: retry all failed tests
......................................................................


Patch Set 5: Code-Review+2

But would be good to get a +1/+2 from Todd too.


-- 
To view, visit http://gerrit.cloudera.org:8080/11374
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6fc069527d6f2553a4e04e14c821090f826cbc0a
Gerrit-Change-Number: 11374
Gerrit-PatchSet: 5
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 04 Sep 2018 18:59:00 +0000
Gerrit-HasComments: No

[kudu-CR] build: retry all failed tests

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has abandoned this change. ( http://gerrit.cloudera.org:8080/11374 )

Change subject: build: retry all failed tests
......................................................................


Abandoned

Ended up going down a different route
-- 
To view, visit http://gerrit.cloudera.org:8080/11374
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I6fc069527d6f2553a4e04e14c821090f826cbc0a
Gerrit-Change-Number: 11374
Gerrit-PatchSet: 5
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>

[kudu-CR] build: retry all failed tests

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/11374 )

Change subject: build: retry all failed tests
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/11374/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11374/4//COMMIT_MSG@13
PS4, Line 13: KUDU_FLAKY_TEST_ATTEMPST
typo here



-- 
To view, visit http://gerrit.cloudera.org:8080/11374
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6fc069527d6f2553a4e04e14c821090f826cbc0a
Gerrit-Change-Number: 11374
Gerrit-PatchSet: 4
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 04 Sep 2018 18:48:06 +0000
Gerrit-HasComments: Yes

[kudu-CR] build: retry failed tests by default

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/11374 )

Change subject: build: retry failed tests by default
......................................................................


Patch Set 3:

(3 comments)

> Ah, actually pre-commit uses build-and-test too, just with enabled dist-test. So changing the default value in build-and-test.sh does affect the pre-commit case.

We have to choose between two options: 1) the status quo, wherein only tests from the flaky test list can be retried, or 2) this patch, wherein all tests can be retried. Neither are particularly good for the reasons you've outlined: #1 means we don't let ourselves retry tests that fail once per week, and #2 means we can potentially introduce new flaky tests without noticing at precommit time.

If you're open to modifying test_result_server.py, we can explore other options. For example, we could use the flaky database to figure out if a particular test is "new" (i.e. has no runs) and not retry it, but retry the rest.

http://gerrit.cloudera.org:8080/#/c/11374/3/build-support/dist_test.py
File build-support/dist_test.py:

http://gerrit.cloudera.org:8080/#/c/11374/3/build-support/dist_test.py@59
PS3, Line 59: FLAKY_TEST_RETRIES = int(os.environ.get('KUDU_FLAKY_TEST_ATTEMPTS', 1)) - 1
Might want to rename/recomment this so it reflects the fact that _any_ failure is retried, not just failures in tests that are in the (now no longer used) flaky test list.

Changing KUDU_FLAKY_TEST_ATTEMPTS probably isn't worth it though (since it's something of an interface that various things may depend on).


http://gerrit.cloudera.org:8080/#/c/11374/3/build-support/dist_test.py@395
PS3, Line 395:                "max_retries": max_retries
Can just put FLAKY_TEST_RETRIES directly in here.


http://gerrit.cloudera.org:8080/#/c/11374/3/build-support/run-test.sh
File build-support/run-test.sh:

http://gerrit.cloudera.org:8080/#/c/11374/3/build-support/run-test.sh@71
PS3, Line 71: TEST_EXECUTION_ATTEMPTS=${KUDU_FLAKY_TEST_ATTEMPTS:-3}
dist-test uses run-test.sh, so won't this cause any dist-test triggered test to be retried? I thought you didn't want that.



-- 
To view, visit http://gerrit.cloudera.org:8080/11374
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I6fc069527d6f2553a4e04e14c821090f826cbc0a
Gerrit-Change-Number: 11374
Gerrit-PatchSet: 3
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Sat, 01 Sep 2018 22:48:55 +0000
Gerrit-HasComments: Yes