You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Grant Henke (Code Review)" <ge...@cloudera.org> on 2018/04/27 15:04:44 UTC

[kudu-CR](branch-1.7.x) test util: include shard number in test scratch directory name

Grant Henke has uploaded this change for review. ( http://gerrit.cloudera.org:8080/10231


Change subject: test_util: include shard number in test scratch directory name
......................................................................

test_util: include shard number in test scratch directory name

Amongst other things, build-and-test.sh checks that every test cleans up its
scratch directory. It only performs this check if all tests pass, because if
a test fails, it'll leave behind its scratch directory for inspection.

This gets a little murky when considering flaky tests, because if one fails,
it is retried a few times. As such, it's important that when run-test.sh
prepares to rerun a flaky test, it cleans up the scratch directory left
behind by the failed run. Unfortunately, run-test.sh isn't really aware of
what it needs to clean up, because each test in the test executable creates
its own test-specific scratch directory within a common root. To account for
this, run-test.sh figures out what to clean using the following algorithm:
1. Before the test, capture the list of all directories within TEST_TMPDIR.
2. After the test, repeat the capture.
3. Diff the two captures to extract a list of directories that exist after
   but not before.
4. Search for the test name as a prefix in the extracted list.

While complex, the algorithm does match and delete all scratch directories
that were left behind by a particular test executable, even if TEST_TMPDIR
is in use by other tests running in parallel. However, sharded tests break
the algorithm, because the test name now includes a shard index, and that
index isn't included in the scratch directory's name.

I experimented with various fixes that included giving run-test.sh more
ownership over TEST_TMPDIR, but eventually settled on a smaller and more
targeted fix: if the test is being sharded, include the shard index in the
scratch directory name.

I tested this patch by running:

  TEST_TMPDIR=/tmp/kudutest-1000 \
  KUDU_FLAKY_TEST_ATTEMPTS=3 \
  KUDU_FLAKY_TEST_LIST=<(echo raft_consensus-itest) \
    ctest -j8 -R raft_consensus-itest

In the background, I also induced stress on the machine. This caused
RaftConsensusParamReplicationModesITest.Test_KUDU_1735/1 to flake and fail
from time to time.

Without the patch, the flaky test's scratch directory was always left behind
in TEST_TMPDIR. With it, the directory was always gone, unless the test
failed every time it was run, which is expected behavior.

Change-Id: Ie5686eb7c3a011a94b6c8c4eadbedb23acbf6e60
Reviewed-on: http://gerrit.cloudera.org:8080/10069
Tested-by: Kudu Jenkins
Reviewed-by: Todd Lipcon <to...@apache.org>
(cherry picked from commit baa1e677fd858a2be17a316fd768746a8e076f88)
---
M src/kudu/util/test_util.cc
1 file changed, 10 insertions(+), 1 deletion(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/31/10231/1
-- 
To view, visit http://gerrit.cloudera.org:8080/10231
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.7.x
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ie5686eb7c3a011a94b6c8c4eadbedb23acbf6e60
Gerrit-Change-Number: 10231
Gerrit-PatchSet: 1
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>

[kudu-CR](branch-1.7.x) test util: include shard number in test scratch directory name

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/10231 )

Change subject: test_util: include shard number in test scratch directory name
......................................................................


Patch Set 1: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/10231
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.7.x
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie5686eb7c3a011a94b6c8c4eadbedb23acbf6e60
Gerrit-Change-Number: 10231
Gerrit-PatchSet: 1
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Comment-Date: Fri, 27 Apr 2018 16:26:39 +0000
Gerrit-HasComments: No

[kudu-CR](branch-1.7.x) test util: include shard number in test scratch directory name

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has removed a vote on this change.

Change subject: test_util: include shard number in test scratch directory name
......................................................................


Removed Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/10231
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.7.x
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: Ie5686eb7c3a011a94b6c8c4eadbedb23acbf6e60
Gerrit-Change-Number: 10231
Gerrit-PatchSet: 2
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins

[kudu-CR](branch-1.7.x) test util: include shard number in test scratch directory name

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has posted comments on this change. ( http://gerrit.cloudera.org:8080/10231 )

Change subject: test_util: include shard number in test scratch directory name
......................................................................


Patch Set 2: Verified+1 Code-Review+2

Carrying forward Adars +2 and marking verified since the failures were due to Jenkins workspace issues.


-- 
To view, visit http://gerrit.cloudera.org:8080/10231
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.7.x
Gerrit-MessageType: comment
Gerrit-Change-Id: Ie5686eb7c3a011a94b6c8c4eadbedb23acbf6e60
Gerrit-Change-Number: 10231
Gerrit-PatchSet: 2
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins
Gerrit-Comment-Date: Fri, 27 Apr 2018 17:08:08 +0000
Gerrit-HasComments: No

[kudu-CR](branch-1.7.x) test util: include shard number in test scratch directory name

Posted by "Grant Henke (Code Review)" <ge...@cloudera.org>.
Grant Henke has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/10231 )

Change subject: test_util: include shard number in test scratch directory name
......................................................................

test_util: include shard number in test scratch directory name

Amongst other things, build-and-test.sh checks that every test cleans up its
scratch directory. It only performs this check if all tests pass, because if
a test fails, it'll leave behind its scratch directory for inspection.

This gets a little murky when considering flaky tests, because if one fails,
it is retried a few times. As such, it's important that when run-test.sh
prepares to rerun a flaky test, it cleans up the scratch directory left
behind by the failed run. Unfortunately, run-test.sh isn't really aware of
what it needs to clean up, because each test in the test executable creates
its own test-specific scratch directory within a common root. To account for
this, run-test.sh figures out what to clean using the following algorithm:
1. Before the test, capture the list of all directories within TEST_TMPDIR.
2. After the test, repeat the capture.
3. Diff the two captures to extract a list of directories that exist after
   but not before.
4. Search for the test name as a prefix in the extracted list.

While complex, the algorithm does match and delete all scratch directories
that were left behind by a particular test executable, even if TEST_TMPDIR
is in use by other tests running in parallel. However, sharded tests break
the algorithm, because the test name now includes a shard index, and that
index isn't included in the scratch directory's name.

I experimented with various fixes that included giving run-test.sh more
ownership over TEST_TMPDIR, but eventually settled on a smaller and more
targeted fix: if the test is being sharded, include the shard index in the
scratch directory name.

I tested this patch by running:

  TEST_TMPDIR=/tmp/kudutest-1000 \
  KUDU_FLAKY_TEST_ATTEMPTS=3 \
  KUDU_FLAKY_TEST_LIST=<(echo raft_consensus-itest) \
    ctest -j8 -R raft_consensus-itest

In the background, I also induced stress on the machine. This caused
RaftConsensusParamReplicationModesITest.Test_KUDU_1735/1 to flake and fail
from time to time.

Without the patch, the flaky test's scratch directory was always left behind
in TEST_TMPDIR. With it, the directory was always gone, unless the test
failed every time it was run, which is expected behavior.

Change-Id: Ie5686eb7c3a011a94b6c8c4eadbedb23acbf6e60
Reviewed-on: http://gerrit.cloudera.org:8080/10069
Tested-by: Kudu Jenkins
Reviewed-by: Todd Lipcon <to...@apache.org>
(cherry picked from commit baa1e677fd858a2be17a316fd768746a8e076f88)
Reviewed-on: http://gerrit.cloudera.org:8080/10231
Reviewed-by: Grant Henke <gr...@apache.org>
Tested-by: Grant Henke <gr...@apache.org>
---
M src/kudu/util/test_util.cc
1 file changed, 10 insertions(+), 1 deletion(-)

Approvals:
  Grant Henke: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/10231
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: branch-1.7.x
Gerrit-MessageType: merged
Gerrit-Change-Id: Ie5686eb7c3a011a94b6c8c4eadbedb23acbf6e60
Gerrit-Change-Number: 10231
Gerrit-PatchSet: 3
Gerrit-Owner: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Kudu Jenkins