You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Benjamin Mahler <bm...@apache.org> on 2017/04/28 23:12:03 UTC

Re: Parallel test runner added

Is anyone using the parallel test runner? I did another test of it today
and it triggered 278 failing tests. I noticed a lot of timeouts so I tried
bumping the default wait time from 15 seconds to 120 seconds. That brought
it down to 43 failures.

Taking a look at the remaining failures, it seems it is going too wide on
my system (the system has 12 core, 24 hyperthreads, although I see 48
entries in /proc/cpuinfo):

[----------] 1 test from DiskQuotaTest
[ RUN      ] DiskQuotaTest.SlaveRecovery
/home/bmahler/git/mesos/build/src/mesos-containerizer: fork: retry:
Resource temporarily unavailable
terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
../../src/tests/disk_quota_tests.cpp:666: Failure
Value of: status->state()
  Actual: TASK_FAILED
Expected: TASK_RUNNING
../../src/tests/disk_quota_tests.cpp:671: Failure
Value of: containers->size()
  Actual: 0
Expected: 1u
Which is: 1
[  FAILED  ] DiskQuotaTest.SlaveRecovery (1636 ms)
[----------] 1 test from DiskQuotaTest (1638 ms total)

[----------] 1 test from FetcherTest
[ RUN      ] FetcherTest.UNZIP_ExtractFileWithDuplicatedEntries
../../src/tests/fetcher_tests.cpp:911: Failure
(fetch).failure(): Failed to execute mesos-fetcher: Failed to clone:
Resource temporarily unavailable
[  FAILED  ] FetcherTest.UNZIP_ExtractFileWithDuplicatedEntries (8 ms)
[----------] 1 test from FetcherTest (8 ms total)

It seems we should constrain how wide it goes, as well as restrict the
number of worker threads libprocess uses in each instance.

On Thu, Oct 13, 2016 at 3:51 PM, Michael Park <mp...@apache.org> wrote:

> Thanks for pushing this through Benjamin!
>
> I understand if you're unable to attend the community sync on the 20th,
> but would you be able to present this as a demo somehow? maybe via a
> screencast?
>
> MPark
>
> On Thu, Oct 13, 2016 at 6:33 PM, Benjamin Mahler <bm...@apache.org>
> wrote:
>
> > Great to see this Benjamin!
> >
> > Looking forward to seeing the parallel test runner turn green, I'll help
> > file tickets under the epic (I see there are a lot of test failures for
> > me).
> >
> > Once we clear the issues and turn it green, shall we make this the
> default?
> > I would be in favor of that.
> >
> > Ben
> >
> > On Thu, Oct 13, 2016 at 2:28 PM, Benjamin Bannier <
> > benjamin.bannier@mesosphere.io> wrote:
> >
> > >
> > > Hi,
> > >
> > > Since most tests in the Mesos, libprocess, and stout test suites can
> > > be executed in parallel (the exception being some `ROOT` tests with
> > > global side effects in Mesos), we recently added a parallel test
> > > runner `support/mesos-gtest-runner.py`. This should allow to
> > > potentially significantly speed up running of test suites.
> > >
> > > To enable automatic parallel execution of tests for test targets
> > > executed during `make check`, configure Mesos with the option
> > > `--enable-parallel-test-execution`. This will configure the test
> runner
> > > to run all tests but the `ROOT` tests in parallel; `ROOT` tests will
> > > be run in a separate, sequential step.
> > >
> > > * * *
> > >
> > > We use the environment variable `TEST_DRIVER` to drive parallel test
> > > execution. By setting this variable to an empty string you can
> > > temporarily disable configured parallel execution, e.g.,
> > >
> > >     % make check TEST_DRIVER=
> > >
> > > By setting this environment variable you have control over the test
> > > runner itself and its arguments, even without enabling parallel test
> > > during `./configure` time. Be aware that many `ROOT` tests cannot be
> > > run in parallel.
> > >
> > >
> > > The current settings oversubscribe the machine by running `#cores*1.5`
> > > parallel jobs. This was driven by the observation that currently our
> > > tests by and large do not make extended use of even a single core.
> > > The number of parallel jobs can by controlled with the `-j` flag of
> > > the test runner.
> > >
> > > Since making more use of the machine will likely increase machine load
> > > during test execution, running tests in parallel might expose test
> > > flakiness. Tests might also fail to run in parallel if testcases e.g.,
> > > write data to hardcoded locations or use hardcoded ports. Please file
> > > JIRA tickets for such tests if they do not yet exist.
> > >
> > >
> > > There is still some work needed to improve reporting from parallel
> > > tests. We currently use a very silent mode if tests are running
> > > without failures, and just report the logs of failed jobs in case of
> > > failure. MESOS-6387 sketches out possible future improvements in this
> > > area.
> > >
> > >
> > > Happy testing,
> > >
> > > Benjamin with help from Kevin & Till
> > >
> > >
> >
>

Re: Parallel test runner added

Posted by Benjamin Bannier <be...@mesosphere.io>.
Hi again,

I looked at the currently committed parallel test execution tooling and summarize existing solutions for machines with many cores below.

I would still be very much interested to know how the existing defaults perform for typical dev setups. Every additional data point would be very much appreciated.

* * *

Our autotools tooling does declare a autotools variable `TEST_DRIVER` which can be used to specify an alternative test driver invocation. Assuming one is in a directory `build/` directly under the main Mesos checkout one can invoke

    $ ../configure TEST_DRIVER="$PWD/../support/mesos-gtest-runner.py -j10” —enable-parallel-test-execution

to bake a maximal concurrency of 10 into the build setup.

For an already configured setup one would specify flags with

    $ make check TEST_DRIVER="$PWD/../support/mesos-gtest-runner.py -j10”

To always have a fixed concurrency one could declare an environment variable `TEST_DRIVER` setting a test driver and its args; `./configure` will pick up its value and bake it into the build setup so enable parallel test execution would always use this driver setup.


Under the covers the build setup invokes

    % ${TEST_DRIVER} ./src/mesos-tests

i.e. something like,

    % ../support/mesos-gtest-runner.py ./src/mesos-tests

so one can experiment with different concurrency levels to decide on an acceptable operating point for the concurrency by directly prefixing the test invocation with some driver setup. The test runner has help strings documenting the understood parameters.


HTH,

Benjamin




> On Apr 29, 2017, at 8:26 AM, Benjamin Bannier <be...@mesosphere.io> wrote:
> 
> Hi Ben,
> 
> I use the parallel exclusively on a 8 hyperthreads Mac OS machine and a 16 core Fedora box. For me only known flaky tests fail.
> 
> Currently the target parallelity is calculated rather naively and can e.g. grow without bound which will become an issue on machines with many cores. I would also be curious to know how the current defaults perform for "typical" setups. Every additional data point would help us deciding on the best way forward.
> 
> I can take on proposing a patch to improve the situation for machines with many cores after the weekend. 
> 
> 
> Cheers,
> 
> Benjamin 
> 
>> Am 29.04.2017 um 01:12 schrieb Benjamin Mahler <bm...@apache.org>:
>> 
>> Is anyone using the parallel test runner? I did another test of it today
>> and it triggered 278 failing tests. I noticed a lot of timeouts so I tried
>> bumping the default wait time from 15 seconds to 120 seconds. That brought
>> it down to 43 failures.
>> 
>> Taking a look at the remaining failures, it seems it is going too wide on
>> my system (the system has 12 core, 24 hyperthreads, although I see 48
>> entries in /proc/cpuinfo):
>> 
>> [----------] 1 test from DiskQuotaTest
>> [ RUN      ] DiskQuotaTest.SlaveRecovery
>> /home/bmahler/git/mesos/build/src/mesos-containerizer: fork: retry:
>> Resource temporarily unavailable
>> terminate called after throwing an instance of 'std::system_error'
>> what():  Resource temporarily unavailable
>> ../../src/tests/disk_quota_tests.cpp:666: Failure
>> Value of: status->state()
>> Actual: TASK_FAILED
>> Expected: TASK_RUNNING
>> ../../src/tests/disk_quota_tests.cpp:671: Failure
>> Value of: containers->size()
>> Actual: 0
>> Expected: 1u
>> Which is: 1
>> [  FAILED  ] DiskQuotaTest.SlaveRecovery (1636 ms)
>> [----------] 1 test from DiskQuotaTest (1638 ms total)
>> 
>> [----------] 1 test from FetcherTest
>> [ RUN      ] FetcherTest.UNZIP_ExtractFileWithDuplicatedEntries
>> ../../src/tests/fetcher_tests.cpp:911: Failure
>> (fetch).failure(): Failed to execute mesos-fetcher: Failed to clone:
>> Resource temporarily unavailable
>> [  FAILED  ] FetcherTest.UNZIP_ExtractFileWithDuplicatedEntries (8 ms)
>> [----------] 1 test from FetcherTest (8 ms total)
>> 
>> It seems we should constrain how wide it goes, as well as restrict the
>> number of worker threads libprocess uses in each instance.
>> 
>>> On Thu, Oct 13, 2016 at 3:51 PM, Michael Park <mp...@apache.org> wrote:
>>> 
>>> Thanks for pushing this through Benjamin!
>>> 
>>> I understand if you're unable to attend the community sync on the 20th,
>>> but would you be able to present this as a demo somehow? maybe via a
>>> screencast?
>>> 
>>> MPark
>>> 
>>> On Thu, Oct 13, 2016 at 6:33 PM, Benjamin Mahler <bm...@apache.org>
>>> wrote:
>>> 
>>>> Great to see this Benjamin!
>>>> 
>>>> Looking forward to seeing the parallel test runner turn green, I'll help
>>>> file tickets under the epic (I see there are a lot of test failures for
>>>> me).
>>>> 
>>>> Once we clear the issues and turn it green, shall we make this the
>>> default?
>>>> I would be in favor of that.
>>>> 
>>>> Ben
>>>> 
>>>> On Thu, Oct 13, 2016 at 2:28 PM, Benjamin Bannier <
>>>> benjamin.bannier@mesosphere.io> wrote:
>>>> 
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Since most tests in the Mesos, libprocess, and stout test suites can
>>>>> be executed in parallel (the exception being some `ROOT` tests with
>>>>> global side effects in Mesos), we recently added a parallel test
>>>>> runner `support/mesos-gtest-runner.py`. This should allow to
>>>>> potentially significantly speed up running of test suites.
>>>>> 
>>>>> To enable automatic parallel execution of tests for test targets
>>>>> executed during `make check`, configure Mesos with the option
>>>>> `--enable-parallel-test-execution`. This will configure the test
>>> runner
>>>>> to run all tests but the `ROOT` tests in parallel; `ROOT` tests will
>>>>> be run in a separate, sequential step.
>>>>> 
>>>>> * * *
>>>>> 
>>>>> We use the environment variable `TEST_DRIVER` to drive parallel test
>>>>> execution. By setting this variable to an empty string you can
>>>>> temporarily disable configured parallel execution, e.g.,
>>>>> 
>>>>>   % make check TEST_DRIVER=
>>>>> 
>>>>> By setting this environment variable you have control over the test
>>>>> runner itself and its arguments, even without enabling parallel test
>>>>> during `./configure` time. Be aware that many `ROOT` tests cannot be
>>>>> run in parallel.
>>>>> 
>>>>> 
>>>>> The current settings oversubscribe the machine by running `#cores*1.5`
>>>>> parallel jobs. This was driven by the observation that currently our
>>>>> tests by and large do not make extended use of even a single core.
>>>>> The number of parallel jobs can by controlled with the `-j` flag of
>>>>> the test runner.
>>>>> 
>>>>> Since making more use of the machine will likely increase machine load
>>>>> during test execution, running tests in parallel might expose test
>>>>> flakiness. Tests might also fail to run in parallel if testcases e.g.,
>>>>> write data to hardcoded locations or use hardcoded ports. Please file
>>>>> JIRA tickets for such tests if they do not yet exist.
>>>>> 
>>>>> 
>>>>> There is still some work needed to improve reporting from parallel
>>>>> tests. We currently use a very silent mode if tests are running
>>>>> without failures, and just report the logs of failed jobs in case of
>>>>> failure. MESOS-6387 sketches out possible future improvements in this
>>>>> area.
>>>>> 
>>>>> 
>>>>> Happy testing,
>>>>> 
>>>>> Benjamin with help from Kevin & Till
>>>>> 
>>>>> 
>>>> 
>>> 


Re: Parallel test runner added

Posted by Benjamin Bannier <be...@mesosphere.io>.
Hi Ben,

I use the parallel exclusively on a 8 hyperthreads Mac OS machine and a 16 core Fedora box. For me only known flaky tests fail.

Currently the target parallelity is calculated rather naively and can e.g. grow without bound which will become an issue on machines with many cores. I would also be curious to know how the current defaults perform for "typical" setups. Every additional data point would help us deciding on the best way forward.

I can take on proposing a patch to improve the situation for machines with many cores after the weekend. 


Cheers,

Benjamin 

> Am 29.04.2017 um 01:12 schrieb Benjamin Mahler <bm...@apache.org>:
> 
> Is anyone using the parallel test runner? I did another test of it today
> and it triggered 278 failing tests. I noticed a lot of timeouts so I tried
> bumping the default wait time from 15 seconds to 120 seconds. That brought
> it down to 43 failures.
> 
> Taking a look at the remaining failures, it seems it is going too wide on
> my system (the system has 12 core, 24 hyperthreads, although I see 48
> entries in /proc/cpuinfo):
> 
> [----------] 1 test from DiskQuotaTest
> [ RUN      ] DiskQuotaTest.SlaveRecovery
> /home/bmahler/git/mesos/build/src/mesos-containerizer: fork: retry:
> Resource temporarily unavailable
> terminate called after throwing an instance of 'std::system_error'
>  what():  Resource temporarily unavailable
> ../../src/tests/disk_quota_tests.cpp:666: Failure
> Value of: status->state()
>  Actual: TASK_FAILED
> Expected: TASK_RUNNING
> ../../src/tests/disk_quota_tests.cpp:671: Failure
> Value of: containers->size()
>  Actual: 0
> Expected: 1u
> Which is: 1
> [  FAILED  ] DiskQuotaTest.SlaveRecovery (1636 ms)
> [----------] 1 test from DiskQuotaTest (1638 ms total)
> 
> [----------] 1 test from FetcherTest
> [ RUN      ] FetcherTest.UNZIP_ExtractFileWithDuplicatedEntries
> ../../src/tests/fetcher_tests.cpp:911: Failure
> (fetch).failure(): Failed to execute mesos-fetcher: Failed to clone:
> Resource temporarily unavailable
> [  FAILED  ] FetcherTest.UNZIP_ExtractFileWithDuplicatedEntries (8 ms)
> [----------] 1 test from FetcherTest (8 ms total)
> 
> It seems we should constrain how wide it goes, as well as restrict the
> number of worker threads libprocess uses in each instance.
> 
>> On Thu, Oct 13, 2016 at 3:51 PM, Michael Park <mp...@apache.org> wrote:
>> 
>> Thanks for pushing this through Benjamin!
>> 
>> I understand if you're unable to attend the community sync on the 20th,
>> but would you be able to present this as a demo somehow? maybe via a
>> screencast?
>> 
>> MPark
>> 
>> On Thu, Oct 13, 2016 at 6:33 PM, Benjamin Mahler <bm...@apache.org>
>> wrote:
>> 
>>> Great to see this Benjamin!
>>> 
>>> Looking forward to seeing the parallel test runner turn green, I'll help
>>> file tickets under the epic (I see there are a lot of test failures for
>>> me).
>>> 
>>> Once we clear the issues and turn it green, shall we make this the
>> default?
>>> I would be in favor of that.
>>> 
>>> Ben
>>> 
>>> On Thu, Oct 13, 2016 at 2:28 PM, Benjamin Bannier <
>>> benjamin.bannier@mesosphere.io> wrote:
>>> 
>>>> 
>>>> Hi,
>>>> 
>>>> Since most tests in the Mesos, libprocess, and stout test suites can
>>>> be executed in parallel (the exception being some `ROOT` tests with
>>>> global side effects in Mesos), we recently added a parallel test
>>>> runner `support/mesos-gtest-runner.py`. This should allow to
>>>> potentially significantly speed up running of test suites.
>>>> 
>>>> To enable automatic parallel execution of tests for test targets
>>>> executed during `make check`, configure Mesos with the option
>>>> `--enable-parallel-test-execution`. This will configure the test
>> runner
>>>> to run all tests but the `ROOT` tests in parallel; `ROOT` tests will
>>>> be run in a separate, sequential step.
>>>> 
>>>> * * *
>>>> 
>>>> We use the environment variable `TEST_DRIVER` to drive parallel test
>>>> execution. By setting this variable to an empty string you can
>>>> temporarily disable configured parallel execution, e.g.,
>>>> 
>>>>    % make check TEST_DRIVER=
>>>> 
>>>> By setting this environment variable you have control over the test
>>>> runner itself and its arguments, even without enabling parallel test
>>>> during `./configure` time. Be aware that many `ROOT` tests cannot be
>>>> run in parallel.
>>>> 
>>>> 
>>>> The current settings oversubscribe the machine by running `#cores*1.5`
>>>> parallel jobs. This was driven by the observation that currently our
>>>> tests by and large do not make extended use of even a single core.
>>>> The number of parallel jobs can by controlled with the `-j` flag of
>>>> the test runner.
>>>> 
>>>> Since making more use of the machine will likely increase machine load
>>>> during test execution, running tests in parallel might expose test
>>>> flakiness. Tests might also fail to run in parallel if testcases e.g.,
>>>> write data to hardcoded locations or use hardcoded ports. Please file
>>>> JIRA tickets for such tests if they do not yet exist.
>>>> 
>>>> 
>>>> There is still some work needed to improve reporting from parallel
>>>> tests. We currently use a very silent mode if tests are running
>>>> without failures, and just report the logs of failed jobs in case of
>>>> failure. MESOS-6387 sketches out possible future improvements in this
>>>> area.
>>>> 
>>>> 
>>>> Happy testing,
>>>> 
>>>> Benjamin with help from Kevin & Till
>>>> 
>>>> 
>>> 
>>