You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by Alex Rukletsov <al...@mesosphere.com> on 2015/12/15 19:15:12 UTC

Speed up Mesos tests

Folks,

I would like to share some facts and thoughts about tests. When I ran `make
check -j7` on my Mac OS machine the other day, gtest reported the following
(your numbers may vary depending on the OS you're on and filters you use):
[==========] 882 tests from 117 test cases ran. (298610 ms total)

Same command for Mesos 0.21.1, which has been released around a year ago,
yields
[==========] 452 tests from 71 test cases ran. (196398 ms total)

We almost doubled the number of tests in 2015. I think this is a great
achievement per se, moreover it makes the life of cluster operators,
release managers, and Mesos contributors less stressful. I am going to have
an extra glass of champagne to celebrate this at the upcoming New Year Eve
: ).

There are still some flaky tests left — and there always will be, failure
is embedded into progress —, but it is not the flakiness I would like to
discuss today. I would like to draw your attention to the last number in
the gtest output lines above.

When adding tests, we also contribute to the time it takes for a complete
test suite to run. There are multiple ways how we can keep this number
small (one is, heh, write less tests : ) ). Today I propose to focus on
reducing duration of individual test cases.

Mesos tests are often build around certain sequences of events, some of
those have timeouts, some are dependent on other events. Naive test
implementations sometimes lead to test being blocked by the duration of
some timeout, pointlessly slowing down the whole suite! A good indicator of
such a test is that its duration is an integral number of seconds (the
timeout) plus some delta (actual testing code), for example 3123 ms, 5076
ms.

Suggestion: If you write a new test, please look at the test duration as
well, if it seems unreasonably long, investigate what the reasons are and
how you can make the test faster.

State of the art:
  * Slave recovery tests are known to be slow, see MESOS-733 [1].
  * Ben Mahler created an epic to track slow tests more than a year ago
(MESOS-1757 [2]) and did some work earlier (MESOS-297 [3]).
  * Dominic Hamon did pretty much what I have done (with a much nicer
command, too bad I noticed that after generating the list myself) and filed
MESOS-2059 [4].

To get a list of suspect tests I ran `./bin/mesos-tests.sh 2>/dev/null |
grep "ms)"` and noted down tests that took more than 1 second to complete.
To my knowledge, 1s is the shortest timeout we use in default values for
configurable parameters.

For each test from the list I either created a JIRA ticket, or grouped a
bunch of seemingly related tickets into an epic (details below). I hijacked
MESOS-1757 [2] and made it a parent for all newly created epics and tickets.

I would like to encourage folks to look at these tickets and work on them
when they have time and mood. Apart making `make check` faster, I believe
that most of these tickets are actually a very good way to familiarize
yourself with the Mesos codebase (hence I marked all tickets as
`newbie++`), so if you would like to contribute to Mesos but do not know
where to start — this can be a good choice!

It is clear that some tickets are false positives and there exists a good
reason why this particular test takes longer than others. In this case a
comment explaining this reason is a proper resolution for the ticket.

To avoid difficulties with finding a shepherd, I would suggest
investigating the test first, understanding the reason for the slowness,
and updating the ticket, so that a potential shepherd can easier estimate
the amount of time necessary for fixing the issue. Investigating does not
require a shepherd, and once it is done, all following steps (finding a
shepherd, submitting a patch, getting it committed) are trivial.

I believe some tests may share the same root cause (for example, they rely
on the same timeout, which cannot be changed from the test harness). In
this case all such tests can be fixed by a single change.

Below are the suspect tests.
  * Examples tests, slow since early days, see MESOS-297 [3]. Filed
MESOS-4155 [6].
  * Fetcher cache and fetcher cache http tests, filed MESOS-4156 [7].
  * Zookeeper tests, some are slow since early days, see MESOS-297 [3].
Filed MESOS-4157 [8].
  * Slave recovery tests. Known to be slow, see MESOS-733 [1] and MESOS-297
[3]. Filed MESOS-4158 [9].
  * Group tests, filed MESOS-4159 [10].
  * Recover tests, filed MESOS-4160 [11].

  * SlaveTest.CommandExecutorWithOverride (1311 ms), filed MESOS-4161 [12].
  * SlaveTest.MetricsSlaveLaunchErrors (1009 ms), filed MESOS-4162 [13].
  * SlaveTest.HTTPSchedulerSlaveRestart (2307 ms), filed MESOS-4163 [14].
  * MasterTest.RecoverResources (1018 ms), filed MESOS-4164 [15].
  * MasterTest.MasterInfoOnReElection (1024 ms), filed MESOS-4165 [16].
  * MasterTest.LaunchCombinedOfferTest (2023 ms), filed MESOS-4166 [17].
  * MasterTest.OfferTimeout (1053 ms), filed MESOS-4167 [18].
  * MasterAllocatorTest/0.SlaveLost (5076 ms). Allocator related test,
MESOS-3775 [5]. The tests waits 5s for an executor to terminate.
  * MasterMaintenanceTest.EnterMaintenanceMode (5087 ms), filed MESOS-4168
[19].
  * MasterMaintenanceTest.InverseOffers (2027 ms), filed MESOS-4169 [20].
  * OversubscriptionTest.UpdateAllocatorOnSchedulerFailover (1018 ms),
filed MESOS-4170 [21].
  * OversubscriptionTest.RemoveCapabilitiesOnSchedulerFailover (1018 ms),
filed MESOS-4171 [22].
  * GarbageCollectorIntegrationTest.Restart (5102 ms), filed MESOS-4172
[23].
  * HealthCheckTest.CheckCommandTimeout (15483 ms), filed MESOS-4173 [24].
  * HookTest.VerifySlaveLaunchExecutorHook (5061 ms), filed MESOS-4174 [25].
  * ContentType/SchedulerTest.Decline/0 (1022 ms), filed MESOS-4175 [26].

Thanks for reading this up till this point,
AlexR


[1] https://issues.apache.org/jira/browse/MESOS-733
[2] https://issues.apache.org/jira/browse/MESOS-1757
[3] https://issues.apache.org/jira/browse/MESOS-297
[4] https://issues.apache.org/jira/browse/MESOS-2059
[5] https://issues.apache.org/jira/browse/MESOS-3775
[6] https://issues.apache.org/jira/browse/MESOS-4155
[7] https://issues.apache.org/jira/browse/MESOS-4156
[8] https://issues.apache.org/jira/browse/MESOS-4157
[9] https://issues.apache.org/jira/browse/MESOS-4158
[10] https://issues.apache.org/jira/browse/MESOS-4159
[11] https://issues.apache.org/jira/browse/MESOS-4160
[12] https://issues.apache.org/jira/browse/MESOS-4161
[13] https://issues.apache.org/jira/browse/MESOS-4162
[14] https://issues.apache.org/jira/browse/MESOS-4163
[15] https://issues.apache.org/jira/browse/MESOS-4164
[16] https://issues.apache.org/jira/browse/MESOS-4165
[17] https://issues.apache.org/jira/browse/MESOS-4166
[18] https://issues.apache.org/jira/browse/MESOS-4167
[19] https://issues.apache.org/jira/browse/MESOS-4168
[20] https://issues.apache.org/jira/browse/MESOS-4169
[21] https://issues.apache.org/jira/browse/MESOS-4170
[22] https://issues.apache.org/jira/browse/MESOS-4171
[23] https://issues.apache.org/jira/browse/MESOS-4172
[24] https://issues.apache.org/jira/browse/MESOS-4173
[25] https://issues.apache.org/jira/browse/MESOS-4174
[26] https://issues.apache.org/jira/browse/MESOS-4175

Re: Speed up Mesos tests

Posted by Benjamin Mahler <bm...@apache.org>.

There was a sharp increase in the test suite duration back when we added
the registrar: by default every test with a master uses replicated log
storage which involves many synchronous disk writes. We can swap out
replicated log storage for in-memory storage (already exists, just needs to
be wired up) if we want to get a broad improvement across the tests. The
reason that we didn't do this in the first place was that we wanted to be a
bit cautious when introducing the registrar, by trying to exercise log
storage across all the tests. This one is noted in MESOS-1757 but there's
no ticket cut out for it yet.

The other big win would be running the tests in parallel. This one is a big
shift from what we do today, but it's possible to do it even without
modifying the way we build the tests (for example, use a runner like
https://github.com/google/gtest-parallel to run many invocations of the
test binary, setting filters in order to divide the tests across the
processes). It's also a bit tricky to do in that we need to ensure that
certain tests (e.g. cgroup related) do not stomp on each other running in
parallel. Ideally we don't have to do this one.

Joris (cc'ed) had mentioned there may be other big wins we can get pretty
easily across the tests.

I thought I had wired up google test xml test reports into our jenkins job,
but perhaps this was lost during the move to docker. I just pushed a change
to generate the xml files, not sure yet how to expose them back from the
docker container filesystem back to jenkins for processing.

On Wed, Dec 16, 2015 at 12:48 PM, Alex Rukletsov <al...@mesosphere.com>
wrote:

> Greg, I think the "clock magic" is key to speed up most of the test, I'm
> glad you raised that point. Moreover, in case some folks haven't noticed
> that already, we have a doc describing some useful testing patterns:
> testing-patterns.md. It would be great if when working on these tickets we
> update and enrich this doc as well.
>
> Regarding MESOS-4101 — an interesting and bold idea, it would be great to
> capture pros & cons and think about potential implications or caveats it
> may bring.
>
> On Wed, Dec 16, 2015 at 8:29 PM, Neil Conway <ne...@gmail.com>
> wrote:
>
> > +1 on the speed-up-the-tests project!
> >
> > On Wed, Dec 16, 2015 at 10:29 AM, Greg Mann <gr...@mesosphere.io> wrote:
> > > I'd like to bring up something that both Neil and Joseph mentioned to
> me
> > > recently, which could be of use when working on these slow test
> tickets.
> > > Since we have the `process::Clock` class, it's quite easy to control
> the
> > > clock manually, and doing so can both speed up tests as well as make
> them
> > > more deterministic/less flaky. While we're working on the above
> tickets,
> > I
> > > think it would be nice to look for opportunities to alter the tests
> we're
> > > touching to pause the clock and then advance it explicitly using
> > `pause()`,
> > > `settle()`, and `advance()`, rather than letting it run as usual.
> >
> > Yep -- I think eventually having the clock paused by default for tests
> > would probably be a good idea:
> >
> > https://issues.apache.org/jira/browse/MESOS-4101
> >
> > To make that happen, we might need a few more primitives to force
> > "pending" events to be processed before manually advancing the clock.
> > `Clock::settle()` works for libprocess messages, but not for socket
> > communication more generally (e.g., when using the HTTP API). It would
> > help to get rid of this kludge in `Clock::settle` as well:
> >
> > https://issues.apache.org/jira/browse/MESOS-3760
> >
> > Neil
> >
>

Re: Speed up Mesos tests

Posted by Alex Rukletsov <al...@mesosphere.com>.

Greg, I think the "clock magic" is key to speed up most of the test, I'm
glad you raised that point. Moreover, in case some folks haven't noticed
that already, we have a doc describing some useful testing patterns:
testing-patterns.md. It would be great if when working on these tickets we
update and enrich this doc as well.

Regarding MESOS-4101 — an interesting and bold idea, it would be great to
capture pros & cons and think about potential implications or caveats it
may bring.

On Wed, Dec 16, 2015 at 8:29 PM, Neil Conway <ne...@gmail.com> wrote:

> +1 on the speed-up-the-tests project!
>
> On Wed, Dec 16, 2015 at 10:29 AM, Greg Mann <gr...@mesosphere.io> wrote:
> > I'd like to bring up something that both Neil and Joseph mentioned to me
> > recently, which could be of use when working on these slow test tickets.
> > Since we have the `process::Clock` class, it's quite easy to control the
> > clock manually, and doing so can both speed up tests as well as make them
> > more deterministic/less flaky. While we're working on the above tickets,
> I
> > think it would be nice to look for opportunities to alter the tests we're
> > touching to pause the clock and then advance it explicitly using
> `pause()`,
> > `settle()`, and `advance()`, rather than letting it run as usual.
>
> Yep -- I think eventually having the clock paused by default for tests
> would probably be a good idea:
>
> https://issues.apache.org/jira/browse/MESOS-4101
>
> To make that happen, we might need a few more primitives to force
> "pending" events to be processed before manually advancing the clock.
> `Clock::settle()` works for libprocess messages, but not for socket
> communication more generally (e.g., when using the HTTP API). It would
> help to get rid of this kludge in `Clock::settle` as well:
>
> https://issues.apache.org/jira/browse/MESOS-3760
>
> Neil
>

Re: Speed up Mesos tests

Posted by Neil Conway <ne...@gmail.com>.

+1 on the speed-up-the-tests project!

On Wed, Dec 16, 2015 at 10:29 AM, Greg Mann <gr...@mesosphere.io> wrote:
> I'd like to bring up something that both Neil and Joseph mentioned to me
> recently, which could be of use when working on these slow test tickets.
> Since we have the `process::Clock` class, it's quite easy to control the
> clock manually, and doing so can both speed up tests as well as make them
> more deterministic/less flaky. While we're working on the above tickets, I
> think it would be nice to look for opportunities to alter the tests we're
> touching to pause the clock and then advance it explicitly using `pause()`,
> `settle()`, and `advance()`, rather than letting it run as usual.

Yep -- I think eventually having the clock paused by default for tests
would probably be a good idea:

https://issues.apache.org/jira/browse/MESOS-4101

To make that happen, we might need a few more primitives to force
"pending" events to be processed before manually advancing the clock.
`Clock::settle()` works for libprocess messages, but not for socket
communication more generally (e.g., when using the HTTP API). It would
help to get rid of this kludge in `Clock::settle` as well:

https://issues.apache.org/jira/browse/MESOS-3760

Neil

Re: Speed up Mesos tests

Posted by Greg Mann <gr...@mesosphere.io>.

AlexR, thanks for this great work! :-D It's nice to hear that so many tests
have been added in the past year, and I appreciate the list of tickets to
check out; I'll definitely take one on soon when I have some time.

I'd like to bring up something that both Neil and Joseph mentioned to me
recently, which could be of use when working on these slow test tickets.
Since we have the `process::Clock` class, it's quite easy to control the
clock manually, and doing so can both speed up tests as well as make them
more deterministic/less flaky. While we're working on the above tickets, I
think it would be nice to look for opportunities to alter the tests we're
touching to pause the clock and then advance it explicitly using `pause()`,
`settle()`, and `advance()`, rather than letting it run as usual.

Cheers,
Greg

On Wed, Dec 16, 2015 at 9:01 AM, tommy xiao <xi...@gmail.com> wrote:

> +1
>
> 2015-12-16 2:15 GMT+08:00 Alex Rukletsov <al...@mesosphere.com>:
>
> > Folks,
> >
> > I would like to share some facts and thoughts about tests. When I ran
> `make
> > check -j7` on my Mac OS machine the other day, gtest reported the
> following
> > (your numbers may vary depending on the OS you're on and filters you
> use):
> > [==========] 882 tests from 117 test cases ran. (298610 ms total)
> >
> > Same command for Mesos 0.21.1, which has been released around a year ago,
> > yields
> > [==========] 452 tests from 71 test cases ran. (196398 ms total)
> >
> > We almost doubled the number of tests in 2015. I think this is a great
> > achievement per se, moreover it makes the life of cluster operators,
> > release managers, and Mesos contributors less stressful. I am going to
> have
> > an extra glass of champagne to celebrate this at the upcoming New Year
> Eve
> > : ).
> >
> > There are still some flaky tests left — and there always will be, failure
> > is embedded into progress —, but it is not the flakiness I would like to
> > discuss today. I would like to draw your attention to the last number in
> > the gtest output lines above.
> >
> > When adding tests, we also contribute to the time it takes for a complete
> > test suite to run. There are multiple ways how we can keep this number
> > small (one is, heh, write less tests : ) ). Today I propose to focus on
> > reducing duration of individual test cases.
> >
> > Mesos tests are often build around certain sequences of events, some of
> > those have timeouts, some are dependent on other events. Naive test
> > implementations sometimes lead to test being blocked by the duration of
> > some timeout, pointlessly slowing down the whole suite! A good indicator
> of
> > such a test is that its duration is an integral number of seconds (the
> > timeout) plus some delta (actual testing code), for example 3123 ms, 5076
> > ms.
> >
> > Suggestion: If you write a new test, please look at the test duration as
> > well, if it seems unreasonably long, investigate what the reasons are and
> > how you can make the test faster.
> >
> > State of the art:
> >   * Slave recovery tests are known to be slow, see MESOS-733 [1].
> >   * Ben Mahler created an epic to track slow tests more than a year ago
> > (MESOS-1757 [2]) and did some work earlier (MESOS-297 [3]).
> >   * Dominic Hamon did pretty much what I have done (with a much nicer
> > command, too bad I noticed that after generating the list myself) and
> filed
> > MESOS-2059 [4].
> >
> > To get a list of suspect tests I ran `./bin/mesos-tests.sh 2>/dev/null |
> > grep "ms)"` and noted down tests that took more than 1 second to
> complete.
> > To my knowledge, 1s is the shortest timeout we use in default values for
> > configurable parameters.
> >
> > For each test from the list I either created a JIRA ticket, or grouped a
> > bunch of seemingly related tickets into an epic (details below). I
> hijacked
> > MESOS-1757 [2] and made it a parent for all newly created epics and
> > tickets.
> >
> > I would like to encourage folks to look at these tickets and work on them
> > when they have time and mood. Apart making `make check` faster, I believe
> > that most of these tickets are actually a very good way to familiarize
> > yourself with the Mesos codebase (hence I marked all tickets as
> > `newbie++`), so if you would like to contribute to Mesos but do not know
> > where to start — this can be a good choice!
> >
> > It is clear that some tickets are false positives and there exists a good
> > reason why this particular test takes longer than others. In this case a
> > comment explaining this reason is a proper resolution for the ticket.
> >
> > To avoid difficulties with finding a shepherd, I would suggest
> > investigating the test first, understanding the reason for the slowness,
> > and updating the ticket, so that a potential shepherd can easier estimate
> > the amount of time necessary for fixing the issue. Investigating does not
> > require a shepherd, and once it is done, all following steps (finding a
> > shepherd, submitting a patch, getting it committed) are trivial.
> >
> > I believe some tests may share the same root cause (for example, they
> rely
> > on the same timeout, which cannot be changed from the test harness). In
> > this case all such tests can be fixed by a single change.
> >
> > Below are the suspect tests.
> >   * Examples tests, slow since early days, see MESOS-297 [3]. Filed
> > MESOS-4155 [6].
> >   * Fetcher cache and fetcher cache http tests, filed MESOS-4156 [7].
> >   * Zookeeper tests, some are slow since early days, see MESOS-297 [3].
> > Filed MESOS-4157 [8].
> >   * Slave recovery tests. Known to be slow, see MESOS-733 [1] and
> MESOS-297
> > [3]. Filed MESOS-4158 [9].
> >   * Group tests, filed MESOS-4159 [10].
> >   * Recover tests, filed MESOS-4160 [11].
> >
> >   * SlaveTest.CommandExecutorWithOverride (1311 ms), filed MESOS-4161
> [12].
> >   * SlaveTest.MetricsSlaveLaunchErrors (1009 ms), filed MESOS-4162 [13].
> >   * SlaveTest.HTTPSchedulerSlaveRestart (2307 ms), filed MESOS-4163 [14].
> >   * MasterTest.RecoverResources (1018 ms), filed MESOS-4164 [15].
> >   * MasterTest.MasterInfoOnReElection (1024 ms), filed MESOS-4165 [16].
> >   * MasterTest.LaunchCombinedOfferTest (2023 ms), filed MESOS-4166 [17].
> >   * MasterTest.OfferTimeout (1053 ms), filed MESOS-4167 [18].
> >   * MasterAllocatorTest/0.SlaveLost (5076 ms). Allocator related test,
> > MESOS-3775 [5]. The tests waits 5s for an executor to terminate.
> >   * MasterMaintenanceTest.EnterMaintenanceMode (5087 ms), filed
> MESOS-4168
> > [19].
> >   * MasterMaintenanceTest.InverseOffers (2027 ms), filed MESOS-4169 [20].
> >   * OversubscriptionTest.UpdateAllocatorOnSchedulerFailover (1018 ms),
> > filed MESOS-4170 [21].
> >   * OversubscriptionTest.RemoveCapabilitiesOnSchedulerFailover (1018 ms),
> > filed MESOS-4171 [22].
> >   * GarbageCollectorIntegrationTest.Restart (5102 ms), filed MESOS-4172
> > [23].
> >   * HealthCheckTest.CheckCommandTimeout (15483 ms), filed MESOS-4173
> [24].
> >   * HookTest.VerifySlaveLaunchExecutorHook (5061 ms), filed MESOS-4174
> > [25].
> >   * ContentType/SchedulerTest.Decline/0 (1022 ms), filed MESOS-4175 [26].
> >
> > Thanks for reading this up till this point,
> > AlexR
> >
> >
> > [1] https://issues.apache.org/jira/browse/MESOS-733
> > [2] https://issues.apache.org/jira/browse/MESOS-1757
> > [3] https://issues.apache.org/jira/browse/MESOS-297
> > [4] https://issues.apache.org/jira/browse/MESOS-2059
> > [5] https://issues.apache.org/jira/browse/MESOS-3775
> > [6] https://issues.apache.org/jira/browse/MESOS-4155
> > [7] https://issues.apache.org/jira/browse/MESOS-4156
> > [8] https://issues.apache.org/jira/browse/MESOS-4157
> > [9] https://issues.apache.org/jira/browse/MESOS-4158
> > [10] https://issues.apache.org/jira/browse/MESOS-4159
> > [11] https://issues.apache.org/jira/browse/MESOS-4160
> > [12] https://issues.apache.org/jira/browse/MESOS-4161
> > [13] https://issues.apache.org/jira/browse/MESOS-4162
> > [14] https://issues.apache.org/jira/browse/MESOS-4163
> > [15] https://issues.apache.org/jira/browse/MESOS-4164
> > [16] https://issues.apache.org/jira/browse/MESOS-4165
> > [17] https://issues.apache.org/jira/browse/MESOS-4166
> > [18] https://issues.apache.org/jira/browse/MESOS-4167
> > [19] https://issues.apache.org/jira/browse/MESOS-4168
> > [20] https://issues.apache.org/jira/browse/MESOS-4169
> > [21] https://issues.apache.org/jira/browse/MESOS-4170
> > [22] https://issues.apache.org/jira/browse/MESOS-4171
> > [23] https://issues.apache.org/jira/browse/MESOS-4172
> > [24] https://issues.apache.org/jira/browse/MESOS-4173
> > [25] https://issues.apache.org/jira/browse/MESOS-4174
> > [26] https://issues.apache.org/jira/browse/MESOS-4175
> >
>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>

Re: Speed up Mesos tests

Posted by tommy xiao <xi...@gmail.com>.

+1

2015-12-16 2:15 GMT+08:00 Alex Rukletsov <al...@mesosphere.com>:

> Folks,
>
> I would like to share some facts and thoughts about tests. When I ran `make
> check -j7` on my Mac OS machine the other day, gtest reported the following
> (your numbers may vary depending on the OS you're on and filters you use):
> [==========] 882 tests from 117 test cases ran. (298610 ms total)
>
> Same command for Mesos 0.21.1, which has been released around a year ago,
> yields
> [==========] 452 tests from 71 test cases ran. (196398 ms total)
>
> We almost doubled the number of tests in 2015. I think this is a great
> achievement per se, moreover it makes the life of cluster operators,
> release managers, and Mesos contributors less stressful. I am going to have
> an extra glass of champagne to celebrate this at the upcoming New Year Eve
> : ).
>
> There are still some flaky tests left — and there always will be, failure
> is embedded into progress —, but it is not the flakiness I would like to
> discuss today. I would like to draw your attention to the last number in
> the gtest output lines above.
>
> When adding tests, we also contribute to the time it takes for a complete
> test suite to run. There are multiple ways how we can keep this number
> small (one is, heh, write less tests : ) ). Today I propose to focus on
> reducing duration of individual test cases.
>
> Mesos tests are often build around certain sequences of events, some of
> those have timeouts, some are dependent on other events. Naive test
> implementations sometimes lead to test being blocked by the duration of
> some timeout, pointlessly slowing down the whole suite! A good indicator of
> such a test is that its duration is an integral number of seconds (the
> timeout) plus some delta (actual testing code), for example 3123 ms, 5076
> ms.
>
> Suggestion: If you write a new test, please look at the test duration as
> well, if it seems unreasonably long, investigate what the reasons are and
> how you can make the test faster.
>
> State of the art:
>   * Slave recovery tests are known to be slow, see MESOS-733 [1].
>   * Ben Mahler created an epic to track slow tests more than a year ago
> (MESOS-1757 [2]) and did some work earlier (MESOS-297 [3]).
>   * Dominic Hamon did pretty much what I have done (with a much nicer
> command, too bad I noticed that after generating the list myself) and filed
> MESOS-2059 [4].
>
> To get a list of suspect tests I ran `./bin/mesos-tests.sh 2>/dev/null |
> grep "ms)"` and noted down tests that took more than 1 second to complete.
> To my knowledge, 1s is the shortest timeout we use in default values for
> configurable parameters.
>
> For each test from the list I either created a JIRA ticket, or grouped a
> bunch of seemingly related tickets into an epic (details below). I hijacked
> MESOS-1757 [2] and made it a parent for all newly created epics and
> tickets.
>
> I would like to encourage folks to look at these tickets and work on them
> when they have time and mood. Apart making `make check` faster, I believe
> that most of these tickets are actually a very good way to familiarize
> yourself with the Mesos codebase (hence I marked all tickets as
> `newbie++`), so if you would like to contribute to Mesos but do not know
> where to start — this can be a good choice!
>
> It is clear that some tickets are false positives and there exists a good
> reason why this particular test takes longer than others. In this case a
> comment explaining this reason is a proper resolution for the ticket.
>
> To avoid difficulties with finding a shepherd, I would suggest
> investigating the test first, understanding the reason for the slowness,
> and updating the ticket, so that a potential shepherd can easier estimate
> the amount of time necessary for fixing the issue. Investigating does not
> require a shepherd, and once it is done, all following steps (finding a
> shepherd, submitting a patch, getting it committed) are trivial.
>
> I believe some tests may share the same root cause (for example, they rely
> on the same timeout, which cannot be changed from the test harness). In
> this case all such tests can be fixed by a single change.
>
> Below are the suspect tests.
>   * Examples tests, slow since early days, see MESOS-297 [3]. Filed
> MESOS-4155 [6].
>   * Fetcher cache and fetcher cache http tests, filed MESOS-4156 [7].
>   * Zookeeper tests, some are slow since early days, see MESOS-297 [3].
> Filed MESOS-4157 [8].
>   * Slave recovery tests. Known to be slow, see MESOS-733 [1] and MESOS-297
> [3]. Filed MESOS-4158 [9].
>   * Group tests, filed MESOS-4159 [10].
>   * Recover tests, filed MESOS-4160 [11].
>
>   * SlaveTest.CommandExecutorWithOverride (1311 ms), filed MESOS-4161 [12].
>   * SlaveTest.MetricsSlaveLaunchErrors (1009 ms), filed MESOS-4162 [13].
>   * SlaveTest.HTTPSchedulerSlaveRestart (2307 ms), filed MESOS-4163 [14].
>   * MasterTest.RecoverResources (1018 ms), filed MESOS-4164 [15].
>   * MasterTest.MasterInfoOnReElection (1024 ms), filed MESOS-4165 [16].
>   * MasterTest.LaunchCombinedOfferTest (2023 ms), filed MESOS-4166 [17].
>   * MasterTest.OfferTimeout (1053 ms), filed MESOS-4167 [18].
>   * MasterAllocatorTest/0.SlaveLost (5076 ms). Allocator related test,
> MESOS-3775 [5]. The tests waits 5s for an executor to terminate.
>   * MasterMaintenanceTest.EnterMaintenanceMode (5087 ms), filed MESOS-4168
> [19].
>   * MasterMaintenanceTest.InverseOffers (2027 ms), filed MESOS-4169 [20].
>   * OversubscriptionTest.UpdateAllocatorOnSchedulerFailover (1018 ms),
> filed MESOS-4170 [21].
>   * OversubscriptionTest.RemoveCapabilitiesOnSchedulerFailover (1018 ms),
> filed MESOS-4171 [22].
>   * GarbageCollectorIntegrationTest.Restart (5102 ms), filed MESOS-4172
> [23].
>   * HealthCheckTest.CheckCommandTimeout (15483 ms), filed MESOS-4173 [24].
>   * HookTest.VerifySlaveLaunchExecutorHook (5061 ms), filed MESOS-4174
> [25].
>   * ContentType/SchedulerTest.Decline/0 (1022 ms), filed MESOS-4175 [26].
>
> Thanks for reading this up till this point,
> AlexR
>
>
> [1] https://issues.apache.org/jira/browse/MESOS-733
> [2] https://issues.apache.org/jira/browse/MESOS-1757
> [3] https://issues.apache.org/jira/browse/MESOS-297
> [4] https://issues.apache.org/jira/browse/MESOS-2059
> [5] https://issues.apache.org/jira/browse/MESOS-3775
> [6] https://issues.apache.org/jira/browse/MESOS-4155
> [7] https://issues.apache.org/jira/browse/MESOS-4156
> [8] https://issues.apache.org/jira/browse/MESOS-4157
> [9] https://issues.apache.org/jira/browse/MESOS-4158
> [10] https://issues.apache.org/jira/browse/MESOS-4159
> [11] https://issues.apache.org/jira/browse/MESOS-4160
> [12] https://issues.apache.org/jira/browse/MESOS-4161
> [13] https://issues.apache.org/jira/browse/MESOS-4162
> [14] https://issues.apache.org/jira/browse/MESOS-4163
> [15] https://issues.apache.org/jira/browse/MESOS-4164
> [16] https://issues.apache.org/jira/browse/MESOS-4165
> [17] https://issues.apache.org/jira/browse/MESOS-4166
> [18] https://issues.apache.org/jira/browse/MESOS-4167
> [19] https://issues.apache.org/jira/browse/MESOS-4168
> [20] https://issues.apache.org/jira/browse/MESOS-4169
> [21] https://issues.apache.org/jira/browse/MESOS-4170
> [22] https://issues.apache.org/jira/browse/MESOS-4171
> [23] https://issues.apache.org/jira/browse/MESOS-4172
> [24] https://issues.apache.org/jira/browse/MESOS-4173
> [25] https://issues.apache.org/jira/browse/MESOS-4174
> [26] https://issues.apache.org/jira/browse/MESOS-4175
>



-- 
Deshi Xiao
Twitter: xds2000
E-mail: xiaods(AT)gmail.com