You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Benno Evers <be...@mesosphere.com> on 2018/10/12 14:37:48 UTC

Mesos Flakiness Statistics

Hey all,

as you might know, we've set up an internal CI system that is running `make
check` on a variety of different platforms and configurations, 16 in total.

As we've experienced more and more pain maintaining a green master, I've
compiled some statistics about which tests are most flaky. I thought other
people might also be interested to have a look at that data:

Last Week:

    # CI Statistics since 2018-10-05 14:22:35.422882 for branches
containing 'asf/master'
    Total: 41 failing tests, 28 unique. (avg 0.142361111111 failing tests
per build)

    Top 5 failing tests:
    6x: [empty]
    4x: ResourceStatistics
    2x: CreateDestroyDiskRecovery
    2x: INTERNET_CURL_InvokeFetchByName
    2x: RecoverNestedContainer

Last Month:

    # CI Statistics since 2018-09-12 14:23:36.272031 for branches
containing 'asf/master'
    Total: 320 failing tests, 75 unique. (avg 0.285714285714 failing tests
per build)

    Top 5 failing tests:
    57x: Used
    32x: LongLivedDefaultExecutorRestart
    27x: PythonFramework
    23x: ROOT_CGROUPS_LaunchNestedContainerSessionsInParallel
    22x: ResourceStatistics

Last year:

    # CI Statistics since 2017-10-12 14:24:31.639792 for branches
containing 'asf/master'
    Total: 3045 failing tests, 225 unique. (avg 0.184054642166 failing
tests per build)

    Top 5 failing tests:
    292x: [empty]
    272x: ROOT_LOGROTATE_UNPRIVILEGED_USER_RotateWithSwitchUserTrueOrFalse
    136x: LOGROTATE_RotateInSandbox
    136x: LOGROTATE_CustomRotateOptions
    131x: ResourceStatistics


I don't really have a point with all of this, but some observations:
 - [empty] means that the `mesos-tests` binary crashed
 - The data also includes "real", i.e. non-flaky test failures, but they
should not appear in the top 5 lists because we would hopefully either
revert or fix them before they can accumulate dozens of failures
 - Over the whole year, we seem to be pretty good at fixing  the nastiest
flakes, with only one of the top 5 still appearing in this weeks test
results
 - Sadly, the fail percentage isn't as different between now and then as we
might have hoped.

Hope this was interesting, and best regards,
-- 
Benno Evers
Software Engineer, Mesosphere

Re: Mesos Flakiness Statistics

Posted by Vinod Kone <vi...@mesosphere.io>.
This is great. Thanks Benno for sharing!

What did you use to do the analysis? I would love it if we can have graphs
that we can run on TVs.

On Mon, Oct 15, 2018 at 5:23 AM Benno Evers <be...@mesosphere.com> wrote:

> > Is there any reason the first portion of the test name is being
> truncated?
>
> There is, although it is slightly embarrassing: We currently only store the
> detailed data including full test case name and platform
> for about a week, for anything older than that the abridged version is the
> best I could find. The data should still be good, though,
> since we hopefully don't have two tests with the same name that are both
> frequently flaky.
>
> In particular, the ResourceStatistics refers to the
> 'MesosContainerizerSlaveRecoveryTest.ResourceStatistics' test tracked
> in MESOS-5048.
>
> On Fri, Oct 12, 2018 at 7:03 PM Benjamin Mahler <bm...@apache.org>
> wrote:
>
> > Thanks for sending this Benno! I for one would love to see more regular
> > communication about the state of CI, especially so that I know how I can
> > help fix tests (right now I don't know which flaky tests are in areas I
> am
> > maintaining).
> >
> > Is there any reason the first portion of the test name is being
> truncated?
> > For example, ResourceStatistics matches several tests:
> >
> > $ grep -R ' ResourceStatistics)' src/tests
> > src/tests/containerizer/xfs_quota_tests.cpp:TEST_F(ROOT_XFS_QuotaTest,
> > ResourceStatistics)
> >
> >
> src/tests/slave_recovery_tests.cpp:TEST_F(MesosContainerizerSlaveRecoveryTest,
> > ResourceStatistics)
> > src/tests/disk_quota_tests.cpp:TEST_F(DiskQuotaTest, ResourceStatistics)
> >
> > Did we actually fix the flaky tests or did we disable them? I see only 22
> > disabled tests, which is better than I expected, but I hope there's good
> > tracking on getting these un-disabled again:
> >
> > $ grep -R DISABLED src/tests | grep -v DISABLED_ON_WINDOWS | grep -v
> > NestedQuota | grep -v ChildRole | grep -v NestedRoles | grep -v
> > environment.cpp | wc -l
> >       22
> >
> > On Fri, Oct 12, 2018 at 7:38 AM Benno Evers <be...@mesosphere.com>
> wrote:
> >
> > > Hey all,
> > >
> > > as you might know, we've set up an internal CI system that is running
> > `make
> > > check` on a variety of different platforms and configurations, 16 in
> > total.
> > >
> > > As we've experienced more and more pain maintaining a green master,
> I've
> > > compiled some statistics about which tests are most flaky. I thought
> > other
> > > people might also be interested to have a look at that data:
> > >
> > > Last Week:
> > >
> > >     # CI Statistics since 2018-10-05 14:22:35.422882 for branches
> > > containing 'asf/master'
> > >     Total: 41 failing tests, 28 unique. (avg 0.142361111111 failing
> tests
> > > per build)
> > >
> > >     Top 5 failing tests:
> > >     6x: [empty]
> > >     4x: ResourceStatistics
> > >     2x: CreateDestroyDiskRecovery
> > >     2x: INTERNET_CURL_InvokeFetchByName
> > >     2x: RecoverNestedContainer
> > >
> > > Last Month:
> > >
> > >     # CI Statistics since 2018-09-12 14:23:36.272031 for branches
> > > containing 'asf/master'
> > >     Total: 320 failing tests, 75 unique. (avg 0.285714285714 failing
> > tests
> > > per build)
> > >
> > >     Top 5 failing tests:
> > >     57x: Used
> > >     32x: LongLivedDefaultExecutorRestart
> > >     27x: PythonFramework
> > >     23x: ROOT_CGROUPS_LaunchNestedContainerSessionsInParallel
> > >     22x: ResourceStatistics
> > >
> > > Last year:
> > >
> > >     # CI Statistics since 2017-10-12 14:24:31.639792 for branches
> > > containing 'asf/master'
> > >     Total: 3045 failing tests, 225 unique. (avg 0.184054642166 failing
> > > tests per build)
> > >
> > >     Top 5 failing tests:
> > >     292x: [empty]
> > >     272x:
> > ROOT_LOGROTATE_UNPRIVILEGED_USER_RotateWithSwitchUserTrueOrFalse
> > >     136x: LOGROTATE_RotateInSandbox
> > >     136x: LOGROTATE_CustomRotateOptions
> > >     131x: ResourceStatistics
> > >
> > >
> > > I don't really have a point with all of this, but some observations:
> > >  - [empty] means that the `mesos-tests` binary crashed
> > >  - The data also includes "real", i.e. non-flaky test failures, but
> they
> > > should not appear in the top 5 lists because we would hopefully either
> > > revert or fix them before they can accumulate dozens of failures
> > >  - Over the whole year, we seem to be pretty good at fixing  the
> nastiest
> > > flakes, with only one of the top 5 still appearing in this weeks test
> > > results
> > >  - Sadly, the fail percentage isn't as different between now and then
> as
> > we
> > > might have hoped.
> > >
> > > Hope this was interesting, and best regards,
> > > --
> > > Benno Evers
> > > Software Engineer, Mesosphere
> > >
> >
>
>
> --
> Benno Evers
> Software Engineer, Mesosphere
>

Re: Mesos Flakiness Statistics

Posted by Benno Evers <be...@mesosphere.com>.
> Is there any reason the first portion of the test name is being truncated?

There is, although it is slightly embarrassing: We currently only store the
detailed data including full test case name and platform
for about a week, for anything older than that the abridged version is the
best I could find. The data should still be good, though,
since we hopefully don't have two tests with the same name that are both
frequently flaky.

In particular, the ResourceStatistics refers to the
'MesosContainerizerSlaveRecoveryTest.ResourceStatistics' test tracked
in MESOS-5048.

On Fri, Oct 12, 2018 at 7:03 PM Benjamin Mahler <bm...@apache.org> wrote:

> Thanks for sending this Benno! I for one would love to see more regular
> communication about the state of CI, especially so that I know how I can
> help fix tests (right now I don't know which flaky tests are in areas I am
> maintaining).
>
> Is there any reason the first portion of the test name is being truncated?
> For example, ResourceStatistics matches several tests:
>
> $ grep -R ' ResourceStatistics)' src/tests
> src/tests/containerizer/xfs_quota_tests.cpp:TEST_F(ROOT_XFS_QuotaTest,
> ResourceStatistics)
>
> src/tests/slave_recovery_tests.cpp:TEST_F(MesosContainerizerSlaveRecoveryTest,
> ResourceStatistics)
> src/tests/disk_quota_tests.cpp:TEST_F(DiskQuotaTest, ResourceStatistics)
>
> Did we actually fix the flaky tests or did we disable them? I see only 22
> disabled tests, which is better than I expected, but I hope there's good
> tracking on getting these un-disabled again:
>
> $ grep -R DISABLED src/tests | grep -v DISABLED_ON_WINDOWS | grep -v
> NestedQuota | grep -v ChildRole | grep -v NestedRoles | grep -v
> environment.cpp | wc -l
>       22
>
> On Fri, Oct 12, 2018 at 7:38 AM Benno Evers <be...@mesosphere.com> wrote:
>
> > Hey all,
> >
> > as you might know, we've set up an internal CI system that is running
> `make
> > check` on a variety of different platforms and configurations, 16 in
> total.
> >
> > As we've experienced more and more pain maintaining a green master, I've
> > compiled some statistics about which tests are most flaky. I thought
> other
> > people might also be interested to have a look at that data:
> >
> > Last Week:
> >
> >     # CI Statistics since 2018-10-05 14:22:35.422882 for branches
> > containing 'asf/master'
> >     Total: 41 failing tests, 28 unique. (avg 0.142361111111 failing tests
> > per build)
> >
> >     Top 5 failing tests:
> >     6x: [empty]
> >     4x: ResourceStatistics
> >     2x: CreateDestroyDiskRecovery
> >     2x: INTERNET_CURL_InvokeFetchByName
> >     2x: RecoverNestedContainer
> >
> > Last Month:
> >
> >     # CI Statistics since 2018-09-12 14:23:36.272031 for branches
> > containing 'asf/master'
> >     Total: 320 failing tests, 75 unique. (avg 0.285714285714 failing
> tests
> > per build)
> >
> >     Top 5 failing tests:
> >     57x: Used
> >     32x: LongLivedDefaultExecutorRestart
> >     27x: PythonFramework
> >     23x: ROOT_CGROUPS_LaunchNestedContainerSessionsInParallel
> >     22x: ResourceStatistics
> >
> > Last year:
> >
> >     # CI Statistics since 2017-10-12 14:24:31.639792 for branches
> > containing 'asf/master'
> >     Total: 3045 failing tests, 225 unique. (avg 0.184054642166 failing
> > tests per build)
> >
> >     Top 5 failing tests:
> >     292x: [empty]
> >     272x:
> ROOT_LOGROTATE_UNPRIVILEGED_USER_RotateWithSwitchUserTrueOrFalse
> >     136x: LOGROTATE_RotateInSandbox
> >     136x: LOGROTATE_CustomRotateOptions
> >     131x: ResourceStatistics
> >
> >
> > I don't really have a point with all of this, but some observations:
> >  - [empty] means that the `mesos-tests` binary crashed
> >  - The data also includes "real", i.e. non-flaky test failures, but they
> > should not appear in the top 5 lists because we would hopefully either
> > revert or fix them before they can accumulate dozens of failures
> >  - Over the whole year, we seem to be pretty good at fixing  the nastiest
> > flakes, with only one of the top 5 still appearing in this weeks test
> > results
> >  - Sadly, the fail percentage isn't as different between now and then as
> we
> > might have hoped.
> >
> > Hope this was interesting, and best regards,
> > --
> > Benno Evers
> > Software Engineer, Mesosphere
> >
>


-- 
Benno Evers
Software Engineer, Mesosphere

Re: Mesos Flakiness Statistics

Posted by Benjamin Mahler <bm...@apache.org>.
Thanks for sending this Benno! I for one would love to see more regular
communication about the state of CI, especially so that I know how I can
help fix tests (right now I don't know which flaky tests are in areas I am
maintaining).

Is there any reason the first portion of the test name is being truncated?
For example, ResourceStatistics matches several tests:

$ grep -R ' ResourceStatistics)' src/tests
src/tests/containerizer/xfs_quota_tests.cpp:TEST_F(ROOT_XFS_QuotaTest,
ResourceStatistics)
src/tests/slave_recovery_tests.cpp:TEST_F(MesosContainerizerSlaveRecoveryTest,
ResourceStatistics)
src/tests/disk_quota_tests.cpp:TEST_F(DiskQuotaTest, ResourceStatistics)

Did we actually fix the flaky tests or did we disable them? I see only 22
disabled tests, which is better than I expected, but I hope there's good
tracking on getting these un-disabled again:

$ grep -R DISABLED src/tests | grep -v DISABLED_ON_WINDOWS | grep -v
NestedQuota | grep -v ChildRole | grep -v NestedRoles | grep -v
environment.cpp | wc -l
      22

On Fri, Oct 12, 2018 at 7:38 AM Benno Evers <be...@mesosphere.com> wrote:

> Hey all,
>
> as you might know, we've set up an internal CI system that is running `make
> check` on a variety of different platforms and configurations, 16 in total.
>
> As we've experienced more and more pain maintaining a green master, I've
> compiled some statistics about which tests are most flaky. I thought other
> people might also be interested to have a look at that data:
>
> Last Week:
>
>     # CI Statistics since 2018-10-05 14:22:35.422882 for branches
> containing 'asf/master'
>     Total: 41 failing tests, 28 unique. (avg 0.142361111111 failing tests
> per build)
>
>     Top 5 failing tests:
>     6x: [empty]
>     4x: ResourceStatistics
>     2x: CreateDestroyDiskRecovery
>     2x: INTERNET_CURL_InvokeFetchByName
>     2x: RecoverNestedContainer
>
> Last Month:
>
>     # CI Statistics since 2018-09-12 14:23:36.272031 for branches
> containing 'asf/master'
>     Total: 320 failing tests, 75 unique. (avg 0.285714285714 failing tests
> per build)
>
>     Top 5 failing tests:
>     57x: Used
>     32x: LongLivedDefaultExecutorRestart
>     27x: PythonFramework
>     23x: ROOT_CGROUPS_LaunchNestedContainerSessionsInParallel
>     22x: ResourceStatistics
>
> Last year:
>
>     # CI Statistics since 2017-10-12 14:24:31.639792 for branches
> containing 'asf/master'
>     Total: 3045 failing tests, 225 unique. (avg 0.184054642166 failing
> tests per build)
>
>     Top 5 failing tests:
>     292x: [empty]
>     272x: ROOT_LOGROTATE_UNPRIVILEGED_USER_RotateWithSwitchUserTrueOrFalse
>     136x: LOGROTATE_RotateInSandbox
>     136x: LOGROTATE_CustomRotateOptions
>     131x: ResourceStatistics
>
>
> I don't really have a point with all of this, but some observations:
>  - [empty] means that the `mesos-tests` binary crashed
>  - The data also includes "real", i.e. non-flaky test failures, but they
> should not appear in the top 5 lists because we would hopefully either
> revert or fix them before they can accumulate dozens of failures
>  - Over the whole year, we seem to be pretty good at fixing  the nastiest
> flakes, with only one of the top 5 still appearing in this weeks test
> results
>  - Sadly, the fail percentage isn't as different between now and then as we
> might have hoped.
>
> Hope this was interesting, and best regards,
> --
> Benno Evers
> Software Engineer, Mesosphere
>