You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Alex Rukletsov <al...@mesosphere.com> on 2017/10/23 22:54:10 UTC

On the current CI state

Folks,

the CI state (both Apache and internal we have at Mesosphere) has recently
degraded to a point when people no longer look at it failures. This defeats
the primary purpose of a CI: to produce a reliable signal when a change
breaks something.

You might have seen a bunch of commits fixing flaky tests and bugs over the
past two weeks — this is the beginning of our effort to bring the CI back
to the green state. To track the effort, there exists a swim lane in our
tech debt board [1] and a flow diagram [2]. I believe that some of the
older tickets are no longer relevant, I will do a cleanup at some point
when I get a better feeling of the actual state.

If you would like to help, watch out for new flakiness new changes might
introduce. Apache CI apparently has a quirk when a test run can pause for
15+s, leading to arbitrary test failures. This is a false positive, but the
pattern is easily recognizabe in the logs.

We also have a dedicated channel in Apache Mesos slack: #ci-back-to-green

If you would like to participate, here is the list of the biggest offenders
that are not triaged yet: MESOS-7519, MESOS-7082, MESOS-7434, MESOS-7512,
MESOS-7742, MESOS-7028, MESOS-7425, MESOS-7106, MESOS-7337, MESOS-7273,
MESOS-6724, MESOS-8112, MESOS-6949, MESOS-8000, MESOS-8047

Alex.

[1]
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=151&view=detail&selectedIssue=MESOS-8005
[2]
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=204&view=reporting&chart=cumulativeFlowDiagram&swimlane=501&column=774&column=775&column=776&days=7

Re: On the current CI state

Posted by Benjamin Mahler <bm...@apache.org>.
Thanks Alex!

Also I would like to re-state the importance of everyone subscribing to the
builds@ list and helping triage the build failure emails. In particular, if
you find a ticket, reply with it so that others don't have to look into it.
If there's no ticket, capturing the logs of the bad run (and ideally also a
good run) and reply with the new ticket. This in itself is a big help!

On Mon, Oct 23, 2017 at 3:54 PM, Alex Rukletsov <al...@mesosphere.com> wrote:

> Folks,
>
> the CI state (both Apache and internal we have at Mesosphere) has recently
> degraded to a point when people no longer look at it failures. This defeats
> the primary purpose of a CI: to produce a reliable signal when a change
> breaks something.
>
> You might have seen a bunch of commits fixing flaky tests and bugs over the
> past two weeks — this is the beginning of our effort to bring the CI back
> to the green state. To track the effort, there exists a swim lane in our
> tech debt board [1] and a flow diagram [2]. I believe that some of the
> older tickets are no longer relevant, I will do a cleanup at some point
> when I get a better feeling of the actual state.
>
> If you would like to help, watch out for new flakiness new changes might
> introduce. Apache CI apparently has a quirk when a test run can pause for
> 15+s, leading to arbitrary test failures. This is a false positive, but the
> pattern is easily recognizabe in the logs.
>
> We also have a dedicated channel in Apache Mesos slack: #ci-back-to-green
>
> If you would like to participate, here is the list of the biggest offenders
> that are not triaged yet: MESOS-7519, MESOS-7082, MESOS-7434, MESOS-7512,
> MESOS-7742, MESOS-7028, MESOS-7425, MESOS-7106, MESOS-7337, MESOS-7273,
> MESOS-6724, MESOS-8112, MESOS-6949, MESOS-8000, MESOS-8047
>
> Alex.
>
> [1]
> https://issues.apache.org/jira/secure/RapidBoard.jspa?
> rapidView=151&view=detail&selectedIssue=MESOS-8005
> [2]
> https://issues.apache.org/jira/secure/RapidBoard.jspa?
> rapidView=204&view=reporting&chart=cumulativeFlowDiagram&
> swimlane=501&column=774&column=775&column=776&days=7
>