You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by Raúl Gutiérrez Segalés <rg...@itevenworks.net> on 2015/05/03 19:28:12 UTC

making CI a more pleasant experience

Hi all,

This has probably come up before but do we have any thoughts on making CI
better? Is the problem jenkins? Is it flaky tests? Bad CI workers? All of
the above?

I see we waste loads of time with trivial (or unrelated to the actual
failures) patches triggering failed builds all the time. I'd like to spend
some time improving our experience here, but would love some
pointers/thoughts.

Ideas?


Cheers,
-rgs

Re: making CI a more pleasant experience

Posted by Raúl Gutiérrez Segalés <rg...@itevenworks.net>.
On 3 May 2015 at 20:53, Patrick Hunt <ph...@apache.org> wrote:
(...)

> Trunk tests seem pretty stable - where I do see issues recently it's mostly
> a single test that seems to be intermittently failing:
> ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig
> I see someone opened a jira then closed it, perhaps it should be re-opened:
> https://issues.apache.org/jira/browse/ZOOKEEPER-2080
>

I'll re-open it, it just failed for me:

https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/2671/testReport/


-rgs

Re: making CI a more pleasant experience

Posted by Patrick Hunt <ph...@apache.org>.
Hongchao recently identified a long standing issue in sync that originally
looked like it might be a flakey test. While flakeys are commonly a
test/host issue it's also important to review the issues carefully lest we
ignore real problems. I've found that reviewing the jenkins results from
past runs (I used to try and do the analysis manually every quarter or so)
would identify some low hanging fruit. I then would create jiras to
triage/fix. I've found that to be the most successful.

Sounds like we'd need to identify what the current pain points are, then
work up a solution. Introducing changes (such as the hadoop improvements
mentioned) might be for the good, but it might not really solve our core
problem(s).

Looking through a number of the recent precommit failures it seems like
real issues to me, e.g. the author was missing testing for the change. Some
are failed tests that might be flakey, I didn't do a full analysis. Perhaps
a tool to help with the analysis would be useful?

Trunk tests seem pretty stable - where I do see issues recently it's mostly
a single test that seems to be intermittently failing:
ReconfigRecoveryTest.testCurrentObserverIsParticipantInNewConfig
I see someone opened a jira then closed it, perhaps it should be re-opened:
https://issues.apache.org/jira/browse/ZOOKEEPER-2080

Patrick

On Sun, May 3, 2015 at 12:53 PM, Chris Nauroth <cn...@hortonworks.com>
wrote:

> Hi Raúl,
>
> Thanks for starting this thread.  Flaky CI is a big challenge for us in
> Hadoop too.  I can speak to some things that we've done to try to improve
> it.  We're still struggling, but some of these have helped improve the
> situation.
>
> We recently rewrote our test-patch.sh with a lot of nice new
> functionality.  (Credit goes to Allen Wittenauer.)
>
> https://issues.apache.org/jira/browse/HADOOP-11746
>
>
> The release notes field in the jira describes the new functionality.
> There is a lot to take in there, so here is a summary of my personal
> favorites:
>
> 1. It can run against any branch of the codebase, not just trunk, by
> following a simple naming convention that includes the branch name when
> uploading your patch.
> 2. It has some smarts to try to minimize execution time.  For example, if
> the patch only changes shell scripts or documentation, then it assumes
> there is no need to run JUnit tests.
> 3. It eliminated race conditions during concurrent test-patch.sh runs
> caused by storing state in shared local files.  The ZooKeeper
> test-patch.sh appears to be a fork of an older version of the Hadoop
> script, so maybe the ZooKeeper script is subject to similar race
> conditions.
> 4. It has some hooks for pluggability, making it easier to add more custom
> checks.
>
> We could explore porting HADOOP-11746 over to ZooKeeper.  It won't be as
> simple as copying it right over to the ZooKeeper repo, because some of the
> logic is specific to the Hadoop repo.
>
> We've also had trouble with flaky tests.  Unfortunately, these often
> require a ton of engineering time to fix.  The typical root causes I've
> seen are:
>
> 1. Tests start servers bound to hard-coded port numbers, so if multiple
> test runs execute concurrently, then one of them will get a bind
> exception.  The solution is always to bind to an ephemeral port in test
> code.
> 2. Tests do not do proper resource cleanup.  This can manifest as file
> descriptor leaks leading to hitting the open file descriptor limit, thread
> leaks, or 2 background threads trying to do the same job and interfering
> with one another.  File descriptor leaks are particularly nasty for test
> runs on Windows, where the default file locking behavior can prevent
> subsequent tests from using a working directory for test data.  The
> solution is to track these down and use try-finally, try-with-resources,
> JUnit @After etc. to ensure clean-up.
> 3. Tests are non-deterministic, such as by hard-coding a sleep time to
> wait for an asynchronous action to complete.  The solutions usually
> involve providing hooks into lower-layer logic, such as to receive a
> callback from the asynchronous action, so that the test can be
> deterministic.
> 4. Tests hard-code a file path separator to '/', which doesn't work
> correctly on Windows.  The code fixes for this usually are obvious once
> you spot the problem.
>
> It can take a long time to track these down and fix them.  To help people
> iterate faster, I've proposed another test-patch.sh enhancement that would
> allow the contributor to request that only specific tests are run on a
> patch.
>
> https://issues.apache.org/jira/browse/HADOOP-11895
>
> This would help engineers get quicker feedback, especially in the event
> that a test failure only repros on the Jenkins hosts.
>
> In my experience (again primarily on Hadoop), it's much more often that we
> see flaky tests rather than bad Jenkins hosts.  When there is a bad host,
> it's usually pretty obvious.  Each Jenkins run reports the host that ran
> the job, so we can identify a trend of a particular problem happening on a
> particular host.
>
> --Chris Nauroth
>
>
>
>
> On 5/3/15, 10:28 AM, "Raúl Gutiérrez Segalés" <rg...@itevenworks.net> wrote:
>
> >Hi all,
> >
> >This has probably come up before but do we have any thoughts on making CI
> >better? Is the problem jenkins? Is it flaky tests? Bad CI workers? All of
> >the above?
> >
> >I see we waste loads of time with trivial (or unrelated to the actual
> >failures) patches triggering failed builds all the time. I'd like to spend
> >some time improving our experience here, but would love some
> >pointers/thoughts.
> >
> >Ideas?
> >
> >
> >Cheers,
> >-rgs
>
>

Re: making CI a more pleasant experience

Posted by Patrick Hunt <ph...@apache.org>.
That's a good idea. I would think that PortAssignment could be modified to
give out truly unique ports (rather than starting from a given seed) in
ways that some other projects do. Agree that our current mechanism is naive
- the problem you mention (multiple jvms and running on the same host
resulting port conflicts in the tests) is an annoying one that I've run
into myself.

Patrick

On Mon, May 4, 2015 at 2:13 PM, Chris Nauroth <cn...@hortonworks.com>
wrote:

> Another possibility for speeding up pre-commit runs might be to execute
> JUnit tests in parallel.  In build.xml, we'd specify the threads attribute
> on the <junit> task to start multiple JUnit processes.  This requires Ant
> 1.9.4.
>
> https://ant.apache.org/manual/Tasks/junit.html
>
>
> It's not as simple as just deploying Ant 1.9.4 and turning it on though.
> We'd need to make sure that test suites are capable of running
> concurrently.  They'd need isolated working directories, port numbers,
> etc.  Right now, the PortAssignment class hands out port numbers that are
> guaranteed to be unique within the same process, but not across multiple
> processes.  I just tried it locally, and many tests failed on bind
> exceptions.  (This is actually a potential problem already if multiple
> ZooKeeper pre-commit runs execute concurrently on the same Jenkins host.)
>
> I'll investigate if we can change tests to bind to ephemeral ports, which
> would give us uniqueness across multiple processes.
>
> --Chris Nauroth
>
>
>
>
> On 5/3/15, 10:10 PM, "Patrick Hunt" <ph...@apache.org> wrote:
>
> >I've pushed back on use of sleep in non-deterministic ways in the past. I
> >think we do a reasonable job there, just grepping for sleep doesn't tell
> >the story.
> >
> >Where you run into issues is when you
> >
> >do x
> >sleep(500)
> >check x was success
> >
> >Most of our use of sleep has migrated to
> >
> >do x
> >for (1 to 120) // or check elapsed time and cap at some large number
> >  sleep(500) // make sufficiently small that you don't waste time waiting
> >unnecessarily, but also not too short that you spin
> >  check x was success
> >
> >unless we were able to make due without a time bound at all - sometimes we
> >migrate to a latch or something.
> >
> >Now it's been a while since I reviewed the tests, new code might have
> >added
> >some bad checks again, it's a tough one to stamp out entirely.
> >
> >Re tests taking too long, I can't seem to find the jira, but iirc Henry
> >had
> >created a jira around reducing the tick time for tests - that
> >significantly
> >reduced the setup time for quorum based tests - a big part of overall
> >overhead. We should probably categorize our tests and run a subset outside
> >of a nightly "full test run".
> >
> >Patrick
> >
> >On Sun, May 3, 2015 at 9:49 PM, Raúl Gutiérrez Segalés
> ><rg...@itevenworks.net>
> >wrote:
> >
> >> Hi,
> >>
> >> On 3 May 2015 at 12:53, Chris Nauroth <cn...@hortonworks.com> wrote:
> >>
> >> > (....)
> >> > 3. Tests are non-deterministic, such as by hard-coding a sleep time to
> >> > wait for an asynchronous action to complete.  The solutions usually
> >> > involve providing hooks into lower-layer logic, such as to receive a
> >> > callback from the asynchronous action, so that the test can be
> >> > deterministic.
> >> >
> >>
> >> Indeed:
> >>
> >> ~/src/zookeeper-svn/src/java/test/org/apache/zookeeper (master) ✔ git
> >>grep
> >> -i 'sleep(' | wc -l
> >> 91
> >>
> >> Making runs shorter would be very helpful as well. Currently it just
> >>takes
> >> too long.
> >>
> >> Also, adding to what Patrick said, I'll take a closer look at the runs
> >> reported at:
> >>
> >> https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/
> >>
> >> to have a better grasp of what's going on. Thanks!
> >>
> >>
> >> -rgs
> >>
>
>

Re: making CI a more pleasant experience

Posted by Chris Nauroth <cn...@hortonworks.com>.
Another possibility for speeding up pre-commit runs might be to execute
JUnit tests in parallel.  In build.xml, we'd specify the threads attribute
on the <junit> task to start multiple JUnit processes.  This requires Ant
1.9.4.

https://ant.apache.org/manual/Tasks/junit.html


It's not as simple as just deploying Ant 1.9.4 and turning it on though.
We'd need to make sure that test suites are capable of running
concurrently.  They'd need isolated working directories, port numbers,
etc.  Right now, the PortAssignment class hands out port numbers that are
guaranteed to be unique within the same process, but not across multiple
processes.  I just tried it locally, and many tests failed on bind
exceptions.  (This is actually a potential problem already if multiple
ZooKeeper pre-commit runs execute concurrently on the same Jenkins host.)

I'll investigate if we can change tests to bind to ephemeral ports, which
would give us uniqueness across multiple processes.

--Chris Nauroth




On 5/3/15, 10:10 PM, "Patrick Hunt" <ph...@apache.org> wrote:

>I've pushed back on use of sleep in non-deterministic ways in the past. I
>think we do a reasonable job there, just grepping for sleep doesn't tell
>the story.
>
>Where you run into issues is when you
>
>do x
>sleep(500)
>check x was success
>
>Most of our use of sleep has migrated to
>
>do x
>for (1 to 120) // or check elapsed time and cap at some large number
>  sleep(500) // make sufficiently small that you don't waste time waiting
>unnecessarily, but also not too short that you spin
>  check x was success
>
>unless we were able to make due without a time bound at all - sometimes we
>migrate to a latch or something.
>
>Now it's been a while since I reviewed the tests, new code might have
>added
>some bad checks again, it's a tough one to stamp out entirely.
>
>Re tests taking too long, I can't seem to find the jira, but iirc Henry
>had
>created a jira around reducing the tick time for tests - that
>significantly
>reduced the setup time for quorum based tests - a big part of overall
>overhead. We should probably categorize our tests and run a subset outside
>of a nightly "full test run".
>
>Patrick
>
>On Sun, May 3, 2015 at 9:49 PM, Raúl Gutiérrez Segalés
><rg...@itevenworks.net>
>wrote:
>
>> Hi,
>>
>> On 3 May 2015 at 12:53, Chris Nauroth <cn...@hortonworks.com> wrote:
>>
>> > (....)
>> > 3. Tests are non-deterministic, such as by hard-coding a sleep time to
>> > wait for an asynchronous action to complete.  The solutions usually
>> > involve providing hooks into lower-layer logic, such as to receive a
>> > callback from the asynchronous action, so that the test can be
>> > deterministic.
>> >
>>
>> Indeed:
>>
>> ~/src/zookeeper-svn/src/java/test/org/apache/zookeeper (master) ✔ git
>>grep
>> -i 'sleep(' | wc -l
>> 91
>>
>> Making runs shorter would be very helpful as well. Currently it just
>>takes
>> too long.
>>
>> Also, adding to what Patrick said, I'll take a closer look at the runs
>> reported at:
>>
>> https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/
>>
>> to have a better grasp of what's going on. Thanks!
>>
>>
>> -rgs
>>


Re: making CI a more pleasant experience

Posted by Patrick Hunt <ph...@apache.org>.
I've pushed back on use of sleep in non-deterministic ways in the past. I
think we do a reasonable job there, just grepping for sleep doesn't tell
the story.

Where you run into issues is when you

do x
sleep(500)
check x was success

Most of our use of sleep has migrated to

do x
for (1 to 120) // or check elapsed time and cap at some large number
  sleep(500) // make sufficiently small that you don't waste time waiting
unnecessarily, but also not too short that you spin
  check x was success

unless we were able to make due without a time bound at all - sometimes we
migrate to a latch or something.

Now it's been a while since I reviewed the tests, new code might have added
some bad checks again, it's a tough one to stamp out entirely.

Re tests taking too long, I can't seem to find the jira, but iirc Henry had
created a jira around reducing the tick time for tests - that significantly
reduced the setup time for quorum based tests - a big part of overall
overhead. We should probably categorize our tests and run a subset outside
of a nightly "full test run".

Patrick

On Sun, May 3, 2015 at 9:49 PM, Raúl Gutiérrez Segalés <rg...@itevenworks.net>
wrote:

> Hi,
>
> On 3 May 2015 at 12:53, Chris Nauroth <cn...@hortonworks.com> wrote:
>
> > (....)
> > 3. Tests are non-deterministic, such as by hard-coding a sleep time to
> > wait for an asynchronous action to complete.  The solutions usually
> > involve providing hooks into lower-layer logic, such as to receive a
> > callback from the asynchronous action, so that the test can be
> > deterministic.
> >
>
> Indeed:
>
> ~/src/zookeeper-svn/src/java/test/org/apache/zookeeper (master) ✔ git grep
> -i 'sleep(' | wc -l
> 91
>
> Making runs shorter would be very helpful as well. Currently it just takes
> too long.
>
> Also, adding to what Patrick said, I'll take a closer look at the runs
> reported at:
>
> https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/
>
> to have a better grasp of what's going on. Thanks!
>
>
> -rgs
>

Re: making CI a more pleasant experience

Posted by Raúl Gutiérrez Segalés <rg...@itevenworks.net>.
Hi,

On 3 May 2015 at 12:53, Chris Nauroth <cn...@hortonworks.com> wrote:

> (....)
> 3. Tests are non-deterministic, such as by hard-coding a sleep time to
> wait for an asynchronous action to complete.  The solutions usually
> involve providing hooks into lower-layer logic, such as to receive a
> callback from the asynchronous action, so that the test can be
> deterministic.
>

Indeed:

~/src/zookeeper-svn/src/java/test/org/apache/zookeeper (master) ✔ git grep
-i 'sleep(' | wc -l
91

Making runs shorter would be very helpful as well. Currently it just takes
too long.

Also, adding to what Patrick said, I'll take a closer look at the runs
reported at:

https://builds.apache.org/job/PreCommit-ZOOKEEPER-Build/

to have a better grasp of what's going on. Thanks!


-rgs

Re: making CI a more pleasant experience

Posted by Chris Nauroth <cn...@hortonworks.com>.
Hi Raúl,

Thanks for starting this thread.  Flaky CI is a big challenge for us in
Hadoop too.  I can speak to some things that we've done to try to improve
it.  We're still struggling, but some of these have helped improve the
situation.

We recently rewrote our test-patch.sh with a lot of nice new
functionality.  (Credit goes to Allen Wittenauer.)

https://issues.apache.org/jira/browse/HADOOP-11746


The release notes field in the jira describes the new functionality.
There is a lot to take in there, so here is a summary of my personal
favorites:

1. It can run against any branch of the codebase, not just trunk, by
following a simple naming convention that includes the branch name when
uploading your patch.
2. It has some smarts to try to minimize execution time.  For example, if
the patch only changes shell scripts or documentation, then it assumes
there is no need to run JUnit tests.
3. It eliminated race conditions during concurrent test-patch.sh runs
caused by storing state in shared local files.  The ZooKeeper
test-patch.sh appears to be a fork of an older version of the Hadoop
script, so maybe the ZooKeeper script is subject to similar race
conditions.
4. It has some hooks for pluggability, making it easier to add more custom
checks.

We could explore porting HADOOP-11746 over to ZooKeeper.  It won't be as
simple as copying it right over to the ZooKeeper repo, because some of the
logic is specific to the Hadoop repo.

We've also had trouble with flaky tests.  Unfortunately, these often
require a ton of engineering time to fix.  The typical root causes I've
seen are:

1. Tests start servers bound to hard-coded port numbers, so if multiple
test runs execute concurrently, then one of them will get a bind
exception.  The solution is always to bind to an ephemeral port in test
code.
2. Tests do not do proper resource cleanup.  This can manifest as file
descriptor leaks leading to hitting the open file descriptor limit, thread
leaks, or 2 background threads trying to do the same job and interfering
with one another.  File descriptor leaks are particularly nasty for test
runs on Windows, where the default file locking behavior can prevent
subsequent tests from using a working directory for test data.  The
solution is to track these down and use try-finally, try-with-resources,
JUnit @After etc. to ensure clean-up.
3. Tests are non-deterministic, such as by hard-coding a sleep time to
wait for an asynchronous action to complete.  The solutions usually
involve providing hooks into lower-layer logic, such as to receive a
callback from the asynchronous action, so that the test can be
deterministic.
4. Tests hard-code a file path separator to '/', which doesn't work
correctly on Windows.  The code fixes for this usually are obvious once
you spot the problem.

It can take a long time to track these down and fix them.  To help people
iterate faster, I've proposed another test-patch.sh enhancement that would
allow the contributor to request that only specific tests are run on a
patch.

https://issues.apache.org/jira/browse/HADOOP-11895

This would help engineers get quicker feedback, especially in the event
that a test failure only repros on the Jenkins hosts.

In my experience (again primarily on Hadoop), it's much more often that we
see flaky tests rather than bad Jenkins hosts.  When there is a bad host,
it's usually pretty obvious.  Each Jenkins run reports the host that ran
the job, so we can identify a trend of a particular problem happening on a
particular host.

--Chris Nauroth




On 5/3/15, 10:28 AM, "Raúl Gutiérrez Segalés" <rg...@itevenworks.net> wrote:

>Hi all,
>
>This has probably come up before but do we have any thoughts on making CI
>better? Is the problem jenkins? Is it flaky tests? Bad CI workers? All of
>the above?
>
>I see we waste loads of time with trivial (or unrelated to the actual
>failures) patches triggering failed builds all the time. I'd like to spend
>some time improving our experience here, but would love some
>pointers/thoughts.
>
>Ideas?
>
>
>Cheers,
>-rgs