You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Stack <st...@duboce.net> on 2015/11/05 01:23:34 UTC

Re: On our unit tests...

Since I wrote the below, we've figured who the surefire-killer was
[HBASE-14589]. 9 of the last 10 1.2 builds passed (even though blue builds
are harder to achieve now since they are a compound of a jdk 1.7 and a jdk
1.8 run). 1.3 is failing on a few tests that seem legitimately flakey; I'm
looking into them. Trunk is settling down after being made into a jdk7/8
matrix; it should stabilize soon.

Repeating my petition from below, can we start putting our trust back in
apache builds and start relying on it again? It found a flakey end of last
week soon after it went in because builds mostly pass now so the flakey
shone through. It can find more if we all make the effort to keep it blue.
In particular, can we end the passes-locally-for-me practice since tests
that go zombie or hang usually run fine on boxes where there is no
contention.

Thanks,
St.Ack



On Fri, Oct 23, 2015 at 2:54 PM, Stack <st...@duboce.net> wrote:

> A few of us have been doing cleanup over the last month or so (see
> HBASE-14420). As a project, we had let our unit test suite go to seed. It
> was an anthology of mysterious crashes, zombies and flakes.
>
> We are not done yet but tests are mostly stable again with patch builds
> passing close to 100% of the time as long as the patch is good and trunk
> and branch-1/branch-1.2 are tending back toward being blue always. Hanging
> tests have been fixed and or disabled to be put back after scrubbing.
> Mysterious surefire crashes/timeouts have been addressed by purging a
> problematic test set that we intend to re-add after tuneup and fix. There
> are still a few flakies in the mix.
>
> This is a petition that we go out of our way going forward to keep OUR
> test suite blue. We'll all be more productive if we can keep it this way.
> Patches will land faster because there'll be less friction getting them in
> (Landing big patches was taking me a week before starting in on this
> effort). We'll catch a slew of problems before commit. New devs won't be
> confounded by mysterious unrelated test fails. There'll be no need to keep
> up an arcane knowledge of 'known flakies' or hanging tests or the need for
> expending extra effort and resources doing 'look-it-works-locally-for-me'
> test runs locally.
>
> St.Ack
>
> Below are some further notes for those interested in build and work done
> to our test rig recently; ugly detail is over in HBASE-14420.
>
> Until an alternative shows up, our Apache Jenkins needs to run blue always
> if we want to do community development. True, Apache Jenkins is a trying
> environment in which to run tests, but it is shared, public, and I have yet
> to come across a hang or failure that was Apache-Jenkins-only; the only
> difference I've seen is that the incidence of hangs and flakies is higher
> on Apache.
>
> The test-patch.sh script had some hacking done to it mostly removing code
> that was finding and killing zombies. We were reporting ANY concurrent
> build as a zombie, even those that were not hbase tests, and killing them
> in the belief that they were leftovers from previous runs (the script had a
> few different techniques for finding and executing adjacent processes).
> This made some sense when we were supposed to be the only test running on
> the box but this has not been true for a long time. Killing was
> papering-over the fact that we were leaving zombies after us.
>
> The Jenkins build configuration also had zombie code from test-patch.sh in
> it (still does -- a TODO). Builds now dump out test machine load and
> listing of what else is running on the box at test start to give a sense of
> how loaded the test box is.
>
> I feel particularly bad for the new contributors. They have it hard enough
> already checking out a fat project with a slow build system with hours of
> tests to run to verify changes. Lets spare them the added barrier of a
> confounding experience when their nice patch throws up a mysterious jenkins
> fail on submit.
>

Re: On our unit tests...

Posted by Josh Elser <jo...@gmail.com>.
Huge kudos to you, Stack, for making the time to run these down.

As a contributor, I'm very moved by the thought of treating what Jenkins 
reports as truth.

Stack wrote:
> Since I wrote the below, we've figured who the surefire-killer was
> [HBASE-14589]. 9 of the last 10 1.2 builds passed (even though blue builds
> are harder to achieve now since they are a compound of a jdk 1.7 and a jdk
> 1.8 run). 1.3 is failing on a few tests that seem legitimately flakey; I'm
> looking into them. Trunk is settling down after being made into a jdk7/8
> matrix; it should stabilize soon.
>
> Repeating my petition from below, can we start putting our trust back in
> apache builds and start relying on it again? It found a flakey end of last
> week soon after it went in because builds mostly pass now so the flakey
> shone through. It can find more if we all make the effort to keep it blue.
> In particular, can we end the passes-locally-for-me practice since tests
> that go zombie or hang usually run fine on boxes where there is no
> contention.
>
> Thanks,
> St.Ack
>
>
>
> On Fri, Oct 23, 2015 at 2:54 PM, Stack<st...@duboce.net>  wrote:
>
>> A few of us have been doing cleanup over the last month or so (see
>> HBASE-14420). As a project, we had let our unit test suite go to seed. It
>> was an anthology of mysterious crashes, zombies and flakes.
>>
>> We are not done yet but tests are mostly stable again with patch builds
>> passing close to 100% of the time as long as the patch is good and trunk
>> and branch-1/branch-1.2 are tending back toward being blue always. Hanging
>> tests have been fixed and or disabled to be put back after scrubbing.
>> Mysterious surefire crashes/timeouts have been addressed by purging a
>> problematic test set that we intend to re-add after tuneup and fix. There
>> are still a few flakies in the mix.
>>
>> This is a petition that we go out of our way going forward to keep OUR
>> test suite blue. We'll all be more productive if we can keep it this way.
>> Patches will land faster because there'll be less friction getting them in
>> (Landing big patches was taking me a week before starting in on this
>> effort). We'll catch a slew of problems before commit. New devs won't be
>> confounded by mysterious unrelated test fails. There'll be no need to keep
>> up an arcane knowledge of 'known flakies' or hanging tests or the need for
>> expending extra effort and resources doing 'look-it-works-locally-for-me'
>> test runs locally.
>>
>> St.Ack
>>
>> Below are some further notes for those interested in build and work done
>> to our test rig recently; ugly detail is over in HBASE-14420.
>>
>> Until an alternative shows up, our Apache Jenkins needs to run blue always
>> if we want to do community development. True, Apache Jenkins is a trying
>> environment in which to run tests, but it is shared, public, and I have yet
>> to come across a hang or failure that was Apache-Jenkins-only; the only
>> difference I've seen is that the incidence of hangs and flakies is higher
>> on Apache.
>>
>> The test-patch.sh script had some hacking done to it mostly removing code
>> that was finding and killing zombies. We were reporting ANY concurrent
>> build as a zombie, even those that were not hbase tests, and killing them
>> in the belief that they were leftovers from previous runs (the script had a
>> few different techniques for finding and executing adjacent processes).
>> This made some sense when we were supposed to be the only test running on
>> the box but this has not been true for a long time. Killing was
>> papering-over the fact that we were leaving zombies after us.
>>
>> The Jenkins build configuration also had zombie code from test-patch.sh in
>> it (still does -- a TODO). Builds now dump out test machine load and
>> listing of what else is running on the box at test start to give a sense of
>> how loaded the test box is.
>>
>> I feel particularly bad for the new contributors. They have it hard enough
>> already checking out a fat project with a slow build system with hours of
>> tests to run to verify changes. Lets spare them the added barrier of a
>> confounding experience when their nice patch throws up a mysterious jenkins
>> fail on submit.
>>
>

Re: On our unit tests...

Posted by Andrew Purtell <an...@gmail.com>.
> In particular, can we end the passes-locally-for-me practice 

+1

Although this depends on the sanity and stability of precommit builds. We (at least I) resorted to posting locally sourced "proof" of clean test suite runs to make forward progress in the limited amount of time I had to work on a particular issue. Anyway, let's give it a shot with renewed confidence. 


> On Nov 4, 2015, at 4:23 PM, Stack <st...@duboce.net> wrote:
> 
> Since I wrote the below, we've figured who the surefire-killer was
> [HBASE-14589]. 9 of the last 10 1.2 builds passed (even though blue builds
> are harder to achieve now since they are a compound of a jdk 1.7 and a jdk
> 1.8 run). 1.3 is failing on a few tests that seem legitimately flakey; I'm
> looking into them. Trunk is settling down after being made into a jdk7/8
> matrix; it should stabilize soon.
> 
> Repeating my petition from below, can we start putting our trust back in
> apache builds and start relying on it again? It found a flakey end of last
> week soon after it went in because builds mostly pass now so the flakey
> shone through. It can find more if we all make the effort to keep it blue.
> In particular, can we end the passes-locally-for-me practice since tests
> that go zombie or hang usually run fine on boxes where there is no
> contention.
> 
> Thanks,
> St.Ack
> 
> 
> 
>> On Fri, Oct 23, 2015 at 2:54 PM, Stack <st...@duboce.net> wrote:
>> 
>> A few of us have been doing cleanup over the last month or so (see
>> HBASE-14420). As a project, we had let our unit test suite go to seed. It
>> was an anthology of mysterious crashes, zombies and flakes.
>> 
>> We are not done yet but tests are mostly stable again with patch builds
>> passing close to 100% of the time as long as the patch is good and trunk
>> and branch-1/branch-1.2 are tending back toward being blue always. Hanging
>> tests have been fixed and or disabled to be put back after scrubbing.
>> Mysterious surefire crashes/timeouts have been addressed by purging a
>> problematic test set that we intend to re-add after tuneup and fix. There
>> are still a few flakies in the mix.
>> 
>> This is a petition that we go out of our way going forward to keep OUR
>> test suite blue. We'll all be more productive if we can keep it this way.
>> Patches will land faster because there'll be less friction getting them in
>> (Landing big patches was taking me a week before starting in on this
>> effort). We'll catch a slew of problems before commit. New devs won't be
>> confounded by mysterious unrelated test fails. There'll be no need to keep
>> up an arcane knowledge of 'known flakies' or hanging tests or the need for
>> expending extra effort and resources doing 'look-it-works-locally-for-me'
>> test runs locally.
>> 
>> St.Ack
>> 
>> Below are some further notes for those interested in build and work done
>> to our test rig recently; ugly detail is over in HBASE-14420.
>> 
>> Until an alternative shows up, our Apache Jenkins needs to run blue always
>> if we want to do community development. True, Apache Jenkins is a trying
>> environment in which to run tests, but it is shared, public, and I have yet
>> to come across a hang or failure that was Apache-Jenkins-only; the only
>> difference I've seen is that the incidence of hangs and flakies is higher
>> on Apache.
>> 
>> The test-patch.sh script had some hacking done to it mostly removing code
>> that was finding and killing zombies. We were reporting ANY concurrent
>> build as a zombie, even those that were not hbase tests, and killing them
>> in the belief that they were leftovers from previous runs (the script had a
>> few different techniques for finding and executing adjacent processes).
>> This made some sense when we were supposed to be the only test running on
>> the box but this has not been true for a long time. Killing was
>> papering-over the fact that we were leaving zombies after us.
>> 
>> The Jenkins build configuration also had zombie code from test-patch.sh in
>> it (still does -- a TODO). Builds now dump out test machine load and
>> listing of what else is running on the box at test start to give a sense of
>> how loaded the test box is.
>> 
>> I feel particularly bad for the new contributors. They have it hard enough
>> already checking out a fat project with a slow build system with hours of
>> tests to run to verify changes. Lets spare them the added barrier of a
>> confounding experience when their nice patch throws up a mysterious jenkins
>> fail on submit.
>>