You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "张铎 (Duo Zhang)" <pa...@gmail.com> on 2020/03/04 00:56:14 UTC

What is the situation for our UTs now?

I see recently there are lots of 'flaky tests' related issues been resolved
but seems the situation is getting worse? For branch-2.2 the flaky page is
fine, but for master it is totally a mess...

https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/branch-2.2/lastSuccessfulBuild/artifact/dashboard.html


https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/master/lastSuccessfulBuild/artifact/dashboard.html

Lots of UTs are in trouble and it makes it really hard to pass the pre
commit check which means it is really hard to contribute to the project...

We need to fix this soon...

Re: What is the situation for our UTs now?

Posted by Stack <st...@duboce.net>.
On Wed, Mar 4, 2020 at 3:34 AM 张铎(Duo Zhang) <pa...@gmail.com> wrote:

> Due to the resource limit I do not think it is a good idea to increase the
> forkCount...
>
>
Which fork count are you referring too? The fork count is about what it
always was after doing the math just that now we size based off the machine
cpu count; this should make the config able to adapt some to the hardware
they are being run on.

There is also the -T argument which I tried to up on general builds but it
was causing too many failures so I reverted; on nightlies and patch builds
we are running the default of one maven thread.

I think you might be referring to the -T I added re-running flakies: mostly
the second panel in these pages
https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests/job/branch-2/lastSuccessfulBuild/artifact/dashboard.html.
I set it to 0.5C instead of default 1 thread. It makes the flakes fail more
often; highlights what happens when resource contention; i.e. makes the
flakies fail more reliably. I set it about a week ago. I've been keeping an
eye on it but was working elsewhere on tests. I was hoping to land patches
in the next days that dealt with resource use that I hoped would put a dent
in the current failure lists. Let me dial this down so we disturb flakies
less and they can hide again (and see if this makes a difference to the
master patch builds).


> FWIW, can we do this on a feature branch and move master and branch-2 back?
>

Which aspect? Fixing falkies? Or re-running the flakey list aggressively? I
don't think we want the former out on a feature branch. The latter would
require infra duplication hackery of not only branch nightlie but a
complementary flakey rerun duplication.


>
> See here
>
> https://github.com/apache/hbase/pull/1221
>
> We tried several times and always got a large amount of failed UTs which
> are not related to the patch. And we even excluded hundreds of UTs due to
> the flaky list!
>
>
I've not been tracking master closely. Is anyone? Let me down the ferocity
of the flakie re-runs to see if it makes a difference.


> This makes it almost impossible to contribute to the project. Even after
> several tries we get a green result, due to the excluded hundreds of UTs,
> no one know if the patch breaks something.
>
>
Yeah, this is a problem. Let me pay more attention here. Let me take a look
at master branch. Patch builds were doing pretty well up until recently.
S


> Thanks.
>
> Stack <st...@duboce.net> 于2020年3月4日周三 下午2:55写道:
>
> > Upstream branch-2 and master nightlies don't look too bad currently.
> There
> > are a few bad runs where there were a bunch of hangs which makes things
> > look bad. I upped the number of tests we show from 5 to 10 on branch-2
> and
> > master which makes it so a failed tests shows longer in the top half of
> the
> > flakies page -- and more flakies are listed. On the bottom half, I'd
> upped
> > the ferocity with which we run on GCE to draw out flakies. Needless to
> say,
> > they fail more often when contended resources. I might knock the ferocity
> > down in the next day or so but am trying to land some patches that cut
> down
> > on resource usage and want to see how these do in the flakie runs first.
> >
> > Master I haven't looked at much... looks like branch-2?  Branch-2.2 and
> > branch-2.1 look sleepy. Similar amounts of flakies in the nightlies. They
> > don't have the ferocity upped so the lower-half GCE section looks
> 'better'.
> > I can make them look like branch-2 and master if folks want (smile) but
> its
> > probably ok letting the flakies lie in branches that are being bypassed.
> >
> > Generally,  I've been working on unit tests with inspiration and help
> from
> > Mark Miller and Nick. Our tests are in a poor state. They take so long,
> > they don't get run anywhere else other than up on jenkins. They rarely
> pass
> > and only then on accident if minimal parallelism and jitter. On
> multi-core
> > machines, they use 1 to 2 cores only -- even if the machine has tens of
> > them.
> >
> > I have been trying to burn down the flakies, make the tests complete
> > successfully in less time with more parallelism, using all of the
> machine,
> > and make them pass both on jenkins and locally. Of late, have been
> focused
> > on branch-2 since it is calming down getting ready for a 2.3.0RC0. Having
> > some success but its a  nasty job where it is hard to claim advances
> > because the flakies vary w/ the context in which the tests are run.
> > Hopefully we'll turn a corner on jenkins soon for folks to enjoy.
> >
> > Shout if need more detail.
> > S
> >
> >
> > On Tue, Mar 3, 2020 at 6:00 PM 张铎(Duo Zhang) <pa...@gmail.com>
> > wrote:
> >
> > > But why branch-2.2 and branch-2.1 are still fine?
> > >
> > > Sean Busbey <bu...@apache.org> 于2020年3月4日周三 上午9:24写道:
> > >
> > > > I agree in principle that excluding 100s of UTs isn't good. But we
> > don't
> > > > really have better options given the state of tests and testing
> > hardware
> > > > currently available to us.
> > > >
> > > > On Tue, Mar 3, 2020, 19:14 张铎(Duo Zhang) <pa...@gmail.com>
> > wrote:
> > > >
> > > > > I think the problem is all UTs are failing randomly...
> > > > >
> > > > > And it is also not a good idea to exclude hundreds of UTs in pre
> > > commit?
> > > > >
> > > > > Sean Busbey <bu...@apache.org> 于2020年3月4日周三 上午9:11写道:
> > > > >
> > > > > > Everything in the flake list should be skipped at precommit time.
> > Is
> > > > that
> > > > > > not happening?
> > > > > >
> > > > > > Are we keeping a shorter flake window so things are bouncing in
> and
> > > out
> > > > > of
> > > > > > the list?
> > > > > >
> > > > > > On Tue, Mar 3, 2020, 18:56 张铎(Duo Zhang) <pa...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > > I see recently there are lots of 'flaky tests' related issues
> > been
> > > > > > resolved
> > > > > > > but seems the situation is getting worse? For branch-2.2 the
> > flaky
> > > > page
> > > > > > is
> > > > > > > fine, but for master it is totally a mess...
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/branch-2.2/lastSuccessfulBuild/artifact/dashboard.html
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/master/lastSuccessfulBuild/artifact/dashboard.html
> > > > > > >
> > > > > > > Lots of UTs are in trouble and it makes it really hard to pass
> > the
> > > > pre
> > > > > > > commit check which means it is really hard to contribute to the
> > > > > > project...
> > > > > > >
> > > > > > > We need to fix this soon...
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: What is the situation for our UTs now?

Posted by Guanghao Zhang <zg...@gmail.com>.
>
> I took a look at master branch. Its not in same state as branch-2. Looking
> at nightlies, it seems a bit worse, I see backup tests failing (we don't
> have this in branch-2).
>
The backup ut may related to  HBASE-23912. Let me take a look.

Stack <st...@duboce.net> 于2020年3月5日周四 上午8:06写道:

> On Wed, Mar 4, 2020 at 3:24 PM 张铎(Duo Zhang) <pa...@gmail.com>
> wrote:
>
> > OK, let's keep an eye on the flaky list of master and branch-2 till this
> > weekend.
> >
> > If it is in a bad state then let's discussion again.
> >
> >
> Agree.
>
> On the rerunning of flakies, I downed the ferocity. It was NOT 0.5C as I'd
> thought but 1.0C. I made it 0.25C. After the change, I got a blue dot for
> first time in a long time (not something to celebrate I'd say since I know
> we've not fixed all flakies). Looking at the machine these tests run on,
> its a 16core with 512G of RAM so 0.25C is a forkcount of 4. Before I went
> messing it was hardcoded to 3 so close enough. Let me push this change on
> master too.
>
> Would be good to go back to 1.0C at some time so flakies stay in the flakey
> list until fixed but I can work offline first on knocking down the length
> of the flakey list before hoisting us back to 1.0C.
>
> I'll work at downing the length of the flakey lists over the next day. Lets
> see if it helps w/ patch builds.
>
> I took a look at master branch. Its not in same state as branch-2. Looking
> at nightlies, it seems a bit worse, I see backup tests failing (we don't
> have this in branch-2).
>
> Thanks,
> S
>
>
> > Stack <st...@duboce.net> 于2020年3月5日周四 上午12:41写道:
> >
> > > On Wed, Mar 4, 2020 at 3:42 AM 张铎(Duo Zhang) <pa...@gmail.com>
> > > wrote:
> > >
> > > > And speak a little more on increasing the forkCount. In fact, the
> test
> > > > category is not too rough. The LargeTests just means the test will
> run
> > a
> > > > bit long, does not mean it will consume more resources. Maybe the
> tests
> > > > just have lots of Thread.sleep so we declare it as LargeTests.
> > > >
> > > >
> > > I've done a few passes on test categorization of late. The notion had
> > > rotted pretty bad but should be cleaned up now.
> > >
> > >
> > > > What I can see is that, all the replication related tests are flaky
> > now.
> > > > This is reasonable. In replication tests, usually we have to set up
> at
> > > > least two mini clusters, and the replication system itself will make
> > use
> > > of
> > > > lots of threads. So if you run several replication related tests
> > > together,
> > > > it will easy to overload and cause the UTs to timeout or OOM.
> > > >
> > > >
> > > We have at least one test that makes four clusters inside the one JVM.
> > >
> > > Yeah, the resource usage in general needs weeding.
> > >
> > > Perhaps you are arguing that we just let the state of tests as they
> are?
> > > That we let long tests run in series in case two or more might run
> > together
> > > and fail because they are profligate in their resource use?
> > >
> > I mean increasing the fork count will lead to a random test result as the
> > test category can not describe the resource usage clearly. You can run
> > maybe 20+ light-weighted UTs without problem, but if you run 5 tests
> which
> > set up 4 mini clusters, the resource will be exhausted and cause the
> tests
> > to fail, or at least make it really slow and fail the tests...
> >
> > >
> > >
> > >
> > > > So, again, let's do this on a feature branch. It is fine to mess
> things
> > > up
> > > > on a feature branch. You can do everything you want as the
> intermediate
> > > > state does not effect others. On master and branch-2 it is another
> > > story. I
> > > > do not think this should be a blocker for 2.3.0 or 3.0.0.
> > > >
> > > > See previous note.
> > >
> > > Thanks,
> > > S
> > >
> > >
> > > > Thanks.
> > > >
> > > > 张铎(Duo Zhang) <pa...@gmail.com> 于2020年3月4日周三 下午7:34写道:
> > > >
> > > > > Due to the resource limit I do not think it is a good idea to
> > increase
> > > > the
> > > > > forkCount...
> > > > >
> > > > > FWIW, can we do this on a feature branch and move master and
> branch-2
> > > > back?
> > > > >
> > > > > See here
> > > > >
> > > > > https://github.com/apache/hbase/pull/1221
> > > > >
> > > > > We tried several times and always got a large amount of failed UTs
> > > which
> > > > > are not related to the patch. And we even excluded hundreds of UTs
> > due
> > > to
> > > > > the flaky list!
> > > > >
> > > > > This makes it almost impossible to contribute to the project. Even
> > > after
> > > > > several tries we get a green result, due to the excluded hundreds
> of
> > > UTs,
> > > > > no one know if the patch breaks something.
> > > > >
> > > > > Thanks.
> > > > >
> > > > > Stack <st...@duboce.net> 于2020年3月4日周三 下午2:55写道:
> > > > >
> > > > >> Upstream branch-2 and master nightlies don't look too bad
> currently.
> > > > There
> > > > >> are a few bad runs where there were a bunch of hangs which makes
> > > things
> > > > >> look bad. I upped the number of tests we show from 5 to 10 on
> > branch-2
> > > > and
> > > > >> master which makes it so a failed tests shows longer in the top
> half
> > > of
> > > > >> the
> > > > >> flakies page -- and more flakies are listed. On the bottom half,
> I'd
> > > > upped
> > > > >> the ferocity with which we run on GCE to draw out flakies.
> Needless
> > to
> > > > >> say,
> > > > >> they fail more often when contended resources. I might knock the
> > > > ferocity
> > > > >> down in the next day or so but am trying to land some patches that
> > cut
> > > > >> down
> > > > >> on resource usage and want to see how these do in the flakie runs
> > > first.
> > > > >>
> > > > >> Master I haven't looked at much... looks like branch-2?
> Branch-2.2
> > > and
> > > > >> branch-2.1 look sleepy. Similar amounts of flakies in the
> nightlies.
> > > > They
> > > > >> don't have the ferocity upped so the lower-half GCE section looks
> > > > >> 'better'.
> > > > >> I can make them look like branch-2 and master if folks want
> (smile)
> > > but
> > > > >> its
> > > > >> probably ok letting the flakies lie in branches that are being
> > > bypassed.
> > > > >>
> > > > >> Generally,  I've been working on unit tests with inspiration and
> > help
> > > > from
> > > > >> Mark Miller and Nick. Our tests are in a poor state. They take so
> > > long,
> > > > >> they don't get run anywhere else other than up on jenkins. They
> > rarely
> > > > >> pass
> > > > >> and only then on accident if minimal parallelism and jitter. On
> > > > multi-core
> > > > >> machines, they use 1 to 2 cores only -- even if the machine has
> tens
> > > of
> > > > >> them.
> > > > >>
> > > > >> I have been trying to burn down the flakies, make the tests
> complete
> > > > >> successfully in less time with more parallelism, using all of the
> > > > machine,
> > > > >> and make them pass both on jenkins and locally. Of late, have been
> > > > focused
> > > > >> on branch-2 since it is calming down getting ready for a 2.3.0RC0.
> > > > Having
> > > > >> some success but its a  nasty job where it is hard to claim
> advances
> > > > >> because the flakies vary w/ the context in which the tests are
> run.
> > > > >> Hopefully we'll turn a corner on jenkins soon for folks to enjoy.
> > > > >>
> > > > >> Shout if need more detail.
> > > > >> S
> > > > >>
> > > > >>
> > > > >> On Tue, Mar 3, 2020 at 6:00 PM 张铎(Duo Zhang) <
> palomino219@gmail.com
> > >
> > > > >> wrote:
> > > > >>
> > > > >> > But why branch-2.2 and branch-2.1 are still fine?
> > > > >> >
> > > > >> > Sean Busbey <bu...@apache.org> 于2020年3月4日周三 上午9:24写道:
> > > > >> >
> > > > >> > > I agree in principle that excluding 100s of UTs isn't good.
> But
> > we
> > > > >> don't
> > > > >> > > really have better options given the state of tests and
> testing
> > > > >> hardware
> > > > >> > > currently available to us.
> > > > >> > >
> > > > >> > > On Tue, Mar 3, 2020, 19:14 张铎(Duo Zhang) <
> palomino219@gmail.com
> > >
> > > > >> wrote:
> > > > >> > >
> > > > >> > > > I think the problem is all UTs are failing randomly...
> > > > >> > > >
> > > > >> > > > And it is also not a good idea to exclude hundreds of UTs in
> > pre
> > > > >> > commit?
> > > > >> > > >
> > > > >> > > > Sean Busbey <bu...@apache.org> 于2020年3月4日周三 上午9:11写道:
> > > > >> > > >
> > > > >> > > > > Everything in the flake list should be skipped at
> precommit
> > > > time.
> > > > >> Is
> > > > >> > > that
> > > > >> > > > > not happening?
> > > > >> > > > >
> > > > >> > > > > Are we keeping a shorter flake window so things are
> bouncing
> > > in
> > > > >> and
> > > > >> > out
> > > > >> > > > of
> > > > >> > > > > the list?
> > > > >> > > > >
> > > > >> > > > > On Tue, Mar 3, 2020, 18:56 张铎(Duo Zhang) <
> > > palomino219@gmail.com
> > > > >
> > > > >> > > wrote:
> > > > >> > > > >
> > > > >> > > > > > I see recently there are lots of 'flaky tests' related
> > > issues
> > > > >> been
> > > > >> > > > > resolved
> > > > >> > > > > > but seems the situation is getting worse? For branch-2.2
> > the
> > > > >> flaky
> > > > >> > > page
> > > > >> > > > > is
> > > > >> > > > > > fine, but for master it is totally a mess...
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/branch-2.2/lastSuccessfulBuild/artifact/dashboard.html
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > >
> > >
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/master/lastSuccessfulBuild/artifact/dashboard.html
> > > > >> > > > > >
> > > > >> > > > > > Lots of UTs are in trouble and it makes it really hard
> to
> > > pass
> > > > >> the
> > > > >> > > pre
> > > > >> > > > > > commit check which means it is really hard to contribute
> > to
> > > > the
> > > > >> > > > > project...
> > > > >> > > > > >
> > > > >> > > > > > We need to fix this soon...
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: What is the situation for our UTs now?

Posted by Stack <st...@duboce.net>.
On Wed, Mar 4, 2020 at 3:24 PM 张铎(Duo Zhang) <pa...@gmail.com> wrote:

> OK, let's keep an eye on the flaky list of master and branch-2 till this
> weekend.
>
> If it is in a bad state then let's discussion again.
>
>
Agree.

On the rerunning of flakies, I downed the ferocity. It was NOT 0.5C as I'd
thought but 1.0C. I made it 0.25C. After the change, I got a blue dot for
first time in a long time (not something to celebrate I'd say since I know
we've not fixed all flakies). Looking at the machine these tests run on,
its a 16core with 512G of RAM so 0.25C is a forkcount of 4. Before I went
messing it was hardcoded to 3 so close enough. Let me push this change on
master too.

Would be good to go back to 1.0C at some time so flakies stay in the flakey
list until fixed but I can work offline first on knocking down the length
of the flakey list before hoisting us back to 1.0C.

I'll work at downing the length of the flakey lists over the next day. Lets
see if it helps w/ patch builds.

I took a look at master branch. Its not in same state as branch-2. Looking
at nightlies, it seems a bit worse, I see backup tests failing (we don't
have this in branch-2).

Thanks,
S


> Stack <st...@duboce.net> 于2020年3月5日周四 上午12:41写道:
>
> > On Wed, Mar 4, 2020 at 3:42 AM 张铎(Duo Zhang) <pa...@gmail.com>
> > wrote:
> >
> > > And speak a little more on increasing the forkCount. In fact, the test
> > > category is not too rough. The LargeTests just means the test will run
> a
> > > bit long, does not mean it will consume more resources. Maybe the tests
> > > just have lots of Thread.sleep so we declare it as LargeTests.
> > >
> > >
> > I've done a few passes on test categorization of late. The notion had
> > rotted pretty bad but should be cleaned up now.
> >
> >
> > > What I can see is that, all the replication related tests are flaky
> now.
> > > This is reasonable. In replication tests, usually we have to set up at
> > > least two mini clusters, and the replication system itself will make
> use
> > of
> > > lots of threads. So if you run several replication related tests
> > together,
> > > it will easy to overload and cause the UTs to timeout or OOM.
> > >
> > >
> > We have at least one test that makes four clusters inside the one JVM.
> >
> > Yeah, the resource usage in general needs weeding.
> >
> > Perhaps you are arguing that we just let the state of tests as they are?
> > That we let long tests run in series in case two or more might run
> together
> > and fail because they are profligate in their resource use?
> >
> I mean increasing the fork count will lead to a random test result as the
> test category can not describe the resource usage clearly. You can run
> maybe 20+ light-weighted UTs without problem, but if you run 5 tests which
> set up 4 mini clusters, the resource will be exhausted and cause the tests
> to fail, or at least make it really slow and fail the tests...
>
> >
> >
> >
> > > So, again, let's do this on a feature branch. It is fine to mess things
> > up
> > > on a feature branch. You can do everything you want as the intermediate
> > > state does not effect others. On master and branch-2 it is another
> > story. I
> > > do not think this should be a blocker for 2.3.0 or 3.0.0.
> > >
> > > See previous note.
> >
> > Thanks,
> > S
> >
> >
> > > Thanks.
> > >
> > > 张铎(Duo Zhang) <pa...@gmail.com> 于2020年3月4日周三 下午7:34写道:
> > >
> > > > Due to the resource limit I do not think it is a good idea to
> increase
> > > the
> > > > forkCount...
> > > >
> > > > FWIW, can we do this on a feature branch and move master and branch-2
> > > back?
> > > >
> > > > See here
> > > >
> > > > https://github.com/apache/hbase/pull/1221
> > > >
> > > > We tried several times and always got a large amount of failed UTs
> > which
> > > > are not related to the patch. And we even excluded hundreds of UTs
> due
> > to
> > > > the flaky list!
> > > >
> > > > This makes it almost impossible to contribute to the project. Even
> > after
> > > > several tries we get a green result, due to the excluded hundreds of
> > UTs,
> > > > no one know if the patch breaks something.
> > > >
> > > > Thanks.
> > > >
> > > > Stack <st...@duboce.net> 于2020年3月4日周三 下午2:55写道:
> > > >
> > > >> Upstream branch-2 and master nightlies don't look too bad currently.
> > > There
> > > >> are a few bad runs where there were a bunch of hangs which makes
> > things
> > > >> look bad. I upped the number of tests we show from 5 to 10 on
> branch-2
> > > and
> > > >> master which makes it so a failed tests shows longer in the top half
> > of
> > > >> the
> > > >> flakies page -- and more flakies are listed. On the bottom half, I'd
> > > upped
> > > >> the ferocity with which we run on GCE to draw out flakies. Needless
> to
> > > >> say,
> > > >> they fail more often when contended resources. I might knock the
> > > ferocity
> > > >> down in the next day or so but am trying to land some patches that
> cut
> > > >> down
> > > >> on resource usage and want to see how these do in the flakie runs
> > first.
> > > >>
> > > >> Master I haven't looked at much... looks like branch-2?  Branch-2.2
> > and
> > > >> branch-2.1 look sleepy. Similar amounts of flakies in the nightlies.
> > > They
> > > >> don't have the ferocity upped so the lower-half GCE section looks
> > > >> 'better'.
> > > >> I can make them look like branch-2 and master if folks want (smile)
> > but
> > > >> its
> > > >> probably ok letting the flakies lie in branches that are being
> > bypassed.
> > > >>
> > > >> Generally,  I've been working on unit tests with inspiration and
> help
> > > from
> > > >> Mark Miller and Nick. Our tests are in a poor state. They take so
> > long,
> > > >> they don't get run anywhere else other than up on jenkins. They
> rarely
> > > >> pass
> > > >> and only then on accident if minimal parallelism and jitter. On
> > > multi-core
> > > >> machines, they use 1 to 2 cores only -- even if the machine has tens
> > of
> > > >> them.
> > > >>
> > > >> I have been trying to burn down the flakies, make the tests complete
> > > >> successfully in less time with more parallelism, using all of the
> > > machine,
> > > >> and make them pass both on jenkins and locally. Of late, have been
> > > focused
> > > >> on branch-2 since it is calming down getting ready for a 2.3.0RC0.
> > > Having
> > > >> some success but its a  nasty job where it is hard to claim advances
> > > >> because the flakies vary w/ the context in which the tests are run.
> > > >> Hopefully we'll turn a corner on jenkins soon for folks to enjoy.
> > > >>
> > > >> Shout if need more detail.
> > > >> S
> > > >>
> > > >>
> > > >> On Tue, Mar 3, 2020 at 6:00 PM 张铎(Duo Zhang) <palomino219@gmail.com
> >
> > > >> wrote:
> > > >>
> > > >> > But why branch-2.2 and branch-2.1 are still fine?
> > > >> >
> > > >> > Sean Busbey <bu...@apache.org> 于2020年3月4日周三 上午9:24写道:
> > > >> >
> > > >> > > I agree in principle that excluding 100s of UTs isn't good. But
> we
> > > >> don't
> > > >> > > really have better options given the state of tests and testing
> > > >> hardware
> > > >> > > currently available to us.
> > > >> > >
> > > >> > > On Tue, Mar 3, 2020, 19:14 张铎(Duo Zhang) <palomino219@gmail.com
> >
> > > >> wrote:
> > > >> > >
> > > >> > > > I think the problem is all UTs are failing randomly...
> > > >> > > >
> > > >> > > > And it is also not a good idea to exclude hundreds of UTs in
> pre
> > > >> > commit?
> > > >> > > >
> > > >> > > > Sean Busbey <bu...@apache.org> 于2020年3月4日周三 上午9:11写道:
> > > >> > > >
> > > >> > > > > Everything in the flake list should be skipped at precommit
> > > time.
> > > >> Is
> > > >> > > that
> > > >> > > > > not happening?
> > > >> > > > >
> > > >> > > > > Are we keeping a shorter flake window so things are bouncing
> > in
> > > >> and
> > > >> > out
> > > >> > > > of
> > > >> > > > > the list?
> > > >> > > > >
> > > >> > > > > On Tue, Mar 3, 2020, 18:56 张铎(Duo Zhang) <
> > palomino219@gmail.com
> > > >
> > > >> > > wrote:
> > > >> > > > >
> > > >> > > > > > I see recently there are lots of 'flaky tests' related
> > issues
> > > >> been
> > > >> > > > > resolved
> > > >> > > > > > but seems the situation is getting worse? For branch-2.2
> the
> > > >> flaky
> > > >> > > page
> > > >> > > > > is
> > > >> > > > > > fine, but for master it is totally a mess...
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/branch-2.2/lastSuccessfulBuild/artifact/dashboard.html
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/master/lastSuccessfulBuild/artifact/dashboard.html
> > > >> > > > > >
> > > >> > > > > > Lots of UTs are in trouble and it makes it really hard to
> > pass
> > > >> the
> > > >> > > pre
> > > >> > > > > > commit check which means it is really hard to contribute
> to
> > > the
> > > >> > > > > project...
> > > >> > > > > >
> > > >> > > > > > We need to fix this soon...
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: What is the situation for our UTs now?

Posted by "张铎 (Duo Zhang)" <pa...@gmail.com>.
OK, let's keep an eye on the flaky list of master and branch-2 till this
weekend.

If it is in a bad state then let's discussion again.

Stack <st...@duboce.net> 于2020年3月5日周四 上午12:41写道:

> On Wed, Mar 4, 2020 at 3:42 AM 张铎(Duo Zhang) <pa...@gmail.com>
> wrote:
>
> > And speak a little more on increasing the forkCount. In fact, the test
> > category is not too rough. The LargeTests just means the test will run a
> > bit long, does not mean it will consume more resources. Maybe the tests
> > just have lots of Thread.sleep so we declare it as LargeTests.
> >
> >
> I've done a few passes on test categorization of late. The notion had
> rotted pretty bad but should be cleaned up now.
>
>
> > What I can see is that, all the replication related tests are flaky now.
> > This is reasonable. In replication tests, usually we have to set up at
> > least two mini clusters, and the replication system itself will make use
> of
> > lots of threads. So if you run several replication related tests
> together,
> > it will easy to overload and cause the UTs to timeout or OOM.
> >
> >
> We have at least one test that makes four clusters inside the one JVM.
>
> Yeah, the resource usage in general needs weeding.
>
> Perhaps you are arguing that we just let the state of tests as they are?
> That we let long tests run in series in case two or more might run together
> and fail because they are profligate in their resource use?
>
I mean increasing the fork count will lead to a random test result as the
test category can not describe the resource usage clearly. You can run
maybe 20+ light-weighted UTs without problem, but if you run 5 tests which
set up 4 mini clusters, the resource will be exhausted and cause the tests
to fail, or at least make it really slow and fail the tests...

>
>
>
> > So, again, let's do this on a feature branch. It is fine to mess things
> up
> > on a feature branch. You can do everything you want as the intermediate
> > state does not effect others. On master and branch-2 it is another
> story. I
> > do not think this should be a blocker for 2.3.0 or 3.0.0.
> >
> > See previous note.
>
> Thanks,
> S
>
>
> > Thanks.
> >
> > 张铎(Duo Zhang) <pa...@gmail.com> 于2020年3月4日周三 下午7:34写道:
> >
> > > Due to the resource limit I do not think it is a good idea to increase
> > the
> > > forkCount...
> > >
> > > FWIW, can we do this on a feature branch and move master and branch-2
> > back?
> > >
> > > See here
> > >
> > > https://github.com/apache/hbase/pull/1221
> > >
> > > We tried several times and always got a large amount of failed UTs
> which
> > > are not related to the patch. And we even excluded hundreds of UTs due
> to
> > > the flaky list!
> > >
> > > This makes it almost impossible to contribute to the project. Even
> after
> > > several tries we get a green result, due to the excluded hundreds of
> UTs,
> > > no one know if the patch breaks something.
> > >
> > > Thanks.
> > >
> > > Stack <st...@duboce.net> 于2020年3月4日周三 下午2:55写道:
> > >
> > >> Upstream branch-2 and master nightlies don't look too bad currently.
> > There
> > >> are a few bad runs where there were a bunch of hangs which makes
> things
> > >> look bad. I upped the number of tests we show from 5 to 10 on branch-2
> > and
> > >> master which makes it so a failed tests shows longer in the top half
> of
> > >> the
> > >> flakies page -- and more flakies are listed. On the bottom half, I'd
> > upped
> > >> the ferocity with which we run on GCE to draw out flakies. Needless to
> > >> say,
> > >> they fail more often when contended resources. I might knock the
> > ferocity
> > >> down in the next day or so but am trying to land some patches that cut
> > >> down
> > >> on resource usage and want to see how these do in the flakie runs
> first.
> > >>
> > >> Master I haven't looked at much... looks like branch-2?  Branch-2.2
> and
> > >> branch-2.1 look sleepy. Similar amounts of flakies in the nightlies.
> > They
> > >> don't have the ferocity upped so the lower-half GCE section looks
> > >> 'better'.
> > >> I can make them look like branch-2 and master if folks want (smile)
> but
> > >> its
> > >> probably ok letting the flakies lie in branches that are being
> bypassed.
> > >>
> > >> Generally,  I've been working on unit tests with inspiration and help
> > from
> > >> Mark Miller and Nick. Our tests are in a poor state. They take so
> long,
> > >> they don't get run anywhere else other than up on jenkins. They rarely
> > >> pass
> > >> and only then on accident if minimal parallelism and jitter. On
> > multi-core
> > >> machines, they use 1 to 2 cores only -- even if the machine has tens
> of
> > >> them.
> > >>
> > >> I have been trying to burn down the flakies, make the tests complete
> > >> successfully in less time with more parallelism, using all of the
> > machine,
> > >> and make them pass both on jenkins and locally. Of late, have been
> > focused
> > >> on branch-2 since it is calming down getting ready for a 2.3.0RC0.
> > Having
> > >> some success but its a  nasty job where it is hard to claim advances
> > >> because the flakies vary w/ the context in which the tests are run.
> > >> Hopefully we'll turn a corner on jenkins soon for folks to enjoy.
> > >>
> > >> Shout if need more detail.
> > >> S
> > >>
> > >>
> > >> On Tue, Mar 3, 2020 at 6:00 PM 张铎(Duo Zhang) <pa...@gmail.com>
> > >> wrote:
> > >>
> > >> > But why branch-2.2 and branch-2.1 are still fine?
> > >> >
> > >> > Sean Busbey <bu...@apache.org> 于2020年3月4日周三 上午9:24写道:
> > >> >
> > >> > > I agree in principle that excluding 100s of UTs isn't good. But we
> > >> don't
> > >> > > really have better options given the state of tests and testing
> > >> hardware
> > >> > > currently available to us.
> > >> > >
> > >> > > On Tue, Mar 3, 2020, 19:14 张铎(Duo Zhang) <pa...@gmail.com>
> > >> wrote:
> > >> > >
> > >> > > > I think the problem is all UTs are failing randomly...
> > >> > > >
> > >> > > > And it is also not a good idea to exclude hundreds of UTs in pre
> > >> > commit?
> > >> > > >
> > >> > > > Sean Busbey <bu...@apache.org> 于2020年3月4日周三 上午9:11写道:
> > >> > > >
> > >> > > > > Everything in the flake list should be skipped at precommit
> > time.
> > >> Is
> > >> > > that
> > >> > > > > not happening?
> > >> > > > >
> > >> > > > > Are we keeping a shorter flake window so things are bouncing
> in
> > >> and
> > >> > out
> > >> > > > of
> > >> > > > > the list?
> > >> > > > >
> > >> > > > > On Tue, Mar 3, 2020, 18:56 张铎(Duo Zhang) <
> palomino219@gmail.com
> > >
> > >> > > wrote:
> > >> > > > >
> > >> > > > > > I see recently there are lots of 'flaky tests' related
> issues
> > >> been
> > >> > > > > resolved
> > >> > > > > > but seems the situation is getting worse? For branch-2.2 the
> > >> flaky
> > >> > > page
> > >> > > > > is
> > >> > > > > > fine, but for master it is totally a mess...
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/branch-2.2/lastSuccessfulBuild/artifact/dashboard.html
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/master/lastSuccessfulBuild/artifact/dashboard.html
> > >> > > > > >
> > >> > > > > > Lots of UTs are in trouble and it makes it really hard to
> pass
> > >> the
> > >> > > pre
> > >> > > > > > commit check which means it is really hard to contribute to
> > the
> > >> > > > > project...
> > >> > > > > >
> > >> > > > > > We need to fix this soon...
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: What is the situation for our UTs now?

Posted by Stack <st...@duboce.net>.
On Wed, Mar 4, 2020 at 3:42 AM 张铎(Duo Zhang) <pa...@gmail.com> wrote:

> And speak a little more on increasing the forkCount. In fact, the test
> category is not too rough. The LargeTests just means the test will run a
> bit long, does not mean it will consume more resources. Maybe the tests
> just have lots of Thread.sleep so we declare it as LargeTests.
>
>
I've done a few passes on test categorization of late. The notion had
rotted pretty bad but should be cleaned up now.


> What I can see is that, all the replication related tests are flaky now.
> This is reasonable. In replication tests, usually we have to set up at
> least two mini clusters, and the replication system itself will make use of
> lots of threads. So if you run several replication related tests together,
> it will easy to overload and cause the UTs to timeout or OOM.
>
>
We have at least one test that makes four clusters inside the one JVM.

Yeah, the resource usage in general needs weeding.

Perhaps you are arguing that we just let the state of tests as they are?
That we let long tests run in series in case two or more might run together
and fail because they are profligate in their resource use?



> So, again, let's do this on a feature branch. It is fine to mess things up
> on a feature branch. You can do everything you want as the intermediate
> state does not effect others. On master and branch-2 it is another story. I
> do not think this should be a blocker for 2.3.0 or 3.0.0.
>
> See previous note.

Thanks,
S


> Thanks.
>
> 张铎(Duo Zhang) <pa...@gmail.com> 于2020年3月4日周三 下午7:34写道:
>
> > Due to the resource limit I do not think it is a good idea to increase
> the
> > forkCount...
> >
> > FWIW, can we do this on a feature branch and move master and branch-2
> back?
> >
> > See here
> >
> > https://github.com/apache/hbase/pull/1221
> >
> > We tried several times and always got a large amount of failed UTs which
> > are not related to the patch. And we even excluded hundreds of UTs due to
> > the flaky list!
> >
> > This makes it almost impossible to contribute to the project. Even after
> > several tries we get a green result, due to the excluded hundreds of UTs,
> > no one know if the patch breaks something.
> >
> > Thanks.
> >
> > Stack <st...@duboce.net> 于2020年3月4日周三 下午2:55写道:
> >
> >> Upstream branch-2 and master nightlies don't look too bad currently.
> There
> >> are a few bad runs where there were a bunch of hangs which makes things
> >> look bad. I upped the number of tests we show from 5 to 10 on branch-2
> and
> >> master which makes it so a failed tests shows longer in the top half of
> >> the
> >> flakies page -- and more flakies are listed. On the bottom half, I'd
> upped
> >> the ferocity with which we run on GCE to draw out flakies. Needless to
> >> say,
> >> they fail more often when contended resources. I might knock the
> ferocity
> >> down in the next day or so but am trying to land some patches that cut
> >> down
> >> on resource usage and want to see how these do in the flakie runs first.
> >>
> >> Master I haven't looked at much... looks like branch-2?  Branch-2.2 and
> >> branch-2.1 look sleepy. Similar amounts of flakies in the nightlies.
> They
> >> don't have the ferocity upped so the lower-half GCE section looks
> >> 'better'.
> >> I can make them look like branch-2 and master if folks want (smile) but
> >> its
> >> probably ok letting the flakies lie in branches that are being bypassed.
> >>
> >> Generally,  I've been working on unit tests with inspiration and help
> from
> >> Mark Miller and Nick. Our tests are in a poor state. They take so long,
> >> they don't get run anywhere else other than up on jenkins. They rarely
> >> pass
> >> and only then on accident if minimal parallelism and jitter. On
> multi-core
> >> machines, they use 1 to 2 cores only -- even if the machine has tens of
> >> them.
> >>
> >> I have been trying to burn down the flakies, make the tests complete
> >> successfully in less time with more parallelism, using all of the
> machine,
> >> and make them pass both on jenkins and locally. Of late, have been
> focused
> >> on branch-2 since it is calming down getting ready for a 2.3.0RC0.
> Having
> >> some success but its a  nasty job where it is hard to claim advances
> >> because the flakies vary w/ the context in which the tests are run.
> >> Hopefully we'll turn a corner on jenkins soon for folks to enjoy.
> >>
> >> Shout if need more detail.
> >> S
> >>
> >>
> >> On Tue, Mar 3, 2020 at 6:00 PM 张铎(Duo Zhang) <pa...@gmail.com>
> >> wrote:
> >>
> >> > But why branch-2.2 and branch-2.1 are still fine?
> >> >
> >> > Sean Busbey <bu...@apache.org> 于2020年3月4日周三 上午9:24写道:
> >> >
> >> > > I agree in principle that excluding 100s of UTs isn't good. But we
> >> don't
> >> > > really have better options given the state of tests and testing
> >> hardware
> >> > > currently available to us.
> >> > >
> >> > > On Tue, Mar 3, 2020, 19:14 张铎(Duo Zhang) <pa...@gmail.com>
> >> wrote:
> >> > >
> >> > > > I think the problem is all UTs are failing randomly...
> >> > > >
> >> > > > And it is also not a good idea to exclude hundreds of UTs in pre
> >> > commit?
> >> > > >
> >> > > > Sean Busbey <bu...@apache.org> 于2020年3月4日周三 上午9:11写道:
> >> > > >
> >> > > > > Everything in the flake list should be skipped at precommit
> time.
> >> Is
> >> > > that
> >> > > > > not happening?
> >> > > > >
> >> > > > > Are we keeping a shorter flake window so things are bouncing in
> >> and
> >> > out
> >> > > > of
> >> > > > > the list?
> >> > > > >
> >> > > > > On Tue, Mar 3, 2020, 18:56 张铎(Duo Zhang) <palomino219@gmail.com
> >
> >> > > wrote:
> >> > > > >
> >> > > > > > I see recently there are lots of 'flaky tests' related issues
> >> been
> >> > > > > resolved
> >> > > > > > but seems the situation is getting worse? For branch-2.2 the
> >> flaky
> >> > > page
> >> > > > > is
> >> > > > > > fine, but for master it is totally a mess...
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/branch-2.2/lastSuccessfulBuild/artifact/dashboard.html
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/master/lastSuccessfulBuild/artifact/dashboard.html
> >> > > > > >
> >> > > > > > Lots of UTs are in trouble and it makes it really hard to pass
> >> the
> >> > > pre
> >> > > > > > commit check which means it is really hard to contribute to
> the
> >> > > > > project...
> >> > > > > >
> >> > > > > > We need to fix this soon...
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: What is the situation for our UTs now?

Posted by "张铎 (Duo Zhang)" <pa...@gmail.com>.
And speak a little more on increasing the forkCount. In fact, the test
category is not too rough. The LargeTests just means the test will run a
bit long, does not mean it will consume more resources. Maybe the tests
just have lots of Thread.sleep so we declare it as LargeTests.

What I can see is that, all the replication related tests are flaky now.
This is reasonable. In replication tests, usually we have to set up at
least two mini clusters, and the replication system itself will make use of
lots of threads. So if you run several replication related tests together,
it will easy to overload and cause the UTs to timeout or OOM.

So, again, let's do this on a feature branch. It is fine to mess things up
on a feature branch. You can do everything you want as the intermediate
state does not effect others. On master and branch-2 it is another story. I
do not think this should be a blocker for 2.3.0 or 3.0.0.

Thanks.

张铎(Duo Zhang) <pa...@gmail.com> 于2020年3月4日周三 下午7:34写道:

> Due to the resource limit I do not think it is a good idea to increase the
> forkCount...
>
> FWIW, can we do this on a feature branch and move master and branch-2 back?
>
> See here
>
> https://github.com/apache/hbase/pull/1221
>
> We tried several times and always got a large amount of failed UTs which
> are not related to the patch. And we even excluded hundreds of UTs due to
> the flaky list!
>
> This makes it almost impossible to contribute to the project. Even after
> several tries we get a green result, due to the excluded hundreds of UTs,
> no one know if the patch breaks something.
>
> Thanks.
>
> Stack <st...@duboce.net> 于2020年3月4日周三 下午2:55写道:
>
>> Upstream branch-2 and master nightlies don't look too bad currently. There
>> are a few bad runs where there were a bunch of hangs which makes things
>> look bad. I upped the number of tests we show from 5 to 10 on branch-2 and
>> master which makes it so a failed tests shows longer in the top half of
>> the
>> flakies page -- and more flakies are listed. On the bottom half, I'd upped
>> the ferocity with which we run on GCE to draw out flakies. Needless to
>> say,
>> they fail more often when contended resources. I might knock the ferocity
>> down in the next day or so but am trying to land some patches that cut
>> down
>> on resource usage and want to see how these do in the flakie runs first.
>>
>> Master I haven't looked at much... looks like branch-2?  Branch-2.2 and
>> branch-2.1 look sleepy. Similar amounts of flakies in the nightlies. They
>> don't have the ferocity upped so the lower-half GCE section looks
>> 'better'.
>> I can make them look like branch-2 and master if folks want (smile) but
>> its
>> probably ok letting the flakies lie in branches that are being bypassed.
>>
>> Generally,  I've been working on unit tests with inspiration and help from
>> Mark Miller and Nick. Our tests are in a poor state. They take so long,
>> they don't get run anywhere else other than up on jenkins. They rarely
>> pass
>> and only then on accident if minimal parallelism and jitter. On multi-core
>> machines, they use 1 to 2 cores only -- even if the machine has tens of
>> them.
>>
>> I have been trying to burn down the flakies, make the tests complete
>> successfully in less time with more parallelism, using all of the machine,
>> and make them pass both on jenkins and locally. Of late, have been focused
>> on branch-2 since it is calming down getting ready for a 2.3.0RC0. Having
>> some success but its a  nasty job where it is hard to claim advances
>> because the flakies vary w/ the context in which the tests are run.
>> Hopefully we'll turn a corner on jenkins soon for folks to enjoy.
>>
>> Shout if need more detail.
>> S
>>
>>
>> On Tue, Mar 3, 2020 at 6:00 PM 张铎(Duo Zhang) <pa...@gmail.com>
>> wrote:
>>
>> > But why branch-2.2 and branch-2.1 are still fine?
>> >
>> > Sean Busbey <bu...@apache.org> 于2020年3月4日周三 上午9:24写道:
>> >
>> > > I agree in principle that excluding 100s of UTs isn't good. But we
>> don't
>> > > really have better options given the state of tests and testing
>> hardware
>> > > currently available to us.
>> > >
>> > > On Tue, Mar 3, 2020, 19:14 张铎(Duo Zhang) <pa...@gmail.com>
>> wrote:
>> > >
>> > > > I think the problem is all UTs are failing randomly...
>> > > >
>> > > > And it is also not a good idea to exclude hundreds of UTs in pre
>> > commit?
>> > > >
>> > > > Sean Busbey <bu...@apache.org> 于2020年3月4日周三 上午9:11写道:
>> > > >
>> > > > > Everything in the flake list should be skipped at precommit time.
>> Is
>> > > that
>> > > > > not happening?
>> > > > >
>> > > > > Are we keeping a shorter flake window so things are bouncing in
>> and
>> > out
>> > > > of
>> > > > > the list?
>> > > > >
>> > > > > On Tue, Mar 3, 2020, 18:56 张铎(Duo Zhang) <pa...@gmail.com>
>> > > wrote:
>> > > > >
>> > > > > > I see recently there are lots of 'flaky tests' related issues
>> been
>> > > > > resolved
>> > > > > > but seems the situation is getting worse? For branch-2.2 the
>> flaky
>> > > page
>> > > > > is
>> > > > > > fine, but for master it is totally a mess...
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/branch-2.2/lastSuccessfulBuild/artifact/dashboard.html
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/master/lastSuccessfulBuild/artifact/dashboard.html
>> > > > > >
>> > > > > > Lots of UTs are in trouble and it makes it really hard to pass
>> the
>> > > pre
>> > > > > > commit check which means it is really hard to contribute to the
>> > > > > project...
>> > > > > >
>> > > > > > We need to fix this soon...
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: What is the situation for our UTs now?

Posted by "张铎 (Duo Zhang)" <pa...@gmail.com>.
Due to the resource limit I do not think it is a good idea to increase the
forkCount...

FWIW, can we do this on a feature branch and move master and branch-2 back?

See here

https://github.com/apache/hbase/pull/1221

We tried several times and always got a large amount of failed UTs which
are not related to the patch. And we even excluded hundreds of UTs due to
the flaky list!

This makes it almost impossible to contribute to the project. Even after
several tries we get a green result, due to the excluded hundreds of UTs,
no one know if the patch breaks something.

Thanks.

Stack <st...@duboce.net> 于2020年3月4日周三 下午2:55写道:

> Upstream branch-2 and master nightlies don't look too bad currently. There
> are a few bad runs where there were a bunch of hangs which makes things
> look bad. I upped the number of tests we show from 5 to 10 on branch-2 and
> master which makes it so a failed tests shows longer in the top half of the
> flakies page -- and more flakies are listed. On the bottom half, I'd upped
> the ferocity with which we run on GCE to draw out flakies. Needless to say,
> they fail more often when contended resources. I might knock the ferocity
> down in the next day or so but am trying to land some patches that cut down
> on resource usage and want to see how these do in the flakie runs first.
>
> Master I haven't looked at much... looks like branch-2?  Branch-2.2 and
> branch-2.1 look sleepy. Similar amounts of flakies in the nightlies. They
> don't have the ferocity upped so the lower-half GCE section looks 'better'.
> I can make them look like branch-2 and master if folks want (smile) but its
> probably ok letting the flakies lie in branches that are being bypassed.
>
> Generally,  I've been working on unit tests with inspiration and help from
> Mark Miller and Nick. Our tests are in a poor state. They take so long,
> they don't get run anywhere else other than up on jenkins. They rarely pass
> and only then on accident if minimal parallelism and jitter. On multi-core
> machines, they use 1 to 2 cores only -- even if the machine has tens of
> them.
>
> I have been trying to burn down the flakies, make the tests complete
> successfully in less time with more parallelism, using all of the machine,
> and make them pass both on jenkins and locally. Of late, have been focused
> on branch-2 since it is calming down getting ready for a 2.3.0RC0. Having
> some success but its a  nasty job where it is hard to claim advances
> because the flakies vary w/ the context in which the tests are run.
> Hopefully we'll turn a corner on jenkins soon for folks to enjoy.
>
> Shout if need more detail.
> S
>
>
> On Tue, Mar 3, 2020 at 6:00 PM 张铎(Duo Zhang) <pa...@gmail.com>
> wrote:
>
> > But why branch-2.2 and branch-2.1 are still fine?
> >
> > Sean Busbey <bu...@apache.org> 于2020年3月4日周三 上午9:24写道:
> >
> > > I agree in principle that excluding 100s of UTs isn't good. But we
> don't
> > > really have better options given the state of tests and testing
> hardware
> > > currently available to us.
> > >
> > > On Tue, Mar 3, 2020, 19:14 张铎(Duo Zhang) <pa...@gmail.com>
> wrote:
> > >
> > > > I think the problem is all UTs are failing randomly...
> > > >
> > > > And it is also not a good idea to exclude hundreds of UTs in pre
> > commit?
> > > >
> > > > Sean Busbey <bu...@apache.org> 于2020年3月4日周三 上午9:11写道:
> > > >
> > > > > Everything in the flake list should be skipped at precommit time.
> Is
> > > that
> > > > > not happening?
> > > > >
> > > > > Are we keeping a shorter flake window so things are bouncing in and
> > out
> > > > of
> > > > > the list?
> > > > >
> > > > > On Tue, Mar 3, 2020, 18:56 张铎(Duo Zhang) <pa...@gmail.com>
> > > wrote:
> > > > >
> > > > > > I see recently there are lots of 'flaky tests' related issues
> been
> > > > > resolved
> > > > > > but seems the situation is getting worse? For branch-2.2 the
> flaky
> > > page
> > > > > is
> > > > > > fine, but for master it is totally a mess...
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/branch-2.2/lastSuccessfulBuild/artifact/dashboard.html
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/master/lastSuccessfulBuild/artifact/dashboard.html
> > > > > >
> > > > > > Lots of UTs are in trouble and it makes it really hard to pass
> the
> > > pre
> > > > > > commit check which means it is really hard to contribute to the
> > > > > project...
> > > > > >
> > > > > > We need to fix this soon...
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: What is the situation for our UTs now?

Posted by Stack <st...@duboce.net>.
Upstream branch-2 and master nightlies don't look too bad currently. There
are a few bad runs where there were a bunch of hangs which makes things
look bad. I upped the number of tests we show from 5 to 10 on branch-2 and
master which makes it so a failed tests shows longer in the top half of the
flakies page -- and more flakies are listed. On the bottom half, I'd upped
the ferocity with which we run on GCE to draw out flakies. Needless to say,
they fail more often when contended resources. I might knock the ferocity
down in the next day or so but am trying to land some patches that cut down
on resource usage and want to see how these do in the flakie runs first.

Master I haven't looked at much... looks like branch-2?  Branch-2.2 and
branch-2.1 look sleepy. Similar amounts of flakies in the nightlies. They
don't have the ferocity upped so the lower-half GCE section looks 'better'.
I can make them look like branch-2 and master if folks want (smile) but its
probably ok letting the flakies lie in branches that are being bypassed.

Generally,  I've been working on unit tests with inspiration and help from
Mark Miller and Nick. Our tests are in a poor state. They take so long,
they don't get run anywhere else other than up on jenkins. They rarely pass
and only then on accident if minimal parallelism and jitter. On multi-core
machines, they use 1 to 2 cores only -- even if the machine has tens of
them.

I have been trying to burn down the flakies, make the tests complete
successfully in less time with more parallelism, using all of the machine,
and make them pass both on jenkins and locally. Of late, have been focused
on branch-2 since it is calming down getting ready for a 2.3.0RC0. Having
some success but its a  nasty job where it is hard to claim advances
because the flakies vary w/ the context in which the tests are run.
Hopefully we'll turn a corner on jenkins soon for folks to enjoy.

Shout if need more detail.
S


On Tue, Mar 3, 2020 at 6:00 PM 张铎(Duo Zhang) <pa...@gmail.com> wrote:

> But why branch-2.2 and branch-2.1 are still fine?
>
> Sean Busbey <bu...@apache.org> 于2020年3月4日周三 上午9:24写道:
>
> > I agree in principle that excluding 100s of UTs isn't good. But we don't
> > really have better options given the state of tests and testing hardware
> > currently available to us.
> >
> > On Tue, Mar 3, 2020, 19:14 张铎(Duo Zhang) <pa...@gmail.com> wrote:
> >
> > > I think the problem is all UTs are failing randomly...
> > >
> > > And it is also not a good idea to exclude hundreds of UTs in pre
> commit?
> > >
> > > Sean Busbey <bu...@apache.org> 于2020年3月4日周三 上午9:11写道:
> > >
> > > > Everything in the flake list should be skipped at precommit time. Is
> > that
> > > > not happening?
> > > >
> > > > Are we keeping a shorter flake window so things are bouncing in and
> out
> > > of
> > > > the list?
> > > >
> > > > On Tue, Mar 3, 2020, 18:56 张铎(Duo Zhang) <pa...@gmail.com>
> > wrote:
> > > >
> > > > > I see recently there are lots of 'flaky tests' related issues been
> > > > resolved
> > > > > but seems the situation is getting worse? For branch-2.2 the flaky
> > page
> > > > is
> > > > > fine, but for master it is totally a mess...
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/branch-2.2/lastSuccessfulBuild/artifact/dashboard.html
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/master/lastSuccessfulBuild/artifact/dashboard.html
> > > > >
> > > > > Lots of UTs are in trouble and it makes it really hard to pass the
> > pre
> > > > > commit check which means it is really hard to contribute to the
> > > > project...
> > > > >
> > > > > We need to fix this soon...
> > > > >
> > > >
> > >
> >
>

Re: What is the situation for our UTs now?

Posted by "张铎 (Duo Zhang)" <pa...@gmail.com>.
But why branch-2.2 and branch-2.1 are still fine?

Sean Busbey <bu...@apache.org> 于2020年3月4日周三 上午9:24写道:

> I agree in principle that excluding 100s of UTs isn't good. But we don't
> really have better options given the state of tests and testing hardware
> currently available to us.
>
> On Tue, Mar 3, 2020, 19:14 张铎(Duo Zhang) <pa...@gmail.com> wrote:
>
> > I think the problem is all UTs are failing randomly...
> >
> > And it is also not a good idea to exclude hundreds of UTs in pre commit?
> >
> > Sean Busbey <bu...@apache.org> 于2020年3月4日周三 上午9:11写道:
> >
> > > Everything in the flake list should be skipped at precommit time. Is
> that
> > > not happening?
> > >
> > > Are we keeping a shorter flake window so things are bouncing in and out
> > of
> > > the list?
> > >
> > > On Tue, Mar 3, 2020, 18:56 张铎(Duo Zhang) <pa...@gmail.com>
> wrote:
> > >
> > > > I see recently there are lots of 'flaky tests' related issues been
> > > resolved
> > > > but seems the situation is getting worse? For branch-2.2 the flaky
> page
> > > is
> > > > fine, but for master it is totally a mess...
> > > >
> > > >
> > > >
> > >
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/branch-2.2/lastSuccessfulBuild/artifact/dashboard.html
> > > >
> > > >
> > > >
> > > >
> > >
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/master/lastSuccessfulBuild/artifact/dashboard.html
> > > >
> > > > Lots of UTs are in trouble and it makes it really hard to pass the
> pre
> > > > commit check which means it is really hard to contribute to the
> > > project...
> > > >
> > > > We need to fix this soon...
> > > >
> > >
> >
>

Re: What is the situation for our UTs now?

Posted by Sean Busbey <bu...@apache.org>.
I agree in principle that excluding 100s of UTs isn't good. But we don't
really have better options given the state of tests and testing hardware
currently available to us.

On Tue, Mar 3, 2020, 19:14 张铎(Duo Zhang) <pa...@gmail.com> wrote:

> I think the problem is all UTs are failing randomly...
>
> And it is also not a good idea to exclude hundreds of UTs in pre commit?
>
> Sean Busbey <bu...@apache.org> 于2020年3月4日周三 上午9:11写道:
>
> > Everything in the flake list should be skipped at precommit time. Is that
> > not happening?
> >
> > Are we keeping a shorter flake window so things are bouncing in and out
> of
> > the list?
> >
> > On Tue, Mar 3, 2020, 18:56 张铎(Duo Zhang) <pa...@gmail.com> wrote:
> >
> > > I see recently there are lots of 'flaky tests' related issues been
> > resolved
> > > but seems the situation is getting worse? For branch-2.2 the flaky page
> > is
> > > fine, but for master it is totally a mess...
> > >
> > >
> > >
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/branch-2.2/lastSuccessfulBuild/artifact/dashboard.html
> > >
> > >
> > >
> > >
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/master/lastSuccessfulBuild/artifact/dashboard.html
> > >
> > > Lots of UTs are in trouble and it makes it really hard to pass the pre
> > > commit check which means it is really hard to contribute to the
> > project...
> > >
> > > We need to fix this soon...
> > >
> >
>

Re: What is the situation for our UTs now?

Posted by "张铎 (Duo Zhang)" <pa...@gmail.com>.
I think the problem is all UTs are failing randomly...

And it is also not a good idea to exclude hundreds of UTs in pre commit?

Sean Busbey <bu...@apache.org> 于2020年3月4日周三 上午9:11写道:

> Everything in the flake list should be skipped at precommit time. Is that
> not happening?
>
> Are we keeping a shorter flake window so things are bouncing in and out of
> the list?
>
> On Tue, Mar 3, 2020, 18:56 张铎(Duo Zhang) <pa...@gmail.com> wrote:
>
> > I see recently there are lots of 'flaky tests' related issues been
> resolved
> > but seems the situation is getting worse? For branch-2.2 the flaky page
> is
> > fine, but for master it is totally a mess...
> >
> >
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/branch-2.2/lastSuccessfulBuild/artifact/dashboard.html
> >
> >
> >
> >
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/master/lastSuccessfulBuild/artifact/dashboard.html
> >
> > Lots of UTs are in trouble and it makes it really hard to pass the pre
> > commit check which means it is really hard to contribute to the
> project...
> >
> > We need to fix this soon...
> >
>

Re: What is the situation for our UTs now?

Posted by Sean Busbey <bu...@apache.org>.
Everything in the flake list should be skipped at precommit time. Is that
not happening?

Are we keeping a shorter flake window so things are bouncing in and out of
the list?

On Tue, Mar 3, 2020, 18:56 张铎(Duo Zhang) <pa...@gmail.com> wrote:

> I see recently there are lots of 'flaky tests' related issues been resolved
> but seems the situation is getting worse? For branch-2.2 the flaky page is
> fine, but for master it is totally a mess...
>
>
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/branch-2.2/lastSuccessfulBuild/artifact/dashboard.html
>
>
>
> https://builds.apache.org/job/HBASE-Find-Flaky-Tests/job/master/lastSuccessfulBuild/artifact/dashboard.html
>
> Lots of UTs are in trouble and it makes it really hard to pass the pre
> commit check which means it is really hard to contribute to the project...
>
> We need to fix this soon...
>