You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Apekshit Sharma <ap...@cloudera.com> on 2018/01/12 02:10:15 UTC

Flaky dashboard for current branch-2

https://builds.apache.org/job/HBase-Find-Flaky-Tests-branch2.0/lastSuccessfulBuild/artifact/dashboard.html

@stack: when you branch out branch-2.0, let me know, i'll update the jobs
to point to that branch so that it's helpful in release. Once release is
done, i'll move them back to "branch-2".


-- Appy

Re: Flaky dashboard for current branch-2

Posted by Apekshit Sharma <ap...@cloudera.com>.

bq. Why a 100% failure test can not be detected with pre commit check?
Precommit runs only those tests which are in the modules being changed. If
a change breaks downstream modules, it can lead to such a scenario.

-- Appy

On Fri, Jan 12, 2018 at 4:35 PM, 张铎(Duo Zhang) <pa...@gmail.com>
wrote:

> Why a 100% failure test can not be detected with pre commit check?
>
> Ted Yu <yu...@gmail.com>于2018年1月13日 周六07:44写道：
>
> > As we get closer and closer to beta release, it is important to have as
> few
> > flaky tests as possible.
> >
> > bq. we can actually update the script to send a mail to dev@
> >
> > A post to the JIRA which caused the 100% failing test would be better.
> > The committer would notice the post and take corresponding action.
> >
> > Cheers
> >
> > On Fri, Jan 12, 2018 at 3:35 PM, Apekshit Sharma <ap...@cloudera.com>
> > wrote:
> >
> > > >   Is Nightly now using a list of flakes?
> > > Dashboard job was flaky yesterday, so didn't start using it. Looks like
> > > it's working fine now. Let me exclude flakies from nightly job.
> > >
> > > > Just took a look at the dashboard. Does this capture only failed runs
> > or
> > > all
> > > runs?
> > > Sorry the question isn't clear. Runs of what?
> > > Here's an attempt to answer it in best way i can understand - it looks
> at
> > > last X (x=6 now) runs of nightly branch-2 to collect failing, hanging,
> > and
> > > timedout tests.
> > >
> > > > I see that the following tests have failed 100% of the time for the
> > last
> > > 30
> > > > runs [1]. If this captures all runs, this isn't truly flaky, but
> > rather a
> > > > legitimate failure, right?
> > > > Maybe this tool is used to see all test failures, but if not, I feel
> > like
> > > > we could/should remove a test from the flaky tests/excludes if it
> fails
> > > > consistently so we can fix the root cause
> > >
> > > Has come up a lot of times before. Yes, you're right 100% failure =
> > > legitimate failure.
> > > <rant>
> > > We as a community suck at tracking nightly runs for failing tests and
> > > fixing them, otherwise we wouldn't have ~40 bad test, right!
> > > In fact, we suck at fixing tests even when it's presented in a nice
> clean
> > > list (this dashboard). We just don't prioritize tests in our work.
> > > The general attitude is, tests are failing...meh..what's new, have been
> > > failing for years. Instead of - Oh, one test failed, find the cause and
> > > revert it!
> > > So the real thing to change here is attitude of the community towards
> > > tests. I am +1 for anything that'll promote/support that change.
> > > </rant>
> > > I think we can actually update the script to send a mail to dev@ when
> it
> > > encounters these 100% failing tests. Waana try? :)
> > >
> > > -- Appy
> > >
> > >
> > >
> > >
> > > On Fri, Jan 12, 2018 at 11:29 AM, Zach York <
> > zyork.contribution@gmail.com>
> > > wrote:
> > >
> > > > Just took a look at the dashboard. Does this capture only failed runs
> > or
> > > > all runs?
> > > >
> > > > I see that the following tests have failed 100% of the time for the
> > last
> > > 30
> > > > runs [1]. If this captures all runs, this isn't truly flaky, but
> > rather a
> > > > legitimate failure, right?
> > > > Maybe this tool is used to see all test failures, but if not, I feel
> > like
> > > > we could/should remove a test from the flaky tests/excludes if it
> fails
> > > > consistently so we can fix the root cause.
> > > >
> > > > [1]
> > > > master.balancer.TestRegionsOnMasterOptions
> > > > client.TestMultiParallel
> > > > regionserver.TestRegionServerReadRequestMetrics
> > > >
> > > > Thanks,
> > > > Zach
> > > >
> > > > On Fri, Jan 12, 2018 at 8:19 AM, Stack <st...@duboce.net> wrote:
> > > >
> > > > > Dashboard doesn't capture timed out tests, right Appy?
> > > > > Thanks,
> > > > > S
> > > > >
> > > > > On Thu, Jan 11, 2018 at 6:10 PM, Apekshit Sharma <
> appy@cloudera.com>
> > > > > wrote:
> > > > >
> > > > > > https://builds.apache.org/job/HBase-Find-Flaky-Tests-
> > > > > > branch2.0/lastSuccessfulBuild/artifact/dashboard.html
> > > > > >
> > > > > > @stack: when you branch out branch-2.0, let me know, i'll update
> > the
> > > > jobs
> > > > > > to point to that branch so that it's helpful in release. Once
> > release
> > > > is
> > > > > > done, i'll move them back to "branch-2".
> > > > > >
> > > > > >
> > > > > > -- Appy
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > -- Appy
> > >
> >
>



-- 

-- Appy

Re: Flaky dashboard for current branch-2

Posted by Zach York <zy...@gmail.com>.

Thanks for the explanation Appy!

bq. I think we can actually update the script to send a mail to dev@ when it
encounters these 100% failing tests. Waana try? :)

That would be cool, shame people into fixing tests :) I can try to take a
look into that.



On Fri, Jan 12, 2018 at 5:01 PM, Ted Yu <yu...@gmail.com> wrote:

> There is more than one reason.
>
> Sometimes QA reported tests in a module failed.
> When artifact/patchprocess/patch-unit-hbase-server.txt is checked, there
> were more than one occurrence of the following :
>
> https://pastebin.com/WBewfj3Q
>
> It is hard to decipher what was behind the crash.
> Finding hanging test currently is not automated.
>
> Also note the following at the beginning of the test run:
>
> https://pastebin.com/sK6ebk84
>
> FYI
>
> On Fri, Jan 12, 2018 at 4:35 PM, 张铎(Duo Zhang) <pa...@gmail.com>
> wrote:
>
> > Why a 100% failure test can not be detected with pre commit check?
> >
> > Ted Yu <yu...@gmail.com>于2018年1月13日 周六07:44写道：
> >
> > > As we get closer and closer to beta release, it is important to have as
> > few
> > > flaky tests as possible.
> > >
> > > bq. we can actually update the script to send a mail to dev@
> > >
> > > A post to the JIRA which caused the 100% failing test would be better.
> > > The committer would notice the post and take corresponding action.
> > >
> > > Cheers
> > >
> > > On Fri, Jan 12, 2018 at 3:35 PM, Apekshit Sharma <ap...@cloudera.com>
> > > wrote:
> > >
> > > > >   Is Nightly now using a list of flakes?
> > > > Dashboard job was flaky yesterday, so didn't start using it. Looks
> like
> > > > it's working fine now. Let me exclude flakies from nightly job.
> > > >
> > > > > Just took a look at the dashboard. Does this capture only failed
> runs
> > > or
> > > > all
> > > > runs?
> > > > Sorry the question isn't clear. Runs of what?
> > > > Here's an attempt to answer it in best way i can understand - it
> looks
> > at
> > > > last X (x=6 now) runs of nightly branch-2 to collect failing,
> hanging,
> > > and
> > > > timedout tests.
> > > >
> > > > > I see that the following tests have failed 100% of the time for the
> > > last
> > > > 30
> > > > > runs [1]. If this captures all runs, this isn't truly flaky, but
> > > rather a
> > > > > legitimate failure, right?
> > > > > Maybe this tool is used to see all test failures, but if not, I
> feel
> > > like
> > > > > we could/should remove a test from the flaky tests/excludes if it
> > fails
> > > > > consistently so we can fix the root cause
> > > >
> > > > Has come up a lot of times before. Yes, you're right 100% failure =
> > > > legitimate failure.
> > > > <rant>
> > > > We as a community suck at tracking nightly runs for failing tests and
> > > > fixing them, otherwise we wouldn't have ~40 bad test, right!
> > > > In fact, we suck at fixing tests even when it's presented in a nice
> > clean
> > > > list (this dashboard). We just don't prioritize tests in our work.
> > > > The general attitude is, tests are failing...meh..what's new, have
> been
> > > > failing for years. Instead of - Oh, one test failed, find the cause
> and
> > > > revert it!
> > > > So the real thing to change here is attitude of the community towards
> > > > tests. I am +1 for anything that'll promote/support that change.
> > > > </rant>
> > > > I think we can actually update the script to send a mail to dev@
> when
> > it
> > > > encounters these 100% failing tests. Waana try? :)
> > > >
> > > > -- Appy
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Jan 12, 2018 at 11:29 AM, Zach York <
> > > zyork.contribution@gmail.com>
> > > > wrote:
> > > >
> > > > > Just took a look at the dashboard. Does this capture only failed
> runs
> > > or
> > > > > all runs?
> > > > >
> > > > > I see that the following tests have failed 100% of the time for the
> > > last
> > > > 30
> > > > > runs [1]. If this captures all runs, this isn't truly flaky, but
> > > rather a
> > > > > legitimate failure, right?
> > > > > Maybe this tool is used to see all test failures, but if not, I
> feel
> > > like
> > > > > we could/should remove a test from the flaky tests/excludes if it
> > fails
> > > > > consistently so we can fix the root cause.
> > > > >
> > > > > [1]
> > > > > master.balancer.TestRegionsOnMasterOptions
> > > > > client.TestMultiParallel
> > > > > regionserver.TestRegionServerReadRequestMetrics
> > > > >
> > > > > Thanks,
> > > > > Zach
> > > > >
> > > > > On Fri, Jan 12, 2018 at 8:19 AM, Stack <st...@duboce.net> wrote:
> > > > >
> > > > > > Dashboard doesn't capture timed out tests, right Appy?
> > > > > > Thanks,
> > > > > > S
> > > > > >
> > > > > > On Thu, Jan 11, 2018 at 6:10 PM, Apekshit Sharma <
> > appy@cloudera.com>
> > > > > > wrote:
> > > > > >
> > > > > > > https://builds.apache.org/job/HBase-Find-Flaky-Tests-
> > > > > > > branch2.0/lastSuccessfulBuild/artifact/dashboard.html
> > > > > > >
> > > > > > > @stack: when you branch out branch-2.0, let me know, i'll
> update
> > > the
> > > > > jobs
> > > > > > > to point to that branch so that it's helpful in release. Once
> > > release
> > > > > is
> > > > > > > done, i'll move them back to "branch-2".
> > > > > > >
> > > > > > >
> > > > > > > -- Appy
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > >
> > > > -- Appy
> > > >
> > >
> >
>

Re: Flaky dashboard for current branch-2

Posted by Ted Yu <yu...@gmail.com>.

There is more than one reason.

Sometimes QA reported tests in a module failed.
When artifact/patchprocess/patch-unit-hbase-server.txt is checked, there
were more than one occurrence of the following :

https://pastebin.com/WBewfj3Q

It is hard to decipher what was behind the crash.
Finding hanging test currently is not automated.

Also note the following at the beginning of the test run:

https://pastebin.com/sK6ebk84

FYI

On Fri, Jan 12, 2018 at 4:35 PM, 张铎(Duo Zhang) <pa...@gmail.com>
wrote:

> Why a 100% failure test can not be detected with pre commit check?
>
> Ted Yu <yu...@gmail.com>于2018年1月13日 周六07:44写道：
>
> > As we get closer and closer to beta release, it is important to have as
> few
> > flaky tests as possible.
> >
> > bq. we can actually update the script to send a mail to dev@
> >
> > A post to the JIRA which caused the 100% failing test would be better.
> > The committer would notice the post and take corresponding action.
> >
> > Cheers
> >
> > On Fri, Jan 12, 2018 at 3:35 PM, Apekshit Sharma <ap...@cloudera.com>
> > wrote:
> >
> > > >   Is Nightly now using a list of flakes?
> > > Dashboard job was flaky yesterday, so didn't start using it. Looks like
> > > it's working fine now. Let me exclude flakies from nightly job.
> > >
> > > > Just took a look at the dashboard. Does this capture only failed runs
> > or
> > > all
> > > runs?
> > > Sorry the question isn't clear. Runs of what?
> > > Here's an attempt to answer it in best way i can understand - it looks
> at
> > > last X (x=6 now) runs of nightly branch-2 to collect failing, hanging,
> > and
> > > timedout tests.
> > >
> > > > I see that the following tests have failed 100% of the time for the
> > last
> > > 30
> > > > runs [1]. If this captures all runs, this isn't truly flaky, but
> > rather a
> > > > legitimate failure, right?
> > > > Maybe this tool is used to see all test failures, but if not, I feel
> > like
> > > > we could/should remove a test from the flaky tests/excludes if it
> fails
> > > > consistently so we can fix the root cause
> > >
> > > Has come up a lot of times before. Yes, you're right 100% failure =
> > > legitimate failure.
> > > <rant>
> > > We as a community suck at tracking nightly runs for failing tests and
> > > fixing them, otherwise we wouldn't have ~40 bad test, right!
> > > In fact, we suck at fixing tests even when it's presented in a nice
> clean
> > > list (this dashboard). We just don't prioritize tests in our work.
> > > The general attitude is, tests are failing...meh..what's new, have been
> > > failing for years. Instead of - Oh, one test failed, find the cause and
> > > revert it!
> > > So the real thing to change here is attitude of the community towards
> > > tests. I am +1 for anything that'll promote/support that change.
> > > </rant>
> > > I think we can actually update the script to send a mail to dev@ when
> it
> > > encounters these 100% failing tests. Waana try? :)
> > >
> > > -- Appy
> > >
> > >
> > >
> > >
> > > On Fri, Jan 12, 2018 at 11:29 AM, Zach York <
> > zyork.contribution@gmail.com>
> > > wrote:
> > >
> > > > Just took a look at the dashboard. Does this capture only failed runs
> > or
> > > > all runs?
> > > >
> > > > I see that the following tests have failed 100% of the time for the
> > last
> > > 30
> > > > runs [1]. If this captures all runs, this isn't truly flaky, but
> > rather a
> > > > legitimate failure, right?
> > > > Maybe this tool is used to see all test failures, but if not, I feel
> > like
> > > > we could/should remove a test from the flaky tests/excludes if it
> fails
> > > > consistently so we can fix the root cause.
> > > >
> > > > [1]
> > > > master.balancer.TestRegionsOnMasterOptions
> > > > client.TestMultiParallel
> > > > regionserver.TestRegionServerReadRequestMetrics
> > > >
> > > > Thanks,
> > > > Zach
> > > >
> > > > On Fri, Jan 12, 2018 at 8:19 AM, Stack <st...@duboce.net> wrote:
> > > >
> > > > > Dashboard doesn't capture timed out tests, right Appy?
> > > > > Thanks,
> > > > > S
> > > > >
> > > > > On Thu, Jan 11, 2018 at 6:10 PM, Apekshit Sharma <
> appy@cloudera.com>
> > > > > wrote:
> > > > >
> > > > > > https://builds.apache.org/job/HBase-Find-Flaky-Tests-
> > > > > > branch2.0/lastSuccessfulBuild/artifact/dashboard.html
> > > > > >
> > > > > > @stack: when you branch out branch-2.0, let me know, i'll update
> > the
> > > > jobs
> > > > > > to point to that branch so that it's helpful in release. Once
> > release
> > > > is
> > > > > > done, i'll move them back to "branch-2".
> > > > > >
> > > > > >
> > > > > > -- Appy
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > -- Appy
> > >
> >
>

Re: Flaky dashboard for current branch-2

Posted by "张铎 (Duo Zhang)" <pa...@gmail.com>.

Why a 100% failure test can not be detected with pre commit check?

Ted Yu <yu...@gmail.com>于2018年1月13日 周六07:44写道：

> As we get closer and closer to beta release, it is important to have as few
> flaky tests as possible.
>
> bq. we can actually update the script to send a mail to dev@
>
> A post to the JIRA which caused the 100% failing test would be better.
> The committer would notice the post and take corresponding action.
>
> Cheers
>
> On Fri, Jan 12, 2018 at 3:35 PM, Apekshit Sharma <ap...@cloudera.com>
> wrote:
>
> > >   Is Nightly now using a list of flakes?
> > Dashboard job was flaky yesterday, so didn't start using it. Looks like
> > it's working fine now. Let me exclude flakies from nightly job.
> >
> > > Just took a look at the dashboard. Does this capture only failed runs
> or
> > all
> > runs?
> > Sorry the question isn't clear. Runs of what?
> > Here's an attempt to answer it in best way i can understand - it looks at
> > last X (x=6 now) runs of nightly branch-2 to collect failing, hanging,
> and
> > timedout tests.
> >
> > > I see that the following tests have failed 100% of the time for the
> last
> > 30
> > > runs [1]. If this captures all runs, this isn't truly flaky, but
> rather a
> > > legitimate failure, right?
> > > Maybe this tool is used to see all test failures, but if not, I feel
> like
> > > we could/should remove a test from the flaky tests/excludes if it fails
> > > consistently so we can fix the root cause
> >
> > Has come up a lot of times before. Yes, you're right 100% failure =
> > legitimate failure.
> > <rant>
> > We as a community suck at tracking nightly runs for failing tests and
> > fixing them, otherwise we wouldn't have ~40 bad test, right!
> > In fact, we suck at fixing tests even when it's presented in a nice clean
> > list (this dashboard). We just don't prioritize tests in our work.
> > The general attitude is, tests are failing...meh..what's new, have been
> > failing for years. Instead of - Oh, one test failed, find the cause and
> > revert it!
> > So the real thing to change here is attitude of the community towards
> > tests. I am +1 for anything that'll promote/support that change.
> > </rant>
> > I think we can actually update the script to send a mail to dev@ when it
> > encounters these 100% failing tests. Waana try? :)
> >
> > -- Appy
> >
> >
> >
> >
> > On Fri, Jan 12, 2018 at 11:29 AM, Zach York <
> zyork.contribution@gmail.com>
> > wrote:
> >
> > > Just took a look at the dashboard. Does this capture only failed runs
> or
> > > all runs?
> > >
> > > I see that the following tests have failed 100% of the time for the
> last
> > 30
> > > runs [1]. If this captures all runs, this isn't truly flaky, but
> rather a
> > > legitimate failure, right?
> > > Maybe this tool is used to see all test failures, but if not, I feel
> like
> > > we could/should remove a test from the flaky tests/excludes if it fails
> > > consistently so we can fix the root cause.
> > >
> > > [1]
> > > master.balancer.TestRegionsOnMasterOptions
> > > client.TestMultiParallel
> > > regionserver.TestRegionServerReadRequestMetrics
> > >
> > > Thanks,
> > > Zach
> > >
> > > On Fri, Jan 12, 2018 at 8:19 AM, Stack <st...@duboce.net> wrote:
> > >
> > > > Dashboard doesn't capture timed out tests, right Appy?
> > > > Thanks,
> > > > S
> > > >
> > > > On Thu, Jan 11, 2018 at 6:10 PM, Apekshit Sharma <ap...@cloudera.com>
> > > > wrote:
> > > >
> > > > > https://builds.apache.org/job/HBase-Find-Flaky-Tests-
> > > > > branch2.0/lastSuccessfulBuild/artifact/dashboard.html
> > > > >
> > > > > @stack: when you branch out branch-2.0, let me know, i'll update
> the
> > > jobs
> > > > > to point to that branch so that it's helpful in release. Once
> release
> > > is
> > > > > done, i'll move them back to "branch-2".
> > > > >
> > > > >
> > > > > -- Appy
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> >
> > -- Appy
> >
>

Re: Flaky dashboard for current branch-2

Posted by Ted Yu <yu...@gmail.com>.

As we get closer and closer to beta release, it is important to have as few
flaky tests as possible.

bq. we can actually update the script to send a mail to dev@

A post to the JIRA which caused the 100% failing test would be better.
The committer would notice the post and take corresponding action.

Cheers

On Fri, Jan 12, 2018 at 3:35 PM, Apekshit Sharma <ap...@cloudera.com> wrote:

> >   Is Nightly now using a list of flakes?
> Dashboard job was flaky yesterday, so didn't start using it. Looks like
> it's working fine now. Let me exclude flakies from nightly job.
>
> > Just took a look at the dashboard. Does this capture only failed runs or
> all
> runs?
> Sorry the question isn't clear. Runs of what?
> Here's an attempt to answer it in best way i can understand - it looks at
> last X (x=6 now) runs of nightly branch-2 to collect failing, hanging, and
> timedout tests.
>
> > I see that the following tests have failed 100% of the time for the last
> 30
> > runs [1]. If this captures all runs, this isn't truly flaky, but rather a
> > legitimate failure, right?
> > Maybe this tool is used to see all test failures, but if not, I feel like
> > we could/should remove a test from the flaky tests/excludes if it fails
> > consistently so we can fix the root cause
>
> Has come up a lot of times before. Yes, you're right 100% failure =
> legitimate failure.
> <rant>
> We as a community suck at tracking nightly runs for failing tests and
> fixing them, otherwise we wouldn't have ~40 bad test, right!
> In fact, we suck at fixing tests even when it's presented in a nice clean
> list (this dashboard). We just don't prioritize tests in our work.
> The general attitude is, tests are failing...meh..what's new, have been
> failing for years. Instead of - Oh, one test failed, find the cause and
> revert it!
> So the real thing to change here is attitude of the community towards
> tests. I am +1 for anything that'll promote/support that change.
> </rant>
> I think we can actually update the script to send a mail to dev@ when it
> encounters these 100% failing tests. Waana try? :)
>
> -- Appy
>
>
>
>
> On Fri, Jan 12, 2018 at 11:29 AM, Zach York <zy...@gmail.com>
> wrote:
>
> > Just took a look at the dashboard. Does this capture only failed runs or
> > all runs?
> >
> > I see that the following tests have failed 100% of the time for the last
> 30
> > runs [1]. If this captures all runs, this isn't truly flaky, but rather a
> > legitimate failure, right?
> > Maybe this tool is used to see all test failures, but if not, I feel like
> > we could/should remove a test from the flaky tests/excludes if it fails
> > consistently so we can fix the root cause.
> >
> > [1]
> > master.balancer.TestRegionsOnMasterOptions
> > client.TestMultiParallel
> > regionserver.TestRegionServerReadRequestMetrics
> >
> > Thanks,
> > Zach
> >
> > On Fri, Jan 12, 2018 at 8:19 AM, Stack <st...@duboce.net> wrote:
> >
> > > Dashboard doesn't capture timed out tests, right Appy?
> > > Thanks,
> > > S
> > >
> > > On Thu, Jan 11, 2018 at 6:10 PM, Apekshit Sharma <ap...@cloudera.com>
> > > wrote:
> > >
> > > > https://builds.apache.org/job/HBase-Find-Flaky-Tests-
> > > > branch2.0/lastSuccessfulBuild/artifact/dashboard.html
> > > >
> > > > @stack: when you branch out branch-2.0, let me know, i'll update the
> > jobs
> > > > to point to that branch so that it's helpful in release. Once release
> > is
> > > > done, i'll move them back to "branch-2".
> > > >
> > > >
> > > > -- Appy
> > > >
> > >
> >
>
>
>
> --
>
> -- Appy
>

Re: Flaky dashboard for current branch-2

Posted by Apekshit Sharma <ap...@cloudera.com>.

>   Is Nightly now using a list of flakes?
Dashboard job was flaky yesterday, so didn't start using it. Looks like
it's working fine now. Let me exclude flakies from nightly job.

> Just took a look at the dashboard. Does this capture only failed runs or all
runs?
Sorry the question isn't clear. Runs of what?
Here's an attempt to answer it in best way i can understand - it looks at
last X (x=6 now) runs of nightly branch-2 to collect failing, hanging, and
timedout tests.

> I see that the following tests have failed 100% of the time for the last
30
> runs [1]. If this captures all runs, this isn't truly flaky, but rather a
> legitimate failure, right?
> Maybe this tool is used to see all test failures, but if not, I feel like
> we could/should remove a test from the flaky tests/excludes if it fails
> consistently so we can fix the root cause

Has come up a lot of times before. Yes, you're right 100% failure =
legitimate failure.
<rant>
We as a community suck at tracking nightly runs for failing tests and
fixing them, otherwise we wouldn't have ~40 bad test, right!
In fact, we suck at fixing tests even when it's presented in a nice clean
list (this dashboard). We just don't prioritize tests in our work.
The general attitude is, tests are failing...meh..what's new, have been
failing for years. Instead of - Oh, one test failed, find the cause and
revert it!
So the real thing to change here is attitude of the community towards
tests. I am +1 for anything that'll promote/support that change.
</rant>
I think we can actually update the script to send a mail to dev@ when it
encounters these 100% failing tests. Waana try? :)

-- Appy




On Fri, Jan 12, 2018 at 11:29 AM, Zach York <zy...@gmail.com>
wrote:

> Just took a look at the dashboard. Does this capture only failed runs or
> all runs?
>
> I see that the following tests have failed 100% of the time for the last 30
> runs [1]. If this captures all runs, this isn't truly flaky, but rather a
> legitimate failure, right?
> Maybe this tool is used to see all test failures, but if not, I feel like
> we could/should remove a test from the flaky tests/excludes if it fails
> consistently so we can fix the root cause.
>
> [1]
> master.balancer.TestRegionsOnMasterOptions
> client.TestMultiParallel
> regionserver.TestRegionServerReadRequestMetrics
>
> Thanks,
> Zach
>
> On Fri, Jan 12, 2018 at 8:19 AM, Stack <st...@duboce.net> wrote:
>
> > Dashboard doesn't capture timed out tests, right Appy?
> > Thanks,
> > S
> >
> > On Thu, Jan 11, 2018 at 6:10 PM, Apekshit Sharma <ap...@cloudera.com>
> > wrote:
> >
> > > https://builds.apache.org/job/HBase-Find-Flaky-Tests-
> > > branch2.0/lastSuccessfulBuild/artifact/dashboard.html
> > >
> > > @stack: when you branch out branch-2.0, let me know, i'll update the
> jobs
> > > to point to that branch so that it's helpful in release. Once release
> is
> > > done, i'll move them back to "branch-2".
> > >
> > >
> > > -- Appy
> > >
> >
>



-- 

-- Appy

Re: Flaky dashboard for current branch-2

Posted by Zach York <zy...@gmail.com>.

Just took a look at the dashboard. Does this capture only failed runs or
all runs?

I see that the following tests have failed 100% of the time for the last 30
runs [1]. If this captures all runs, this isn't truly flaky, but rather a
legitimate failure, right?
Maybe this tool is used to see all test failures, but if not, I feel like
we could/should remove a test from the flaky tests/excludes if it fails
consistently so we can fix the root cause.

[1]
master.balancer.TestRegionsOnMasterOptions
client.TestMultiParallel
regionserver.TestRegionServerReadRequestMetrics

Thanks,
Zach

On Fri, Jan 12, 2018 at 8:19 AM, Stack <st...@duboce.net> wrote:

> Dashboard doesn't capture timed out tests, right Appy?
> Thanks,
> S
>
> On Thu, Jan 11, 2018 at 6:10 PM, Apekshit Sharma <ap...@cloudera.com>
> wrote:
>
> > https://builds.apache.org/job/HBase-Find-Flaky-Tests-
> > branch2.0/lastSuccessfulBuild/artifact/dashboard.html
> >
> > @stack: when you branch out branch-2.0, let me know, i'll update the jobs
> > to point to that branch so that it's helpful in release. Once release is
> > done, i'll move them back to "branch-2".
> >
> >
> > -- Appy
> >
>

Re: Flaky dashboard for current branch-2

Posted by Stack <st...@duboce.net>.

Dashboard doesn't capture timed out tests, right Appy?
Thanks,
S

On Thu, Jan 11, 2018 at 6:10 PM, Apekshit Sharma <ap...@cloudera.com> wrote:

> https://builds.apache.org/job/HBase-Find-Flaky-Tests-
> branch2.0/lastSuccessfulBuild/artifact/dashboard.html
>
> @stack: when you branch out branch-2.0, let me know, i'll update the jobs
> to point to that branch so that it's helpful in release. Once release is
> done, i'll move them back to "branch-2".
>
>
> -- Appy
>

Re: Flaky dashboard for current branch-2

Posted by Balazs Meszaros <ba...@cloudera.com>.

Nice job! Thanks Appy!

On Fri, Jan 12, 2018 at 6:09 AM, Stack <st...@duboce.net> wrote:

> Thanks Appy. Looks beautiful. Is Nightly now using a list of flakes?
> Thanks.
> S
>
> On Thu, Jan 11, 2018 at 6:10 PM, Apekshit Sharma <ap...@cloudera.com>
> wrote:
>
> > https://builds.apache.org/job/HBase-Find-Flaky-Tests-
> > branch2.0/lastSuccessfulBuild/artifact/dashboard.html
> >
> > @stack: when you branch out branch-2.0, let me know, i'll update the jobs
> > to point to that branch so that it's helpful in release. Once release is
> > done, i'll move them back to "branch-2".
> >
> >
> > -- Appy
> >
>

Re: Flaky dashboard for current branch-2

Posted by Stack <st...@duboce.net>.

Thanks Appy. Looks beautiful. Is Nightly now using a list of flakes? Thanks.
S

On Thu, Jan 11, 2018 at 6:10 PM, Apekshit Sharma <ap...@cloudera.com> wrote:

> https://builds.apache.org/job/HBase-Find-Flaky-Tests-
> branch2.0/lastSuccessfulBuild/artifact/dashboard.html
>
> @stack: when you branch out branch-2.0, let me know, i'll update the jobs
> to point to that branch so that it's helpful in release. Once release is
> done, i'll move them back to "branch-2".
>
>
> -- Appy
>