You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kudu.apache.org by Todd Lipcon <to...@cloudera.com> on 2017/11/20 19:50:03 UTC

Flaky tests?

Hey folks,

It seems some of our tests have gotten pretty flaky lately again. Some of
it is likely due to churn in test infrastructure (running on a different VM
type now I think) but it makes me a little nervous to go into the 1.6
release with some tests at 5%+ flaky.

Can we get some volunteers to triage the top couple most flaky? Note that
"triage" doesn't necessarily mean "fix" -- just want to investigate to the
point that we can decide it's likely to be a test issue or known existing
issue rather than a regression before the release.

I'll volunteer to look at consensus_peers-itests (the top most flaky one).

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Flaky tests?

Posted by Alexey Serbin <as...@cloudera.com>.

An update: the flakiness in raft_consensus_nonvoter-itest has been fixed.

On 11/27/17 6:55 PM, Alexey Serbin wrote:
> Yep, that CatalogManagerAddsNonVoter is the new one which was 
> committed just yesterday.
>
> On 11/27/17 6:53 PM, Alexey Serbin wrote:
>> The raft_consensus_nonvoter-itest is the set of tests added for 3-4-3 
>> re-replication improvements.  I'm adding more scenarios there right 
>> now, and I'll take care of the current flaky ones from there as well.
>>
>>
>> Thanks,
>>
>> Alexey
>>
>> On 11/27/17 6:38 PM, Andrew Wong wrote:
>>> N/w! I should have checked with you beforehand given you were 
>>> already in
>>> the area (per your response last week). Seems the double-effort was 
>>> fairly
>>> minimal anyway.
>>>
>>> With the fixes for tablet_copy-itest and delete_table-itest checked 
>>> in, the
>>> next-highest offenders on the dashboard
>>> <http://dist-test.cloudera.org:8080/> are:
>>>
>>>     - raft_consensus_nonvoter-itest (9.62%)
>>>     - linked_list-test (8.45%)
>>>
>>>  From a quick glance I'm not sure I have a grasp on what's going on in
>>> either test. Would anyone like to volunteer? 😃
>>>
>>> On Mon, Nov 27, 2017 at 6:27 PM, Alexey Serbin 
>>> <as...@cloudera.com> wrote:
>>>
>>>> I just realized after re-reading this message that Andrew was about to
>>>> look at the flake in delete_table-itest as well.  I'm sorry for the
>>>> double-effort here, if any.  I read this message after posting the 
>>>> patch.
>>>>
>>>>
>>>>
>>>> On 11/27/17 12:09 PM, Andrew Wong wrote:
>>>>
>>>>> I'm taking a look at tablet_copy-itest and the flakiness in
>>>>> delete_table-itest beyond Alexey's outstanding patch.
>>>>>
>>>>> On Tue, Nov 21, 2017 at 10:17 AM, Todd Lipcon <to...@cloudera.com> 
>>>>> wrote:
>>>>>
>>>>> On Tue, Nov 21, 2017 at 10:13 AM, Alexey Serbin 
>>>>> <as...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>> I'll take a look at delete_table-itest (at least I have had a 
>>>>>> patch in
>>>>>>> review for one flake there for a long time).
>>>>>>>
>>>>>>> BTW, it would be much better if it were possible to see the type of
>>>>>>>
>>>>>> failed
>>>>>>
>>>>>>> build in the dashboard (as it was prior to quasar).  Is the type 
>>>>>>> of a
>>>>>>>
>>>>>> build
>>>>>>
>>>>>>> something inherently impossible to expose from quasar?
>>>>>>>
>>>>>>> I think it should be possible by just setting the BUILD_ID 
>>>>>>> environment
>>>>>> variable appropriate before reporting the test result. That 
>>>>>> information
>>>>>> should be available in the enviornment as $BUILD_TYPE or somesuch. I
>>>>>> think
>>>>>> Ed is out this week but maybe he can take a look at this when he 
>>>>>> gets
>>>>>> back?
>>>>>>
>>>>>> -Todd
>>>>>>
>>>>>>
>>>>>>
>>>>>>> Best regards,
>>>>>>>
>>>>>>> Alexey
>>>>>>>
>>>>>>>
>>>>>>> On 11/20/17 11:50 AM, Todd Lipcon wrote:
>>>>>>>
>>>>>>> Hey folks,
>>>>>>>> It seems some of our tests have gotten pretty flaky lately 
>>>>>>>> again. Some
>>>>>>>>
>>>>>>> of
>>>>>>> it is likely due to churn in test infrastructure (running on a 
>>>>>>> different
>>>>>>>> VM
>>>>>>>> type now I think) but it makes me a little nervous to go into 
>>>>>>>> the 1.6
>>>>>>>> release with some tests at 5%+ flaky.
>>>>>>>>
>>>>>>>> Can we get some volunteers to triage the top couple most flaky? 
>>>>>>>> Note
>>>>>>>>
>>>>>>> that
>>>>>>> "triage" doesn't necessarily mean "fix" -- just want to 
>>>>>>> investigate to
>>>>>>> the
>>>>>>> point that we can decide it's likely to be a test issue or known
>>>>>>> existing
>>>>>>> issue rather than a regression before the release.
>>>>>>>> I'll volunteer to look at consensus_peers-itests (the top most 
>>>>>>>> flaky
>>>>>>>>
>>>>>>> one).
>>>>>>> -Todd
>>>>>>>>
>>>>>> -- 
>>>>>> Todd Lipcon
>>>>>> Software Engineer, Cloudera
>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Re: Flaky tests?

Posted by Alexey Serbin <as...@cloudera.com>.

Yep, that CatalogManagerAddsNonVoter is the new one which was committed 
just yesterday.

On 11/27/17 6:53 PM, Alexey Serbin wrote:
> The raft_consensus_nonvoter-itest is the set of tests added for 3-4-3 
> re-replication improvements.  I'm adding more scenarios there right 
> now, and I'll take care of the current flaky ones from there as well.
>
>
> Thanks,
>
> Alexey
>
> On 11/27/17 6:38 PM, Andrew Wong wrote:
>> N/w! I should have checked with you beforehand given you were already in
>> the area (per your response last week). Seems the double-effort was 
>> fairly
>> minimal anyway.
>>
>> With the fixes for tablet_copy-itest and delete_table-itest checked 
>> in, the
>> next-highest offenders on the dashboard
>> <http://dist-test.cloudera.org:8080/> are:
>>
>>     - raft_consensus_nonvoter-itest (9.62%)
>>     - linked_list-test (8.45%)
>>
>>  From a quick glance I'm not sure I have a grasp on what's going on in
>> either test. Would anyone like to volunteer? 😃
>>
>> On Mon, Nov 27, 2017 at 6:27 PM, Alexey Serbin <as...@cloudera.com> 
>> wrote:
>>
>>> I just realized after re-reading this message that Andrew was about to
>>> look at the flake in delete_table-itest as well.  I'm sorry for the
>>> double-effort here, if any.  I read this message after posting the 
>>> patch.
>>>
>>>
>>>
>>> On 11/27/17 12:09 PM, Andrew Wong wrote:
>>>
>>>> I'm taking a look at tablet_copy-itest and the flakiness in
>>>> delete_table-itest beyond Alexey's outstanding patch.
>>>>
>>>> On Tue, Nov 21, 2017 at 10:17 AM, Todd Lipcon <to...@cloudera.com> 
>>>> wrote:
>>>>
>>>> On Tue, Nov 21, 2017 at 10:13 AM, Alexey Serbin <as...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>> I'll take a look at delete_table-itest (at least I have had a 
>>>>> patch in
>>>>>> review for one flake there for a long time).
>>>>>>
>>>>>> BTW, it would be much better if it were possible to see the type of
>>>>>>
>>>>> failed
>>>>>
>>>>>> build in the dashboard (as it was prior to quasar).  Is the type 
>>>>>> of a
>>>>>>
>>>>> build
>>>>>
>>>>>> something inherently impossible to expose from quasar?
>>>>>>
>>>>>> I think it should be possible by just setting the BUILD_ID 
>>>>>> environment
>>>>> variable appropriate before reporting the test result. That 
>>>>> information
>>>>> should be available in the enviornment as $BUILD_TYPE or somesuch. I
>>>>> think
>>>>> Ed is out this week but maybe he can take a look at this when he gets
>>>>> back?
>>>>>
>>>>> -Todd
>>>>>
>>>>>
>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Alexey
>>>>>>
>>>>>>
>>>>>> On 11/20/17 11:50 AM, Todd Lipcon wrote:
>>>>>>
>>>>>> Hey folks,
>>>>>>> It seems some of our tests have gotten pretty flaky lately 
>>>>>>> again. Some
>>>>>>>
>>>>>> of
>>>>>> it is likely due to churn in test infrastructure (running on a 
>>>>>> different
>>>>>>> VM
>>>>>>> type now I think) but it makes me a little nervous to go into 
>>>>>>> the 1.6
>>>>>>> release with some tests at 5%+ flaky.
>>>>>>>
>>>>>>> Can we get some volunteers to triage the top couple most flaky? 
>>>>>>> Note
>>>>>>>
>>>>>> that
>>>>>> "triage" doesn't necessarily mean "fix" -- just want to 
>>>>>> investigate to
>>>>>> the
>>>>>> point that we can decide it's likely to be a test issue or known
>>>>>> existing
>>>>>> issue rather than a regression before the release.
>>>>>>> I'll volunteer to look at consensus_peers-itests (the top most 
>>>>>>> flaky
>>>>>>>
>>>>>> one).
>>>>>> -Todd
>>>>>>>
>>>>> -- 
>>>>> Todd Lipcon
>>>>> Software Engineer, Cloudera
>>>>>
>>>>>
>>>>
>>
>

Re: Flaky tests?

Posted by Alexey Serbin <as...@cloudera.com>.

The raft_consensus_nonvoter-itest is the set of tests added for 3-4-3 
re-replication improvements.  I'm adding more scenarios there right now, 
and I'll take care of the current flaky ones from there as well.


Thanks,

Alexey

On 11/27/17 6:38 PM, Andrew Wong wrote:
> N/w! I should have checked with you beforehand given you were already in
> the area (per your response last week). Seems the double-effort was fairly
> minimal anyway.
>
> With the fixes for tablet_copy-itest and delete_table-itest checked in, the
> next-highest offenders on the dashboard
> <http://dist-test.cloudera.org:8080/> are:
>
>     - raft_consensus_nonvoter-itest (9.62%)
>     - linked_list-test (8.45%)
>
>  From a quick glance I'm not sure I have a grasp on what's going on in
> either test. Would anyone like to volunteer? 😃
>
> On Mon, Nov 27, 2017 at 6:27 PM, Alexey Serbin <as...@cloudera.com> wrote:
>
>> I just realized after re-reading this message that Andrew was about to
>> look at the flake in delete_table-itest as well.  I'm sorry for the
>> double-effort here, if any.  I read this message after posting the patch.
>>
>>
>>
>> On 11/27/17 12:09 PM, Andrew Wong wrote:
>>
>>> I'm taking a look at tablet_copy-itest and the flakiness in
>>> delete_table-itest beyond Alexey's outstanding patch.
>>>
>>> On Tue, Nov 21, 2017 at 10:17 AM, Todd Lipcon <to...@cloudera.com> wrote:
>>>
>>> On Tue, Nov 21, 2017 at 10:13 AM, Alexey Serbin <as...@cloudera.com>
>>>> wrote:
>>>>
>>>> I'll take a look at delete_table-itest (at least I have had a patch in
>>>>> review for one flake there for a long time).
>>>>>
>>>>> BTW, it would be much better if it were possible to see the type of
>>>>>
>>>> failed
>>>>
>>>>> build in the dashboard (as it was prior to quasar).  Is the type of a
>>>>>
>>>> build
>>>>
>>>>> something inherently impossible to expose from quasar?
>>>>>
>>>>> I think it should be possible by just setting the BUILD_ID environment
>>>> variable appropriate before reporting the test result. That information
>>>> should be available in the enviornment as $BUILD_TYPE or somesuch. I
>>>> think
>>>> Ed is out this week but maybe he can take a look at this when he gets
>>>> back?
>>>>
>>>> -Todd
>>>>
>>>>
>>>>
>>>>> Best regards,
>>>>>
>>>>> Alexey
>>>>>
>>>>>
>>>>> On 11/20/17 11:50 AM, Todd Lipcon wrote:
>>>>>
>>>>> Hey folks,
>>>>>> It seems some of our tests have gotten pretty flaky lately again. Some
>>>>>>
>>>>> of
>>>>> it is likely due to churn in test infrastructure (running on a different
>>>>>> VM
>>>>>> type now I think) but it makes me a little nervous to go into the 1.6
>>>>>> release with some tests at 5%+ flaky.
>>>>>>
>>>>>> Can we get some volunteers to triage the top couple most flaky? Note
>>>>>>
>>>>> that
>>>>> "triage" doesn't necessarily mean "fix" -- just want to investigate to
>>>>> the
>>>>> point that we can decide it's likely to be a test issue or known
>>>>> existing
>>>>> issue rather than a regression before the release.
>>>>>> I'll volunteer to look at consensus_peers-itests (the top most flaky
>>>>>>
>>>>> one).
>>>>> -Todd
>>>>>>
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>>
>>>>
>>>
>

Re: Flaky tests?

Posted by Andrew Wong <aw...@cloudera.com>.

N/w! I should have checked with you beforehand given you were already in
the area (per your response last week). Seems the double-effort was fairly
minimal anyway.

With the fixes for tablet_copy-itest and delete_table-itest checked in, the
next-highest offenders on the dashboard
<http://dist-test.cloudera.org:8080/> are:

   - raft_consensus_nonvoter-itest (9.62%)
   - linked_list-test (8.45%)

From a quick glance I'm not sure I have a grasp on what's going on in
either test. Would anyone like to volunteer? 😃

On Mon, Nov 27, 2017 at 6:27 PM, Alexey Serbin <as...@cloudera.com> wrote:

> I just realized after re-reading this message that Andrew was about to
> look at the flake in delete_table-itest as well.  I'm sorry for the
> double-effort here, if any.  I read this message after posting the patch.
>
>
>
> On 11/27/17 12:09 PM, Andrew Wong wrote:
>
>> I'm taking a look at tablet_copy-itest and the flakiness in
>> delete_table-itest beyond Alexey's outstanding patch.
>>
>> On Tue, Nov 21, 2017 at 10:17 AM, Todd Lipcon <to...@cloudera.com> wrote:
>>
>> On Tue, Nov 21, 2017 at 10:13 AM, Alexey Serbin <as...@cloudera.com>
>>> wrote:
>>>
>>> I'll take a look at delete_table-itest (at least I have had a patch in
>>>> review for one flake there for a long time).
>>>>
>>>> BTW, it would be much better if it were possible to see the type of
>>>>
>>> failed
>>>
>>>> build in the dashboard (as it was prior to quasar).  Is the type of a
>>>>
>>> build
>>>
>>>> something inherently impossible to expose from quasar?
>>>>
>>>> I think it should be possible by just setting the BUILD_ID environment
>>> variable appropriate before reporting the test result. That information
>>> should be available in the enviornment as $BUILD_TYPE or somesuch. I
>>> think
>>> Ed is out this week but maybe he can take a look at this when he gets
>>> back?
>>>
>>> -Todd
>>>
>>>
>>>
>>>> Best regards,
>>>>
>>>> Alexey
>>>>
>>>>
>>>> On 11/20/17 11:50 AM, Todd Lipcon wrote:
>>>>
>>>> Hey folks,
>>>>>
>>>>> It seems some of our tests have gotten pretty flaky lately again. Some
>>>>>
>>>> of
>>>
>>>> it is likely due to churn in test infrastructure (running on a different
>>>>> VM
>>>>> type now I think) but it makes me a little nervous to go into the 1.6
>>>>> release with some tests at 5%+ flaky.
>>>>>
>>>>> Can we get some volunteers to triage the top couple most flaky? Note
>>>>>
>>>> that
>>>
>>>> "triage" doesn't necessarily mean "fix" -- just want to investigate to
>>>>>
>>>> the
>>>
>>>> point that we can decide it's likely to be a test issue or known
>>>>>
>>>> existing
>>>
>>>> issue rather than a regression before the release.
>>>>>
>>>>> I'll volunteer to look at consensus_peers-itests (the top most flaky
>>>>>
>>>> one).
>>>
>>>> -Todd
>>>>>
>>>>>
>>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>>
>>
>>
>


-- 
Andrew Wong

Re: Flaky tests?

Posted by Alexey Serbin <as...@cloudera.com>.

I just realized after re-reading this message that Andrew was about to 
look at the flake in delete_table-itest as well.  I'm sorry for the 
double-effort here, if any.  I read this message after posting the patch.


On 11/27/17 12:09 PM, Andrew Wong wrote:
> I'm taking a look at tablet_copy-itest and the flakiness in
> delete_table-itest beyond Alexey's outstanding patch.
>
> On Tue, Nov 21, 2017 at 10:17 AM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> On Tue, Nov 21, 2017 at 10:13 AM, Alexey Serbin <as...@cloudera.com>
>> wrote:
>>
>>> I'll take a look at delete_table-itest (at least I have had a patch in
>>> review for one flake there for a long time).
>>>
>>> BTW, it would be much better if it were possible to see the type of
>> failed
>>> build in the dashboard (as it was prior to quasar).  Is the type of a
>> build
>>> something inherently impossible to expose from quasar?
>>>
>> I think it should be possible by just setting the BUILD_ID environment
>> variable appropriate before reporting the test result. That information
>> should be available in the enviornment as $BUILD_TYPE or somesuch. I think
>> Ed is out this week but maybe he can take a look at this when he gets back?
>>
>> -Todd
>>
>>
>>>
>>> Best regards,
>>>
>>> Alexey
>>>
>>>
>>> On 11/20/17 11:50 AM, Todd Lipcon wrote:
>>>
>>>> Hey folks,
>>>>
>>>> It seems some of our tests have gotten pretty flaky lately again. Some
>> of
>>>> it is likely due to churn in test infrastructure (running on a different
>>>> VM
>>>> type now I think) but it makes me a little nervous to go into the 1.6
>>>> release with some tests at 5%+ flaky.
>>>>
>>>> Can we get some volunteers to triage the top couple most flaky? Note
>> that
>>>> "triage" doesn't necessarily mean "fix" -- just want to investigate to
>> the
>>>> point that we can decide it's likely to be a test issue or known
>> existing
>>>> issue rather than a regression before the release.
>>>>
>>>> I'll volunteer to look at consensus_peers-itests (the top most flaky
>> one).
>>>> -Todd
>>>>
>>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>

Re: Flaky tests?

Posted by Andrew Wong <aw...@cloudera.com>.

I'm taking a look at tablet_copy-itest and the flakiness in
delete_table-itest beyond Alexey's outstanding patch.

On Tue, Nov 21, 2017 at 10:17 AM, Todd Lipcon <to...@cloudera.com> wrote:

> On Tue, Nov 21, 2017 at 10:13 AM, Alexey Serbin <as...@cloudera.com>
> wrote:
>
> > I'll take a look at delete_table-itest (at least I have had a patch in
> > review for one flake there for a long time).
> >
> > BTW, it would be much better if it were possible to see the type of
> failed
> > build in the dashboard (as it was prior to quasar).  Is the type of a
> build
> > something inherently impossible to expose from quasar?
> >
>
> I think it should be possible by just setting the BUILD_ID environment
> variable appropriate before reporting the test result. That information
> should be available in the enviornment as $BUILD_TYPE or somesuch. I think
> Ed is out this week but maybe he can take a look at this when he gets back?
>
> -Todd
>
>
> >
> >
> > Best regards,
> >
> > Alexey
> >
> >
> > On 11/20/17 11:50 AM, Todd Lipcon wrote:
> >
> >> Hey folks,
> >>
> >> It seems some of our tests have gotten pretty flaky lately again. Some
> of
> >> it is likely due to churn in test infrastructure (running on a different
> >> VM
> >> type now I think) but it makes me a little nervous to go into the 1.6
> >> release with some tests at 5%+ flaky.
> >>
> >> Can we get some volunteers to triage the top couple most flaky? Note
> that
> >> "triage" doesn't necessarily mean "fix" -- just want to investigate to
> the
> >> point that we can decide it's likely to be a test issue or known
> existing
> >> issue rather than a regression before the release.
> >>
> >> I'll volunteer to look at consensus_peers-itests (the top most flaky
> one).
> >>
> >> -Todd
> >>
> >
> >
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Andrew Wong

Re: Flaky tests?

Posted by Todd Lipcon <to...@cloudera.com>.

On Tue, Nov 21, 2017 at 10:13 AM, Alexey Serbin <as...@cloudera.com>
wrote:

> I'll take a look at delete_table-itest (at least I have had a patch in
> review for one flake there for a long time).
>
> BTW, it would be much better if it were possible to see the type of failed
> build in the dashboard (as it was prior to quasar).  Is the type of a build
> something inherently impossible to expose from quasar?
>

I think it should be possible by just setting the BUILD_ID environment
variable appropriate before reporting the test result. That information
should be available in the enviornment as $BUILD_TYPE or somesuch. I think
Ed is out this week but maybe he can take a look at this when he gets back?

-Todd


>
>
> Best regards,
>
> Alexey
>
>
> On 11/20/17 11:50 AM, Todd Lipcon wrote:
>
>> Hey folks,
>>
>> It seems some of our tests have gotten pretty flaky lately again. Some of
>> it is likely due to churn in test infrastructure (running on a different
>> VM
>> type now I think) but it makes me a little nervous to go into the 1.6
>> release with some tests at 5%+ flaky.
>>
>> Can we get some volunteers to triage the top couple most flaky? Note that
>> "triage" doesn't necessarily mean "fix" -- just want to investigate to the
>> point that we can decide it's likely to be a test issue or known existing
>> issue rather than a regression before the release.
>>
>> I'll volunteer to look at consensus_peers-itests (the top most flaky one).
>>
>> -Todd
>>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Flaky tests?

Posted by Alexey Serbin <as...@cloudera.com>.

I'll take a look at delete_table-itest (at least I have had a patch in 
review for one flake there for a long time).

BTW, it would be much better if it were possible to see the type of 
failed build in the dashboard (as it was prior to quasar).  Is the type 
of a build something inherently impossible to expose from quasar?

Best regards,

Alexey

On 11/20/17 11:50 AM, Todd Lipcon wrote:
> Hey folks,
>
> It seems some of our tests have gotten pretty flaky lately again. Some of
> it is likely due to churn in test infrastructure (running on a different VM
> type now I think) but it makes me a little nervous to go into the 1.6
> release with some tests at 5%+ flaky.
>
> Can we get some volunteers to triage the top couple most flaky? Note that
> "triage" doesn't necessarily mean "fix" -- just want to investigate to the
> point that we can decide it's likely to be a test issue or known existing
> issue rather than a regression before the release.
>
> I'll volunteer to look at consensus_peers-itests (the top most flaky one).
>
> -Todd