You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kudu.apache.org by Todd Lipcon <to...@cloudera.com> on 2018/03/23 16:58:11 UTC

Flaky pre-commits

It seems that over recent weeks our precommits have gotten somewhat flaky.
Some of this is due to actual flaky tests (most of which are tracked by
JIRAs) but a lot has been due to issues like clock synchronization problems
on the dist-test slaves.

I'd like to consider changing precommit to retry _all_ tests up to 3 times,
instead of just known-flakies. It's a bit of a heavy hammer -- the risk is
that if you introduce flakiness in a test you aren't likely to see it
precommit, but I think the upside of avoiding wasted effort triaging failed
precommits is probably worth it.

Longer term hopefully we can improve the dist-test software to support
something like a "retry if results match a certain regex" to check for
clock sync errors or somesuch, but I think it's non-trivial.

Thoughts?

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Flaky pre-commits

Posted by Adar Lieber-Dembo <ad...@cloudera.com>.

> I do think the 'NTP in kudu' could help a bit here, especially if it were
> only used as a "backup" in case the kernel is unsynchronized. I'm a little
> nervous about the impact on NTP servers, though, in our minicluster based
> tests where we might start and stop tens of thousands of times in the
> course of a 15-minute dist-test run. Wouldn't be surprised if that caused
> us to get blacklisted unless we took some effort to ensure that
> miniclusters "reuse" some NTP state instead of resynchronizing at startup.

If like you said we were to limit the user-space NTP client to only
operate in cases where the system ntpd isn't working properly, would
that mitigate the impact on remote NTP servers?

I didn't say this in my first reply, but I really do value precommit's
ability to surface new flaky tests. When it happens it serves as a
good reminder that my new tests should be run in dist-test. It's more
expensive (in terms of troubleshooting time) to figure that out after
the fact; we don't monitor the flaky test dashboard as much as we
should. That's why I'm interested in finding solutions to the general
"precommit tests failed due to flaky infra" problem.

Re: Flaky pre-commits

Posted by Todd Lipcon <to...@cloudera.com>.

On Fri, Mar 23, 2018 at 11:03 AM, Adar Lieber-Dembo <ad...@cloudera.com>
wrote:

> The clock sync errors do seem to have increased over the past few
> months. If we could just fix those, I think we'd be left with almost
> entirely "known" flakies. Any ideas as to what's going on? Think it's
> something that could be addressed with your NTP-client-in-Kudu patch
> series?
>

Nothing has changed in the node configurations as far as I'm aware, so not
sure why it appears to have become more common lately.

I do think the 'NTP in kudu' could help a bit here, especially if it were
only used as a "backup" in case the kernel is unsynchronized. I'm a little
nervous about the impact on NTP servers, though, in our minicluster based
tests where we might start and stop tens of thousands of times in the
course of a 15-minute dist-test run. Wouldn't be surprised if that caused
us to get blacklisted unless we took some effort to ensure that
miniclusters "reuse" some NTP state instead of resynchronizing at startup.

>
> On Fri, Mar 23, 2018 at 9:58 AM, Todd Lipcon <to...@cloudera.com> wrote:
> > It seems that over recent weeks our precommits have gotten somewhat
> flaky.
> > Some of this is due to actual flaky tests (most of which are tracked by
> > JIRAs) but a lot has been due to issues like clock synchronization
> problems
> > on the dist-test slaves.
> >
> > I'd like to consider changing precommit to retry _all_ tests up to 3
> times,
> > instead of just known-flakies. It's a bit of a heavy hammer -- the risk
> is
> > that if you introduce flakiness in a test you aren't likely to see it
> > precommit, but I think the upside of avoiding wasted effort triaging
> failed
> > precommits is probably worth it.
> >
> > Longer term hopefully we can improve the dist-test software to support
> > something like a "retry if results match a certain regex" to check for
> > clock sync errors or somesuch, but I think it's non-trivial.
> >
> > Thoughts?
> >
> > -Todd
> > --
> > Todd Lipcon
> > Software Engineer, Cloudera
>

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Flaky pre-commits

Posted by Adar Lieber-Dembo <ad...@cloudera.com>.

The clock sync errors do seem to have increased over the past few
months. If we could just fix those, I think we'd be left with almost
entirely "known" flakies. Any ideas as to what's going on? Think it's
something that could be addressed with your NTP-client-in-Kudu patch
series?

On Fri, Mar 23, 2018 at 9:58 AM, Todd Lipcon <to...@cloudera.com> wrote:
> It seems that over recent weeks our precommits have gotten somewhat flaky.
> Some of this is due to actual flaky tests (most of which are tracked by
> JIRAs) but a lot has been due to issues like clock synchronization problems
> on the dist-test slaves.
>
> I'd like to consider changing precommit to retry _all_ tests up to 3 times,
> instead of just known-flakies. It's a bit of a heavy hammer -- the risk is
> that if you introduce flakiness in a test you aren't likely to see it
> precommit, but I think the upside of avoiding wasted effort triaging failed
> precommits is probably worth it.
>
> Longer term hopefully we can improve the dist-test software to support
> something like a "retry if results match a certain regex" to check for
> clock sync errors or somesuch, but I think it's non-trivial.
>
> Thoughts?
>
> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera