You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Berenguer Blasi <be...@gmail.com> on 2022/07/05 04:47:05 UTC

Raise test timeouts?

Hi All,

bringing https://issues.apache.org/jira/browse/CASSANDRA-17729 to the ML 
for visibility as this has been a discussion point with some of you.

I noticed tests timeout much more on jenkins that circle. I was 
wondering if legit bugs were hiding behind those timeouts and it might 
be the case. Feel free to jump in the ticket :-)

Regards

Re: Raise test timeouts?

Posted by Berenguer Blasi <be...@gmail.com>.

Hi All,

if I am parsing this thread correctly it seems we have a number of 
options to attack and some are already progressing: tmp misconfig, 
docker misconfig, unmatched resources in different CI envs, no 
definition of minimal HW requiremenets, etc.

But so far nothing against merging CASSANDRA-17729, in fact it has a +1 
already, as tests seem to indicate it may reveal legit bugs. Correct me 
if I am wrong but I will assume lazy consensus and merge by the end of 
the week if nobody objects.

Given we're on holidays season I will have no problem to revert, it's 
quite easy in fact, if I missed sthg.

Regards

On 7/7/22 22:44, Mick Semb Wever wrote:
>
>
>     However, the docker space
>     issue needs to be resolved first since we don't have the capacity to
>     experiment with those nodes out of commission.
>
>
> ETA on fixing the docker space issues is this/next week. Once that 
> lands we can take a look at the abnormal CPU usage on some nodes.
>

Re: Raise test timeouts?

Posted by Mick Semb Wever <mc...@apache.org>.

However, the docker space
> issue needs to be resolved first since we don't have the capacity to
> experiment with those nodes out of commission.
>


ETA on fixing the docker space issues is this/next week. Once that lands we
can take a look at the abnormal CPU usage on some nodes.

Re: Raise test timeouts?

Posted by Ekaterina Dimitrova <e....@gmail.com>.

Just wanted to bring up that we actually started seeing a trend pre-4.0 and
it keeps showing up now on the way to 4.1 - legit bugs are found more in
CircleCI when they do not pop up at all in Jenkins. So my appeal is to keep
checking thoroughly also CircleCI even if some failures are not visible in
Butler.

On Wed, 6 Jul 2022 at 11:27, Josh McKenzie <jm...@apache.org> wrote:

> Bringing discussion from JIRA (CASSANDRA-17729) to here:
>
> Mick said:
>
> Agree with the notion that Jenkins (lower resources/more contention) is
> better at exposing flakies, but that there's a trade-off between
> encouraging flakies and creating difficult-to-deal-with noise.
>
> I come back to the question: what minimum spec of hardware do we want to
> support for C*, and how can we best configure our CI infrastructure to be
> representative of that? Given the complexity and temporal relationships
> w/multiple actors in a distributed system, there's *always* going to be
> "defects" that show up if you sufficiently under-provision a host. That
> doesn't necessarily mean it's a user-facing bug that needs to be fixed.
>
> What I mean by that specifically: if you under-provision a node with 2
> cpus, 1.5 gigs of ram, slow disks, slow networking, and noisy neighbors,
> and the nodes take so long with GC pauses, compaction, streaming, etc that
> they don't correctly complete certain operations in expected time,
> completely time out, fall over, or otherwise *preserve correctness but
> die or don't complete operations in time* - is that a bug?
>
> And if the angle is more "the test isn't deterministic and fails on
> under-provisioned hosts; there's a bug *in the test*", well, that's just
> our lives. We have a lot of technical debt in the form of brittle
> non-deterministic tests we'd have to target excising to get past this if we
> keep our container provisioning where it is.
>
> If in the lead up to 4.0 we saw a sub 20% hit rate in product defects from
> flaky tests vs. test environment flakes alone, we have to consider how much
> effort from how many engineers it's taking in the run up to a release to
> hammer all these "flaky due to provisioning" tests back down vs. using
> other methodologies of testing to uncover correctness defects in timing,
> schema propagation, consistency level guarantees, etc.
>
> On Wed, Jul 6, 2022, at 10:43 AM, Brandon Williams wrote:
>
> I suspect there's another problem with some of the Jenkins nodes where
> the system CPU usage is high and drives the load much higher than
> other nodes, possibly causing timeouts. However, the docker space
> issue needs to be resolved first since we don't have the capacity to
> experiment with those nodes out of commission.
>
> On Tue, Jul 5, 2022 at 10:53 AM Josh McKenzie <jm...@apache.org>
> wrote:
> >
> > Another option would be to increase the resources dedicated to each
> agent container and run less in parallel. Or, best yet, do both (up
> timeouts and lower parallelization / up resources).
> >
> > As far as I can tell the failures on Jenkins aren't value-add compared
> to what we're seeing on circleci and are just generating busywork.
> >
> > There's a reasonable discussion to be had about "what's the smallest
> footprint of hardware we consider C* supported on" and targeting ASF CI to
> validate that. I believe the noisy env + low resources on ASF CI currently
> are lower than whatever floor we'd reasonably agree on.
> >
> > On Tue, Jul 5, 2022, at 12:47 AM, Berenguer Blasi wrote:
> >
> > Hi All,
> >
> > bringing https://issues.apache.org/jira/browse/CASSANDRA-17729 to the ML
> > for visibility as this has been a discussion point with some of you.
> >
> > I noticed tests timeout much more on jenkins that circle. I was
> > wondering if legit bugs were hiding behind those timeouts and it might
> > be the case. Feel free to jump in the ticket :-)
> >
> > Regards
> >
> >
> >
>
>
>

Re: Raise test timeouts?

Posted by Josh McKenzie <jm...@apache.org>.

> Having parity between CI systems is important, no matter how we approach it.
How much does the hardware allocation (cpu, memory, disk throughput, network throughput) differ between ASF Jenkins and circle midres? How much does the container isolation differ?

i.e. why are we seeing bugged tests that flake out in ASF that don't fail in Circle midres for example?

On Wed, Jul 6, 2022, at 1:31 PM, Mick Semb Wever wrote:
>> What I mean by that specifically: if you under-provision a node with 2 cpus, 1.5 gigs of ram, slow disks, slow networking, and noisy neighbors, and the nodes take so long with GC pauses, compaction, streaming, etc that they don't correctly complete certain operations in expected time, completely time out, fall over, or otherwise *preserve correctness but die or don't complete operations in time* - is that a bug?
> 
>  
> I'd say it is a bug in the test if we can't distinguish between the test failing and the test not completing/crashing. How much time folk want to spend on the different test frameworks we have to improve such things (on a distributed system), or what the expected time saving such improvements would provide, I leave to others. I appreciate how demotivating it is.
> 
> Having parity between CI systems is important, no matter how we approach it.

Re: Raise test timeouts?

Posted by Mick Semb Wever <mc...@apache.org>.

>
> What I mean by that specifically: if you under-provision a node with 2
> cpus, 1.5 gigs of ram, slow disks, slow networking, and noisy neighbors,
> and the nodes take so long with GC pauses, compaction, streaming, etc that
> they don't correctly complete certain operations in expected time,
> completely time out, fall over, or otherwise *preserve correctness but
> die or don't complete operations in time* - is that a bug?
>


I'd say it is a bug in the test if we can't distinguish between the test
failing and the test not completing/crashing. How much time folk want to
spend on the different test frameworks we have to improve such things (on a
distributed system), or what the expected time saving such improvements
would provide, I leave to others. I appreciate how demotivating it is.

Having parity between CI systems is important, no matter how we approach it.

Re: Raise test timeouts?

Posted by Josh McKenzie <jm...@apache.org>.

Bringing discussion from JIRA (CASSANDRA-17729) to here:

Mick said:
> Agree with the notion that Jenkins (lower resources/more contention) is better at exposing flakies, but that there's a trade-off between encouraging flakies and creating difficult-to-deal-with noise.
I come back to the question: what minimum spec of hardware do we want to support for C*, and how can we best configure our CI infrastructure to be representative of that? Given the complexity and temporal relationships w/multiple actors in a distributed system, there's *always* going to be "defects" that show up if you sufficiently under-provision a host. That doesn't necessarily mean it's a user-facing bug that needs to be fixed.
What I mean by that specifically: if you under-provision a node with 2 cpus, 1.5 gigs of ram, slow disks, slow networking, and noisy neighbors, and the nodes take so long with GC pauses, compaction, streaming, etc that they don't correctly complete certain operations in expected time, completely time out, fall over, or otherwise *preserve correctness but die or don't complete operations in time* - is that a bug?

And if the angle is more "the test isn't deterministic and fails on under-provisioned hosts; there's a bug *in the test*", well, that's just our lives. We have a lot of technical debt in the form of brittle non-deterministic tests we'd have to target excising to get past this if we keep our container provisioning where it is.

If in the lead up to 4.0 we saw a sub 20% hit rate in product defects from flaky tests vs. test environment flakes alone, we have to consider how much effort from how many engineers it's taking in the run up to a release to hammer all these "flaky due to provisioning" tests back down vs. using other methodologies of testing to uncover correctness defects in timing, schema propagation, consistency level guarantees, etc.


On Wed, Jul 6, 2022, at 10:43 AM, Brandon Williams wrote:
> I suspect there's another problem with some of the Jenkins nodes where
> the system CPU usage is high and drives the load much higher than
> other nodes, possibly causing timeouts. However, the docker space
> issue needs to be resolved first since we don't have the capacity to
> experiment with those nodes out of commission.
> 
> On Tue, Jul 5, 2022 at 10:53 AM Josh McKenzie <jm...@apache.org> wrote:
> >
> > Another option would be to increase the resources dedicated to each agent container and run less in parallel. Or, best yet, do both (up timeouts and lower parallelization / up resources).
> >
> > As far as I can tell the failures on Jenkins aren't value-add compared to what we're seeing on circleci and are just generating busywork.
> >
> > There's a reasonable discussion to be had about "what's the smallest footprint of hardware we consider C* supported on" and targeting ASF CI to validate that. I believe the noisy env + low resources on ASF CI currently are lower than whatever floor we'd reasonably agree on.
> >
> > On Tue, Jul 5, 2022, at 12:47 AM, Berenguer Blasi wrote:
> >
> > Hi All,
> >
> > bringing https://issues.apache.org/jira/browse/CASSANDRA-17729 to the ML
> > for visibility as this has been a discussion point with some of you.
> >
> > I noticed tests timeout much more on jenkins that circle. I was
> > wondering if legit bugs were hiding behind those timeouts and it might
> > be the case. Feel free to jump in the ticket :-)
> >
> > Regards
> >
> >
> >
>

Re: Raise test timeouts?

Posted by Brandon Williams <dr...@gmail.com>.

I suspect there's another problem with some of the Jenkins nodes where
the system CPU usage is high and drives the load much higher than
other nodes, possibly causing timeouts. However, the docker space
issue needs to be resolved first since we don't have the capacity to
experiment with those nodes out of commission.

On Tue, Jul 5, 2022 at 10:53 AM Josh McKenzie <jm...@apache.org> wrote:
>
> Another option would be to increase the resources dedicated to each agent container and run less in parallel. Or, best yet, do both (up timeouts and lower parallelization / up resources).
>
> As far as I can tell the failures on Jenkins aren't value-add compared to what we're seeing on circleci and are just generating busywork.
>
> There's a reasonable discussion to be had about "what's the smallest footprint of hardware we consider C* supported on" and targeting ASF CI to validate that. I believe the noisy env + low resources on ASF CI currently are lower than whatever floor we'd reasonably agree on.
>
> On Tue, Jul 5, 2022, at 12:47 AM, Berenguer Blasi wrote:
>
> Hi All,
>
> bringing https://issues.apache.org/jira/browse/CASSANDRA-17729 to the ML
> for visibility as this has been a discussion point with some of you.
>
> I noticed tests timeout much more on jenkins that circle. I was
> wondering if legit bugs were hiding behind those timeouts and it might
> be the case. Feel free to jump in the ticket :-)
>
> Regards
>
>
>

Re: Raise test timeouts?

Posted by Josh McKenzie <jm...@apache.org>.

Another option would be to increase the resources dedicated to each agent container and run less in parallel. Or, best yet, do both (up timeouts and lower parallelization / up resources).

As far as I can tell the failures on Jenkins aren't value-add compared to what we're seeing on circleci and are just generating busywork.

There's a reasonable discussion to be had about "what's the smallest footprint of hardware we consider C* supported on" and targeting ASF CI to validate that. I believe the noisy env + low resources on ASF CI currently are lower than whatever floor we'd reasonably agree on.

On Tue, Jul 5, 2022, at 12:47 AM, Berenguer Blasi wrote:
> Hi All,
> 
> bringing https://issues.apache.org/jira/browse/CASSANDRA-17729 to the ML 
> for visibility as this has been a discussion point with some of you.
> 
> I noticed tests timeout much more on jenkins that circle. I was 
> wondering if legit bugs were hiding behind those timeouts and it might 
> be the case. Feel free to jump in the ticket :-)
> 
> Regards
> 
>