You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Josh McKenzie <jm...@apache.org> on 2022/08/17 18:04:56 UTC

Cassandra project status update 2022-08-17

This update comes to you from day 5 of quarantining in the basement. Thanks Pandemic. (╯°□°)╯︵ ┻━┻

(Today we're going to test if the ASF mailing lists allows a variety of ascii characters! I almost hope for everyone's sakes it doesn't; I abuse these things. :))

Let's start with 4.1:
Latest run has 7 failures. If we dig a bit deeper into the detail panel (https://butler.cassandra.apache.org/#/ci/upstream/compare/Cassandra-4.1/cassandra-4.1), you can see that the CASTest failures in https://issues.apache.org/jira/browse/CASSANDRA-17461 account for the long pole blocking the release. Looks like there's multiple folks working on that (thanks Brandon, Benedict, Andres, and Berenguer!), but it also looks like there's still no assignee so we're maybe holding it at arms length. Either that or we're just going to keep dogpiling on it which is great too; I don't see it falling off the radar any time soon.

testAutoSnapshotTTIOnDropAfterRestart has failed a few times so there's some legit flake there: https://ci-cassandra.apache.org/job/Cassandra-4.1/138/testReport/org.apache.cassandra.distributed.test/AutoSnapshotTtlTest/testAutoSnapshotTTlOnDropAfterRestart_2/. No build lead lately so we don't have a JIRA for it or associated with it (https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=496&quickFilter=2252); I may put that mantle back on in the near future.

There are 2 other failures that push us up to 7:
1) org.apache.cassandra.distributed.test.RepairTest.testForcedNormalRepairWithOneNodeDown (https://ci-cassandra.apache.org/job/Cassandra-4.1/138/testReport/org.apache.cassandra.distributed.test/RepairTest/testForcedNormalRepairWithOneNodeDown/). Looks like not all endpoints replied to the repair request so probably worth trying to repro locally and troubleshoot.

2) org.apache.cassandra.net.ProxyHandlerConnectionsTest.testExpireSome (https://ci-cassandra.apache.org/job/Cassandra-4.1/138/testReport/org.apache.cassandra.net/ProxyHandlerConnectionsTest/testExpireSome_2/). This is a timeout, so it's anyone's guess. :)

Holistically, if we take a step back and look at 4.1 from a distance as to its general CI health, there's quite a bit of flake there: https://butler.cassandra.apache.org/#/ci/upstream/compare/Cassandra-4.1/cassandra-4.1. If we toss out build 122 as an anomaly, there's many that fail once in the past 16 runs. This continues to highlight for me the tradeoff of re-running flakes vs. blocking on them. We've chatted a bit on other ML threads on this; while prepping this email I just noticed that the high level dashboard on our branches shows varied and pervasive flakiness which is particularly challenging.

Getting to 0 flakes with a "run once 0 tolerance" policy with the current ASF CI infra (which is definitely being improved upon!) looks to be something of a Sisyphean task. 

We're down from 13 tickets blocking 4.1 beta down to 7: https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&quickFilter=2455. As mentioned above, we have some test failures w/out tickets so that 7 is probably closer realistically to the previous count.

We have one unassigned ticket blocking 4.1 if anyone wants to pick it up: https://issues.apache.org/jira/browse/CASSANDRA-17773 (Incorrect cassandra.logdir on Debian systems).


[New Contributors Getting Started]
Follow your curiosity! We have a small number of things that still need to be fixed blocking 4.1, but if you have something specific you're interested in and there's an open ticket in jira on Cassandra, feel free to ping in slack (see below) to see if there's any context you need to dive in and get yourself assigned to that ticket. Hit up the @cassandra_mentors alias to reach volunteers who are available to help you get situated and link up with you as a mentor.

To search JIRA for a topic of interest, replace "ReplaceTextHere" with the topic on the following JIRA search: https://issues.apache.org/jira/issues/?jql=project%20%3D%20cassandra%20AND%20resolution%20!%3D%20unresolved%20AND%20assignee%20is%20EMPTY%20AND%20summary%20~%20%27ReplaceTextHere%27%20ORDER%20BY%20priority%20ASC

To get situated, here's an explanation of various types of contribution: https://cassandra.apache.org/_/community.html#how-to-contribute
An overview of the C* architecture: https://cassandra.apache.org/doc/latest/cassandra/architecture/overview.html
And here's our getting started contributing guide: https://cassandra.apache.org/_/development/index.html
We hang out in #cassandra-dev on https://the-asf.slack.com so come join us.


[Dev list Digest]
https://lists.apache.org/list?dev@cassandra.apache.org:lte=2w:

We've had a fairly active couple of weeks. Caleb is shopping for feedback on what we do with hints during decommission: https://lists.apache.org/thread/0o2kd2hntbdjhpf8t1j9l9ys7k7y1wo5. See CASSANDRA-17808 for more details: https://issues.apache.org/jira/browse/CASSANDRA-17808

Claude brought up the state of our open pull requests (so many that are open and stale) and the optics and inclusivity of our current MO: https://lists.apache.org/thread/7r6wd2p8kyz0g7rw2mnlw411gdmymlld. I'll refrain from further editorializing here as I've shared my perspective on the proposal thread; thanks Claude for bringing that up!

Claude later brought up a formal proposal to add a pull request template: https://lists.apache.org/thread/bwogjbpmwxd7qongq86lcv03ljqq83ps to the project. 

Mick put forward the proposal to move our official debian and redhat repositories from downloads.apache.org to a redirect to apache.jfrog.io: https://lists.apache.org/thread/09kj80xld5dkt7cv73m6xs56lqh4jd18

In pursuit of CEP-15: Accord and multi-key transactions, Caleb is working on the syntax discussed in a previous thread. https://lists.apache.org/thread/6p2flc3ql14nkn76m3dp1cldmqx0kz96, see https://issues.apache.org/jira/browse/CASSANDRA-17719 as well for details. Patrick shared quite a few thoughts a few days ago; curious to see what others think.


[CI Trends]
https://butler.cassandra.apache.org/#/

Here's our trends on our branches for the last two weeks:

3.0: 14 -> 11
3.11: 17 -> 20
4.0: 6 -> 5
4.1: 4 -> 7
trunk: 7 -> 8

Most of the consistent failures on 3.0 don't have assignees yet but do have JIRA tickets: https://butler.cassandra.apache.org/#/ci/upstream/compare/Cassandra-3.0/cassandra-3.0 (Thanks Brandon for creating those JIRAs)


[Release progress]
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&quickFilter=2175

Going to try a new "editorialized changes.txt" style format here:

4.1 beta: 11 issues
- Fixed documentation surrounding semantics of token ranges in nodetool compact (CASSANDRA-17575)
- A variety of flaky test failures
- build and packaging fixes (CASSANDRA-17766 and 17765)
- Fixing clientInitialization setting the failure detector (CASSANDRA-17782)
- Fixing BulkLoader initializing schema via streaming (CASSANDRA-17740)

4.X / Next: 17 issues
- New Guardrail for column sizes added (CASSANDRA-17151)
- A fix to add an additional check that a node being replaced is reported as live so we don't fail incorrectly (CASSANDRA-17805)
- Added the ability to do a one-time heap dump to a file on an Unhandled Exception, configurable by file or JMX (CASSANDRA-17795)
- Added a separate thread pool that handles high cost auth responses so new client connections don't overwhelm bcrypt (CASSANDRA-17812)
- A pretty significant improvement in DataOutputBuffer's memory usage and GC pressure (CASSANDRA-16471)
- The ability to read TTL and WRITE TIME of an element in a collection added (CASSANDRA-8877)
- Some cleaning up of python linting and legacy code fragments (CASSANDRA-17694, CASSANDRA-17779)
- A bug causing an NPE during streaming fixed (CASSANDRA-17801)
- Upstream CEP-15 work, using a seeded crc for PaxosBallotTracker checksum (CASSANDRA-17793)
- UUID for tracking nodeool import logging (CASSANDRA-17800)
- lack of JNA not exploding things when running as a client (CASSANDRA-17794)
- Skipping node kill on startup check for unknown things in system since that can happen if you upgrade from older versions of C* up (CASSANDRA-17777)
- Logging duplicate keys if they show up during verify (CASSANDRA-17789)
- Breaking out secondary index building to its own thread pool so it doesn't block compaction in (CASSANDRA-17781)

So to sum it up:
- CASTest continues to be the biggest block on 4.1: https://issues.apache.org/jira/browse/CASSANDRA-17461 but folks are working on it
- The lack of a consistent build lead means our Kanban tracking test fixes is drifting further from the state of CI
- It's pretty expensive and painful to defer cleaning up CI to the end of the release cycle

Keep fighting the good fight!


~Josh

Re: Cassandra project status update 2022-08-17

Posted by Ekaterina Dimitrova <e....@gmail.com>.

One correction, testAutoSnapshotTTIOnDropAfterRestart - ticket in review
already, it wasn’t linked in Butler though. I will link it now. Thanks
Paulo and Caleb for looking into it.

On Wed, 17 Aug 2022 at 14:05, Josh McKenzie <jm...@apache.org> wrote:

> This update comes to you from day 5 of quarantining in the basement.
> Thanks Pandemic. (╯°□°)╯︵ ┻━┻
>
> (Today we're going to test if the ASF mailing lists allows a variety of
> ascii characters! I almost hope for everyone's sakes it doesn't; I abuse
> these things. :))
>
> Let's start with 4.1:
> Latest run has 7 failures. If we dig a bit deeper into the detail panel (
> https://butler.cassandra.apache.org/#/ci/upstream/compare/Cassandra-4.1/cassandra-4.1),
> you can see that the CASTest failures in
> https://issues.apache.org/jira/browse/CASSANDRA-17461 account for the
> long pole blocking the release. Looks like there's multiple folks working
> on that (thanks Brandon, Benedict, Andres, and Berenguer!), but it also
> looks like there's still no assignee so we're maybe holding it at arms
> length. Either that or we're just going to keep dogpiling on it which is
> great too; I don't see it falling off the radar any time soon.
>
> has failed a few times so there's some legit flake there:
> https://ci-cassandra.apache.org/job/Cassandra-4.1/138/testReport/org.apache.cassandra.distributed.test/AutoSnapshotTtlTest/testAutoSnapshotTTlOnDropAfterRestart_2/.
> No build lead lately so we don't have a JIRA for it or associated with it (
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=496&quickFilter=2252);
> I may put that mantle back on in the near future.
>
> There are 2 other failures that push us up to 7:
> 1)
> org.apache.cassandra.distributed.test.RepairTest.testForcedNormalRepairWithOneNodeDown
> (
> https://ci-cassandra.apache.org/job/Cassandra-4.1/138/testReport/org.apache.cassandra.distributed.test/RepairTest/testForcedNormalRepairWithOneNodeDown/).
> Looks like not all endpoints replied to the repair request so probably
> worth trying to repro locally and troubleshoot.
>
> 2) org.apache.cassandra.net.ProxyHandlerConnectionsTest.testExpireSome (
> https://ci-cassandra.apache.org/job/Cassandra-4.1/138/testReport/org.apache.cassandra.net/ProxyHandlerConnectionsTest/testExpireSome_2/).
> This is a timeout, so it's anyone's guess. :)
>
> Holistically, if we take a step back and look at 4.1 from a distance as to
> its general CI health, there's quite a bit of flake there:
> https://butler.cassandra.apache.org/#/ci/upstream/compare/Cassandra-4.1/cassandra-4.1.
> If we toss out build 122 as an anomaly, there's many that fail once in the
> past 16 runs. This continues to highlight for me the tradeoff of re-running
> flakes vs. blocking on them. We've chatted a bit on other ML threads on
> this; while prepping this email I just noticed that the high level
> dashboard on our branches shows varied and pervasive flakiness which is
> particularly challenging.
>
> Getting to 0 flakes with a "run once 0 tolerance" policy with the current
> ASF CI infra (which is definitely being improved upon!) looks to be
> something of a Sisyphean task.
>
> We're down from 13 tickets blocking 4.1 beta down to 7:
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&quickFilter=2455.
> As mentioned above, we have some test failures w/out tickets so that 7 is
> probably closer realistically to the previous count.
>
> We have one unassigned ticket blocking 4.1 if anyone wants to pick it up:
> https://issues.apache.org/jira/browse/CASSANDRA-17773 (Incorrect
> cassandra.logdir on Debian systems).
>
>
> [New Contributors Getting Started]
> Follow your curiosity! We have a small number of things that still need to
> be fixed blocking 4.1, but if you have something specific you're interested
> in and there's an open ticket in jira on Cassandra, feel free to ping in
> slack (see below) to see if there's any context you need to dive in and get
> yourself assigned to that ticket. Hit up the @cassandra_mentors alias to
> reach volunteers who are available to help you get situated and link up
> with you as a mentor.
>
> To search JIRA for a topic of interest, replace "ReplaceTextHere" with the
> topic on the following JIRA search:
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20cassandra%20AND%20resolution%20!%3D%20unresolved%20AND%20assignee%20is%20EMPTY%20AND%20summary%20~%20%27ReplaceTextHere%27%20ORDER%20BY%20priority%20ASC
>
> To get situated, here's an explanation of various types of contribution:
> https://cassandra.apache.org/_/community.html#how-to-contribute
> An overview of the C* architecture:
> https://cassandra.apache.org/doc/latest/cassandra/architecture/overview.html
> And here's our getting started contributing guide:
> https://cassandra.apache.org/_/development/index.html
> We hang out in #cassandra-dev on https://the-asf.slack.com so come join
> us.
>
>
> [Dev list Digest]
> https://lists.apache.org/list?dev@cassandra.apache.org:lte=2w:
>
> We've had a fairly active couple of weeks. Caleb is shopping for feedback
> on what we do with hints during decommission:
> https://lists.apache.org/thread/0o2kd2hntbdjhpf8t1j9l9ys7k7y1wo5. See
> CASSANDRA-17808 for more details:
> https://issues.apache.org/jira/browse/CASSANDRA-17808
>
> Claude brought up the state of our open pull requests (so many that are
> open and stale) and the optics and inclusivity of our current MO:
> https://lists.apache.org/thread/7r6wd2p8kyz0g7rw2mnlw411gdmymlld. I'll
> refrain from further editorializing here as I've shared my perspective on
> the proposal thread; thanks Claude for bringing that up!
>
> Claude later brought up a formal proposal to add a pull request template:
> https://lists.apache.org/thread/bwogjbpmwxd7qongq86lcv03ljqq83ps to the
> project.
>
> Mick put forward the proposal to move our official debian and redhat
> repositories from downloads.apache.org to a redirect to apache.jfrog.io:
> https://lists.apache.org/thread/09kj80xld5dkt7cv73m6xs56lqh4jd18
>
> In pursuit of CEP-15: Accord and multi-key transactions, Caleb is working
> on the syntax discussed in a previous thread.
> https://lists.apache.org/thread/6p2flc3ql14nkn76m3dp1cldmqx0kz96, see
> https://issues.apache.org/jira/browse/CASSANDRA-17719 as well for
> details. Patrick shared quite a few thoughts a few days ago; curious to see
> what others think.
>
>
> [CI Trends]
> https://butler.cassandra.apache.org/#/
>
> Here's our trends on our branches for the last two weeks:
>
> 3.0: 14 -> 11
> 3.11: 17 -> 20
> 4.0: 6 -> 5
> 4.1: 4 -> 7
> trunk: 7 -> 8
>
> Most of the consistent failures on 3.0 don't have assignees yet but do
> have JIRA tickets:
> https://butler.cassandra.apache.org/#/ci/upstream/compare/Cassandra-3.0/cassandra-3.0
> (Thanks Brandon for creating those JIRAs)
>
>
> [Release progress]
>
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&quickFilter=2175
>
> Going to try a new "editorialized changes.txt" style format here:
>
> 4.1 beta: 11 issues
> - Fixed documentation surrounding semantics of token ranges in nodetool
> compact (CASSANDRA-17575)
> - A variety of flaky test failures
> - build and packaging fixes (CASSANDRA-17766 and 17765)
> - Fixing clientInitialization setting the failure detector
> (CASSANDRA-17782)
> - Fixing BulkLoader initializing schema via streaming (CASSANDRA-17740)
>
> 4.X / Next: 17 issues
> - New Guardrail for column sizes added (CASSANDRA-17151)
> - A fix to add an additional check that a node being replaced is reported
> as live so we don't fail incorrectly (CASSANDRA-17805)
> - Added the ability to do a one-time heap dump to a file on an Unhandled
> Exception, configurable by file or JMX (CASSANDRA-17795)
> - Added a separate thread pool that handles high cost auth responses so
> new client connections don't overwhelm bcrypt (CASSANDRA-17812)
> - A pretty significant improvement in DataOutputBuffer's memory usage and
> GC pressure (CASSANDRA-16471)
> - The ability to read TTL and WRITE TIME of an element in a collection
> added (CASSANDRA-8877)
> - Some cleaning up of python linting and legacy code fragments
> (CASSANDRA-17694, CASSANDRA-17779)
> - A bug causing an NPE during streaming fixed (CASSANDRA-17801)
> - Upstream CEP-15 work, using a seeded crc for PaxosBallotTracker checksum
> (CASSANDRA-17793)
> - UUID for tracking nodeool import logging (CASSANDRA-17800)
> - lack of JNA not exploding things when running as a client
> (CASSANDRA-17794)
> - Skipping node kill on startup check for unknown things in system since
> that can happen if you upgrade from older versions of C* up
> (CASSANDRA-17777)
> - Logging duplicate keys if they show up during verify (CASSANDRA-17789)
> - Breaking out secondary index building to its own thread pool so it
> doesn't block compaction in (CASSANDRA-17781)
>
> So to sum it up:
> - CASTest continues to be the biggest block on 4.1:
> https://issues.apache.org/jira/browse/CASSANDRA-17461 but folks are
> working on it
> - The lack of a consistent build lead means our Kanban tracking test fixes
> is drifting further from the state of CI
> - It's pretty expensive and painful to defer cleaning up CI to the end of
> the release cycle
>
> Keep fighting the good fight!
>
>
>
> ~Josh
>

Re: Cassandra project status update 2022-08-17

Posted by Josh McKenzie <jm...@apache.org>.

A few thoughts:
 1. If we gate the classification behind each test failure being root-caused, we consistently need people who are dedicating their time to doing that or we end up with a backlog of unclassified CI failures (like we have now and have always had historically).
 2. Newcomers to the project or folks who haven't worked in the CI space that want to pitch in to push a release across the line right now don't have any guidance as to how to classify things.
 3. Unless that classifications is more rigorous, rules like "no non-flaky test failures" don't actually mean the same thing to everyone on the project if we don't have a shared definition of what we're considering "flaky".
Also, one other data point in favor of a simple frequency heuristic is that if these tests are flaking on ci-cassandra but not flaking on circleci, that's more evidence that they're *likely* test environment + authoring failures rather than product failures.

On Thu, Aug 18, 2022, at 10:36 AM, Brandon Williams wrote:
> > I think a simple metric for "is something flaky" is "does it only fail once in the butler history (of 15 or so builds)".
> 
> Does that make it considered flaky?  What if the one failure is a
> timeout?  I think each failing case has to have the failures
> investigated in order to know.
> 
> Kind Regards,
> Brandon
> 
> On Thu, Aug 18, 2022 at 9:31 AM Josh McKenzie <jm...@apache.org> wrote:
> >
> > So move to beta when:
> >
> > all non-flaky test *failures* (NOT tickets, see below) are resolved
> > We get a green ci-cassandra run
> >
> > Move to rc when:
> >
> > Three consecutive green runs in ci-cassandra
> >
> > Release when:
> >
> > All rc tickets are closed
> > Some time-based gate maybe?
> > Three more consecutive green ci-cassandra runs?
> >
> >
> > We don't have people volunteering for the build lead role so we don't consistently have tickets created for flaky or non-flaky test failures, thus we can't use that as a gatekeeper IMO as it's non-deterministic. Using "no non-flaky failures in butler (i.e. ci-cassandra + history analysis)" should shore that up. We also need a more rigorous designation for flaky vs. non-flaky in our tickets outside an informal practice of adding that to the Summary.
> >
> > I think a simple metric for "is something flaky" is "does it only fail once in the butler history (of 15 or so builds)".
> >
> > We can then filter out our kanban to reflect that as well (flaky tests to their own swimlane as they're "iffy" as RC blockers; it'd technically be a roll of the dice as to whether any flake on the 3 consecutive runs we need to get green to release... which I don't love ;) ).
> >
> > We did something similar last time, this would be the same exception to the rules, rules we continue to get closer to.
> >
> > If we did something similar last time and this is the same exception to the rules, I don't think we're getting closer to satisfying those rules are we? i.e. I think we should consider revising the rules formally to match the above metrics that are a little fuzzier and more tolerant to the current (and richly historical!) reality of our CI environment.
> >
> > Would save us a lot of back and forth on subsequent releases. :)
> >
> > ~Josh
> >
> > On Thu, Aug 18, 2022, at 1:24 AM, Berenguer Blasi wrote:
> >
> > +1 to Mick's points.
> >
> > Also notice in circle 4.1 green runs are the norm lately imo. Yes it's not the official CI but it helps build an overall picture of improvement towards green CI. On jenkins, if you check the latest 4.1 runs, <5-ish failures per run are starting to be common and those that don't are known failures being worked on (CAS i.e.), infra or flakies taking you back to the <5-ish failures. So overall, if I am not missing anything, the signal among the infra and flaky noise is pretty good.
> >
> > Regards
> >
> > On 17/8/22 22:50, Ekaterina Dimitrova wrote:
> >
> > +1, I second Mick on both points.
> >
> > On Wed, 17 Aug 2022 at 16:23, Mick Semb Wever <mc...@apache.org> wrote:
> >
> > We're down from 13 tickets blocking 4.1 beta down to 7: https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&quickFilter=2455. As mentioned above, we have some test failures w/out tickets so that 7 is probably closer realistically to the previous count.
> >
> >
> >
> > I suggest we move to beta when all non-flaky-test tickets are resolved and we get our first green ci-cassandra run.
> > And I suggest we move to rc when we get three consecutive green runs.
> >
> > We did something similar last time, this would be the same exception to the rules, rules we continue to get closer to.
> >
> > An alternative is to replace "green" with "builds with only non-regression and infra-caused failures".
> >
> >
> >
> > - It's pretty expensive and painful to defer cleaning up CI to the end of the release cycle
> >
> >
> >
> > This^
> >
> >
>

Re: Cassandra project status update 2022-08-17

Posted by Brandon Williams <dr...@gmail.com>.

> I think a simple metric for "is something flaky" is "does it only fail once in the butler history (of 15 or so builds)".

Does that make it considered flaky?  What if the one failure is a
timeout?  I think each failing case has to have the failures
investigated in order to know.

Kind Regards,
Brandon

On Thu, Aug 18, 2022 at 9:31 AM Josh McKenzie <jm...@apache.org> wrote:
>
> So move to beta when:
>
> all non-flaky test *failures* (NOT tickets, see below) are resolved
> We get a green ci-cassandra run
>
> Move to rc when:
>
> Three consecutive green runs in ci-cassandra
>
> Release when:
>
> All rc tickets are closed
> Some time-based gate maybe?
> Three more consecutive green ci-cassandra runs?
>
>
> We don't have people volunteering for the build lead role so we don't consistently have tickets created for flaky or non-flaky test failures, thus we can't use that as a gatekeeper IMO as it's non-deterministic. Using "no non-flaky failures in butler (i.e. ci-cassandra + history analysis)" should shore that up. We also need a more rigorous designation for flaky vs. non-flaky in our tickets outside an informal practice of adding that to the Summary.
>
> I think a simple metric for "is something flaky" is "does it only fail once in the butler history (of 15 or so builds)".
>
> We can then filter out our kanban to reflect that as well (flaky tests to their own swimlane as they're "iffy" as RC blockers; it'd technically be a roll of the dice as to whether any flake on the 3 consecutive runs we need to get green to release... which I don't love ;) ).
>
> We did something similar last time, this would be the same exception to the rules, rules we continue to get closer to.
>
> If we did something similar last time and this is the same exception to the rules, I don't think we're getting closer to satisfying those rules are we? i.e. I think we should consider revising the rules formally to match the above metrics that are a little fuzzier and more tolerant to the current (and richly historical!) reality of our CI environment.
>
> Would save us a lot of back and forth on subsequent releases. :)
>
> ~Josh
>
> On Thu, Aug 18, 2022, at 1:24 AM, Berenguer Blasi wrote:
>
> +1 to Mick's points.
>
> Also notice in circle 4.1 green runs are the norm lately imo. Yes it's not the official CI but it helps build an overall picture of improvement towards green CI. On jenkins, if you check the latest 4.1 runs, <5-ish failures per run are starting to be common and those that don't are known failures being worked on (CAS i.e.), infra or flakies taking you back to the <5-ish failures. So overall, if I am not missing anything, the signal among the infra and flaky noise is pretty good.
>
> Regards
>
> On 17/8/22 22:50, Ekaterina Dimitrova wrote:
>
> +1, I second Mick on both points.
>
> On Wed, 17 Aug 2022 at 16:23, Mick Semb Wever <mc...@apache.org> wrote:
>
> We're down from 13 tickets blocking 4.1 beta down to 7: https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&quickFilter=2455. As mentioned above, we have some test failures w/out tickets so that 7 is probably closer realistically to the previous count.
>
>
>
> I suggest we move to beta when all non-flaky-test tickets are resolved and we get our first green ci-cassandra run.
> And I suggest we move to rc when we get three consecutive green runs.
>
> We did something similar last time, this would be the same exception to the rules, rules we continue to get closer to.
>
> An alternative is to replace "green" with "builds with only non-regression and infra-caused failures".
>
>
>
> - It's pretty expensive and painful to defer cleaning up CI to the end of the release cycle
>
>
>
> This^
>
>

Re: Cassandra project status update 2022-08-17

Posted by Josh McKenzie <jm...@apache.org>.

So move to beta when:
 1. all non-flaky test *failures* (NOT tickets, see below) are resolved
 2. We get a green ci-cassandra run
Move to rc when:
 1. Three consecutive green runs in ci-cassandra
Release when:
 1. All rc tickets are closed
 2. Some time-based gate maybe?
 3. Three more consecutive green ci-cassandra runs?

We don't have people volunteering for the build lead role so we don't consistently have tickets created for flaky or non-flaky test failures, thus we can't use that as a gatekeeper IMO as it's non-deterministic. Using "no non-flaky failures in butler (i.e. ci-cassandra + history analysis)" should shore that up. We also need a more rigorous designation for flaky vs. non-flaky in our tickets outside an informal practice of adding that to the Summary.

I think a simple metric for "is something flaky" is "does it only fail once in the butler history (of 15 or so builds)".

We can then filter out our kanban to reflect that as well (flaky tests to their own swimlane as they're "iffy" as RC blockers; it'd technically be a roll of the dice as to whether any flake on the 3 consecutive runs we need to get green to release... which I don't love ;) ).

> We did something similar last time, this would be the same exception to the rules, rules we continue to get closer to.
If we did something similar last time and this is the same exception to the rules, I don't think we're getting closer to satisfying those rules are we? i.e. I think we should consider revising the rules formally to match the above metrics that are a little fuzzier and more tolerant to the current (and richly historical!) reality of our CI environment.

Would save us a lot of back and forth on subsequent releases. :)

~Josh

On Thu, Aug 18, 2022, at 1:24 AM, Berenguer Blasi wrote:
> +1 to Mick's points.
> 
> Also notice in circle 4.1 green runs are the norm lately imo. Yes it's not the official CI but it helps build an overall picture of improvement towards green CI. On jenkins, if you check the latest 4.1 runs, <5-ish failures per run are starting to be common and those that don't are known failures being worked on (CAS i.e.), infra or flakies taking you back to the <5-ish failures. So overall, if I am not missing anything, the signal among the infra and flaky noise is pretty good.
> 
> Regards
> 
> On 17/8/22 22:50, Ekaterina Dimitrova wrote:
>> +1, I second Mick on both points. 
>> 
>> On Wed, 17 Aug 2022 at 16:23, Mick Semb Wever <mc...@apache.org> wrote:
>>>> We're down from 13 tickets blocking 4.1 beta down to 7: https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&quickFilter=2455. As mentioned above, we have some test failures w/out tickets so that 7 is probably closer realistically to the previous count.
>>> 
>>> 
>>> I suggest we move to beta when all non-flaky-test tickets are resolved and we get our first green ci-cassandra run. 
>>> And I suggest we move to rc when we get three consecutive green runs.
>>> 
>>> We did something similar last time, this would be the same exception to the rules, rules we continue to get closer to.
>>> 
>>> An alternative is to replace "green" with "builds with only non-regression and infra-caused failures".
>>> 
>>>  
>>>> - It's pretty expensive and painful to defer cleaning up CI to the end of the release cycle
>>> 
>>> 
>>> This^

Re: Cassandra project status update 2022-08-17

Posted by Berenguer Blasi <be...@gmail.com>.

+1 to Mick's points.

Also notice in circle 4.1 green runs are the norm lately imo. Yes it's 
not the official CI but it helps build an overall picture of improvement 
towards green CI. On jenkins, if you check the latest 4.1 runs, <5-ish 
failures per run are starting to be common and those that don't are 
known failures being worked on (CAS i.e.), infra or flakies taking you 
back to the <5-ish failures. So overall, if I am not missing anything, 
the signal among the infra and flaky noise is pretty good.

Regards

On 17/8/22 22:50, Ekaterina Dimitrova wrote:
> +1, I second Mick on both points.
>
> On Wed, 17 Aug 2022 at 16:23, Mick Semb Wever <mc...@apache.org> wrote:
>
>         We're down from 13 tickets blocking 4.1 beta down to 7:
>         https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&quickFilter=2455
>         <https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&quickFilter=2455>.
>         As mentioned above, we have some test failures w/out tickets
>         so that 7 is probably closer realistically to the previous count.
>
>
>
>     I suggest we move to beta when all non-flaky-test tickets are
>     resolved and we get our first green ci-cassandra run.
>     And I suggest we move to rc when we get three consecutive green runs.
>
>     We did something similar last time, this would be the same
>     exception to the rules, rules we continue to get closer to.
>
>     An alternative is to replace "green" with "builds with only
>     non-regression and infra-caused failures".
>
>         - It's pretty expensive and painful to defer cleaning up CI to
>         the end of the release cycle
>
>
>
>     This^
>

Re: Cassandra project status update 2022-08-17

Posted by Ekaterina Dimitrova <e....@gmail.com>.

+1, I second Mick on both points.

On Wed, 17 Aug 2022 at 16:23, Mick Semb Wever <mc...@apache.org> wrote:

> We're down from 13 tickets blocking 4.1 beta down to 7:
>> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&quickFilter=2455.
>> As mentioned above, we have some test failures w/out tickets so that 7 is
>> probably closer realistically to the previous count.
>>
>
>
> I suggest we move to beta when all non-flaky-test tickets are resolved and
> we get our first green ci-cassandra run.
> And I suggest we move to rc when we get three consecutive green runs.
>
> We did something similar last time, this would be the same exception to
> the rules, rules we continue to get closer to.
>
> An alternative is to replace "green" with "builds with only non-regression
> and infra-caused failures".
>
>
>
>> - It's pretty expensive and painful to defer cleaning up CI to the end of
>> the release cycle
>>
>
>
> This^
>

Re: Cassandra project status update 2022-08-17

Posted by Paulo Motta <pa...@gmail.com>.

Please disconsider previous message, I missed Ekaterina's message :)

Em qua., 17 de ago. de 2022 às 17:35, Paulo Motta <pa...@gmail.com>
escreveu:

> > testAutoSnapshotTTIOnDropAfterRestart has failed a few times so there's
> some legit flake there:
> https://ci-cassandra.apache.org/job/Cassandra-4.1/138/testReport/org.apache.cassandra.distributed.test/AutoSnapshotTtlTest/testAutoSnapshotTTlOnDropAfterRestart_2/.
> No build lead lately so we don't have a JIRA for it or associated with it (
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=496&quickFilter=2252);
> I may put that mantle back on in the near future.
>
> There's https://issues.apache.org/jira/browse/CASSANDRA-17804, for some
> reason it's not showing up in the kanban board..
>
> Em qua., 17 de ago. de 2022 às 17:24, Mick Semb Wever <mc...@apache.org>
> escreveu:
>
>> We're down from 13 tickets blocking 4.1 beta down to 7:
>>> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&quickFilter=2455.
>>> As mentioned above, we have some test failures w/out tickets so that 7 is
>>> probably closer realistically to the previous count.
>>>
>>
>>
>> I suggest we move to beta when all non-flaky-test tickets are resolved
>> and we get our first green ci-cassandra run.
>> And I suggest we move to rc when we get three consecutive green runs.
>>
>> We did something similar last time, this would be the same exception to
>> the rules, rules we continue to get closer to.
>>
>> An alternative is to replace "green" with "builds with only
>> non-regression and infra-caused failures".
>>
>>
>>
>>> - It's pretty expensive and painful to defer cleaning up CI to the end
>>> of the release cycle
>>>
>>
>>
>> This^
>>
>

Re: Cassandra project status update 2022-08-17

Posted by Paulo Motta <pa...@gmail.com>.

> testAutoSnapshotTTIOnDropAfterRestart has failed a few times so there's
some legit flake there:
https://ci-cassandra.apache.org/job/Cassandra-4.1/138/testReport/org.apache.cassandra.distributed.test/AutoSnapshotTtlTest/testAutoSnapshotTTlOnDropAfterRestart_2/.
No build lead lately so we don't have a JIRA for it or associated with it (
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=496&quickFilter=2252);
I may put that mantle back on in the near future.

There's https://issues.apache.org/jira/browse/CASSANDRA-17804, for some
reason it's not showing up in the kanban board..

Em qua., 17 de ago. de 2022 às 17:24, Mick Semb Wever <mc...@apache.org>
escreveu:

> We're down from 13 tickets blocking 4.1 beta down to 7:
>> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&quickFilter=2455.
>> As mentioned above, we have some test failures w/out tickets so that 7 is
>> probably closer realistically to the previous count.
>>
>
>
> I suggest we move to beta when all non-flaky-test tickets are resolved and
> we get our first green ci-cassandra run.
> And I suggest we move to rc when we get three consecutive green runs.
>
> We did something similar last time, this would be the same exception to
> the rules, rules we continue to get closer to.
>
> An alternative is to replace "green" with "builds with only non-regression
> and infra-caused failures".
>
>
>
>> - It's pretty expensive and painful to defer cleaning up CI to the end of
>> the release cycle
>>
>
>
> This^
>

Re: Cassandra project status update 2022-08-17

Posted by Mick Semb Wever <mc...@apache.org>.

>
> We're down from 13 tickets blocking 4.1 beta down to 7:
> https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=484&quickFilter=2455.
> As mentioned above, we have some test failures w/out tickets so that 7 is
> probably closer realistically to the previous count.
>


I suggest we move to beta when all non-flaky-test tickets are resolved and
we get our first green ci-cassandra run.
And I suggest we move to rc when we get three consecutive green runs.

We did something similar last time, this would be the same exception to the
rules, rules we continue to get closer to.

An alternative is to replace "green" with "builds with only non-regression
and infra-caused failures".



> - It's pretty expensive and painful to defer cleaning up CI to the end of
> the release cycle
>


This^