You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by Jordan West <jo...@gmail.com> on 2019/04/10 15:25:57 UTC

Cassandra 4.0 Quality and Stability Update

In September, the community chose to freeze trunk to begin working on
Quality and Stability with the goal of releasing the most stable Cassandra
major in the project’s history. While lots of work has been ongoing and
folks could follow along with progress on JIRA I thought it would be useful
to cover what has been accomplished so far since I’ve spent a good amount
of time working with others on various testing projects.

During this time we have made significant progress on improving the Quality
and Stability of Cassandra — not only Cassandra 4.0 but also the Cassandra
3.x series and future Cassandra releases. Additionally, testing has
provided the opportunity for new community members and committers to
contribute. While not comprehensive the community has found at least 25
bugs that can be classified as either Data Loss, Corruption, Incorrect
Response, Loss of Stability, Loss of Availability, Concurrency Issues,
Performance Issues, and Lack of Safety. These bugs have been found by a
variety of methodologies including commonly used ones like unit testing and
canary deployments. However, the majority of the bugs have been found or
confirmed using new methodologies like the ones described in a some recent
blog posts [1] [2].

Additionally, the state of the test suites and test tooling have improved.
CASSANDRA-14806 [3] brought some much welcomed improvements to the circleci
workflow and made it easier for people to run (d)tests on supported
platforms (jdk8/11) and the work to get upgrade tests running found several
bugs including CASSADNRA-14958 [4].

While we have made significant progress there is still more to do before we
can be truly confident in an Cassandra 4.0 release. Some ongoing and
outstanding work includes:

* Improving the state of the cqlsh tests [5]
* There is ongoing discussion on the new MessagingService [6] which will
require significant review and testing
* Additional upgrade testing for Cassandra 4.0 including additional support
for upgrade testing using in-jvm dtests [7]
* Work to increase coverage of important areas and new features in
Cassandra 4.0 [8]

While the list above may seem short, the last item contains a long list of
important areas the community has previously discussed adding coverage to.
If you are looking for areas to contribute this is a great starting point.
If there is a name down on an area you are interested in I would encourage
you to reach out to them to discuss how you can help further increase the
community’s confidence in the Quality and Stability of Cassandra.

Below is an in-complete list of many of the severe bugs found during this
part of the release cycle. Thanks again to all of the community members who
contributed to finding these bugs and improving Cassandra for everyone.

CASSANDRA-15004: Anti-compaction briefly removes sstables from the read path
CASSANDRA-14958: Counters fail to increment on 2.X to 3.X mixed version
clusters
CASSANDRA-14936: Anticompaction should throw exceptions on errors, not just
log them
CASSANDRA-14672: After deleting data in 3.11.3, reads fail: "open marker
and close marker have different deletion times"
CASSANDRA-14912: LegacyLayout errors on collection tombstones from dropped
columns
CASSANDRA-14843: Drop/add column name with different Kind can result in
corruption
CASSANDRA-14568: CorruptSSTableExceptions in 3.0.17.1 (CASSANDRA-14568 v2)
Static collection deletions are corrupted in 3.0 <-> 2.{1,2} messages
CASSANDRA-14749: Collection Deletions for Dropped Columns in 2.1/3.0
mixed-mode can delete rows
CASSANDRA-14568: Static collection deletions are corrupted in 3.0 ->
2.{1,2} messages
CASSANDRA-14861: Inaccurate sstable min/max metadata can cause data loss
CASSANDRA-14823: Legacy sstables with range tombstones spanning multiple
index blocks create invalid bound sequences on 3.0+ (#1193)
CASSANDRA-14873: Missing rows when reading 2.1 SSTables in 3.0
CASSANDRA-14838: Dropped columns can cause reverse sstable iteration to
return prematurely
CASSANDRA-14803: Rows that cross index block boundaries can cause
incomplete reverse reads in some cases.
CASSANDRA-14766: DESC order reads can fail to return the last Unfiltered in
the partition (#1170)
CASSANDRA-14991: SSL Cert Hot Reloading should defensively check for sanity
of the new keystore/truststore before loading it
CASSANDRA-14794: Avoid calling iter.next() in a loop when notifying
indexers about range tombstones
CASSANDRA-14780: Avoid creating empty compaction tasks after truncate
CASSANDRA-14657: Handle failures in upgradesstables/cleanup/relocatee
CASSANDRA-14638: Column result order can change in 'SELECT *' results when
upgrading from 2.1 to 3.0 causing response corruption for queries using
prepared statements when static columns are used
CASSANDRA-14919: Regression in paging queries in mixed version clusters
CASSANDRA-14554: LifecycleTransaction encounters
ConcurrentModificationException when used in multi-threaded context
CASSANDRA-14935: PendingAntiCompaction should be more judicious in the
compactions it cancels
CASSANDRA-14894: RangeTombstoneList doesn't properly clean up mergeable or
superseded rts in some cases
CASSANDRA-14824: Expand range tombstone validation checks to multiple
interim request stages
CASSANDRA-14763: Fail incremental repair prepare phase if it encounters
sstables from un-finalized sessions
CASSANDRA-14920: Some comparisons used for verifying paging queries in
dtests only test the column names and not values

Jordan

[1]
http://cassandra.apache.org/blog/2018/08/21/testing_apache_cassandra.html
[2]
http://cassandra.apache.org/blog/2018/10/17/finding_bugs_with_property_based_testing.html
[3] https://issues.apache.org/jira/browse/CASSANDRA-14806
[4] https://issues.apache.org/jira/browse/CASSANDRA-14958
[5] https://issues.apache.org/jira/browse/CASSANDRA-14951
[6] https://issues.apache.org/jira/browse/CASSANDRA-15066
[7] https://issues.apache.org/jira/browse/CASSANDRA-15078
[8]
https://cwiki.apache.org/confluence/display/CASSANDRA/4.0+Quality%3A+Components+and+Test+Plans

Re: Cassandra 4.0 Quality and Stability Update

Posted by Jordan West <jo...@gmail.com>.
Hi Dinesh,

Great question! Unfortunately I don’t have a great definition of what
“alpha” means in the Cassandra community so its hard for me to answer that
directly. However, I am of the opinion that we are not yet at the point of
being able to branch trunk — we are finding too many bugs at too quick a
pace still and have yet to make enough significant progress on the test
plan [1] previously linked. I do think it would be beneficial to cut an
official build (maybe after internode messaging settles down) as a preview
for the community and to make it easier for folks to run on dev/test
hardware. In the Riak community we call these “pre” builds (Riak 2.0.0preX)
and they were nothing more than a stable place on trunk released
periodically until we reached a point where we branched.

Regarding metrics, the first major step towards that was Benedict’s and
others work (thanks al!) to re-organize JIRA. We now have a better set of
inputs to automatically build reports around release quality metrics, etc.
We have yet to take this and turn it into JIRA reports but I am working
with Scott Andreas on it — I don’t have a timeframe just yet but I hope
soon. If you would like to help please let me know.

In the meantime, Scott and I have kept a list which is where the data I
used came from. We absolutely need to make this public and the efforts
mentioned above will accomplish that.

Jordan

[1]
https://cwiki.apache.org/confluence/display/CASSANDRA/4.0+Quality%3A+Components+and+Test+Plans

On Thu, Apr 11, 2019 at 4:21 PM Dinesh Joshi <dj...@apache.org> wrote:

> Hey Jordan,
>
> Thanks for update! Do you have a sense of where we are in terms of
> stability and where do we need to be in order to cut an alpha? I also
> remember a discussion on measuring release quality[1]. Not sure where we
> landed on it. Any idea on how are we doing on that front?
>
> Thanks,
>
> Dinesh
>
> [1]
> https://lists.apache.org/thread.html/3a444be1a3097c0c55d15268ccb0fe7aab83ef276b87bf55bf4f3bc2@%3Cdev.cassandra.apache.org%3E
>
> > On Apr 10, 2019, at 8:25 AM, Jordan West <jo...@gmail.com> wrote:
> >
> > In September, the community chose to freeze trunk to begin working on
> > Quality and Stability with the goal of releasing the most stable
> Cassandra
> > major in the project’s history. While lots of work has been ongoing and
> > folks could follow along with progress on JIRA I thought it would be
> useful
> > to cover what has been accomplished so far since I’ve spent a good amount
> > of time working with others on various testing projects.
> >
> > During this time we have made significant progress on improving the
> Quality
> > and Stability of Cassandra — not only Cassandra 4.0 but also the
> Cassandra
> > 3.x series and future Cassandra releases. Additionally, testing has
> > provided the opportunity for new community members and committers to
> > contribute. While not comprehensive the community has found at least 25
> > bugs that can be classified as either Data Loss, Corruption, Incorrect
> > Response, Loss of Stability, Loss of Availability, Concurrency Issues,
> > Performance Issues, and Lack of Safety. These bugs have been found by a
> > variety of methodologies including commonly used ones like unit testing
> and
> > canary deployments. However, the majority of the bugs have been found or
> > confirmed using new methodologies like the ones described in a some
> recent
> > blog posts [1] [2].
> >
> > Additionally, the state of the test suites and test tooling have
> improved.
> > CASSANDRA-14806 [3] brought some much welcomed improvements to the
> circleci
> > workflow and made it easier for people to run (d)tests on supported
> > platforms (jdk8/11) and the work to get upgrade tests running found
> several
> > bugs including CASSADNRA-14958 [4].
> >
> > While we have made significant progress there is still more to do before
> we
> > can be truly confident in an Cassandra 4.0 release. Some ongoing and
> > outstanding work includes:
> >
> > * Improving the state of the cqlsh tests [5]
> > * There is ongoing discussion on the new MessagingService [6] which will
> > require significant review and testing
> > * Additional upgrade testing for Cassandra 4.0 including additional
> support
> > for upgrade testing using in-jvm dtests [7]
> > * Work to increase coverage of important areas and new features in
> > Cassandra 4.0 [8]
> >
> > While the list above may seem short, the last item contains a long list
> of
> > important areas the community has previously discussed adding coverage
> to.
> > If you are looking for areas to contribute this is a great starting
> point.
> > If there is a name down on an area you are interested in I would
> encourage
> > you to reach out to them to discuss how you can help further increase the
> > community’s confidence in the Quality and Stability of Cassandra.
> >
> > Below is an in-complete list of many of the severe bugs found during this
> > part of the release cycle. Thanks again to all of the community members
> who
> > contributed to finding these bugs and improving Cassandra for everyone.
> >
> > CASSANDRA-15004: Anti-compaction briefly removes sstables from the read
> path
> > CASSANDRA-14958: Counters fail to increment on 2.X to 3.X mixed version
> > clusters
> > CASSANDRA-14936: Anticompaction should throw exceptions on errors, not
> just
> > log them
> > CASSANDRA-14672: After deleting data in 3.11.3, reads fail: "open marker
> > and close marker have different deletion times"
> > CASSANDRA-14912: LegacyLayout errors on collection tombstones from
> dropped
> > columns
> > CASSANDRA-14843: Drop/add column name with different Kind can result in
> > corruption
> > CASSANDRA-14568: CorruptSSTableExceptions in 3.0.17.1 (CASSANDRA-14568
> v2)
> > Static collection deletions are corrupted in 3.0 <-> 2.{1,2} messages
> > CASSANDRA-14749: Collection Deletions for Dropped Columns in 2.1/3.0
> > mixed-mode can delete rows
> > CASSANDRA-14568: Static collection deletions are corrupted in 3.0 ->
> > 2.{1,2} messages
> > CASSANDRA-14861: Inaccurate sstable min/max metadata can cause data loss
> > CASSANDRA-14823: Legacy sstables with range tombstones spanning multiple
> > index blocks create invalid bound sequences on 3.0+ (#1193)
> > CASSANDRA-14873: Missing rows when reading 2.1 SSTables in 3.0
> > CASSANDRA-14838: Dropped columns can cause reverse sstable iteration to
> > return prematurely
> > CASSANDRA-14803: Rows that cross index block boundaries can cause
> > incomplete reverse reads in some cases.
> > CASSANDRA-14766: DESC order reads can fail to return the last Unfiltered
> in
> > the partition (#1170)
> > CASSANDRA-14991: SSL Cert Hot Reloading should defensively check for
> sanity
> > of the new keystore/truststore before loading it
> > CASSANDRA-14794: Avoid calling iter.next() in a loop when notifying
> > indexers about range tombstones
> > CASSANDRA-14780: Avoid creating empty compaction tasks after truncate
> > CASSANDRA-14657: Handle failures in upgradesstables/cleanup/relocatee
> > CASSANDRA-14638: Column result order can change in 'SELECT *' results
> when
> > upgrading from 2.1 to 3.0 causing response corruption for queries using
> > prepared statements when static columns are used
> > CASSANDRA-14919: Regression in paging queries in mixed version clusters
> > CASSANDRA-14554: LifecycleTransaction encounters
> > ConcurrentModificationException when used in multi-threaded context
> > CASSANDRA-14935: PendingAntiCompaction should be more judicious in the
> > compactions it cancels
> > CASSANDRA-14894: RangeTombstoneList doesn't properly clean up mergeable
> or
> > superseded rts in some cases
> > CASSANDRA-14824: Expand range tombstone validation checks to multiple
> > interim request stages
> > CASSANDRA-14763: Fail incremental repair prepare phase if it encounters
> > sstables from un-finalized sessions
> > CASSANDRA-14920: Some comparisons used for verifying paging queries in
> > dtests only test the column names and not values
> >
> > Jordan
> >
> > [1]
> >
> http://cassandra.apache.org/blog/2018/08/21/testing_apache_cassandra.html
> > [2]
> >
> http://cassandra.apache.org/blog/2018/10/17/finding_bugs_with_property_based_testing.html
> > [3] https://issues.apache.org/jira/browse/CASSANDRA-14806
> > [4] https://issues.apache.org/jira/browse/CASSANDRA-14958
> > [5] https://issues.apache.org/jira/browse/CASSANDRA-14951
> > [6] https://issues.apache.org/jira/browse/CASSANDRA-15066
> > [7] https://issues.apache.org/jira/browse/CASSANDRA-15078
> > [8]
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/4.0+Quality%3A+Components+and+Test+Plans
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>
>

Re: Cassandra 4.0 Quality and Stability Update

Posted by Dinesh Joshi <dj...@apache.org>.
Hey Jordan, 

Thanks for update! Do you have a sense of where we are in terms of stability and where do we need to be in order to cut an alpha? I also remember a discussion on measuring release quality[1]. Not sure where we landed on it. Any idea on how are we doing on that front?

Thanks, 

Dinesh

[1] https://lists.apache.org/thread.html/3a444be1a3097c0c55d15268ccb0fe7aab83ef276b87bf55bf4f3bc2@%3Cdev.cassandra.apache.org%3E

> On Apr 10, 2019, at 8:25 AM, Jordan West <jo...@gmail.com> wrote:
> 
> In September, the community chose to freeze trunk to begin working on
> Quality and Stability with the goal of releasing the most stable Cassandra
> major in the project’s history. While lots of work has been ongoing and
> folks could follow along with progress on JIRA I thought it would be useful
> to cover what has been accomplished so far since I’ve spent a good amount
> of time working with others on various testing projects.
> 
> During this time we have made significant progress on improving the Quality
> and Stability of Cassandra — not only Cassandra 4.0 but also the Cassandra
> 3.x series and future Cassandra releases. Additionally, testing has
> provided the opportunity for new community members and committers to
> contribute. While not comprehensive the community has found at least 25
> bugs that can be classified as either Data Loss, Corruption, Incorrect
> Response, Loss of Stability, Loss of Availability, Concurrency Issues,
> Performance Issues, and Lack of Safety. These bugs have been found by a
> variety of methodologies including commonly used ones like unit testing and
> canary deployments. However, the majority of the bugs have been found or
> confirmed using new methodologies like the ones described in a some recent
> blog posts [1] [2].
> 
> Additionally, the state of the test suites and test tooling have improved.
> CASSANDRA-14806 [3] brought some much welcomed improvements to the circleci
> workflow and made it easier for people to run (d)tests on supported
> platforms (jdk8/11) and the work to get upgrade tests running found several
> bugs including CASSADNRA-14958 [4].
> 
> While we have made significant progress there is still more to do before we
> can be truly confident in an Cassandra 4.0 release. Some ongoing and
> outstanding work includes:
> 
> * Improving the state of the cqlsh tests [5]
> * There is ongoing discussion on the new MessagingService [6] which will
> require significant review and testing
> * Additional upgrade testing for Cassandra 4.0 including additional support
> for upgrade testing using in-jvm dtests [7]
> * Work to increase coverage of important areas and new features in
> Cassandra 4.0 [8]
> 
> While the list above may seem short, the last item contains a long list of
> important areas the community has previously discussed adding coverage to.
> If you are looking for areas to contribute this is a great starting point.
> If there is a name down on an area you are interested in I would encourage
> you to reach out to them to discuss how you can help further increase the
> community’s confidence in the Quality and Stability of Cassandra.
> 
> Below is an in-complete list of many of the severe bugs found during this
> part of the release cycle. Thanks again to all of the community members who
> contributed to finding these bugs and improving Cassandra for everyone.
> 
> CASSANDRA-15004: Anti-compaction briefly removes sstables from the read path
> CASSANDRA-14958: Counters fail to increment on 2.X to 3.X mixed version
> clusters
> CASSANDRA-14936: Anticompaction should throw exceptions on errors, not just
> log them
> CASSANDRA-14672: After deleting data in 3.11.3, reads fail: "open marker
> and close marker have different deletion times"
> CASSANDRA-14912: LegacyLayout errors on collection tombstones from dropped
> columns
> CASSANDRA-14843: Drop/add column name with different Kind can result in
> corruption
> CASSANDRA-14568: CorruptSSTableExceptions in 3.0.17.1 (CASSANDRA-14568 v2)
> Static collection deletions are corrupted in 3.0 <-> 2.{1,2} messages
> CASSANDRA-14749: Collection Deletions for Dropped Columns in 2.1/3.0
> mixed-mode can delete rows
> CASSANDRA-14568: Static collection deletions are corrupted in 3.0 ->
> 2.{1,2} messages
> CASSANDRA-14861: Inaccurate sstable min/max metadata can cause data loss
> CASSANDRA-14823: Legacy sstables with range tombstones spanning multiple
> index blocks create invalid bound sequences on 3.0+ (#1193)
> CASSANDRA-14873: Missing rows when reading 2.1 SSTables in 3.0
> CASSANDRA-14838: Dropped columns can cause reverse sstable iteration to
> return prematurely
> CASSANDRA-14803: Rows that cross index block boundaries can cause
> incomplete reverse reads in some cases.
> CASSANDRA-14766: DESC order reads can fail to return the last Unfiltered in
> the partition (#1170)
> CASSANDRA-14991: SSL Cert Hot Reloading should defensively check for sanity
> of the new keystore/truststore before loading it
> CASSANDRA-14794: Avoid calling iter.next() in a loop when notifying
> indexers about range tombstones
> CASSANDRA-14780: Avoid creating empty compaction tasks after truncate
> CASSANDRA-14657: Handle failures in upgradesstables/cleanup/relocatee
> CASSANDRA-14638: Column result order can change in 'SELECT *' results when
> upgrading from 2.1 to 3.0 causing response corruption for queries using
> prepared statements when static columns are used
> CASSANDRA-14919: Regression in paging queries in mixed version clusters
> CASSANDRA-14554: LifecycleTransaction encounters
> ConcurrentModificationException when used in multi-threaded context
> CASSANDRA-14935: PendingAntiCompaction should be more judicious in the
> compactions it cancels
> CASSANDRA-14894: RangeTombstoneList doesn't properly clean up mergeable or
> superseded rts in some cases
> CASSANDRA-14824: Expand range tombstone validation checks to multiple
> interim request stages
> CASSANDRA-14763: Fail incremental repair prepare phase if it encounters
> sstables from un-finalized sessions
> CASSANDRA-14920: Some comparisons used for verifying paging queries in
> dtests only test the column names and not values
> 
> Jordan
> 
> [1]
> http://cassandra.apache.org/blog/2018/08/21/testing_apache_cassandra.html
> [2]
> http://cassandra.apache.org/blog/2018/10/17/finding_bugs_with_property_based_testing.html
> [3] https://issues.apache.org/jira/browse/CASSANDRA-14806
> [4] https://issues.apache.org/jira/browse/CASSANDRA-14958
> [5] https://issues.apache.org/jira/browse/CASSANDRA-14951
> [6] https://issues.apache.org/jira/browse/CASSANDRA-15066
> [7] https://issues.apache.org/jira/browse/CASSANDRA-15078
> [8]
> https://cwiki.apache.org/confluence/display/CASSANDRA/4.0+Quality%3A+Components+and+Test+Plans


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org