You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by "benedict@apache.org" <be...@apache.org> on 2021/09/05 13:54:13 UTC

[DISCUSS] CEP-15: General Purpose Transactions

Wiki: https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
Whitepaper: https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf<https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2>
Prototype: https://github.com/belliottsmith/accord

Hi everyone, I’d like to propose this CEP for adoption by the community.

Cassandra has benefitted from LWTs for many years, but application developers that want to ensure consistency for complex operations must either accept the scalability bottleneck of serializing all related state through a single partition, or layer a complex state machine on top of the database. These are sophisticated and costly activities that our users should not be expected to undertake. Since distributed databases are beginning to offer distributed transactions with fewer caveats, it is past time for Cassandra to do so as well.

This CEP proposes the use of several novel techniques that build upon research (that followed EPaxos) to deliver (non-interactive) general purpose distributed transactions. The approach is outlined in the wikipage and in more detail in the linked whitepaper. Importantly, by adopting this approach we will be the _only_ distributed database to offer global, scalable, strict serializable transactions in one wide area round-trip. This would represent a significant improvement in the state of the art, both in the academic literature and in commercial or open source offerings.

This work has been partially realised in a prototype. This partial prototype has been verified against Jepsen.io’s Maelstrom library and dedicated in-tree strict serializability verification tools, but much work remains for the work to be production capable and integrated into Cassandra.

I propose including the prototype in the project as a new source repository, to be developed as a standalone library for integration into Cassandra. I hope the community sees the important value proposition of this proposal, and will adopt the CEP after this discussion, so that the library and its integration into Cassandra can be developed in parallel and with the involvement of the wider community.

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

Hi Paulo,

> First and foremost, I believe this proposal in its current form focuses on the protocol details (HOW?) but lacks the bigger picture on how this is going to be exposed to the user (WHAT)?

In my opinion this CEP embodies a coherent distinct and complex piece of work, that requires specialist expertise. You have after all just suggested a month to read only the existing proposal 😊

UX is a whole other kind of discussion, that can be quite opinionated, and requires different expertise. It is in my opinion helpful to break out work that is not tightly coupled, as well as work that requires different expertise. As you point out, multi-key UX features are largely independent of any underlying implementation, likely can be done in parallel, and even with different contributors.

> Can we not start using it as an external dependency

I would love to understand your rationale, as this is a surprising suggestion to me. This is just like any other subsystem, but we would be managing it as a separate library primarily for modularity reasons. The reality is that this option should anyway be considered unavailable. This is a proposed contribution to the Cassandra project, which we can either accept or reject.

> Isn't this a good chance to make the serialization protocol pluggable
with clearly defined integration points

It has recently been demonstrated to be possible to build a system that can safely switch between different consensus protocols. However, this was very sophisticated work that would require its own CEP, one that we would be unable to resource. Even if we could this would be insufficient. This goal has never been achieved for a multi-shard transaction protocol to my knowledge, and multi-shard transaction protocols are much more divergent in implementation detail than consensus protocols.

> so we could easily switch implementations with different guarantees… (ie. Apache Ratis)

As far as I know, there are no other strict serializable protocols available to plug in today. Apache Ratis appears to be a straightforward Raft implementation, and therefore it is a linearizable consensus protocol. It is not multi-shard transaction protocol at all, let alone strict serializable. It could be used in place of Paxos, but not Accord.



From: Paulo Motta <pa...@gmail.com>
Date: Tuesday, 14 September 2021 at 22:55
To: Cassandra DEV <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
I can start with some preliminary comments while I get more familiarized
with the proposal:

- First and foremost, I believe this proposal in its current form focuses
on the protocol details (HOW?) but lacks the bigger picture on how this is
going to be exposed to the user (WHAT)? Is exposing linearizable
transactions to the user not a goal of this proposal? If not, I think the
proposal is missing the UX (ie. what CQL commands are going to be added
etc) on how these transactions are going to be exposed.

- Why do we need to bring the library into the project umbrella? Can we not
start using it as an external dependency, and later re-evaluate if it's
necessary to bring it into the project or even incubate it as another
Apache project? I feel we may be importing unnecessary management overhead
into the project while only a small subset of contributors will be involved
with the core protocol.

- Isn't this a good chance to make the serialization protocol pluggable
with clearly defined integration points, so we could easily switch
implementations with different guarantees, trade-offs and performance
considerations while leaving the UX intact? This would also allow us to
easily benchmark the protocol against alternatives (ie. Apache Ratis) and
validate the performance claims. I think the best way to do that would be
to define what the feature will look like to the end user (UX), define the
integration points necessary to support this feature, and use accord as the
first implementation of these integration points.

Em ter., 14 de set. de 2021 às 17:57, Paulo Motta <pa...@gmail.com>
escreveu:

> Given the extensiveness and complexity of the proposal I'd suggest leaving
> it a little longer (perhaps 4 weeks from the publish date?) for people to
> get a bit more familiarized and have the chance to comment before casting a
> vote. I glanced through the proposal - and it looks outstanding, very
> promising work guys! - but would like a bit more time to take a deeper look
> and digest it before potentially commenting on it.
>
> Em ter., 14 de set. de 2021 às 17:30, benedict@apache.org <
> benedict@apache.org> escreveu:
>
>> Has anyone had a chance to read the drafts, and has any feedback or
>> questions? Does anybody still anticipate doing so in the near future? Or
>> shall we move to a vote?
>>
>> From: benedict@apache.org <be...@apache.org>
>> Date: Tuesday, 7 September 2021 at 21:27
>> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
>> Hi Jake,
>>
>> > What structural changes are planned to support an external dependency
>> project like this
>>
>> To add to Blake’s answer, in case there’s some confusion over this, the
>> proposal is to include this library within the Apache Cassandra project. So
>> I wouldn’t think of it as an external dependency. This PMC and community
>> will still have the usual oversight over direction and development, and
>> APIs will be developed solely with the intention of their integration with
>> Cassandra.
>>
>> > Will this effort eventually replace consistency levels in C*?
>>
>> I hope we’ll have some very related discussions around consistency levels
>> in the coming months more generally, but I don’t think that is tightly
>> coupled to this work. I agree with you both that we won’t want to
>> perpetuate the problems you’ve highlighted though.
>>
>> Henrik:
>> > I was referring to the property that Calvin transactions also need to
>> be sent to the cluster in a single shot
>>
>> Ah, yes. In that case I agree, and I tried to point to this direction in
>> an earlier email, where I discussed the use of scripting languages (i.e.
>> transactionally modifying the database with some subset of arbitrary
>> computation). I think the JVM is particularly suited to offering quite
>> powerful distributed transactions in this vein, and it will be interesting
>> to see what we might develop in this direction in future.
>>
>>
>> From: Jake Luciani <ja...@gmail.com>
>> Date: Tuesday, 7 September 2021 at 19:27
>> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
>> Great thanks for the information
>>
>> On Tue, Sep 7, 2021 at 12:44 PM Blake Eggleston
>> <be...@apple.com.invalid> wrote:
>>
>> > Hi Jake,
>> >
>> > > 1.  Will this effort eventually replace consistency levels in C*?  I
>> ask
>> > > because one of the shortcomings of our paxos today is
>> > > it can be easily mixed with non serialized consistencies and therefore
>> > > users commonly break consistency by for example reading at CL.ONE
>> while
>> > > also
>> > > using LWTs.
>> >
>> > This will likely require CLs to be specified at the schema level for
>> > tables using multi partition transactions. I’d expect this to be
>> available
>> > for other tables, but not required.
>> >
>> > > 2. What structural changes are planned to support an external
>> dependency
>> > > project like this?  Are there some high level interfaces you expect
>> the
>> > > project to adhere to?
>> >
>> > There will be some interfaces that need to be implemented in C* to
>> support
>> > the library. You can find the current interfaces in the accord.api
>> package,
>> > but these were written to support some initial testing, and not intended
>> > for integration into C* as is. Things are pretty fluid right now and
>> will
>> > be rewritten / refactored multiple times over the next few months.
>> >
>> > Thanks,
>> >
>> > Blake
>> >
>> >
>> > > On Sun, Sep 5, 2021 at 10:33 AM benedict@apache.org <
>> benedict@apache.org
>> > >
>> > > wrote:
>> > >
>> > >> Wiki:
>> > >>
>> >
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
>> > >> Whitepaper:
>> > >>
>> >
>> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
>> > >> <
>> > >>
>> >
>> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
>> > >>>
>> > >> Prototype: https://github.com/belliottsmith/accord
>> > >>
>> > >> Hi everyone, I’d like to propose this CEP for adoption by the
>> community.
>> > >>
>> > >> Cassandra has benefitted from LWTs for many years, but application
>> > >> developers that want to ensure consistency for complex operations
>> must
>> > >> either accept the scalability bottleneck of serializing all related
>> > state
>> > >> through a single partition, or layer a complex state machine on top
>> of
>> > the
>> > >> database. These are sophisticated and costly activities that our
>> users
>> > >> should not be expected to undertake. Since distributed databases are
>> > >> beginning to offer distributed transactions with fewer caveats, it is
>> > past
>> > >> time for Cassandra to do so as well.
>> > >>
>> > >> This CEP proposes the use of several novel techniques that build upon
>> > >> research (that followed EPaxos) to deliver (non-interactive) general
>> > >> purpose distributed transactions. The approach is outlined in the
>> > wikipage
>> > >> and in more detail in the linked whitepaper. Importantly, by adopting
>> > this
>> > >> approach we will be the _only_ distributed database to offer global,
>> > >> scalable, strict serializable transactions in one wide area
>> round-trip.
>> > >> This would represent a significant improvement in the state of the
>> art,
>> > >> both in the academic literature and in commercial or open source
>> > offerings.
>> > >>
>> > >> This work has been partially realised in a prototype. This partial
>> > >> prototype has been verified against Jepsen.io’s Maelstrom library and
>> > >> dedicated in-tree strict serializability verification tools, but much
>> > work
>> > >> remains for the work to be production capable and integrated into
>> > Cassandra.
>> > >>
>> > >> I propose including the prototype in the project as a new source
>> > >> repository, to be developed as a standalone library for integration
>> into
>> > >> Cassandra. I hope the community sees the important value proposition
>> of
>> > >> this proposal, and will adopt the CEP after this discussion, so that
>> the
>> > >> library and its integration into Cassandra can be developed in
>> parallel
>> > and
>> > >> with the involvement of the wider community.
>> > >>
>> > >
>> > >
>> > > --
>> > > http://twitter.com/tjake
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>> > For additional commands, e-mail: dev-help@cassandra.apache.org
>> >
>> >
>>
>> --
>> http://twitter.com/tjake
>>
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Paulo Motta <pa...@gmail.com>.

I can start with some preliminary comments while I get more familiarized
with the proposal:

- First and foremost, I believe this proposal in its current form focuses
on the protocol details (HOW?) but lacks the bigger picture on how this is
going to be exposed to the user (WHAT)? Is exposing linearizable
transactions to the user not a goal of this proposal? If not, I think the
proposal is missing the UX (ie. what CQL commands are going to be added
etc) on how these transactions are going to be exposed.

- Why do we need to bring the library into the project umbrella? Can we not
start using it as an external dependency, and later re-evaluate if it's
necessary to bring it into the project or even incubate it as another
Apache project? I feel we may be importing unnecessary management overhead
into the project while only a small subset of contributors will be involved
with the core protocol.

- Isn't this a good chance to make the serialization protocol pluggable
with clearly defined integration points, so we could easily switch
implementations with different guarantees, trade-offs and performance
considerations while leaving the UX intact? This would also allow us to
easily benchmark the protocol against alternatives (ie. Apache Ratis) and
validate the performance claims. I think the best way to do that would be
to define what the feature will look like to the end user (UX), define the
integration points necessary to support this feature, and use accord as the
first implementation of these integration points.

Em ter., 14 de set. de 2021 às 17:57, Paulo Motta <pa...@gmail.com>
escreveu:

> Given the extensiveness and complexity of the proposal I'd suggest leaving
> it a little longer (perhaps 4 weeks from the publish date?) for people to
> get a bit more familiarized and have the chance to comment before casting a
> vote. I glanced through the proposal - and it looks outstanding, very
> promising work guys! - but would like a bit more time to take a deeper look
> and digest it before potentially commenting on it.
>
> Em ter., 14 de set. de 2021 às 17:30, benedict@apache.org <
> benedict@apache.org> escreveu:
>
>> Has anyone had a chance to read the drafts, and has any feedback or
>> questions? Does anybody still anticipate doing so in the near future? Or
>> shall we move to a vote?
>>
>> From: benedict@apache.org <be...@apache.org>
>> Date: Tuesday, 7 September 2021 at 21:27
>> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
>> Hi Jake,
>>
>> > What structural changes are planned to support an external dependency
>> project like this
>>
>> To add to Blake’s answer, in case there’s some confusion over this, the
>> proposal is to include this library within the Apache Cassandra project. So
>> I wouldn’t think of it as an external dependency. This PMC and community
>> will still have the usual oversight over direction and development, and
>> APIs will be developed solely with the intention of their integration with
>> Cassandra.
>>
>> > Will this effort eventually replace consistency levels in C*?
>>
>> I hope we’ll have some very related discussions around consistency levels
>> in the coming months more generally, but I don’t think that is tightly
>> coupled to this work. I agree with you both that we won’t want to
>> perpetuate the problems you’ve highlighted though.
>>
>> Henrik:
>> > I was referring to the property that Calvin transactions also need to
>> be sent to the cluster in a single shot
>>
>> Ah, yes. In that case I agree, and I tried to point to this direction in
>> an earlier email, where I discussed the use of scripting languages (i.e.
>> transactionally modifying the database with some subset of arbitrary
>> computation). I think the JVM is particularly suited to offering quite
>> powerful distributed transactions in this vein, and it will be interesting
>> to see what we might develop in this direction in future.
>>
>>
>> From: Jake Luciani <ja...@gmail.com>
>> Date: Tuesday, 7 September 2021 at 19:27
>> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
>> Great thanks for the information
>>
>> On Tue, Sep 7, 2021 at 12:44 PM Blake Eggleston
>> <be...@apple.com.invalid> wrote:
>>
>> > Hi Jake,
>> >
>> > > 1.  Will this effort eventually replace consistency levels in C*?  I
>> ask
>> > > because one of the shortcomings of our paxos today is
>> > > it can be easily mixed with non serialized consistencies and therefore
>> > > users commonly break consistency by for example reading at CL.ONE
>> while
>> > > also
>> > > using LWTs.
>> >
>> > This will likely require CLs to be specified at the schema level for
>> > tables using multi partition transactions. I’d expect this to be
>> available
>> > for other tables, but not required.
>> >
>> > > 2. What structural changes are planned to support an external
>> dependency
>> > > project like this?  Are there some high level interfaces you expect
>> the
>> > > project to adhere to?
>> >
>> > There will be some interfaces that need to be implemented in C* to
>> support
>> > the library. You can find the current interfaces in the accord.api
>> package,
>> > but these were written to support some initial testing, and not intended
>> > for integration into C* as is. Things are pretty fluid right now and
>> will
>> > be rewritten / refactored multiple times over the next few months.
>> >
>> > Thanks,
>> >
>> > Blake
>> >
>> >
>> > > On Sun, Sep 5, 2021 at 10:33 AM benedict@apache.org <
>> benedict@apache.org
>> > >
>> > > wrote:
>> > >
>> > >> Wiki:
>> > >>
>> >
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
>> > >> Whitepaper:
>> > >>
>> >
>> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
>> > >> <
>> > >>
>> >
>> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
>> > >>>
>> > >> Prototype: https://github.com/belliottsmith/accord
>> > >>
>> > >> Hi everyone, I’d like to propose this CEP for adoption by the
>> community.
>> > >>
>> > >> Cassandra has benefitted from LWTs for many years, but application
>> > >> developers that want to ensure consistency for complex operations
>> must
>> > >> either accept the scalability bottleneck of serializing all related
>> > state
>> > >> through a single partition, or layer a complex state machine on top
>> of
>> > the
>> > >> database. These are sophisticated and costly activities that our
>> users
>> > >> should not be expected to undertake. Since distributed databases are
>> > >> beginning to offer distributed transactions with fewer caveats, it is
>> > past
>> > >> time for Cassandra to do so as well.
>> > >>
>> > >> This CEP proposes the use of several novel techniques that build upon
>> > >> research (that followed EPaxos) to deliver (non-interactive) general
>> > >> purpose distributed transactions. The approach is outlined in the
>> > wikipage
>> > >> and in more detail in the linked whitepaper. Importantly, by adopting
>> > this
>> > >> approach we will be the _only_ distributed database to offer global,
>> > >> scalable, strict serializable transactions in one wide area
>> round-trip.
>> > >> This would represent a significant improvement in the state of the
>> art,
>> > >> both in the academic literature and in commercial or open source
>> > offerings.
>> > >>
>> > >> This work has been partially realised in a prototype. This partial
>> > >> prototype has been verified against Jepsen.io’s Maelstrom library and
>> > >> dedicated in-tree strict serializability verification tools, but much
>> > work
>> > >> remains for the work to be production capable and integrated into
>> > Cassandra.
>> > >>
>> > >> I propose including the prototype in the project as a new source
>> > >> repository, to be developed as a standalone library for integration
>> into
>> > >> Cassandra. I hope the community sees the important value proposition
>> of
>> > >> this proposal, and will adopt the CEP after this discussion, so that
>> the
>> > >> library and its integration into Cassandra can be developed in
>> parallel
>> > and
>> > >> with the involvement of the wider community.
>> > >>
>> > >
>> > >
>> > > --
>> > > http://twitter.com/tjake
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
>> > For additional commands, e-mail: dev-help@cassandra.apache.org
>> >
>> >
>>
>> --
>> http://twitter.com/tjake
>>
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Paulo Motta <pa...@gmail.com>.

Given the extensiveness and complexity of the proposal I'd suggest leaving
it a little longer (perhaps 4 weeks from the publish date?) for people to
get a bit more familiarized and have the chance to comment before casting a
vote. I glanced through the proposal - and it looks outstanding, very
promising work guys! - but would like a bit more time to take a deeper look
and digest it before potentially commenting on it.

Em ter., 14 de set. de 2021 às 17:30, benedict@apache.org <
benedict@apache.org> escreveu:

> Has anyone had a chance to read the drafts, and has any feedback or
> questions? Does anybody still anticipate doing so in the near future? Or
> shall we move to a vote?
>
> From: benedict@apache.org <be...@apache.org>
> Date: Tuesday, 7 September 2021 at 21:27
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Hi Jake,
>
> > What structural changes are planned to support an external dependency
> project like this
>
> To add to Blake’s answer, in case there’s some confusion over this, the
> proposal is to include this library within the Apache Cassandra project. So
> I wouldn’t think of it as an external dependency. This PMC and community
> will still have the usual oversight over direction and development, and
> APIs will be developed solely with the intention of their integration with
> Cassandra.
>
> > Will this effort eventually replace consistency levels in C*?
>
> I hope we’ll have some very related discussions around consistency levels
> in the coming months more generally, but I don’t think that is tightly
> coupled to this work. I agree with you both that we won’t want to
> perpetuate the problems you’ve highlighted though.
>
> Henrik:
> > I was referring to the property that Calvin transactions also need to be
> sent to the cluster in a single shot
>
> Ah, yes. In that case I agree, and I tried to point to this direction in
> an earlier email, where I discussed the use of scripting languages (i.e.
> transactionally modifying the database with some subset of arbitrary
> computation). I think the JVM is particularly suited to offering quite
> powerful distributed transactions in this vein, and it will be interesting
> to see what we might develop in this direction in future.
>
>
> From: Jake Luciani <ja...@gmail.com>
> Date: Tuesday, 7 September 2021 at 19:27
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Great thanks for the information
>
> On Tue, Sep 7, 2021 at 12:44 PM Blake Eggleston
> <be...@apple.com.invalid> wrote:
>
> > Hi Jake,
> >
> > > 1.  Will this effort eventually replace consistency levels in C*?  I
> ask
> > > because one of the shortcomings of our paxos today is
> > > it can be easily mixed with non serialized consistencies and therefore
> > > users commonly break consistency by for example reading at CL.ONE while
> > > also
> > > using LWTs.
> >
> > This will likely require CLs to be specified at the schema level for
> > tables using multi partition transactions. I’d expect this to be
> available
> > for other tables, but not required.
> >
> > > 2. What structural changes are planned to support an external
> dependency
> > > project like this?  Are there some high level interfaces you expect the
> > > project to adhere to?
> >
> > There will be some interfaces that need to be implemented in C* to
> support
> > the library. You can find the current interfaces in the accord.api
> package,
> > but these were written to support some initial testing, and not intended
> > for integration into C* as is. Things are pretty fluid right now and will
> > be rewritten / refactored multiple times over the next few months.
> >
> > Thanks,
> >
> > Blake
> >
> >
> > > On Sun, Sep 5, 2021 at 10:33 AM benedict@apache.org <
> benedict@apache.org
> > >
> > > wrote:
> > >
> > >> Wiki:
> > >>
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > >> Whitepaper:
> > >>
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > >> <
> > >>
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > >>>
> > >> Prototype: https://github.com/belliottsmith/accord
> > >>
> > >> Hi everyone, I’d like to propose this CEP for adoption by the
> community.
> > >>
> > >> Cassandra has benefitted from LWTs for many years, but application
> > >> developers that want to ensure consistency for complex operations must
> > >> either accept the scalability bottleneck of serializing all related
> > state
> > >> through a single partition, or layer a complex state machine on top of
> > the
> > >> database. These are sophisticated and costly activities that our users
> > >> should not be expected to undertake. Since distributed databases are
> > >> beginning to offer distributed transactions with fewer caveats, it is
> > past
> > >> time for Cassandra to do so as well.
> > >>
> > >> This CEP proposes the use of several novel techniques that build upon
> > >> research (that followed EPaxos) to deliver (non-interactive) general
> > >> purpose distributed transactions. The approach is outlined in the
> > wikipage
> > >> and in more detail in the linked whitepaper. Importantly, by adopting
> > this
> > >> approach we will be the _only_ distributed database to offer global,
> > >> scalable, strict serializable transactions in one wide area
> round-trip.
> > >> This would represent a significant improvement in the state of the
> art,
> > >> both in the academic literature and in commercial or open source
> > offerings.
> > >>
> > >> This work has been partially realised in a prototype. This partial
> > >> prototype has been verified against Jepsen.io’s Maelstrom library and
> > >> dedicated in-tree strict serializability verification tools, but much
> > work
> > >> remains for the work to be production capable and integrated into
> > Cassandra.
> > >>
> > >> I propose including the prototype in the project as a new source
> > >> repository, to be developed as a standalone library for integration
> into
> > >> Cassandra. I hope the community sees the important value proposition
> of
> > >> this proposal, and will adopt the CEP after this discussion, so that
> the
> > >> library and its integration into Cassandra can be developed in
> parallel
> > and
> > >> with the involvement of the wider community.
> > >>
> > >
> > >
> > > --
> > > http://twitter.com/tjake
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > For additional commands, e-mail: dev-help@cassandra.apache.org
> >
> >
>
> --
> http://twitter.com/tjake
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

Has anyone had a chance to read the drafts, and has any feedback or questions? Does anybody still anticipate doing so in the near future? Or shall we move to a vote?

From: benedict@apache.org <be...@apache.org>
Date: Tuesday, 7 September 2021 at 21:27
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Hi Jake,

> What structural changes are planned to support an external dependency project like this

To add to Blake’s answer, in case there’s some confusion over this, the proposal is to include this library within the Apache Cassandra project. So I wouldn’t think of it as an external dependency. This PMC and community will still have the usual oversight over direction and development, and APIs will be developed solely with the intention of their integration with Cassandra.

> Will this effort eventually replace consistency levels in C*?

I hope we’ll have some very related discussions around consistency levels in the coming months more generally, but I don’t think that is tightly coupled to this work. I agree with you both that we won’t want to perpetuate the problems you’ve highlighted though.

Henrik:
> I was referring to the property that Calvin transactions also need to be sent to the cluster in a single shot

Ah, yes. In that case I agree, and I tried to point to this direction in an earlier email, where I discussed the use of scripting languages (i.e. transactionally modifying the database with some subset of arbitrary computation). I think the JVM is particularly suited to offering quite powerful distributed transactions in this vein, and it will be interesting to see what we might develop in this direction in future.


From: Jake Luciani <ja...@gmail.com>
Date: Tuesday, 7 September 2021 at 19:27
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Great thanks for the information

On Tue, Sep 7, 2021 at 12:44 PM Blake Eggleston
<be...@apple.com.invalid> wrote:

> Hi Jake,
>
> > 1.  Will this effort eventually replace consistency levels in C*?  I ask
> > because one of the shortcomings of our paxos today is
> > it can be easily mixed with non serialized consistencies and therefore
> > users commonly break consistency by for example reading at CL.ONE while
> > also
> > using LWTs.
>
> This will likely require CLs to be specified at the schema level for
> tables using multi partition transactions. I’d expect this to be available
> for other tables, but not required.
>
> > 2. What structural changes are planned to support an external dependency
> > project like this?  Are there some high level interfaces you expect the
> > project to adhere to?
>
> There will be some interfaces that need to be implemented in C* to support
> the library. You can find the current interfaces in the accord.api package,
> but these were written to support some initial testing, and not intended
> for integration into C* as is. Things are pretty fluid right now and will
> be rewritten / refactored multiple times over the next few months.
>
> Thanks,
>
> Blake
>
>
> > On Sun, Sep 5, 2021 at 10:33 AM benedict@apache.org <benedict@apache.org
> >
> > wrote:
> >
> >> Wiki:
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> >> Whitepaper:
> >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> >> <
> >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >>>
> >> Prototype: https://github.com/belliottsmith/accord
> >>
> >> Hi everyone, I’d like to propose this CEP for adoption by the community.
> >>
> >> Cassandra has benefitted from LWTs for many years, but application
> >> developers that want to ensure consistency for complex operations must
> >> either accept the scalability bottleneck of serializing all related
> state
> >> through a single partition, or layer a complex state machine on top of
> the
> >> database. These are sophisticated and costly activities that our users
> >> should not be expected to undertake. Since distributed databases are
> >> beginning to offer distributed transactions with fewer caveats, it is
> past
> >> time for Cassandra to do so as well.
> >>
> >> This CEP proposes the use of several novel techniques that build upon
> >> research (that followed EPaxos) to deliver (non-interactive) general
> >> purpose distributed transactions. The approach is outlined in the
> wikipage
> >> and in more detail in the linked whitepaper. Importantly, by adopting
> this
> >> approach we will be the _only_ distributed database to offer global,
> >> scalable, strict serializable transactions in one wide area round-trip.
> >> This would represent a significant improvement in the state of the art,
> >> both in the academic literature and in commercial or open source
> offerings.
> >>
> >> This work has been partially realised in a prototype. This partial
> >> prototype has been verified against Jepsen.io’s Maelstrom library and
> >> dedicated in-tree strict serializability verification tools, but much
> work
> >> remains for the work to be production capable and integrated into
> Cassandra.
> >>
> >> I propose including the prototype in the project as a new source
> >> repository, to be developed as a standalone library for integration into
> >> Cassandra. I hope the community sees the important value proposition of
> >> this proposal, and will adopt the CEP after this discussion, so that the
> >> library and its integration into Cassandra can be developed in parallel
> and
> >> with the involvement of the wider community.
> >>
> >
> >
> > --
> > http://twitter.com/tjake
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>
>

--
http://twitter.com/tjake

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

Hi Jake,

> What structural changes are planned to support an external dependency project like this

To add to Blake’s answer, in case there’s some confusion over this, the proposal is to include this library within the Apache Cassandra project. So I wouldn’t think of it as an external dependency. This PMC and community will still have the usual oversight over direction and development, and APIs will be developed solely with the intention of their integration with Cassandra.

> Will this effort eventually replace consistency levels in C*?

I hope we’ll have some very related discussions around consistency levels in the coming months more generally, but I don’t think that is tightly coupled to this work. I agree with you both that we won’t want to perpetuate the problems you’ve highlighted though.

Henrik:
> I was referring to the property that Calvin transactions also need to be sent to the cluster in a single shot

Ah, yes. In that case I agree, and I tried to point to this direction in an earlier email, where I discussed the use of scripting languages (i.e. transactionally modifying the database with some subset of arbitrary computation). I think the JVM is particularly suited to offering quite powerful distributed transactions in this vein, and it will be interesting to see what we might develop in this direction in future.


From: Jake Luciani <ja...@gmail.com>
Date: Tuesday, 7 September 2021 at 19:27
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Great thanks for the information

On Tue, Sep 7, 2021 at 12:44 PM Blake Eggleston
<be...@apple.com.invalid> wrote:

> Hi Jake,
>
> > 1.  Will this effort eventually replace consistency levels in C*?  I ask
> > because one of the shortcomings of our paxos today is
> > it can be easily mixed with non serialized consistencies and therefore
> > users commonly break consistency by for example reading at CL.ONE while
> > also
> > using LWTs.
>
> This will likely require CLs to be specified at the schema level for
> tables using multi partition transactions. I’d expect this to be available
> for other tables, but not required.
>
> > 2. What structural changes are planned to support an external dependency
> > project like this?  Are there some high level interfaces you expect the
> > project to adhere to?
>
> There will be some interfaces that need to be implemented in C* to support
> the library. You can find the current interfaces in the accord.api package,
> but these were written to support some initial testing, and not intended
> for integration into C* as is. Things are pretty fluid right now and will
> be rewritten / refactored multiple times over the next few months.
>
> Thanks,
>
> Blake
>
>
> > On Sun, Sep 5, 2021 at 10:33 AM benedict@apache.org <benedict@apache.org
> >
> > wrote:
> >
> >> Wiki:
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> >> Whitepaper:
> >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> >> <
> >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >>>
> >> Prototype: https://github.com/belliottsmith/accord
> >>
> >> Hi everyone, I’d like to propose this CEP for adoption by the community.
> >>
> >> Cassandra has benefitted from LWTs for many years, but application
> >> developers that want to ensure consistency for complex operations must
> >> either accept the scalability bottleneck of serializing all related
> state
> >> through a single partition, or layer a complex state machine on top of
> the
> >> database. These are sophisticated and costly activities that our users
> >> should not be expected to undertake. Since distributed databases are
> >> beginning to offer distributed transactions with fewer caveats, it is
> past
> >> time for Cassandra to do so as well.
> >>
> >> This CEP proposes the use of several novel techniques that build upon
> >> research (that followed EPaxos) to deliver (non-interactive) general
> >> purpose distributed transactions. The approach is outlined in the
> wikipage
> >> and in more detail in the linked whitepaper. Importantly, by adopting
> this
> >> approach we will be the _only_ distributed database to offer global,
> >> scalable, strict serializable transactions in one wide area round-trip.
> >> This would represent a significant improvement in the state of the art,
> >> both in the academic literature and in commercial or open source
> offerings.
> >>
> >> This work has been partially realised in a prototype. This partial
> >> prototype has been verified against Jepsen.io’s Maelstrom library and
> >> dedicated in-tree strict serializability verification tools, but much
> work
> >> remains for the work to be production capable and integrated into
> Cassandra.
> >>
> >> I propose including the prototype in the project as a new source
> >> repository, to be developed as a standalone library for integration into
> >> Cassandra. I hope the community sees the important value proposition of
> >> this proposal, and will adopt the CEP after this discussion, so that the
> >> library and its integration into Cassandra can be developed in parallel
> and
> >> with the involvement of the wider community.
> >>
> >
> >
> > --
> > http://twitter.com/tjake
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>
>

--
http://twitter.com/tjake

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Jake Luciani <ja...@gmail.com>.

Great thanks for the information

On Tue, Sep 7, 2021 at 12:44 PM Blake Eggleston
<be...@apple.com.invalid> wrote:

> Hi Jake,
>
> > 1.  Will this effort eventually replace consistency levels in C*?  I ask
> > because one of the shortcomings of our paxos today is
> > it can be easily mixed with non serialized consistencies and therefore
> > users commonly break consistency by for example reading at CL.ONE while
> > also
> > using LWTs.
>
> This will likely require CLs to be specified at the schema level for
> tables using multi partition transactions. I’d expect this to be available
> for other tables, but not required.
>
> > 2. What structural changes are planned to support an external dependency
> > project like this?  Are there some high level interfaces you expect the
> > project to adhere to?
>
> There will be some interfaces that need to be implemented in C* to support
> the library. You can find the current interfaces in the accord.api package,
> but these were written to support some initial testing, and not intended
> for integration into C* as is. Things are pretty fluid right now and will
> be rewritten / refactored multiple times over the next few months.
>
> Thanks,
>
> Blake
>
>
> > On Sun, Sep 5, 2021 at 10:33 AM benedict@apache.org <benedict@apache.org
> >
> > wrote:
> >
> >> Wiki:
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> >> Whitepaper:
> >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> >> <
> >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >>>
> >> Prototype: https://github.com/belliottsmith/accord
> >>
> >> Hi everyone, I’d like to propose this CEP for adoption by the community.
> >>
> >> Cassandra has benefitted from LWTs for many years, but application
> >> developers that want to ensure consistency for complex operations must
> >> either accept the scalability bottleneck of serializing all related
> state
> >> through a single partition, or layer a complex state machine on top of
> the
> >> database. These are sophisticated and costly activities that our users
> >> should not be expected to undertake. Since distributed databases are
> >> beginning to offer distributed transactions with fewer caveats, it is
> past
> >> time for Cassandra to do so as well.
> >>
> >> This CEP proposes the use of several novel techniques that build upon
> >> research (that followed EPaxos) to deliver (non-interactive) general
> >> purpose distributed transactions. The approach is outlined in the
> wikipage
> >> and in more detail in the linked whitepaper. Importantly, by adopting
> this
> >> approach we will be the _only_ distributed database to offer global,
> >> scalable, strict serializable transactions in one wide area round-trip.
> >> This would represent a significant improvement in the state of the art,
> >> both in the academic literature and in commercial or open source
> offerings.
> >>
> >> This work has been partially realised in a prototype. This partial
> >> prototype has been verified against Jepsen.io’s Maelstrom library and
> >> dedicated in-tree strict serializability verification tools, but much
> work
> >> remains for the work to be production capable and integrated into
> Cassandra.
> >>
> >> I propose including the prototype in the project as a new source
> >> repository, to be developed as a standalone library for integration into
> >> Cassandra. I hope the community sees the important value proposition of
> >> this proposal, and will adopt the CEP after this discussion, so that the
> >> library and its integration into Cassandra can be developed in parallel
> and
> >> with the involvement of the wider community.
> >>
> >
> >
> > --
> > http://twitter.com/tjake
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>
>

-- 
http://twitter.com/tjake

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Blake Eggleston <be...@apple.com.INVALID>.

Hi Jake,

> 1.  Will this effort eventually replace consistency levels in C*?  I ask
> because one of the shortcomings of our paxos today is
> it can be easily mixed with non serialized consistencies and therefore
> users commonly break consistency by for example reading at CL.ONE while
> also
> using LWTs.

This will likely require CLs to be specified at the schema level for tables using multi partition transactions. I’d expect this to be available for other tables, but not required.

> 2. What structural changes are planned to support an external dependency
> project like this?  Are there some high level interfaces you expect the
> project to adhere to?

There will be some interfaces that need to be implemented in C* to support the library. You can find the current interfaces in the accord.api package, but these were written to support some initial testing, and not intended for integration into C* as is. Things are pretty fluid right now and will be rewritten / refactored multiple times over the next few months.

Thanks,

Blake


> On Sun, Sep 5, 2021 at 10:33 AM benedict@apache.org <be...@apache.org>
> wrote:
> 
>> Wiki:
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
>> Whitepaper:
>> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
>> <
>> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
>>> 
>> Prototype: https://github.com/belliottsmith/accord
>> 
>> Hi everyone, I’d like to propose this CEP for adoption by the community.
>> 
>> Cassandra has benefitted from LWTs for many years, but application
>> developers that want to ensure consistency for complex operations must
>> either accept the scalability bottleneck of serializing all related state
>> through a single partition, or layer a complex state machine on top of the
>> database. These are sophisticated and costly activities that our users
>> should not be expected to undertake. Since distributed databases are
>> beginning to offer distributed transactions with fewer caveats, it is past
>> time for Cassandra to do so as well.
>> 
>> This CEP proposes the use of several novel techniques that build upon
>> research (that followed EPaxos) to deliver (non-interactive) general
>> purpose distributed transactions. The approach is outlined in the wikipage
>> and in more detail in the linked whitepaper. Importantly, by adopting this
>> approach we will be the _only_ distributed database to offer global,
>> scalable, strict serializable transactions in one wide area round-trip.
>> This would represent a significant improvement in the state of the art,
>> both in the academic literature and in commercial or open source offerings.
>> 
>> This work has been partially realised in a prototype. This partial
>> prototype has been verified against Jepsen.io’s Maelstrom library and
>> dedicated in-tree strict serializability verification tools, but much work
>> remains for the work to be production capable and integrated into Cassandra.
>> 
>> I propose including the prototype in the project as a new source
>> repository, to be developed as a standalone library for integration into
>> Cassandra. I hope the community sees the important value proposition of
>> this proposal, and will adopt the CEP after this discussion, so that the
>> library and its integration into Cassandra can be developed in parallel and
>> with the involvement of the wider community.
>> 
> 
> 
> -- 
> http://twitter.com/tjake


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Jake Luciani <ja...@gmail.com>.

Hi Benedict!

I haven't gone too deeply into this proposal but it's very exciting to see
this kind of innovation!

Some basic questions which are tangentially related with this effort I
didn't see covered in the CEP.

1.  Will this effort eventually replace consistency levels in C*?  I ask
because one of the shortcomings of our paxos today is
it can be easily mixed with non serialized consistencies and therefore
users commonly break consistency by for example reading at CL.ONE while
also
using LWTs.

2. What structural changes are planned to support an external dependency
project like this?  Are there some high level interfaces you expect the
project to adhere to?

Thanks
Jake




On Sun, Sep 5, 2021 at 10:33 AM benedict@apache.org <be...@apache.org>
wrote:

> Wiki:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> Whitepaper:
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> <
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >
> Prototype: https://github.com/belliottsmith/accord
>
> Hi everyone, I’d like to propose this CEP for adoption by the community.
>
> Cassandra has benefitted from LWTs for many years, but application
> developers that want to ensure consistency for complex operations must
> either accept the scalability bottleneck of serializing all related state
> through a single partition, or layer a complex state machine on top of the
> database. These are sophisticated and costly activities that our users
> should not be expected to undertake. Since distributed databases are
> beginning to offer distributed transactions with fewer caveats, it is past
> time for Cassandra to do so as well.
>
> This CEP proposes the use of several novel techniques that build upon
> research (that followed EPaxos) to deliver (non-interactive) general
> purpose distributed transactions. The approach is outlined in the wikipage
> and in more detail in the linked whitepaper. Importantly, by adopting this
> approach we will be the _only_ distributed database to offer global,
> scalable, strict serializable transactions in one wide area round-trip.
> This would represent a significant improvement in the state of the art,
> both in the academic literature and in commercial or open source offerings.
>
> This work has been partially realised in a prototype. This partial
> prototype has been verified against Jepsen.io’s Maelstrom library and
> dedicated in-tree strict serializability verification tools, but much work
> remains for the work to be production capable and integrated into Cassandra.
>
> I propose including the prototype in the project as a new source
> repository, to be developed as a standalone library for integration into
> Cassandra. I hope the community sees the important value proposition of
> this proposal, and will adopt the CEP after this discussion, so that the
> library and its integration into Cassandra can be developed in parallel and
> with the involvement of the wider community.
>


-- 
http://twitter.com/tjake

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Henrik Ingo <he...@datastax.com>.

On Tue, Sep 7, 2021 at 5:06 PM benedict@apache.org <be...@apache.org>
wrote:

> > I was thinking that a path similar to Calvin/FaunaDB is certainly
> looming in the horizon at least.
>
> I’m not sure which aspect of these systems you are referring to. Unless I
> have misunderstood, I consider them to be strictly inferior approaches
> (particularly for Cassandra) as they require a _global_ leader process and
> as a result have scalability limits. Users simply shift the sharding
> problem to the cluster level rather than the node level, but the
> fundamental problem remains. This may be acceptable for many users, but was
> contrary to the goals of this CEP.
>

Oh yes. For sure it's one of the strengths of the CEP that it is clearly
designed to fit well into the existing Cassandra architecture and
experience.

I was referring to the property that Calvin transactions also need to be
sent to the cluster in a single shot, but then they have extended the
functionality by allowing programming logic to be executed inside the
transaction. (Like a stored procedure, if you will.) So the transactions
can be multi-statement with complex logic, they just can't communicate
outside the cluster - such as back and forth with the client and server.


> > good job pulling together ingredients from state of the art work in this
> area
>
> In case this was lost in the noise: this work is not simply an assembly of
> prior work. It introduces entirely novel approaches that permit the work to
> exceed the capabilities of any prior research or production system. It is
> worth properly highlighting that if we deliver this, Cassandra will have
> the most sophisticated transaction system full stop.
>
>
Of course. Maybe it's just me, but I'm at least equally impressed by the
"level of education" the authors show in not reinventing the wheel for the
details where copying a feature, or at least being inspired by one, from
some existing publication or implementation was possible. Knowing what to
keep vs what you want to improve isn't easy. Also, it makes the whitepaper
an interesting read when in addition to learning about Accord I also
learned about several other systems that I hadn't previously read about.

henrik

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

> I was thinking that a path similar to Calvin/FaunaDB is certainly looming in the horizon at least.

I’m not sure which aspect of these systems you are referring to. Unless I have misunderstood, I consider them to be strictly inferior approaches (particularly for Cassandra) as they require a _global_ leader process and as a result have scalability limits. Users simply shift the sharding problem to the cluster level rather than the node level, but the fundamental problem remains. This may be acceptable for many users, but was contrary to the goals of this CEP.

> It seems to me at that point long running queries and interactive transactions are mostly the same problem.

I would estimate long running queries to be easier to deliver by at least an order of magnitude. They’re not unrelated, but they’re still quite distinct in my opinion.

> good job pulling together ingredients from state of the art work in this area

In case this was lost in the noise: this work is not simply an assembly of prior work. It introduces entirely novel approaches that permit the work to exceed the capabilities of any prior research or production system. It is worth properly highlighting that if we deliver this, Cassandra will have the most sophisticated transaction system full stop.

There are to my knowledge no databases offering distributed transactions that are both strict serializable and have no scalability bottleneck. Every database today clearly aims for this combination, but accepts some trade-off: either only guaranteeing serializable isolation, requiring special time keeping hardware to guarantee strict serializability, or using a global leader process (or uses two phase commit, but this is quite niche).

From: Henrik Ingo <he...@datastax.com>
Date: Tuesday, 7 September 2021 at 14:06
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
On Tue, Sep 7, 2021 at 12:26 PM benedict@apache.org <be...@apache.org>
wrote:

> > whether I should just* think of this as "better and more efficient LWT”
>
> So, the LWT concept is a Cassandra one and doesn’t have an agreed-upon
> definition. My understanding of a core feature/limitation of LWTs is that
> they operate over a single partition, and as a result many operations are
> impossible even in multiple rounds without complex distributed state
> machines. The core improvement here, besides improved performance, is that
> we will be able to operate over any set of keys at-once.
>
>
My bad, I have never used LWT and forgot / didn't know they were single
partition. The CEP makes more sense now.

> How this facility is evolved into user-facing capabilities is an
> open-ended question. Initially of course we will at least support the same
> syntax but remove the restriction on operating over a single partition. I
> haven’t thought about this much, as the CEP is primarily for enabling
> works, but I think we will want to expand the syntax in two ways:
>
>  1) to support more complex conditions (simple AND conditions across all
> partitions seem likely too restrictive, though they might make sense for
> the single partition case);
>   2) to support inserting data from one row into another, potentially with
> transformations being applied (including via UDFs).
>
> These are both relatively manageable improvements that we might want to
> land in the same major release as the transactions themselves. The core
> facility can be expanded quite broadly, though. It would be possible for
> instance to support some interpreted language(s) as part of a query, so
> that arbitrary work can be applied in the transaction.
>

I was thinking that a path similar to Calvin/FaunaDB is certainly looming
in the horizon at least. I've been following those with interest, because
a) it's refreshingly outside of the box thinking, and b) they seem to be
able to push the limitations of this approach much beyond what one might
imagine when reading about it the first time. But like you also point out,
it remains to be seen whether users actually want those kinds of
transactions. We are creatures of habit for sure.

> Or, perhaps the community would rather build atop the feature to support
> interactive transactions at the client. I can’t predict resourcing for
> this, though, and it might be a community effort. I think it would be quite
> tractable once this work lands, however.
>
> > Suppose I wanted to do a long running read-only transaction
>
> So, there’s two sides to this: with and without paging. A long running
> read-only transaction taking a few seconds is quite likely to be fine and
> we will probably support with some MVCC within the transaction system
> itself. This may or may not be part of v1, it’s hard to predict with
> certainty as this is going to be a large undertaking.
>
> But for paged queries we’d be talking about SNAPSHOT isolation. This is
> likely to be something the community wants to support before long anyway
> and is probably not as hard as you might think. It is probably outside of
> the scope of this work, though the two would dovetail very nicely.
>

I've pointed out to some of my colleagues that since Cassandra's storage
engine is an LSM engine, with some additional work it could become an MVCC
style storage engine. Your thinking here seems to be in the same direction,
even if it's beyond version 1. (Just for context, also for benefit of other
readers on the list, it took MongoDB 5 years and 6 major releases to
develop distributed multi-shard transactions. So it's good to talk about
the general direction, but understanding that this is not something anyone
will finish before Christmas.)

It seems to me at that point long running queries and interactive
transactions are mostly the same problem.

****

Benedict, thanks for the answers. Since I'm not a Cassandra developer I
feel it would be inappropriate for me to express an opinion for or against,
so I'll just end with saying this is an interesting proposal and the
authors have done a good job pulling together ingredients from state of the
art work in this area. As such it will be interesting to follow the
discussion and work from whitepaper to implementation.

A secondary objective was also to just let everyone know I am lurking here.
If you ever want to reach out for an off-band discussion, you now have my
contact details.

henrik

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Henrik Ingo <he...@datastax.com>.

On Tue, Sep 7, 2021 at 12:26 PM benedict@apache.org <be...@apache.org>
wrote:

> > whether I should just* think of this as "better and more efficient LWT”
>
> So, the LWT concept is a Cassandra one and doesn’t have an agreed-upon
> definition. My understanding of a core feature/limitation of LWTs is that
> they operate over a single partition, and as a result many operations are
> impossible even in multiple rounds without complex distributed state
> machines. The core improvement here, besides improved performance, is that
> we will be able to operate over any set of keys at-once.
>
>
My bad, I have never used LWT and forgot / didn't know they were single
partition. The CEP makes more sense now.

> How this facility is evolved into user-facing capabilities is an
> open-ended question. Initially of course we will at least support the same
> syntax but remove the restriction on operating over a single partition. I
> haven’t thought about this much, as the CEP is primarily for enabling
> works, but I think we will want to expand the syntax in two ways:
>
>  1) to support more complex conditions (simple AND conditions across all
> partitions seem likely too restrictive, though they might make sense for
> the single partition case);
>   2) to support inserting data from one row into another, potentially with
> transformations being applied (including via UDFs).
>
> These are both relatively manageable improvements that we might want to
> land in the same major release as the transactions themselves. The core
> facility can be expanded quite broadly, though. It would be possible for
> instance to support some interpreted language(s) as part of a query, so
> that arbitrary work can be applied in the transaction.
>

I was thinking that a path similar to Calvin/FaunaDB is certainly looming
in the horizon at least. I've been following those with interest, because
a) it's refreshingly outside of the box thinking, and b) they seem to be
able to push the limitations of this approach much beyond what one might
imagine when reading about it the first time. But like you also point out,
it remains to be seen whether users actually want those kinds of
transactions. We are creatures of habit for sure.

> Or, perhaps the community would rather build atop the feature to support
> interactive transactions at the client. I can’t predict resourcing for
> this, though, and it might be a community effort. I think it would be quite
> tractable once this work lands, however.
>
> > Suppose I wanted to do a long running read-only transaction
>
> So, there’s two sides to this: with and without paging. A long running
> read-only transaction taking a few seconds is quite likely to be fine and
> we will probably support with some MVCC within the transaction system
> itself. This may or may not be part of v1, it’s hard to predict with
> certainty as this is going to be a large undertaking.
>
> But for paged queries we’d be talking about SNAPSHOT isolation. This is
> likely to be something the community wants to support before long anyway
> and is probably not as hard as you might think. It is probably outside of
> the scope of this work, though the two would dovetail very nicely.
>

I've pointed out to some of my colleagues that since Cassandra's storage
engine is an LSM engine, with some additional work it could become an MVCC
style storage engine. Your thinking here seems to be in the same direction,
even if it's beyond version 1. (Just for context, also for benefit of other
readers on the list, it took MongoDB 5 years and 6 major releases to
develop distributed multi-shard transactions. So it's good to talk about
the general direction, but understanding that this is not something anyone
will finish before Christmas.)

It seems to me at that point long running queries and interactive
transactions are mostly the same problem.

****

Benedict, thanks for the answers. Since I'm not a Cassandra developer I
feel it would be inappropriate for me to express an opinion for or against,
so I'll just end with saying this is an interesting proposal and the
authors have done a good job pulling together ingredients from state of the
art work in this area. As such it will be interesting to follow the
discussion and work from whitepaper to implementation.

A secondary objective was also to just let everyone know I am lurking here.
If you ever want to reach out for an off-band discussion, you now have my
contact details.

henrik

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

> Sorry if a few comments were a bit "editorial" in the first message

Not a problem at all – more than happy to talk about suggestions in that vein! Just probably best not to subject everyone else to the discussion.

> What I would like to understand better and without guessing is, what do these transactions look like from a client/user point of view?

This is a fair question, and perhaps something I should pinpoint more directly for the reader. The CEP does stipulate non-interactive transactions, i.e. those that are one-shot. The only other limitation is that the partition keys must be known upfront, however I expect we will follow-up soon after with some weaker semantics that build on top (probably using optimistic concurrency control) to support transactions where only some partition keys are known upfront, so that we may support global secondary indexes with proper isolation and consistency.

> whether I should just* think of this as "better and more efficient LWT”

So, the LWT concept is a Cassandra one and doesn’t have an agreed-upon definition. My understanding of a core feature/limitation of LWTs is that they operate over a single partition, and as a result many operations are impossible even in multiple rounds without complex distributed state machines. The core improvement here, besides improved performance, is that we will be able to operate over any set of keys at-once.

How this facility is evolved into user-facing capabilities is an open-ended question. Initially of course we will at least support the same syntax but remove the restriction on operating over a single partition. I haven’t thought about this much, as the CEP is primarily for enabling works, but I think we will want to expand the syntax in two ways:

 1) to support more complex conditions (simple AND conditions across all partitions seem likely too restrictive, though they might make sense for the single partition case);
  2) to support inserting data from one row into another, potentially with transformations being applied (including via UDFs).

These are both relatively manageable improvements that we might want to land in the same major release as the transactions themselves. The core facility can be expanded quite broadly, though. It would be possible for instance to support some interpreted language(s) as part of a query, so that arbitrary work can be applied in the transaction.

Or, perhaps the community would rather build atop the feature to support interactive transactions at the client. I can’t predict resourcing for this, though, and it might be a community effort. I think it would be quite tractable once this work lands, however.

> Suppose I wanted to do a long running read-only transaction

So, there’s two sides to this: with and without paging. A long running read-only transaction taking a few seconds is quite likely to be fine and we will probably support with some MVCC within the transaction system itself. This may or may not be part of v1, it’s hard to predict with certainty as this is going to be a large undertaking.

But for paged queries we’d be talking about SNAPSHOT isolation. This is likely to be something the community wants to support before long anyway and is probably not as hard as you might think. It is probably outside of the scope of this work, though the two would dovetail very nicely.

From: Henrik Ingo <he...@datastax.com>
Date: Tuesday, 7 September 2021 at 09:24
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
On Tue, Sep 7, 2021 at 1:31 AM benedict@apache.org <be...@apache.org>
wrote:

>
> Of course, but we may have to be selective in our back-and-forth. We can
> always take some discussion off-list to keep it manageable.
>
>
I'll try to converge.Sorry if a few comments were a bit "editorial" in the
first message. I find that sometimes it pays off to also ask the dumb
questions, as long as we don't get stuck on any of them.

> > The algorithm is hard to read since you omit the roles of the
> participants.
>
> Thanks. I will consider how I might make it clearer that the portions of
> the algorithm that execute on receipt of messages that may only be received
> by replicas, are indeed executed by those replicas.
>
>
In fact the same algorithm in the CEP was easier to read exactly because of
this, I now realize.

> > So I guess my question is how and when reads happen?
>
> I think this is reasonably well specified in the protocol and, since it’s
> unclear what you’ve found confusing, I don’t know it would be productive to
> try to explain it again here on list. You can look at the prototype, if
> Java is easier for you to parse, as it is of course fully specified there
> with no ambiguity. Or we can discuss off list, or perhaps on the community
> slack channel.
>
>
Maybe my question was a bit too open ended, as I didn't want to lead into
any specific direction.

I can of course tell where reads happen in the execution algorithm. What I
would like to understand better and without guessing is, what do these
transactions look like from a client/user point of view? You already
confirmed that interactive transactions aren't intended by this proposal.
At the other end of the spectrum, given that this is a Cassandra
Enhancement Proposal, and the CEP does in fact state this, it seems like
providing equivalent functionality to already existing LWT is a goal. So my
question is whether I should just* think of this as "better and more
efficient LWT" or is there something more? Would this CEP or follow-up work
introduce any new CQL syntax, for example?

To give just one more example of the kind of questions I'm triangulating
at: Suppose I wanted to do a long running read-only transaction, such as
querying a secondary index. Like SERIAL in current Cassandra, but taking
seconds to execute and returning thousands of rows. How would you see the
possibilities and limits of such operations in Accord?

*) Should emphasize that better scaling LWTs isn't just "just". If I
imagine a future Cassandra cluster where all reads and writes are
transactional and therefore strict serializeable, that would be quite a
change from today.

henrik

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Henrik Ingo <he...@datastax.com>.

On Tue, Sep 7, 2021 at 1:31 AM benedict@apache.org <be...@apache.org>
wrote:

>
> Of course, but we may have to be selective in our back-and-forth. We can
> always take some discussion off-list to keep it manageable.
>
>
I'll try to converge.Sorry if a few comments were a bit "editorial" in the
first message. I find that sometimes it pays off to also ask the dumb
questions, as long as we don't get stuck on any of them.

> > The algorithm is hard to read since you omit the roles of the
> participants.
>
> Thanks. I will consider how I might make it clearer that the portions of
> the algorithm that execute on receipt of messages that may only be received
> by replicas, are indeed executed by those replicas.
>
>
In fact the same algorithm in the CEP was easier to read exactly because of
this, I now realize.

> > So I guess my question is how and when reads happen?
>
> I think this is reasonably well specified in the protocol and, since it’s
> unclear what you’ve found confusing, I don’t know it would be productive to
> try to explain it again here on list. You can look at the prototype, if
> Java is easier for you to parse, as it is of course fully specified there
> with no ambiguity. Or we can discuss off list, or perhaps on the community
> slack channel.
>
>
Maybe my question was a bit too open ended, as I didn't want to lead into
any specific direction.

I can of course tell where reads happen in the execution algorithm. What I
would like to understand better and without guessing is, what do these
transactions look like from a client/user point of view? You already
confirmed that interactive transactions aren't intended by this proposal.
At the other end of the spectrum, given that this is a Cassandra
Enhancement Proposal, and the CEP does in fact state this, it seems like
providing equivalent functionality to already existing LWT is a goal. So my
question is whether I should just* think of this as "better and more
efficient LWT" or is there something more? Would this CEP or follow-up work
introduce any new CQL syntax, for example?

To give just one more example of the kind of questions I'm triangulating
at: Suppose I wanted to do a long running read-only transaction, such as
querying a secondary index. Like SERIAL in current Cassandra, but taking
seconds to execute and returning thousands of rows. How would you see the
possibilities and limits of such operations in Accord?

*) Should emphasize that better scaling LWTs isn't just "just". If I
imagine a future Cassandra cluster where all reads and writes are
transactional and therefore strict serializeable, that would be quite a
change from today.

henrik

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

Hi Henrik,

Welcome, and thanks for the feedback.

> I hope it's ok to use this list for comments on the whitepaper?

Of course, but we may have to be selective in our back-and-forth. We can always take some discussion off-list to keep it manageable.

> if in addition to a deadline you also impose some upper bound for the maximum allowed timestamp

I expect that, much like with LWTs, there will be no facility for user-provided timestamps with these transactions. But yes, I anticipate many knock-on improvements for tables that are managed with this transaction facility.

> The algorithm is hard to read since you omit the roles of the participants.

Thanks. I will consider how I might make it clearer that the portions of the algorithm that execute on receipt of messages that may only be received by replicas, are indeed executed by those replicas.

> Is this sentence correct?

Yes, but perhaps it may be made clearer. In a previous draft there was an additional upsilon variable that likely clarified, but in this location for consistency this is hard to use (as it would replace tau, which is already bound by wider context), and for consistency I have tried to ensure gamma < tau < upsilon throughout the paper.

> Proofs of theorems 3.1 and 3.2 appear to be identical?

Nope. There’s a single but important digit difference.

>* Are interactive transactions possible?

No, I don’t think this protocol can be easily made to natively support interactive transactions, even discounting the problems you highlight - but I haven’t thought about it much as it was not a goal. Interactive transactions can certainly be built on top.

> Are the results of the Jepsen testing available too? (Or will be?)

There are no publishable results, nor any intention to publish them. There is a (fairly rough) implementation of the Jepsen.io Maelstrom txn-append workload that you may run at your leisure in the prototype repository. The in-tree strict serializability verifier is in all honesty probably more useful today and is I think functionally equivalent. You are welcome to browse and run both. As things progress towards completion, if Kyle is interested or funding can be found I’d love to discuss the possibility of an in-depth Jepsen analysis that could be published, but that’s a totally separate conversation and I think very premature.

> So I guess my question is how and when reads happen?

I think this is reasonably well specified in the protocol and, since it’s unclear what you’ve found confusing, I don’t know it would be productive to try to explain it again here on list. You can look at the prototype, if Java is easier for you to parse, as it is of course fully specified there with no ambiguity. Or we can discuss off list, or perhaps on the community slack channel.

From: Henrik Ingo <he...@datastax.com>
Date: Monday, 6 September 2021 at 19:08
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Hi all

I should start by briefly introducing myself: I've worked a year plus at
Datastax, but in a manager role. I have no expectations near term to
actually contribute code or docs to Cassandra, rather I hope my work
indirectly will enable others to do so. As such I also don't expect to be
very vocal on this list, but today seemed like a perfect day to make that
one exception! I hope that's ok?

Before joining the Cassandra world I've worked at MongoDB and several
companies in the MySQL ecosystem. If you read the Raft mailing list you
will have met me there. Since my focus was always on high availability and
performance, I've felt very much at home working in the Cassandra ecosystem.

To the authors of the white paper I want to say this is very inspiring
work. I agree it is time to bring general purpose transactions to
Cassandra, and you are introducing them in a way that builds upon
Cassandra's existing Dynamo protocol with natural timestamps. When I was
learning Cassandra 16 months ago I had similar thoughts to what you are now
presenting.

I hope it's ok to use this list for comments on the whitepaper?

1. Introduction

While I agree that cross shard transactions are only recently becoming
mainstream, for academic level accuracy of your paper you may want to
reference NDB, also known as MySQL NDB Cluster.
 * https://en.wikipedia.org/wiki/MySQL_Cluster
 * http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.884

Above thesis is from 1997 and MySQL acquired the technology for 1 dollar in
2004. Since shortly after that year it has been in widespread use in
our mobile phone networks, with some early e-commerce and OLAP/ML type use
as secondary use cases. In short, NDB provides cross shard transactions
simply via 2 PC. A curious detail of the design is that it actually does
both replication and cross-shard both via 2PC. Two of the participants just
happen to be replicas of each other.

2.2 Timestamp Reorder buffer

It's probably the case this is obvious, and it's omitted because it's not
required by ACCORD, but I wanted to add here that if in addition to a
deadline you also impose some upper bound for the maximum allowed
timestamp, you will make all our issues with tombstones from the future go
away. (And since you are now creating an ordered commit log, this will also
avoid having to keep tombstones for 10 days, simplify anti-entropy for
failed nodes, etc...)

3.2 Consensus

The algorithm is hard to read since you omit the roles of the participants.
It's as if all of it was executed on the Coordinator.

Is this sentence correct? Probably it is and I'm at the limits of my
understanding... *"Note that any transitive dependency of another γ ∈depsτ
where Committedγ may be pruned from depsτ, as it is durably a transitive
dependency of τ."*

3.4 Safety

Proofs of theorems 3.1 and 3.2 appear to be identical?

End:

Ok so reads were discussed very briefly in 3.3, leaving the reader to guess
quite a lot...

* Are interactive transactions possible? It appears they could be, even if
Algorithm 2 only allows for one pass at reads.
* Do I understand correctly that t0 is essentially both the start and end
time of the transaction? ...and that serializability is provided by the
fact that a later transaction gamma will not even start to execute reads
before earlier transaction tau has committed?
* If interactive transactions are possible, it seems a client can
denial-of-service a row by never committing, keeping locks open forever?

So I guess my question is how and when reads happen?

More precisely... how is it possible that the Consensus protocol is
executed first, and it already knows its dependencies, even if the
Execution protocol - aka reads and writes - are only executed after?

Similarly, how do you expect to apply writes before reads were returned to
the client? Even if you were proposing some Calvin-like single-shot
transaction, it still begs the question what mechanism can consume read
results and based on those impact the writes?

Reading the CEP:

Are the results of the Jepsen testing available too? (Or will be?)

henrik

On Sun, Sep 5, 2021 at 5:33 PM benedict@apache.org <be...@apache.org>
wrote:

> Wiki:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> Whitepaper:
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> <
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >
> Prototype: https://github.com/belliottsmith/accord
>
> Hi everyone, I’d like to propose this CEP for adoption by the community.
>
> Cassandra has benefitted from LWTs for many years, but application
> developers that want to ensure consistency for complex operations must
> either accept the scalability bottleneck of serializing all related state
> through a single partition, or layer a complex state machine on top of the
> database. These are sophisticated and costly activities that our users
> should not be expected to undertake. Since distributed databases are
> beginning to offer distributed transactions with fewer caveats, it is past
> time for Cassandra to do so as well.
>
> This CEP proposes the use of several novel techniques that build upon
> research (that followed EPaxos) to deliver (non-interactive) general
> purpose distributed transactions. The approach is outlined in the wikipage
> and in more detail in the linked whitepaper. Importantly, by adopting this
> approach we will be the _only_ distributed database to offer global,
> scalable, strict serializable transactions in one wide area round-trip.
> This would represent a significant improvement in the state of the art,
> both in the academic literature and in commercial or open source offerings.
>
> This work has been partially realised in a prototype. This partial
> prototype has been verified against Jepsen.io’s Maelstrom library and
> dedicated in-tree strict serializability verification tools, but much work
> remains for the work to be production capable and integrated into Cassandra.
>
> I propose including the prototype in the project as a new source
> repository, to be developed as a standalone library for integration into
> Cassandra. I hope the community sees the important value proposition of
> this proposal, and will adopt the CEP after this discussion, so that the
> library and its integration into Cassandra can be developed in parallel and
> with the involvement of the wider community.
>

--

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Henrik Ingo <he...@datastax.com>.

Hi all

I should start by briefly introducing myself: I've worked a year plus at
Datastax, but in a manager role. I have no expectations near term to
actually contribute code or docs to Cassandra, rather I hope my work
indirectly will enable others to do so. As such I also don't expect to be
very vocal on this list, but today seemed like a perfect day to make that
one exception! I hope that's ok?

Before joining the Cassandra world I've worked at MongoDB and several
companies in the MySQL ecosystem. If you read the Raft mailing list you
will have met me there. Since my focus was always on high availability and
performance, I've felt very much at home working in the Cassandra ecosystem.

To the authors of the white paper I want to say this is very inspiring
work. I agree it is time to bring general purpose transactions to
Cassandra, and you are introducing them in a way that builds upon
Cassandra's existing Dynamo protocol with natural timestamps. When I was
learning Cassandra 16 months ago I had similar thoughts to what you are now
presenting.

I hope it's ok to use this list for comments on the whitepaper?

1. Introduction

While I agree that cross shard transactions are only recently becoming
mainstream, for academic level accuracy of your paper you may want to
reference NDB, also known as MySQL NDB Cluster.
 * https://en.wikipedia.org/wiki/MySQL_Cluster
 * http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.48.884

Above thesis is from 1997 and MySQL acquired the technology for 1 dollar in
2004. Since shortly after that year it has been in widespread use in
our mobile phone networks, with some early e-commerce and OLAP/ML type use
as secondary use cases. In short, NDB provides cross shard transactions
simply via 2 PC. A curious detail of the design is that it actually does
both replication and cross-shard both via 2PC. Two of the participants just
happen to be replicas of each other.

2.2 Timestamp Reorder buffer

It's probably the case this is obvious, and it's omitted because it's not
required by ACCORD, but I wanted to add here that if in addition to a
deadline you also impose some upper bound for the maximum allowed
timestamp, you will make all our issues with tombstones from the future go
away. (And since you are now creating an ordered commit log, this will also
avoid having to keep tombstones for 10 days, simplify anti-entropy for
failed nodes, etc...)

3.2 Consensus

The algorithm is hard to read since you omit the roles of the participants.
It's as if all of it was executed on the Coordinator.

Is this sentence correct? Probably it is and I'm at the limits of my
understanding... *"Note that any transitive dependency of another γ ∈depsτ
where Committedγ may be pruned from depsτ, as it is durably a transitive
dependency of τ."*

3.4 Safety

Proofs of theorems 3.1 and 3.2 appear to be identical?

End:

Ok so reads were discussed very briefly in 3.3, leaving the reader to guess
quite a lot...

* Are interactive transactions possible? It appears they could be, even if
Algorithm 2 only allows for one pass at reads.
* Do I understand correctly that t0 is essentially both the start and end
time of the transaction? ...and that serializability is provided by the
fact that a later transaction gamma will not even start to execute reads
before earlier transaction tau has committed?
* If interactive transactions are possible, it seems a client can
denial-of-service a row by never committing, keeping locks open forever?

So I guess my question is how and when reads happen?

More precisely... how is it possible that the Consensus protocol is
executed first, and it already knows its dependencies, even if the
Execution protocol - aka reads and writes - are only executed after?

Similarly, how do you expect to apply writes before reads were returned to
the client? Even if you were proposing some Calvin-like single-shot
transaction, it still begs the question what mechanism can consume read
results and based on those impact the writes?

Reading the CEP:

Are the results of the Jepsen testing available too? (Or will be?)

henrik

On Sun, Sep 5, 2021 at 5:33 PM benedict@apache.org <be...@apache.org>
wrote:

> Wiki:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> Whitepaper:
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> <
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >
> Prototype: https://github.com/belliottsmith/accord
>
> Hi everyone, I’d like to propose this CEP for adoption by the community.
>
> Cassandra has benefitted from LWTs for many years, but application
> developers that want to ensure consistency for complex operations must
> either accept the scalability bottleneck of serializing all related state
> through a single partition, or layer a complex state machine on top of the
> database. These are sophisticated and costly activities that our users
> should not be expected to undertake. Since distributed databases are
> beginning to offer distributed transactions with fewer caveats, it is past
> time for Cassandra to do so as well.
>
> This CEP proposes the use of several novel techniques that build upon
> research (that followed EPaxos) to deliver (non-interactive) general
> purpose distributed transactions. The approach is outlined in the wikipage
> and in more detail in the linked whitepaper. Importantly, by adopting this
> approach we will be the _only_ distributed database to offer global,
> scalable, strict serializable transactions in one wide area round-trip.
> This would represent a significant improvement in the state of the art,
> both in the academic literature and in commercial or open source offerings.
>
> This work has been partially realised in a prototype. This partial
> prototype has been verified against Jepsen.io’s Maelstrom library and
> dedicated in-tree strict serializability verification tools, but much work
> remains for the work to be production capable and integrated into Cassandra.
>
> I propose including the prototype in the project as a new source
> repository, to be developed as a standalone library for integration into
> Cassandra. I hope the community sees the important value proposition of
> this proposal, and will adopt the CEP after this discussion, so that the
> library and its integration into Cassandra can be developed in parallel and
> with the involvement of the wider community.
>

-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Dinesh Joshi <dj...@icloud.com.INVALID>.

+1

One of the major advantages of a separate library would be modularity.

Dinesh

> On Sep 5, 2021, at 3:02 PM, benedict@apache.org wrote:
> 
> Yep, that’s correct. In fact my goal is that we maintain this as a standalone library long term. While its primary goal will be integration with Cassandra, I think there is value in maintaining a distinct library for the core functionality - so long as the burden remains manageable.
> 
> From: Nate McCall <zz...@gmail.com>
> Date: Sunday, 5 September 2021 at 22:30
> To: dev <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Hi Benedict,
> If I'm parsing this correctly, you want to include the stand-alone library
> in the project as a separate repo to begin with, correct? (I'm +1 on that,
> if so).
> 
> Otherwise I am very intrigued by the paper and proposal. This looks
> excellent. Thanks Benedict, et all for putting this together!
> 
> -Nate
> 
>> On Mon, Sep 6, 2021 at 2:33 AM benedict@apache.org <be...@apache.org>
>> wrote:
>> 
>> Wiki:
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
>> Whitepaper:
>> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
>> <
>> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
>>> 
>> Prototype: https://github.com/belliottsmith/accord
>> 
>> Hi everyone, I’d like to propose this CEP for adoption by the community.
>> 
>> Cassandra has benefitted from LWTs for many years, but application
>> developers that want to ensure consistency for complex operations must
>> either accept the scalability bottleneck of serializing all related state
>> through a single partition, or layer a complex state machine on top of the
>> database. These are sophisticated and costly activities that our users
>> should not be expected to undertake. Since distributed databases are
>> beginning to offer distributed transactions with fewer caveats, it is past
>> time for Cassandra to do so as well.
>> 
>> This CEP proposes the use of several novel techniques that build upon
>> research (that followed EPaxos) to deliver (non-interactive) general
>> purpose distributed transactions. The approach is outlined in the wikipage
>> and in more detail in the linked whitepaper. Importantly, by adopting this
>> approach we will be the _only_ distributed database to offer global,
>> scalable, strict serializable transactions in one wide area round-trip.
>> This would represent a significant improvement in the state of the art,
>> both in the academic literature and in commercial or open source offerings.
>> 
>> This work has been partially realised in a prototype. This partial
>> prototype has been verified against Jepsen.io’s Maelstrom library and
>> dedicated in-tree strict serializability verification tools, but much work
>> remains for the work to be production capable and integrated into Cassandra.
>> 
>> I propose including the prototype in the project as a new source
>> repository, to be developed as a standalone library for integration into
>> Cassandra. I hope the community sees the important value proposition of
>> this proposal, and will adopt the CEP after this discussion, so that the
>> library and its integration into Cassandra can be developed in parallel and
>> with the involvement of the wider community.
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

Yep, that’s correct. In fact my goal is that we maintain this as a standalone library long term. While its primary goal will be integration with Cassandra, I think there is value in maintaining a distinct library for the core functionality - so long as the burden remains manageable.

From: Nate McCall <zz...@gmail.com>
Date: Sunday, 5 September 2021 at 22:30
To: dev <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Hi Benedict,
If I'm parsing this correctly, you want to include the stand-alone library
in the project as a separate repo to begin with, correct? (I'm +1 on that,
if so).

Otherwise I am very intrigued by the paper and proposal. This looks
excellent. Thanks Benedict, et all for putting this together!

-Nate

On Mon, Sep 6, 2021 at 2:33 AM benedict@apache.org <be...@apache.org>
wrote:

> Wiki:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> Whitepaper:
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> <
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >
> Prototype: https://github.com/belliottsmith/accord
>
> Hi everyone, I’d like to propose this CEP for adoption by the community.
>
> Cassandra has benefitted from LWTs for many years, but application
> developers that want to ensure consistency for complex operations must
> either accept the scalability bottleneck of serializing all related state
> through a single partition, or layer a complex state machine on top of the
> database. These are sophisticated and costly activities that our users
> should not be expected to undertake. Since distributed databases are
> beginning to offer distributed transactions with fewer caveats, it is past
> time for Cassandra to do so as well.
>
> This CEP proposes the use of several novel techniques that build upon
> research (that followed EPaxos) to deliver (non-interactive) general
> purpose distributed transactions. The approach is outlined in the wikipage
> and in more detail in the linked whitepaper. Importantly, by adopting this
> approach we will be the _only_ distributed database to offer global,
> scalable, strict serializable transactions in one wide area round-trip.
> This would represent a significant improvement in the state of the art,
> both in the academic literature and in commercial or open source offerings.
>
> This work has been partially realised in a prototype. This partial
> prototype has been verified against Jepsen.io’s Maelstrom library and
> dedicated in-tree strict serializability verification tools, but much work
> remains for the work to be production capable and integrated into Cassandra.
>
> I propose including the prototype in the project as a new source
> repository, to be developed as a standalone library for integration into
> Cassandra. I hope the community sees the important value proposition of
> this proposal, and will adopt the CEP after this discussion, so that the
> library and its integration into Cassandra can be developed in parallel and
> with the involvement of the wider community.
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Nate McCall <zz...@gmail.com>.

Hi Benedict,
If I'm parsing this correctly, you want to include the stand-alone library
in the project as a separate repo to begin with, correct? (I'm +1 on that,
if so).

Otherwise I am very intrigued by the paper and proposal. This looks
excellent. Thanks Benedict, et all for putting this together!

-Nate

On Mon, Sep 6, 2021 at 2:33 AM benedict@apache.org <be...@apache.org>
wrote:

> Wiki:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> Whitepaper:
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> <
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >
> Prototype: https://github.com/belliottsmith/accord
>
> Hi everyone, I’d like to propose this CEP for adoption by the community.
>
> Cassandra has benefitted from LWTs for many years, but application
> developers that want to ensure consistency for complex operations must
> either accept the scalability bottleneck of serializing all related state
> through a single partition, or layer a complex state machine on top of the
> database. These are sophisticated and costly activities that our users
> should not be expected to undertake. Since distributed databases are
> beginning to offer distributed transactions with fewer caveats, it is past
> time for Cassandra to do so as well.
>
> This CEP proposes the use of several novel techniques that build upon
> research (that followed EPaxos) to deliver (non-interactive) general
> purpose distributed transactions. The approach is outlined in the wikipage
> and in more detail in the linked whitepaper. Importantly, by adopting this
> approach we will be the _only_ distributed database to offer global,
> scalable, strict serializable transactions in one wide area round-trip.
> This would represent a significant improvement in the state of the art,
> both in the academic literature and in commercial or open source offerings.
>
> This work has been partially realised in a prototype. This partial
> prototype has been verified against Jepsen.io’s Maelstrom library and
> dedicated in-tree strict serializability verification tools, but much work
> remains for the work to be production capable and integrated into Cassandra.
>
> I propose including the prototype in the project as a new source
> repository, to be developed as a standalone library for integration into
> Cassandra. I hope the community sees the important value proposition of
> this proposal, and will adopt the CEP after this discussion, so that the
> library and its integration into Cassandra can be developed in parallel and
> with the involvement of the wider community.
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Henrik Ingo <he...@datastax.com>.

I feel like I should volunteer to write about MongoDB transactions.

TL;DR Snapshot Isolation and Causal Consistency using Raft'ish, Lamport
clock and 2PC. This leads to the age old discussion whether users really
want serializability or not.

On Wed, Sep 22, 2021 at 1:44 AM Jonathan Ellis <jb...@gmail.com> wrote:

> The whitepaper here is a good description of the consensus algorithm itself
> as well as its robustness and stability characteristics, and its comparison
> with other state-of-the-art consensus algorithms is very useful.  In the
> context of Cassandra, where a consensus algorithm is only part of what will
> be implemented, I'd like to see a more complete evaluation of the
> transactional side of things as well, including performance characteristics
> as well as the types of transactions that can be supported and at least a
> general idea of what it would look like applied to Cassandra. This will
> allow the PMC to make a more informed decision about what tradeoffs are
> best for the entire long-term project of first supplementing and ultimately
> replacing LWT.
>
> (Allowing users to mix LWT and AP Cassandra operations against the same
> rows was probably a mistake, so in contrast with LWT we’re not looking for
> something fast enough for occasional use but rather something within a
> reasonable factor of AP operations, appropriate to being the only way to
> interact with tables declared as such.)
>
> Besides Accord, this should cover
>
> - Calvin and FaunaDB
> - A Spanner derivative (no opinion on whether that should be Cockroach or
> Yugabyte, I don’t think it’s necessary to cover both)
> - A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
> there is more public information about MongoDB)
> - RAMP
>
>
=MongoDB=

References:
Presentation: https://www.youtube.com/watch?v=quFheFrLLGQ
Slides:
http://henrikingo.github.io/presentations/HighLoad%202019%20-%20Distributed%20transactions%20top%20to%20bottom/index.html#/step-1
Lamport implementation:
http://delivery.acm.org/10.1145/3320000/3314049/p636-tyulenev.pdf
Replication: http://www.vldb.org/pvldb/vol12/p2071-schultz.pdf and
https://www.usenix.org/system/files/nsdi21-zhou.pdf
TPC-C benchmark: http://www.vldb.org/pvldb/vol12/p2254-kamsky.pdf
(Nothing published on cross shard trx...)

Approach and Guarantees: Shards are independent replica sets, multi shard
transactions and queries handled through a coordinator aka query router.
Replica sets are Raft-like, so leader-based. When using 2PC, also the 2PC
coordinator is a replica set, so that the coordinator state is made durable
via majority commits. This means that a cross shard transaction actually
needs 4 majority commits, but it would be possible to reduce latency to
client ack to 2 commits (https://jira.mongodb.org/browse/SERVER-47130)
Because of this the trx-coordinator is also its own recovery manager and it
is assumed that the replica set will always be able to recover from
failures, usually quickly.

Cluster time is a Lamport clock, in practice the implementation is to
generate use unix timestamp+counter to generate monotonically increasing
integers. Time is passed along each message, and each recipient, updates
its own cluster time to the higher timestamp. All nodes, including clients
participate this. Causal Consistency is basically a client asking to read
at or later than its current timestamp. A replica will block if needed to
satisfy this request. The lamport clock is incremented by leaders to ensure
progress in the absence of write transactions.

The storage engine provides MVCC semantics. Extending this to the
replication system is straightforward, since replicas apply transactions
serially in the same order. For cross shard transactions it's the job of
the transaction coordinator to commit the transaction with the same cluster
time on all shards. If I remember correctly in the 2PC phase it will simply
choose the timestamp returned by each shard as the global transaction
timestamp. Combined, MongoDB transactions are snapshot isolation + causal
consistency.

Performance: 2PC is used only if a transaction actually has multiple
participating shards. It is possible though not fun or realistic to specify
partition boundaries so that related records from two collections will
always reside on the same shard. The 2PC protocol actually requires 4
majority commits, although as of MongoDB 5.0, client only waits for 3.
Majority commit is exactly what QUORUM is in Cassandra, so in a multi-DC
cluster, commit waits for replication latency. Notably, single shard
transactions parallelize well, because conflicting transactions can execute
on the leader, even when the majority commit isn't yet finished. (This
involves some speculative execution optimization.) I don't believe the same
is true for cross shard transactions using 2PC.

The paper by Asya Kamsky uses a single replica set and reports 60-70k TPM
for a non-standard TPC-C where varying client threads was allowed and
schema was modified to take advantage of denormalization in a document
model. I'm not aware of benchmarks for cross shard transactions ,nor would
I expect such results to be great. The philosophy there has been that cross
shard transactions are expected to be a minority.

Functionality and limitations: MongoDB's approach has been similar in
spirit to what we can observe in RDBMS market. Even if MySQL (since
forever) and PostgreSQL (2011) provide serializeable isolation, it is not
default, and it's hard to find a single user who ever wanted to use it.
Snapshot Isolation and Causal Consistency are considered the optimal
tradeoff between good consistency and performance, and minimal hassle with
lots of aborted transactions. The typical MongoDB user is like the typical
MySQL and PostgreSQL user happy with this. It is possible to emulate SELECT
FOR UPDATE by using findAndModify, which will turn your writes to a read
and therefore take a write lock on all touched records.

Note that first versions of MongoDB transactions got quite bad Jepsen
review. This was mostly a function of none of the above guarantees being
default, and the client API being really confusing, so most users -
including Kyle Kingsbury and yours truly - would struggle to get all
parameters right to actually enjoy the above mentioned guarantees. This is
a sober reminder that this is complex stuff to get right end to end.

Note that MongoDB also supports linearizeable writes and reads, but only on
a per-record basis. Linearizeable is not available for transactions.

It should be noted MongoDB's approach allows for interactive transactions.

Application to Cassandra: D.

Replication being leader based is a poor fit for expectations of a typical
Cassandra user. It's hard to predict whether a typical Cassandra workload
can expect cross-partition transactions to be the exceptional case, but my
instinct says no. The Lamport clock and the causal consistency it provides
is simple to understand and could be a building block in a transactional
Cassandra cluster. My personal opinion is that a "synchronized timestamp"
(or Hybrid Logical Clock I guess?) scheme like in Accord is more familiar
to current Cassandra semantics.

henrik
-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

Oh, finally, to address your question about how Fauna achieves low-cost reads: they default to serializable isolation only. They no doubt ensure the transaction log is replicated in order, so that any read from the DC-local transaction log is serializable. Accord will similarly be able to offer cheap serializable reads, and additionally is able to offer strict serializable reads without performing any write during consensus (nod to Alex Miller for pointing out this advantage over Calvin)

From: benedict@apache.org <be...@apache.org>
Date: Wednesday, 22 September 2021 at 04:19
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Demonstrating how subtle, complex and difficult to pin-down this topic is, Fauna’s recent blog post implies they may have migrated to a leaderless sequencing protocol (an earlier blog post made clear they used a leader process). However, Calvin still assumes a global sequencing shard, so this only modifies latency for clients, i.e. goal (3). Whether they have also removed Calvin’s single-shard linearization of transactions is unclear; there is no public information to suggest that they have met goal (1). With this the protocol would in essence begin to look a lot like Accord, and perhaps they are moving towards a similar approach.

From: benedict@apache.org <be...@apache.org>
Date: Wednesday, 22 September 2021 at 03:52
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Hi Jonathan,

These other systems are incompatible with the goals of the CEP. I do discuss them (besides 2PC) in both the whitepaper and the CEP, and will summarise that discussion below. A true and accurate comparison of these other systems is essentially intractable, as there are complex subtleties to each flavour, and those who are interested would be better served by performing their own research.

I think it is more productive to focus on what we want to achieve as a community. If you believe the goals of this CEP are wrong for the project, let’s focus on that. If you want to compare and contrast specific facets of alternative systems that you consider to be preferable in some dimension, let’s do that here or in a Q&A as proposed by Joey.

The relevant goals are that we:

  1.  Guarantee strict serializable isolation on commodity hardware
  2.  Scale to any cluster size
  3.  Achieve optimal latency

The approach taken by Spanner derivatives is rejected by (1) because they guarantee only Serializable isolation (they additionally fail (3)). From watching talks by YugaByte, and inferring from Cockroach’s panic-cluster-death under clock skew, this is clearly considered by everyone to be undesirable but necessary to achieve scalability.

The approach taken by FaunaDB (Calvin) is rejected by (2) because its sequencing layer requires a global leader process for the cluster, which is incompatible with Cassandra’s scalability requirements. It additionally fails (3) for global clients.

Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a Spanner clone for its multi-key transaction functionality, not 2PC.

Systems such as RAMP with even weaker isolation are not considered for the simple reason that they do not even claim to meet (1).

If we want to additionally offer weaker isolation levels than Serializable, such as that provided by the recent RAMP-TAO paper, Cassandra is likely able to support multiple distinct transaction layers that operate independently. I would encourage you to file a CEP to explore how we can meet these distinct use cases, but I consider them to be niche. I expect that a majority of our user base desire strict serializable isolation, and certainly no less than serializable isolation, to augment the existing weaker isolation offered by quorum reads and writes.

I would tangentially note that we are not an AP database under normal recommended operation. A minority in any network partition cannot reach QUORUM, so under recommended usage we are a high-availability leaderless CP database.

From: Jonathan Ellis <jb...@gmail.com>
Date: Tuesday, 21 September 2021 at 23:45
To: dev <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Benedict, thanks for taking the lead in putting this together. Since
Cassandra is the only relevant database today designed around a leaderless
architecture, it's quite likely that we'll be better served with a custom
transaction design instead of trying to retrofit one from CP systems.

The whitepaper here is a good description of the consensus algorithm itself
as well as its robustness and stability characteristics, and its comparison
with other state-of-the-art consensus algorithms is very useful.  In the
context of Cassandra, where a consensus algorithm is only part of what will
be implemented, I'd like to see a more complete evaluation of the
transactional side of things as well, including performance characteristics
as well as the types of transactions that can be supported and at least a
general idea of what it would look like applied to Cassandra. This will
allow the PMC to make a more informed decision about what tradeoffs are
best for the entire long-term project of first supplementing and ultimately
replacing LWT.

(Allowing users to mix LWT and AP Cassandra operations against the same
rows was probably a mistake, so in contrast with LWT we’re not looking for
something fast enough for occasional use but rather something within a
reasonable factor of AP operations, appropriate to being the only way to
interact with tables declared as such.)

Besides Accord, this should cover

- Calvin and FaunaDB
- A Spanner derivative (no opinion on whether that should be Cockroach or
Yugabyte, I don’t think it’s necessary to cover both)
- A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
there is more public information about MongoDB)
- RAMP

Here’s an example of what I mean:

=Calvin=

Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
transactions, then replicas execute the transactions independently with no
further coordination.  No SPOF.  Transactions are batched by each sequencer
to keep this from becoming a bottleneck.

Performance: Calvin paper (published 2012) reports linear scaling of TPC-C
New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is composed
of four reads and four writes, so this is effectively 2M reads and 2M
writes as we normally measure them in C*.

Calvin supports mixed read/write transactions, but because the transaction
execution logic requires knowing all partition keys in advance to ensure
that all replicas can reproduce the same results with no coordination,
reads against non-PK predicates must be done ahead of time (transparently,
by the server) to determine the set of keys, and this must be retried if
the set of rows affected is updated before the actual transaction executes.

Batching and global consensus adds latency -- 100ms in the Calvin paper and
apparently about 50ms in FaunaDB.  Glass half full: all transactions
(including multi-partition updates) are equally performant in Calvin since
the coordination is handled up front in the sequencing step.  Glass half
empty: even single-row reads and writes have to pay the full coordination
cost.  Fauna has optimized this away for reads but I am not aware of a
description of how they changed the design to allow this.

Functionality and limitations: since the entire transaction must be known
in advance to allow coordination-less execution at the replicas, Calvin
cannot support interactive transactions at all.  FaunaDB mitigates this by
allowing server-side logic to be included, but a Calvin approach will never
be able to offer SQL compatibility.

Guarantees: Calvin transactions are strictly serializable.  There is no
additional complexity or performance hit to generalizing to multiple
regions, apart from the speed of light.  And since Calvin is already paying
a batching latency penalty, this is less painful than for other systems.

Application to Cassandra: B-.  Distributed transactions are handled by the
sequencing and scheduling layers, which are leaderless, and Calvin’s
requirements for the storage layer are easily met by C*.  But Calvin also
requires a global consensus protocol and LWT is almost certainly not
sufficiently performant, so this would require ZK or etcd (reasonable for a
library approach but not for replacing LWT in C* itself), or an
implementation of Accord.  I don’t believe Calvin would require additional
table-level metadata in Cassandra.

On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <be...@apache.org>
wrote:

> Wiki:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> Whitepaper:
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> <
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >
> Prototype: https://github.com/belliottsmith/accord
>
> Hi everyone, I’d like to propose this CEP for adoption by the community.
>
> Cassandra has benefitted from LWTs for many years, but application
> developers that want to ensure consistency for complex operations must
> either accept the scalability bottleneck of serializing all related state
> through a single partition, or layer a complex state machine on top of the
> database. These are sophisticated and costly activities that our users
> should not be expected to undertake. Since distributed databases are
> beginning to offer distributed transactions with fewer caveats, it is past
> time for Cassandra to do so as well.
>
> This CEP proposes the use of several novel techniques that build upon
> research (that followed EPaxos) to deliver (non-interactive) general
> purpose distributed transactions. The approach is outlined in the wikipage
> and in more detail in the linked whitepaper. Importantly, by adopting this
> approach we will be the _only_ distributed database to offer global,
> scalable, strict serializable transactions in one wide area round-trip.
> This would represent a significant improvement in the state of the art,
> both in the academic literature and in commercial or open source offerings.
>
> This work has been partially realised in a prototype. This partial
> prototype has been verified against Jepsen.io’s Maelstrom library and
> dedicated in-tree strict serializability verification tools, but much work
> remains for the work to be production capable and integrated into Cassandra.
>
> I propose including the prototype in the project as a new source
> repository, to be developed as a standalone library for integration into
> Cassandra. I hope the community sees the important value proposition of
> this proposal, and will adopt the CEP after this discussion, so that the
> library and its integration into Cassandra can be developed in parallel and
> with the involvement of the wider community.
>

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

FWIW I retract this – looking again at the blog post I don’t see adequate reason to infer they are using a leaderless approach. On balance I expect Fauna is still using a stable leader. Do you have reason to believe they are now leaderless?

From: benedict@apache.org <be...@apache.org>
Date: Wednesday, 22 September 2021 at 04:19
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Demonstrating how subtle, complex and difficult to pin-down this topic is, Fauna’s recent blog post implies they may have migrated to a leaderless sequencing protocol (an earlier blog post made clear they used a leader process). However, Calvin still assumes a global sequencing shard, so this only modifies latency for clients, i.e. goal (3). Whether they have also removed Calvin’s single-shard linearization of transactions is unclear; there is no public information to suggest that they have met goal (1). With this the protocol would in essence begin to look a lot like Accord, and perhaps they are moving towards a similar approach.

From: benedict@apache.org <be...@apache.org>
Date: Wednesday, 22 September 2021 at 03:52
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Hi Jonathan,

These other systems are incompatible with the goals of the CEP. I do discuss them (besides 2PC) in both the whitepaper and the CEP, and will summarise that discussion below. A true and accurate comparison of these other systems is essentially intractable, as there are complex subtleties to each flavour, and those who are interested would be better served by performing their own research.

I think it is more productive to focus on what we want to achieve as a community. If you believe the goals of this CEP are wrong for the project, let’s focus on that. If you want to compare and contrast specific facets of alternative systems that you consider to be preferable in some dimension, let’s do that here or in a Q&A as proposed by Joey.

The relevant goals are that we:

  1.  Guarantee strict serializable isolation on commodity hardware
  2.  Scale to any cluster size
  3.  Achieve optimal latency

The approach taken by Spanner derivatives is rejected by (1) because they guarantee only Serializable isolation (they additionally fail (3)). From watching talks by YugaByte, and inferring from Cockroach’s panic-cluster-death under clock skew, this is clearly considered by everyone to be undesirable but necessary to achieve scalability.

The approach taken by FaunaDB (Calvin) is rejected by (2) because its sequencing layer requires a global leader process for the cluster, which is incompatible with Cassandra’s scalability requirements. It additionally fails (3) for global clients.

Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a Spanner clone for its multi-key transaction functionality, not 2PC.

Systems such as RAMP with even weaker isolation are not considered for the simple reason that they do not even claim to meet (1).

If we want to additionally offer weaker isolation levels than Serializable, such as that provided by the recent RAMP-TAO paper, Cassandra is likely able to support multiple distinct transaction layers that operate independently. I would encourage you to file a CEP to explore how we can meet these distinct use cases, but I consider them to be niche. I expect that a majority of our user base desire strict serializable isolation, and certainly no less than serializable isolation, to augment the existing weaker isolation offered by quorum reads and writes.

I would tangentially note that we are not an AP database under normal recommended operation. A minority in any network partition cannot reach QUORUM, so under recommended usage we are a high-availability leaderless CP database.

From: Jonathan Ellis <jb...@gmail.com>
Date: Tuesday, 21 September 2021 at 23:45
To: dev <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Benedict, thanks for taking the lead in putting this together. Since
Cassandra is the only relevant database today designed around a leaderless
architecture, it's quite likely that we'll be better served with a custom
transaction design instead of trying to retrofit one from CP systems.

The whitepaper here is a good description of the consensus algorithm itself
as well as its robustness and stability characteristics, and its comparison
with other state-of-the-art consensus algorithms is very useful.  In the
context of Cassandra, where a consensus algorithm is only part of what will
be implemented, I'd like to see a more complete evaluation of the
transactional side of things as well, including performance characteristics
as well as the types of transactions that can be supported and at least a
general idea of what it would look like applied to Cassandra. This will
allow the PMC to make a more informed decision about what tradeoffs are
best for the entire long-term project of first supplementing and ultimately
replacing LWT.

(Allowing users to mix LWT and AP Cassandra operations against the same
rows was probably a mistake, so in contrast with LWT we’re not looking for
something fast enough for occasional use but rather something within a
reasonable factor of AP operations, appropriate to being the only way to
interact with tables declared as such.)

Besides Accord, this should cover

- Calvin and FaunaDB
- A Spanner derivative (no opinion on whether that should be Cockroach or
Yugabyte, I don’t think it’s necessary to cover both)
- A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
there is more public information about MongoDB)
- RAMP

Here’s an example of what I mean:

=Calvin=

Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
transactions, then replicas execute the transactions independently with no
further coordination.  No SPOF.  Transactions are batched by each sequencer
to keep this from becoming a bottleneck.

Performance: Calvin paper (published 2012) reports linear scaling of TPC-C
New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is composed
of four reads and four writes, so this is effectively 2M reads and 2M
writes as we normally measure them in C*.

Calvin supports mixed read/write transactions, but because the transaction
execution logic requires knowing all partition keys in advance to ensure
that all replicas can reproduce the same results with no coordination,
reads against non-PK predicates must be done ahead of time (transparently,
by the server) to determine the set of keys, and this must be retried if
the set of rows affected is updated before the actual transaction executes.

Batching and global consensus adds latency -- 100ms in the Calvin paper and
apparently about 50ms in FaunaDB.  Glass half full: all transactions
(including multi-partition updates) are equally performant in Calvin since
the coordination is handled up front in the sequencing step.  Glass half
empty: even single-row reads and writes have to pay the full coordination
cost.  Fauna has optimized this away for reads but I am not aware of a
description of how they changed the design to allow this.

Functionality and limitations: since the entire transaction must be known
in advance to allow coordination-less execution at the replicas, Calvin
cannot support interactive transactions at all.  FaunaDB mitigates this by
allowing server-side logic to be included, but a Calvin approach will never
be able to offer SQL compatibility.

Guarantees: Calvin transactions are strictly serializable.  There is no
additional complexity or performance hit to generalizing to multiple
regions, apart from the speed of light.  And since Calvin is already paying
a batching latency penalty, this is less painful than for other systems.

Application to Cassandra: B-.  Distributed transactions are handled by the
sequencing and scheduling layers, which are leaderless, and Calvin’s
requirements for the storage layer are easily met by C*.  But Calvin also
requires a global consensus protocol and LWT is almost certainly not
sufficiently performant, so this would require ZK or etcd (reasonable for a
library approach but not for replacing LWT in C* itself), or an
implementation of Accord.  I don’t believe Calvin would require additional
table-level metadata in Cassandra.

On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <be...@apache.org>
wrote:

> Wiki:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> Whitepaper:
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> <
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >
> Prototype: https://github.com/belliottsmith/accord
>
> Hi everyone, I’d like to propose this CEP for adoption by the community.
>
> Cassandra has benefitted from LWTs for many years, but application
> developers that want to ensure consistency for complex operations must
> either accept the scalability bottleneck of serializing all related state
> through a single partition, or layer a complex state machine on top of the
> database. These are sophisticated and costly activities that our users
> should not be expected to undertake. Since distributed databases are
> beginning to offer distributed transactions with fewer caveats, it is past
> time for Cassandra to do so as well.
>
> This CEP proposes the use of several novel techniques that build upon
> research (that followed EPaxos) to deliver (non-interactive) general
> purpose distributed transactions. The approach is outlined in the wikipage
> and in more detail in the linked whitepaper. Importantly, by adopting this
> approach we will be the _only_ distributed database to offer global,
> scalable, strict serializable transactions in one wide area round-trip.
> This would represent a significant improvement in the state of the art,
> both in the academic literature and in commercial or open source offerings.
>
> This work has been partially realised in a prototype. This partial
> prototype has been verified against Jepsen.io’s Maelstrom library and
> dedicated in-tree strict serializability verification tools, but much work
> remains for the work to be production capable and integrated into Cassandra.
>
> I propose including the prototype in the project as a new source
> repository, to be developed as a standalone library for integration into
> Cassandra. I hope the community sees the important value proposition of
> this proposal, and will adopt the CEP after this discussion, so that the
> library and its integration into Cassandra can be developed in parallel and
> with the involvement of the wider community.
>

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

Demonstrating how subtle, complex and difficult to pin-down this topic is, Fauna’s recent blog post implies they may have migrated to a leaderless sequencing protocol (an earlier blog post made clear they used a leader process). However, Calvin still assumes a global sequencing shard, so this only modifies latency for clients, i.e. goal (3). Whether they have also removed Calvin’s single-shard linearization of transactions is unclear; there is no public information to suggest that they have met goal (1). With this the protocol would in essence begin to look a lot like Accord, and perhaps they are moving towards a similar approach.

From: benedict@apache.org <be...@apache.org>
Date: Wednesday, 22 September 2021 at 03:52
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Hi Jonathan,

These other systems are incompatible with the goals of the CEP. I do discuss them (besides 2PC) in both the whitepaper and the CEP, and will summarise that discussion below. A true and accurate comparison of these other systems is essentially intractable, as there are complex subtleties to each flavour, and those who are interested would be better served by performing their own research.

I think it is more productive to focus on what we want to achieve as a community. If you believe the goals of this CEP are wrong for the project, let’s focus on that. If you want to compare and contrast specific facets of alternative systems that you consider to be preferable in some dimension, let’s do that here or in a Q&A as proposed by Joey.

The relevant goals are that we:

  1.  Guarantee strict serializable isolation on commodity hardware
  2.  Scale to any cluster size
  3.  Achieve optimal latency

The approach taken by Spanner derivatives is rejected by (1) because they guarantee only Serializable isolation (they additionally fail (3)). From watching talks by YugaByte, and inferring from Cockroach’s panic-cluster-death under clock skew, this is clearly considered by everyone to be undesirable but necessary to achieve scalability.

The approach taken by FaunaDB (Calvin) is rejected by (2) because its sequencing layer requires a global leader process for the cluster, which is incompatible with Cassandra’s scalability requirements. It additionally fails (3) for global clients.

Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a Spanner clone for its multi-key transaction functionality, not 2PC.

Systems such as RAMP with even weaker isolation are not considered for the simple reason that they do not even claim to meet (1).

If we want to additionally offer weaker isolation levels than Serializable, such as that provided by the recent RAMP-TAO paper, Cassandra is likely able to support multiple distinct transaction layers that operate independently. I would encourage you to file a CEP to explore how we can meet these distinct use cases, but I consider them to be niche. I expect that a majority of our user base desire strict serializable isolation, and certainly no less than serializable isolation, to augment the existing weaker isolation offered by quorum reads and writes.

I would tangentially note that we are not an AP database under normal recommended operation. A minority in any network partition cannot reach QUORUM, so under recommended usage we are a high-availability leaderless CP database.

From: Jonathan Ellis <jb...@gmail.com>
Date: Tuesday, 21 September 2021 at 23:45
To: dev <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Benedict, thanks for taking the lead in putting this together. Since
Cassandra is the only relevant database today designed around a leaderless
architecture, it's quite likely that we'll be better served with a custom
transaction design instead of trying to retrofit one from CP systems.

The whitepaper here is a good description of the consensus algorithm itself
as well as its robustness and stability characteristics, and its comparison
with other state-of-the-art consensus algorithms is very useful.  In the
context of Cassandra, where a consensus algorithm is only part of what will
be implemented, I'd like to see a more complete evaluation of the
transactional side of things as well, including performance characteristics
as well as the types of transactions that can be supported and at least a
general idea of what it would look like applied to Cassandra. This will
allow the PMC to make a more informed decision about what tradeoffs are
best for the entire long-term project of first supplementing and ultimately
replacing LWT.

(Allowing users to mix LWT and AP Cassandra operations against the same
rows was probably a mistake, so in contrast with LWT we’re not looking for
something fast enough for occasional use but rather something within a
reasonable factor of AP operations, appropriate to being the only way to
interact with tables declared as such.)

Besides Accord, this should cover

- Calvin and FaunaDB
- A Spanner derivative (no opinion on whether that should be Cockroach or
Yugabyte, I don’t think it’s necessary to cover both)
- A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
there is more public information about MongoDB)
- RAMP

Here’s an example of what I mean:

=Calvin=

Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
transactions, then replicas execute the transactions independently with no
further coordination.  No SPOF.  Transactions are batched by each sequencer
to keep this from becoming a bottleneck.

Performance: Calvin paper (published 2012) reports linear scaling of TPC-C
New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is composed
of four reads and four writes, so this is effectively 2M reads and 2M
writes as we normally measure them in C*.

Calvin supports mixed read/write transactions, but because the transaction
execution logic requires knowing all partition keys in advance to ensure
that all replicas can reproduce the same results with no coordination,
reads against non-PK predicates must be done ahead of time (transparently,
by the server) to determine the set of keys, and this must be retried if
the set of rows affected is updated before the actual transaction executes.

Batching and global consensus adds latency -- 100ms in the Calvin paper and
apparently about 50ms in FaunaDB.  Glass half full: all transactions
(including multi-partition updates) are equally performant in Calvin since
the coordination is handled up front in the sequencing step.  Glass half
empty: even single-row reads and writes have to pay the full coordination
cost.  Fauna has optimized this away for reads but I am not aware of a
description of how they changed the design to allow this.

Functionality and limitations: since the entire transaction must be known
in advance to allow coordination-less execution at the replicas, Calvin
cannot support interactive transactions at all.  FaunaDB mitigates this by
allowing server-side logic to be included, but a Calvin approach will never
be able to offer SQL compatibility.

Guarantees: Calvin transactions are strictly serializable.  There is no
additional complexity or performance hit to generalizing to multiple
regions, apart from the speed of light.  And since Calvin is already paying
a batching latency penalty, this is less painful than for other systems.

Application to Cassandra: B-.  Distributed transactions are handled by the
sequencing and scheduling layers, which are leaderless, and Calvin’s
requirements for the storage layer are easily met by C*.  But Calvin also
requires a global consensus protocol and LWT is almost certainly not
sufficiently performant, so this would require ZK or etcd (reasonable for a
library approach but not for replacing LWT in C* itself), or an
implementation of Accord.  I don’t believe Calvin would require additional
table-level metadata in Cassandra.

On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <be...@apache.org>
wrote:

> Wiki:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> Whitepaper:
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> <
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >
> Prototype: https://github.com/belliottsmith/accord
>
> Hi everyone, I’d like to propose this CEP for adoption by the community.
>
> Cassandra has benefitted from LWTs for many years, but application
> developers that want to ensure consistency for complex operations must
> either accept the scalability bottleneck of serializing all related state
> through a single partition, or layer a complex state machine on top of the
> database. These are sophisticated and costly activities that our users
> should not be expected to undertake. Since distributed databases are
> beginning to offer distributed transactions with fewer caveats, it is past
> time for Cassandra to do so as well.
>
> This CEP proposes the use of several novel techniques that build upon
> research (that followed EPaxos) to deliver (non-interactive) general
> purpose distributed transactions. The approach is outlined in the wikipage
> and in more detail in the linked whitepaper. Importantly, by adopting this
> approach we will be the _only_ distributed database to offer global,
> scalable, strict serializable transactions in one wide area round-trip.
> This would represent a significant improvement in the state of the art,
> both in the academic literature and in commercial or open source offerings.
>
> This work has been partially realised in a prototype. This partial
> prototype has been verified against Jepsen.io’s Maelstrom library and
> dedicated in-tree strict serializability verification tools, but much work
> remains for the work to be production capable and integrated into Cassandra.
>
> I propose including the prototype in the project as a new source
> repository, to be developed as a standalone library for integration into
> Cassandra. I hope the community sees the important value proposition of
> this proposal, and will adopt the CEP after this discussion, so that the
> library and its integration into Cassandra can be developed in parallel and
> with the involvement of the wider community.
>

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

I am of course more than happy to continue discussing CEP-15 with respect to the proposed goals, and queries about the proposed protocol. I hope people feel free to continue raising queries. If anybody disagrees with the goals or any specific part of the proposal on substantive (rather than aesthetic/structural) grounds I also remain very open to further discussion.

However, I think at this point it is reasonable to request that we engage with the proposal as defined, and in particular the goals that have been proposed. Those who wish for a different proposal can produce one so that it may be engaged with on the same terms.

From: benedict@apache.org <be...@apache.org>
Date: Friday, 1 October 2021 at 14:19
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
I think this is getting circular and unproductive. Basic disagreements about whether the CEP specifies a feature I am inclined to leave for a vote. In my view the CEP specifies several features, both immediate ones for the user (ACID batches and multi-key LWTS) and developer-focused ones around ground-breaking semantics that will be enabled.

The proposal as it stands today is exceptionally thorough, more so than any other CEP to date, or any CEP is likely to be in the near future.

This is a Cassandra Enhancement *Proposal*, and at some point we have to engage with what is proposed, not what you might like to be proposed. Since it remains unclear to me what either yourself or Jonathan want to see as an alternative, at this point it would seem more productive to produce your own proposals for the community to consider. It is possible for multiple transaction systems to co-exist, if you feel this is necessary.



From: Paulo Motta <pa...@gmail.com>
Date: Friday, 1 October 2021 at 13:58
To: Cassandra DEV <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
I share similar feelings as jbellis that this proposal seems to be focusing
on the protocol itself but lacking the actual feature that will use the
protocol which IMO a key element to discuss on a CEP.

It's similar to saying: hey I want to add this Tries Serialization Protocol
to Cassandra, but not providing specific details of how this protocol is
going to be used.

I think the right route for a CEP is to describe the feature that will be
added to the database and the protocol is a mere requirement of the
high-level feature, for example:

CEP: Add Trie-backed memtable
- Trie Serialization Protocol: implementation detail of the above CEP

What is the difficulty of taking this approach, picking one of the myriad
of features that will be enabled by Accord and using that as the initial
CEP to introduce the protocol to the database?

Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
benedict@apache.org> escreveu:

> Actually, thinking about it again, the simple optimistic protocol would in
> fact guarantee system forward progress (i.e. independent of transaction
> formulation).
>
>
> From: benedict@apache.org <be...@apache.org>
> Date: Friday, 1 October 2021 at 09:14
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Hi Jonathan,
>
> It would be great if we could achieve a bandwidth higher than 1-2 short
> emails per week. It remains unclear to me what your goal is, and it would
> help if you could make a statement like “I want Cassandra to be able to do
> X” so that we can respond directly to it. I am also available to have
> another call, in which we can have a back and forth, please feel free to
> propose a London-compatible time within the next week that is suitable for
> you.
>
> In my opinion we are at risk of veering off-topic, though. This CEP is not
> to deliver interactive transactions, and to my knowledge nobody is
> proposing a CEP for interactive transactions. So, for the CEP at hand the
> salient question seems: does this CEP prevent us from implementing
> interactive transactions with properties X, Y, Z in future? To which the
> answer is almost certainly no.
>
> However, to continue the discussion and respond directly to your queries,
> I believe we agree on the definition of an interactive transaction.
>
> Two protocols were loosely outlined. The first, using timestamps for
> optimistic concurrency control, would indeed involve the possibility of
> aborts. It would not however inherently adopt the issue of LWTs where no
> transaction is able to make progress. Whether or not progress is guaranteed
> (in a livelock-free sense) would depend on the structure of the
> transactions that were interfering.
>
> This approach has the advantage of being very simple to implement, so that
> we could realistically support interactive transactions quite quickly. It
> has the additional advantage that transactions would execute very quickly
> by avoiding the WAN during construction, and as a result may in practice
> experience fewer aborts than protocols that guarantee livelock-freedom.
>
> The second protocol proposed using read/write intents and would be able to
> support almost any behaviour you want. We could even utilise pessimistic
> concurrency control, or anything in-between. This is its own huge design
> space, and discussion of this approach and the trade-offs that could be
> made is (in my opinion) entirely out of scope for this CEP.
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Friday, 1 October 2021 at 05:00
> To: dev <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> The obstacle for me is you've provided a protocol but not a fully fleshed
> out architecture, so it's hard to fill in some of the blanks.  But it looks
> to me like optimistic concurrency control for interactive transactions
> applied to Accord would leave you in a LWT-like situation under fairly
> light contention where nobody actually makes progress due to retries.
>
> To make sure we're talking about the same thing, as Henrik pointed out,
> interactive transactions mean multiple round trips from the client within a
> transaction.  For example, here
> <
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> >
> is a simple implementation of the TPC-C New Order transaction.  The high
> level logic (via
> <
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> >)
> is,
>
>    1. Get records describing a warehouse, customer, & district
>    2. Update the district
>    3. Increment next available order number
>    4. Insert record into Order and New-Order tables
>    5. For 5-15 items, get Item record, get/update Stock record
>    6. Insert Order-Line Record
>
> As you can see, this requires a lot of client-side logic mixed in with the
> actual SQL commands.
>
>
> On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <be...@apache.org>
> wrote:
>
> > Essentially this, although I think in practice we will need to track each
> > partition’s timestamp separately (or optionally for reduced conflicts,
> each
> > row or datum’s), and make them all part of the conditional application of
> > the transaction - at least for strict-serializability.
> >
> > The alternative is to insert read/write intents for the transaction
> during
> > each step, and to confirm they are still valid on commit, but this
> approach
> > would require a WAN round-trip for each step in the interactive
> > transaction, whereas the timestamp-validating approach can use a LAN
> > round-trip for each step besides the final one, and is also much simpler
> to
> > implement.
> >
> >
> > From: Blake Eggleston <be...@apple.com.INVALID>
> > Date: Thursday, 30 September 2021 at 05:47
> > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > You could establish a lower timestamp bound and buffer transaction state
> > on the coordinator, then make the commit an operation that only applies
> if
> > all partitions involved haven’t been changed by a more recent timestamp.
> > You could also implement mvcc either in the storage layer or for some
> > period of time by buffering commits on each replica before applying.
> >
> > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> > >
> > > How are interactive transactions possible with Accord?
> > >
> > >
> > >
> > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > benedict@apache.org>
> > > wrote:
> > >
> > >> Could you explain why you believe this trade-off is necessary? We can
> > >> support full SQL just fine with Accord, and I hope that we eventually
> > do so.
> > >>
> > >> This domain is incredibly complex, so it is easy to reach wrong
> > >> conclusions. I would invite you again to propose a system for
> discussion
> > >> that you think offers something Accord is unable to, and that you
> > consider
> > >> desirable, and we can work from there.
> > >>
> > >> To pre-empt some possible discussions, I am not aware of anything we
> > >> cannot do with Accord that we could do with either Calvin or Spanner.
> > >> Interactive transactions are possible on top of Accord, as are
> > transactions
> > >> with an unknown read/write set. In each case the only cost is that
> they
> > >> would use optimistic concurrency control, which is no worse the
> spanner
> > >> derivatives anyway (which I have to assume is your benchmark in this
> > >> regard). I do not expect to deliver either functionality initially,
> but
> > >> Accord takes us most of the way there for both.
> > >>
> > >>
> > >> From: Jonathan Ellis <jb...@gmail.com>
> > >> Date: Wednesday, 22 September 2021 at 05:36
> > >> To: dev <de...@cassandra.apache.org>
> > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >> Right, I'm looking for exactly a discussion on the high level goals.
> > >> Instead of saying "here's the goals and we ruled out X because Y" we
> > should
> > >> start with a discussion around, "Approach A allows X and W, approach B
> > >> allows Y and Z" and decide together what the goals should be and and
> > what
> > >> we are willing to trade to get those goals, e.g., are we willing to
> > give up
> > >> global strict serializability to get the ability to support full SQL.
> > Both
> > >> of these are nice to have!
> > >>
> > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > benedict@apache.org>
> > >> wrote:
> > >>
> > >>> Hi Jonathan,
> > >>>
> > >>> These other systems are incompatible with the goals of the CEP. I do
> > >>> discuss them (besides 2PC) in both the whitepaper and the CEP, and
> will
> > >>> summarise that discussion below. A true and accurate comparison of
> > these
> > >>> other systems is essentially intractable, as there are complex
> > subtleties
> > >>> to each flavour, and those who are interested would be better served
> by
> > >>> performing their own research.
> > >>>
> > >>> I think it is more productive to focus on what we want to achieve as
> a
> > >>> community. If you believe the goals of this CEP are wrong for the
> > >> project,
> > >>> let’s focus on that. If you want to compare and contrast specific
> > facets
> > >> of
> > >>> alternative systems that you consider to be preferable in some
> > dimension,
> > >>> let’s do that here or in a Q&A as proposed by Joey.
> > >>>
> > >>> The relevant goals are that we:
> > >>>
> > >>>
> > >>>  1.  Guarantee strict serializable isolation on commodity hardware
> > >>>  2.  Scale to any cluster size
> > >>>  3.  Achieve optimal latency
> > >>>
> > >>> The approach taken by Spanner derivatives is rejected by (1) because
> > they
> > >>> guarantee only Serializable isolation (they additionally fail (3)).
> > From
> > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > >>> panic-cluster-death under clock skew, this is clearly considered by
> > >>> everyone to be undesirable but necessary to achieve scalability.
> > >>>
> > >>> The approach taken by FaunaDB (Calvin) is rejected by (2) because its
> > >>> sequencing layer requires a global leader process for the cluster,
> > which
> > >> is
> > >>> incompatible with Cassandra’s scalability requirements. It
> additionally
> > >>> fails (3) for global clients.
> > >>>
> > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
> > >>> Spanner clone for its multi-key transaction functionality, not 2PC.
> > >>>
> > >>> Systems such as RAMP with even weaker isolation are not considered
> for
> > >> the
> > >>> simple reason that they do not even claim to meet (1).
> > >>>
> > >>> If we want to additionally offer weaker isolation levels than
> > >>> Serializable, such as that provided by the recent RAMP-TAO paper,
> > >> Cassandra
> > >>> is likely able to support multiple distinct transaction layers that
> > >> operate
> > >>> independently. I would encourage you to file a CEP to explore how we
> > can
> > >>> meet these distinct use cases, but I consider them to be niche. I
> > expect
> > >>> that a majority of our user base desire strict serializable
> isolation,
> > >> and
> > >>> certainly no less than serializable isolation, to augment the
> existing
> > >>> weaker isolation offered by quorum reads and writes.
> > >>>
> > >>> I would tangentially note that we are not an AP database under normal
> > >>> recommended operation. A minority in any network partition cannot
> reach
> > >>> QUORUM, so under recommended usage we are a high-availability
> > leaderless
> > >> CP
> > >>> database.
> > >>>
> > >>>
> > >>> From: Jonathan Ellis <jb...@gmail.com>
> > >>> Date: Tuesday, 21 September 2021 at 23:45
> > >>> To: dev <de...@cassandra.apache.org>
> > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >>> Benedict, thanks for taking the lead in putting this together. Since
> > >>> Cassandra is the only relevant database today designed around a
> > >> leaderless
> > >>> architecture, it's quite likely that we'll be better served with a
> > custom
> > >>> transaction design instead of trying to retrofit one from CP systems.
> > >>>
> > >>> The whitepaper here is a good description of the consensus algorithm
> > >> itself
> > >>> as well as its robustness and stability characteristics, and its
> > >> comparison
> > >>> with other state-of-the-art consensus algorithms is very useful.  In
> > the
> > >>> context of Cassandra, where a consensus algorithm is only part of
> what
> > >> will
> > >>> be implemented, I'd like to see a more complete evaluation of the
> > >>> transactional side of things as well, including performance
> > >> characteristics
> > >>> as well as the types of transactions that can be supported and at
> > least a
> > >>> general idea of what it would look like applied to Cassandra. This
> will
> > >>> allow the PMC to make a more informed decision about what tradeoffs
> are
> > >>> best for the entire long-term project of first supplementing and
> > >> ultimately
> > >>> replacing LWT.
> > >>>
> > >>> (Allowing users to mix LWT and AP Cassandra operations against the
> same
> > >>> rows was probably a mistake, so in contrast with LWT we’re not
> looking
> > >> for
> > >>> something fast enough for occasional use but rather something within
> a
> > >>> reasonable factor of AP operations, appropriate to being the only way
> > to
> > >>> interact with tables declared as such.)
> > >>>
> > >>> Besides Accord, this should cover
> > >>>
> > >>> - Calvin and FaunaDB
> > >>> - A Spanner derivative (no opinion on whether that should be
> Cockroach
> > or
> > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I
> > suspect
> > >>> there is more public information about MongoDB)
> > >>> - RAMP
> > >>>
> > >>> Here’s an example of what I mean:
> > >>>
> > >>> =Calvin=
> > >>>
> > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to
> order
> > >>> transactions, then replicas execute the transactions independently
> with
> > >> no
> > >>> further coordination.  No SPOF.  Transactions are batched by each
> > >> sequencer
> > >>> to keep this from becoming a bottleneck.
> > >>>
> > >>> Performance: Calvin paper (published 2012) reports linear scaling of
> > >> TPC-C
> > >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL
> machines
> > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is
> > composed
> > >>> of four reads and four writes, so this is effectively 2M reads and 2M
> > >>> writes as we normally measure them in C*.
> > >>>
> > >>> Calvin supports mixed read/write transactions, but because the
> > >> transaction
> > >>> execution logic requires knowing all partition keys in advance to
> > ensure
> > >>> that all replicas can reproduce the same results with no
> coordination,
> > >>> reads against non-PK predicates must be done ahead of time
> > >> (transparently,
> > >>> by the server) to determine the set of keys, and this must be retried
> > if
> > >>> the set of rows affected is updated before the actual transaction
> > >> executes.
> > >>>
> > >>> Batching and global consensus adds latency -- 100ms in the Calvin
> paper
> > >> and
> > >>> apparently about 50ms in FaunaDB.  Glass half full: all transactions
> > >>> (including multi-partition updates) are equally performant in Calvin
> > >> since
> > >>> the coordination is handled up front in the sequencing step.  Glass
> > half
> > >>> empty: even single-row reads and writes have to pay the full
> > coordination
> > >>> cost.  Fauna has optimized this away for reads but I am not aware of
> a
> > >>> description of how they changed the design to allow this.
> > >>>
> > >>> Functionality and limitations: since the entire transaction must be
> > known
> > >>> in advance to allow coordination-less execution at the replicas,
> Calvin
> > >>> cannot support interactive transactions at all.  FaunaDB mitigates
> this
> > >> by
> > >>> allowing server-side logic to be included, but a Calvin approach will
> > >> never
> > >>> be able to offer SQL compatibility.
> > >>>
> > >>> Guarantees: Calvin transactions are strictly serializable.  There is
> no
> > >>> additional complexity or performance hit to generalizing to multiple
> > >>> regions, apart from the speed of light.  And since Calvin is already
> > >> paying
> > >>> a batching latency penalty, this is less painful than for other
> > systems.
> > >>>
> > >>> Application to Cassandra: B-.  Distributed transactions are handled
> by
> > >> the
> > >>> sequencing and scheduling layers, which are leaderless, and Calvin’s
> > >>> requirements for the storage layer are easily met by C*.  But Calvin
> > also
> > >>> requires a global consensus protocol and LWT is almost certainly not
> > >>> sufficiently performant, so this would require ZK or etcd (reasonable
> > >> for a
> > >>> library approach but not for replacing LWT in C* itself), or an
> > >>> implementation of Accord.  I don’t believe Calvin would require
> > >> additional
> > >>> table-level metadata in Cassandra.
> > >>>
> > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > benedict@apache.org>
> > >>> wrote:
> > >>>
> > >>>> Wiki:
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > >>>> Whitepaper:
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > >>>> <
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > >>>>>
> > >>>> Prototype: https://github.com/belliottsmith/accord
> > >>>>
> > >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> > >> community.
> > >>>>
> > >>>> Cassandra has benefitted from LWTs for many years, but application
> > >>>> developers that want to ensure consistency for complex operations
> must
> > >>>> either accept the scalability bottleneck of serializing all related
> > >> state
> > >>>> through a single partition, or layer a complex state machine on top
> of
> > >>> the
> > >>>> database. These are sophisticated and costly activities that our
> users
> > >>>> should not be expected to undertake. Since distributed databases are
> > >>>> beginning to offer distributed transactions with fewer caveats, it
> is
> > >>> past
> > >>>> time for Cassandra to do so as well.
> > >>>>
> > >>>> This CEP proposes the use of several novel techniques that build
> upon
> > >>>> research (that followed EPaxos) to deliver (non-interactive) general
> > >>>> purpose distributed transactions. The approach is outlined in the
> > >>> wikipage
> > >>>> and in more detail in the linked whitepaper. Importantly, by
> adopting
> > >>> this
> > >>>> approach we will be the _only_ distributed database to offer global,
> > >>>> scalable, strict serializable transactions in one wide area
> > round-trip.
> > >>>> This would represent a significant improvement in the state of the
> > art,
> > >>>> both in the academic literature and in commercial or open source
> > >>> offerings.
> > >>>>
> > >>>> This work has been partially realised in a prototype. This partial
> > >>>> prototype has been verified against Jepsen.io’s Maelstrom library
> and
> > >>>> dedicated in-tree strict serializability verification tools, but
> much
> > >>> work
> > >>>> remains for the work to be production capable and integrated into
> > >>> Cassandra.
> > >>>>
> > >>>> I propose including the prototype in the project as a new source
> > >>>> repository, to be developed as a standalone library for integration
> > >> into
> > >>>> Cassandra. I hope the community sees the important value proposition
> > of
> > >>>> this proposal, and will adopt the CEP after this discussion, so that
> > >> the
> > >>>> library and its integration into Cassandra can be developed in
> > parallel
> > >>> and
> > >>>> with the involvement of the wider community.
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Jonathan Ellis
> > >>> co-founder, http://www.datastax.com
> > >>> @spyced
> > >>>
> > >>
> > >>
> > >> --
> > >> Jonathan Ellis
> > >> co-founder, http://www.datastax.com
> > >> @spyced
> > >>
> > >
> > >
> > > --
> > > Jonathan Ellis
> > > co-founder, http://www.datastax.com
> > > @spyced
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > For additional commands, e-mail: dev-help@cassandra.apache.org
> >
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

Jonathan,

This work will only determine Cassandra’s future if no other contributors choose to take a different route in future. If in future the community decides this work is incompatible with its direction, it remains in the community’s power to remove the facility, or to make it optional.

OSS is a living thing, and this CEP will shape the future of community only by virtue of the work that I and others will do. You are equally capable of investing this time and effort.

Today, this is the only CEP of the kind on offer. If another competing proposal were to be made, we could either work to reconcile them, or to ensure they may co-exist. You cannot, however, expect to impose your _goals_ on the work that I and others will undertake. That is not how the community works.

Since we are going around in circles, I propose a simple majority vote to establish if the community endorses the stated goals of the CEP.

From: Jonathan Ellis <jb...@gmail.com>
Date: Wednesday, 6 October 2021 at 16:05
To: dev <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
The problem that I keep pointing out is that you've created this CEP for
Accord without first getting consensus that the goals and the tradeoffs it
makes to achieve those goals (and that it will impose on future work around
transactions) are the right ones for Cassandra long term.

At this point I'm done repeating myself.  For the convenience of anyone
following this thread intermittently, I'll quote my first reply on this
thread to illustrate the kind of discussion I'd like to have.

-----

The whitepaper here is a good description of the consensus algorithm itself
as well as its robustness and stability characteristics, and its comparison
with other state-of-the-art consensus algorithms is very useful.  In the
context of Cassandra, where a consensus algorithm is only part of what will
be implemented, I'd like to see a more complete evaluation of the
transactional side of things as well, including performance characteristics
as well as the types of transactions that can be supported and at least a
general idea of what it would look like applied to Cassandra. This will
allow the PMC to make a more informed decision about what tradeoffs are
best for the entire long-term project of first supplementing and ultimately
replacing LWT.

(Allowing users to mix LWT and AP Cassandra operations against the same
rows was probably a mistake, so in contrast with LWT we’re not looking for
something fast enough for occasional use but rather something within a
reasonable factor of AP operations, appropriate to being the only way to
interact with tables declared as such.)

Besides Accord, this should cover

- Calvin and FaunaDB
- A Spanner derivative (no opinion on whether that should be Cockroach or
Yugabyte, I don’t think it’s necessary to cover both)
- A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
there is more public information about MongoDB)
- RAMP

Here’s an example of what I mean:

=Calvin=

Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
transactions, then replicas execute the transactions independently with no
further coordination.  No SPOF.  Transactions are batched by each sequencer
to keep this from becoming a bottleneck.

Performance: Calvin paper (published 2012) reports linear scaling of TPC-C
New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is composed
of four reads and four writes, so this is effectively 2M reads and 2M
writes as we normally measure them in C*.

Calvin supports mixed read/write transactions, but because the transaction
execution logic requires knowing all partition keys in advance to ensure
that all replicas can reproduce the same results with no coordination,
reads against non-PK predicates must be done ahead of time (transparently,
by the server) to determine the set of keys, and this must be retried if
the set of rows affected is updated before the actual transaction executes.

Batching and global consensus adds latency -- 100ms in the Calvin paper and
apparently about 50ms in FaunaDB.  Glass half full: all transactions
(including multi-partition updates) are equally performant in Calvin since
the coordination is handled up front in the sequencing step.  Glass half
empty: even single-row reads and writes have to pay the full coordination
cost.  Fauna has optimized this away for reads but I am not aware of a
description of how they changed the design to allow this.

Functionality and limitations: since the entire transaction must be known
in advance to allow coordination-less execution at the replicas, Calvin
cannot support interactive transactions at all.  FaunaDB mitigates this by
allowing server-side logic to be included, but a Calvin approach will never
be able to offer SQL compatibility.

Guarantees: Calvin transactions are strictly serializable.  There is no
additional complexity or performance hit to generalizing to multiple
regions, apart from the speed of light.  And since Calvin is already paying
a batching latency penalty, this is less painful than for other systems.

Application to Cassandra: B-.  Distributed transactions are handled by the
sequencing and scheduling layers, which are leaderless, and Calvin’s
requirements for the storage layer are easily met by C*.  But Calvin also
requires a global consensus protocol and LWT is almost certainly not
sufficiently performant, so this would require ZK or etcd (reasonable for a
library approach but not for replacing LWT in C* itself), or an
implementation of Accord.  I don’t believe Calvin would require additional
table-level metadata in Cassandra.

On Wed, Oct 6, 2021 at 9:53 AM benedict@apache.org <be...@apache.org>
wrote:

> The problem with dropping a patch on Jira is that there is no opportunity
> to point out problems, either with the fundamental approach or with the
> specific implementation. So please point out some problems I can engage
> with!
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Wednesday, 6 October 2021 at 15:48
> To: dev <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> On Wed, Oct 6, 2021 at 9:21 AM benedict@apache.org <be...@apache.org>
> wrote:
>
> > The goals of the CEP are stated clearly, and these were the goals we had
> > going into the (multi-month) research project we undertook before
> proposing
> > this CEP. These goals are necessarily value judgements, so we cannot
> expect
> > that everyone will agree that they are optimal.
> >
>
> Right, so I'm saying that this is exactly the most important thing to get
> consensus on, and creating a CEP for a protocol to achieve goals that you
> have not discussed with the community is the CEP equivalent of dropping a
> patch on Jira without discussing its goals either.
>
> That's why our conversations haven't gone anywhere, because I keep saying
> "we need discuss the goals and tradeoffs", and I'll give an example of what
> I mean, and you keep addressing the examples (sometimes very shallowly, "it
> would be possible to X" or "Y could be done as an optimization") while
> ignoring the request to open a discussion around the big picture.
>

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Jonathan Ellis <jb...@gmail.com>.

The problem that I keep pointing out is that you've created this CEP for
Accord without first getting consensus that the goals and the tradeoffs it
makes to achieve those goals (and that it will impose on future work around
transactions) are the right ones for Cassandra long term.

At this point I'm done repeating myself.  For the convenience of anyone
following this thread intermittently, I'll quote my first reply on this
thread to illustrate the kind of discussion I'd like to have.

-----

The whitepaper here is a good description of the consensus algorithm itself
as well as its robustness and stability characteristics, and its comparison
with other state-of-the-art consensus algorithms is very useful.  In the
context of Cassandra, where a consensus algorithm is only part of what will
be implemented, I'd like to see a more complete evaluation of the
transactional side of things as well, including performance characteristics
as well as the types of transactions that can be supported and at least a
general idea of what it would look like applied to Cassandra. This will
allow the PMC to make a more informed decision about what tradeoffs are
best for the entire long-term project of first supplementing and ultimately
replacing LWT.

(Allowing users to mix LWT and AP Cassandra operations against the same
rows was probably a mistake, so in contrast with LWT we’re not looking for
something fast enough for occasional use but rather something within a
reasonable factor of AP operations, appropriate to being the only way to
interact with tables declared as such.)

Besides Accord, this should cover

- Calvin and FaunaDB
- A Spanner derivative (no opinion on whether that should be Cockroach or
Yugabyte, I don’t think it’s necessary to cover both)
- A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
there is more public information about MongoDB)
- RAMP

Here’s an example of what I mean:

=Calvin=

Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
transactions, then replicas execute the transactions independently with no
further coordination.  No SPOF.  Transactions are batched by each sequencer
to keep this from becoming a bottleneck.

Performance: Calvin paper (published 2012) reports linear scaling of TPC-C
New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is composed
of four reads and four writes, so this is effectively 2M reads and 2M
writes as we normally measure them in C*.

Calvin supports mixed read/write transactions, but because the transaction
execution logic requires knowing all partition keys in advance to ensure
that all replicas can reproduce the same results with no coordination,
reads against non-PK predicates must be done ahead of time (transparently,
by the server) to determine the set of keys, and this must be retried if
the set of rows affected is updated before the actual transaction executes.

Batching and global consensus adds latency -- 100ms in the Calvin paper and
apparently about 50ms in FaunaDB.  Glass half full: all transactions
(including multi-partition updates) are equally performant in Calvin since
the coordination is handled up front in the sequencing step.  Glass half
empty: even single-row reads and writes have to pay the full coordination
cost.  Fauna has optimized this away for reads but I am not aware of a
description of how they changed the design to allow this.

Functionality and limitations: since the entire transaction must be known
in advance to allow coordination-less execution at the replicas, Calvin
cannot support interactive transactions at all.  FaunaDB mitigates this by
allowing server-side logic to be included, but a Calvin approach will never
be able to offer SQL compatibility.

Guarantees: Calvin transactions are strictly serializable.  There is no
additional complexity or performance hit to generalizing to multiple
regions, apart from the speed of light.  And since Calvin is already paying
a batching latency penalty, this is less painful than for other systems.

Application to Cassandra: B-.  Distributed transactions are handled by the
sequencing and scheduling layers, which are leaderless, and Calvin’s
requirements for the storage layer are easily met by C*.  But Calvin also
requires a global consensus protocol and LWT is almost certainly not
sufficiently performant, so this would require ZK or etcd (reasonable for a
library approach but not for replacing LWT in C* itself), or an
implementation of Accord.  I don’t believe Calvin would require additional
table-level metadata in Cassandra.

On Wed, Oct 6, 2021 at 9:53 AM benedict@apache.org <be...@apache.org>
wrote:

> The problem with dropping a patch on Jira is that there is no opportunity
> to point out problems, either with the fundamental approach or with the
> specific implementation. So please point out some problems I can engage
> with!
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Wednesday, 6 October 2021 at 15:48
> To: dev <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> On Wed, Oct 6, 2021 at 9:21 AM benedict@apache.org <be...@apache.org>
> wrote:
>
> > The goals of the CEP are stated clearly, and these were the goals we had
> > going into the (multi-month) research project we undertook before
> proposing
> > this CEP. These goals are necessarily value judgements, so we cannot
> expect
> > that everyone will agree that they are optimal.
> >
>
> Right, so I'm saying that this is exactly the most important thing to get
> consensus on, and creating a CEP for a protocol to achieve goals that you
> have not discussed with the community is the CEP equivalent of dropping a
> patch on Jira without discussing its goals either.
>
> That's why our conversations haven't gone anywhere, because I keep saying
> "we need discuss the goals and tradeoffs", and I'll give an example of what
> I mean, and you keep addressing the examples (sometimes very shallowly, "it
> would be possible to X" or "Y could be done as an optimization") while
> ignoring the request to open a discussion around the big picture.
>

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

The problem with dropping a patch on Jira is that there is no opportunity to point out problems, either with the fundamental approach or with the specific implementation. So please point out some problems I can engage with!

From: Jonathan Ellis <jb...@gmail.com>
Date: Wednesday, 6 October 2021 at 15:48
To: dev <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
On Wed, Oct 6, 2021 at 9:21 AM benedict@apache.org <be...@apache.org>
wrote:

> The goals of the CEP are stated clearly, and these were the goals we had
> going into the (multi-month) research project we undertook before proposing
> this CEP. These goals are necessarily value judgements, so we cannot expect
> that everyone will agree that they are optimal.
>

Right, so I'm saying that this is exactly the most important thing to get
consensus on, and creating a CEP for a protocol to achieve goals that you
have not discussed with the community is the CEP equivalent of dropping a
patch on Jira without discussing its goals either.

That's why our conversations haven't gone anywhere, because I keep saying
"we need discuss the goals and tradeoffs", and I'll give an example of what
I mean, and you keep addressing the examples (sometimes very shallowly, "it
would be possible to X" or "Y could be done as an optimization") while
ignoring the request to open a discussion around the big picture.

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Jonathan Ellis <jb...@gmail.com>.

On Wed, Oct 6, 2021 at 9:21 AM benedict@apache.org <be...@apache.org>
wrote:

> The goals of the CEP are stated clearly, and these were the goals we had
> going into the (multi-month) research project we undertook before proposing
> this CEP. These goals are necessarily value judgements, so we cannot expect
> that everyone will agree that they are optimal.
>

Right, so I'm saying that this is exactly the most important thing to get
consensus on, and creating a CEP for a protocol to achieve goals that you
have not discussed with the community is the CEP equivalent of dropping a
patch on Jira without discussing its goals either.

That's why our conversations haven't gone anywhere, because I keep saying
"we need discuss the goals and tradeoffs", and I'll give an example of what
I mean, and you keep addressing the examples (sometimes very shallowly, "it
would be possible to X" or "Y could be done as an optimization") while
ignoring the request to open a discussion around the big picture.

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

The goals of the CEP are stated clearly, and these were the goals we had going into the (multi-month) research project we undertook before proposing this CEP. These goals are necessarily value judgements, so we cannot expect that everyone will agree that they are optimal.

So far you have not engaged with these goals to state any specific disagreement. I have engaged with all of the trade-offs you imagined, and every specific concern you have raised. Despite a month having elapsed and a great deal of time spent answering your emails, this is the first confirmation I have that you are dissatisfied with my responses to you.

The role of the CEP is to advertise a project, allowing people to register their interest in collaborating, and for technical concerns to be stated in advance. So far you have expressed no specific technical concerns that I have not engaged with, and yet I have received no response to my engagements.

The role of the CEP is *not* to permit members of the community to dictate their preferences on the proposers, or to declare that the CEP is inadequate because it doesn’t meet their goals, or to demand additional work to explore others’ preferred research avenues on the topic.

You have to do some of the work here, Jonathan.

If you have an alternative approach, I continue to ask you to propose it so we may compare and contrast in a specific and technical manner.  If you have any specific technical concerns I exhort you to raise them, so we my discuss them. If you dispute the goals, please make an argument as to why. If our goals are irreconcilable, file another CEP.

From: Jonathan Ellis <jb...@gmail.com>
Date: Wednesday, 6 October 2021 at 14:41
To: dev <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
I've repeatedly explained why I'm unhappy: instead of starting with a
discussion of what API and tradeoffs we should make to get that, this CEP
starts with a protocol and asks us to figure out what API we can build with
it.

Of course by API I mean, what kinds of CQL and SQL operations we can
perform, with what kinds of ACID semantics and what kinds of performance,
not "Result perform(Transaction transaction)".  And it's not simply SQL
syntax, either.  I realize that this could sound a little vague, but that's
why I gave an example of the kind of analysis I'm talking about in my first
reply.  Your responses have been to attempt to avoid the discussion
entirely ("the relevant goals are [mine]") or to declare it to be out of
scope.

The CEP process is intended to help get to alignment across the community
of PMC members, committers, and contributors on goals and outcomes before
starting in writing code, not simply to bless a completed design.  That's
why we're going in circles here.

On Wed, Oct 6, 2021 at 2:12 AM benedict@apache.org <be...@apache.org>
wrote:

> We have discussed the API at length in this thread. The API primarily
> involves the semantics of the transactions, as besides this the API of a
> transaction is simply:
>
> Result perform(Transaction transaction)
>
> As discussed in follow-up to that email, a prototype API is specified
> alongside the prototype protocol. I am unsure what more you want than this,
> or the above, or the prior semantic discussions.
>
> It seems clear that you’re unhappy with the proposal, but it remains
> ambiguous as to why. Your emails are terse, infrequent and unclear. My
> responses receive no follow up from you, even to clarify if I have answered
> your query. Sometime later I seem to be able to expect a new unrelated
> problem that you are unhappy about. You have not yet responded to even one
> of my repeated offers to hop on a call to hash out any of your concerns,
> even if only to decline.
>
> This does not feel like constructive and respectful engagement to me, and
> I am losing interest.
>
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Wednesday, 6 October 2021 at 00:02
> To: dev <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> I honestly can't understand the perspective that on the one hand, you're
> asking for approval of a specific protocol as part of the CEP, but on the
> other, you think discussion of the APIs this will enable is not warranted.
> Surely we need agreement on what APIs we're trying to build, before we
> discuss the protocols and architectures with which to build them.
>
> On Fri, Oct 1, 2021 at 9:34 AM benedict@apache.org <be...@apache.org>
> wrote:
>
> > > The current document details thoroughly the protocol but in my view
> > lacks to illustrate what specific API, methods, modules will become
> > available to developers
> >
> > With respect to this, in my view this kind of detail is not warranted
> > within a CEP. Software development is an exploratory process with respect
> > to structure, and these decisions will be made as the CEP progresses. If
> > these need to be specified upfront, then the purpose of a CEP – seeking
> buy
> > in – is invalidated, because the work must be complete before you know
> the
> > answers.
> >

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Jonathan Ellis <jb...@gmail.com>.

I've repeatedly explained why I'm unhappy: instead of starting with a
discussion of what API and tradeoffs we should make to get that, this CEP
starts with a protocol and asks us to figure out what API we can build with
it.

Of course by API I mean, what kinds of CQL and SQL operations we can
perform, with what kinds of ACID semantics and what kinds of performance,
not "Result perform(Transaction transaction)".  And it's not simply SQL
syntax, either.  I realize that this could sound a little vague, but that's
why I gave an example of the kind of analysis I'm talking about in my first
reply.  Your responses have been to attempt to avoid the discussion
entirely ("the relevant goals are [mine]") or to declare it to be out of
scope.

The CEP process is intended to help get to alignment across the community
of PMC members, committers, and contributors on goals and outcomes before
starting in writing code, not simply to bless a completed design.  That's
why we're going in circles here.

On Wed, Oct 6, 2021 at 2:12 AM benedict@apache.org <be...@apache.org>
wrote:

> We have discussed the API at length in this thread. The API primarily
> involves the semantics of the transactions, as besides this the API of a
> transaction is simply:
>
> Result perform(Transaction transaction)
>
> As discussed in follow-up to that email, a prototype API is specified
> alongside the prototype protocol. I am unsure what more you want than this,
> or the above, or the prior semantic discussions.
>
> It seems clear that you’re unhappy with the proposal, but it remains
> ambiguous as to why. Your emails are terse, infrequent and unclear. My
> responses receive no follow up from you, even to clarify if I have answered
> your query. Sometime later I seem to be able to expect a new unrelated
> problem that you are unhappy about. You have not yet responded to even one
> of my repeated offers to hop on a call to hash out any of your concerns,
> even if only to decline.
>
> This does not feel like constructive and respectful engagement to me, and
> I am losing interest.
>
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Wednesday, 6 October 2021 at 00:02
> To: dev <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> I honestly can't understand the perspective that on the one hand, you're
> asking for approval of a specific protocol as part of the CEP, but on the
> other, you think discussion of the APIs this will enable is not warranted.
> Surely we need agreement on what APIs we're trying to build, before we
> discuss the protocols and architectures with which to build them.
>
> On Fri, Oct 1, 2021 at 9:34 AM benedict@apache.org <be...@apache.org>
> wrote:
>
> > > The current document details thoroughly the protocol but in my view
> > lacks to illustrate what specific API, methods, modules will become
> > available to developers
> >
> > With respect to this, in my view this kind of detail is not warranted
> > within a CEP. Software development is an exploratory process with respect
> > to structure, and these decisions will be made as the CEP progresses. If
> > these need to be specified upfront, then the purpose of a CEP – seeking
> buy
> > in – is invalidated, because the work must be complete before you know
> the
> > answers.
> >

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

We have discussed the API at length in this thread. The API primarily involves the semantics of the transactions, as besides this the API of a transaction is simply:

Result perform(Transaction transaction)

As discussed in follow-up to that email, a prototype API is specified alongside the prototype protocol. I am unsure what more you want than this, or the above, or the prior semantic discussions.

It seems clear that you’re unhappy with the proposal, but it remains ambiguous as to why. Your emails are terse, infrequent and unclear. My responses receive no follow up from you, even to clarify if I have answered your query. Sometime later I seem to be able to expect a new unrelated problem that you are unhappy about. You have not yet responded to even one of my repeated offers to hop on a call to hash out any of your concerns, even if only to decline.

This does not feel like constructive and respectful engagement to me, and I am losing interest.



From: Jonathan Ellis <jb...@gmail.com>
Date: Wednesday, 6 October 2021 at 00:02
To: dev <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
I honestly can't understand the perspective that on the one hand, you're
asking for approval of a specific protocol as part of the CEP, but on the
other, you think discussion of the APIs this will enable is not warranted.
Surely we need agreement on what APIs we're trying to build, before we
discuss the protocols and architectures with which to build them.

On Fri, Oct 1, 2021 at 9:34 AM benedict@apache.org <be...@apache.org>
wrote:

> > The current document details thoroughly the protocol but in my view
> lacks to illustrate what specific API, methods, modules will become
> available to developers
>
> With respect to this, in my view this kind of detail is not warranted
> within a CEP. Software development is an exploratory process with respect
> to structure, and these decisions will be made as the CEP progresses. If
> these need to be specified upfront, then the purpose of a CEP – seeking buy
> in – is invalidated, because the work must be complete before you know the
> answers.
>
>
> From: benedict@apache.org <be...@apache.org>
> Date: Friday, 1 October 2021 at 15:31
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> From the CEP:
>
> Batches (including unconditional batches) on transactional tables will
> receive ACID properties, and grammatically correct conditional batch
> operations that would be rejected for operating over multiple CQL
> partitions will now be supported
>
>
> From: Paulo Motta <pa...@gmail.com>
> Date: Friday, 1 October 2021 at 15:30
> To: Cassandra DEV <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Can you just answer what palpable feature will be available once this CEP
> lands because this is still not clear to me (and perhaps to others) from
> the current CEP structure. The current document details thoroughly the
> protocol but in my view lacks to illustrate what specific API, methods,
> modules will become available to developers, how it fits into the larger
> picture and interacts with existing modules if at all and perhaps a few
> examples of how it can be used to build features on top.
>
> Em sex., 1 de out. de 2021 às 11:10, benedict@apache.org <
> benedict@apache.org> escreveu:
>
> > I’m not, though it might seem that way. I disagree with your views about
> > how CEP should be structured. Since the CEP process was itself codified
> via
> > the CEP process, if you want to recodify how CEP work, the correct way is
> > via the CEP process itself.
> >
> > The discussion is being drawn in multiple directions away from the CEP
> > itself, and I am trying to keep this particular thread focused on the
> > business at hand, not meta discussions around CEP structure that will no
> > doubt be unproductive given likely irreconcilable views about the topic,
> > nor discussions about other CEP that could have been.
> >
> > If you want to start a separate exploratory discussion thread about CEP
> > structure without filing a CEP feel free to do so.
> >
> >
> > From: Paulo Motta <pa...@gmail.com>
> > Date: Friday, 1 October 2021 at 15:04
> > To: Cassandra DEV <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > If you want to impose your views on CEP structure on others, please
> file
> > a CEP with the additional restrictions and guidance you want to impose
> and
> > start a discussion thread. I can then respond in detail to why I perceive
> > this approach to be flawed, in a dedicated context.
> >
> > This sounds very kafkaesque. You know I won't file a meta-CEP to change
> the
> > structure of CEP so you're just using this as an excuse to just shut the
> > discussion on the lack of clarity on what actual palpable feature will be
> > available once the CEP lands. :-)
> >
> > I'm just providing my humble feedback on how a CEP could be more
> digestible
> > and easier to consume from an external point of view, and this seems like
> > an appropriate and contextualized place to voice this opinion which is
> > perhaps shared by others.
> >
> > Em sex., 1 de out. de 2021 às 10:55, benedict@apache.org <
> > benedict@apache.org> escreveu:
> >
> > > I disagree with you. However, this is the wrong forum to have a meta
> > > discussion about how CEP should be structured.
> > >
> > > If you want to impose your views on CEP structure on others, please
> file
> > a
> > > CEP with the additional restrictions and guidance you want to impose
> and
> > > start a discussion thread. I can then respond in detail to why I
> perceive
> > > this approach to be flawed, in a dedicated context.
> > >
> > >
> > > From: Paulo Motta <pa...@gmail.com>
> > > Date: Friday, 1 October 2021 at 14:48
> > > To: Cassandra DEV <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > >  The proposal as it stands today is exceptionally thorough, more so
> > than
> > > any other CEP to date, or any CEP is likely to be in the near future.
> > >
> > > The protocol is thoroughly described, but in my view CEP is a forum to
> > > discuss the high level architecture and plan for adding a full
> end-to-end
> > > enhancement to the database, breaking it into sub-CEPs if needed, as
> long
> > > as the full plan is known in advance, otherwise the community will not
> > have
> > > the context to judge the full extent and impact of the proposed
> > > enhancement.
> > >
> > > > Since it remains unclear to me what either yourself or Jonathan want
> to
> > > see as an alternative
> > >
> > > I would personally like to see something along these lines:
> > >
> > > CEP1: Add ACID-compliant atomic batches
> > > - UX changes needed: none, CQL provides the grammar we need.
> > > - Distributed transaction protocol needed: Accord (link to white paper
> if
> > > you want specific details about the protcool)
> > > - High-level architecture: what new components will be added, how
> > existing
> > > components will be modified, what new messages will be added, what new
> > > configuration knobs will be introduced, what are the milestones of the
> > > project, etc.
> > >
> > > CEP2: Make LWT faster and more reliable
> > > - UX changes needed: none
> > > - Distributed transaction protocol needed: Accord, already added by
> > > previous CEP.
> > > - High-level architecture: blablabla... and so on.
> > >
> > > Em sex., 1 de out. de 2021 às 10:19, benedict@apache.org <
> > > benedict@apache.org> escreveu:
> > >
> > > > I think this is getting circular and unproductive. Basic
> disagreements
> > > > about whether the CEP specifies a feature I am inclined to leave for
> a
> > > > vote. In my view the CEP specifies several features, both immediate
> > ones
> > > > for the user (ACID batches and multi-key LWTS) and developer-focused
> > ones
> > > > around ground-breaking semantics that will be enabled.
> > > >
> > > > The proposal as it stands today is exceptionally thorough, more so
> than
> > > > any other CEP to date, or any CEP is likely to be in the near future.
> > > >
> > > > This is a Cassandra Enhancement *Proposal*, and at some point we have
> > to
> > > > engage with what is proposed, not what you might like to be proposed.
> > > Since
> > > > it remains unclear to me what either yourself or Jonathan want to see
> > as
> > > an
> > > > alternative, at this point it would seem more productive to produce
> > your
> > > > own proposals for the community to consider. It is possible for
> > multiple
> > > > transaction systems to co-exist, if you feel this is necessary.
> > > >
> > > >
> > > >
> > > > From: Paulo Motta <pa...@gmail.com>
> > > > Date: Friday, 1 October 2021 at 13:58
> > > > To: Cassandra DEV <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > I share similar feelings as jbellis that this proposal seems to be
> > > focusing
> > > > on the protocol itself but lacking the actual feature that will use
> the
> > > > protocol which IMO a key element to discuss on a CEP.
> > > >
> > > > It's similar to saying: hey I want to add this Tries Serialization
> > > Protocol
> > > > to Cassandra, but not providing specific details of how this protocol
> > is
> > > > going to be used.
> > > >
> > > > I think the right route for a CEP is to describe the feature that
> will
> > be
> > > > added to the database and the protocol is a mere requirement of the
> > > > high-level feature, for example:
> > > >
> > > > CEP: Add Trie-backed memtable
> > > > - Trie Serialization Protocol: implementation detail of the above CEP
> > > >
> > > > What is the difficulty of taking this approach, picking one of the
> > myriad
> > > > of features that will be enabled by Accord and using that as the
> > initial
> > > > CEP to introduce the protocol to the database?
> > > >
> > > > Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
> > > > benedict@apache.org> escreveu:
> > > >
> > > > > Actually, thinking about it again, the simple optimistic protocol
> > would
> > > > in
> > > > > fact guarantee system forward progress (i.e. independent of
> > transaction
> > > > > formulation).
> > > > >
> > > > >
> > > > > From: benedict@apache.org <be...@apache.org>
> > > > > Date: Friday, 1 October 2021 at 09:14
> > > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > Hi Jonathan,
> > > > >
> > > > > It would be great if we could achieve a bandwidth higher than 1-2
> > short
> > > > > emails per week. It remains unclear to me what your goal is, and it
> > > would
> > > > > help if you could make a statement like “I want Cassandra to be
> able
> > to
> > > > do
> > > > > X” so that we can respond directly to it. I am also available to
> have
> > > > > another call, in which we can have a back and forth, please feel
> free
> > > to
> > > > > propose a London-compatible time within the next week that is
> > suitable
> > > > for
> > > > > you.
> > > > >
> > > > > In my opinion we are at risk of veering off-topic, though. This CEP
> > is
> > > > not
> > > > > to deliver interactive transactions, and to my knowledge nobody is
> > > > > proposing a CEP for interactive transactions. So, for the CEP at
> hand
> > > the
> > > > > salient question seems: does this CEP prevent us from implementing
> > > > > interactive transactions with properties X, Y, Z in future? To
> which
> > > the
> > > > > answer is almost certainly no.
> > > > >
> > > > > However, to continue the discussion and respond directly to your
> > > queries,
> > > > > I believe we agree on the definition of an interactive transaction.
> > > > >
> > > > > Two protocols were loosely outlined. The first, using timestamps
> for
> > > > > optimistic concurrency control, would indeed involve the
> possibility
> > of
> > > > > aborts. It would not however inherently adopt the issue of LWTs
> where
> > > no
> > > > > transaction is able to make progress. Whether or not progress is
> > > > guaranteed
> > > > > (in a livelock-free sense) would depend on the structure of the
> > > > > transactions that were interfering.
> > > > >
> > > > > This approach has the advantage of being very simple to implement,
> so
> > > > that
> > > > > we could realistically support interactive transactions quite
> > quickly.
> > > It
> > > > > has the additional advantage that transactions would execute very
> > > quickly
> > > > > by avoiding the WAN during construction, and as a result may in
> > > practice
> > > > > experience fewer aborts than protocols that guarantee
> > livelock-freedom.
> > > > >
> > > > > The second protocol proposed using read/write intents and would be
> > able
> > > > to
> > > > > support almost any behaviour you want. We could even utilise
> > > pessimistic
> > > > > concurrency control, or anything in-between. This is its own huge
> > > design
> > > > > space, and discussion of this approach and the trade-offs that
> could
> > be
> > > > > made is (in my opinion) entirely out of scope for this CEP.
> > > > >
> > > > >
> > > > > From: Jonathan Ellis <jb...@gmail.com>
> > > > > Date: Friday, 1 October 2021 at 05:00
> > > > > To: dev <de...@cassandra.apache.org>
> > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > The obstacle for me is you've provided a protocol but not a fully
> > > fleshed
> > > > > out architecture, so it's hard to fill in some of the blanks.  But
> it
> > > > looks
> > > > > to me like optimistic concurrency control for interactive
> > transactions
> > > > > applied to Accord would leave you in a LWT-like situation under
> > fairly
> > > > > light contention where nobody actually makes progress due to
> retries.
> > > > >
> > > > > To make sure we're talking about the same thing, as Henrik pointed
> > out,
> > > > > interactive transactions mean multiple round trips from the client
> > > > within a
> > > > > transaction.  For example, here
> > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > > > > >
> > > > > is a simple implementation of the TPC-C New Order transaction.  The
> > > high
> > > > > level logic (via
> > > > > <
> > > > >
> > > >
> > >
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > > > > >)
> > > > > is,
> > > > >
> > > > >    1. Get records describing a warehouse, customer, & district
> > > > >    2. Update the district
> > > > >    3. Increment next available order number
> > > > >    4. Insert record into Order and New-Order tables
> > > > >    5. For 5-15 items, get Item record, get/update Stock record
> > > > >    6. Insert Order-Line Record
> > > > >
> > > > > As you can see, this requires a lot of client-side logic mixed in
> > with
> > > > the
> > > > > actual SQL commands.
> > > > >
> > > > >
> > > > > On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <
> > > benedict@apache.org
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Essentially this, although I think in practice we will need to
> > track
> > > > each
> > > > > > partition’s timestamp separately (or optionally for reduced
> > > conflicts,
> > > > > each
> > > > > > row or datum’s), and make them all part of the conditional
> > > application
> > > > of
> > > > > > the transaction - at least for strict-serializability.
> > > > > >
> > > > > > The alternative is to insert read/write intents for the
> transaction
> > > > > during
> > > > > > each step, and to confirm they are still valid on commit, but
> this
> > > > > approach
> > > > > > would require a WAN round-trip for each step in the interactive
> > > > > > transaction, whereas the timestamp-validating approach can use a
> > LAN
> > > > > > round-trip for each step besides the final one, and is also much
> > > > simpler
> > > > > to
> > > > > > implement.
> > > > > >
> > > > > >
> > > > > > From: Blake Eggleston <be...@apple.com.INVALID>
> > > > > > Date: Thursday, 30 September 2021 at 05:47
> > > > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > You could establish a lower timestamp bound and buffer
> transaction
> > > > state
> > > > > > on the coordinator, then make the commit an operation that only
> > > applies
> > > > > if
> > > > > > all partitions involved haven’t been changed by a more recent
> > > > timestamp.
> > > > > > You could also implement mvcc either in the storage layer or for
> > some
> > > > > > period of time by buffering commits on each replica before
> > applying.
> > > > > >
> > > > > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jbellis@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > How are interactive transactions possible with Accord?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Could you explain why you believe this trade-off is necessary?
> > We
> > > > can
> > > > > > >> support full SQL just fine with Accord, and I hope that we
> > > > eventually
> > > > > > do so.
> > > > > > >>
> > > > > > >> This domain is incredibly complex, so it is easy to reach
> wrong
> > > > > > >> conclusions. I would invite you again to propose a system for
> > > > > discussion
> > > > > > >> that you think offers something Accord is unable to, and that
> > you
> > > > > > consider
> > > > > > >> desirable, and we can work from there.
> > > > > > >>
> > > > > > >> To pre-empt some possible discussions, I am not aware of
> > anything
> > > we
> > > > > > >> cannot do with Accord that we could do with either Calvin or
> > > > Spanner.
> > > > > > >> Interactive transactions are possible on top of Accord, as are
> > > > > > transactions
> > > > > > >> with an unknown read/write set. In each case the only cost is
> > that
> > > > > they
> > > > > > >> would use optimistic concurrency control, which is no worse
> the
> > > > > spanner
> > > > > > >> derivatives anyway (which I have to assume is your benchmark
> in
> > > this
> > > > > > >> regard). I do not expect to deliver either functionality
> > > initially,
> > > > > but
> > > > > > >> Accord takes us most of the way there for both.
> > > > > > >>
> > > > > > >>
> > > > > > >> From: Jonathan Ellis <jb...@gmail.com>
> > > > > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > > > > >> To: dev <de...@cassandra.apache.org>
> > > > > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > >> Right, I'm looking for exactly a discussion on the high level
> > > goals.
> > > > > > >> Instead of saying "here's the goals and we ruled out X because
> > Y"
> > > we
> > > > > > should
> > > > > > >> start with a discussion around, "Approach A allows X and W,
> > > > approach B
> > > > > > >> allows Y and Z" and decide together what the goals should be
> and
> > > and
> > > > > > what
> > > > > > >> we are willing to trade to get those goals, e.g., are we
> willing
> > > to
> > > > > > give up
> > > > > > >> global strict serializability to get the ability to support
> full
> > > > SQL.
> > > > > > Both
> > > > > > >> of these are nice to have!
> > > > > > >>
> > > > > > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> Hi Jonathan,
> > > > > > >>>
> > > > > > >>> These other systems are incompatible with the goals of the
> > CEP. I
> > > > do
> > > > > > >>> discuss them (besides 2PC) in both the whitepaper and the
> CEP,
> > > and
> > > > > will
> > > > > > >>> summarise that discussion below. A true and accurate
> comparison
> > > of
> > > > > > these
> > > > > > >>> other systems is essentially intractable, as there are
> complex
> > > > > > subtleties
> > > > > > >>> to each flavour, and those who are interested would be better
> > > > served
> > > > > by
> > > > > > >>> performing their own research.
> > > > > > >>>
> > > > > > >>> I think it is more productive to focus on what we want to
> > achieve
> > > > as
> > > > > a
> > > > > > >>> community. If you believe the goals of this CEP are wrong for
> > the
> > > > > > >> project,
> > > > > > >>> let’s focus on that. If you want to compare and contrast
> > specific
> > > > > > facets
> > > > > > >> of
> > > > > > >>> alternative systems that you consider to be preferable in
> some
> > > > > > dimension,
> > > > > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > > > > >>>
> > > > > > >>> The relevant goals are that we:
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>  1.  Guarantee strict serializable isolation on commodity
> > > hardware
> > > > > > >>>  2.  Scale to any cluster size
> > > > > > >>>  3.  Achieve optimal latency
> > > > > > >>>
> > > > > > >>> The approach taken by Spanner derivatives is rejected by (1)
> > > > because
> > > > > > they
> > > > > > >>> guarantee only Serializable isolation (they additionally fail
> > > (3)).
> > > > > > From
> > > > > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > > > > >>> panic-cluster-death under clock skew, this is clearly
> > considered
> > > by
> > > > > > >>> everyone to be undesirable but necessary to achieve
> > scalability.
> > > > > > >>>
> > > > > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2)
> > because
> > > > its
> > > > > > >>> sequencing layer requires a global leader process for the
> > > cluster,
> > > > > > which
> > > > > > >> is
> > > > > > >>> incompatible with Cassandra’s scalability requirements. It
> > > > > additionally
> > > > > > >>> fails (3) for global clients.
> > > > > > >>>
> > > > > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is
> > > today a
> > > > > > >>> Spanner clone for its multi-key transaction functionality,
> not
> > > 2PC.
> > > > > > >>>
> > > > > > >>> Systems such as RAMP with even weaker isolation are not
> > > considered
> > > > > for
> > > > > > >> the
> > > > > > >>> simple reason that they do not even claim to meet (1).
> > > > > > >>>
> > > > > > >>> If we want to additionally offer weaker isolation levels than
> > > > > > >>> Serializable, such as that provided by the recent RAMP-TAO
> > paper,
> > > > > > >> Cassandra
> > > > > > >>> is likely able to support multiple distinct transaction
> layers
> > > that
> > > > > > >> operate
> > > > > > >>> independently. I would encourage you to file a CEP to explore
> > how
> > > > we
> > > > > > can
> > > > > > >>> meet these distinct use cases, but I consider them to be
> > niche. I
> > > > > > expect
> > > > > > >>> that a majority of our user base desire strict serializable
> > > > > isolation,
> > > > > > >> and
> > > > > > >>> certainly no less than serializable isolation, to augment the
> > > > > existing
> > > > > > >>> weaker isolation offered by quorum reads and writes.
> > > > > > >>>
> > > > > > >>> I would tangentially note that we are not an AP database
> under
> > > > normal
> > > > > > >>> recommended operation. A minority in any network partition
> > cannot
> > > > > reach
> > > > > > >>> QUORUM, so under recommended usage we are a high-availability
> > > > > > leaderless
> > > > > > >> CP
> > > > > > >>> database.
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> From: Jonathan Ellis <jb...@gmail.com>
> > > > > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > > > > >>> To: dev <de...@cassandra.apache.org>
> > > > > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > >>> Benedict, thanks for taking the lead in putting this
> together.
> > > > Since
> > > > > > >>> Cassandra is the only relevant database today designed
> around a
> > > > > > >> leaderless
> > > > > > >>> architecture, it's quite likely that we'll be better served
> > with
> > > a
> > > > > > custom
> > > > > > >>> transaction design instead of trying to retrofit one from CP
> > > > systems.
> > > > > > >>>
> > > > > > >>> The whitepaper here is a good description of the consensus
> > > > algorithm
> > > > > > >> itself
> > > > > > >>> as well as its robustness and stability characteristics, and
> > its
> > > > > > >> comparison
> > > > > > >>> with other state-of-the-art consensus algorithms is very
> > useful.
> > > > In
> > > > > > the
> > > > > > >>> context of Cassandra, where a consensus algorithm is only
> part
> > of
> > > > > what
> > > > > > >> will
> > > > > > >>> be implemented, I'd like to see a more complete evaluation of
> > the
> > > > > > >>> transactional side of things as well, including performance
> > > > > > >> characteristics
> > > > > > >>> as well as the types of transactions that can be supported
> and
> > at
> > > > > > least a
> > > > > > >>> general idea of what it would look like applied to Cassandra.
> > > This
> > > > > will
> > > > > > >>> allow the PMC to make a more informed decision about what
> > > tradeoffs
> > > > > are
> > > > > > >>> best for the entire long-term project of first supplementing
> > and
> > > > > > >> ultimately
> > > > > > >>> replacing LWT.
> > > > > > >>>
> > > > > > >>> (Allowing users to mix LWT and AP Cassandra operations
> against
> > > the
> > > > > same
> > > > > > >>> rows was probably a mistake, so in contrast with LWT we’re
> not
> > > > > looking
> > > > > > >> for
> > > > > > >>> something fast enough for occasional use but rather something
> > > > within
> > > > > a
> > > > > > >>> reasonable factor of AP operations, appropriate to being the
> > only
> > > > way
> > > > > > to
> > > > > > >>> interact with tables declared as such.)
> > > > > > >>>
> > > > > > >>> Besides Accord, this should cover
> > > > > > >>>
> > > > > > >>> - Calvin and FaunaDB
> > > > > > >>> - A Spanner derivative (no opinion on whether that should be
> > > > > Cockroach
> > > > > > or
> > > > > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > > > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB
> but
> > I
> > > > > > suspect
> > > > > > >>> there is more public information about MongoDB)
> > > > > > >>> - RAMP
> > > > > > >>>
> > > > > > >>> Here’s an example of what I mean:
> > > > > > >>>
> > > > > > >>> =Calvin=
> > > > > > >>>
> > > > > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB)
> > to
> > > > > order
> > > > > > >>> transactions, then replicas execute the transactions
> > > independently
> > > > > with
> > > > > > >> no
> > > > > > >>> further coordination.  No SPOF.  Transactions are batched by
> > each
> > > > > > >> sequencer
> > > > > > >>> to keep this from becoming a bottleneck.
> > > > > > >>>
> > > > > > >>> Performance: Calvin paper (published 2012) reports linear
> > scaling
> > > > of
> > > > > > >> TPC-C
> > > > > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2
> XL
> > > > > machines
> > > > > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order
> > is
> > > > > > composed
> > > > > > >>> of four reads and four writes, so this is effectively 2M
> reads
> > > and
> > > > 2M
> > > > > > >>> writes as we normally measure them in C*.
> > > > > > >>>
> > > > > > >>> Calvin supports mixed read/write transactions, but because
> the
> > > > > > >> transaction
> > > > > > >>> execution logic requires knowing all partition keys in
> advance
> > to
> > > > > > ensure
> > > > > > >>> that all replicas can reproduce the same results with no
> > > > > coordination,
> > > > > > >>> reads against non-PK predicates must be done ahead of time
> > > > > > >> (transparently,
> > > > > > >>> by the server) to determine the set of keys, and this must be
> > > > retried
> > > > > > if
> > > > > > >>> the set of rows affected is updated before the actual
> > transaction
> > > > > > >> executes.
> > > > > > >>>
> > > > > > >>> Batching and global consensus adds latency -- 100ms in the
> > Calvin
> > > > > paper
> > > > > > >> and
> > > > > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> > > > transactions
> > > > > > >>> (including multi-partition updates) are equally performant in
> > > > Calvin
> > > > > > >> since
> > > > > > >>> the coordination is handled up front in the sequencing step.
> > > Glass
> > > > > > half
> > > > > > >>> empty: even single-row reads and writes have to pay the full
> > > > > > coordination
> > > > > > >>> cost.  Fauna has optimized this away for reads but I am not
> > aware
> > > > of
> > > > > a
> > > > > > >>> description of how they changed the design to allow this.
> > > > > > >>>
> > > > > > >>> Functionality and limitations: since the entire transaction
> > must
> > > be
> > > > > > known
> > > > > > >>> in advance to allow coordination-less execution at the
> > replicas,
> > > > > Calvin
> > > > > > >>> cannot support interactive transactions at all.  FaunaDB
> > > mitigates
> > > > > this
> > > > > > >> by
> > > > > > >>> allowing server-side logic to be included, but a Calvin
> > approach
> > > > will
> > > > > > >> never
> > > > > > >>> be able to offer SQL compatibility.
> > > > > > >>>
> > > > > > >>> Guarantees: Calvin transactions are strictly serializable.
> > There
> > > > is
> > > > > no
> > > > > > >>> additional complexity or performance hit to generalizing to
> > > > multiple
> > > > > > >>> regions, apart from the speed of light.  And since Calvin is
> > > > already
> > > > > > >> paying
> > > > > > >>> a batching latency penalty, this is less painful than for
> other
> > > > > > systems.
> > > > > > >>>
> > > > > > >>> Application to Cassandra: B-.  Distributed transactions are
> > > handled
> > > > > by
> > > > > > >> the
> > > > > > >>> sequencing and scheduling layers, which are leaderless, and
> > > > Calvin’s
> > > > > > >>> requirements for the storage layer are easily met by C*.  But
> > > > Calvin
> > > > > > also
> > > > > > >>> requires a global consensus protocol and LWT is almost
> > certainly
> > > > not
> > > > > > >>> sufficiently performant, so this would require ZK or etcd
> > > > (reasonable
> > > > > > >> for a
> > > > > > >>> library approach but not for replacing LWT in C* itself), or
> an
> > > > > > >>> implementation of Accord.  I don’t believe Calvin would
> require
> > > > > > >> additional
> > > > > > >>> table-level metadata in Cassandra.
> > > > > > >>>
> > > > > > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > >>> wrote:
> > > > > > >>>
> > > > > > >>>> Wiki:
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > > > > >>>> Whitepaper:
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > > > > >>>> <
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > > > > >>>>>
> > > > > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > > > > >>>>
> > > > > > >>>> Hi everyone, I’d like to propose this CEP for adoption by
> the
> > > > > > >> community.
> > > > > > >>>>
> > > > > > >>>> Cassandra has benefitted from LWTs for many years, but
> > > application
> > > > > > >>>> developers that want to ensure consistency for complex
> > > operations
> > > > > must
> > > > > > >>>> either accept the scalability bottleneck of serializing all
> > > > related
> > > > > > >> state
> > > > > > >>>> through a single partition, or layer a complex state machine
> > on
> > > > top
> > > > > of
> > > > > > >>> the
> > > > > > >>>> database. These are sophisticated and costly activities that
> > our
> > > > > users
> > > > > > >>>> should not be expected to undertake. Since distributed
> > databases
> > > > are
> > > > > > >>>> beginning to offer distributed transactions with fewer
> > caveats,
> > > it
> > > > > is
> > > > > > >>> past
> > > > > > >>>> time for Cassandra to do so as well.
> > > > > > >>>>
> > > > > > >>>> This CEP proposes the use of several novel techniques that
> > build
> > > > > upon
> > > > > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> > > > general
> > > > > > >>>> purpose distributed transactions. The approach is outlined
> in
> > > the
> > > > > > >>> wikipage
> > > > > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > > > > adopting
> > > > > > >>> this
> > > > > > >>>> approach we will be the _only_ distributed database to offer
> > > > global,
> > > > > > >>>> scalable, strict serializable transactions in one wide area
> > > > > > round-trip.
> > > > > > >>>> This would represent a significant improvement in the state
> of
> > > the
> > > > > > art,
> > > > > > >>>> both in the academic literature and in commercial or open
> > source
> > > > > > >>> offerings.
> > > > > > >>>>
> > > > > > >>>> This work has been partially realised in a prototype. This
> > > partial
> > > > > > >>>> prototype has been verified against Jepsen.io’s Maelstrom
> > > library
> > > > > and
> > > > > > >>>> dedicated in-tree strict serializability verification tools,
> > but
> > > > > much
> > > > > > >>> work
> > > > > > >>>> remains for the work to be production capable and integrated
> > > into
> > > > > > >>> Cassandra.
> > > > > > >>>>
> > > > > > >>>> I propose including the prototype in the project as a new
> > source
> > > > > > >>>> repository, to be developed as a standalone library for
> > > > integration
> > > > > > >> into
> > > > > > >>>> Cassandra. I hope the community sees the important value
> > > > proposition
> > > > > > of
> > > > > > >>>> this proposal, and will adopt the CEP after this discussion,
> > so
> > > > that
> > > > > > >> the
> > > > > > >>>> library and its integration into Cassandra can be developed
> in
> > > > > > parallel
> > > > > > >>> and
> > > > > > >>>> with the involvement of the wider community.
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> --
> > > > > > >>> Jonathan Ellis
> > > > > > >>> co-founder, http://www.datastax.com
> > > > > > >>> @spyced
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> Jonathan Ellis
> > > > > > >> co-founder, http://www.datastax.com
> > > > > > >> @spyced
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jonathan Ellis
> > > > > > > co-founder, http://www.datastax.com
> > > > > > > @spyced
> > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jonathan Ellis
> > > > > co-founder, http://www.datastax.com
> > > > > @spyced
> > > > >
> > > >
> > >
> >
>


--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

You can take a look at the Accord library, as linked in the CEP: https://github.com/belliottsmith/accord

It will of course be modified extensively over time, but this is the basic shape of the API that is envisaged. You can take a look at the Maelstrom implementation for how this will be integrated with Cassandra (which of course will be much more involved).

There will be a function for describing atomic transactions involving some combination of reads and writes, and it will be possible to submit these operations and receive an answer back. The relevant point of integration for this is accord.local.Node#coordinate.

There will likely be separate APIs for providing the system with topology changes, which it will ensure are linearized correctly with respect to ongoing transactions.

But when it boils down to it, we are providing a single point of entry for one-shot transactions. So the API from the perspective of a developer building features on top is pretty simple.


From: Paulo Motta <pa...@gmail.com>
Date: Friday, 1 October 2021 at 15:40
To: Cassandra DEV <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> With respect to this, in my view this kind of detail is not warranted
within a CEP. Software development is an exploratory process with respect
to structure, and these decisions will be made as the CEP progresses. If
these need to be specified upfront, then the purpose of a CEP – seeking buy
in – is invalidated, because the work must be complete before you know the
answers.

These need not to be set in stone, they're just a rough sketch of what the
end product will look like to make it easier to build a mental model of the
project, specially for those not directly involved with it, as well as to
guide its development for those involved. At least for me it's much easier
to visualize a project top-down (from how it's going to be used to its
particular implementation details) versus the other way around.

Em sex., 1 de out. de 2021 às 11:33, benedict@apache.org <
benedict@apache.org> escreveu:

> > The current document details thoroughly the protocol but in my view
> lacks to illustrate what specific API, methods, modules will become
> available to developers
>
> With respect to this, in my view this kind of detail is not warranted
> within a CEP. Software development is an exploratory process with respect
> to structure, and these decisions will be made as the CEP progresses. If
> these need to be specified upfront, then the purpose of a CEP – seeking buy
> in – is invalidated, because the work must be complete before you know the
> answers.
>
>
> From: benedict@apache.org <be...@apache.org>
> Date: Friday, 1 October 2021 at 15:31
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> From the CEP:
>
> Batches (including unconditional batches) on transactional tables will
> receive ACID properties, and grammatically correct conditional batch
> operations that would be rejected for operating over multiple CQL
> partitions will now be supported
>
>
> From: Paulo Motta <pa...@gmail.com>
> Date: Friday, 1 October 2021 at 15:30
> To: Cassandra DEV <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Can you just answer what palpable feature will be available once this CEP
> lands because this is still not clear to me (and perhaps to others) from
> the current CEP structure. The current document details thoroughly the
> protocol but in my view lacks to illustrate what specific API, methods,
> modules will become available to developers, how it fits into the larger
> picture and interacts with existing modules if at all and perhaps a few
> examples of how it can be used to build features on top.
>
> Em sex., 1 de out. de 2021 às 11:10, benedict@apache.org <
> benedict@apache.org> escreveu:
>
> > I’m not, though it might seem that way. I disagree with your views about
> > how CEP should be structured. Since the CEP process was itself codified
> via
> > the CEP process, if you want to recodify how CEP work, the correct way is
> > via the CEP process itself.
> >
> > The discussion is being drawn in multiple directions away from the CEP
> > itself, and I am trying to keep this particular thread focused on the
> > business at hand, not meta discussions around CEP structure that will no
> > doubt be unproductive given likely irreconcilable views about the topic,
> > nor discussions about other CEP that could have been.
> >
> > If you want to start a separate exploratory discussion thread about CEP
> > structure without filing a CEP feel free to do so.
> >
> >
> > From: Paulo Motta <pa...@gmail.com>
> > Date: Friday, 1 October 2021 at 15:04
> > To: Cassandra DEV <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > If you want to impose your views on CEP structure on others, please
> file
> > a CEP with the additional restrictions and guidance you want to impose
> and
> > start a discussion thread. I can then respond in detail to why I perceive
> > this approach to be flawed, in a dedicated context.
> >
> > This sounds very kafkaesque. You know I won't file a meta-CEP to change
> the
> > structure of CEP so you're just using this as an excuse to just shut the
> > discussion on the lack of clarity on what actual palpable feature will be
> > available once the CEP lands. :-)
> >
> > I'm just providing my humble feedback on how a CEP could be more
> digestible
> > and easier to consume from an external point of view, and this seems like
> > an appropriate and contextualized place to voice this opinion which is
> > perhaps shared by others.
> >
> > Em sex., 1 de out. de 2021 às 10:55, benedict@apache.org <
> > benedict@apache.org> escreveu:
> >
> > > I disagree with you. However, this is the wrong forum to have a meta
> > > discussion about how CEP should be structured.
> > >
> > > If you want to impose your views on CEP structure on others, please
> file
> > a
> > > CEP with the additional restrictions and guidance you want to impose
> and
> > > start a discussion thread. I can then respond in detail to why I
> perceive
> > > this approach to be flawed, in a dedicated context.
> > >
> > >
> > > From: Paulo Motta <pa...@gmail.com>
> > > Date: Friday, 1 October 2021 at 14:48
> > > To: Cassandra DEV <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > >  The proposal as it stands today is exceptionally thorough, more so
> > than
> > > any other CEP to date, or any CEP is likely to be in the near future.
> > >
> > > The protocol is thoroughly described, but in my view CEP is a forum to
> > > discuss the high level architecture and plan for adding a full
> end-to-end
> > > enhancement to the database, breaking it into sub-CEPs if needed, as
> long
> > > as the full plan is known in advance, otherwise the community will not
> > have
> > > the context to judge the full extent and impact of the proposed
> > > enhancement.
> > >
> > > > Since it remains unclear to me what either yourself or Jonathan want
> to
> > > see as an alternative
> > >
> > > I would personally like to see something along these lines:
> > >
> > > CEP1: Add ACID-compliant atomic batches
> > > - UX changes needed: none, CQL provides the grammar we need.
> > > - Distributed transaction protocol needed: Accord (link to white paper
> if
> > > you want specific details about the protcool)
> > > - High-level architecture: what new components will be added, how
> > existing
> > > components will be modified, what new messages will be added, what new
> > > configuration knobs will be introduced, what are the milestones of the
> > > project, etc.
> > >
> > > CEP2: Make LWT faster and more reliable
> > > - UX changes needed: none
> > > - Distributed transaction protocol needed: Accord, already added by
> > > previous CEP.
> > > - High-level architecture: blablabla... and so on.
> > >
> > > Em sex., 1 de out. de 2021 às 10:19, benedict@apache.org <
> > > benedict@apache.org> escreveu:
> > >
> > > > I think this is getting circular and unproductive. Basic
> disagreements
> > > > about whether the CEP specifies a feature I am inclined to leave for
> a
> > > > vote. In my view the CEP specifies several features, both immediate
> > ones
> > > > for the user (ACID batches and multi-key LWTS) and developer-focused
> > ones
> > > > around ground-breaking semantics that will be enabled.
> > > >
> > > > The proposal as it stands today is exceptionally thorough, more so
> than
> > > > any other CEP to date, or any CEP is likely to be in the near future.
> > > >
> > > > This is a Cassandra Enhancement *Proposal*, and at some point we have
> > to
> > > > engage with what is proposed, not what you might like to be proposed.
> > > Since
> > > > it remains unclear to me what either yourself or Jonathan want to see
> > as
> > > an
> > > > alternative, at this point it would seem more productive to produce
> > your
> > > > own proposals for the community to consider. It is possible for
> > multiple
> > > > transaction systems to co-exist, if you feel this is necessary.
> > > >
> > > >
> > > >
> > > > From: Paulo Motta <pa...@gmail.com>
> > > > Date: Friday, 1 October 2021 at 13:58
> > > > To: Cassandra DEV <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > I share similar feelings as jbellis that this proposal seems to be
> > > focusing
> > > > on the protocol itself but lacking the actual feature that will use
> the
> > > > protocol which IMO a key element to discuss on a CEP.
> > > >
> > > > It's similar to saying: hey I want to add this Tries Serialization
> > > Protocol
> > > > to Cassandra, but not providing specific details of how this protocol
> > is
> > > > going to be used.
> > > >
> > > > I think the right route for a CEP is to describe the feature that
> will
> > be
> > > > added to the database and the protocol is a mere requirement of the
> > > > high-level feature, for example:
> > > >
> > > > CEP: Add Trie-backed memtable
> > > > - Trie Serialization Protocol: implementation detail of the above CEP
> > > >
> > > > What is the difficulty of taking this approach, picking one of the
> > myriad
> > > > of features that will be enabled by Accord and using that as the
> > initial
> > > > CEP to introduce the protocol to the database?
> > > >
> > > > Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
> > > > benedict@apache.org> escreveu:
> > > >
> > > > > Actually, thinking about it again, the simple optimistic protocol
> > would
> > > > in
> > > > > fact guarantee system forward progress (i.e. independent of
> > transaction
> > > > > formulation).
> > > > >
> > > > >
> > > > > From: benedict@apache.org <be...@apache.org>
> > > > > Date: Friday, 1 October 2021 at 09:14
> > > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > Hi Jonathan,
> > > > >
> > > > > It would be great if we could achieve a bandwidth higher than 1-2
> > short
> > > > > emails per week. It remains unclear to me what your goal is, and it
> > > would
> > > > > help if you could make a statement like “I want Cassandra to be
> able
> > to
> > > > do
> > > > > X” so that we can respond directly to it. I am also available to
> have
> > > > > another call, in which we can have a back and forth, please feel
> free
> > > to
> > > > > propose a London-compatible time within the next week that is
> > suitable
> > > > for
> > > > > you.
> > > > >
> > > > > In my opinion we are at risk of veering off-topic, though. This CEP
> > is
> > > > not
> > > > > to deliver interactive transactions, and to my knowledge nobody is
> > > > > proposing a CEP for interactive transactions. So, for the CEP at
> hand
> > > the
> > > > > salient question seems: does this CEP prevent us from implementing
> > > > > interactive transactions with properties X, Y, Z in future? To
> which
> > > the
> > > > > answer is almost certainly no.
> > > > >
> > > > > However, to continue the discussion and respond directly to your
> > > queries,
> > > > > I believe we agree on the definition of an interactive transaction.
> > > > >
> > > > > Two protocols were loosely outlined. The first, using timestamps
> for
> > > > > optimistic concurrency control, would indeed involve the
> possibility
> > of
> > > > > aborts. It would not however inherently adopt the issue of LWTs
> where
> > > no
> > > > > transaction is able to make progress. Whether or not progress is
> > > > guaranteed
> > > > > (in a livelock-free sense) would depend on the structure of the
> > > > > transactions that were interfering.
> > > > >
> > > > > This approach has the advantage of being very simple to implement,
> so
> > > > that
> > > > > we could realistically support interactive transactions quite
> > quickly.
> > > It
> > > > > has the additional advantage that transactions would execute very
> > > quickly
> > > > > by avoiding the WAN during construction, and as a result may in
> > > practice
> > > > > experience fewer aborts than protocols that guarantee
> > livelock-freedom.
> > > > >
> > > > > The second protocol proposed using read/write intents and would be
> > able
> > > > to
> > > > > support almost any behaviour you want. We could even utilise
> > > pessimistic
> > > > > concurrency control, or anything in-between. This is its own huge
> > > design
> > > > > space, and discussion of this approach and the trade-offs that
> could
> > be
> > > > > made is (in my opinion) entirely out of scope for this CEP.
> > > > >
> > > > >
> > > > > From: Jonathan Ellis <jb...@gmail.com>
> > > > > Date: Friday, 1 October 2021 at 05:00
> > > > > To: dev <de...@cassandra.apache.org>
> > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > The obstacle for me is you've provided a protocol but not a fully
> > > fleshed
> > > > > out architecture, so it's hard to fill in some of the blanks.  But
> it
> > > > looks
> > > > > to me like optimistic concurrency control for interactive
> > transactions
> > > > > applied to Accord would leave you in a LWT-like situation under
> > fairly
> > > > > light contention where nobody actually makes progress due to
> retries.
> > > > >
> > > > > To make sure we're talking about the same thing, as Henrik pointed
> > out,
> > > > > interactive transactions mean multiple round trips from the client
> > > > within a
> > > > > transaction.  For example, here
> > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > > > > >
> > > > > is a simple implementation of the TPC-C New Order transaction.  The
> > > high
> > > > > level logic (via
> > > > > <
> > > > >
> > > >
> > >
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > > > > >)
> > > > > is,
> > > > >
> > > > >    1. Get records describing a warehouse, customer, & district
> > > > >    2. Update the district
> > > > >    3. Increment next available order number
> > > > >    4. Insert record into Order and New-Order tables
> > > > >    5. For 5-15 items, get Item record, get/update Stock record
> > > > >    6. Insert Order-Line Record
> > > > >
> > > > > As you can see, this requires a lot of client-side logic mixed in
> > with
> > > > the
> > > > > actual SQL commands.
> > > > >
> > > > >
> > > > > On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <
> > > benedict@apache.org
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Essentially this, although I think in practice we will need to
> > track
> > > > each
> > > > > > partition’s timestamp separately (or optionally for reduced
> > > conflicts,
> > > > > each
> > > > > > row or datum’s), and make them all part of the conditional
> > > application
> > > > of
> > > > > > the transaction - at least for strict-serializability.
> > > > > >
> > > > > > The alternative is to insert read/write intents for the
> transaction
> > > > > during
> > > > > > each step, and to confirm they are still valid on commit, but
> this
> > > > > approach
> > > > > > would require a WAN round-trip for each step in the interactive
> > > > > > transaction, whereas the timestamp-validating approach can use a
> > LAN
> > > > > > round-trip for each step besides the final one, and is also much
> > > > simpler
> > > > > to
> > > > > > implement.
> > > > > >
> > > > > >
> > > > > > From: Blake Eggleston <be...@apple.com.INVALID>
> > > > > > Date: Thursday, 30 September 2021 at 05:47
> > > > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > You could establish a lower timestamp bound and buffer
> transaction
> > > > state
> > > > > > on the coordinator, then make the commit an operation that only
> > > applies
> > > > > if
> > > > > > all partitions involved haven’t been changed by a more recent
> > > > timestamp.
> > > > > > You could also implement mvcc either in the storage layer or for
> > some
> > > > > > period of time by buffering commits on each replica before
> > applying.
> > > > > >
> > > > > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jbellis@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > How are interactive transactions possible with Accord?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Could you explain why you believe this trade-off is necessary?
> > We
> > > > can
> > > > > > >> support full SQL just fine with Accord, and I hope that we
> > > > eventually
> > > > > > do so.
> > > > > > >>
> > > > > > >> This domain is incredibly complex, so it is easy to reach
> wrong
> > > > > > >> conclusions. I would invite you again to propose a system for
> > > > > discussion
> > > > > > >> that you think offers something Accord is unable to, and that
> > you
> > > > > > consider
> > > > > > >> desirable, and we can work from there.
> > > > > > >>
> > > > > > >> To pre-empt some possible discussions, I am not aware of
> > anything
> > > we
> > > > > > >> cannot do with Accord that we could do with either Calvin or
> > > > Spanner.
> > > > > > >> Interactive transactions are possible on top of Accord, as are
> > > > > > transactions
> > > > > > >> with an unknown read/write set. In each case the only cost is
> > that
> > > > > they
> > > > > > >> would use optimistic concurrency control, which is no worse
> the
> > > > > spanner
> > > > > > >> derivatives anyway (which I have to assume is your benchmark
> in
> > > this
> > > > > > >> regard). I do not expect to deliver either functionality
> > > initially,
> > > > > but
> > > > > > >> Accord takes us most of the way there for both.
> > > > > > >>
> > > > > > >>
> > > > > > >> From: Jonathan Ellis <jb...@gmail.com>
> > > > > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > > > > >> To: dev <de...@cassandra.apache.org>
> > > > > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > >> Right, I'm looking for exactly a discussion on the high level
> > > goals.
> > > > > > >> Instead of saying "here's the goals and we ruled out X because
> > Y"
> > > we
> > > > > > should
> > > > > > >> start with a discussion around, "Approach A allows X and W,
> > > > approach B
> > > > > > >> allows Y and Z" and decide together what the goals should be
> and
> > > and
> > > > > > what
> > > > > > >> we are willing to trade to get those goals, e.g., are we
> willing
> > > to
> > > > > > give up
> > > > > > >> global strict serializability to get the ability to support
> full
> > > > SQL.
> > > > > > Both
> > > > > > >> of these are nice to have!
> > > > > > >>
> > > > > > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> Hi Jonathan,
> > > > > > >>>
> > > > > > >>> These other systems are incompatible with the goals of the
> > CEP. I
> > > > do
> > > > > > >>> discuss them (besides 2PC) in both the whitepaper and the
> CEP,
> > > and
> > > > > will
> > > > > > >>> summarise that discussion below. A true and accurate
> comparison
> > > of
> > > > > > these
> > > > > > >>> other systems is essentially intractable, as there are
> complex
> > > > > > subtleties
> > > > > > >>> to each flavour, and those who are interested would be better
> > > > served
> > > > > by
> > > > > > >>> performing their own research.
> > > > > > >>>
> > > > > > >>> I think it is more productive to focus on what we want to
> > achieve
> > > > as
> > > > > a
> > > > > > >>> community. If you believe the goals of this CEP are wrong for
> > the
> > > > > > >> project,
> > > > > > >>> let’s focus on that. If you want to compare and contrast
> > specific
> > > > > > facets
> > > > > > >> of
> > > > > > >>> alternative systems that you consider to be preferable in
> some
> > > > > > dimension,
> > > > > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > > > > >>>
> > > > > > >>> The relevant goals are that we:
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>  1.  Guarantee strict serializable isolation on commodity
> > > hardware
> > > > > > >>>  2.  Scale to any cluster size
> > > > > > >>>  3.  Achieve optimal latency
> > > > > > >>>
> > > > > > >>> The approach taken by Spanner derivatives is rejected by (1)
> > > > because
> > > > > > they
> > > > > > >>> guarantee only Serializable isolation (they additionally fail
> > > (3)).
> > > > > > From
> > > > > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > > > > >>> panic-cluster-death under clock skew, this is clearly
> > considered
> > > by
> > > > > > >>> everyone to be undesirable but necessary to achieve
> > scalability.
> > > > > > >>>
> > > > > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2)
> > because
> > > > its
> > > > > > >>> sequencing layer requires a global leader process for the
> > > cluster,
> > > > > > which
> > > > > > >> is
> > > > > > >>> incompatible with Cassandra’s scalability requirements. It
> > > > > additionally
> > > > > > >>> fails (3) for global clients.
> > > > > > >>>
> > > > > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is
> > > today a
> > > > > > >>> Spanner clone for its multi-key transaction functionality,
> not
> > > 2PC.
> > > > > > >>>
> > > > > > >>> Systems such as RAMP with even weaker isolation are not
> > > considered
> > > > > for
> > > > > > >> the
> > > > > > >>> simple reason that they do not even claim to meet (1).
> > > > > > >>>
> > > > > > >>> If we want to additionally offer weaker isolation levels than
> > > > > > >>> Serializable, such as that provided by the recent RAMP-TAO
> > paper,
> > > > > > >> Cassandra
> > > > > > >>> is likely able to support multiple distinct transaction
> layers
> > > that
> > > > > > >> operate
> > > > > > >>> independently. I would encourage you to file a CEP to explore
> > how
> > > > we
> > > > > > can
> > > > > > >>> meet these distinct use cases, but I consider them to be
> > niche. I
> > > > > > expect
> > > > > > >>> that a majority of our user base desire strict serializable
> > > > > isolation,
> > > > > > >> and
> > > > > > >>> certainly no less than serializable isolation, to augment the
> > > > > existing
> > > > > > >>> weaker isolation offered by quorum reads and writes.
> > > > > > >>>
> > > > > > >>> I would tangentially note that we are not an AP database
> under
> > > > normal
> > > > > > >>> recommended operation. A minority in any network partition
> > cannot
> > > > > reach
> > > > > > >>> QUORUM, so under recommended usage we are a high-availability
> > > > > > leaderless
> > > > > > >> CP
> > > > > > >>> database.
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> From: Jonathan Ellis <jb...@gmail.com>
> > > > > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > > > > >>> To: dev <de...@cassandra.apache.org>
> > > > > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > >>> Benedict, thanks for taking the lead in putting this
> together.
> > > > Since
> > > > > > >>> Cassandra is the only relevant database today designed
> around a
> > > > > > >> leaderless
> > > > > > >>> architecture, it's quite likely that we'll be better served
> > with
> > > a
> > > > > > custom
> > > > > > >>> transaction design instead of trying to retrofit one from CP
> > > > systems.
> > > > > > >>>
> > > > > > >>> The whitepaper here is a good description of the consensus
> > > > algorithm
> > > > > > >> itself
> > > > > > >>> as well as its robustness and stability characteristics, and
> > its
> > > > > > >> comparison
> > > > > > >>> with other state-of-the-art consensus algorithms is very
> > useful.
> > > > In
> > > > > > the
> > > > > > >>> context of Cassandra, where a consensus algorithm is only
> part
> > of
> > > > > what
> > > > > > >> will
> > > > > > >>> be implemented, I'd like to see a more complete evaluation of
> > the
> > > > > > >>> transactional side of things as well, including performance
> > > > > > >> characteristics
> > > > > > >>> as well as the types of transactions that can be supported
> and
> > at
> > > > > > least a
> > > > > > >>> general idea of what it would look like applied to Cassandra.
> > > This
> > > > > will
> > > > > > >>> allow the PMC to make a more informed decision about what
> > > tradeoffs
> > > > > are
> > > > > > >>> best for the entire long-term project of first supplementing
> > and
> > > > > > >> ultimately
> > > > > > >>> replacing LWT.
> > > > > > >>>
> > > > > > >>> (Allowing users to mix LWT and AP Cassandra operations
> against
> > > the
> > > > > same
> > > > > > >>> rows was probably a mistake, so in contrast with LWT we’re
> not
> > > > > looking
> > > > > > >> for
> > > > > > >>> something fast enough for occasional use but rather something
> > > > within
> > > > > a
> > > > > > >>> reasonable factor of AP operations, appropriate to being the
> > only
> > > > way
> > > > > > to
> > > > > > >>> interact with tables declared as such.)
> > > > > > >>>
> > > > > > >>> Besides Accord, this should cover
> > > > > > >>>
> > > > > > >>> - Calvin and FaunaDB
> > > > > > >>> - A Spanner derivative (no opinion on whether that should be
> > > > > Cockroach
> > > > > > or
> > > > > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > > > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB
> but
> > I
> > > > > > suspect
> > > > > > >>> there is more public information about MongoDB)
> > > > > > >>> - RAMP
> > > > > > >>>
> > > > > > >>> Here’s an example of what I mean:
> > > > > > >>>
> > > > > > >>> =Calvin=
> > > > > > >>>
> > > > > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB)
> > to
> > > > > order
> > > > > > >>> transactions, then replicas execute the transactions
> > > independently
> > > > > with
> > > > > > >> no
> > > > > > >>> further coordination.  No SPOF.  Transactions are batched by
> > each
> > > > > > >> sequencer
> > > > > > >>> to keep this from becoming a bottleneck.
> > > > > > >>>
> > > > > > >>> Performance: Calvin paper (published 2012) reports linear
> > scaling
> > > > of
> > > > > > >> TPC-C
> > > > > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2
> XL
> > > > > machines
> > > > > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order
> > is
> > > > > > composed
> > > > > > >>> of four reads and four writes, so this is effectively 2M
> reads
> > > and
> > > > 2M
> > > > > > >>> writes as we normally measure them in C*.
> > > > > > >>>
> > > > > > >>> Calvin supports mixed read/write transactions, but because
> the
> > > > > > >> transaction
> > > > > > >>> execution logic requires knowing all partition keys in
> advance
> > to
> > > > > > ensure
> > > > > > >>> that all replicas can reproduce the same results with no
> > > > > coordination,
> > > > > > >>> reads against non-PK predicates must be done ahead of time
> > > > > > >> (transparently,
> > > > > > >>> by the server) to determine the set of keys, and this must be
> > > > retried
> > > > > > if
> > > > > > >>> the set of rows affected is updated before the actual
> > transaction
> > > > > > >> executes.
> > > > > > >>>
> > > > > > >>> Batching and global consensus adds latency -- 100ms in the
> > Calvin
> > > > > paper
> > > > > > >> and
> > > > > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> > > > transactions
> > > > > > >>> (including multi-partition updates) are equally performant in
> > > > Calvin
> > > > > > >> since
> > > > > > >>> the coordination is handled up front in the sequencing step.
> > > Glass
> > > > > > half
> > > > > > >>> empty: even single-row reads and writes have to pay the full
> > > > > > coordination
> > > > > > >>> cost.  Fauna has optimized this away for reads but I am not
> > aware
> > > > of
> > > > > a
> > > > > > >>> description of how they changed the design to allow this.
> > > > > > >>>
> > > > > > >>> Functionality and limitations: since the entire transaction
> > must
> > > be
> > > > > > known
> > > > > > >>> in advance to allow coordination-less execution at the
> > replicas,
> > > > > Calvin
> > > > > > >>> cannot support interactive transactions at all.  FaunaDB
> > > mitigates
> > > > > this
> > > > > > >> by
> > > > > > >>> allowing server-side logic to be included, but a Calvin
> > approach
> > > > will
> > > > > > >> never
> > > > > > >>> be able to offer SQL compatibility.
> > > > > > >>>
> > > > > > >>> Guarantees: Calvin transactions are strictly serializable.
> > There
> > > > is
> > > > > no
> > > > > > >>> additional complexity or performance hit to generalizing to
> > > > multiple
> > > > > > >>> regions, apart from the speed of light.  And since Calvin is
> > > > already
> > > > > > >> paying
> > > > > > >>> a batching latency penalty, this is less painful than for
> other
> > > > > > systems.
> > > > > > >>>
> > > > > > >>> Application to Cassandra: B-.  Distributed transactions are
> > > handled
> > > > > by
> > > > > > >> the
> > > > > > >>> sequencing and scheduling layers, which are leaderless, and
> > > > Calvin’s
> > > > > > >>> requirements for the storage layer are easily met by C*.  But
> > > > Calvin
> > > > > > also
> > > > > > >>> requires a global consensus protocol and LWT is almost
> > certainly
> > > > not
> > > > > > >>> sufficiently performant, so this would require ZK or etcd
> > > > (reasonable
> > > > > > >> for a
> > > > > > >>> library approach but not for replacing LWT in C* itself), or
> an
> > > > > > >>> implementation of Accord.  I don’t believe Calvin would
> require
> > > > > > >> additional
> > > > > > >>> table-level metadata in Cassandra.
> > > > > > >>>
> > > > > > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > >>> wrote:
> > > > > > >>>
> > > > > > >>>> Wiki:
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > > > > >>>> Whitepaper:
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > > > > >>>> <
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > > > > >>>>>
> > > > > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > > > > >>>>
> > > > > > >>>> Hi everyone, I’d like to propose this CEP for adoption by
> the
> > > > > > >> community.
> > > > > > >>>>
> > > > > > >>>> Cassandra has benefitted from LWTs for many years, but
> > > application
> > > > > > >>>> developers that want to ensure consistency for complex
> > > operations
> > > > > must
> > > > > > >>>> either accept the scalability bottleneck of serializing all
> > > > related
> > > > > > >> state
> > > > > > >>>> through a single partition, or layer a complex state machine
> > on
> > > > top
> > > > > of
> > > > > > >>> the
> > > > > > >>>> database. These are sophisticated and costly activities that
> > our
> > > > > users
> > > > > > >>>> should not be expected to undertake. Since distributed
> > databases
> > > > are
> > > > > > >>>> beginning to offer distributed transactions with fewer
> > caveats,
> > > it
> > > > > is
> > > > > > >>> past
> > > > > > >>>> time for Cassandra to do so as well.
> > > > > > >>>>
> > > > > > >>>> This CEP proposes the use of several novel techniques that
> > build
> > > > > upon
> > > > > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> > > > general
> > > > > > >>>> purpose distributed transactions. The approach is outlined
> in
> > > the
> > > > > > >>> wikipage
> > > > > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > > > > adopting
> > > > > > >>> this
> > > > > > >>>> approach we will be the _only_ distributed database to offer
> > > > global,
> > > > > > >>>> scalable, strict serializable transactions in one wide area
> > > > > > round-trip.
> > > > > > >>>> This would represent a significant improvement in the state
> of
> > > the
> > > > > > art,
> > > > > > >>>> both in the academic literature and in commercial or open
> > source
> > > > > > >>> offerings.
> > > > > > >>>>
> > > > > > >>>> This work has been partially realised in a prototype. This
> > > partial
> > > > > > >>>> prototype has been verified against Jepsen.io’s Maelstrom
> > > library
> > > > > and
> > > > > > >>>> dedicated in-tree strict serializability verification tools,
> > but
> > > > > much
> > > > > > >>> work
> > > > > > >>>> remains for the work to be production capable and integrated
> > > into
> > > > > > >>> Cassandra.
> > > > > > >>>>
> > > > > > >>>> I propose including the prototype in the project as a new
> > source
> > > > > > >>>> repository, to be developed as a standalone library for
> > > > integration
> > > > > > >> into
> > > > > > >>>> Cassandra. I hope the community sees the important value
> > > > proposition
> > > > > > of
> > > > > > >>>> this proposal, and will adopt the CEP after this discussion,
> > so
> > > > that
> > > > > > >> the
> > > > > > >>>> library and its integration into Cassandra can be developed
> in
> > > > > > parallel
> > > > > > >>> and
> > > > > > >>>> with the involvement of the wider community.
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> --
> > > > > > >>> Jonathan Ellis
> > > > > > >>> co-founder, http://www.datastax.com
> > > > > > >>> @spyced
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> Jonathan Ellis
> > > > > > >> co-founder, http://www.datastax.com
> > > > > > >> @spyced
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jonathan Ellis
> > > > > > > co-founder, http://www.datastax.com
> > > > > > > @spyced
> > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jonathan Ellis
> > > > > co-founder, http://www.datastax.com
> > > > > @spyced
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Jonathan Ellis <jb...@gmail.com>.

I honestly can't understand the perspective that on the one hand, you're
asking for approval of a specific protocol as part of the CEP, but on the
other, you think discussion of the APIs this will enable is not warranted.
Surely we need agreement on what APIs we're trying to build, before we
discuss the protocols and architectures with which to build them.

On Fri, Oct 1, 2021 at 9:34 AM benedict@apache.org <be...@apache.org>
wrote:

> > The current document details thoroughly the protocol but in my view
> lacks to illustrate what specific API, methods, modules will become
> available to developers
>
> With respect to this, in my view this kind of detail is not warranted
> within a CEP. Software development is an exploratory process with respect
> to structure, and these decisions will be made as the CEP progresses. If
> these need to be specified upfront, then the purpose of a CEP – seeking buy
> in – is invalidated, because the work must be complete before you know the
> answers.
>
>
> From: benedict@apache.org <be...@apache.org>
> Date: Friday, 1 October 2021 at 15:31
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> From the CEP:
>
> Batches (including unconditional batches) on transactional tables will
> receive ACID properties, and grammatically correct conditional batch
> operations that would be rejected for operating over multiple CQL
> partitions will now be supported
>
>
> From: Paulo Motta <pa...@gmail.com>
> Date: Friday, 1 October 2021 at 15:30
> To: Cassandra DEV <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Can you just answer what palpable feature will be available once this CEP
> lands because this is still not clear to me (and perhaps to others) from
> the current CEP structure. The current document details thoroughly the
> protocol but in my view lacks to illustrate what specific API, methods,
> modules will become available to developers, how it fits into the larger
> picture and interacts with existing modules if at all and perhaps a few
> examples of how it can be used to build features on top.
>
> Em sex., 1 de out. de 2021 às 11:10, benedict@apache.org <
> benedict@apache.org> escreveu:
>
> > I’m not, though it might seem that way. I disagree with your views about
> > how CEP should be structured. Since the CEP process was itself codified
> via
> > the CEP process, if you want to recodify how CEP work, the correct way is
> > via the CEP process itself.
> >
> > The discussion is being drawn in multiple directions away from the CEP
> > itself, and I am trying to keep this particular thread focused on the
> > business at hand, not meta discussions around CEP structure that will no
> > doubt be unproductive given likely irreconcilable views about the topic,
> > nor discussions about other CEP that could have been.
> >
> > If you want to start a separate exploratory discussion thread about CEP
> > structure without filing a CEP feel free to do so.
> >
> >
> > From: Paulo Motta <pa...@gmail.com>
> > Date: Friday, 1 October 2021 at 15:04
> > To: Cassandra DEV <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > If you want to impose your views on CEP structure on others, please
> file
> > a CEP with the additional restrictions and guidance you want to impose
> and
> > start a discussion thread. I can then respond in detail to why I perceive
> > this approach to be flawed, in a dedicated context.
> >
> > This sounds very kafkaesque. You know I won't file a meta-CEP to change
> the
> > structure of CEP so you're just using this as an excuse to just shut the
> > discussion on the lack of clarity on what actual palpable feature will be
> > available once the CEP lands. :-)
> >
> > I'm just providing my humble feedback on how a CEP could be more
> digestible
> > and easier to consume from an external point of view, and this seems like
> > an appropriate and contextualized place to voice this opinion which is
> > perhaps shared by others.
> >
> > Em sex., 1 de out. de 2021 às 10:55, benedict@apache.org <
> > benedict@apache.org> escreveu:
> >
> > > I disagree with you. However, this is the wrong forum to have a meta
> > > discussion about how CEP should be structured.
> > >
> > > If you want to impose your views on CEP structure on others, please
> file
> > a
> > > CEP with the additional restrictions and guidance you want to impose
> and
> > > start a discussion thread. I can then respond in detail to why I
> perceive
> > > this approach to be flawed, in a dedicated context.
> > >
> > >
> > > From: Paulo Motta <pa...@gmail.com>
> > > Date: Friday, 1 October 2021 at 14:48
> > > To: Cassandra DEV <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > >  The proposal as it stands today is exceptionally thorough, more so
> > than
> > > any other CEP to date, or any CEP is likely to be in the near future.
> > >
> > > The protocol is thoroughly described, but in my view CEP is a forum to
> > > discuss the high level architecture and plan for adding a full
> end-to-end
> > > enhancement to the database, breaking it into sub-CEPs if needed, as
> long
> > > as the full plan is known in advance, otherwise the community will not
> > have
> > > the context to judge the full extent and impact of the proposed
> > > enhancement.
> > >
> > > > Since it remains unclear to me what either yourself or Jonathan want
> to
> > > see as an alternative
> > >
> > > I would personally like to see something along these lines:
> > >
> > > CEP1: Add ACID-compliant atomic batches
> > > - UX changes needed: none, CQL provides the grammar we need.
> > > - Distributed transaction protocol needed: Accord (link to white paper
> if
> > > you want specific details about the protcool)
> > > - High-level architecture: what new components will be added, how
> > existing
> > > components will be modified, what new messages will be added, what new
> > > configuration knobs will be introduced, what are the milestones of the
> > > project, etc.
> > >
> > > CEP2: Make LWT faster and more reliable
> > > - UX changes needed: none
> > > - Distributed transaction protocol needed: Accord, already added by
> > > previous CEP.
> > > - High-level architecture: blablabla... and so on.
> > >
> > > Em sex., 1 de out. de 2021 às 10:19, benedict@apache.org <
> > > benedict@apache.org> escreveu:
> > >
> > > > I think this is getting circular and unproductive. Basic
> disagreements
> > > > about whether the CEP specifies a feature I am inclined to leave for
> a
> > > > vote. In my view the CEP specifies several features, both immediate
> > ones
> > > > for the user (ACID batches and multi-key LWTS) and developer-focused
> > ones
> > > > around ground-breaking semantics that will be enabled.
> > > >
> > > > The proposal as it stands today is exceptionally thorough, more so
> than
> > > > any other CEP to date, or any CEP is likely to be in the near future.
> > > >
> > > > This is a Cassandra Enhancement *Proposal*, and at some point we have
> > to
> > > > engage with what is proposed, not what you might like to be proposed.
> > > Since
> > > > it remains unclear to me what either yourself or Jonathan want to see
> > as
> > > an
> > > > alternative, at this point it would seem more productive to produce
> > your
> > > > own proposals for the community to consider. It is possible for
> > multiple
> > > > transaction systems to co-exist, if you feel this is necessary.
> > > >
> > > >
> > > >
> > > > From: Paulo Motta <pa...@gmail.com>
> > > > Date: Friday, 1 October 2021 at 13:58
> > > > To: Cassandra DEV <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > I share similar feelings as jbellis that this proposal seems to be
> > > focusing
> > > > on the protocol itself but lacking the actual feature that will use
> the
> > > > protocol which IMO a key element to discuss on a CEP.
> > > >
> > > > It's similar to saying: hey I want to add this Tries Serialization
> > > Protocol
> > > > to Cassandra, but not providing specific details of how this protocol
> > is
> > > > going to be used.
> > > >
> > > > I think the right route for a CEP is to describe the feature that
> will
> > be
> > > > added to the database and the protocol is a mere requirement of the
> > > > high-level feature, for example:
> > > >
> > > > CEP: Add Trie-backed memtable
> > > > - Trie Serialization Protocol: implementation detail of the above CEP
> > > >
> > > > What is the difficulty of taking this approach, picking one of the
> > myriad
> > > > of features that will be enabled by Accord and using that as the
> > initial
> > > > CEP to introduce the protocol to the database?
> > > >
> > > > Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
> > > > benedict@apache.org> escreveu:
> > > >
> > > > > Actually, thinking about it again, the simple optimistic protocol
> > would
> > > > in
> > > > > fact guarantee system forward progress (i.e. independent of
> > transaction
> > > > > formulation).
> > > > >
> > > > >
> > > > > From: benedict@apache.org <be...@apache.org>
> > > > > Date: Friday, 1 October 2021 at 09:14
> > > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > Hi Jonathan,
> > > > >
> > > > > It would be great if we could achieve a bandwidth higher than 1-2
> > short
> > > > > emails per week. It remains unclear to me what your goal is, and it
> > > would
> > > > > help if you could make a statement like “I want Cassandra to be
> able
> > to
> > > > do
> > > > > X” so that we can respond directly to it. I am also available to
> have
> > > > > another call, in which we can have a back and forth, please feel
> free
> > > to
> > > > > propose a London-compatible time within the next week that is
> > suitable
> > > > for
> > > > > you.
> > > > >
> > > > > In my opinion we are at risk of veering off-topic, though. This CEP
> > is
> > > > not
> > > > > to deliver interactive transactions, and to my knowledge nobody is
> > > > > proposing a CEP for interactive transactions. So, for the CEP at
> hand
> > > the
> > > > > salient question seems: does this CEP prevent us from implementing
> > > > > interactive transactions with properties X, Y, Z in future? To
> which
> > > the
> > > > > answer is almost certainly no.
> > > > >
> > > > > However, to continue the discussion and respond directly to your
> > > queries,
> > > > > I believe we agree on the definition of an interactive transaction.
> > > > >
> > > > > Two protocols were loosely outlined. The first, using timestamps
> for
> > > > > optimistic concurrency control, would indeed involve the
> possibility
> > of
> > > > > aborts. It would not however inherently adopt the issue of LWTs
> where
> > > no
> > > > > transaction is able to make progress. Whether or not progress is
> > > > guaranteed
> > > > > (in a livelock-free sense) would depend on the structure of the
> > > > > transactions that were interfering.
> > > > >
> > > > > This approach has the advantage of being very simple to implement,
> so
> > > > that
> > > > > we could realistically support interactive transactions quite
> > quickly.
> > > It
> > > > > has the additional advantage that transactions would execute very
> > > quickly
> > > > > by avoiding the WAN during construction, and as a result may in
> > > practice
> > > > > experience fewer aborts than protocols that guarantee
> > livelock-freedom.
> > > > >
> > > > > The second protocol proposed using read/write intents and would be
> > able
> > > > to
> > > > > support almost any behaviour you want. We could even utilise
> > > pessimistic
> > > > > concurrency control, or anything in-between. This is its own huge
> > > design
> > > > > space, and discussion of this approach and the trade-offs that
> could
> > be
> > > > > made is (in my opinion) entirely out of scope for this CEP.
> > > > >
> > > > >
> > > > > From: Jonathan Ellis <jb...@gmail.com>
> > > > > Date: Friday, 1 October 2021 at 05:00
> > > > > To: dev <de...@cassandra.apache.org>
> > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > The obstacle for me is you've provided a protocol but not a fully
> > > fleshed
> > > > > out architecture, so it's hard to fill in some of the blanks.  But
> it
> > > > looks
> > > > > to me like optimistic concurrency control for interactive
> > transactions
> > > > > applied to Accord would leave you in a LWT-like situation under
> > fairly
> > > > > light contention where nobody actually makes progress due to
> retries.
> > > > >
> > > > > To make sure we're talking about the same thing, as Henrik pointed
> > out,
> > > > > interactive transactions mean multiple round trips from the client
> > > > within a
> > > > > transaction.  For example, here
> > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > > > > >
> > > > > is a simple implementation of the TPC-C New Order transaction.  The
> > > high
> > > > > level logic (via
> > > > > <
> > > > >
> > > >
> > >
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > > > > >)
> > > > > is,
> > > > >
> > > > >    1. Get records describing a warehouse, customer, & district
> > > > >    2. Update the district
> > > > >    3. Increment next available order number
> > > > >    4. Insert record into Order and New-Order tables
> > > > >    5. For 5-15 items, get Item record, get/update Stock record
> > > > >    6. Insert Order-Line Record
> > > > >
> > > > > As you can see, this requires a lot of client-side logic mixed in
> > with
> > > > the
> > > > > actual SQL commands.
> > > > >
> > > > >
> > > > > On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <
> > > benedict@apache.org
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Essentially this, although I think in practice we will need to
> > track
> > > > each
> > > > > > partition’s timestamp separately (or optionally for reduced
> > > conflicts,
> > > > > each
> > > > > > row or datum’s), and make them all part of the conditional
> > > application
> > > > of
> > > > > > the transaction - at least for strict-serializability.
> > > > > >
> > > > > > The alternative is to insert read/write intents for the
> transaction
> > > > > during
> > > > > > each step, and to confirm they are still valid on commit, but
> this
> > > > > approach
> > > > > > would require a WAN round-trip for each step in the interactive
> > > > > > transaction, whereas the timestamp-validating approach can use a
> > LAN
> > > > > > round-trip for each step besides the final one, and is also much
> > > > simpler
> > > > > to
> > > > > > implement.
> > > > > >
> > > > > >
> > > > > > From: Blake Eggleston <be...@apple.com.INVALID>
> > > > > > Date: Thursday, 30 September 2021 at 05:47
> > > > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > You could establish a lower timestamp bound and buffer
> transaction
> > > > state
> > > > > > on the coordinator, then make the commit an operation that only
> > > applies
> > > > > if
> > > > > > all partitions involved haven’t been changed by a more recent
> > > > timestamp.
> > > > > > You could also implement mvcc either in the storage layer or for
> > some
> > > > > > period of time by buffering commits on each replica before
> > applying.
> > > > > >
> > > > > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jbellis@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > How are interactive transactions possible with Accord?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Could you explain why you believe this trade-off is necessary?
> > We
> > > > can
> > > > > > >> support full SQL just fine with Accord, and I hope that we
> > > > eventually
> > > > > > do so.
> > > > > > >>
> > > > > > >> This domain is incredibly complex, so it is easy to reach
> wrong
> > > > > > >> conclusions. I would invite you again to propose a system for
> > > > > discussion
> > > > > > >> that you think offers something Accord is unable to, and that
> > you
> > > > > > consider
> > > > > > >> desirable, and we can work from there.
> > > > > > >>
> > > > > > >> To pre-empt some possible discussions, I am not aware of
> > anything
> > > we
> > > > > > >> cannot do with Accord that we could do with either Calvin or
> > > > Spanner.
> > > > > > >> Interactive transactions are possible on top of Accord, as are
> > > > > > transactions
> > > > > > >> with an unknown read/write set. In each case the only cost is
> > that
> > > > > they
> > > > > > >> would use optimistic concurrency control, which is no worse
> the
> > > > > spanner
> > > > > > >> derivatives anyway (which I have to assume is your benchmark
> in
> > > this
> > > > > > >> regard). I do not expect to deliver either functionality
> > > initially,
> > > > > but
> > > > > > >> Accord takes us most of the way there for both.
> > > > > > >>
> > > > > > >>
> > > > > > >> From: Jonathan Ellis <jb...@gmail.com>
> > > > > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > > > > >> To: dev <de...@cassandra.apache.org>
> > > > > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > >> Right, I'm looking for exactly a discussion on the high level
> > > goals.
> > > > > > >> Instead of saying "here's the goals and we ruled out X because
> > Y"
> > > we
> > > > > > should
> > > > > > >> start with a discussion around, "Approach A allows X and W,
> > > > approach B
> > > > > > >> allows Y and Z" and decide together what the goals should be
> and
> > > and
> > > > > > what
> > > > > > >> we are willing to trade to get those goals, e.g., are we
> willing
> > > to
> > > > > > give up
> > > > > > >> global strict serializability to get the ability to support
> full
> > > > SQL.
> > > > > > Both
> > > > > > >> of these are nice to have!
> > > > > > >>
> > > > > > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> Hi Jonathan,
> > > > > > >>>
> > > > > > >>> These other systems are incompatible with the goals of the
> > CEP. I
> > > > do
> > > > > > >>> discuss them (besides 2PC) in both the whitepaper and the
> CEP,
> > > and
> > > > > will
> > > > > > >>> summarise that discussion below. A true and accurate
> comparison
> > > of
> > > > > > these
> > > > > > >>> other systems is essentially intractable, as there are
> complex
> > > > > > subtleties
> > > > > > >>> to each flavour, and those who are interested would be better
> > > > served
> > > > > by
> > > > > > >>> performing their own research.
> > > > > > >>>
> > > > > > >>> I think it is more productive to focus on what we want to
> > achieve
> > > > as
> > > > > a
> > > > > > >>> community. If you believe the goals of this CEP are wrong for
> > the
> > > > > > >> project,
> > > > > > >>> let’s focus on that. If you want to compare and contrast
> > specific
> > > > > > facets
> > > > > > >> of
> > > > > > >>> alternative systems that you consider to be preferable in
> some
> > > > > > dimension,
> > > > > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > > > > >>>
> > > > > > >>> The relevant goals are that we:
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>  1.  Guarantee strict serializable isolation on commodity
> > > hardware
> > > > > > >>>  2.  Scale to any cluster size
> > > > > > >>>  3.  Achieve optimal latency
> > > > > > >>>
> > > > > > >>> The approach taken by Spanner derivatives is rejected by (1)
> > > > because
> > > > > > they
> > > > > > >>> guarantee only Serializable isolation (they additionally fail
> > > (3)).
> > > > > > From
> > > > > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > > > > >>> panic-cluster-death under clock skew, this is clearly
> > considered
> > > by
> > > > > > >>> everyone to be undesirable but necessary to achieve
> > scalability.
> > > > > > >>>
> > > > > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2)
> > because
> > > > its
> > > > > > >>> sequencing layer requires a global leader process for the
> > > cluster,
> > > > > > which
> > > > > > >> is
> > > > > > >>> incompatible with Cassandra’s scalability requirements. It
> > > > > additionally
> > > > > > >>> fails (3) for global clients.
> > > > > > >>>
> > > > > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is
> > > today a
> > > > > > >>> Spanner clone for its multi-key transaction functionality,
> not
> > > 2PC.
> > > > > > >>>
> > > > > > >>> Systems such as RAMP with even weaker isolation are not
> > > considered
> > > > > for
> > > > > > >> the
> > > > > > >>> simple reason that they do not even claim to meet (1).
> > > > > > >>>
> > > > > > >>> If we want to additionally offer weaker isolation levels than
> > > > > > >>> Serializable, such as that provided by the recent RAMP-TAO
> > paper,
> > > > > > >> Cassandra
> > > > > > >>> is likely able to support multiple distinct transaction
> layers
> > > that
> > > > > > >> operate
> > > > > > >>> independently. I would encourage you to file a CEP to explore
> > how
> > > > we
> > > > > > can
> > > > > > >>> meet these distinct use cases, but I consider them to be
> > niche. I
> > > > > > expect
> > > > > > >>> that a majority of our user base desire strict serializable
> > > > > isolation,
> > > > > > >> and
> > > > > > >>> certainly no less than serializable isolation, to augment the
> > > > > existing
> > > > > > >>> weaker isolation offered by quorum reads and writes.
> > > > > > >>>
> > > > > > >>> I would tangentially note that we are not an AP database
> under
> > > > normal
> > > > > > >>> recommended operation. A minority in any network partition
> > cannot
> > > > > reach
> > > > > > >>> QUORUM, so under recommended usage we are a high-availability
> > > > > > leaderless
> > > > > > >> CP
> > > > > > >>> database.
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> From: Jonathan Ellis <jb...@gmail.com>
> > > > > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > > > > >>> To: dev <de...@cassandra.apache.org>
> > > > > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > >>> Benedict, thanks for taking the lead in putting this
> together.
> > > > Since
> > > > > > >>> Cassandra is the only relevant database today designed
> around a
> > > > > > >> leaderless
> > > > > > >>> architecture, it's quite likely that we'll be better served
> > with
> > > a
> > > > > > custom
> > > > > > >>> transaction design instead of trying to retrofit one from CP
> > > > systems.
> > > > > > >>>
> > > > > > >>> The whitepaper here is a good description of the consensus
> > > > algorithm
> > > > > > >> itself
> > > > > > >>> as well as its robustness and stability characteristics, and
> > its
> > > > > > >> comparison
> > > > > > >>> with other state-of-the-art consensus algorithms is very
> > useful.
> > > > In
> > > > > > the
> > > > > > >>> context of Cassandra, where a consensus algorithm is only
> part
> > of
> > > > > what
> > > > > > >> will
> > > > > > >>> be implemented, I'd like to see a more complete evaluation of
> > the
> > > > > > >>> transactional side of things as well, including performance
> > > > > > >> characteristics
> > > > > > >>> as well as the types of transactions that can be supported
> and
> > at
> > > > > > least a
> > > > > > >>> general idea of what it would look like applied to Cassandra.
> > > This
> > > > > will
> > > > > > >>> allow the PMC to make a more informed decision about what
> > > tradeoffs
> > > > > are
> > > > > > >>> best for the entire long-term project of first supplementing
> > and
> > > > > > >> ultimately
> > > > > > >>> replacing LWT.
> > > > > > >>>
> > > > > > >>> (Allowing users to mix LWT and AP Cassandra operations
> against
> > > the
> > > > > same
> > > > > > >>> rows was probably a mistake, so in contrast with LWT we’re
> not
> > > > > looking
> > > > > > >> for
> > > > > > >>> something fast enough for occasional use but rather something
> > > > within
> > > > > a
> > > > > > >>> reasonable factor of AP operations, appropriate to being the
> > only
> > > > way
> > > > > > to
> > > > > > >>> interact with tables declared as such.)
> > > > > > >>>
> > > > > > >>> Besides Accord, this should cover
> > > > > > >>>
> > > > > > >>> - Calvin and FaunaDB
> > > > > > >>> - A Spanner derivative (no opinion on whether that should be
> > > > > Cockroach
> > > > > > or
> > > > > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > > > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB
> but
> > I
> > > > > > suspect
> > > > > > >>> there is more public information about MongoDB)
> > > > > > >>> - RAMP
> > > > > > >>>
> > > > > > >>> Here’s an example of what I mean:
> > > > > > >>>
> > > > > > >>> =Calvin=
> > > > > > >>>
> > > > > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB)
> > to
> > > > > order
> > > > > > >>> transactions, then replicas execute the transactions
> > > independently
> > > > > with
> > > > > > >> no
> > > > > > >>> further coordination.  No SPOF.  Transactions are batched by
> > each
> > > > > > >> sequencer
> > > > > > >>> to keep this from becoming a bottleneck.
> > > > > > >>>
> > > > > > >>> Performance: Calvin paper (published 2012) reports linear
> > scaling
> > > > of
> > > > > > >> TPC-C
> > > > > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2
> XL
> > > > > machines
> > > > > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order
> > is
> > > > > > composed
> > > > > > >>> of four reads and four writes, so this is effectively 2M
> reads
> > > and
> > > > 2M
> > > > > > >>> writes as we normally measure them in C*.
> > > > > > >>>
> > > > > > >>> Calvin supports mixed read/write transactions, but because
> the
> > > > > > >> transaction
> > > > > > >>> execution logic requires knowing all partition keys in
> advance
> > to
> > > > > > ensure
> > > > > > >>> that all replicas can reproduce the same results with no
> > > > > coordination,
> > > > > > >>> reads against non-PK predicates must be done ahead of time
> > > > > > >> (transparently,
> > > > > > >>> by the server) to determine the set of keys, and this must be
> > > > retried
> > > > > > if
> > > > > > >>> the set of rows affected is updated before the actual
> > transaction
> > > > > > >> executes.
> > > > > > >>>
> > > > > > >>> Batching and global consensus adds latency -- 100ms in the
> > Calvin
> > > > > paper
> > > > > > >> and
> > > > > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> > > > transactions
> > > > > > >>> (including multi-partition updates) are equally performant in
> > > > Calvin
> > > > > > >> since
> > > > > > >>> the coordination is handled up front in the sequencing step.
> > > Glass
> > > > > > half
> > > > > > >>> empty: even single-row reads and writes have to pay the full
> > > > > > coordination
> > > > > > >>> cost.  Fauna has optimized this away for reads but I am not
> > aware
> > > > of
> > > > > a
> > > > > > >>> description of how they changed the design to allow this.
> > > > > > >>>
> > > > > > >>> Functionality and limitations: since the entire transaction
> > must
> > > be
> > > > > > known
> > > > > > >>> in advance to allow coordination-less execution at the
> > replicas,
> > > > > Calvin
> > > > > > >>> cannot support interactive transactions at all.  FaunaDB
> > > mitigates
> > > > > this
> > > > > > >> by
> > > > > > >>> allowing server-side logic to be included, but a Calvin
> > approach
> > > > will
> > > > > > >> never
> > > > > > >>> be able to offer SQL compatibility.
> > > > > > >>>
> > > > > > >>> Guarantees: Calvin transactions are strictly serializable.
> > There
> > > > is
> > > > > no
> > > > > > >>> additional complexity or performance hit to generalizing to
> > > > multiple
> > > > > > >>> regions, apart from the speed of light.  And since Calvin is
> > > > already
> > > > > > >> paying
> > > > > > >>> a batching latency penalty, this is less painful than for
> other
> > > > > > systems.
> > > > > > >>>
> > > > > > >>> Application to Cassandra: B-.  Distributed transactions are
> > > handled
> > > > > by
> > > > > > >> the
> > > > > > >>> sequencing and scheduling layers, which are leaderless, and
> > > > Calvin’s
> > > > > > >>> requirements for the storage layer are easily met by C*.  But
> > > > Calvin
> > > > > > also
> > > > > > >>> requires a global consensus protocol and LWT is almost
> > certainly
> > > > not
> > > > > > >>> sufficiently performant, so this would require ZK or etcd
> > > > (reasonable
> > > > > > >> for a
> > > > > > >>> library approach but not for replacing LWT in C* itself), or
> an
> > > > > > >>> implementation of Accord.  I don’t believe Calvin would
> require
> > > > > > >> additional
> > > > > > >>> table-level metadata in Cassandra.
> > > > > > >>>
> > > > > > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > >>> wrote:
> > > > > > >>>
> > > > > > >>>> Wiki:
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > > > > >>>> Whitepaper:
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > > > > >>>> <
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > > > > >>>>>
> > > > > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > > > > >>>>
> > > > > > >>>> Hi everyone, I’d like to propose this CEP for adoption by
> the
> > > > > > >> community.
> > > > > > >>>>
> > > > > > >>>> Cassandra has benefitted from LWTs for many years, but
> > > application
> > > > > > >>>> developers that want to ensure consistency for complex
> > > operations
> > > > > must
> > > > > > >>>> either accept the scalability bottleneck of serializing all
> > > > related
> > > > > > >> state
> > > > > > >>>> through a single partition, or layer a complex state machine
> > on
> > > > top
> > > > > of
> > > > > > >>> the
> > > > > > >>>> database. These are sophisticated and costly activities that
> > our
> > > > > users
> > > > > > >>>> should not be expected to undertake. Since distributed
> > databases
> > > > are
> > > > > > >>>> beginning to offer distributed transactions with fewer
> > caveats,
> > > it
> > > > > is
> > > > > > >>> past
> > > > > > >>>> time for Cassandra to do so as well.
> > > > > > >>>>
> > > > > > >>>> This CEP proposes the use of several novel techniques that
> > build
> > > > > upon
> > > > > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> > > > general
> > > > > > >>>> purpose distributed transactions. The approach is outlined
> in
> > > the
> > > > > > >>> wikipage
> > > > > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > > > > adopting
> > > > > > >>> this
> > > > > > >>>> approach we will be the _only_ distributed database to offer
> > > > global,
> > > > > > >>>> scalable, strict serializable transactions in one wide area
> > > > > > round-trip.
> > > > > > >>>> This would represent a significant improvement in the state
> of
> > > the
> > > > > > art,
> > > > > > >>>> both in the academic literature and in commercial or open
> > source
> > > > > > >>> offerings.
> > > > > > >>>>
> > > > > > >>>> This work has been partially realised in a prototype. This
> > > partial
> > > > > > >>>> prototype has been verified against Jepsen.io’s Maelstrom
> > > library
> > > > > and
> > > > > > >>>> dedicated in-tree strict serializability verification tools,
> > but
> > > > > much
> > > > > > >>> work
> > > > > > >>>> remains for the work to be production capable and integrated
> > > into
> > > > > > >>> Cassandra.
> > > > > > >>>>
> > > > > > >>>> I propose including the prototype in the project as a new
> > source
> > > > > > >>>> repository, to be developed as a standalone library for
> > > > integration
> > > > > > >> into
> > > > > > >>>> Cassandra. I hope the community sees the important value
> > > > proposition
> > > > > > of
> > > > > > >>>> this proposal, and will adopt the CEP after this discussion,
> > so
> > > > that
> > > > > > >> the
> > > > > > >>>> library and its integration into Cassandra can be developed
> in
> > > > > > parallel
> > > > > > >>> and
> > > > > > >>>> with the involvement of the wider community.
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> --
> > > > > > >>> Jonathan Ellis
> > > > > > >>> co-founder, http://www.datastax.com
> > > > > > >>> @spyced
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> Jonathan Ellis
> > > > > > >> co-founder, http://www.datastax.com
> > > > > > >> @spyced
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jonathan Ellis
> > > > > > > co-founder, http://www.datastax.com
> > > > > > > @spyced
> > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jonathan Ellis
> > > > > co-founder, http://www.datastax.com
> > > > > @spyced
> > > > >
> > > >
> > >
> >
>


-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Paulo Motta <pa...@gmail.com>.

> With respect to this, in my view this kind of detail is not warranted
within a CEP. Software development is an exploratory process with respect
to structure, and these decisions will be made as the CEP progresses. If
these need to be specified upfront, then the purpose of a CEP – seeking buy
in – is invalidated, because the work must be complete before you know the
answers.

These need not to be set in stone, they're just a rough sketch of what the
end product will look like to make it easier to build a mental model of the
project, specially for those not directly involved with it, as well as to
guide its development for those involved. At least for me it's much easier
to visualize a project top-down (from how it's going to be used to its
particular implementation details) versus the other way around.

Em sex., 1 de out. de 2021 às 11:33, benedict@apache.org <
benedict@apache.org> escreveu:

> > The current document details thoroughly the protocol but in my view
> lacks to illustrate what specific API, methods, modules will become
> available to developers
>
> With respect to this, in my view this kind of detail is not warranted
> within a CEP. Software development is an exploratory process with respect
> to structure, and these decisions will be made as the CEP progresses. If
> these need to be specified upfront, then the purpose of a CEP – seeking buy
> in – is invalidated, because the work must be complete before you know the
> answers.
>
>
> From: benedict@apache.org <be...@apache.org>
> Date: Friday, 1 October 2021 at 15:31
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> From the CEP:
>
> Batches (including unconditional batches) on transactional tables will
> receive ACID properties, and grammatically correct conditional batch
> operations that would be rejected for operating over multiple CQL
> partitions will now be supported
>
>
> From: Paulo Motta <pa...@gmail.com>
> Date: Friday, 1 October 2021 at 15:30
> To: Cassandra DEV <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Can you just answer what palpable feature will be available once this CEP
> lands because this is still not clear to me (and perhaps to others) from
> the current CEP structure. The current document details thoroughly the
> protocol but in my view lacks to illustrate what specific API, methods,
> modules will become available to developers, how it fits into the larger
> picture and interacts with existing modules if at all and perhaps a few
> examples of how it can be used to build features on top.
>
> Em sex., 1 de out. de 2021 às 11:10, benedict@apache.org <
> benedict@apache.org> escreveu:
>
> > I’m not, though it might seem that way. I disagree with your views about
> > how CEP should be structured. Since the CEP process was itself codified
> via
> > the CEP process, if you want to recodify how CEP work, the correct way is
> > via the CEP process itself.
> >
> > The discussion is being drawn in multiple directions away from the CEP
> > itself, and I am trying to keep this particular thread focused on the
> > business at hand, not meta discussions around CEP structure that will no
> > doubt be unproductive given likely irreconcilable views about the topic,
> > nor discussions about other CEP that could have been.
> >
> > If you want to start a separate exploratory discussion thread about CEP
> > structure without filing a CEP feel free to do so.
> >
> >
> > From: Paulo Motta <pa...@gmail.com>
> > Date: Friday, 1 October 2021 at 15:04
> > To: Cassandra DEV <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > If you want to impose your views on CEP structure on others, please
> file
> > a CEP with the additional restrictions and guidance you want to impose
> and
> > start a discussion thread. I can then respond in detail to why I perceive
> > this approach to be flawed, in a dedicated context.
> >
> > This sounds very kafkaesque. You know I won't file a meta-CEP to change
> the
> > structure of CEP so you're just using this as an excuse to just shut the
> > discussion on the lack of clarity on what actual palpable feature will be
> > available once the CEP lands. :-)
> >
> > I'm just providing my humble feedback on how a CEP could be more
> digestible
> > and easier to consume from an external point of view, and this seems like
> > an appropriate and contextualized place to voice this opinion which is
> > perhaps shared by others.
> >
> > Em sex., 1 de out. de 2021 às 10:55, benedict@apache.org <
> > benedict@apache.org> escreveu:
> >
> > > I disagree with you. However, this is the wrong forum to have a meta
> > > discussion about how CEP should be structured.
> > >
> > > If you want to impose your views on CEP structure on others, please
> file
> > a
> > > CEP with the additional restrictions and guidance you want to impose
> and
> > > start a discussion thread. I can then respond in detail to why I
> perceive
> > > this approach to be flawed, in a dedicated context.
> > >
> > >
> > > From: Paulo Motta <pa...@gmail.com>
> > > Date: Friday, 1 October 2021 at 14:48
> > > To: Cassandra DEV <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > >  The proposal as it stands today is exceptionally thorough, more so
> > than
> > > any other CEP to date, or any CEP is likely to be in the near future.
> > >
> > > The protocol is thoroughly described, but in my view CEP is a forum to
> > > discuss the high level architecture and plan for adding a full
> end-to-end
> > > enhancement to the database, breaking it into sub-CEPs if needed, as
> long
> > > as the full plan is known in advance, otherwise the community will not
> > have
> > > the context to judge the full extent and impact of the proposed
> > > enhancement.
> > >
> > > > Since it remains unclear to me what either yourself or Jonathan want
> to
> > > see as an alternative
> > >
> > > I would personally like to see something along these lines:
> > >
> > > CEP1: Add ACID-compliant atomic batches
> > > - UX changes needed: none, CQL provides the grammar we need.
> > > - Distributed transaction protocol needed: Accord (link to white paper
> if
> > > you want specific details about the protcool)
> > > - High-level architecture: what new components will be added, how
> > existing
> > > components will be modified, what new messages will be added, what new
> > > configuration knobs will be introduced, what are the milestones of the
> > > project, etc.
> > >
> > > CEP2: Make LWT faster and more reliable
> > > - UX changes needed: none
> > > - Distributed transaction protocol needed: Accord, already added by
> > > previous CEP.
> > > - High-level architecture: blablabla... and so on.
> > >
> > > Em sex., 1 de out. de 2021 às 10:19, benedict@apache.org <
> > > benedict@apache.org> escreveu:
> > >
> > > > I think this is getting circular and unproductive. Basic
> disagreements
> > > > about whether the CEP specifies a feature I am inclined to leave for
> a
> > > > vote. In my view the CEP specifies several features, both immediate
> > ones
> > > > for the user (ACID batches and multi-key LWTS) and developer-focused
> > ones
> > > > around ground-breaking semantics that will be enabled.
> > > >
> > > > The proposal as it stands today is exceptionally thorough, more so
> than
> > > > any other CEP to date, or any CEP is likely to be in the near future.
> > > >
> > > > This is a Cassandra Enhancement *Proposal*, and at some point we have
> > to
> > > > engage with what is proposed, not what you might like to be proposed.
> > > Since
> > > > it remains unclear to me what either yourself or Jonathan want to see
> > as
> > > an
> > > > alternative, at this point it would seem more productive to produce
> > your
> > > > own proposals for the community to consider. It is possible for
> > multiple
> > > > transaction systems to co-exist, if you feel this is necessary.
> > > >
> > > >
> > > >
> > > > From: Paulo Motta <pa...@gmail.com>
> > > > Date: Friday, 1 October 2021 at 13:58
> > > > To: Cassandra DEV <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > I share similar feelings as jbellis that this proposal seems to be
> > > focusing
> > > > on the protocol itself but lacking the actual feature that will use
> the
> > > > protocol which IMO a key element to discuss on a CEP.
> > > >
> > > > It's similar to saying: hey I want to add this Tries Serialization
> > > Protocol
> > > > to Cassandra, but not providing specific details of how this protocol
> > is
> > > > going to be used.
> > > >
> > > > I think the right route for a CEP is to describe the feature that
> will
> > be
> > > > added to the database and the protocol is a mere requirement of the
> > > > high-level feature, for example:
> > > >
> > > > CEP: Add Trie-backed memtable
> > > > - Trie Serialization Protocol: implementation detail of the above CEP
> > > >
> > > > What is the difficulty of taking this approach, picking one of the
> > myriad
> > > > of features that will be enabled by Accord and using that as the
> > initial
> > > > CEP to introduce the protocol to the database?
> > > >
> > > > Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
> > > > benedict@apache.org> escreveu:
> > > >
> > > > > Actually, thinking about it again, the simple optimistic protocol
> > would
> > > > in
> > > > > fact guarantee system forward progress (i.e. independent of
> > transaction
> > > > > formulation).
> > > > >
> > > > >
> > > > > From: benedict@apache.org <be...@apache.org>
> > > > > Date: Friday, 1 October 2021 at 09:14
> > > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > Hi Jonathan,
> > > > >
> > > > > It would be great if we could achieve a bandwidth higher than 1-2
> > short
> > > > > emails per week. It remains unclear to me what your goal is, and it
> > > would
> > > > > help if you could make a statement like “I want Cassandra to be
> able
> > to
> > > > do
> > > > > X” so that we can respond directly to it. I am also available to
> have
> > > > > another call, in which we can have a back and forth, please feel
> free
> > > to
> > > > > propose a London-compatible time within the next week that is
> > suitable
> > > > for
> > > > > you.
> > > > >
> > > > > In my opinion we are at risk of veering off-topic, though. This CEP
> > is
> > > > not
> > > > > to deliver interactive transactions, and to my knowledge nobody is
> > > > > proposing a CEP for interactive transactions. So, for the CEP at
> hand
> > > the
> > > > > salient question seems: does this CEP prevent us from implementing
> > > > > interactive transactions with properties X, Y, Z in future? To
> which
> > > the
> > > > > answer is almost certainly no.
> > > > >
> > > > > However, to continue the discussion and respond directly to your
> > > queries,
> > > > > I believe we agree on the definition of an interactive transaction.
> > > > >
> > > > > Two protocols were loosely outlined. The first, using timestamps
> for
> > > > > optimistic concurrency control, would indeed involve the
> possibility
> > of
> > > > > aborts. It would not however inherently adopt the issue of LWTs
> where
> > > no
> > > > > transaction is able to make progress. Whether or not progress is
> > > > guaranteed
> > > > > (in a livelock-free sense) would depend on the structure of the
> > > > > transactions that were interfering.
> > > > >
> > > > > This approach has the advantage of being very simple to implement,
> so
> > > > that
> > > > > we could realistically support interactive transactions quite
> > quickly.
> > > It
> > > > > has the additional advantage that transactions would execute very
> > > quickly
> > > > > by avoiding the WAN during construction, and as a result may in
> > > practice
> > > > > experience fewer aborts than protocols that guarantee
> > livelock-freedom.
> > > > >
> > > > > The second protocol proposed using read/write intents and would be
> > able
> > > > to
> > > > > support almost any behaviour you want. We could even utilise
> > > pessimistic
> > > > > concurrency control, or anything in-between. This is its own huge
> > > design
> > > > > space, and discussion of this approach and the trade-offs that
> could
> > be
> > > > > made is (in my opinion) entirely out of scope for this CEP.
> > > > >
> > > > >
> > > > > From: Jonathan Ellis <jb...@gmail.com>
> > > > > Date: Friday, 1 October 2021 at 05:00
> > > > > To: dev <de...@cassandra.apache.org>
> > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > The obstacle for me is you've provided a protocol but not a fully
> > > fleshed
> > > > > out architecture, so it's hard to fill in some of the blanks.  But
> it
> > > > looks
> > > > > to me like optimistic concurrency control for interactive
> > transactions
> > > > > applied to Accord would leave you in a LWT-like situation under
> > fairly
> > > > > light contention where nobody actually makes progress due to
> retries.
> > > > >
> > > > > To make sure we're talking about the same thing, as Henrik pointed
> > out,
> > > > > interactive transactions mean multiple round trips from the client
> > > > within a
> > > > > transaction.  For example, here
> > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > > > > >
> > > > > is a simple implementation of the TPC-C New Order transaction.  The
> > > high
> > > > > level logic (via
> > > > > <
> > > > >
> > > >
> > >
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > > > > >)
> > > > > is,
> > > > >
> > > > >    1. Get records describing a warehouse, customer, & district
> > > > >    2. Update the district
> > > > >    3. Increment next available order number
> > > > >    4. Insert record into Order and New-Order tables
> > > > >    5. For 5-15 items, get Item record, get/update Stock record
> > > > >    6. Insert Order-Line Record
> > > > >
> > > > > As you can see, this requires a lot of client-side logic mixed in
> > with
> > > > the
> > > > > actual SQL commands.
> > > > >
> > > > >
> > > > > On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <
> > > benedict@apache.org
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Essentially this, although I think in practice we will need to
> > track
> > > > each
> > > > > > partition’s timestamp separately (or optionally for reduced
> > > conflicts,
> > > > > each
> > > > > > row or datum’s), and make them all part of the conditional
> > > application
> > > > of
> > > > > > the transaction - at least for strict-serializability.
> > > > > >
> > > > > > The alternative is to insert read/write intents for the
> transaction
> > > > > during
> > > > > > each step, and to confirm they are still valid on commit, but
> this
> > > > > approach
> > > > > > would require a WAN round-trip for each step in the interactive
> > > > > > transaction, whereas the timestamp-validating approach can use a
> > LAN
> > > > > > round-trip for each step besides the final one, and is also much
> > > > simpler
> > > > > to
> > > > > > implement.
> > > > > >
> > > > > >
> > > > > > From: Blake Eggleston <be...@apple.com.INVALID>
> > > > > > Date: Thursday, 30 September 2021 at 05:47
> > > > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > You could establish a lower timestamp bound and buffer
> transaction
> > > > state
> > > > > > on the coordinator, then make the commit an operation that only
> > > applies
> > > > > if
> > > > > > all partitions involved haven’t been changed by a more recent
> > > > timestamp.
> > > > > > You could also implement mvcc either in the storage layer or for
> > some
> > > > > > period of time by buffering commits on each replica before
> > applying.
> > > > > >
> > > > > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jbellis@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > How are interactive transactions possible with Accord?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Could you explain why you believe this trade-off is necessary?
> > We
> > > > can
> > > > > > >> support full SQL just fine with Accord, and I hope that we
> > > > eventually
> > > > > > do so.
> > > > > > >>
> > > > > > >> This domain is incredibly complex, so it is easy to reach
> wrong
> > > > > > >> conclusions. I would invite you again to propose a system for
> > > > > discussion
> > > > > > >> that you think offers something Accord is unable to, and that
> > you
> > > > > > consider
> > > > > > >> desirable, and we can work from there.
> > > > > > >>
> > > > > > >> To pre-empt some possible discussions, I am not aware of
> > anything
> > > we
> > > > > > >> cannot do with Accord that we could do with either Calvin or
> > > > Spanner.
> > > > > > >> Interactive transactions are possible on top of Accord, as are
> > > > > > transactions
> > > > > > >> with an unknown read/write set. In each case the only cost is
> > that
> > > > > they
> > > > > > >> would use optimistic concurrency control, which is no worse
> the
> > > > > spanner
> > > > > > >> derivatives anyway (which I have to assume is your benchmark
> in
> > > this
> > > > > > >> regard). I do not expect to deliver either functionality
> > > initially,
> > > > > but
> > > > > > >> Accord takes us most of the way there for both.
> > > > > > >>
> > > > > > >>
> > > > > > >> From: Jonathan Ellis <jb...@gmail.com>
> > > > > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > > > > >> To: dev <de...@cassandra.apache.org>
> > > > > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > >> Right, I'm looking for exactly a discussion on the high level
> > > goals.
> > > > > > >> Instead of saying "here's the goals and we ruled out X because
> > Y"
> > > we
> > > > > > should
> > > > > > >> start with a discussion around, "Approach A allows X and W,
> > > > approach B
> > > > > > >> allows Y and Z" and decide together what the goals should be
> and
> > > and
> > > > > > what
> > > > > > >> we are willing to trade to get those goals, e.g., are we
> willing
> > > to
> > > > > > give up
> > > > > > >> global strict serializability to get the ability to support
> full
> > > > SQL.
> > > > > > Both
> > > > > > >> of these are nice to have!
> > > > > > >>
> > > > > > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> Hi Jonathan,
> > > > > > >>>
> > > > > > >>> These other systems are incompatible with the goals of the
> > CEP. I
> > > > do
> > > > > > >>> discuss them (besides 2PC) in both the whitepaper and the
> CEP,
> > > and
> > > > > will
> > > > > > >>> summarise that discussion below. A true and accurate
> comparison
> > > of
> > > > > > these
> > > > > > >>> other systems is essentially intractable, as there are
> complex
> > > > > > subtleties
> > > > > > >>> to each flavour, and those who are interested would be better
> > > > served
> > > > > by
> > > > > > >>> performing their own research.
> > > > > > >>>
> > > > > > >>> I think it is more productive to focus on what we want to
> > achieve
> > > > as
> > > > > a
> > > > > > >>> community. If you believe the goals of this CEP are wrong for
> > the
> > > > > > >> project,
> > > > > > >>> let’s focus on that. If you want to compare and contrast
> > specific
> > > > > > facets
> > > > > > >> of
> > > > > > >>> alternative systems that you consider to be preferable in
> some
> > > > > > dimension,
> > > > > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > > > > >>>
> > > > > > >>> The relevant goals are that we:
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>  1.  Guarantee strict serializable isolation on commodity
> > > hardware
> > > > > > >>>  2.  Scale to any cluster size
> > > > > > >>>  3.  Achieve optimal latency
> > > > > > >>>
> > > > > > >>> The approach taken by Spanner derivatives is rejected by (1)
> > > > because
> > > > > > they
> > > > > > >>> guarantee only Serializable isolation (they additionally fail
> > > (3)).
> > > > > > From
> > > > > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > > > > >>> panic-cluster-death under clock skew, this is clearly
> > considered
> > > by
> > > > > > >>> everyone to be undesirable but necessary to achieve
> > scalability.
> > > > > > >>>
> > > > > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2)
> > because
> > > > its
> > > > > > >>> sequencing layer requires a global leader process for the
> > > cluster,
> > > > > > which
> > > > > > >> is
> > > > > > >>> incompatible with Cassandra’s scalability requirements. It
> > > > > additionally
> > > > > > >>> fails (3) for global clients.
> > > > > > >>>
> > > > > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is
> > > today a
> > > > > > >>> Spanner clone for its multi-key transaction functionality,
> not
> > > 2PC.
> > > > > > >>>
> > > > > > >>> Systems such as RAMP with even weaker isolation are not
> > > considered
> > > > > for
> > > > > > >> the
> > > > > > >>> simple reason that they do not even claim to meet (1).
> > > > > > >>>
> > > > > > >>> If we want to additionally offer weaker isolation levels than
> > > > > > >>> Serializable, such as that provided by the recent RAMP-TAO
> > paper,
> > > > > > >> Cassandra
> > > > > > >>> is likely able to support multiple distinct transaction
> layers
> > > that
> > > > > > >> operate
> > > > > > >>> independently. I would encourage you to file a CEP to explore
> > how
> > > > we
> > > > > > can
> > > > > > >>> meet these distinct use cases, but I consider them to be
> > niche. I
> > > > > > expect
> > > > > > >>> that a majority of our user base desire strict serializable
> > > > > isolation,
> > > > > > >> and
> > > > > > >>> certainly no less than serializable isolation, to augment the
> > > > > existing
> > > > > > >>> weaker isolation offered by quorum reads and writes.
> > > > > > >>>
> > > > > > >>> I would tangentially note that we are not an AP database
> under
> > > > normal
> > > > > > >>> recommended operation. A minority in any network partition
> > cannot
> > > > > reach
> > > > > > >>> QUORUM, so under recommended usage we are a high-availability
> > > > > > leaderless
> > > > > > >> CP
> > > > > > >>> database.
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> From: Jonathan Ellis <jb...@gmail.com>
> > > > > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > > > > >>> To: dev <de...@cassandra.apache.org>
> > > > > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > >>> Benedict, thanks for taking the lead in putting this
> together.
> > > > Since
> > > > > > >>> Cassandra is the only relevant database today designed
> around a
> > > > > > >> leaderless
> > > > > > >>> architecture, it's quite likely that we'll be better served
> > with
> > > a
> > > > > > custom
> > > > > > >>> transaction design instead of trying to retrofit one from CP
> > > > systems.
> > > > > > >>>
> > > > > > >>> The whitepaper here is a good description of the consensus
> > > > algorithm
> > > > > > >> itself
> > > > > > >>> as well as its robustness and stability characteristics, and
> > its
> > > > > > >> comparison
> > > > > > >>> with other state-of-the-art consensus algorithms is very
> > useful.
> > > > In
> > > > > > the
> > > > > > >>> context of Cassandra, where a consensus algorithm is only
> part
> > of
> > > > > what
> > > > > > >> will
> > > > > > >>> be implemented, I'd like to see a more complete evaluation of
> > the
> > > > > > >>> transactional side of things as well, including performance
> > > > > > >> characteristics
> > > > > > >>> as well as the types of transactions that can be supported
> and
> > at
> > > > > > least a
> > > > > > >>> general idea of what it would look like applied to Cassandra.
> > > This
> > > > > will
> > > > > > >>> allow the PMC to make a more informed decision about what
> > > tradeoffs
> > > > > are
> > > > > > >>> best for the entire long-term project of first supplementing
> > and
> > > > > > >> ultimately
> > > > > > >>> replacing LWT.
> > > > > > >>>
> > > > > > >>> (Allowing users to mix LWT and AP Cassandra operations
> against
> > > the
> > > > > same
> > > > > > >>> rows was probably a mistake, so in contrast with LWT we’re
> not
> > > > > looking
> > > > > > >> for
> > > > > > >>> something fast enough for occasional use but rather something
> > > > within
> > > > > a
> > > > > > >>> reasonable factor of AP operations, appropriate to being the
> > only
> > > > way
> > > > > > to
> > > > > > >>> interact with tables declared as such.)
> > > > > > >>>
> > > > > > >>> Besides Accord, this should cover
> > > > > > >>>
> > > > > > >>> - Calvin and FaunaDB
> > > > > > >>> - A Spanner derivative (no opinion on whether that should be
> > > > > Cockroach
> > > > > > or
> > > > > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > > > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB
> but
> > I
> > > > > > suspect
> > > > > > >>> there is more public information about MongoDB)
> > > > > > >>> - RAMP
> > > > > > >>>
> > > > > > >>> Here’s an example of what I mean:
> > > > > > >>>
> > > > > > >>> =Calvin=
> > > > > > >>>
> > > > > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB)
> > to
> > > > > order
> > > > > > >>> transactions, then replicas execute the transactions
> > > independently
> > > > > with
> > > > > > >> no
> > > > > > >>> further coordination.  No SPOF.  Transactions are batched by
> > each
> > > > > > >> sequencer
> > > > > > >>> to keep this from becoming a bottleneck.
> > > > > > >>>
> > > > > > >>> Performance: Calvin paper (published 2012) reports linear
> > scaling
> > > > of
> > > > > > >> TPC-C
> > > > > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2
> XL
> > > > > machines
> > > > > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order
> > is
> > > > > > composed
> > > > > > >>> of four reads and four writes, so this is effectively 2M
> reads
> > > and
> > > > 2M
> > > > > > >>> writes as we normally measure them in C*.
> > > > > > >>>
> > > > > > >>> Calvin supports mixed read/write transactions, but because
> the
> > > > > > >> transaction
> > > > > > >>> execution logic requires knowing all partition keys in
> advance
> > to
> > > > > > ensure
> > > > > > >>> that all replicas can reproduce the same results with no
> > > > > coordination,
> > > > > > >>> reads against non-PK predicates must be done ahead of time
> > > > > > >> (transparently,
> > > > > > >>> by the server) to determine the set of keys, and this must be
> > > > retried
> > > > > > if
> > > > > > >>> the set of rows affected is updated before the actual
> > transaction
> > > > > > >> executes.
> > > > > > >>>
> > > > > > >>> Batching and global consensus adds latency -- 100ms in the
> > Calvin
> > > > > paper
> > > > > > >> and
> > > > > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> > > > transactions
> > > > > > >>> (including multi-partition updates) are equally performant in
> > > > Calvin
> > > > > > >> since
> > > > > > >>> the coordination is handled up front in the sequencing step.
> > > Glass
> > > > > > half
> > > > > > >>> empty: even single-row reads and writes have to pay the full
> > > > > > coordination
> > > > > > >>> cost.  Fauna has optimized this away for reads but I am not
> > aware
> > > > of
> > > > > a
> > > > > > >>> description of how they changed the design to allow this.
> > > > > > >>>
> > > > > > >>> Functionality and limitations: since the entire transaction
> > must
> > > be
> > > > > > known
> > > > > > >>> in advance to allow coordination-less execution at the
> > replicas,
> > > > > Calvin
> > > > > > >>> cannot support interactive transactions at all.  FaunaDB
> > > mitigates
> > > > > this
> > > > > > >> by
> > > > > > >>> allowing server-side logic to be included, but a Calvin
> > approach
> > > > will
> > > > > > >> never
> > > > > > >>> be able to offer SQL compatibility.
> > > > > > >>>
> > > > > > >>> Guarantees: Calvin transactions are strictly serializable.
> > There
> > > > is
> > > > > no
> > > > > > >>> additional complexity or performance hit to generalizing to
> > > > multiple
> > > > > > >>> regions, apart from the speed of light.  And since Calvin is
> > > > already
> > > > > > >> paying
> > > > > > >>> a batching latency penalty, this is less painful than for
> other
> > > > > > systems.
> > > > > > >>>
> > > > > > >>> Application to Cassandra: B-.  Distributed transactions are
> > > handled
> > > > > by
> > > > > > >> the
> > > > > > >>> sequencing and scheduling layers, which are leaderless, and
> > > > Calvin’s
> > > > > > >>> requirements for the storage layer are easily met by C*.  But
> > > > Calvin
> > > > > > also
> > > > > > >>> requires a global consensus protocol and LWT is almost
> > certainly
> > > > not
> > > > > > >>> sufficiently performant, so this would require ZK or etcd
> > > > (reasonable
> > > > > > >> for a
> > > > > > >>> library approach but not for replacing LWT in C* itself), or
> an
> > > > > > >>> implementation of Accord.  I don’t believe Calvin would
> require
> > > > > > >> additional
> > > > > > >>> table-level metadata in Cassandra.
> > > > > > >>>
> > > > > > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > >>> wrote:
> > > > > > >>>
> > > > > > >>>> Wiki:
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > > > > >>>> Whitepaper:
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > > > > >>>> <
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > > > > >>>>>
> > > > > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > > > > >>>>
> > > > > > >>>> Hi everyone, I’d like to propose this CEP for adoption by
> the
> > > > > > >> community.
> > > > > > >>>>
> > > > > > >>>> Cassandra has benefitted from LWTs for many years, but
> > > application
> > > > > > >>>> developers that want to ensure consistency for complex
> > > operations
> > > > > must
> > > > > > >>>> either accept the scalability bottleneck of serializing all
> > > > related
> > > > > > >> state
> > > > > > >>>> through a single partition, or layer a complex state machine
> > on
> > > > top
> > > > > of
> > > > > > >>> the
> > > > > > >>>> database. These are sophisticated and costly activities that
> > our
> > > > > users
> > > > > > >>>> should not be expected to undertake. Since distributed
> > databases
> > > > are
> > > > > > >>>> beginning to offer distributed transactions with fewer
> > caveats,
> > > it
> > > > > is
> > > > > > >>> past
> > > > > > >>>> time for Cassandra to do so as well.
> > > > > > >>>>
> > > > > > >>>> This CEP proposes the use of several novel techniques that
> > build
> > > > > upon
> > > > > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> > > > general
> > > > > > >>>> purpose distributed transactions. The approach is outlined
> in
> > > the
> > > > > > >>> wikipage
> > > > > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > > > > adopting
> > > > > > >>> this
> > > > > > >>>> approach we will be the _only_ distributed database to offer
> > > > global,
> > > > > > >>>> scalable, strict serializable transactions in one wide area
> > > > > > round-trip.
> > > > > > >>>> This would represent a significant improvement in the state
> of
> > > the
> > > > > > art,
> > > > > > >>>> both in the academic literature and in commercial or open
> > source
> > > > > > >>> offerings.
> > > > > > >>>>
> > > > > > >>>> This work has been partially realised in a prototype. This
> > > partial
> > > > > > >>>> prototype has been verified against Jepsen.io’s Maelstrom
> > > library
> > > > > and
> > > > > > >>>> dedicated in-tree strict serializability verification tools,
> > but
> > > > > much
> > > > > > >>> work
> > > > > > >>>> remains for the work to be production capable and integrated
> > > into
> > > > > > >>> Cassandra.
> > > > > > >>>>
> > > > > > >>>> I propose including the prototype in the project as a new
> > source
> > > > > > >>>> repository, to be developed as a standalone library for
> > > > integration
> > > > > > >> into
> > > > > > >>>> Cassandra. I hope the community sees the important value
> > > > proposition
> > > > > > of
> > > > > > >>>> this proposal, and will adopt the CEP after this discussion,
> > so
> > > > that
> > > > > > >> the
> > > > > > >>>> library and its integration into Cassandra can be developed
> in
> > > > > > parallel
> > > > > > >>> and
> > > > > > >>>> with the involvement of the wider community.
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> --
> > > > > > >>> Jonathan Ellis
> > > > > > >>> co-founder, http://www.datastax.com
> > > > > > >>> @spyced
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> Jonathan Ellis
> > > > > > >> co-founder, http://www.datastax.com
> > > > > > >> @spyced
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jonathan Ellis
> > > > > > > co-founder, http://www.datastax.com
> > > > > > > @spyced
> > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jonathan Ellis
> > > > > co-founder, http://www.datastax.com
> > > > > @spyced
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

> The current document details thoroughly the protocol but in my view lacks to illustrate what specific API, methods, modules will become available to developers

With respect to this, in my view this kind of detail is not warranted within a CEP. Software development is an exploratory process with respect to structure, and these decisions will be made as the CEP progresses. If these need to be specified upfront, then the purpose of a CEP – seeking buy in – is invalidated, because the work must be complete before you know the answers.


From: benedict@apache.org <be...@apache.org>
Date: Friday, 1 October 2021 at 15:31
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
From the CEP:

Batches (including unconditional batches) on transactional tables will receive ACID properties, and grammatically correct conditional batch operations that would be rejected for operating over multiple CQL partitions will now be supported


From: Paulo Motta <pa...@gmail.com>
Date: Friday, 1 October 2021 at 15:30
To: Cassandra DEV <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Can you just answer what palpable feature will be available once this CEP
lands because this is still not clear to me (and perhaps to others) from
the current CEP structure. The current document details thoroughly the
protocol but in my view lacks to illustrate what specific API, methods,
modules will become available to developers, how it fits into the larger
picture and interacts with existing modules if at all and perhaps a few
examples of how it can be used to build features on top.

Em sex., 1 de out. de 2021 às 11:10, benedict@apache.org <
benedict@apache.org> escreveu:

> I’m not, though it might seem that way. I disagree with your views about
> how CEP should be structured. Since the CEP process was itself codified via
> the CEP process, if you want to recodify how CEP work, the correct way is
> via the CEP process itself.
>
> The discussion is being drawn in multiple directions away from the CEP
> itself, and I am trying to keep this particular thread focused on the
> business at hand, not meta discussions around CEP structure that will no
> doubt be unproductive given likely irreconcilable views about the topic,
> nor discussions about other CEP that could have been.
>
> If you want to start a separate exploratory discussion thread about CEP
> structure without filing a CEP feel free to do so.
>
>
> From: Paulo Motta <pa...@gmail.com>
> Date: Friday, 1 October 2021 at 15:04
> To: Cassandra DEV <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > If you want to impose your views on CEP structure on others, please file
> a CEP with the additional restrictions and guidance you want to impose and
> start a discussion thread. I can then respond in detail to why I perceive
> this approach to be flawed, in a dedicated context.
>
> This sounds very kafkaesque. You know I won't file a meta-CEP to change the
> structure of CEP so you're just using this as an excuse to just shut the
> discussion on the lack of clarity on what actual palpable feature will be
> available once the CEP lands. :-)
>
> I'm just providing my humble feedback on how a CEP could be more digestible
> and easier to consume from an external point of view, and this seems like
> an appropriate and contextualized place to voice this opinion which is
> perhaps shared by others.
>
> Em sex., 1 de out. de 2021 às 10:55, benedict@apache.org <
> benedict@apache.org> escreveu:
>
> > I disagree with you. However, this is the wrong forum to have a meta
> > discussion about how CEP should be structured.
> >
> > If you want to impose your views on CEP structure on others, please file
> a
> > CEP with the additional restrictions and guidance you want to impose and
> > start a discussion thread. I can then respond in detail to why I perceive
> > this approach to be flawed, in a dedicated context.
> >
> >
> > From: Paulo Motta <pa...@gmail.com>
> > Date: Friday, 1 October 2021 at 14:48
> > To: Cassandra DEV <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >  The proposal as it stands today is exceptionally thorough, more so
> than
> > any other CEP to date, or any CEP is likely to be in the near future.
> >
> > The protocol is thoroughly described, but in my view CEP is a forum to
> > discuss the high level architecture and plan for adding a full end-to-end
> > enhancement to the database, breaking it into sub-CEPs if needed, as long
> > as the full plan is known in advance, otherwise the community will not
> have
> > the context to judge the full extent and impact of the proposed
> > enhancement.
> >
> > > Since it remains unclear to me what either yourself or Jonathan want to
> > see as an alternative
> >
> > I would personally like to see something along these lines:
> >
> > CEP1: Add ACID-compliant atomic batches
> > - UX changes needed: none, CQL provides the grammar we need.
> > - Distributed transaction protocol needed: Accord (link to white paper if
> > you want specific details about the protcool)
> > - High-level architecture: what new components will be added, how
> existing
> > components will be modified, what new messages will be added, what new
> > configuration knobs will be introduced, what are the milestones of the
> > project, etc.
> >
> > CEP2: Make LWT faster and more reliable
> > - UX changes needed: none
> > - Distributed transaction protocol needed: Accord, already added by
> > previous CEP.
> > - High-level architecture: blablabla... and so on.
> >
> > Em sex., 1 de out. de 2021 às 10:19, benedict@apache.org <
> > benedict@apache.org> escreveu:
> >
> > > I think this is getting circular and unproductive. Basic disagreements
> > > about whether the CEP specifies a feature I am inclined to leave for a
> > > vote. In my view the CEP specifies several features, both immediate
> ones
> > > for the user (ACID batches and multi-key LWTS) and developer-focused
> ones
> > > around ground-breaking semantics that will be enabled.
> > >
> > > The proposal as it stands today is exceptionally thorough, more so than
> > > any other CEP to date, or any CEP is likely to be in the near future.
> > >
> > > This is a Cassandra Enhancement *Proposal*, and at some point we have
> to
> > > engage with what is proposed, not what you might like to be proposed.
> > Since
> > > it remains unclear to me what either yourself or Jonathan want to see
> as
> > an
> > > alternative, at this point it would seem more productive to produce
> your
> > > own proposals for the community to consider. It is possible for
> multiple
> > > transaction systems to co-exist, if you feel this is necessary.
> > >
> > >
> > >
> > > From: Paulo Motta <pa...@gmail.com>
> > > Date: Friday, 1 October 2021 at 13:58
> > > To: Cassandra DEV <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > I share similar feelings as jbellis that this proposal seems to be
> > focusing
> > > on the protocol itself but lacking the actual feature that will use the
> > > protocol which IMO a key element to discuss on a CEP.
> > >
> > > It's similar to saying: hey I want to add this Tries Serialization
> > Protocol
> > > to Cassandra, but not providing specific details of how this protocol
> is
> > > going to be used.
> > >
> > > I think the right route for a CEP is to describe the feature that will
> be
> > > added to the database and the protocol is a mere requirement of the
> > > high-level feature, for example:
> > >
> > > CEP: Add Trie-backed memtable
> > > - Trie Serialization Protocol: implementation detail of the above CEP
> > >
> > > What is the difficulty of taking this approach, picking one of the
> myriad
> > > of features that will be enabled by Accord and using that as the
> initial
> > > CEP to introduce the protocol to the database?
> > >
> > > Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
> > > benedict@apache.org> escreveu:
> > >
> > > > Actually, thinking about it again, the simple optimistic protocol
> would
> > > in
> > > > fact guarantee system forward progress (i.e. independent of
> transaction
> > > > formulation).
> > > >
> > > >
> > > > From: benedict@apache.org <be...@apache.org>
> > > > Date: Friday, 1 October 2021 at 09:14
> > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > Hi Jonathan,
> > > >
> > > > It would be great if we could achieve a bandwidth higher than 1-2
> short
> > > > emails per week. It remains unclear to me what your goal is, and it
> > would
> > > > help if you could make a statement like “I want Cassandra to be able
> to
> > > do
> > > > X” so that we can respond directly to it. I am also available to have
> > > > another call, in which we can have a back and forth, please feel free
> > to
> > > > propose a London-compatible time within the next week that is
> suitable
> > > for
> > > > you.
> > > >
> > > > In my opinion we are at risk of veering off-topic, though. This CEP
> is
> > > not
> > > > to deliver interactive transactions, and to my knowledge nobody is
> > > > proposing a CEP for interactive transactions. So, for the CEP at hand
> > the
> > > > salient question seems: does this CEP prevent us from implementing
> > > > interactive transactions with properties X, Y, Z in future? To which
> > the
> > > > answer is almost certainly no.
> > > >
> > > > However, to continue the discussion and respond directly to your
> > queries,
> > > > I believe we agree on the definition of an interactive transaction.
> > > >
> > > > Two protocols were loosely outlined. The first, using timestamps for
> > > > optimistic concurrency control, would indeed involve the possibility
> of
> > > > aborts. It would not however inherently adopt the issue of LWTs where
> > no
> > > > transaction is able to make progress. Whether or not progress is
> > > guaranteed
> > > > (in a livelock-free sense) would depend on the structure of the
> > > > transactions that were interfering.
> > > >
> > > > This approach has the advantage of being very simple to implement, so
> > > that
> > > > we could realistically support interactive transactions quite
> quickly.
> > It
> > > > has the additional advantage that transactions would execute very
> > quickly
> > > > by avoiding the WAN during construction, and as a result may in
> > practice
> > > > experience fewer aborts than protocols that guarantee
> livelock-freedom.
> > > >
> > > > The second protocol proposed using read/write intents and would be
> able
> > > to
> > > > support almost any behaviour you want. We could even utilise
> > pessimistic
> > > > concurrency control, or anything in-between. This is its own huge
> > design
> > > > space, and discussion of this approach and the trade-offs that could
> be
> > > > made is (in my opinion) entirely out of scope for this CEP.
> > > >
> > > >
> > > > From: Jonathan Ellis <jb...@gmail.com>
> > > > Date: Friday, 1 October 2021 at 05:00
> > > > To: dev <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > The obstacle for me is you've provided a protocol but not a fully
> > fleshed
> > > > out architecture, so it's hard to fill in some of the blanks.  But it
> > > looks
> > > > to me like optimistic concurrency control for interactive
> transactions
> > > > applied to Accord would leave you in a LWT-like situation under
> fairly
> > > > light contention where nobody actually makes progress due to retries.
> > > >
> > > > To make sure we're talking about the same thing, as Henrik pointed
> out,
> > > > interactive transactions mean multiple round trips from the client
> > > within a
> > > > transaction.  For example, here
> > > > <
> > > >
> > >
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > > > >
> > > > is a simple implementation of the TPC-C New Order transaction.  The
> > high
> > > > level logic (via
> > > > <
> > > >
> > >
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > > > >)
> > > > is,
> > > >
> > > >    1. Get records describing a warehouse, customer, & district
> > > >    2. Update the district
> > > >    3. Increment next available order number
> > > >    4. Insert record into Order and New-Order tables
> > > >    5. For 5-15 items, get Item record, get/update Stock record
> > > >    6. Insert Order-Line Record
> > > >
> > > > As you can see, this requires a lot of client-side logic mixed in
> with
> > > the
> > > > actual SQL commands.
> > > >
> > > >
> > > > On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <
> > benedict@apache.org
> > > >
> > > > wrote:
> > > >
> > > > > Essentially this, although I think in practice we will need to
> track
> > > each
> > > > > partition’s timestamp separately (or optionally for reduced
> > conflicts,
> > > > each
> > > > > row or datum’s), and make them all part of the conditional
> > application
> > > of
> > > > > the transaction - at least for strict-serializability.
> > > > >
> > > > > The alternative is to insert read/write intents for the transaction
> > > > during
> > > > > each step, and to confirm they are still valid on commit, but this
> > > > approach
> > > > > would require a WAN round-trip for each step in the interactive
> > > > > transaction, whereas the timestamp-validating approach can use a
> LAN
> > > > > round-trip for each step besides the final one, and is also much
> > > simpler
> > > > to
> > > > > implement.
> > > > >
> > > > >
> > > > > From: Blake Eggleston <be...@apple.com.INVALID>
> > > > > Date: Thursday, 30 September 2021 at 05:47
> > > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > You could establish a lower timestamp bound and buffer transaction
> > > state
> > > > > on the coordinator, then make the commit an operation that only
> > applies
> > > > if
> > > > > all partitions involved haven’t been changed by a more recent
> > > timestamp.
> > > > > You could also implement mvcc either in the storage layer or for
> some
> > > > > period of time by buffering commits on each replica before
> applying.
> > > > >
> > > > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > How are interactive transactions possible with Accord?
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > > > > benedict@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > >> Could you explain why you believe this trade-off is necessary?
> We
> > > can
> > > > > >> support full SQL just fine with Accord, and I hope that we
> > > eventually
> > > > > do so.
> > > > > >>
> > > > > >> This domain is incredibly complex, so it is easy to reach wrong
> > > > > >> conclusions. I would invite you again to propose a system for
> > > > discussion
> > > > > >> that you think offers something Accord is unable to, and that
> you
> > > > > consider
> > > > > >> desirable, and we can work from there.
> > > > > >>
> > > > > >> To pre-empt some possible discussions, I am not aware of
> anything
> > we
> > > > > >> cannot do with Accord that we could do with either Calvin or
> > > Spanner.
> > > > > >> Interactive transactions are possible on top of Accord, as are
> > > > > transactions
> > > > > >> with an unknown read/write set. In each case the only cost is
> that
> > > > they
> > > > > >> would use optimistic concurrency control, which is no worse the
> > > > spanner
> > > > > >> derivatives anyway (which I have to assume is your benchmark in
> > this
> > > > > >> regard). I do not expect to deliver either functionality
> > initially,
> > > > but
> > > > > >> Accord takes us most of the way there for both.
> > > > > >>
> > > > > >>
> > > > > >> From: Jonathan Ellis <jb...@gmail.com>
> > > > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > > > >> To: dev <de...@cassandra.apache.org>
> > > > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > >> Right, I'm looking for exactly a discussion on the high level
> > goals.
> > > > > >> Instead of saying "here's the goals and we ruled out X because
> Y"
> > we
> > > > > should
> > > > > >> start with a discussion around, "Approach A allows X and W,
> > > approach B
> > > > > >> allows Y and Z" and decide together what the goals should be and
> > and
> > > > > what
> > > > > >> we are willing to trade to get those goals, e.g., are we willing
> > to
> > > > > give up
> > > > > >> global strict serializability to get the ability to support full
> > > SQL.
> > > > > Both
> > > > > >> of these are nice to have!
> > > > > >>
> > > > > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > > > > benedict@apache.org>
> > > > > >> wrote:
> > > > > >>
> > > > > >>> Hi Jonathan,
> > > > > >>>
> > > > > >>> These other systems are incompatible with the goals of the
> CEP. I
> > > do
> > > > > >>> discuss them (besides 2PC) in both the whitepaper and the CEP,
> > and
> > > > will
> > > > > >>> summarise that discussion below. A true and accurate comparison
> > of
> > > > > these
> > > > > >>> other systems is essentially intractable, as there are complex
> > > > > subtleties
> > > > > >>> to each flavour, and those who are interested would be better
> > > served
> > > > by
> > > > > >>> performing their own research.
> > > > > >>>
> > > > > >>> I think it is more productive to focus on what we want to
> achieve
> > > as
> > > > a
> > > > > >>> community. If you believe the goals of this CEP are wrong for
> the
> > > > > >> project,
> > > > > >>> let’s focus on that. If you want to compare and contrast
> specific
> > > > > facets
> > > > > >> of
> > > > > >>> alternative systems that you consider to be preferable in some
> > > > > dimension,
> > > > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > > > >>>
> > > > > >>> The relevant goals are that we:
> > > > > >>>
> > > > > >>>
> > > > > >>>  1.  Guarantee strict serializable isolation on commodity
> > hardware
> > > > > >>>  2.  Scale to any cluster size
> > > > > >>>  3.  Achieve optimal latency
> > > > > >>>
> > > > > >>> The approach taken by Spanner derivatives is rejected by (1)
> > > because
> > > > > they
> > > > > >>> guarantee only Serializable isolation (they additionally fail
> > (3)).
> > > > > From
> > > > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > > > >>> panic-cluster-death under clock skew, this is clearly
> considered
> > by
> > > > > >>> everyone to be undesirable but necessary to achieve
> scalability.
> > > > > >>>
> > > > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2)
> because
> > > its
> > > > > >>> sequencing layer requires a global leader process for the
> > cluster,
> > > > > which
> > > > > >> is
> > > > > >>> incompatible with Cassandra’s scalability requirements. It
> > > > additionally
> > > > > >>> fails (3) for global clients.
> > > > > >>>
> > > > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is
> > today a
> > > > > >>> Spanner clone for its multi-key transaction functionality, not
> > 2PC.
> > > > > >>>
> > > > > >>> Systems such as RAMP with even weaker isolation are not
> > considered
> > > > for
> > > > > >> the
> > > > > >>> simple reason that they do not even claim to meet (1).
> > > > > >>>
> > > > > >>> If we want to additionally offer weaker isolation levels than
> > > > > >>> Serializable, such as that provided by the recent RAMP-TAO
> paper,
> > > > > >> Cassandra
> > > > > >>> is likely able to support multiple distinct transaction layers
> > that
> > > > > >> operate
> > > > > >>> independently. I would encourage you to file a CEP to explore
> how
> > > we
> > > > > can
> > > > > >>> meet these distinct use cases, but I consider them to be
> niche. I
> > > > > expect
> > > > > >>> that a majority of our user base desire strict serializable
> > > > isolation,
> > > > > >> and
> > > > > >>> certainly no less than serializable isolation, to augment the
> > > > existing
> > > > > >>> weaker isolation offered by quorum reads and writes.
> > > > > >>>
> > > > > >>> I would tangentially note that we are not an AP database under
> > > normal
> > > > > >>> recommended operation. A minority in any network partition
> cannot
> > > > reach
> > > > > >>> QUORUM, so under recommended usage we are a high-availability
> > > > > leaderless
> > > > > >> CP
> > > > > >>> database.
> > > > > >>>
> > > > > >>>
> > > > > >>> From: Jonathan Ellis <jb...@gmail.com>
> > > > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > > > >>> To: dev <de...@cassandra.apache.org>
> > > > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > >>> Benedict, thanks for taking the lead in putting this together.
> > > Since
> > > > > >>> Cassandra is the only relevant database today designed around a
> > > > > >> leaderless
> > > > > >>> architecture, it's quite likely that we'll be better served
> with
> > a
> > > > > custom
> > > > > >>> transaction design instead of trying to retrofit one from CP
> > > systems.
> > > > > >>>
> > > > > >>> The whitepaper here is a good description of the consensus
> > > algorithm
> > > > > >> itself
> > > > > >>> as well as its robustness and stability characteristics, and
> its
> > > > > >> comparison
> > > > > >>> with other state-of-the-art consensus algorithms is very
> useful.
> > > In
> > > > > the
> > > > > >>> context of Cassandra, where a consensus algorithm is only part
> of
> > > > what
> > > > > >> will
> > > > > >>> be implemented, I'd like to see a more complete evaluation of
> the
> > > > > >>> transactional side of things as well, including performance
> > > > > >> characteristics
> > > > > >>> as well as the types of transactions that can be supported and
> at
> > > > > least a
> > > > > >>> general idea of what it would look like applied to Cassandra.
> > This
> > > > will
> > > > > >>> allow the PMC to make a more informed decision about what
> > tradeoffs
> > > > are
> > > > > >>> best for the entire long-term project of first supplementing
> and
> > > > > >> ultimately
> > > > > >>> replacing LWT.
> > > > > >>>
> > > > > >>> (Allowing users to mix LWT and AP Cassandra operations against
> > the
> > > > same
> > > > > >>> rows was probably a mistake, so in contrast with LWT we’re not
> > > > looking
> > > > > >> for
> > > > > >>> something fast enough for occasional use but rather something
> > > within
> > > > a
> > > > > >>> reasonable factor of AP operations, appropriate to being the
> only
> > > way
> > > > > to
> > > > > >>> interact with tables declared as such.)
> > > > > >>>
> > > > > >>> Besides Accord, this should cover
> > > > > >>>
> > > > > >>> - Calvin and FaunaDB
> > > > > >>> - A Spanner derivative (no opinion on whether that should be
> > > > Cockroach
> > > > > or
> > > > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB but
> I
> > > > > suspect
> > > > > >>> there is more public information about MongoDB)
> > > > > >>> - RAMP
> > > > > >>>
> > > > > >>> Here’s an example of what I mean:
> > > > > >>>
> > > > > >>> =Calvin=
> > > > > >>>
> > > > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB)
> to
> > > > order
> > > > > >>> transactions, then replicas execute the transactions
> > independently
> > > > with
> > > > > >> no
> > > > > >>> further coordination.  No SPOF.  Transactions are batched by
> each
> > > > > >> sequencer
> > > > > >>> to keep this from becoming a bottleneck.
> > > > > >>>
> > > > > >>> Performance: Calvin paper (published 2012) reports linear
> scaling
> > > of
> > > > > >> TPC-C
> > > > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL
> > > > machines
> > > > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order
> is
> > > > > composed
> > > > > >>> of four reads and four writes, so this is effectively 2M reads
> > and
> > > 2M
> > > > > >>> writes as we normally measure them in C*.
> > > > > >>>
> > > > > >>> Calvin supports mixed read/write transactions, but because the
> > > > > >> transaction
> > > > > >>> execution logic requires knowing all partition keys in advance
> to
> > > > > ensure
> > > > > >>> that all replicas can reproduce the same results with no
> > > > coordination,
> > > > > >>> reads against non-PK predicates must be done ahead of time
> > > > > >> (transparently,
> > > > > >>> by the server) to determine the set of keys, and this must be
> > > retried
> > > > > if
> > > > > >>> the set of rows affected is updated before the actual
> transaction
> > > > > >> executes.
> > > > > >>>
> > > > > >>> Batching and global consensus adds latency -- 100ms in the
> Calvin
> > > > paper
> > > > > >> and
> > > > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> > > transactions
> > > > > >>> (including multi-partition updates) are equally performant in
> > > Calvin
> > > > > >> since
> > > > > >>> the coordination is handled up front in the sequencing step.
> > Glass
> > > > > half
> > > > > >>> empty: even single-row reads and writes have to pay the full
> > > > > coordination
> > > > > >>> cost.  Fauna has optimized this away for reads but I am not
> aware
> > > of
> > > > a
> > > > > >>> description of how they changed the design to allow this.
> > > > > >>>
> > > > > >>> Functionality and limitations: since the entire transaction
> must
> > be
> > > > > known
> > > > > >>> in advance to allow coordination-less execution at the
> replicas,
> > > > Calvin
> > > > > >>> cannot support interactive transactions at all.  FaunaDB
> > mitigates
> > > > this
> > > > > >> by
> > > > > >>> allowing server-side logic to be included, but a Calvin
> approach
> > > will
> > > > > >> never
> > > > > >>> be able to offer SQL compatibility.
> > > > > >>>
> > > > > >>> Guarantees: Calvin transactions are strictly serializable.
> There
> > > is
> > > > no
> > > > > >>> additional complexity or performance hit to generalizing to
> > > multiple
> > > > > >>> regions, apart from the speed of light.  And since Calvin is
> > > already
> > > > > >> paying
> > > > > >>> a batching latency penalty, this is less painful than for other
> > > > > systems.
> > > > > >>>
> > > > > >>> Application to Cassandra: B-.  Distributed transactions are
> > handled
> > > > by
> > > > > >> the
> > > > > >>> sequencing and scheduling layers, which are leaderless, and
> > > Calvin’s
> > > > > >>> requirements for the storage layer are easily met by C*.  But
> > > Calvin
> > > > > also
> > > > > >>> requires a global consensus protocol and LWT is almost
> certainly
> > > not
> > > > > >>> sufficiently performant, so this would require ZK or etcd
> > > (reasonable
> > > > > >> for a
> > > > > >>> library approach but not for replacing LWT in C* itself), or an
> > > > > >>> implementation of Accord.  I don’t believe Calvin would require
> > > > > >> additional
> > > > > >>> table-level metadata in Cassandra.
> > > > > >>>
> > > > > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > > > > benedict@apache.org>
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>>> Wiki:
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > > > >>>> Whitepaper:
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > > > >>>> <
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > > > >>>>>
> > > > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > > > >>>>
> > > > > >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> > > > > >> community.
> > > > > >>>>
> > > > > >>>> Cassandra has benefitted from LWTs for many years, but
> > application
> > > > > >>>> developers that want to ensure consistency for complex
> > operations
> > > > must
> > > > > >>>> either accept the scalability bottleneck of serializing all
> > > related
> > > > > >> state
> > > > > >>>> through a single partition, or layer a complex state machine
> on
> > > top
> > > > of
> > > > > >>> the
> > > > > >>>> database. These are sophisticated and costly activities that
> our
> > > > users
> > > > > >>>> should not be expected to undertake. Since distributed
> databases
> > > are
> > > > > >>>> beginning to offer distributed transactions with fewer
> caveats,
> > it
> > > > is
> > > > > >>> past
> > > > > >>>> time for Cassandra to do so as well.
> > > > > >>>>
> > > > > >>>> This CEP proposes the use of several novel techniques that
> build
> > > > upon
> > > > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> > > general
> > > > > >>>> purpose distributed transactions. The approach is outlined in
> > the
> > > > > >>> wikipage
> > > > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > > > adopting
> > > > > >>> this
> > > > > >>>> approach we will be the _only_ distributed database to offer
> > > global,
> > > > > >>>> scalable, strict serializable transactions in one wide area
> > > > > round-trip.
> > > > > >>>> This would represent a significant improvement in the state of
> > the
> > > > > art,
> > > > > >>>> both in the academic literature and in commercial or open
> source
> > > > > >>> offerings.
> > > > > >>>>
> > > > > >>>> This work has been partially realised in a prototype. This
> > partial
> > > > > >>>> prototype has been verified against Jepsen.io’s Maelstrom
> > library
> > > > and
> > > > > >>>> dedicated in-tree strict serializability verification tools,
> but
> > > > much
> > > > > >>> work
> > > > > >>>> remains for the work to be production capable and integrated
> > into
> > > > > >>> Cassandra.
> > > > > >>>>
> > > > > >>>> I propose including the prototype in the project as a new
> source
> > > > > >>>> repository, to be developed as a standalone library for
> > > integration
> > > > > >> into
> > > > > >>>> Cassandra. I hope the community sees the important value
> > > proposition
> > > > > of
> > > > > >>>> this proposal, and will adopt the CEP after this discussion,
> so
> > > that
> > > > > >> the
> > > > > >>>> library and its integration into Cassandra can be developed in
> > > > > parallel
> > > > > >>> and
> > > > > >>>> with the involvement of the wider community.
> > > > > >>>>
> > > > > >>>
> > > > > >>>
> > > > > >>> --
> > > > > >>> Jonathan Ellis
> > > > > >>> co-founder, http://www.datastax.com
> > > > > >>> @spyced
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Jonathan Ellis
> > > > > >> co-founder, http://www.datastax.com
> > > > > >> @spyced
> > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jonathan Ellis
> > > > > > co-founder, http://www.datastax.com
> > > > > > @spyced
> > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > > > >
> > > >
> > > >
> > > > --
> > > > Jonathan Ellis
> > > > co-founder, http://www.datastax.com
> > > > @spyced
> > > >
> > >
> >
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

From the CEP:

Batches (including unconditional batches) on transactional tables will receive ACID properties, and grammatically correct conditional batch operations that would be rejected for operating over multiple CQL partitions will now be supported


From: Paulo Motta <pa...@gmail.com>
Date: Friday, 1 October 2021 at 15:30
To: Cassandra DEV <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Can you just answer what palpable feature will be available once this CEP
lands because this is still not clear to me (and perhaps to others) from
the current CEP structure. The current document details thoroughly the
protocol but in my view lacks to illustrate what specific API, methods,
modules will become available to developers, how it fits into the larger
picture and interacts with existing modules if at all and perhaps a few
examples of how it can be used to build features on top.

Em sex., 1 de out. de 2021 às 11:10, benedict@apache.org <
benedict@apache.org> escreveu:

> I’m not, though it might seem that way. I disagree with your views about
> how CEP should be structured. Since the CEP process was itself codified via
> the CEP process, if you want to recodify how CEP work, the correct way is
> via the CEP process itself.
>
> The discussion is being drawn in multiple directions away from the CEP
> itself, and I am trying to keep this particular thread focused on the
> business at hand, not meta discussions around CEP structure that will no
> doubt be unproductive given likely irreconcilable views about the topic,
> nor discussions about other CEP that could have been.
>
> If you want to start a separate exploratory discussion thread about CEP
> structure without filing a CEP feel free to do so.
>
>
> From: Paulo Motta <pa...@gmail.com>
> Date: Friday, 1 October 2021 at 15:04
> To: Cassandra DEV <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > If you want to impose your views on CEP structure on others, please file
> a CEP with the additional restrictions and guidance you want to impose and
> start a discussion thread. I can then respond in detail to why I perceive
> this approach to be flawed, in a dedicated context.
>
> This sounds very kafkaesque. You know I won't file a meta-CEP to change the
> structure of CEP so you're just using this as an excuse to just shut the
> discussion on the lack of clarity on what actual palpable feature will be
> available once the CEP lands. :-)
>
> I'm just providing my humble feedback on how a CEP could be more digestible
> and easier to consume from an external point of view, and this seems like
> an appropriate and contextualized place to voice this opinion which is
> perhaps shared by others.
>
> Em sex., 1 de out. de 2021 às 10:55, benedict@apache.org <
> benedict@apache.org> escreveu:
>
> > I disagree with you. However, this is the wrong forum to have a meta
> > discussion about how CEP should be structured.
> >
> > If you want to impose your views on CEP structure on others, please file
> a
> > CEP with the additional restrictions and guidance you want to impose and
> > start a discussion thread. I can then respond in detail to why I perceive
> > this approach to be flawed, in a dedicated context.
> >
> >
> > From: Paulo Motta <pa...@gmail.com>
> > Date: Friday, 1 October 2021 at 14:48
> > To: Cassandra DEV <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >  The proposal as it stands today is exceptionally thorough, more so
> than
> > any other CEP to date, or any CEP is likely to be in the near future.
> >
> > The protocol is thoroughly described, but in my view CEP is a forum to
> > discuss the high level architecture and plan for adding a full end-to-end
> > enhancement to the database, breaking it into sub-CEPs if needed, as long
> > as the full plan is known in advance, otherwise the community will not
> have
> > the context to judge the full extent and impact of the proposed
> > enhancement.
> >
> > > Since it remains unclear to me what either yourself or Jonathan want to
> > see as an alternative
> >
> > I would personally like to see something along these lines:
> >
> > CEP1: Add ACID-compliant atomic batches
> > - UX changes needed: none, CQL provides the grammar we need.
> > - Distributed transaction protocol needed: Accord (link to white paper if
> > you want specific details about the protcool)
> > - High-level architecture: what new components will be added, how
> existing
> > components will be modified, what new messages will be added, what new
> > configuration knobs will be introduced, what are the milestones of the
> > project, etc.
> >
> > CEP2: Make LWT faster and more reliable
> > - UX changes needed: none
> > - Distributed transaction protocol needed: Accord, already added by
> > previous CEP.
> > - High-level architecture: blablabla... and so on.
> >
> > Em sex., 1 de out. de 2021 às 10:19, benedict@apache.org <
> > benedict@apache.org> escreveu:
> >
> > > I think this is getting circular and unproductive. Basic disagreements
> > > about whether the CEP specifies a feature I am inclined to leave for a
> > > vote. In my view the CEP specifies several features, both immediate
> ones
> > > for the user (ACID batches and multi-key LWTS) and developer-focused
> ones
> > > around ground-breaking semantics that will be enabled.
> > >
> > > The proposal as it stands today is exceptionally thorough, more so than
> > > any other CEP to date, or any CEP is likely to be in the near future.
> > >
> > > This is a Cassandra Enhancement *Proposal*, and at some point we have
> to
> > > engage with what is proposed, not what you might like to be proposed.
> > Since
> > > it remains unclear to me what either yourself or Jonathan want to see
> as
> > an
> > > alternative, at this point it would seem more productive to produce
> your
> > > own proposals for the community to consider. It is possible for
> multiple
> > > transaction systems to co-exist, if you feel this is necessary.
> > >
> > >
> > >
> > > From: Paulo Motta <pa...@gmail.com>
> > > Date: Friday, 1 October 2021 at 13:58
> > > To: Cassandra DEV <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > I share similar feelings as jbellis that this proposal seems to be
> > focusing
> > > on the protocol itself but lacking the actual feature that will use the
> > > protocol which IMO a key element to discuss on a CEP.
> > >
> > > It's similar to saying: hey I want to add this Tries Serialization
> > Protocol
> > > to Cassandra, but not providing specific details of how this protocol
> is
> > > going to be used.
> > >
> > > I think the right route for a CEP is to describe the feature that will
> be
> > > added to the database and the protocol is a mere requirement of the
> > > high-level feature, for example:
> > >
> > > CEP: Add Trie-backed memtable
> > > - Trie Serialization Protocol: implementation detail of the above CEP
> > >
> > > What is the difficulty of taking this approach, picking one of the
> myriad
> > > of features that will be enabled by Accord and using that as the
> initial
> > > CEP to introduce the protocol to the database?
> > >
> > > Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
> > > benedict@apache.org> escreveu:
> > >
> > > > Actually, thinking about it again, the simple optimistic protocol
> would
> > > in
> > > > fact guarantee system forward progress (i.e. independent of
> transaction
> > > > formulation).
> > > >
> > > >
> > > > From: benedict@apache.org <be...@apache.org>
> > > > Date: Friday, 1 October 2021 at 09:14
> > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > Hi Jonathan,
> > > >
> > > > It would be great if we could achieve a bandwidth higher than 1-2
> short
> > > > emails per week. It remains unclear to me what your goal is, and it
> > would
> > > > help if you could make a statement like “I want Cassandra to be able
> to
> > > do
> > > > X” so that we can respond directly to it. I am also available to have
> > > > another call, in which we can have a back and forth, please feel free
> > to
> > > > propose a London-compatible time within the next week that is
> suitable
> > > for
> > > > you.
> > > >
> > > > In my opinion we are at risk of veering off-topic, though. This CEP
> is
> > > not
> > > > to deliver interactive transactions, and to my knowledge nobody is
> > > > proposing a CEP for interactive transactions. So, for the CEP at hand
> > the
> > > > salient question seems: does this CEP prevent us from implementing
> > > > interactive transactions with properties X, Y, Z in future? To which
> > the
> > > > answer is almost certainly no.
> > > >
> > > > However, to continue the discussion and respond directly to your
> > queries,
> > > > I believe we agree on the definition of an interactive transaction.
> > > >
> > > > Two protocols were loosely outlined. The first, using timestamps for
> > > > optimistic concurrency control, would indeed involve the possibility
> of
> > > > aborts. It would not however inherently adopt the issue of LWTs where
> > no
> > > > transaction is able to make progress. Whether or not progress is
> > > guaranteed
> > > > (in a livelock-free sense) would depend on the structure of the
> > > > transactions that were interfering.
> > > >
> > > > This approach has the advantage of being very simple to implement, so
> > > that
> > > > we could realistically support interactive transactions quite
> quickly.
> > It
> > > > has the additional advantage that transactions would execute very
> > quickly
> > > > by avoiding the WAN during construction, and as a result may in
> > practice
> > > > experience fewer aborts than protocols that guarantee
> livelock-freedom.
> > > >
> > > > The second protocol proposed using read/write intents and would be
> able
> > > to
> > > > support almost any behaviour you want. We could even utilise
> > pessimistic
> > > > concurrency control, or anything in-between. This is its own huge
> > design
> > > > space, and discussion of this approach and the trade-offs that could
> be
> > > > made is (in my opinion) entirely out of scope for this CEP.
> > > >
> > > >
> > > > From: Jonathan Ellis <jb...@gmail.com>
> > > > Date: Friday, 1 October 2021 at 05:00
> > > > To: dev <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > The obstacle for me is you've provided a protocol but not a fully
> > fleshed
> > > > out architecture, so it's hard to fill in some of the blanks.  But it
> > > looks
> > > > to me like optimistic concurrency control for interactive
> transactions
> > > > applied to Accord would leave you in a LWT-like situation under
> fairly
> > > > light contention where nobody actually makes progress due to retries.
> > > >
> > > > To make sure we're talking about the same thing, as Henrik pointed
> out,
> > > > interactive transactions mean multiple round trips from the client
> > > within a
> > > > transaction.  For example, here
> > > > <
> > > >
> > >
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > > > >
> > > > is a simple implementation of the TPC-C New Order transaction.  The
> > high
> > > > level logic (via
> > > > <
> > > >
> > >
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > > > >)
> > > > is,
> > > >
> > > >    1. Get records describing a warehouse, customer, & district
> > > >    2. Update the district
> > > >    3. Increment next available order number
> > > >    4. Insert record into Order and New-Order tables
> > > >    5. For 5-15 items, get Item record, get/update Stock record
> > > >    6. Insert Order-Line Record
> > > >
> > > > As you can see, this requires a lot of client-side logic mixed in
> with
> > > the
> > > > actual SQL commands.
> > > >
> > > >
> > > > On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <
> > benedict@apache.org
> > > >
> > > > wrote:
> > > >
> > > > > Essentially this, although I think in practice we will need to
> track
> > > each
> > > > > partition’s timestamp separately (or optionally for reduced
> > conflicts,
> > > > each
> > > > > row or datum’s), and make them all part of the conditional
> > application
> > > of
> > > > > the transaction - at least for strict-serializability.
> > > > >
> > > > > The alternative is to insert read/write intents for the transaction
> > > > during
> > > > > each step, and to confirm they are still valid on commit, but this
> > > > approach
> > > > > would require a WAN round-trip for each step in the interactive
> > > > > transaction, whereas the timestamp-validating approach can use a
> LAN
> > > > > round-trip for each step besides the final one, and is also much
> > > simpler
> > > > to
> > > > > implement.
> > > > >
> > > > >
> > > > > From: Blake Eggleston <be...@apple.com.INVALID>
> > > > > Date: Thursday, 30 September 2021 at 05:47
> > > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > You could establish a lower timestamp bound and buffer transaction
> > > state
> > > > > on the coordinator, then make the commit an operation that only
> > applies
> > > > if
> > > > > all partitions involved haven’t been changed by a more recent
> > > timestamp.
> > > > > You could also implement mvcc either in the storage layer or for
> some
> > > > > period of time by buffering commits on each replica before
> applying.
> > > > >
> > > > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > How are interactive transactions possible with Accord?
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > > > > benedict@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > >> Could you explain why you believe this trade-off is necessary?
> We
> > > can
> > > > > >> support full SQL just fine with Accord, and I hope that we
> > > eventually
> > > > > do so.
> > > > > >>
> > > > > >> This domain is incredibly complex, so it is easy to reach wrong
> > > > > >> conclusions. I would invite you again to propose a system for
> > > > discussion
> > > > > >> that you think offers something Accord is unable to, and that
> you
> > > > > consider
> > > > > >> desirable, and we can work from there.
> > > > > >>
> > > > > >> To pre-empt some possible discussions, I am not aware of
> anything
> > we
> > > > > >> cannot do with Accord that we could do with either Calvin or
> > > Spanner.
> > > > > >> Interactive transactions are possible on top of Accord, as are
> > > > > transactions
> > > > > >> with an unknown read/write set. In each case the only cost is
> that
> > > > they
> > > > > >> would use optimistic concurrency control, which is no worse the
> > > > spanner
> > > > > >> derivatives anyway (which I have to assume is your benchmark in
> > this
> > > > > >> regard). I do not expect to deliver either functionality
> > initially,
> > > > but
> > > > > >> Accord takes us most of the way there for both.
> > > > > >>
> > > > > >>
> > > > > >> From: Jonathan Ellis <jb...@gmail.com>
> > > > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > > > >> To: dev <de...@cassandra.apache.org>
> > > > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > >> Right, I'm looking for exactly a discussion on the high level
> > goals.
> > > > > >> Instead of saying "here's the goals and we ruled out X because
> Y"
> > we
> > > > > should
> > > > > >> start with a discussion around, "Approach A allows X and W,
> > > approach B
> > > > > >> allows Y and Z" and decide together what the goals should be and
> > and
> > > > > what
> > > > > >> we are willing to trade to get those goals, e.g., are we willing
> > to
> > > > > give up
> > > > > >> global strict serializability to get the ability to support full
> > > SQL.
> > > > > Both
> > > > > >> of these are nice to have!
> > > > > >>
> > > > > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > > > > benedict@apache.org>
> > > > > >> wrote:
> > > > > >>
> > > > > >>> Hi Jonathan,
> > > > > >>>
> > > > > >>> These other systems are incompatible with the goals of the
> CEP. I
> > > do
> > > > > >>> discuss them (besides 2PC) in both the whitepaper and the CEP,
> > and
> > > > will
> > > > > >>> summarise that discussion below. A true and accurate comparison
> > of
> > > > > these
> > > > > >>> other systems is essentially intractable, as there are complex
> > > > > subtleties
> > > > > >>> to each flavour, and those who are interested would be better
> > > served
> > > > by
> > > > > >>> performing their own research.
> > > > > >>>
> > > > > >>> I think it is more productive to focus on what we want to
> achieve
> > > as
> > > > a
> > > > > >>> community. If you believe the goals of this CEP are wrong for
> the
> > > > > >> project,
> > > > > >>> let’s focus on that. If you want to compare and contrast
> specific
> > > > > facets
> > > > > >> of
> > > > > >>> alternative systems that you consider to be preferable in some
> > > > > dimension,
> > > > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > > > >>>
> > > > > >>> The relevant goals are that we:
> > > > > >>>
> > > > > >>>
> > > > > >>>  1.  Guarantee strict serializable isolation on commodity
> > hardware
> > > > > >>>  2.  Scale to any cluster size
> > > > > >>>  3.  Achieve optimal latency
> > > > > >>>
> > > > > >>> The approach taken by Spanner derivatives is rejected by (1)
> > > because
> > > > > they
> > > > > >>> guarantee only Serializable isolation (they additionally fail
> > (3)).
> > > > > From
> > > > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > > > >>> panic-cluster-death under clock skew, this is clearly
> considered
> > by
> > > > > >>> everyone to be undesirable but necessary to achieve
> scalability.
> > > > > >>>
> > > > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2)
> because
> > > its
> > > > > >>> sequencing layer requires a global leader process for the
> > cluster,
> > > > > which
> > > > > >> is
> > > > > >>> incompatible with Cassandra’s scalability requirements. It
> > > > additionally
> > > > > >>> fails (3) for global clients.
> > > > > >>>
> > > > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is
> > today a
> > > > > >>> Spanner clone for its multi-key transaction functionality, not
> > 2PC.
> > > > > >>>
> > > > > >>> Systems such as RAMP with even weaker isolation are not
> > considered
> > > > for
> > > > > >> the
> > > > > >>> simple reason that they do not even claim to meet (1).
> > > > > >>>
> > > > > >>> If we want to additionally offer weaker isolation levels than
> > > > > >>> Serializable, such as that provided by the recent RAMP-TAO
> paper,
> > > > > >> Cassandra
> > > > > >>> is likely able to support multiple distinct transaction layers
> > that
> > > > > >> operate
> > > > > >>> independently. I would encourage you to file a CEP to explore
> how
> > > we
> > > > > can
> > > > > >>> meet these distinct use cases, but I consider them to be
> niche. I
> > > > > expect
> > > > > >>> that a majority of our user base desire strict serializable
> > > > isolation,
> > > > > >> and
> > > > > >>> certainly no less than serializable isolation, to augment the
> > > > existing
> > > > > >>> weaker isolation offered by quorum reads and writes.
> > > > > >>>
> > > > > >>> I would tangentially note that we are not an AP database under
> > > normal
> > > > > >>> recommended operation. A minority in any network partition
> cannot
> > > > reach
> > > > > >>> QUORUM, so under recommended usage we are a high-availability
> > > > > leaderless
> > > > > >> CP
> > > > > >>> database.
> > > > > >>>
> > > > > >>>
> > > > > >>> From: Jonathan Ellis <jb...@gmail.com>
> > > > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > > > >>> To: dev <de...@cassandra.apache.org>
> > > > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > >>> Benedict, thanks for taking the lead in putting this together.
> > > Since
> > > > > >>> Cassandra is the only relevant database today designed around a
> > > > > >> leaderless
> > > > > >>> architecture, it's quite likely that we'll be better served
> with
> > a
> > > > > custom
> > > > > >>> transaction design instead of trying to retrofit one from CP
> > > systems.
> > > > > >>>
> > > > > >>> The whitepaper here is a good description of the consensus
> > > algorithm
> > > > > >> itself
> > > > > >>> as well as its robustness and stability characteristics, and
> its
> > > > > >> comparison
> > > > > >>> with other state-of-the-art consensus algorithms is very
> useful.
> > > In
> > > > > the
> > > > > >>> context of Cassandra, where a consensus algorithm is only part
> of
> > > > what
> > > > > >> will
> > > > > >>> be implemented, I'd like to see a more complete evaluation of
> the
> > > > > >>> transactional side of things as well, including performance
> > > > > >> characteristics
> > > > > >>> as well as the types of transactions that can be supported and
> at
> > > > > least a
> > > > > >>> general idea of what it would look like applied to Cassandra.
> > This
> > > > will
> > > > > >>> allow the PMC to make a more informed decision about what
> > tradeoffs
> > > > are
> > > > > >>> best for the entire long-term project of first supplementing
> and
> > > > > >> ultimately
> > > > > >>> replacing LWT.
> > > > > >>>
> > > > > >>> (Allowing users to mix LWT and AP Cassandra operations against
> > the
> > > > same
> > > > > >>> rows was probably a mistake, so in contrast with LWT we’re not
> > > > looking
> > > > > >> for
> > > > > >>> something fast enough for occasional use but rather something
> > > within
> > > > a
> > > > > >>> reasonable factor of AP operations, appropriate to being the
> only
> > > way
> > > > > to
> > > > > >>> interact with tables declared as such.)
> > > > > >>>
> > > > > >>> Besides Accord, this should cover
> > > > > >>>
> > > > > >>> - Calvin and FaunaDB
> > > > > >>> - A Spanner derivative (no opinion on whether that should be
> > > > Cockroach
> > > > > or
> > > > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB but
> I
> > > > > suspect
> > > > > >>> there is more public information about MongoDB)
> > > > > >>> - RAMP
> > > > > >>>
> > > > > >>> Here’s an example of what I mean:
> > > > > >>>
> > > > > >>> =Calvin=
> > > > > >>>
> > > > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB)
> to
> > > > order
> > > > > >>> transactions, then replicas execute the transactions
> > independently
> > > > with
> > > > > >> no
> > > > > >>> further coordination.  No SPOF.  Transactions are batched by
> each
> > > > > >> sequencer
> > > > > >>> to keep this from becoming a bottleneck.
> > > > > >>>
> > > > > >>> Performance: Calvin paper (published 2012) reports linear
> scaling
> > > of
> > > > > >> TPC-C
> > > > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL
> > > > machines
> > > > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order
> is
> > > > > composed
> > > > > >>> of four reads and four writes, so this is effectively 2M reads
> > and
> > > 2M
> > > > > >>> writes as we normally measure them in C*.
> > > > > >>>
> > > > > >>> Calvin supports mixed read/write transactions, but because the
> > > > > >> transaction
> > > > > >>> execution logic requires knowing all partition keys in advance
> to
> > > > > ensure
> > > > > >>> that all replicas can reproduce the same results with no
> > > > coordination,
> > > > > >>> reads against non-PK predicates must be done ahead of time
> > > > > >> (transparently,
> > > > > >>> by the server) to determine the set of keys, and this must be
> > > retried
> > > > > if
> > > > > >>> the set of rows affected is updated before the actual
> transaction
> > > > > >> executes.
> > > > > >>>
> > > > > >>> Batching and global consensus adds latency -- 100ms in the
> Calvin
> > > > paper
> > > > > >> and
> > > > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> > > transactions
> > > > > >>> (including multi-partition updates) are equally performant in
> > > Calvin
> > > > > >> since
> > > > > >>> the coordination is handled up front in the sequencing step.
> > Glass
> > > > > half
> > > > > >>> empty: even single-row reads and writes have to pay the full
> > > > > coordination
> > > > > >>> cost.  Fauna has optimized this away for reads but I am not
> aware
> > > of
> > > > a
> > > > > >>> description of how they changed the design to allow this.
> > > > > >>>
> > > > > >>> Functionality and limitations: since the entire transaction
> must
> > be
> > > > > known
> > > > > >>> in advance to allow coordination-less execution at the
> replicas,
> > > > Calvin
> > > > > >>> cannot support interactive transactions at all.  FaunaDB
> > mitigates
> > > > this
> > > > > >> by
> > > > > >>> allowing server-side logic to be included, but a Calvin
> approach
> > > will
> > > > > >> never
> > > > > >>> be able to offer SQL compatibility.
> > > > > >>>
> > > > > >>> Guarantees: Calvin transactions are strictly serializable.
> There
> > > is
> > > > no
> > > > > >>> additional complexity or performance hit to generalizing to
> > > multiple
> > > > > >>> regions, apart from the speed of light.  And since Calvin is
> > > already
> > > > > >> paying
> > > > > >>> a batching latency penalty, this is less painful than for other
> > > > > systems.
> > > > > >>>
> > > > > >>> Application to Cassandra: B-.  Distributed transactions are
> > handled
> > > > by
> > > > > >> the
> > > > > >>> sequencing and scheduling layers, which are leaderless, and
> > > Calvin’s
> > > > > >>> requirements for the storage layer are easily met by C*.  But
> > > Calvin
> > > > > also
> > > > > >>> requires a global consensus protocol and LWT is almost
> certainly
> > > not
> > > > > >>> sufficiently performant, so this would require ZK or etcd
> > > (reasonable
> > > > > >> for a
> > > > > >>> library approach but not for replacing LWT in C* itself), or an
> > > > > >>> implementation of Accord.  I don’t believe Calvin would require
> > > > > >> additional
> > > > > >>> table-level metadata in Cassandra.
> > > > > >>>
> > > > > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > > > > benedict@apache.org>
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>>> Wiki:
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > > > >>>> Whitepaper:
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > > > >>>> <
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > > > >>>>>
> > > > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > > > >>>>
> > > > > >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> > > > > >> community.
> > > > > >>>>
> > > > > >>>> Cassandra has benefitted from LWTs for many years, but
> > application
> > > > > >>>> developers that want to ensure consistency for complex
> > operations
> > > > must
> > > > > >>>> either accept the scalability bottleneck of serializing all
> > > related
> > > > > >> state
> > > > > >>>> through a single partition, or layer a complex state machine
> on
> > > top
> > > > of
> > > > > >>> the
> > > > > >>>> database. These are sophisticated and costly activities that
> our
> > > > users
> > > > > >>>> should not be expected to undertake. Since distributed
> databases
> > > are
> > > > > >>>> beginning to offer distributed transactions with fewer
> caveats,
> > it
> > > > is
> > > > > >>> past
> > > > > >>>> time for Cassandra to do so as well.
> > > > > >>>>
> > > > > >>>> This CEP proposes the use of several novel techniques that
> build
> > > > upon
> > > > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> > > general
> > > > > >>>> purpose distributed transactions. The approach is outlined in
> > the
> > > > > >>> wikipage
> > > > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > > > adopting
> > > > > >>> this
> > > > > >>>> approach we will be the _only_ distributed database to offer
> > > global,
> > > > > >>>> scalable, strict serializable transactions in one wide area
> > > > > round-trip.
> > > > > >>>> This would represent a significant improvement in the state of
> > the
> > > > > art,
> > > > > >>>> both in the academic literature and in commercial or open
> source
> > > > > >>> offerings.
> > > > > >>>>
> > > > > >>>> This work has been partially realised in a prototype. This
> > partial
> > > > > >>>> prototype has been verified against Jepsen.io’s Maelstrom
> > library
> > > > and
> > > > > >>>> dedicated in-tree strict serializability verification tools,
> but
> > > > much
> > > > > >>> work
> > > > > >>>> remains for the work to be production capable and integrated
> > into
> > > > > >>> Cassandra.
> > > > > >>>>
> > > > > >>>> I propose including the prototype in the project as a new
> source
> > > > > >>>> repository, to be developed as a standalone library for
> > > integration
> > > > > >> into
> > > > > >>>> Cassandra. I hope the community sees the important value
> > > proposition
> > > > > of
> > > > > >>>> this proposal, and will adopt the CEP after this discussion,
> so
> > > that
> > > > > >> the
> > > > > >>>> library and its integration into Cassandra can be developed in
> > > > > parallel
> > > > > >>> and
> > > > > >>>> with the involvement of the wider community.
> > > > > >>>>
> > > > > >>>
> > > > > >>>
> > > > > >>> --
> > > > > >>> Jonathan Ellis
> > > > > >>> co-founder, http://www.datastax.com
> > > > > >>> @spyced
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Jonathan Ellis
> > > > > >> co-founder, http://www.datastax.com
> > > > > >> @spyced
> > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jonathan Ellis
> > > > > > co-founder, http://www.datastax.com
> > > > > > @spyced
> > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > > > >
> > > >
> > > >
> > > > --
> > > > Jonathan Ellis
> > > > co-founder, http://www.datastax.com
> > > > @spyced
> > > >
> > >
> >
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Paulo Motta <pa...@gmail.com>.

Can you just answer what palpable feature will be available once this CEP
lands because this is still not clear to me (and perhaps to others) from
the current CEP structure. The current document details thoroughly the
protocol but in my view lacks to illustrate what specific API, methods,
modules will become available to developers, how it fits into the larger
picture and interacts with existing modules if at all and perhaps a few
examples of how it can be used to build features on top.

Em sex., 1 de out. de 2021 às 11:10, benedict@apache.org <
benedict@apache.org> escreveu:

> I’m not, though it might seem that way. I disagree with your views about
> how CEP should be structured. Since the CEP process was itself codified via
> the CEP process, if you want to recodify how CEP work, the correct way is
> via the CEP process itself.
>
> The discussion is being drawn in multiple directions away from the CEP
> itself, and I am trying to keep this particular thread focused on the
> business at hand, not meta discussions around CEP structure that will no
> doubt be unproductive given likely irreconcilable views about the topic,
> nor discussions about other CEP that could have been.
>
> If you want to start a separate exploratory discussion thread about CEP
> structure without filing a CEP feel free to do so.
>
>
> From: Paulo Motta <pa...@gmail.com>
> Date: Friday, 1 October 2021 at 15:04
> To: Cassandra DEV <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > If you want to impose your views on CEP structure on others, please file
> a CEP with the additional restrictions and guidance you want to impose and
> start a discussion thread. I can then respond in detail to why I perceive
> this approach to be flawed, in a dedicated context.
>
> This sounds very kafkaesque. You know I won't file a meta-CEP to change the
> structure of CEP so you're just using this as an excuse to just shut the
> discussion on the lack of clarity on what actual palpable feature will be
> available once the CEP lands. :-)
>
> I'm just providing my humble feedback on how a CEP could be more digestible
> and easier to consume from an external point of view, and this seems like
> an appropriate and contextualized place to voice this opinion which is
> perhaps shared by others.
>
> Em sex., 1 de out. de 2021 às 10:55, benedict@apache.org <
> benedict@apache.org> escreveu:
>
> > I disagree with you. However, this is the wrong forum to have a meta
> > discussion about how CEP should be structured.
> >
> > If you want to impose your views on CEP structure on others, please file
> a
> > CEP with the additional restrictions and guidance you want to impose and
> > start a discussion thread. I can then respond in detail to why I perceive
> > this approach to be flawed, in a dedicated context.
> >
> >
> > From: Paulo Motta <pa...@gmail.com>
> > Date: Friday, 1 October 2021 at 14:48
> > To: Cassandra DEV <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >  The proposal as it stands today is exceptionally thorough, more so
> than
> > any other CEP to date, or any CEP is likely to be in the near future.
> >
> > The protocol is thoroughly described, but in my view CEP is a forum to
> > discuss the high level architecture and plan for adding a full end-to-end
> > enhancement to the database, breaking it into sub-CEPs if needed, as long
> > as the full plan is known in advance, otherwise the community will not
> have
> > the context to judge the full extent and impact of the proposed
> > enhancement.
> >
> > > Since it remains unclear to me what either yourself or Jonathan want to
> > see as an alternative
> >
> > I would personally like to see something along these lines:
> >
> > CEP1: Add ACID-compliant atomic batches
> > - UX changes needed: none, CQL provides the grammar we need.
> > - Distributed transaction protocol needed: Accord (link to white paper if
> > you want specific details about the protcool)
> > - High-level architecture: what new components will be added, how
> existing
> > components will be modified, what new messages will be added, what new
> > configuration knobs will be introduced, what are the milestones of the
> > project, etc.
> >
> > CEP2: Make LWT faster and more reliable
> > - UX changes needed: none
> > - Distributed transaction protocol needed: Accord, already added by
> > previous CEP.
> > - High-level architecture: blablabla... and so on.
> >
> > Em sex., 1 de out. de 2021 às 10:19, benedict@apache.org <
> > benedict@apache.org> escreveu:
> >
> > > I think this is getting circular and unproductive. Basic disagreements
> > > about whether the CEP specifies a feature I am inclined to leave for a
> > > vote. In my view the CEP specifies several features, both immediate
> ones
> > > for the user (ACID batches and multi-key LWTS) and developer-focused
> ones
> > > around ground-breaking semantics that will be enabled.
> > >
> > > The proposal as it stands today is exceptionally thorough, more so than
> > > any other CEP to date, or any CEP is likely to be in the near future.
> > >
> > > This is a Cassandra Enhancement *Proposal*, and at some point we have
> to
> > > engage with what is proposed, not what you might like to be proposed.
> > Since
> > > it remains unclear to me what either yourself or Jonathan want to see
> as
> > an
> > > alternative, at this point it would seem more productive to produce
> your
> > > own proposals for the community to consider. It is possible for
> multiple
> > > transaction systems to co-exist, if you feel this is necessary.
> > >
> > >
> > >
> > > From: Paulo Motta <pa...@gmail.com>
> > > Date: Friday, 1 October 2021 at 13:58
> > > To: Cassandra DEV <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > I share similar feelings as jbellis that this proposal seems to be
> > focusing
> > > on the protocol itself but lacking the actual feature that will use the
> > > protocol which IMO a key element to discuss on a CEP.
> > >
> > > It's similar to saying: hey I want to add this Tries Serialization
> > Protocol
> > > to Cassandra, but not providing specific details of how this protocol
> is
> > > going to be used.
> > >
> > > I think the right route for a CEP is to describe the feature that will
> be
> > > added to the database and the protocol is a mere requirement of the
> > > high-level feature, for example:
> > >
> > > CEP: Add Trie-backed memtable
> > > - Trie Serialization Protocol: implementation detail of the above CEP
> > >
> > > What is the difficulty of taking this approach, picking one of the
> myriad
> > > of features that will be enabled by Accord and using that as the
> initial
> > > CEP to introduce the protocol to the database?
> > >
> > > Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
> > > benedict@apache.org> escreveu:
> > >
> > > > Actually, thinking about it again, the simple optimistic protocol
> would
> > > in
> > > > fact guarantee system forward progress (i.e. independent of
> transaction
> > > > formulation).
> > > >
> > > >
> > > > From: benedict@apache.org <be...@apache.org>
> > > > Date: Friday, 1 October 2021 at 09:14
> > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > Hi Jonathan,
> > > >
> > > > It would be great if we could achieve a bandwidth higher than 1-2
> short
> > > > emails per week. It remains unclear to me what your goal is, and it
> > would
> > > > help if you could make a statement like “I want Cassandra to be able
> to
> > > do
> > > > X” so that we can respond directly to it. I am also available to have
> > > > another call, in which we can have a back and forth, please feel free
> > to
> > > > propose a London-compatible time within the next week that is
> suitable
> > > for
> > > > you.
> > > >
> > > > In my opinion we are at risk of veering off-topic, though. This CEP
> is
> > > not
> > > > to deliver interactive transactions, and to my knowledge nobody is
> > > > proposing a CEP for interactive transactions. So, for the CEP at hand
> > the
> > > > salient question seems: does this CEP prevent us from implementing
> > > > interactive transactions with properties X, Y, Z in future? To which
> > the
> > > > answer is almost certainly no.
> > > >
> > > > However, to continue the discussion and respond directly to your
> > queries,
> > > > I believe we agree on the definition of an interactive transaction.
> > > >
> > > > Two protocols were loosely outlined. The first, using timestamps for
> > > > optimistic concurrency control, would indeed involve the possibility
> of
> > > > aborts. It would not however inherently adopt the issue of LWTs where
> > no
> > > > transaction is able to make progress. Whether or not progress is
> > > guaranteed
> > > > (in a livelock-free sense) would depend on the structure of the
> > > > transactions that were interfering.
> > > >
> > > > This approach has the advantage of being very simple to implement, so
> > > that
> > > > we could realistically support interactive transactions quite
> quickly.
> > It
> > > > has the additional advantage that transactions would execute very
> > quickly
> > > > by avoiding the WAN during construction, and as a result may in
> > practice
> > > > experience fewer aborts than protocols that guarantee
> livelock-freedom.
> > > >
> > > > The second protocol proposed using read/write intents and would be
> able
> > > to
> > > > support almost any behaviour you want. We could even utilise
> > pessimistic
> > > > concurrency control, or anything in-between. This is its own huge
> > design
> > > > space, and discussion of this approach and the trade-offs that could
> be
> > > > made is (in my opinion) entirely out of scope for this CEP.
> > > >
> > > >
> > > > From: Jonathan Ellis <jb...@gmail.com>
> > > > Date: Friday, 1 October 2021 at 05:00
> > > > To: dev <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > The obstacle for me is you've provided a protocol but not a fully
> > fleshed
> > > > out architecture, so it's hard to fill in some of the blanks.  But it
> > > looks
> > > > to me like optimistic concurrency control for interactive
> transactions
> > > > applied to Accord would leave you in a LWT-like situation under
> fairly
> > > > light contention where nobody actually makes progress due to retries.
> > > >
> > > > To make sure we're talking about the same thing, as Henrik pointed
> out,
> > > > interactive transactions mean multiple round trips from the client
> > > within a
> > > > transaction.  For example, here
> > > > <
> > > >
> > >
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > > > >
> > > > is a simple implementation of the TPC-C New Order transaction.  The
> > high
> > > > level logic (via
> > > > <
> > > >
> > >
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > > > >)
> > > > is,
> > > >
> > > >    1. Get records describing a warehouse, customer, & district
> > > >    2. Update the district
> > > >    3. Increment next available order number
> > > >    4. Insert record into Order and New-Order tables
> > > >    5. For 5-15 items, get Item record, get/update Stock record
> > > >    6. Insert Order-Line Record
> > > >
> > > > As you can see, this requires a lot of client-side logic mixed in
> with
> > > the
> > > > actual SQL commands.
> > > >
> > > >
> > > > On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <
> > benedict@apache.org
> > > >
> > > > wrote:
> > > >
> > > > > Essentially this, although I think in practice we will need to
> track
> > > each
> > > > > partition’s timestamp separately (or optionally for reduced
> > conflicts,
> > > > each
> > > > > row or datum’s), and make them all part of the conditional
> > application
> > > of
> > > > > the transaction - at least for strict-serializability.
> > > > >
> > > > > The alternative is to insert read/write intents for the transaction
> > > > during
> > > > > each step, and to confirm they are still valid on commit, but this
> > > > approach
> > > > > would require a WAN round-trip for each step in the interactive
> > > > > transaction, whereas the timestamp-validating approach can use a
> LAN
> > > > > round-trip for each step besides the final one, and is also much
> > > simpler
> > > > to
> > > > > implement.
> > > > >
> > > > >
> > > > > From: Blake Eggleston <be...@apple.com.INVALID>
> > > > > Date: Thursday, 30 September 2021 at 05:47
> > > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > You could establish a lower timestamp bound and buffer transaction
> > > state
> > > > > on the coordinator, then make the commit an operation that only
> > applies
> > > > if
> > > > > all partitions involved haven’t been changed by a more recent
> > > timestamp.
> > > > > You could also implement mvcc either in the storage layer or for
> some
> > > > > period of time by buffering commits on each replica before
> applying.
> > > > >
> > > > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > How are interactive transactions possible with Accord?
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > > > > benedict@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > >> Could you explain why you believe this trade-off is necessary?
> We
> > > can
> > > > > >> support full SQL just fine with Accord, and I hope that we
> > > eventually
> > > > > do so.
> > > > > >>
> > > > > >> This domain is incredibly complex, so it is easy to reach wrong
> > > > > >> conclusions. I would invite you again to propose a system for
> > > > discussion
> > > > > >> that you think offers something Accord is unable to, and that
> you
> > > > > consider
> > > > > >> desirable, and we can work from there.
> > > > > >>
> > > > > >> To pre-empt some possible discussions, I am not aware of
> anything
> > we
> > > > > >> cannot do with Accord that we could do with either Calvin or
> > > Spanner.
> > > > > >> Interactive transactions are possible on top of Accord, as are
> > > > > transactions
> > > > > >> with an unknown read/write set. In each case the only cost is
> that
> > > > they
> > > > > >> would use optimistic concurrency control, which is no worse the
> > > > spanner
> > > > > >> derivatives anyway (which I have to assume is your benchmark in
> > this
> > > > > >> regard). I do not expect to deliver either functionality
> > initially,
> > > > but
> > > > > >> Accord takes us most of the way there for both.
> > > > > >>
> > > > > >>
> > > > > >> From: Jonathan Ellis <jb...@gmail.com>
> > > > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > > > >> To: dev <de...@cassandra.apache.org>
> > > > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > >> Right, I'm looking for exactly a discussion on the high level
> > goals.
> > > > > >> Instead of saying "here's the goals and we ruled out X because
> Y"
> > we
> > > > > should
> > > > > >> start with a discussion around, "Approach A allows X and W,
> > > approach B
> > > > > >> allows Y and Z" and decide together what the goals should be and
> > and
> > > > > what
> > > > > >> we are willing to trade to get those goals, e.g., are we willing
> > to
> > > > > give up
> > > > > >> global strict serializability to get the ability to support full
> > > SQL.
> > > > > Both
> > > > > >> of these are nice to have!
> > > > > >>
> > > > > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > > > > benedict@apache.org>
> > > > > >> wrote:
> > > > > >>
> > > > > >>> Hi Jonathan,
> > > > > >>>
> > > > > >>> These other systems are incompatible with the goals of the
> CEP. I
> > > do
> > > > > >>> discuss them (besides 2PC) in both the whitepaper and the CEP,
> > and
> > > > will
> > > > > >>> summarise that discussion below. A true and accurate comparison
> > of
> > > > > these
> > > > > >>> other systems is essentially intractable, as there are complex
> > > > > subtleties
> > > > > >>> to each flavour, and those who are interested would be better
> > > served
> > > > by
> > > > > >>> performing their own research.
> > > > > >>>
> > > > > >>> I think it is more productive to focus on what we want to
> achieve
> > > as
> > > > a
> > > > > >>> community. If you believe the goals of this CEP are wrong for
> the
> > > > > >> project,
> > > > > >>> let’s focus on that. If you want to compare and contrast
> specific
> > > > > facets
> > > > > >> of
> > > > > >>> alternative systems that you consider to be preferable in some
> > > > > dimension,
> > > > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > > > >>>
> > > > > >>> The relevant goals are that we:
> > > > > >>>
> > > > > >>>
> > > > > >>>  1.  Guarantee strict serializable isolation on commodity
> > hardware
> > > > > >>>  2.  Scale to any cluster size
> > > > > >>>  3.  Achieve optimal latency
> > > > > >>>
> > > > > >>> The approach taken by Spanner derivatives is rejected by (1)
> > > because
> > > > > they
> > > > > >>> guarantee only Serializable isolation (they additionally fail
> > (3)).
> > > > > From
> > > > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > > > >>> panic-cluster-death under clock skew, this is clearly
> considered
> > by
> > > > > >>> everyone to be undesirable but necessary to achieve
> scalability.
> > > > > >>>
> > > > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2)
> because
> > > its
> > > > > >>> sequencing layer requires a global leader process for the
> > cluster,
> > > > > which
> > > > > >> is
> > > > > >>> incompatible with Cassandra’s scalability requirements. It
> > > > additionally
> > > > > >>> fails (3) for global clients.
> > > > > >>>
> > > > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is
> > today a
> > > > > >>> Spanner clone for its multi-key transaction functionality, not
> > 2PC.
> > > > > >>>
> > > > > >>> Systems such as RAMP with even weaker isolation are not
> > considered
> > > > for
> > > > > >> the
> > > > > >>> simple reason that they do not even claim to meet (1).
> > > > > >>>
> > > > > >>> If we want to additionally offer weaker isolation levels than
> > > > > >>> Serializable, such as that provided by the recent RAMP-TAO
> paper,
> > > > > >> Cassandra
> > > > > >>> is likely able to support multiple distinct transaction layers
> > that
> > > > > >> operate
> > > > > >>> independently. I would encourage you to file a CEP to explore
> how
> > > we
> > > > > can
> > > > > >>> meet these distinct use cases, but I consider them to be
> niche. I
> > > > > expect
> > > > > >>> that a majority of our user base desire strict serializable
> > > > isolation,
> > > > > >> and
> > > > > >>> certainly no less than serializable isolation, to augment the
> > > > existing
> > > > > >>> weaker isolation offered by quorum reads and writes.
> > > > > >>>
> > > > > >>> I would tangentially note that we are not an AP database under
> > > normal
> > > > > >>> recommended operation. A minority in any network partition
> cannot
> > > > reach
> > > > > >>> QUORUM, so under recommended usage we are a high-availability
> > > > > leaderless
> > > > > >> CP
> > > > > >>> database.
> > > > > >>>
> > > > > >>>
> > > > > >>> From: Jonathan Ellis <jb...@gmail.com>
> > > > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > > > >>> To: dev <de...@cassandra.apache.org>
> > > > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > >>> Benedict, thanks for taking the lead in putting this together.
> > > Since
> > > > > >>> Cassandra is the only relevant database today designed around a
> > > > > >> leaderless
> > > > > >>> architecture, it's quite likely that we'll be better served
> with
> > a
> > > > > custom
> > > > > >>> transaction design instead of trying to retrofit one from CP
> > > systems.
> > > > > >>>
> > > > > >>> The whitepaper here is a good description of the consensus
> > > algorithm
> > > > > >> itself
> > > > > >>> as well as its robustness and stability characteristics, and
> its
> > > > > >> comparison
> > > > > >>> with other state-of-the-art consensus algorithms is very
> useful.
> > > In
> > > > > the
> > > > > >>> context of Cassandra, where a consensus algorithm is only part
> of
> > > > what
> > > > > >> will
> > > > > >>> be implemented, I'd like to see a more complete evaluation of
> the
> > > > > >>> transactional side of things as well, including performance
> > > > > >> characteristics
> > > > > >>> as well as the types of transactions that can be supported and
> at
> > > > > least a
> > > > > >>> general idea of what it would look like applied to Cassandra.
> > This
> > > > will
> > > > > >>> allow the PMC to make a more informed decision about what
> > tradeoffs
> > > > are
> > > > > >>> best for the entire long-term project of first supplementing
> and
> > > > > >> ultimately
> > > > > >>> replacing LWT.
> > > > > >>>
> > > > > >>> (Allowing users to mix LWT and AP Cassandra operations against
> > the
> > > > same
> > > > > >>> rows was probably a mistake, so in contrast with LWT we’re not
> > > > looking
> > > > > >> for
> > > > > >>> something fast enough for occasional use but rather something
> > > within
> > > > a
> > > > > >>> reasonable factor of AP operations, appropriate to being the
> only
> > > way
> > > > > to
> > > > > >>> interact with tables declared as such.)
> > > > > >>>
> > > > > >>> Besides Accord, this should cover
> > > > > >>>
> > > > > >>> - Calvin and FaunaDB
> > > > > >>> - A Spanner derivative (no opinion on whether that should be
> > > > Cockroach
> > > > > or
> > > > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB but
> I
> > > > > suspect
> > > > > >>> there is more public information about MongoDB)
> > > > > >>> - RAMP
> > > > > >>>
> > > > > >>> Here’s an example of what I mean:
> > > > > >>>
> > > > > >>> =Calvin=
> > > > > >>>
> > > > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB)
> to
> > > > order
> > > > > >>> transactions, then replicas execute the transactions
> > independently
> > > > with
> > > > > >> no
> > > > > >>> further coordination.  No SPOF.  Transactions are batched by
> each
> > > > > >> sequencer
> > > > > >>> to keep this from becoming a bottleneck.
> > > > > >>>
> > > > > >>> Performance: Calvin paper (published 2012) reports linear
> scaling
> > > of
> > > > > >> TPC-C
> > > > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL
> > > > machines
> > > > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order
> is
> > > > > composed
> > > > > >>> of four reads and four writes, so this is effectively 2M reads
> > and
> > > 2M
> > > > > >>> writes as we normally measure them in C*.
> > > > > >>>
> > > > > >>> Calvin supports mixed read/write transactions, but because the
> > > > > >> transaction
> > > > > >>> execution logic requires knowing all partition keys in advance
> to
> > > > > ensure
> > > > > >>> that all replicas can reproduce the same results with no
> > > > coordination,
> > > > > >>> reads against non-PK predicates must be done ahead of time
> > > > > >> (transparently,
> > > > > >>> by the server) to determine the set of keys, and this must be
> > > retried
> > > > > if
> > > > > >>> the set of rows affected is updated before the actual
> transaction
> > > > > >> executes.
> > > > > >>>
> > > > > >>> Batching and global consensus adds latency -- 100ms in the
> Calvin
> > > > paper
> > > > > >> and
> > > > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> > > transactions
> > > > > >>> (including multi-partition updates) are equally performant in
> > > Calvin
> > > > > >> since
> > > > > >>> the coordination is handled up front in the sequencing step.
> > Glass
> > > > > half
> > > > > >>> empty: even single-row reads and writes have to pay the full
> > > > > coordination
> > > > > >>> cost.  Fauna has optimized this away for reads but I am not
> aware
> > > of
> > > > a
> > > > > >>> description of how they changed the design to allow this.
> > > > > >>>
> > > > > >>> Functionality and limitations: since the entire transaction
> must
> > be
> > > > > known
> > > > > >>> in advance to allow coordination-less execution at the
> replicas,
> > > > Calvin
> > > > > >>> cannot support interactive transactions at all.  FaunaDB
> > mitigates
> > > > this
> > > > > >> by
> > > > > >>> allowing server-side logic to be included, but a Calvin
> approach
> > > will
> > > > > >> never
> > > > > >>> be able to offer SQL compatibility.
> > > > > >>>
> > > > > >>> Guarantees: Calvin transactions are strictly serializable.
> There
> > > is
> > > > no
> > > > > >>> additional complexity or performance hit to generalizing to
> > > multiple
> > > > > >>> regions, apart from the speed of light.  And since Calvin is
> > > already
> > > > > >> paying
> > > > > >>> a batching latency penalty, this is less painful than for other
> > > > > systems.
> > > > > >>>
> > > > > >>> Application to Cassandra: B-.  Distributed transactions are
> > handled
> > > > by
> > > > > >> the
> > > > > >>> sequencing and scheduling layers, which are leaderless, and
> > > Calvin’s
> > > > > >>> requirements for the storage layer are easily met by C*.  But
> > > Calvin
> > > > > also
> > > > > >>> requires a global consensus protocol and LWT is almost
> certainly
> > > not
> > > > > >>> sufficiently performant, so this would require ZK or etcd
> > > (reasonable
> > > > > >> for a
> > > > > >>> library approach but not for replacing LWT in C* itself), or an
> > > > > >>> implementation of Accord.  I don’t believe Calvin would require
> > > > > >> additional
> > > > > >>> table-level metadata in Cassandra.
> > > > > >>>
> > > > > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > > > > benedict@apache.org>
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>>> Wiki:
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > > > >>>> Whitepaper:
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > > > >>>> <
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > > > >>>>>
> > > > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > > > >>>>
> > > > > >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> > > > > >> community.
> > > > > >>>>
> > > > > >>>> Cassandra has benefitted from LWTs for many years, but
> > application
> > > > > >>>> developers that want to ensure consistency for complex
> > operations
> > > > must
> > > > > >>>> either accept the scalability bottleneck of serializing all
> > > related
> > > > > >> state
> > > > > >>>> through a single partition, or layer a complex state machine
> on
> > > top
> > > > of
> > > > > >>> the
> > > > > >>>> database. These are sophisticated and costly activities that
> our
> > > > users
> > > > > >>>> should not be expected to undertake. Since distributed
> databases
> > > are
> > > > > >>>> beginning to offer distributed transactions with fewer
> caveats,
> > it
> > > > is
> > > > > >>> past
> > > > > >>>> time for Cassandra to do so as well.
> > > > > >>>>
> > > > > >>>> This CEP proposes the use of several novel techniques that
> build
> > > > upon
> > > > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> > > general
> > > > > >>>> purpose distributed transactions. The approach is outlined in
> > the
> > > > > >>> wikipage
> > > > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > > > adopting
> > > > > >>> this
> > > > > >>>> approach we will be the _only_ distributed database to offer
> > > global,
> > > > > >>>> scalable, strict serializable transactions in one wide area
> > > > > round-trip.
> > > > > >>>> This would represent a significant improvement in the state of
> > the
> > > > > art,
> > > > > >>>> both in the academic literature and in commercial or open
> source
> > > > > >>> offerings.
> > > > > >>>>
> > > > > >>>> This work has been partially realised in a prototype. This
> > partial
> > > > > >>>> prototype has been verified against Jepsen.io’s Maelstrom
> > library
> > > > and
> > > > > >>>> dedicated in-tree strict serializability verification tools,
> but
> > > > much
> > > > > >>> work
> > > > > >>>> remains for the work to be production capable and integrated
> > into
> > > > > >>> Cassandra.
> > > > > >>>>
> > > > > >>>> I propose including the prototype in the project as a new
> source
> > > > > >>>> repository, to be developed as a standalone library for
> > > integration
> > > > > >> into
> > > > > >>>> Cassandra. I hope the community sees the important value
> > > proposition
> > > > > of
> > > > > >>>> this proposal, and will adopt the CEP after this discussion,
> so
> > > that
> > > > > >> the
> > > > > >>>> library and its integration into Cassandra can be developed in
> > > > > parallel
> > > > > >>> and
> > > > > >>>> with the involvement of the wider community.
> > > > > >>>>
> > > > > >>>
> > > > > >>>
> > > > > >>> --
> > > > > >>> Jonathan Ellis
> > > > > >>> co-founder, http://www.datastax.com
> > > > > >>> @spyced
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Jonathan Ellis
> > > > > >> co-founder, http://www.datastax.com
> > > > > >> @spyced
> > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jonathan Ellis
> > > > > > co-founder, http://www.datastax.com
> > > > > > @spyced
> > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > > > >
> > > >
> > > >
> > > > --
> > > > Jonathan Ellis
> > > > co-founder, http://www.datastax.com
> > > > @spyced
> > > >
> > >
> >
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

I’m not, though it might seem that way. I disagree with your views about how CEP should be structured. Since the CEP process was itself codified via the CEP process, if you want to recodify how CEP work, the correct way is via the CEP process itself.

The discussion is being drawn in multiple directions away from the CEP itself, and I am trying to keep this particular thread focused on the business at hand, not meta discussions around CEP structure that will no doubt be unproductive given likely irreconcilable views about the topic, nor discussions about other CEP that could have been.

If you want to start a separate exploratory discussion thread about CEP structure without filing a CEP feel free to do so.


From: Paulo Motta <pa...@gmail.com>
Date: Friday, 1 October 2021 at 15:04
To: Cassandra DEV <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> If you want to impose your views on CEP structure on others, please file
a CEP with the additional restrictions and guidance you want to impose and
start a discussion thread. I can then respond in detail to why I perceive
this approach to be flawed, in a dedicated context.

This sounds very kafkaesque. You know I won't file a meta-CEP to change the
structure of CEP so you're just using this as an excuse to just shut the
discussion on the lack of clarity on what actual palpable feature will be
available once the CEP lands. :-)

I'm just providing my humble feedback on how a CEP could be more digestible
and easier to consume from an external point of view, and this seems like
an appropriate and contextualized place to voice this opinion which is
perhaps shared by others.

Em sex., 1 de out. de 2021 às 10:55, benedict@apache.org <
benedict@apache.org> escreveu:

> I disagree with you. However, this is the wrong forum to have a meta
> discussion about how CEP should be structured.
>
> If you want to impose your views on CEP structure on others, please file a
> CEP with the additional restrictions and guidance you want to impose and
> start a discussion thread. I can then respond in detail to why I perceive
> this approach to be flawed, in a dedicated context.
>
>
> From: Paulo Motta <pa...@gmail.com>
> Date: Friday, 1 October 2021 at 14:48
> To: Cassandra DEV <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> >  The proposal as it stands today is exceptionally thorough, more so than
> any other CEP to date, or any CEP is likely to be in the near future.
>
> The protocol is thoroughly described, but in my view CEP is a forum to
> discuss the high level architecture and plan for adding a full end-to-end
> enhancement to the database, breaking it into sub-CEPs if needed, as long
> as the full plan is known in advance, otherwise the community will not have
> the context to judge the full extent and impact of the proposed
> enhancement.
>
> > Since it remains unclear to me what either yourself or Jonathan want to
> see as an alternative
>
> I would personally like to see something along these lines:
>
> CEP1: Add ACID-compliant atomic batches
> - UX changes needed: none, CQL provides the grammar we need.
> - Distributed transaction protocol needed: Accord (link to white paper if
> you want specific details about the protcool)
> - High-level architecture: what new components will be added, how existing
> components will be modified, what new messages will be added, what new
> configuration knobs will be introduced, what are the milestones of the
> project, etc.
>
> CEP2: Make LWT faster and more reliable
> - UX changes needed: none
> - Distributed transaction protocol needed: Accord, already added by
> previous CEP.
> - High-level architecture: blablabla... and so on.
>
> Em sex., 1 de out. de 2021 às 10:19, benedict@apache.org <
> benedict@apache.org> escreveu:
>
> > I think this is getting circular and unproductive. Basic disagreements
> > about whether the CEP specifies a feature I am inclined to leave for a
> > vote. In my view the CEP specifies several features, both immediate ones
> > for the user (ACID batches and multi-key LWTS) and developer-focused ones
> > around ground-breaking semantics that will be enabled.
> >
> > The proposal as it stands today is exceptionally thorough, more so than
> > any other CEP to date, or any CEP is likely to be in the near future.
> >
> > This is a Cassandra Enhancement *Proposal*, and at some point we have to
> > engage with what is proposed, not what you might like to be proposed.
> Since
> > it remains unclear to me what either yourself or Jonathan want to see as
> an
> > alternative, at this point it would seem more productive to produce your
> > own proposals for the community to consider. It is possible for multiple
> > transaction systems to co-exist, if you feel this is necessary.
> >
> >
> >
> > From: Paulo Motta <pa...@gmail.com>
> > Date: Friday, 1 October 2021 at 13:58
> > To: Cassandra DEV <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > I share similar feelings as jbellis that this proposal seems to be
> focusing
> > on the protocol itself but lacking the actual feature that will use the
> > protocol which IMO a key element to discuss on a CEP.
> >
> > It's similar to saying: hey I want to add this Tries Serialization
> Protocol
> > to Cassandra, but not providing specific details of how this protocol is
> > going to be used.
> >
> > I think the right route for a CEP is to describe the feature that will be
> > added to the database and the protocol is a mere requirement of the
> > high-level feature, for example:
> >
> > CEP: Add Trie-backed memtable
> > - Trie Serialization Protocol: implementation detail of the above CEP
> >
> > What is the difficulty of taking this approach, picking one of the myriad
> > of features that will be enabled by Accord and using that as the initial
> > CEP to introduce the protocol to the database?
> >
> > Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
> > benedict@apache.org> escreveu:
> >
> > > Actually, thinking about it again, the simple optimistic protocol would
> > in
> > > fact guarantee system forward progress (i.e. independent of transaction
> > > formulation).
> > >
> > >
> > > From: benedict@apache.org <be...@apache.org>
> > > Date: Friday, 1 October 2021 at 09:14
> > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > Hi Jonathan,
> > >
> > > It would be great if we could achieve a bandwidth higher than 1-2 short
> > > emails per week. It remains unclear to me what your goal is, and it
> would
> > > help if you could make a statement like “I want Cassandra to be able to
> > do
> > > X” so that we can respond directly to it. I am also available to have
> > > another call, in which we can have a back and forth, please feel free
> to
> > > propose a London-compatible time within the next week that is suitable
> > for
> > > you.
> > >
> > > In my opinion we are at risk of veering off-topic, though. This CEP is
> > not
> > > to deliver interactive transactions, and to my knowledge nobody is
> > > proposing a CEP for interactive transactions. So, for the CEP at hand
> the
> > > salient question seems: does this CEP prevent us from implementing
> > > interactive transactions with properties X, Y, Z in future? To which
> the
> > > answer is almost certainly no.
> > >
> > > However, to continue the discussion and respond directly to your
> queries,
> > > I believe we agree on the definition of an interactive transaction.
> > >
> > > Two protocols were loosely outlined. The first, using timestamps for
> > > optimistic concurrency control, would indeed involve the possibility of
> > > aborts. It would not however inherently adopt the issue of LWTs where
> no
> > > transaction is able to make progress. Whether or not progress is
> > guaranteed
> > > (in a livelock-free sense) would depend on the structure of the
> > > transactions that were interfering.
> > >
> > > This approach has the advantage of being very simple to implement, so
> > that
> > > we could realistically support interactive transactions quite quickly.
> It
> > > has the additional advantage that transactions would execute very
> quickly
> > > by avoiding the WAN during construction, and as a result may in
> practice
> > > experience fewer aborts than protocols that guarantee livelock-freedom.
> > >
> > > The second protocol proposed using read/write intents and would be able
> > to
> > > support almost any behaviour you want. We could even utilise
> pessimistic
> > > concurrency control, or anything in-between. This is its own huge
> design
> > > space, and discussion of this approach and the trade-offs that could be
> > > made is (in my opinion) entirely out of scope for this CEP.
> > >
> > >
> > > From: Jonathan Ellis <jb...@gmail.com>
> > > Date: Friday, 1 October 2021 at 05:00
> > > To: dev <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > The obstacle for me is you've provided a protocol but not a fully
> fleshed
> > > out architecture, so it's hard to fill in some of the blanks.  But it
> > looks
> > > to me like optimistic concurrency control for interactive transactions
> > > applied to Accord would leave you in a LWT-like situation under fairly
> > > light contention where nobody actually makes progress due to retries.
> > >
> > > To make sure we're talking about the same thing, as Henrik pointed out,
> > > interactive transactions mean multiple round trips from the client
> > within a
> > > transaction.  For example, here
> > > <
> > >
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > > >
> > > is a simple implementation of the TPC-C New Order transaction.  The
> high
> > > level logic (via
> > > <
> > >
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > > >)
> > > is,
> > >
> > >    1. Get records describing a warehouse, customer, & district
> > >    2. Update the district
> > >    3. Increment next available order number
> > >    4. Insert record into Order and New-Order tables
> > >    5. For 5-15 items, get Item record, get/update Stock record
> > >    6. Insert Order-Line Record
> > >
> > > As you can see, this requires a lot of client-side logic mixed in with
> > the
> > > actual SQL commands.
> > >
> > >
> > > On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <
> benedict@apache.org
> > >
> > > wrote:
> > >
> > > > Essentially this, although I think in practice we will need to track
> > each
> > > > partition’s timestamp separately (or optionally for reduced
> conflicts,
> > > each
> > > > row or datum’s), and make them all part of the conditional
> application
> > of
> > > > the transaction - at least for strict-serializability.
> > > >
> > > > The alternative is to insert read/write intents for the transaction
> > > during
> > > > each step, and to confirm they are still valid on commit, but this
> > > approach
> > > > would require a WAN round-trip for each step in the interactive
> > > > transaction, whereas the timestamp-validating approach can use a LAN
> > > > round-trip for each step besides the final one, and is also much
> > simpler
> > > to
> > > > implement.
> > > >
> > > >
> > > > From: Blake Eggleston <be...@apple.com.INVALID>
> > > > Date: Thursday, 30 September 2021 at 05:47
> > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > You could establish a lower timestamp bound and buffer transaction
> > state
> > > > on the coordinator, then make the commit an operation that only
> applies
> > > if
> > > > all partitions involved haven’t been changed by a more recent
> > timestamp.
> > > > You could also implement mvcc either in the storage layer or for some
> > > > period of time by buffering commits on each replica before applying.
> > > >
> > > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com>
> > wrote:
> > > > >
> > > > > How are interactive transactions possible with Accord?
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > > > benedict@apache.org>
> > > > > wrote:
> > > > >
> > > > >> Could you explain why you believe this trade-off is necessary? We
> > can
> > > > >> support full SQL just fine with Accord, and I hope that we
> > eventually
> > > > do so.
> > > > >>
> > > > >> This domain is incredibly complex, so it is easy to reach wrong
> > > > >> conclusions. I would invite you again to propose a system for
> > > discussion
> > > > >> that you think offers something Accord is unable to, and that you
> > > > consider
> > > > >> desirable, and we can work from there.
> > > > >>
> > > > >> To pre-empt some possible discussions, I am not aware of anything
> we
> > > > >> cannot do with Accord that we could do with either Calvin or
> > Spanner.
> > > > >> Interactive transactions are possible on top of Accord, as are
> > > > transactions
> > > > >> with an unknown read/write set. In each case the only cost is that
> > > they
> > > > >> would use optimistic concurrency control, which is no worse the
> > > spanner
> > > > >> derivatives anyway (which I have to assume is your benchmark in
> this
> > > > >> regard). I do not expect to deliver either functionality
> initially,
> > > but
> > > > >> Accord takes us most of the way there for both.
> > > > >>
> > > > >>
> > > > >> From: Jonathan Ellis <jb...@gmail.com>
> > > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > > >> To: dev <de...@cassandra.apache.org>
> > > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > >> Right, I'm looking for exactly a discussion on the high level
> goals.
> > > > >> Instead of saying "here's the goals and we ruled out X because Y"
> we
> > > > should
> > > > >> start with a discussion around, "Approach A allows X and W,
> > approach B
> > > > >> allows Y and Z" and decide together what the goals should be and
> and
> > > > what
> > > > >> we are willing to trade to get those goals, e.g., are we willing
> to
> > > > give up
> > > > >> global strict serializability to get the ability to support full
> > SQL.
> > > > Both
> > > > >> of these are nice to have!
> > > > >>
> > > > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > > > benedict@apache.org>
> > > > >> wrote:
> > > > >>
> > > > >>> Hi Jonathan,
> > > > >>>
> > > > >>> These other systems are incompatible with the goals of the CEP. I
> > do
> > > > >>> discuss them (besides 2PC) in both the whitepaper and the CEP,
> and
> > > will
> > > > >>> summarise that discussion below. A true and accurate comparison
> of
> > > > these
> > > > >>> other systems is essentially intractable, as there are complex
> > > > subtleties
> > > > >>> to each flavour, and those who are interested would be better
> > served
> > > by
> > > > >>> performing their own research.
> > > > >>>
> > > > >>> I think it is more productive to focus on what we want to achieve
> > as
> > > a
> > > > >>> community. If you believe the goals of this CEP are wrong for the
> > > > >> project,
> > > > >>> let’s focus on that. If you want to compare and contrast specific
> > > > facets
> > > > >> of
> > > > >>> alternative systems that you consider to be preferable in some
> > > > dimension,
> > > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > > >>>
> > > > >>> The relevant goals are that we:
> > > > >>>
> > > > >>>
> > > > >>>  1.  Guarantee strict serializable isolation on commodity
> hardware
> > > > >>>  2.  Scale to any cluster size
> > > > >>>  3.  Achieve optimal latency
> > > > >>>
> > > > >>> The approach taken by Spanner derivatives is rejected by (1)
> > because
> > > > they
> > > > >>> guarantee only Serializable isolation (they additionally fail
> (3)).
> > > > From
> > > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > > >>> panic-cluster-death under clock skew, this is clearly considered
> by
> > > > >>> everyone to be undesirable but necessary to achieve scalability.
> > > > >>>
> > > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2) because
> > its
> > > > >>> sequencing layer requires a global leader process for the
> cluster,
> > > > which
> > > > >> is
> > > > >>> incompatible with Cassandra’s scalability requirements. It
> > > additionally
> > > > >>> fails (3) for global clients.
> > > > >>>
> > > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is
> today a
> > > > >>> Spanner clone for its multi-key transaction functionality, not
> 2PC.
> > > > >>>
> > > > >>> Systems such as RAMP with even weaker isolation are not
> considered
> > > for
> > > > >> the
> > > > >>> simple reason that they do not even claim to meet (1).
> > > > >>>
> > > > >>> If we want to additionally offer weaker isolation levels than
> > > > >>> Serializable, such as that provided by the recent RAMP-TAO paper,
> > > > >> Cassandra
> > > > >>> is likely able to support multiple distinct transaction layers
> that
> > > > >> operate
> > > > >>> independently. I would encourage you to file a CEP to explore how
> > we
> > > > can
> > > > >>> meet these distinct use cases, but I consider them to be niche. I
> > > > expect
> > > > >>> that a majority of our user base desire strict serializable
> > > isolation,
> > > > >> and
> > > > >>> certainly no less than serializable isolation, to augment the
> > > existing
> > > > >>> weaker isolation offered by quorum reads and writes.
> > > > >>>
> > > > >>> I would tangentially note that we are not an AP database under
> > normal
> > > > >>> recommended operation. A minority in any network partition cannot
> > > reach
> > > > >>> QUORUM, so under recommended usage we are a high-availability
> > > > leaderless
> > > > >> CP
> > > > >>> database.
> > > > >>>
> > > > >>>
> > > > >>> From: Jonathan Ellis <jb...@gmail.com>
> > > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > > >>> To: dev <de...@cassandra.apache.org>
> > > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > >>> Benedict, thanks for taking the lead in putting this together.
> > Since
> > > > >>> Cassandra is the only relevant database today designed around a
> > > > >> leaderless
> > > > >>> architecture, it's quite likely that we'll be better served with
> a
> > > > custom
> > > > >>> transaction design instead of trying to retrofit one from CP
> > systems.
> > > > >>>
> > > > >>> The whitepaper here is a good description of the consensus
> > algorithm
> > > > >> itself
> > > > >>> as well as its robustness and stability characteristics, and its
> > > > >> comparison
> > > > >>> with other state-of-the-art consensus algorithms is very useful.
> > In
> > > > the
> > > > >>> context of Cassandra, where a consensus algorithm is only part of
> > > what
> > > > >> will
> > > > >>> be implemented, I'd like to see a more complete evaluation of the
> > > > >>> transactional side of things as well, including performance
> > > > >> characteristics
> > > > >>> as well as the types of transactions that can be supported and at
> > > > least a
> > > > >>> general idea of what it would look like applied to Cassandra.
> This
> > > will
> > > > >>> allow the PMC to make a more informed decision about what
> tradeoffs
> > > are
> > > > >>> best for the entire long-term project of first supplementing and
> > > > >> ultimately
> > > > >>> replacing LWT.
> > > > >>>
> > > > >>> (Allowing users to mix LWT and AP Cassandra operations against
> the
> > > same
> > > > >>> rows was probably a mistake, so in contrast with LWT we’re not
> > > looking
> > > > >> for
> > > > >>> something fast enough for occasional use but rather something
> > within
> > > a
> > > > >>> reasonable factor of AP operations, appropriate to being the only
> > way
> > > > to
> > > > >>> interact with tables declared as such.)
> > > > >>>
> > > > >>> Besides Accord, this should cover
> > > > >>>
> > > > >>> - Calvin and FaunaDB
> > > > >>> - A Spanner derivative (no opinion on whether that should be
> > > Cockroach
> > > > or
> > > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I
> > > > suspect
> > > > >>> there is more public information about MongoDB)
> > > > >>> - RAMP
> > > > >>>
> > > > >>> Here’s an example of what I mean:
> > > > >>>
> > > > >>> =Calvin=
> > > > >>>
> > > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to
> > > order
> > > > >>> transactions, then replicas execute the transactions
> independently
> > > with
> > > > >> no
> > > > >>> further coordination.  No SPOF.  Transactions are batched by each
> > > > >> sequencer
> > > > >>> to keep this from becoming a bottleneck.
> > > > >>>
> > > > >>> Performance: Calvin paper (published 2012) reports linear scaling
> > of
> > > > >> TPC-C
> > > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL
> > > machines
> > > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is
> > > > composed
> > > > >>> of four reads and four writes, so this is effectively 2M reads
> and
> > 2M
> > > > >>> writes as we normally measure them in C*.
> > > > >>>
> > > > >>> Calvin supports mixed read/write transactions, but because the
> > > > >> transaction
> > > > >>> execution logic requires knowing all partition keys in advance to
> > > > ensure
> > > > >>> that all replicas can reproduce the same results with no
> > > coordination,
> > > > >>> reads against non-PK predicates must be done ahead of time
> > > > >> (transparently,
> > > > >>> by the server) to determine the set of keys, and this must be
> > retried
> > > > if
> > > > >>> the set of rows affected is updated before the actual transaction
> > > > >> executes.
> > > > >>>
> > > > >>> Batching and global consensus adds latency -- 100ms in the Calvin
> > > paper
> > > > >> and
> > > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> > transactions
> > > > >>> (including multi-partition updates) are equally performant in
> > Calvin
> > > > >> since
> > > > >>> the coordination is handled up front in the sequencing step.
> Glass
> > > > half
> > > > >>> empty: even single-row reads and writes have to pay the full
> > > > coordination
> > > > >>> cost.  Fauna has optimized this away for reads but I am not aware
> > of
> > > a
> > > > >>> description of how they changed the design to allow this.
> > > > >>>
> > > > >>> Functionality and limitations: since the entire transaction must
> be
> > > > known
> > > > >>> in advance to allow coordination-less execution at the replicas,
> > > Calvin
> > > > >>> cannot support interactive transactions at all.  FaunaDB
> mitigates
> > > this
> > > > >> by
> > > > >>> allowing server-side logic to be included, but a Calvin approach
> > will
> > > > >> never
> > > > >>> be able to offer SQL compatibility.
> > > > >>>
> > > > >>> Guarantees: Calvin transactions are strictly serializable.  There
> > is
> > > no
> > > > >>> additional complexity or performance hit to generalizing to
> > multiple
> > > > >>> regions, apart from the speed of light.  And since Calvin is
> > already
> > > > >> paying
> > > > >>> a batching latency penalty, this is less painful than for other
> > > > systems.
> > > > >>>
> > > > >>> Application to Cassandra: B-.  Distributed transactions are
> handled
> > > by
> > > > >> the
> > > > >>> sequencing and scheduling layers, which are leaderless, and
> > Calvin’s
> > > > >>> requirements for the storage layer are easily met by C*.  But
> > Calvin
> > > > also
> > > > >>> requires a global consensus protocol and LWT is almost certainly
> > not
> > > > >>> sufficiently performant, so this would require ZK or etcd
> > (reasonable
> > > > >> for a
> > > > >>> library approach but not for replacing LWT in C* itself), or an
> > > > >>> implementation of Accord.  I don’t believe Calvin would require
> > > > >> additional
> > > > >>> table-level metadata in Cassandra.
> > > > >>>
> > > > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > > > benedict@apache.org>
> > > > >>> wrote:
> > > > >>>
> > > > >>>> Wiki:
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > > >>>> Whitepaper:
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > > >>>> <
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > > >>>>>
> > > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > > >>>>
> > > > >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> > > > >> community.
> > > > >>>>
> > > > >>>> Cassandra has benefitted from LWTs for many years, but
> application
> > > > >>>> developers that want to ensure consistency for complex
> operations
> > > must
> > > > >>>> either accept the scalability bottleneck of serializing all
> > related
> > > > >> state
> > > > >>>> through a single partition, or layer a complex state machine on
> > top
> > > of
> > > > >>> the
> > > > >>>> database. These are sophisticated and costly activities that our
> > > users
> > > > >>>> should not be expected to undertake. Since distributed databases
> > are
> > > > >>>> beginning to offer distributed transactions with fewer caveats,
> it
> > > is
> > > > >>> past
> > > > >>>> time for Cassandra to do so as well.
> > > > >>>>
> > > > >>>> This CEP proposes the use of several novel techniques that build
> > > upon
> > > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> > general
> > > > >>>> purpose distributed transactions. The approach is outlined in
> the
> > > > >>> wikipage
> > > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > > adopting
> > > > >>> this
> > > > >>>> approach we will be the _only_ distributed database to offer
> > global,
> > > > >>>> scalable, strict serializable transactions in one wide area
> > > > round-trip.
> > > > >>>> This would represent a significant improvement in the state of
> the
> > > > art,
> > > > >>>> both in the academic literature and in commercial or open source
> > > > >>> offerings.
> > > > >>>>
> > > > >>>> This work has been partially realised in a prototype. This
> partial
> > > > >>>> prototype has been verified against Jepsen.io’s Maelstrom
> library
> > > and
> > > > >>>> dedicated in-tree strict serializability verification tools, but
> > > much
> > > > >>> work
> > > > >>>> remains for the work to be production capable and integrated
> into
> > > > >>> Cassandra.
> > > > >>>>
> > > > >>>> I propose including the prototype in the project as a new source
> > > > >>>> repository, to be developed as a standalone library for
> > integration
> > > > >> into
> > > > >>>> Cassandra. I hope the community sees the important value
> > proposition
> > > > of
> > > > >>>> this proposal, and will adopt the CEP after this discussion, so
> > that
> > > > >> the
> > > > >>>> library and its integration into Cassandra can be developed in
> > > > parallel
> > > > >>> and
> > > > >>>> with the involvement of the wider community.
> > > > >>>>
> > > > >>>
> > > > >>>
> > > > >>> --
> > > > >>> Jonathan Ellis
> > > > >>> co-founder, http://www.datastax.com
> > > > >>> @spyced
> > > > >>>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Jonathan Ellis
> > > > >> co-founder, http://www.datastax.com
> > > > >> @spyced
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > > Jonathan Ellis
> > > > > co-founder, http://www.datastax.com
> > > > > @spyced
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > > >
> > >
> > >
> > > --
> > > Jonathan Ellis
> > > co-founder, http://www.datastax.com
> > > @spyced
> > >
> >
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Paulo Motta <pa...@gmail.com>.

> If you want to impose your views on CEP structure on others, please file
a CEP with the additional restrictions and guidance you want to impose and
start a discussion thread. I can then respond in detail to why I perceive
this approach to be flawed, in a dedicated context.

This sounds very kafkaesque. You know I won't file a meta-CEP to change the
structure of CEP so you're just using this as an excuse to just shut the
discussion on the lack of clarity on what actual palpable feature will be
available once the CEP lands. :-)

I'm just providing my humble feedback on how a CEP could be more digestible
and easier to consume from an external point of view, and this seems like
an appropriate and contextualized place to voice this opinion which is
perhaps shared by others.

Em sex., 1 de out. de 2021 às 10:55, benedict@apache.org <
benedict@apache.org> escreveu:

> I disagree with you. However, this is the wrong forum to have a meta
> discussion about how CEP should be structured.
>
> If you want to impose your views on CEP structure on others, please file a
> CEP with the additional restrictions and guidance you want to impose and
> start a discussion thread. I can then respond in detail to why I perceive
> this approach to be flawed, in a dedicated context.
>
>
> From: Paulo Motta <pa...@gmail.com>
> Date: Friday, 1 October 2021 at 14:48
> To: Cassandra DEV <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> >  The proposal as it stands today is exceptionally thorough, more so than
> any other CEP to date, or any CEP is likely to be in the near future.
>
> The protocol is thoroughly described, but in my view CEP is a forum to
> discuss the high level architecture and plan for adding a full end-to-end
> enhancement to the database, breaking it into sub-CEPs if needed, as long
> as the full plan is known in advance, otherwise the community will not have
> the context to judge the full extent and impact of the proposed
> enhancement.
>
> > Since it remains unclear to me what either yourself or Jonathan want to
> see as an alternative
>
> I would personally like to see something along these lines:
>
> CEP1: Add ACID-compliant atomic batches
> - UX changes needed: none, CQL provides the grammar we need.
> - Distributed transaction protocol needed: Accord (link to white paper if
> you want specific details about the protcool)
> - High-level architecture: what new components will be added, how existing
> components will be modified, what new messages will be added, what new
> configuration knobs will be introduced, what are the milestones of the
> project, etc.
>
> CEP2: Make LWT faster and more reliable
> - UX changes needed: none
> - Distributed transaction protocol needed: Accord, already added by
> previous CEP.
> - High-level architecture: blablabla... and so on.
>
> Em sex., 1 de out. de 2021 às 10:19, benedict@apache.org <
> benedict@apache.org> escreveu:
>
> > I think this is getting circular and unproductive. Basic disagreements
> > about whether the CEP specifies a feature I am inclined to leave for a
> > vote. In my view the CEP specifies several features, both immediate ones
> > for the user (ACID batches and multi-key LWTS) and developer-focused ones
> > around ground-breaking semantics that will be enabled.
> >
> > The proposal as it stands today is exceptionally thorough, more so than
> > any other CEP to date, or any CEP is likely to be in the near future.
> >
> > This is a Cassandra Enhancement *Proposal*, and at some point we have to
> > engage with what is proposed, not what you might like to be proposed.
> Since
> > it remains unclear to me what either yourself or Jonathan want to see as
> an
> > alternative, at this point it would seem more productive to produce your
> > own proposals for the community to consider. It is possible for multiple
> > transaction systems to co-exist, if you feel this is necessary.
> >
> >
> >
> > From: Paulo Motta <pa...@gmail.com>
> > Date: Friday, 1 October 2021 at 13:58
> > To: Cassandra DEV <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > I share similar feelings as jbellis that this proposal seems to be
> focusing
> > on the protocol itself but lacking the actual feature that will use the
> > protocol which IMO a key element to discuss on a CEP.
> >
> > It's similar to saying: hey I want to add this Tries Serialization
> Protocol
> > to Cassandra, but not providing specific details of how this protocol is
> > going to be used.
> >
> > I think the right route for a CEP is to describe the feature that will be
> > added to the database and the protocol is a mere requirement of the
> > high-level feature, for example:
> >
> > CEP: Add Trie-backed memtable
> > - Trie Serialization Protocol: implementation detail of the above CEP
> >
> > What is the difficulty of taking this approach, picking one of the myriad
> > of features that will be enabled by Accord and using that as the initial
> > CEP to introduce the protocol to the database?
> >
> > Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
> > benedict@apache.org> escreveu:
> >
> > > Actually, thinking about it again, the simple optimistic protocol would
> > in
> > > fact guarantee system forward progress (i.e. independent of transaction
> > > formulation).
> > >
> > >
> > > From: benedict@apache.org <be...@apache.org>
> > > Date: Friday, 1 October 2021 at 09:14
> > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > Hi Jonathan,
> > >
> > > It would be great if we could achieve a bandwidth higher than 1-2 short
> > > emails per week. It remains unclear to me what your goal is, and it
> would
> > > help if you could make a statement like “I want Cassandra to be able to
> > do
> > > X” so that we can respond directly to it. I am also available to have
> > > another call, in which we can have a back and forth, please feel free
> to
> > > propose a London-compatible time within the next week that is suitable
> > for
> > > you.
> > >
> > > In my opinion we are at risk of veering off-topic, though. This CEP is
> > not
> > > to deliver interactive transactions, and to my knowledge nobody is
> > > proposing a CEP for interactive transactions. So, for the CEP at hand
> the
> > > salient question seems: does this CEP prevent us from implementing
> > > interactive transactions with properties X, Y, Z in future? To which
> the
> > > answer is almost certainly no.
> > >
> > > However, to continue the discussion and respond directly to your
> queries,
> > > I believe we agree on the definition of an interactive transaction.
> > >
> > > Two protocols were loosely outlined. The first, using timestamps for
> > > optimistic concurrency control, would indeed involve the possibility of
> > > aborts. It would not however inherently adopt the issue of LWTs where
> no
> > > transaction is able to make progress. Whether or not progress is
> > guaranteed
> > > (in a livelock-free sense) would depend on the structure of the
> > > transactions that were interfering.
> > >
> > > This approach has the advantage of being very simple to implement, so
> > that
> > > we could realistically support interactive transactions quite quickly.
> It
> > > has the additional advantage that transactions would execute very
> quickly
> > > by avoiding the WAN during construction, and as a result may in
> practice
> > > experience fewer aborts than protocols that guarantee livelock-freedom.
> > >
> > > The second protocol proposed using read/write intents and would be able
> > to
> > > support almost any behaviour you want. We could even utilise
> pessimistic
> > > concurrency control, or anything in-between. This is its own huge
> design
> > > space, and discussion of this approach and the trade-offs that could be
> > > made is (in my opinion) entirely out of scope for this CEP.
> > >
> > >
> > > From: Jonathan Ellis <jb...@gmail.com>
> > > Date: Friday, 1 October 2021 at 05:00
> > > To: dev <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > The obstacle for me is you've provided a protocol but not a fully
> fleshed
> > > out architecture, so it's hard to fill in some of the blanks.  But it
> > looks
> > > to me like optimistic concurrency control for interactive transactions
> > > applied to Accord would leave you in a LWT-like situation under fairly
> > > light contention where nobody actually makes progress due to retries.
> > >
> > > To make sure we're talking about the same thing, as Henrik pointed out,
> > > interactive transactions mean multiple round trips from the client
> > within a
> > > transaction.  For example, here
> > > <
> > >
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > > >
> > > is a simple implementation of the TPC-C New Order transaction.  The
> high
> > > level logic (via
> > > <
> > >
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > > >)
> > > is,
> > >
> > >    1. Get records describing a warehouse, customer, & district
> > >    2. Update the district
> > >    3. Increment next available order number
> > >    4. Insert record into Order and New-Order tables
> > >    5. For 5-15 items, get Item record, get/update Stock record
> > >    6. Insert Order-Line Record
> > >
> > > As you can see, this requires a lot of client-side logic mixed in with
> > the
> > > actual SQL commands.
> > >
> > >
> > > On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <
> benedict@apache.org
> > >
> > > wrote:
> > >
> > > > Essentially this, although I think in practice we will need to track
> > each
> > > > partition’s timestamp separately (or optionally for reduced
> conflicts,
> > > each
> > > > row or datum’s), and make them all part of the conditional
> application
> > of
> > > > the transaction - at least for strict-serializability.
> > > >
> > > > The alternative is to insert read/write intents for the transaction
> > > during
> > > > each step, and to confirm they are still valid on commit, but this
> > > approach
> > > > would require a WAN round-trip for each step in the interactive
> > > > transaction, whereas the timestamp-validating approach can use a LAN
> > > > round-trip for each step besides the final one, and is also much
> > simpler
> > > to
> > > > implement.
> > > >
> > > >
> > > > From: Blake Eggleston <be...@apple.com.INVALID>
> > > > Date: Thursday, 30 September 2021 at 05:47
> > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > You could establish a lower timestamp bound and buffer transaction
> > state
> > > > on the coordinator, then make the commit an operation that only
> applies
> > > if
> > > > all partitions involved haven’t been changed by a more recent
> > timestamp.
> > > > You could also implement mvcc either in the storage layer or for some
> > > > period of time by buffering commits on each replica before applying.
> > > >
> > > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com>
> > wrote:
> > > > >
> > > > > How are interactive transactions possible with Accord?
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > > > benedict@apache.org>
> > > > > wrote:
> > > > >
> > > > >> Could you explain why you believe this trade-off is necessary? We
> > can
> > > > >> support full SQL just fine with Accord, and I hope that we
> > eventually
> > > > do so.
> > > > >>
> > > > >> This domain is incredibly complex, so it is easy to reach wrong
> > > > >> conclusions. I would invite you again to propose a system for
> > > discussion
> > > > >> that you think offers something Accord is unable to, and that you
> > > > consider
> > > > >> desirable, and we can work from there.
> > > > >>
> > > > >> To pre-empt some possible discussions, I am not aware of anything
> we
> > > > >> cannot do with Accord that we could do with either Calvin or
> > Spanner.
> > > > >> Interactive transactions are possible on top of Accord, as are
> > > > transactions
> > > > >> with an unknown read/write set. In each case the only cost is that
> > > they
> > > > >> would use optimistic concurrency control, which is no worse the
> > > spanner
> > > > >> derivatives anyway (which I have to assume is your benchmark in
> this
> > > > >> regard). I do not expect to deliver either functionality
> initially,
> > > but
> > > > >> Accord takes us most of the way there for both.
> > > > >>
> > > > >>
> > > > >> From: Jonathan Ellis <jb...@gmail.com>
> > > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > > >> To: dev <de...@cassandra.apache.org>
> > > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > >> Right, I'm looking for exactly a discussion on the high level
> goals.
> > > > >> Instead of saying "here's the goals and we ruled out X because Y"
> we
> > > > should
> > > > >> start with a discussion around, "Approach A allows X and W,
> > approach B
> > > > >> allows Y and Z" and decide together what the goals should be and
> and
> > > > what
> > > > >> we are willing to trade to get those goals, e.g., are we willing
> to
> > > > give up
> > > > >> global strict serializability to get the ability to support full
> > SQL.
> > > > Both
> > > > >> of these are nice to have!
> > > > >>
> > > > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > > > benedict@apache.org>
> > > > >> wrote:
> > > > >>
> > > > >>> Hi Jonathan,
> > > > >>>
> > > > >>> These other systems are incompatible with the goals of the CEP. I
> > do
> > > > >>> discuss them (besides 2PC) in both the whitepaper and the CEP,
> and
> > > will
> > > > >>> summarise that discussion below. A true and accurate comparison
> of
> > > > these
> > > > >>> other systems is essentially intractable, as there are complex
> > > > subtleties
> > > > >>> to each flavour, and those who are interested would be better
> > served
> > > by
> > > > >>> performing their own research.
> > > > >>>
> > > > >>> I think it is more productive to focus on what we want to achieve
> > as
> > > a
> > > > >>> community. If you believe the goals of this CEP are wrong for the
> > > > >> project,
> > > > >>> let’s focus on that. If you want to compare and contrast specific
> > > > facets
> > > > >> of
> > > > >>> alternative systems that you consider to be preferable in some
> > > > dimension,
> > > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > > >>>
> > > > >>> The relevant goals are that we:
> > > > >>>
> > > > >>>
> > > > >>>  1.  Guarantee strict serializable isolation on commodity
> hardware
> > > > >>>  2.  Scale to any cluster size
> > > > >>>  3.  Achieve optimal latency
> > > > >>>
> > > > >>> The approach taken by Spanner derivatives is rejected by (1)
> > because
> > > > they
> > > > >>> guarantee only Serializable isolation (they additionally fail
> (3)).
> > > > From
> > > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > > >>> panic-cluster-death under clock skew, this is clearly considered
> by
> > > > >>> everyone to be undesirable but necessary to achieve scalability.
> > > > >>>
> > > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2) because
> > its
> > > > >>> sequencing layer requires a global leader process for the
> cluster,
> > > > which
> > > > >> is
> > > > >>> incompatible with Cassandra’s scalability requirements. It
> > > additionally
> > > > >>> fails (3) for global clients.
> > > > >>>
> > > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is
> today a
> > > > >>> Spanner clone for its multi-key transaction functionality, not
> 2PC.
> > > > >>>
> > > > >>> Systems such as RAMP with even weaker isolation are not
> considered
> > > for
> > > > >> the
> > > > >>> simple reason that they do not even claim to meet (1).
> > > > >>>
> > > > >>> If we want to additionally offer weaker isolation levels than
> > > > >>> Serializable, such as that provided by the recent RAMP-TAO paper,
> > > > >> Cassandra
> > > > >>> is likely able to support multiple distinct transaction layers
> that
> > > > >> operate
> > > > >>> independently. I would encourage you to file a CEP to explore how
> > we
> > > > can
> > > > >>> meet these distinct use cases, but I consider them to be niche. I
> > > > expect
> > > > >>> that a majority of our user base desire strict serializable
> > > isolation,
> > > > >> and
> > > > >>> certainly no less than serializable isolation, to augment the
> > > existing
> > > > >>> weaker isolation offered by quorum reads and writes.
> > > > >>>
> > > > >>> I would tangentially note that we are not an AP database under
> > normal
> > > > >>> recommended operation. A minority in any network partition cannot
> > > reach
> > > > >>> QUORUM, so under recommended usage we are a high-availability
> > > > leaderless
> > > > >> CP
> > > > >>> database.
> > > > >>>
> > > > >>>
> > > > >>> From: Jonathan Ellis <jb...@gmail.com>
> > > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > > >>> To: dev <de...@cassandra.apache.org>
> > > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > >>> Benedict, thanks for taking the lead in putting this together.
> > Since
> > > > >>> Cassandra is the only relevant database today designed around a
> > > > >> leaderless
> > > > >>> architecture, it's quite likely that we'll be better served with
> a
> > > > custom
> > > > >>> transaction design instead of trying to retrofit one from CP
> > systems.
> > > > >>>
> > > > >>> The whitepaper here is a good description of the consensus
> > algorithm
> > > > >> itself
> > > > >>> as well as its robustness and stability characteristics, and its
> > > > >> comparison
> > > > >>> with other state-of-the-art consensus algorithms is very useful.
> > In
> > > > the
> > > > >>> context of Cassandra, where a consensus algorithm is only part of
> > > what
> > > > >> will
> > > > >>> be implemented, I'd like to see a more complete evaluation of the
> > > > >>> transactional side of things as well, including performance
> > > > >> characteristics
> > > > >>> as well as the types of transactions that can be supported and at
> > > > least a
> > > > >>> general idea of what it would look like applied to Cassandra.
> This
> > > will
> > > > >>> allow the PMC to make a more informed decision about what
> tradeoffs
> > > are
> > > > >>> best for the entire long-term project of first supplementing and
> > > > >> ultimately
> > > > >>> replacing LWT.
> > > > >>>
> > > > >>> (Allowing users to mix LWT and AP Cassandra operations against
> the
> > > same
> > > > >>> rows was probably a mistake, so in contrast with LWT we’re not
> > > looking
> > > > >> for
> > > > >>> something fast enough for occasional use but rather something
> > within
> > > a
> > > > >>> reasonable factor of AP operations, appropriate to being the only
> > way
> > > > to
> > > > >>> interact with tables declared as such.)
> > > > >>>
> > > > >>> Besides Accord, this should cover
> > > > >>>
> > > > >>> - Calvin and FaunaDB
> > > > >>> - A Spanner derivative (no opinion on whether that should be
> > > Cockroach
> > > > or
> > > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I
> > > > suspect
> > > > >>> there is more public information about MongoDB)
> > > > >>> - RAMP
> > > > >>>
> > > > >>> Here’s an example of what I mean:
> > > > >>>
> > > > >>> =Calvin=
> > > > >>>
> > > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to
> > > order
> > > > >>> transactions, then replicas execute the transactions
> independently
> > > with
> > > > >> no
> > > > >>> further coordination.  No SPOF.  Transactions are batched by each
> > > > >> sequencer
> > > > >>> to keep this from becoming a bottleneck.
> > > > >>>
> > > > >>> Performance: Calvin paper (published 2012) reports linear scaling
> > of
> > > > >> TPC-C
> > > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL
> > > machines
> > > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is
> > > > composed
> > > > >>> of four reads and four writes, so this is effectively 2M reads
> and
> > 2M
> > > > >>> writes as we normally measure them in C*.
> > > > >>>
> > > > >>> Calvin supports mixed read/write transactions, but because the
> > > > >> transaction
> > > > >>> execution logic requires knowing all partition keys in advance to
> > > > ensure
> > > > >>> that all replicas can reproduce the same results with no
> > > coordination,
> > > > >>> reads against non-PK predicates must be done ahead of time
> > > > >> (transparently,
> > > > >>> by the server) to determine the set of keys, and this must be
> > retried
> > > > if
> > > > >>> the set of rows affected is updated before the actual transaction
> > > > >> executes.
> > > > >>>
> > > > >>> Batching and global consensus adds latency -- 100ms in the Calvin
> > > paper
> > > > >> and
> > > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> > transactions
> > > > >>> (including multi-partition updates) are equally performant in
> > Calvin
> > > > >> since
> > > > >>> the coordination is handled up front in the sequencing step.
> Glass
> > > > half
> > > > >>> empty: even single-row reads and writes have to pay the full
> > > > coordination
> > > > >>> cost.  Fauna has optimized this away for reads but I am not aware
> > of
> > > a
> > > > >>> description of how they changed the design to allow this.
> > > > >>>
> > > > >>> Functionality and limitations: since the entire transaction must
> be
> > > > known
> > > > >>> in advance to allow coordination-less execution at the replicas,
> > > Calvin
> > > > >>> cannot support interactive transactions at all.  FaunaDB
> mitigates
> > > this
> > > > >> by
> > > > >>> allowing server-side logic to be included, but a Calvin approach
> > will
> > > > >> never
> > > > >>> be able to offer SQL compatibility.
> > > > >>>
> > > > >>> Guarantees: Calvin transactions are strictly serializable.  There
> > is
> > > no
> > > > >>> additional complexity or performance hit to generalizing to
> > multiple
> > > > >>> regions, apart from the speed of light.  And since Calvin is
> > already
> > > > >> paying
> > > > >>> a batching latency penalty, this is less painful than for other
> > > > systems.
> > > > >>>
> > > > >>> Application to Cassandra: B-.  Distributed transactions are
> handled
> > > by
> > > > >> the
> > > > >>> sequencing and scheduling layers, which are leaderless, and
> > Calvin’s
> > > > >>> requirements for the storage layer are easily met by C*.  But
> > Calvin
> > > > also
> > > > >>> requires a global consensus protocol and LWT is almost certainly
> > not
> > > > >>> sufficiently performant, so this would require ZK or etcd
> > (reasonable
> > > > >> for a
> > > > >>> library approach but not for replacing LWT in C* itself), or an
> > > > >>> implementation of Accord.  I don’t believe Calvin would require
> > > > >> additional
> > > > >>> table-level metadata in Cassandra.
> > > > >>>
> > > > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > > > benedict@apache.org>
> > > > >>> wrote:
> > > > >>>
> > > > >>>> Wiki:
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > > >>>> Whitepaper:
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > > >>>> <
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > > >>>>>
> > > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > > >>>>
> > > > >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> > > > >> community.
> > > > >>>>
> > > > >>>> Cassandra has benefitted from LWTs for many years, but
> application
> > > > >>>> developers that want to ensure consistency for complex
> operations
> > > must
> > > > >>>> either accept the scalability bottleneck of serializing all
> > related
> > > > >> state
> > > > >>>> through a single partition, or layer a complex state machine on
> > top
> > > of
> > > > >>> the
> > > > >>>> database. These are sophisticated and costly activities that our
> > > users
> > > > >>>> should not be expected to undertake. Since distributed databases
> > are
> > > > >>>> beginning to offer distributed transactions with fewer caveats,
> it
> > > is
> > > > >>> past
> > > > >>>> time for Cassandra to do so as well.
> > > > >>>>
> > > > >>>> This CEP proposes the use of several novel techniques that build
> > > upon
> > > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> > general
> > > > >>>> purpose distributed transactions. The approach is outlined in
> the
> > > > >>> wikipage
> > > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > > adopting
> > > > >>> this
> > > > >>>> approach we will be the _only_ distributed database to offer
> > global,
> > > > >>>> scalable, strict serializable transactions in one wide area
> > > > round-trip.
> > > > >>>> This would represent a significant improvement in the state of
> the
> > > > art,
> > > > >>>> both in the academic literature and in commercial or open source
> > > > >>> offerings.
> > > > >>>>
> > > > >>>> This work has been partially realised in a prototype. This
> partial
> > > > >>>> prototype has been verified against Jepsen.io’s Maelstrom
> library
> > > and
> > > > >>>> dedicated in-tree strict serializability verification tools, but
> > > much
> > > > >>> work
> > > > >>>> remains for the work to be production capable and integrated
> into
> > > > >>> Cassandra.
> > > > >>>>
> > > > >>>> I propose including the prototype in the project as a new source
> > > > >>>> repository, to be developed as a standalone library for
> > integration
> > > > >> into
> > > > >>>> Cassandra. I hope the community sees the important value
> > proposition
> > > > of
> > > > >>>> this proposal, and will adopt the CEP after this discussion, so
> > that
> > > > >> the
> > > > >>>> library and its integration into Cassandra can be developed in
> > > > parallel
> > > > >>> and
> > > > >>>> with the involvement of the wider community.
> > > > >>>>
> > > > >>>
> > > > >>>
> > > > >>> --
> > > > >>> Jonathan Ellis
> > > > >>> co-founder, http://www.datastax.com
> > > > >>> @spyced
> > > > >>>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Jonathan Ellis
> > > > >> co-founder, http://www.datastax.com
> > > > >> @spyced
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > > Jonathan Ellis
> > > > > co-founder, http://www.datastax.com
> > > > > @spyced
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > > >
> > >
> > >
> > > --
> > > Jonathan Ellis
> > > co-founder, http://www.datastax.com
> > > @spyced
> > >
> >
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

I disagree with you. However, this is the wrong forum to have a meta discussion about how CEP should be structured.

If you want to impose your views on CEP structure on others, please file a CEP with the additional restrictions and guidance you want to impose and start a discussion thread. I can then respond in detail to why I perceive this approach to be flawed, in a dedicated context.


From: Paulo Motta <pa...@gmail.com>
Date: Friday, 1 October 2021 at 14:48
To: Cassandra DEV <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
>  The proposal as it stands today is exceptionally thorough, more so than
any other CEP to date, or any CEP is likely to be in the near future.

The protocol is thoroughly described, but in my view CEP is a forum to
discuss the high level architecture and plan for adding a full end-to-end
enhancement to the database, breaking it into sub-CEPs if needed, as long
as the full plan is known in advance, otherwise the community will not have
the context to judge the full extent and impact of the proposed enhancement.

> Since it remains unclear to me what either yourself or Jonathan want to
see as an alternative

I would personally like to see something along these lines:

CEP1: Add ACID-compliant atomic batches
- UX changes needed: none, CQL provides the grammar we need.
- Distributed transaction protocol needed: Accord (link to white paper if
you want specific details about the protcool)
- High-level architecture: what new components will be added, how existing
components will be modified, what new messages will be added, what new
configuration knobs will be introduced, what are the milestones of the
project, etc.

CEP2: Make LWT faster and more reliable
- UX changes needed: none
- Distributed transaction protocol needed: Accord, already added by
previous CEP.
- High-level architecture: blablabla... and so on.

Em sex., 1 de out. de 2021 às 10:19, benedict@apache.org <
benedict@apache.org> escreveu:

> I think this is getting circular and unproductive. Basic disagreements
> about whether the CEP specifies a feature I am inclined to leave for a
> vote. In my view the CEP specifies several features, both immediate ones
> for the user (ACID batches and multi-key LWTS) and developer-focused ones
> around ground-breaking semantics that will be enabled.
>
> The proposal as it stands today is exceptionally thorough, more so than
> any other CEP to date, or any CEP is likely to be in the near future.
>
> This is a Cassandra Enhancement *Proposal*, and at some point we have to
> engage with what is proposed, not what you might like to be proposed. Since
> it remains unclear to me what either yourself or Jonathan want to see as an
> alternative, at this point it would seem more productive to produce your
> own proposals for the community to consider. It is possible for multiple
> transaction systems to co-exist, if you feel this is necessary.
>
>
>
> From: Paulo Motta <pa...@gmail.com>
> Date: Friday, 1 October 2021 at 13:58
> To: Cassandra DEV <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> I share similar feelings as jbellis that this proposal seems to be focusing
> on the protocol itself but lacking the actual feature that will use the
> protocol which IMO a key element to discuss on a CEP.
>
> It's similar to saying: hey I want to add this Tries Serialization Protocol
> to Cassandra, but not providing specific details of how this protocol is
> going to be used.
>
> I think the right route for a CEP is to describe the feature that will be
> added to the database and the protocol is a mere requirement of the
> high-level feature, for example:
>
> CEP: Add Trie-backed memtable
> - Trie Serialization Protocol: implementation detail of the above CEP
>
> What is the difficulty of taking this approach, picking one of the myriad
> of features that will be enabled by Accord and using that as the initial
> CEP to introduce the protocol to the database?
>
> Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
> benedict@apache.org> escreveu:
>
> > Actually, thinking about it again, the simple optimistic protocol would
> in
> > fact guarantee system forward progress (i.e. independent of transaction
> > formulation).
> >
> >
> > From: benedict@apache.org <be...@apache.org>
> > Date: Friday, 1 October 2021 at 09:14
> > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > Hi Jonathan,
> >
> > It would be great if we could achieve a bandwidth higher than 1-2 short
> > emails per week. It remains unclear to me what your goal is, and it would
> > help if you could make a statement like “I want Cassandra to be able to
> do
> > X” so that we can respond directly to it. I am also available to have
> > another call, in which we can have a back and forth, please feel free to
> > propose a London-compatible time within the next week that is suitable
> for
> > you.
> >
> > In my opinion we are at risk of veering off-topic, though. This CEP is
> not
> > to deliver interactive transactions, and to my knowledge nobody is
> > proposing a CEP for interactive transactions. So, for the CEP at hand the
> > salient question seems: does this CEP prevent us from implementing
> > interactive transactions with properties X, Y, Z in future? To which the
> > answer is almost certainly no.
> >
> > However, to continue the discussion and respond directly to your queries,
> > I believe we agree on the definition of an interactive transaction.
> >
> > Two protocols were loosely outlined. The first, using timestamps for
> > optimistic concurrency control, would indeed involve the possibility of
> > aborts. It would not however inherently adopt the issue of LWTs where no
> > transaction is able to make progress. Whether or not progress is
> guaranteed
> > (in a livelock-free sense) would depend on the structure of the
> > transactions that were interfering.
> >
> > This approach has the advantage of being very simple to implement, so
> that
> > we could realistically support interactive transactions quite quickly. It
> > has the additional advantage that transactions would execute very quickly
> > by avoiding the WAN during construction, and as a result may in practice
> > experience fewer aborts than protocols that guarantee livelock-freedom.
> >
> > The second protocol proposed using read/write intents and would be able
> to
> > support almost any behaviour you want. We could even utilise pessimistic
> > concurrency control, or anything in-between. This is its own huge design
> > space, and discussion of this approach and the trade-offs that could be
> > made is (in my opinion) entirely out of scope for this CEP.
> >
> >
> > From: Jonathan Ellis <jb...@gmail.com>
> > Date: Friday, 1 October 2021 at 05:00
> > To: dev <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > The obstacle for me is you've provided a protocol but not a fully fleshed
> > out architecture, so it's hard to fill in some of the blanks.  But it
> looks
> > to me like optimistic concurrency control for interactive transactions
> > applied to Accord would leave you in a LWT-like situation under fairly
> > light contention where nobody actually makes progress due to retries.
> >
> > To make sure we're talking about the same thing, as Henrik pointed out,
> > interactive transactions mean multiple round trips from the client
> within a
> > transaction.  For example, here
> > <
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > >
> > is a simple implementation of the TPC-C New Order transaction.  The high
> > level logic (via
> > <
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > >)
> > is,
> >
> >    1. Get records describing a warehouse, customer, & district
> >    2. Update the district
> >    3. Increment next available order number
> >    4. Insert record into Order and New-Order tables
> >    5. For 5-15 items, get Item record, get/update Stock record
> >    6. Insert Order-Line Record
> >
> > As you can see, this requires a lot of client-side logic mixed in with
> the
> > actual SQL commands.
> >
> >
> > On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <benedict@apache.org
> >
> > wrote:
> >
> > > Essentially this, although I think in practice we will need to track
> each
> > > partition’s timestamp separately (or optionally for reduced conflicts,
> > each
> > > row or datum’s), and make them all part of the conditional application
> of
> > > the transaction - at least for strict-serializability.
> > >
> > > The alternative is to insert read/write intents for the transaction
> > during
> > > each step, and to confirm they are still valid on commit, but this
> > approach
> > > would require a WAN round-trip for each step in the interactive
> > > transaction, whereas the timestamp-validating approach can use a LAN
> > > round-trip for each step besides the final one, and is also much
> simpler
> > to
> > > implement.
> > >
> > >
> > > From: Blake Eggleston <be...@apple.com.INVALID>
> > > Date: Thursday, 30 September 2021 at 05:47
> > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > You could establish a lower timestamp bound and buffer transaction
> state
> > > on the coordinator, then make the commit an operation that only applies
> > if
> > > all partitions involved haven’t been changed by a more recent
> timestamp.
> > > You could also implement mvcc either in the storage layer or for some
> > > period of time by buffering commits on each replica before applying.
> > >
> > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> > > >
> > > > How are interactive transactions possible with Accord?
> > > >
> > > >
> > > >
> > > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > > benedict@apache.org>
> > > > wrote:
> > > >
> > > >> Could you explain why you believe this trade-off is necessary? We
> can
> > > >> support full SQL just fine with Accord, and I hope that we
> eventually
> > > do so.
> > > >>
> > > >> This domain is incredibly complex, so it is easy to reach wrong
> > > >> conclusions. I would invite you again to propose a system for
> > discussion
> > > >> that you think offers something Accord is unable to, and that you
> > > consider
> > > >> desirable, and we can work from there.
> > > >>
> > > >> To pre-empt some possible discussions, I am not aware of anything we
> > > >> cannot do with Accord that we could do with either Calvin or
> Spanner.
> > > >> Interactive transactions are possible on top of Accord, as are
> > > transactions
> > > >> with an unknown read/write set. In each case the only cost is that
> > they
> > > >> would use optimistic concurrency control, which is no worse the
> > spanner
> > > >> derivatives anyway (which I have to assume is your benchmark in this
> > > >> regard). I do not expect to deliver either functionality initially,
> > but
> > > >> Accord takes us most of the way there for both.
> > > >>
> > > >>
> > > >> From: Jonathan Ellis <jb...@gmail.com>
> > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > >> To: dev <de...@cassandra.apache.org>
> > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > >> Right, I'm looking for exactly a discussion on the high level goals.
> > > >> Instead of saying "here's the goals and we ruled out X because Y" we
> > > should
> > > >> start with a discussion around, "Approach A allows X and W,
> approach B
> > > >> allows Y and Z" and decide together what the goals should be and and
> > > what
> > > >> we are willing to trade to get those goals, e.g., are we willing to
> > > give up
> > > >> global strict serializability to get the ability to support full
> SQL.
> > > Both
> > > >> of these are nice to have!
> > > >>
> > > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > > benedict@apache.org>
> > > >> wrote:
> > > >>
> > > >>> Hi Jonathan,
> > > >>>
> > > >>> These other systems are incompatible with the goals of the CEP. I
> do
> > > >>> discuss them (besides 2PC) in both the whitepaper and the CEP, and
> > will
> > > >>> summarise that discussion below. A true and accurate comparison of
> > > these
> > > >>> other systems is essentially intractable, as there are complex
> > > subtleties
> > > >>> to each flavour, and those who are interested would be better
> served
> > by
> > > >>> performing their own research.
> > > >>>
> > > >>> I think it is more productive to focus on what we want to achieve
> as
> > a
> > > >>> community. If you believe the goals of this CEP are wrong for the
> > > >> project,
> > > >>> let’s focus on that. If you want to compare and contrast specific
> > > facets
> > > >> of
> > > >>> alternative systems that you consider to be preferable in some
> > > dimension,
> > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > >>>
> > > >>> The relevant goals are that we:
> > > >>>
> > > >>>
> > > >>>  1.  Guarantee strict serializable isolation on commodity hardware
> > > >>>  2.  Scale to any cluster size
> > > >>>  3.  Achieve optimal latency
> > > >>>
> > > >>> The approach taken by Spanner derivatives is rejected by (1)
> because
> > > they
> > > >>> guarantee only Serializable isolation (they additionally fail (3)).
> > > From
> > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > >>> panic-cluster-death under clock skew, this is clearly considered by
> > > >>> everyone to be undesirable but necessary to achieve scalability.
> > > >>>
> > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2) because
> its
> > > >>> sequencing layer requires a global leader process for the cluster,
> > > which
> > > >> is
> > > >>> incompatible with Cassandra’s scalability requirements. It
> > additionally
> > > >>> fails (3) for global clients.
> > > >>>
> > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
> > > >>> Spanner clone for its multi-key transaction functionality, not 2PC.
> > > >>>
> > > >>> Systems such as RAMP with even weaker isolation are not considered
> > for
> > > >> the
> > > >>> simple reason that they do not even claim to meet (1).
> > > >>>
> > > >>> If we want to additionally offer weaker isolation levels than
> > > >>> Serializable, such as that provided by the recent RAMP-TAO paper,
> > > >> Cassandra
> > > >>> is likely able to support multiple distinct transaction layers that
> > > >> operate
> > > >>> independently. I would encourage you to file a CEP to explore how
> we
> > > can
> > > >>> meet these distinct use cases, but I consider them to be niche. I
> > > expect
> > > >>> that a majority of our user base desire strict serializable
> > isolation,
> > > >> and
> > > >>> certainly no less than serializable isolation, to augment the
> > existing
> > > >>> weaker isolation offered by quorum reads and writes.
> > > >>>
> > > >>> I would tangentially note that we are not an AP database under
> normal
> > > >>> recommended operation. A minority in any network partition cannot
> > reach
> > > >>> QUORUM, so under recommended usage we are a high-availability
> > > leaderless
> > > >> CP
> > > >>> database.
> > > >>>
> > > >>>
> > > >>> From: Jonathan Ellis <jb...@gmail.com>
> > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > >>> To: dev <de...@cassandra.apache.org>
> > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > >>> Benedict, thanks for taking the lead in putting this together.
> Since
> > > >>> Cassandra is the only relevant database today designed around a
> > > >> leaderless
> > > >>> architecture, it's quite likely that we'll be better served with a
> > > custom
> > > >>> transaction design instead of trying to retrofit one from CP
> systems.
> > > >>>
> > > >>> The whitepaper here is a good description of the consensus
> algorithm
> > > >> itself
> > > >>> as well as its robustness and stability characteristics, and its
> > > >> comparison
> > > >>> with other state-of-the-art consensus algorithms is very useful.
> In
> > > the
> > > >>> context of Cassandra, where a consensus algorithm is only part of
> > what
> > > >> will
> > > >>> be implemented, I'd like to see a more complete evaluation of the
> > > >>> transactional side of things as well, including performance
> > > >> characteristics
> > > >>> as well as the types of transactions that can be supported and at
> > > least a
> > > >>> general idea of what it would look like applied to Cassandra. This
> > will
> > > >>> allow the PMC to make a more informed decision about what tradeoffs
> > are
> > > >>> best for the entire long-term project of first supplementing and
> > > >> ultimately
> > > >>> replacing LWT.
> > > >>>
> > > >>> (Allowing users to mix LWT and AP Cassandra operations against the
> > same
> > > >>> rows was probably a mistake, so in contrast with LWT we’re not
> > looking
> > > >> for
> > > >>> something fast enough for occasional use but rather something
> within
> > a
> > > >>> reasonable factor of AP operations, appropriate to being the only
> way
> > > to
> > > >>> interact with tables declared as such.)
> > > >>>
> > > >>> Besides Accord, this should cover
> > > >>>
> > > >>> - Calvin and FaunaDB
> > > >>> - A Spanner derivative (no opinion on whether that should be
> > Cockroach
> > > or
> > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I
> > > suspect
> > > >>> there is more public information about MongoDB)
> > > >>> - RAMP
> > > >>>
> > > >>> Here’s an example of what I mean:
> > > >>>
> > > >>> =Calvin=
> > > >>>
> > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to
> > order
> > > >>> transactions, then replicas execute the transactions independently
> > with
> > > >> no
> > > >>> further coordination.  No SPOF.  Transactions are batched by each
> > > >> sequencer
> > > >>> to keep this from becoming a bottleneck.
> > > >>>
> > > >>> Performance: Calvin paper (published 2012) reports linear scaling
> of
> > > >> TPC-C
> > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL
> > machines
> > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is
> > > composed
> > > >>> of four reads and four writes, so this is effectively 2M reads and
> 2M
> > > >>> writes as we normally measure them in C*.
> > > >>>
> > > >>> Calvin supports mixed read/write transactions, but because the
> > > >> transaction
> > > >>> execution logic requires knowing all partition keys in advance to
> > > ensure
> > > >>> that all replicas can reproduce the same results with no
> > coordination,
> > > >>> reads against non-PK predicates must be done ahead of time
> > > >> (transparently,
> > > >>> by the server) to determine the set of keys, and this must be
> retried
> > > if
> > > >>> the set of rows affected is updated before the actual transaction
> > > >> executes.
> > > >>>
> > > >>> Batching and global consensus adds latency -- 100ms in the Calvin
> > paper
> > > >> and
> > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> transactions
> > > >>> (including multi-partition updates) are equally performant in
> Calvin
> > > >> since
> > > >>> the coordination is handled up front in the sequencing step.  Glass
> > > half
> > > >>> empty: even single-row reads and writes have to pay the full
> > > coordination
> > > >>> cost.  Fauna has optimized this away for reads but I am not aware
> of
> > a
> > > >>> description of how they changed the design to allow this.
> > > >>>
> > > >>> Functionality and limitations: since the entire transaction must be
> > > known
> > > >>> in advance to allow coordination-less execution at the replicas,
> > Calvin
> > > >>> cannot support interactive transactions at all.  FaunaDB mitigates
> > this
> > > >> by
> > > >>> allowing server-side logic to be included, but a Calvin approach
> will
> > > >> never
> > > >>> be able to offer SQL compatibility.
> > > >>>
> > > >>> Guarantees: Calvin transactions are strictly serializable.  There
> is
> > no
> > > >>> additional complexity or performance hit to generalizing to
> multiple
> > > >>> regions, apart from the speed of light.  And since Calvin is
> already
> > > >> paying
> > > >>> a batching latency penalty, this is less painful than for other
> > > systems.
> > > >>>
> > > >>> Application to Cassandra: B-.  Distributed transactions are handled
> > by
> > > >> the
> > > >>> sequencing and scheduling layers, which are leaderless, and
> Calvin’s
> > > >>> requirements for the storage layer are easily met by C*.  But
> Calvin
> > > also
> > > >>> requires a global consensus protocol and LWT is almost certainly
> not
> > > >>> sufficiently performant, so this would require ZK or etcd
> (reasonable
> > > >> for a
> > > >>> library approach but not for replacing LWT in C* itself), or an
> > > >>> implementation of Accord.  I don’t believe Calvin would require
> > > >> additional
> > > >>> table-level metadata in Cassandra.
> > > >>>
> > > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > > benedict@apache.org>
> > > >>> wrote:
> > > >>>
> > > >>>> Wiki:
> > > >>>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > >>>> Whitepaper:
> > > >>>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > >>>> <
> > > >>>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > >>>>>
> > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > >>>>
> > > >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> > > >> community.
> > > >>>>
> > > >>>> Cassandra has benefitted from LWTs for many years, but application
> > > >>>> developers that want to ensure consistency for complex operations
> > must
> > > >>>> either accept the scalability bottleneck of serializing all
> related
> > > >> state
> > > >>>> through a single partition, or layer a complex state machine on
> top
> > of
> > > >>> the
> > > >>>> database. These are sophisticated and costly activities that our
> > users
> > > >>>> should not be expected to undertake. Since distributed databases
> are
> > > >>>> beginning to offer distributed transactions with fewer caveats, it
> > is
> > > >>> past
> > > >>>> time for Cassandra to do so as well.
> > > >>>>
> > > >>>> This CEP proposes the use of several novel techniques that build
> > upon
> > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> general
> > > >>>> purpose distributed transactions. The approach is outlined in the
> > > >>> wikipage
> > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > adopting
> > > >>> this
> > > >>>> approach we will be the _only_ distributed database to offer
> global,
> > > >>>> scalable, strict serializable transactions in one wide area
> > > round-trip.
> > > >>>> This would represent a significant improvement in the state of the
> > > art,
> > > >>>> both in the academic literature and in commercial or open source
> > > >>> offerings.
> > > >>>>
> > > >>>> This work has been partially realised in a prototype. This partial
> > > >>>> prototype has been verified against Jepsen.io’s Maelstrom library
> > and
> > > >>>> dedicated in-tree strict serializability verification tools, but
> > much
> > > >>> work
> > > >>>> remains for the work to be production capable and integrated into
> > > >>> Cassandra.
> > > >>>>
> > > >>>> I propose including the prototype in the project as a new source
> > > >>>> repository, to be developed as a standalone library for
> integration
> > > >> into
> > > >>>> Cassandra. I hope the community sees the important value
> proposition
> > > of
> > > >>>> this proposal, and will adopt the CEP after this discussion, so
> that
> > > >> the
> > > >>>> library and its integration into Cassandra can be developed in
> > > parallel
> > > >>> and
> > > >>>> with the involvement of the wider community.
> > > >>>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> Jonathan Ellis
> > > >>> co-founder, http://www.datastax.com
> > > >>> @spyced
> > > >>>
> > > >>
> > > >>
> > > >> --
> > > >> Jonathan Ellis
> > > >> co-founder, http://www.datastax.com
> > > >> @spyced
> > > >>
> > > >
> > > >
> > > > --
> > > > Jonathan Ellis
> > > > co-founder, http://www.datastax.com
> > > > @spyced
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > >
> >
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
> >
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Paulo Motta <pa...@gmail.com>.

>  The proposal as it stands today is exceptionally thorough, more so than
any other CEP to date, or any CEP is likely to be in the near future.

The protocol is thoroughly described, but in my view CEP is a forum to
discuss the high level architecture and plan for adding a full end-to-end
enhancement to the database, breaking it into sub-CEPs if needed, as long
as the full plan is known in advance, otherwise the community will not have
the context to judge the full extent and impact of the proposed enhancement.

> Since it remains unclear to me what either yourself or Jonathan want to
see as an alternative

I would personally like to see something along these lines:

CEP1: Add ACID-compliant atomic batches
- UX changes needed: none, CQL provides the grammar we need.
- Distributed transaction protocol needed: Accord (link to white paper if
you want specific details about the protcool)
- High-level architecture: what new components will be added, how existing
components will be modified, what new messages will be added, what new
configuration knobs will be introduced, what are the milestones of the
project, etc.

CEP2: Make LWT faster and more reliable
- UX changes needed: none
- Distributed transaction protocol needed: Accord, already added by
previous CEP.
- High-level architecture: blablabla... and so on.

Em sex., 1 de out. de 2021 às 10:19, benedict@apache.org <
benedict@apache.org> escreveu:

> I think this is getting circular and unproductive. Basic disagreements
> about whether the CEP specifies a feature I am inclined to leave for a
> vote. In my view the CEP specifies several features, both immediate ones
> for the user (ACID batches and multi-key LWTS) and developer-focused ones
> around ground-breaking semantics that will be enabled.
>
> The proposal as it stands today is exceptionally thorough, more so than
> any other CEP to date, or any CEP is likely to be in the near future.
>
> This is a Cassandra Enhancement *Proposal*, and at some point we have to
> engage with what is proposed, not what you might like to be proposed. Since
> it remains unclear to me what either yourself or Jonathan want to see as an
> alternative, at this point it would seem more productive to produce your
> own proposals for the community to consider. It is possible for multiple
> transaction systems to co-exist, if you feel this is necessary.
>
>
>
> From: Paulo Motta <pa...@gmail.com>
> Date: Friday, 1 October 2021 at 13:58
> To: Cassandra DEV <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> I share similar feelings as jbellis that this proposal seems to be focusing
> on the protocol itself but lacking the actual feature that will use the
> protocol which IMO a key element to discuss on a CEP.
>
> It's similar to saying: hey I want to add this Tries Serialization Protocol
> to Cassandra, but not providing specific details of how this protocol is
> going to be used.
>
> I think the right route for a CEP is to describe the feature that will be
> added to the database and the protocol is a mere requirement of the
> high-level feature, for example:
>
> CEP: Add Trie-backed memtable
> - Trie Serialization Protocol: implementation detail of the above CEP
>
> What is the difficulty of taking this approach, picking one of the myriad
> of features that will be enabled by Accord and using that as the initial
> CEP to introduce the protocol to the database?
>
> Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
> benedict@apache.org> escreveu:
>
> > Actually, thinking about it again, the simple optimistic protocol would
> in
> > fact guarantee system forward progress (i.e. independent of transaction
> > formulation).
> >
> >
> > From: benedict@apache.org <be...@apache.org>
> > Date: Friday, 1 October 2021 at 09:14
> > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > Hi Jonathan,
> >
> > It would be great if we could achieve a bandwidth higher than 1-2 short
> > emails per week. It remains unclear to me what your goal is, and it would
> > help if you could make a statement like “I want Cassandra to be able to
> do
> > X” so that we can respond directly to it. I am also available to have
> > another call, in which we can have a back and forth, please feel free to
> > propose a London-compatible time within the next week that is suitable
> for
> > you.
> >
> > In my opinion we are at risk of veering off-topic, though. This CEP is
> not
> > to deliver interactive transactions, and to my knowledge nobody is
> > proposing a CEP for interactive transactions. So, for the CEP at hand the
> > salient question seems: does this CEP prevent us from implementing
> > interactive transactions with properties X, Y, Z in future? To which the
> > answer is almost certainly no.
> >
> > However, to continue the discussion and respond directly to your queries,
> > I believe we agree on the definition of an interactive transaction.
> >
> > Two protocols were loosely outlined. The first, using timestamps for
> > optimistic concurrency control, would indeed involve the possibility of
> > aborts. It would not however inherently adopt the issue of LWTs where no
> > transaction is able to make progress. Whether or not progress is
> guaranteed
> > (in a livelock-free sense) would depend on the structure of the
> > transactions that were interfering.
> >
> > This approach has the advantage of being very simple to implement, so
> that
> > we could realistically support interactive transactions quite quickly. It
> > has the additional advantage that transactions would execute very quickly
> > by avoiding the WAN during construction, and as a result may in practice
> > experience fewer aborts than protocols that guarantee livelock-freedom.
> >
> > The second protocol proposed using read/write intents and would be able
> to
> > support almost any behaviour you want. We could even utilise pessimistic
> > concurrency control, or anything in-between. This is its own huge design
> > space, and discussion of this approach and the trade-offs that could be
> > made is (in my opinion) entirely out of scope for this CEP.
> >
> >
> > From: Jonathan Ellis <jb...@gmail.com>
> > Date: Friday, 1 October 2021 at 05:00
> > To: dev <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > The obstacle for me is you've provided a protocol but not a fully fleshed
> > out architecture, so it's hard to fill in some of the blanks.  But it
> looks
> > to me like optimistic concurrency control for interactive transactions
> > applied to Accord would leave you in a LWT-like situation under fairly
> > light contention where nobody actually makes progress due to retries.
> >
> > To make sure we're talking about the same thing, as Henrik pointed out,
> > interactive transactions mean multiple round trips from the client
> within a
> > transaction.  For example, here
> > <
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > >
> > is a simple implementation of the TPC-C New Order transaction.  The high
> > level logic (via
> > <
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > >)
> > is,
> >
> >    1. Get records describing a warehouse, customer, & district
> >    2. Update the district
> >    3. Increment next available order number
> >    4. Insert record into Order and New-Order tables
> >    5. For 5-15 items, get Item record, get/update Stock record
> >    6. Insert Order-Line Record
> >
> > As you can see, this requires a lot of client-side logic mixed in with
> the
> > actual SQL commands.
> >
> >
> > On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <benedict@apache.org
> >
> > wrote:
> >
> > > Essentially this, although I think in practice we will need to track
> each
> > > partition’s timestamp separately (or optionally for reduced conflicts,
> > each
> > > row or datum’s), and make them all part of the conditional application
> of
> > > the transaction - at least for strict-serializability.
> > >
> > > The alternative is to insert read/write intents for the transaction
> > during
> > > each step, and to confirm they are still valid on commit, but this
> > approach
> > > would require a WAN round-trip for each step in the interactive
> > > transaction, whereas the timestamp-validating approach can use a LAN
> > > round-trip for each step besides the final one, and is also much
> simpler
> > to
> > > implement.
> > >
> > >
> > > From: Blake Eggleston <be...@apple.com.INVALID>
> > > Date: Thursday, 30 September 2021 at 05:47
> > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > You could establish a lower timestamp bound and buffer transaction
> state
> > > on the coordinator, then make the commit an operation that only applies
> > if
> > > all partitions involved haven’t been changed by a more recent
> timestamp.
> > > You could also implement mvcc either in the storage layer or for some
> > > period of time by buffering commits on each replica before applying.
> > >
> > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> > > >
> > > > How are interactive transactions possible with Accord?
> > > >
> > > >
> > > >
> > > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > > benedict@apache.org>
> > > > wrote:
> > > >
> > > >> Could you explain why you believe this trade-off is necessary? We
> can
> > > >> support full SQL just fine with Accord, and I hope that we
> eventually
> > > do so.
> > > >>
> > > >> This domain is incredibly complex, so it is easy to reach wrong
> > > >> conclusions. I would invite you again to propose a system for
> > discussion
> > > >> that you think offers something Accord is unable to, and that you
> > > consider
> > > >> desirable, and we can work from there.
> > > >>
> > > >> To pre-empt some possible discussions, I am not aware of anything we
> > > >> cannot do with Accord that we could do with either Calvin or
> Spanner.
> > > >> Interactive transactions are possible on top of Accord, as are
> > > transactions
> > > >> with an unknown read/write set. In each case the only cost is that
> > they
> > > >> would use optimistic concurrency control, which is no worse the
> > spanner
> > > >> derivatives anyway (which I have to assume is your benchmark in this
> > > >> regard). I do not expect to deliver either functionality initially,
> > but
> > > >> Accord takes us most of the way there for both.
> > > >>
> > > >>
> > > >> From: Jonathan Ellis <jb...@gmail.com>
> > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > >> To: dev <de...@cassandra.apache.org>
> > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > >> Right, I'm looking for exactly a discussion on the high level goals.
> > > >> Instead of saying "here's the goals and we ruled out X because Y" we
> > > should
> > > >> start with a discussion around, "Approach A allows X and W,
> approach B
> > > >> allows Y and Z" and decide together what the goals should be and and
> > > what
> > > >> we are willing to trade to get those goals, e.g., are we willing to
> > > give up
> > > >> global strict serializability to get the ability to support full
> SQL.
> > > Both
> > > >> of these are nice to have!
> > > >>
> > > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > > benedict@apache.org>
> > > >> wrote:
> > > >>
> > > >>> Hi Jonathan,
> > > >>>
> > > >>> These other systems are incompatible with the goals of the CEP. I
> do
> > > >>> discuss them (besides 2PC) in both the whitepaper and the CEP, and
> > will
> > > >>> summarise that discussion below. A true and accurate comparison of
> > > these
> > > >>> other systems is essentially intractable, as there are complex
> > > subtleties
> > > >>> to each flavour, and those who are interested would be better
> served
> > by
> > > >>> performing their own research.
> > > >>>
> > > >>> I think it is more productive to focus on what we want to achieve
> as
> > a
> > > >>> community. If you believe the goals of this CEP are wrong for the
> > > >> project,
> > > >>> let’s focus on that. If you want to compare and contrast specific
> > > facets
> > > >> of
> > > >>> alternative systems that you consider to be preferable in some
> > > dimension,
> > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > >>>
> > > >>> The relevant goals are that we:
> > > >>>
> > > >>>
> > > >>>  1.  Guarantee strict serializable isolation on commodity hardware
> > > >>>  2.  Scale to any cluster size
> > > >>>  3.  Achieve optimal latency
> > > >>>
> > > >>> The approach taken by Spanner derivatives is rejected by (1)
> because
> > > they
> > > >>> guarantee only Serializable isolation (they additionally fail (3)).
> > > From
> > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > >>> panic-cluster-death under clock skew, this is clearly considered by
> > > >>> everyone to be undesirable but necessary to achieve scalability.
> > > >>>
> > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2) because
> its
> > > >>> sequencing layer requires a global leader process for the cluster,
> > > which
> > > >> is
> > > >>> incompatible with Cassandra’s scalability requirements. It
> > additionally
> > > >>> fails (3) for global clients.
> > > >>>
> > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
> > > >>> Spanner clone for its multi-key transaction functionality, not 2PC.
> > > >>>
> > > >>> Systems such as RAMP with even weaker isolation are not considered
> > for
> > > >> the
> > > >>> simple reason that they do not even claim to meet (1).
> > > >>>
> > > >>> If we want to additionally offer weaker isolation levels than
> > > >>> Serializable, such as that provided by the recent RAMP-TAO paper,
> > > >> Cassandra
> > > >>> is likely able to support multiple distinct transaction layers that
> > > >> operate
> > > >>> independently. I would encourage you to file a CEP to explore how
> we
> > > can
> > > >>> meet these distinct use cases, but I consider them to be niche. I
> > > expect
> > > >>> that a majority of our user base desire strict serializable
> > isolation,
> > > >> and
> > > >>> certainly no less than serializable isolation, to augment the
> > existing
> > > >>> weaker isolation offered by quorum reads and writes.
> > > >>>
> > > >>> I would tangentially note that we are not an AP database under
> normal
> > > >>> recommended operation. A minority in any network partition cannot
> > reach
> > > >>> QUORUM, so under recommended usage we are a high-availability
> > > leaderless
> > > >> CP
> > > >>> database.
> > > >>>
> > > >>>
> > > >>> From: Jonathan Ellis <jb...@gmail.com>
> > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > >>> To: dev <de...@cassandra.apache.org>
> > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > >>> Benedict, thanks for taking the lead in putting this together.
> Since
> > > >>> Cassandra is the only relevant database today designed around a
> > > >> leaderless
> > > >>> architecture, it's quite likely that we'll be better served with a
> > > custom
> > > >>> transaction design instead of trying to retrofit one from CP
> systems.
> > > >>>
> > > >>> The whitepaper here is a good description of the consensus
> algorithm
> > > >> itself
> > > >>> as well as its robustness and stability characteristics, and its
> > > >> comparison
> > > >>> with other state-of-the-art consensus algorithms is very useful.
> In
> > > the
> > > >>> context of Cassandra, where a consensus algorithm is only part of
> > what
> > > >> will
> > > >>> be implemented, I'd like to see a more complete evaluation of the
> > > >>> transactional side of things as well, including performance
> > > >> characteristics
> > > >>> as well as the types of transactions that can be supported and at
> > > least a
> > > >>> general idea of what it would look like applied to Cassandra. This
> > will
> > > >>> allow the PMC to make a more informed decision about what tradeoffs
> > are
> > > >>> best for the entire long-term project of first supplementing and
> > > >> ultimately
> > > >>> replacing LWT.
> > > >>>
> > > >>> (Allowing users to mix LWT and AP Cassandra operations against the
> > same
> > > >>> rows was probably a mistake, so in contrast with LWT we’re not
> > looking
> > > >> for
> > > >>> something fast enough for occasional use but rather something
> within
> > a
> > > >>> reasonable factor of AP operations, appropriate to being the only
> way
> > > to
> > > >>> interact with tables declared as such.)
> > > >>>
> > > >>> Besides Accord, this should cover
> > > >>>
> > > >>> - Calvin and FaunaDB
> > > >>> - A Spanner derivative (no opinion on whether that should be
> > Cockroach
> > > or
> > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I
> > > suspect
> > > >>> there is more public information about MongoDB)
> > > >>> - RAMP
> > > >>>
> > > >>> Here’s an example of what I mean:
> > > >>>
> > > >>> =Calvin=
> > > >>>
> > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to
> > order
> > > >>> transactions, then replicas execute the transactions independently
> > with
> > > >> no
> > > >>> further coordination.  No SPOF.  Transactions are batched by each
> > > >> sequencer
> > > >>> to keep this from becoming a bottleneck.
> > > >>>
> > > >>> Performance: Calvin paper (published 2012) reports linear scaling
> of
> > > >> TPC-C
> > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL
> > machines
> > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is
> > > composed
> > > >>> of four reads and four writes, so this is effectively 2M reads and
> 2M
> > > >>> writes as we normally measure them in C*.
> > > >>>
> > > >>> Calvin supports mixed read/write transactions, but because the
> > > >> transaction
> > > >>> execution logic requires knowing all partition keys in advance to
> > > ensure
> > > >>> that all replicas can reproduce the same results with no
> > coordination,
> > > >>> reads against non-PK predicates must be done ahead of time
> > > >> (transparently,
> > > >>> by the server) to determine the set of keys, and this must be
> retried
> > > if
> > > >>> the set of rows affected is updated before the actual transaction
> > > >> executes.
> > > >>>
> > > >>> Batching and global consensus adds latency -- 100ms in the Calvin
> > paper
> > > >> and
> > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> transactions
> > > >>> (including multi-partition updates) are equally performant in
> Calvin
> > > >> since
> > > >>> the coordination is handled up front in the sequencing step.  Glass
> > > half
> > > >>> empty: even single-row reads and writes have to pay the full
> > > coordination
> > > >>> cost.  Fauna has optimized this away for reads but I am not aware
> of
> > a
> > > >>> description of how they changed the design to allow this.
> > > >>>
> > > >>> Functionality and limitations: since the entire transaction must be
> > > known
> > > >>> in advance to allow coordination-less execution at the replicas,
> > Calvin
> > > >>> cannot support interactive transactions at all.  FaunaDB mitigates
> > this
> > > >> by
> > > >>> allowing server-side logic to be included, but a Calvin approach
> will
> > > >> never
> > > >>> be able to offer SQL compatibility.
> > > >>>
> > > >>> Guarantees: Calvin transactions are strictly serializable.  There
> is
> > no
> > > >>> additional complexity or performance hit to generalizing to
> multiple
> > > >>> regions, apart from the speed of light.  And since Calvin is
> already
> > > >> paying
> > > >>> a batching latency penalty, this is less painful than for other
> > > systems.
> > > >>>
> > > >>> Application to Cassandra: B-.  Distributed transactions are handled
> > by
> > > >> the
> > > >>> sequencing and scheduling layers, which are leaderless, and
> Calvin’s
> > > >>> requirements for the storage layer are easily met by C*.  But
> Calvin
> > > also
> > > >>> requires a global consensus protocol and LWT is almost certainly
> not
> > > >>> sufficiently performant, so this would require ZK or etcd
> (reasonable
> > > >> for a
> > > >>> library approach but not for replacing LWT in C* itself), or an
> > > >>> implementation of Accord.  I don’t believe Calvin would require
> > > >> additional
> > > >>> table-level metadata in Cassandra.
> > > >>>
> > > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > > benedict@apache.org>
> > > >>> wrote:
> > > >>>
> > > >>>> Wiki:
> > > >>>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > >>>> Whitepaper:
> > > >>>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > >>>> <
> > > >>>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > >>>>>
> > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > >>>>
> > > >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> > > >> community.
> > > >>>>
> > > >>>> Cassandra has benefitted from LWTs for many years, but application
> > > >>>> developers that want to ensure consistency for complex operations
> > must
> > > >>>> either accept the scalability bottleneck of serializing all
> related
> > > >> state
> > > >>>> through a single partition, or layer a complex state machine on
> top
> > of
> > > >>> the
> > > >>>> database. These are sophisticated and costly activities that our
> > users
> > > >>>> should not be expected to undertake. Since distributed databases
> are
> > > >>>> beginning to offer distributed transactions with fewer caveats, it
> > is
> > > >>> past
> > > >>>> time for Cassandra to do so as well.
> > > >>>>
> > > >>>> This CEP proposes the use of several novel techniques that build
> > upon
> > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> general
> > > >>>> purpose distributed transactions. The approach is outlined in the
> > > >>> wikipage
> > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > adopting
> > > >>> this
> > > >>>> approach we will be the _only_ distributed database to offer
> global,
> > > >>>> scalable, strict serializable transactions in one wide area
> > > round-trip.
> > > >>>> This would represent a significant improvement in the state of the
> > > art,
> > > >>>> both in the academic literature and in commercial or open source
> > > >>> offerings.
> > > >>>>
> > > >>>> This work has been partially realised in a prototype. This partial
> > > >>>> prototype has been verified against Jepsen.io’s Maelstrom library
> > and
> > > >>>> dedicated in-tree strict serializability verification tools, but
> > much
> > > >>> work
> > > >>>> remains for the work to be production capable and integrated into
> > > >>> Cassandra.
> > > >>>>
> > > >>>> I propose including the prototype in the project as a new source
> > > >>>> repository, to be developed as a standalone library for
> integration
> > > >> into
> > > >>>> Cassandra. I hope the community sees the important value
> proposition
> > > of
> > > >>>> this proposal, and will adopt the CEP after this discussion, so
> that
> > > >> the
> > > >>>> library and its integration into Cassandra can be developed in
> > > parallel
> > > >>> and
> > > >>>> with the involvement of the wider community.
> > > >>>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> Jonathan Ellis
> > > >>> co-founder, http://www.datastax.com
> > > >>> @spyced
> > > >>>
> > > >>
> > > >>
> > > >> --
> > > >> Jonathan Ellis
> > > >> co-founder, http://www.datastax.com
> > > >> @spyced
> > > >>
> > > >
> > > >
> > > > --
> > > > Jonathan Ellis
> > > > co-founder, http://www.datastax.com
> > > > @spyced
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > >
> >
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
> >
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

I think this is getting circular and unproductive. Basic disagreements about whether the CEP specifies a feature I am inclined to leave for a vote. In my view the CEP specifies several features, both immediate ones for the user (ACID batches and multi-key LWTS) and developer-focused ones around ground-breaking semantics that will be enabled.

The proposal as it stands today is exceptionally thorough, more so than any other CEP to date, or any CEP is likely to be in the near future.

This is a Cassandra Enhancement *Proposal*, and at some point we have to engage with what is proposed, not what you might like to be proposed. Since it remains unclear to me what either yourself or Jonathan want to see as an alternative, at this point it would seem more productive to produce your own proposals for the community to consider. It is possible for multiple transaction systems to co-exist, if you feel this is necessary.



From: Paulo Motta <pa...@gmail.com>
Date: Friday, 1 October 2021 at 13:58
To: Cassandra DEV <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
I share similar feelings as jbellis that this proposal seems to be focusing
on the protocol itself but lacking the actual feature that will use the
protocol which IMO a key element to discuss on a CEP.

It's similar to saying: hey I want to add this Tries Serialization Protocol
to Cassandra, but not providing specific details of how this protocol is
going to be used.

I think the right route for a CEP is to describe the feature that will be
added to the database and the protocol is a mere requirement of the
high-level feature, for example:

CEP: Add Trie-backed memtable
- Trie Serialization Protocol: implementation detail of the above CEP

What is the difficulty of taking this approach, picking one of the myriad
of features that will be enabled by Accord and using that as the initial
CEP to introduce the protocol to the database?

Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
benedict@apache.org> escreveu:

> Actually, thinking about it again, the simple optimistic protocol would in
> fact guarantee system forward progress (i.e. independent of transaction
> formulation).
>
>
> From: benedict@apache.org <be...@apache.org>
> Date: Friday, 1 October 2021 at 09:14
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Hi Jonathan,
>
> It would be great if we could achieve a bandwidth higher than 1-2 short
> emails per week. It remains unclear to me what your goal is, and it would
> help if you could make a statement like “I want Cassandra to be able to do
> X” so that we can respond directly to it. I am also available to have
> another call, in which we can have a back and forth, please feel free to
> propose a London-compatible time within the next week that is suitable for
> you.
>
> In my opinion we are at risk of veering off-topic, though. This CEP is not
> to deliver interactive transactions, and to my knowledge nobody is
> proposing a CEP for interactive transactions. So, for the CEP at hand the
> salient question seems: does this CEP prevent us from implementing
> interactive transactions with properties X, Y, Z in future? To which the
> answer is almost certainly no.
>
> However, to continue the discussion and respond directly to your queries,
> I believe we agree on the definition of an interactive transaction.
>
> Two protocols were loosely outlined. The first, using timestamps for
> optimistic concurrency control, would indeed involve the possibility of
> aborts. It would not however inherently adopt the issue of LWTs where no
> transaction is able to make progress. Whether or not progress is guaranteed
> (in a livelock-free sense) would depend on the structure of the
> transactions that were interfering.
>
> This approach has the advantage of being very simple to implement, so that
> we could realistically support interactive transactions quite quickly. It
> has the additional advantage that transactions would execute very quickly
> by avoiding the WAN during construction, and as a result may in practice
> experience fewer aborts than protocols that guarantee livelock-freedom.
>
> The second protocol proposed using read/write intents and would be able to
> support almost any behaviour you want. We could even utilise pessimistic
> concurrency control, or anything in-between. This is its own huge design
> space, and discussion of this approach and the trade-offs that could be
> made is (in my opinion) entirely out of scope for this CEP.
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Friday, 1 October 2021 at 05:00
> To: dev <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> The obstacle for me is you've provided a protocol but not a fully fleshed
> out architecture, so it's hard to fill in some of the blanks.  But it looks
> to me like optimistic concurrency control for interactive transactions
> applied to Accord would leave you in a LWT-like situation under fairly
> light contention where nobody actually makes progress due to retries.
>
> To make sure we're talking about the same thing, as Henrik pointed out,
> interactive transactions mean multiple round trips from the client within a
> transaction.  For example, here
> <
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> >
> is a simple implementation of the TPC-C New Order transaction.  The high
> level logic (via
> <
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> >)
> is,
>
>    1. Get records describing a warehouse, customer, & district
>    2. Update the district
>    3. Increment next available order number
>    4. Insert record into Order and New-Order tables
>    5. For 5-15 items, get Item record, get/update Stock record
>    6. Insert Order-Line Record
>
> As you can see, this requires a lot of client-side logic mixed in with the
> actual SQL commands.
>
>
> On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <be...@apache.org>
> wrote:
>
> > Essentially this, although I think in practice we will need to track each
> > partition’s timestamp separately (or optionally for reduced conflicts,
> each
> > row or datum’s), and make them all part of the conditional application of
> > the transaction - at least for strict-serializability.
> >
> > The alternative is to insert read/write intents for the transaction
> during
> > each step, and to confirm they are still valid on commit, but this
> approach
> > would require a WAN round-trip for each step in the interactive
> > transaction, whereas the timestamp-validating approach can use a LAN
> > round-trip for each step besides the final one, and is also much simpler
> to
> > implement.
> >
> >
> > From: Blake Eggleston <be...@apple.com.INVALID>
> > Date: Thursday, 30 September 2021 at 05:47
> > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > You could establish a lower timestamp bound and buffer transaction state
> > on the coordinator, then make the commit an operation that only applies
> if
> > all partitions involved haven’t been changed by a more recent timestamp.
> > You could also implement mvcc either in the storage layer or for some
> > period of time by buffering commits on each replica before applying.
> >
> > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> > >
> > > How are interactive transactions possible with Accord?
> > >
> > >
> > >
> > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > benedict@apache.org>
> > > wrote:
> > >
> > >> Could you explain why you believe this trade-off is necessary? We can
> > >> support full SQL just fine with Accord, and I hope that we eventually
> > do so.
> > >>
> > >> This domain is incredibly complex, so it is easy to reach wrong
> > >> conclusions. I would invite you again to propose a system for
> discussion
> > >> that you think offers something Accord is unable to, and that you
> > consider
> > >> desirable, and we can work from there.
> > >>
> > >> To pre-empt some possible discussions, I am not aware of anything we
> > >> cannot do with Accord that we could do with either Calvin or Spanner.
> > >> Interactive transactions are possible on top of Accord, as are
> > transactions
> > >> with an unknown read/write set. In each case the only cost is that
> they
> > >> would use optimistic concurrency control, which is no worse the
> spanner
> > >> derivatives anyway (which I have to assume is your benchmark in this
> > >> regard). I do not expect to deliver either functionality initially,
> but
> > >> Accord takes us most of the way there for both.
> > >>
> > >>
> > >> From: Jonathan Ellis <jb...@gmail.com>
> > >> Date: Wednesday, 22 September 2021 at 05:36
> > >> To: dev <de...@cassandra.apache.org>
> > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >> Right, I'm looking for exactly a discussion on the high level goals.
> > >> Instead of saying "here's the goals and we ruled out X because Y" we
> > should
> > >> start with a discussion around, "Approach A allows X and W, approach B
> > >> allows Y and Z" and decide together what the goals should be and and
> > what
> > >> we are willing to trade to get those goals, e.g., are we willing to
> > give up
> > >> global strict serializability to get the ability to support full SQL.
> > Both
> > >> of these are nice to have!
> > >>
> > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > benedict@apache.org>
> > >> wrote:
> > >>
> > >>> Hi Jonathan,
> > >>>
> > >>> These other systems are incompatible with the goals of the CEP. I do
> > >>> discuss them (besides 2PC) in both the whitepaper and the CEP, and
> will
> > >>> summarise that discussion below. A true and accurate comparison of
> > these
> > >>> other systems is essentially intractable, as there are complex
> > subtleties
> > >>> to each flavour, and those who are interested would be better served
> by
> > >>> performing their own research.
> > >>>
> > >>> I think it is more productive to focus on what we want to achieve as
> a
> > >>> community. If you believe the goals of this CEP are wrong for the
> > >> project,
> > >>> let’s focus on that. If you want to compare and contrast specific
> > facets
> > >> of
> > >>> alternative systems that you consider to be preferable in some
> > dimension,
> > >>> let’s do that here or in a Q&A as proposed by Joey.
> > >>>
> > >>> The relevant goals are that we:
> > >>>
> > >>>
> > >>>  1.  Guarantee strict serializable isolation on commodity hardware
> > >>>  2.  Scale to any cluster size
> > >>>  3.  Achieve optimal latency
> > >>>
> > >>> The approach taken by Spanner derivatives is rejected by (1) because
> > they
> > >>> guarantee only Serializable isolation (they additionally fail (3)).
> > From
> > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > >>> panic-cluster-death under clock skew, this is clearly considered by
> > >>> everyone to be undesirable but necessary to achieve scalability.
> > >>>
> > >>> The approach taken by FaunaDB (Calvin) is rejected by (2) because its
> > >>> sequencing layer requires a global leader process for the cluster,
> > which
> > >> is
> > >>> incompatible with Cassandra’s scalability requirements. It
> additionally
> > >>> fails (3) for global clients.
> > >>>
> > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
> > >>> Spanner clone for its multi-key transaction functionality, not 2PC.
> > >>>
> > >>> Systems such as RAMP with even weaker isolation are not considered
> for
> > >> the
> > >>> simple reason that they do not even claim to meet (1).
> > >>>
> > >>> If we want to additionally offer weaker isolation levels than
> > >>> Serializable, such as that provided by the recent RAMP-TAO paper,
> > >> Cassandra
> > >>> is likely able to support multiple distinct transaction layers that
> > >> operate
> > >>> independently. I would encourage you to file a CEP to explore how we
> > can
> > >>> meet these distinct use cases, but I consider them to be niche. I
> > expect
> > >>> that a majority of our user base desire strict serializable
> isolation,
> > >> and
> > >>> certainly no less than serializable isolation, to augment the
> existing
> > >>> weaker isolation offered by quorum reads and writes.
> > >>>
> > >>> I would tangentially note that we are not an AP database under normal
> > >>> recommended operation. A minority in any network partition cannot
> reach
> > >>> QUORUM, so under recommended usage we are a high-availability
> > leaderless
> > >> CP
> > >>> database.
> > >>>
> > >>>
> > >>> From: Jonathan Ellis <jb...@gmail.com>
> > >>> Date: Tuesday, 21 September 2021 at 23:45
> > >>> To: dev <de...@cassandra.apache.org>
> > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >>> Benedict, thanks for taking the lead in putting this together. Since
> > >>> Cassandra is the only relevant database today designed around a
> > >> leaderless
> > >>> architecture, it's quite likely that we'll be better served with a
> > custom
> > >>> transaction design instead of trying to retrofit one from CP systems.
> > >>>
> > >>> The whitepaper here is a good description of the consensus algorithm
> > >> itself
> > >>> as well as its robustness and stability characteristics, and its
> > >> comparison
> > >>> with other state-of-the-art consensus algorithms is very useful.  In
> > the
> > >>> context of Cassandra, where a consensus algorithm is only part of
> what
> > >> will
> > >>> be implemented, I'd like to see a more complete evaluation of the
> > >>> transactional side of things as well, including performance
> > >> characteristics
> > >>> as well as the types of transactions that can be supported and at
> > least a
> > >>> general idea of what it would look like applied to Cassandra. This
> will
> > >>> allow the PMC to make a more informed decision about what tradeoffs
> are
> > >>> best for the entire long-term project of first supplementing and
> > >> ultimately
> > >>> replacing LWT.
> > >>>
> > >>> (Allowing users to mix LWT and AP Cassandra operations against the
> same
> > >>> rows was probably a mistake, so in contrast with LWT we’re not
> looking
> > >> for
> > >>> something fast enough for occasional use but rather something within
> a
> > >>> reasonable factor of AP operations, appropriate to being the only way
> > to
> > >>> interact with tables declared as such.)
> > >>>
> > >>> Besides Accord, this should cover
> > >>>
> > >>> - Calvin and FaunaDB
> > >>> - A Spanner derivative (no opinion on whether that should be
> Cockroach
> > or
> > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I
> > suspect
> > >>> there is more public information about MongoDB)
> > >>> - RAMP
> > >>>
> > >>> Here’s an example of what I mean:
> > >>>
> > >>> =Calvin=
> > >>>
> > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to
> order
> > >>> transactions, then replicas execute the transactions independently
> with
> > >> no
> > >>> further coordination.  No SPOF.  Transactions are batched by each
> > >> sequencer
> > >>> to keep this from becoming a bottleneck.
> > >>>
> > >>> Performance: Calvin paper (published 2012) reports linear scaling of
> > >> TPC-C
> > >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL
> machines
> > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is
> > composed
> > >>> of four reads and four writes, so this is effectively 2M reads and 2M
> > >>> writes as we normally measure them in C*.
> > >>>
> > >>> Calvin supports mixed read/write transactions, but because the
> > >> transaction
> > >>> execution logic requires knowing all partition keys in advance to
> > ensure
> > >>> that all replicas can reproduce the same results with no
> coordination,
> > >>> reads against non-PK predicates must be done ahead of time
> > >> (transparently,
> > >>> by the server) to determine the set of keys, and this must be retried
> > if
> > >>> the set of rows affected is updated before the actual transaction
> > >> executes.
> > >>>
> > >>> Batching and global consensus adds latency -- 100ms in the Calvin
> paper
> > >> and
> > >>> apparently about 50ms in FaunaDB.  Glass half full: all transactions
> > >>> (including multi-partition updates) are equally performant in Calvin
> > >> since
> > >>> the coordination is handled up front in the sequencing step.  Glass
> > half
> > >>> empty: even single-row reads and writes have to pay the full
> > coordination
> > >>> cost.  Fauna has optimized this away for reads but I am not aware of
> a
> > >>> description of how they changed the design to allow this.
> > >>>
> > >>> Functionality and limitations: since the entire transaction must be
> > known
> > >>> in advance to allow coordination-less execution at the replicas,
> Calvin
> > >>> cannot support interactive transactions at all.  FaunaDB mitigates
> this
> > >> by
> > >>> allowing server-side logic to be included, but a Calvin approach will
> > >> never
> > >>> be able to offer SQL compatibility.
> > >>>
> > >>> Guarantees: Calvin transactions are strictly serializable.  There is
> no
> > >>> additional complexity or performance hit to generalizing to multiple
> > >>> regions, apart from the speed of light.  And since Calvin is already
> > >> paying
> > >>> a batching latency penalty, this is less painful than for other
> > systems.
> > >>>
> > >>> Application to Cassandra: B-.  Distributed transactions are handled
> by
> > >> the
> > >>> sequencing and scheduling layers, which are leaderless, and Calvin’s
> > >>> requirements for the storage layer are easily met by C*.  But Calvin
> > also
> > >>> requires a global consensus protocol and LWT is almost certainly not
> > >>> sufficiently performant, so this would require ZK or etcd (reasonable
> > >> for a
> > >>> library approach but not for replacing LWT in C* itself), or an
> > >>> implementation of Accord.  I don’t believe Calvin would require
> > >> additional
> > >>> table-level metadata in Cassandra.
> > >>>
> > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > benedict@apache.org>
> > >>> wrote:
> > >>>
> > >>>> Wiki:
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > >>>> Whitepaper:
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > >>>> <
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > >>>>>
> > >>>> Prototype: https://github.com/belliottsmith/accord
> > >>>>
> > >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> > >> community.
> > >>>>
> > >>>> Cassandra has benefitted from LWTs for many years, but application
> > >>>> developers that want to ensure consistency for complex operations
> must
> > >>>> either accept the scalability bottleneck of serializing all related
> > >> state
> > >>>> through a single partition, or layer a complex state machine on top
> of
> > >>> the
> > >>>> database. These are sophisticated and costly activities that our
> users
> > >>>> should not be expected to undertake. Since distributed databases are
> > >>>> beginning to offer distributed transactions with fewer caveats, it
> is
> > >>> past
> > >>>> time for Cassandra to do so as well.
> > >>>>
> > >>>> This CEP proposes the use of several novel techniques that build
> upon
> > >>>> research (that followed EPaxos) to deliver (non-interactive) general
> > >>>> purpose distributed transactions. The approach is outlined in the
> > >>> wikipage
> > >>>> and in more detail in the linked whitepaper. Importantly, by
> adopting
> > >>> this
> > >>>> approach we will be the _only_ distributed database to offer global,
> > >>>> scalable, strict serializable transactions in one wide area
> > round-trip.
> > >>>> This would represent a significant improvement in the state of the
> > art,
> > >>>> both in the academic literature and in commercial or open source
> > >>> offerings.
> > >>>>
> > >>>> This work has been partially realised in a prototype. This partial
> > >>>> prototype has been verified against Jepsen.io’s Maelstrom library
> and
> > >>>> dedicated in-tree strict serializability verification tools, but
> much
> > >>> work
> > >>>> remains for the work to be production capable and integrated into
> > >>> Cassandra.
> > >>>>
> > >>>> I propose including the prototype in the project as a new source
> > >>>> repository, to be developed as a standalone library for integration
> > >> into
> > >>>> Cassandra. I hope the community sees the important value proposition
> > of
> > >>>> this proposal, and will adopt the CEP after this discussion, so that
> > >> the
> > >>>> library and its integration into Cassandra can be developed in
> > parallel
> > >>> and
> > >>>> with the involvement of the wider community.
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Jonathan Ellis
> > >>> co-founder, http://www.datastax.com
> > >>> @spyced
> > >>>
> > >>
> > >>
> > >> --
> > >> Jonathan Ellis
> > >> co-founder, http://www.datastax.com
> > >> @spyced
> > >>
> > >
> > >
> > > --
> > > Jonathan Ellis
> > > co-founder, http://www.datastax.com
> > > @spyced
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > For additional commands, e-mail: dev-help@cassandra.apache.org
> >
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Paulo Motta <pa...@gmail.com>.

I share similar feelings as jbellis that this proposal seems to be focusing
on the protocol itself but lacking the actual feature that will use the
protocol which IMO a key element to discuss on a CEP.

It's similar to saying: hey I want to add this Tries Serialization Protocol
to Cassandra, but not providing specific details of how this protocol is
going to be used.

I think the right route for a CEP is to describe the feature that will be
added to the database and the protocol is a mere requirement of the
high-level feature, for example:

CEP: Add Trie-backed memtable
- Trie Serialization Protocol: implementation detail of the above CEP

What is the difficulty of taking this approach, picking one of the myriad
of features that will be enabled by Accord and using that as the initial
CEP to introduce the protocol to the database?

Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
benedict@apache.org> escreveu:

> Actually, thinking about it again, the simple optimistic protocol would in
> fact guarantee system forward progress (i.e. independent of transaction
> formulation).
>
>
> From: benedict@apache.org <be...@apache.org>
> Date: Friday, 1 October 2021 at 09:14
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Hi Jonathan,
>
> It would be great if we could achieve a bandwidth higher than 1-2 short
> emails per week. It remains unclear to me what your goal is, and it would
> help if you could make a statement like “I want Cassandra to be able to do
> X” so that we can respond directly to it. I am also available to have
> another call, in which we can have a back and forth, please feel free to
> propose a London-compatible time within the next week that is suitable for
> you.
>
> In my opinion we are at risk of veering off-topic, though. This CEP is not
> to deliver interactive transactions, and to my knowledge nobody is
> proposing a CEP for interactive transactions. So, for the CEP at hand the
> salient question seems: does this CEP prevent us from implementing
> interactive transactions with properties X, Y, Z in future? To which the
> answer is almost certainly no.
>
> However, to continue the discussion and respond directly to your queries,
> I believe we agree on the definition of an interactive transaction.
>
> Two protocols were loosely outlined. The first, using timestamps for
> optimistic concurrency control, would indeed involve the possibility of
> aborts. It would not however inherently adopt the issue of LWTs where no
> transaction is able to make progress. Whether or not progress is guaranteed
> (in a livelock-free sense) would depend on the structure of the
> transactions that were interfering.
>
> This approach has the advantage of being very simple to implement, so that
> we could realistically support interactive transactions quite quickly. It
> has the additional advantage that transactions would execute very quickly
> by avoiding the WAN during construction, and as a result may in practice
> experience fewer aborts than protocols that guarantee livelock-freedom.
>
> The second protocol proposed using read/write intents and would be able to
> support almost any behaviour you want. We could even utilise pessimistic
> concurrency control, or anything in-between. This is its own huge design
> space, and discussion of this approach and the trade-offs that could be
> made is (in my opinion) entirely out of scope for this CEP.
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Friday, 1 October 2021 at 05:00
> To: dev <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> The obstacle for me is you've provided a protocol but not a fully fleshed
> out architecture, so it's hard to fill in some of the blanks.  But it looks
> to me like optimistic concurrency control for interactive transactions
> applied to Accord would leave you in a LWT-like situation under fairly
> light contention where nobody actually makes progress due to retries.
>
> To make sure we're talking about the same thing, as Henrik pointed out,
> interactive transactions mean multiple round trips from the client within a
> transaction.  For example, here
> <
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> >
> is a simple implementation of the TPC-C New Order transaction.  The high
> level logic (via
> <
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> >)
> is,
>
>    1. Get records describing a warehouse, customer, & district
>    2. Update the district
>    3. Increment next available order number
>    4. Insert record into Order and New-Order tables
>    5. For 5-15 items, get Item record, get/update Stock record
>    6. Insert Order-Line Record
>
> As you can see, this requires a lot of client-side logic mixed in with the
> actual SQL commands.
>
>
> On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <be...@apache.org>
> wrote:
>
> > Essentially this, although I think in practice we will need to track each
> > partition’s timestamp separately (or optionally for reduced conflicts,
> each
> > row or datum’s), and make them all part of the conditional application of
> > the transaction - at least for strict-serializability.
> >
> > The alternative is to insert read/write intents for the transaction
> during
> > each step, and to confirm they are still valid on commit, but this
> approach
> > would require a WAN round-trip for each step in the interactive
> > transaction, whereas the timestamp-validating approach can use a LAN
> > round-trip for each step besides the final one, and is also much simpler
> to
> > implement.
> >
> >
> > From: Blake Eggleston <be...@apple.com.INVALID>
> > Date: Thursday, 30 September 2021 at 05:47
> > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > You could establish a lower timestamp bound and buffer transaction state
> > on the coordinator, then make the commit an operation that only applies
> if
> > all partitions involved haven’t been changed by a more recent timestamp.
> > You could also implement mvcc either in the storage layer or for some
> > period of time by buffering commits on each replica before applying.
> >
> > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> > >
> > > How are interactive transactions possible with Accord?
> > >
> > >
> > >
> > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > benedict@apache.org>
> > > wrote:
> > >
> > >> Could you explain why you believe this trade-off is necessary? We can
> > >> support full SQL just fine with Accord, and I hope that we eventually
> > do so.
> > >>
> > >> This domain is incredibly complex, so it is easy to reach wrong
> > >> conclusions. I would invite you again to propose a system for
> discussion
> > >> that you think offers something Accord is unable to, and that you
> > consider
> > >> desirable, and we can work from there.
> > >>
> > >> To pre-empt some possible discussions, I am not aware of anything we
> > >> cannot do with Accord that we could do with either Calvin or Spanner.
> > >> Interactive transactions are possible on top of Accord, as are
> > transactions
> > >> with an unknown read/write set. In each case the only cost is that
> they
> > >> would use optimistic concurrency control, which is no worse the
> spanner
> > >> derivatives anyway (which I have to assume is your benchmark in this
> > >> regard). I do not expect to deliver either functionality initially,
> but
> > >> Accord takes us most of the way there for both.
> > >>
> > >>
> > >> From: Jonathan Ellis <jb...@gmail.com>
> > >> Date: Wednesday, 22 September 2021 at 05:36
> > >> To: dev <de...@cassandra.apache.org>
> > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >> Right, I'm looking for exactly a discussion on the high level goals.
> > >> Instead of saying "here's the goals and we ruled out X because Y" we
> > should
> > >> start with a discussion around, "Approach A allows X and W, approach B
> > >> allows Y and Z" and decide together what the goals should be and and
> > what
> > >> we are willing to trade to get those goals, e.g., are we willing to
> > give up
> > >> global strict serializability to get the ability to support full SQL.
> > Both
> > >> of these are nice to have!
> > >>
> > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > benedict@apache.org>
> > >> wrote:
> > >>
> > >>> Hi Jonathan,
> > >>>
> > >>> These other systems are incompatible with the goals of the CEP. I do
> > >>> discuss them (besides 2PC) in both the whitepaper and the CEP, and
> will
> > >>> summarise that discussion below. A true and accurate comparison of
> > these
> > >>> other systems is essentially intractable, as there are complex
> > subtleties
> > >>> to each flavour, and those who are interested would be better served
> by
> > >>> performing their own research.
> > >>>
> > >>> I think it is more productive to focus on what we want to achieve as
> a
> > >>> community. If you believe the goals of this CEP are wrong for the
> > >> project,
> > >>> let’s focus on that. If you want to compare and contrast specific
> > facets
> > >> of
> > >>> alternative systems that you consider to be preferable in some
> > dimension,
> > >>> let’s do that here or in a Q&A as proposed by Joey.
> > >>>
> > >>> The relevant goals are that we:
> > >>>
> > >>>
> > >>>  1.  Guarantee strict serializable isolation on commodity hardware
> > >>>  2.  Scale to any cluster size
> > >>>  3.  Achieve optimal latency
> > >>>
> > >>> The approach taken by Spanner derivatives is rejected by (1) because
> > they
> > >>> guarantee only Serializable isolation (they additionally fail (3)).
> > From
> > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > >>> panic-cluster-death under clock skew, this is clearly considered by
> > >>> everyone to be undesirable but necessary to achieve scalability.
> > >>>
> > >>> The approach taken by FaunaDB (Calvin) is rejected by (2) because its
> > >>> sequencing layer requires a global leader process for the cluster,
> > which
> > >> is
> > >>> incompatible with Cassandra’s scalability requirements. It
> additionally
> > >>> fails (3) for global clients.
> > >>>
> > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
> > >>> Spanner clone for its multi-key transaction functionality, not 2PC.
> > >>>
> > >>> Systems such as RAMP with even weaker isolation are not considered
> for
> > >> the
> > >>> simple reason that they do not even claim to meet (1).
> > >>>
> > >>> If we want to additionally offer weaker isolation levels than
> > >>> Serializable, such as that provided by the recent RAMP-TAO paper,
> > >> Cassandra
> > >>> is likely able to support multiple distinct transaction layers that
> > >> operate
> > >>> independently. I would encourage you to file a CEP to explore how we
> > can
> > >>> meet these distinct use cases, but I consider them to be niche. I
> > expect
> > >>> that a majority of our user base desire strict serializable
> isolation,
> > >> and
> > >>> certainly no less than serializable isolation, to augment the
> existing
> > >>> weaker isolation offered by quorum reads and writes.
> > >>>
> > >>> I would tangentially note that we are not an AP database under normal
> > >>> recommended operation. A minority in any network partition cannot
> reach
> > >>> QUORUM, so under recommended usage we are a high-availability
> > leaderless
> > >> CP
> > >>> database.
> > >>>
> > >>>
> > >>> From: Jonathan Ellis <jb...@gmail.com>
> > >>> Date: Tuesday, 21 September 2021 at 23:45
> > >>> To: dev <de...@cassandra.apache.org>
> > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >>> Benedict, thanks for taking the lead in putting this together. Since
> > >>> Cassandra is the only relevant database today designed around a
> > >> leaderless
> > >>> architecture, it's quite likely that we'll be better served with a
> > custom
> > >>> transaction design instead of trying to retrofit one from CP systems.
> > >>>
> > >>> The whitepaper here is a good description of the consensus algorithm
> > >> itself
> > >>> as well as its robustness and stability characteristics, and its
> > >> comparison
> > >>> with other state-of-the-art consensus algorithms is very useful.  In
> > the
> > >>> context of Cassandra, where a consensus algorithm is only part of
> what
> > >> will
> > >>> be implemented, I'd like to see a more complete evaluation of the
> > >>> transactional side of things as well, including performance
> > >> characteristics
> > >>> as well as the types of transactions that can be supported and at
> > least a
> > >>> general idea of what it would look like applied to Cassandra. This
> will
> > >>> allow the PMC to make a more informed decision about what tradeoffs
> are
> > >>> best for the entire long-term project of first supplementing and
> > >> ultimately
> > >>> replacing LWT.
> > >>>
> > >>> (Allowing users to mix LWT and AP Cassandra operations against the
> same
> > >>> rows was probably a mistake, so in contrast with LWT we’re not
> looking
> > >> for
> > >>> something fast enough for occasional use but rather something within
> a
> > >>> reasonable factor of AP operations, appropriate to being the only way
> > to
> > >>> interact with tables declared as such.)
> > >>>
> > >>> Besides Accord, this should cover
> > >>>
> > >>> - Calvin and FaunaDB
> > >>> - A Spanner derivative (no opinion on whether that should be
> Cockroach
> > or
> > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I
> > suspect
> > >>> there is more public information about MongoDB)
> > >>> - RAMP
> > >>>
> > >>> Here’s an example of what I mean:
> > >>>
> > >>> =Calvin=
> > >>>
> > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to
> order
> > >>> transactions, then replicas execute the transactions independently
> with
> > >> no
> > >>> further coordination.  No SPOF.  Transactions are batched by each
> > >> sequencer
> > >>> to keep this from becoming a bottleneck.
> > >>>
> > >>> Performance: Calvin paper (published 2012) reports linear scaling of
> > >> TPC-C
> > >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL
> machines
> > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is
> > composed
> > >>> of four reads and four writes, so this is effectively 2M reads and 2M
> > >>> writes as we normally measure them in C*.
> > >>>
> > >>> Calvin supports mixed read/write transactions, but because the
> > >> transaction
> > >>> execution logic requires knowing all partition keys in advance to
> > ensure
> > >>> that all replicas can reproduce the same results with no
> coordination,
> > >>> reads against non-PK predicates must be done ahead of time
> > >> (transparently,
> > >>> by the server) to determine the set of keys, and this must be retried
> > if
> > >>> the set of rows affected is updated before the actual transaction
> > >> executes.
> > >>>
> > >>> Batching and global consensus adds latency -- 100ms in the Calvin
> paper
> > >> and
> > >>> apparently about 50ms in FaunaDB.  Glass half full: all transactions
> > >>> (including multi-partition updates) are equally performant in Calvin
> > >> since
> > >>> the coordination is handled up front in the sequencing step.  Glass
> > half
> > >>> empty: even single-row reads and writes have to pay the full
> > coordination
> > >>> cost.  Fauna has optimized this away for reads but I am not aware of
> a
> > >>> description of how they changed the design to allow this.
> > >>>
> > >>> Functionality and limitations: since the entire transaction must be
> > known
> > >>> in advance to allow coordination-less execution at the replicas,
> Calvin
> > >>> cannot support interactive transactions at all.  FaunaDB mitigates
> this
> > >> by
> > >>> allowing server-side logic to be included, but a Calvin approach will
> > >> never
> > >>> be able to offer SQL compatibility.
> > >>>
> > >>> Guarantees: Calvin transactions are strictly serializable.  There is
> no
> > >>> additional complexity or performance hit to generalizing to multiple
> > >>> regions, apart from the speed of light.  And since Calvin is already
> > >> paying
> > >>> a batching latency penalty, this is less painful than for other
> > systems.
> > >>>
> > >>> Application to Cassandra: B-.  Distributed transactions are handled
> by
> > >> the
> > >>> sequencing and scheduling layers, which are leaderless, and Calvin’s
> > >>> requirements for the storage layer are easily met by C*.  But Calvin
> > also
> > >>> requires a global consensus protocol and LWT is almost certainly not
> > >>> sufficiently performant, so this would require ZK or etcd (reasonable
> > >> for a
> > >>> library approach but not for replacing LWT in C* itself), or an
> > >>> implementation of Accord.  I don’t believe Calvin would require
> > >> additional
> > >>> table-level metadata in Cassandra.
> > >>>
> > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > benedict@apache.org>
> > >>> wrote:
> > >>>
> > >>>> Wiki:
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > >>>> Whitepaper:
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > >>>> <
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > >>>>>
> > >>>> Prototype: https://github.com/belliottsmith/accord
> > >>>>
> > >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> > >> community.
> > >>>>
> > >>>> Cassandra has benefitted from LWTs for many years, but application
> > >>>> developers that want to ensure consistency for complex operations
> must
> > >>>> either accept the scalability bottleneck of serializing all related
> > >> state
> > >>>> through a single partition, or layer a complex state machine on top
> of
> > >>> the
> > >>>> database. These are sophisticated and costly activities that our
> users
> > >>>> should not be expected to undertake. Since distributed databases are
> > >>>> beginning to offer distributed transactions with fewer caveats, it
> is
> > >>> past
> > >>>> time for Cassandra to do so as well.
> > >>>>
> > >>>> This CEP proposes the use of several novel techniques that build
> upon
> > >>>> research (that followed EPaxos) to deliver (non-interactive) general
> > >>>> purpose distributed transactions. The approach is outlined in the
> > >>> wikipage
> > >>>> and in more detail in the linked whitepaper. Importantly, by
> adopting
> > >>> this
> > >>>> approach we will be the _only_ distributed database to offer global,
> > >>>> scalable, strict serializable transactions in one wide area
> > round-trip.
> > >>>> This would represent a significant improvement in the state of the
> > art,
> > >>>> both in the academic literature and in commercial or open source
> > >>> offerings.
> > >>>>
> > >>>> This work has been partially realised in a prototype. This partial
> > >>>> prototype has been verified against Jepsen.io’s Maelstrom library
> and
> > >>>> dedicated in-tree strict serializability verification tools, but
> much
> > >>> work
> > >>>> remains for the work to be production capable and integrated into
> > >>> Cassandra.
> > >>>>
> > >>>> I propose including the prototype in the project as a new source
> > >>>> repository, to be developed as a standalone library for integration
> > >> into
> > >>>> Cassandra. I hope the community sees the important value proposition
> > of
> > >>>> this proposal, and will adopt the CEP after this discussion, so that
> > >> the
> > >>>> library and its integration into Cassandra can be developed in
> > parallel
> > >>> and
> > >>>> with the involvement of the wider community.
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Jonathan Ellis
> > >>> co-founder, http://www.datastax.com
> > >>> @spyced
> > >>>
> > >>
> > >>
> > >> --
> > >> Jonathan Ellis
> > >> co-founder, http://www.datastax.com
> > >> @spyced
> > >>
> > >
> > >
> > > --
> > > Jonathan Ellis
> > > co-founder, http://www.datastax.com
> > > @spyced
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > For additional commands, e-mail: dev-help@cassandra.apache.org
> >
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

Actually, thinking about it again, the simple optimistic protocol would in fact guarantee system forward progress (i.e. independent of transaction formulation).


From: benedict@apache.org <be...@apache.org>
Date: Friday, 1 October 2021 at 09:14
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Hi Jonathan,

It would be great if we could achieve a bandwidth higher than 1-2 short emails per week. It remains unclear to me what your goal is, and it would help if you could make a statement like “I want Cassandra to be able to do X” so that we can respond directly to it. I am also available to have another call, in which we can have a back and forth, please feel free to propose a London-compatible time within the next week that is suitable for you.

In my opinion we are at risk of veering off-topic, though. This CEP is not to deliver interactive transactions, and to my knowledge nobody is proposing a CEP for interactive transactions. So, for the CEP at hand the salient question seems: does this CEP prevent us from implementing interactive transactions with properties X, Y, Z in future? To which the answer is almost certainly no.

However, to continue the discussion and respond directly to your queries, I believe we agree on the definition of an interactive transaction.

Two protocols were loosely outlined. The first, using timestamps for optimistic concurrency control, would indeed involve the possibility of aborts. It would not however inherently adopt the issue of LWTs where no transaction is able to make progress. Whether or not progress is guaranteed (in a livelock-free sense) would depend on the structure of the transactions that were interfering.

This approach has the advantage of being very simple to implement, so that we could realistically support interactive transactions quite quickly. It has the additional advantage that transactions would execute very quickly by avoiding the WAN during construction, and as a result may in practice experience fewer aborts than protocols that guarantee livelock-freedom.

The second protocol proposed using read/write intents and would be able to support almost any behaviour you want. We could even utilise pessimistic concurrency control, or anything in-between. This is its own huge design space, and discussion of this approach and the trade-offs that could be made is (in my opinion) entirely out of scope for this CEP.


From: Jonathan Ellis <jb...@gmail.com>
Date: Friday, 1 October 2021 at 05:00
To: dev <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
The obstacle for me is you've provided a protocol but not a fully fleshed
out architecture, so it's hard to fill in some of the blanks.  But it looks
to me like optimistic concurrency control for interactive transactions
applied to Accord would leave you in a LWT-like situation under fairly
light contention where nobody actually makes progress due to retries.

To make sure we're talking about the same thing, as Henrik pointed out,
interactive transactions mean multiple round trips from the client within a
transaction.  For example, here
<https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213>
is a simple implementation of the TPC-C New Order transaction.  The high
level logic (via
<https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm>)
is,

   1. Get records describing a warehouse, customer, & district
   2. Update the district
   3. Increment next available order number
   4. Insert record into Order and New-Order tables
   5. For 5-15 items, get Item record, get/update Stock record
   6. Insert Order-Line Record

As you can see, this requires a lot of client-side logic mixed in with the
actual SQL commands.


On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <be...@apache.org>
wrote:

> Essentially this, although I think in practice we will need to track each
> partition’s timestamp separately (or optionally for reduced conflicts, each
> row or datum’s), and make them all part of the conditional application of
> the transaction - at least for strict-serializability.
>
> The alternative is to insert read/write intents for the transaction during
> each step, and to confirm they are still valid on commit, but this approach
> would require a WAN round-trip for each step in the interactive
> transaction, whereas the timestamp-validating approach can use a LAN
> round-trip for each step besides the final one, and is also much simpler to
> implement.
>
>
> From: Blake Eggleston <be...@apple.com.INVALID>
> Date: Thursday, 30 September 2021 at 05:47
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> You could establish a lower timestamp bound and buffer transaction state
> on the coordinator, then make the commit an operation that only applies if
> all partitions involved haven’t been changed by a more recent timestamp.
> You could also implement mvcc either in the storage layer or for some
> period of time by buffering commits on each replica before applying.
>
> > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> >
> > How are interactive transactions possible with Accord?
> >
> >
> >
> > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> benedict@apache.org>
> > wrote:
> >
> >> Could you explain why you believe this trade-off is necessary? We can
> >> support full SQL just fine with Accord, and I hope that we eventually
> do so.
> >>
> >> This domain is incredibly complex, so it is easy to reach wrong
> >> conclusions. I would invite you again to propose a system for discussion
> >> that you think offers something Accord is unable to, and that you
> consider
> >> desirable, and we can work from there.
> >>
> >> To pre-empt some possible discussions, I am not aware of anything we
> >> cannot do with Accord that we could do with either Calvin or Spanner.
> >> Interactive transactions are possible on top of Accord, as are
> transactions
> >> with an unknown read/write set. In each case the only cost is that they
> >> would use optimistic concurrency control, which is no worse the spanner
> >> derivatives anyway (which I have to assume is your benchmark in this
> >> regard). I do not expect to deliver either functionality initially, but
> >> Accord takes us most of the way there for both.
> >>
> >>
> >> From: Jonathan Ellis <jb...@gmail.com>
> >> Date: Wednesday, 22 September 2021 at 05:36
> >> To: dev <de...@cassandra.apache.org>
> >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> >> Right, I'm looking for exactly a discussion on the high level goals.
> >> Instead of saying "here's the goals and we ruled out X because Y" we
> should
> >> start with a discussion around, "Approach A allows X and W, approach B
> >> allows Y and Z" and decide together what the goals should be and and
> what
> >> we are willing to trade to get those goals, e.g., are we willing to
> give up
> >> global strict serializability to get the ability to support full SQL.
> Both
> >> of these are nice to have!
> >>
> >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> benedict@apache.org>
> >> wrote:
> >>
> >>> Hi Jonathan,
> >>>
> >>> These other systems are incompatible with the goals of the CEP. I do
> >>> discuss them (besides 2PC) in both the whitepaper and the CEP, and will
> >>> summarise that discussion below. A true and accurate comparison of
> these
> >>> other systems is essentially intractable, as there are complex
> subtleties
> >>> to each flavour, and those who are interested would be better served by
> >>> performing their own research.
> >>>
> >>> I think it is more productive to focus on what we want to achieve as a
> >>> community. If you believe the goals of this CEP are wrong for the
> >> project,
> >>> let’s focus on that. If you want to compare and contrast specific
> facets
> >> of
> >>> alternative systems that you consider to be preferable in some
> dimension,
> >>> let’s do that here or in a Q&A as proposed by Joey.
> >>>
> >>> The relevant goals are that we:
> >>>
> >>>
> >>>  1.  Guarantee strict serializable isolation on commodity hardware
> >>>  2.  Scale to any cluster size
> >>>  3.  Achieve optimal latency
> >>>
> >>> The approach taken by Spanner derivatives is rejected by (1) because
> they
> >>> guarantee only Serializable isolation (they additionally fail (3)).
> From
> >>> watching talks by YugaByte, and inferring from Cockroach’s
> >>> panic-cluster-death under clock skew, this is clearly considered by
> >>> everyone to be undesirable but necessary to achieve scalability.
> >>>
> >>> The approach taken by FaunaDB (Calvin) is rejected by (2) because its
> >>> sequencing layer requires a global leader process for the cluster,
> which
> >> is
> >>> incompatible with Cassandra’s scalability requirements. It additionally
> >>> fails (3) for global clients.
> >>>
> >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
> >>> Spanner clone for its multi-key transaction functionality, not 2PC.
> >>>
> >>> Systems such as RAMP with even weaker isolation are not considered for
> >> the
> >>> simple reason that they do not even claim to meet (1).
> >>>
> >>> If we want to additionally offer weaker isolation levels than
> >>> Serializable, such as that provided by the recent RAMP-TAO paper,
> >> Cassandra
> >>> is likely able to support multiple distinct transaction layers that
> >> operate
> >>> independently. I would encourage you to file a CEP to explore how we
> can
> >>> meet these distinct use cases, but I consider them to be niche. I
> expect
> >>> that a majority of our user base desire strict serializable isolation,
> >> and
> >>> certainly no less than serializable isolation, to augment the existing
> >>> weaker isolation offered by quorum reads and writes.
> >>>
> >>> I would tangentially note that we are not an AP database under normal
> >>> recommended operation. A minority in any network partition cannot reach
> >>> QUORUM, so under recommended usage we are a high-availability
> leaderless
> >> CP
> >>> database.
> >>>
> >>>
> >>> From: Jonathan Ellis <jb...@gmail.com>
> >>> Date: Tuesday, 21 September 2021 at 23:45
> >>> To: dev <de...@cassandra.apache.org>
> >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> >>> Benedict, thanks for taking the lead in putting this together. Since
> >>> Cassandra is the only relevant database today designed around a
> >> leaderless
> >>> architecture, it's quite likely that we'll be better served with a
> custom
> >>> transaction design instead of trying to retrofit one from CP systems.
> >>>
> >>> The whitepaper here is a good description of the consensus algorithm
> >> itself
> >>> as well as its robustness and stability characteristics, and its
> >> comparison
> >>> with other state-of-the-art consensus algorithms is very useful.  In
> the
> >>> context of Cassandra, where a consensus algorithm is only part of what
> >> will
> >>> be implemented, I'd like to see a more complete evaluation of the
> >>> transactional side of things as well, including performance
> >> characteristics
> >>> as well as the types of transactions that can be supported and at
> least a
> >>> general idea of what it would look like applied to Cassandra. This will
> >>> allow the PMC to make a more informed decision about what tradeoffs are
> >>> best for the entire long-term project of first supplementing and
> >> ultimately
> >>> replacing LWT.
> >>>
> >>> (Allowing users to mix LWT and AP Cassandra operations against the same
> >>> rows was probably a mistake, so in contrast with LWT we’re not looking
> >> for
> >>> something fast enough for occasional use but rather something within a
> >>> reasonable factor of AP operations, appropriate to being the only way
> to
> >>> interact with tables declared as such.)
> >>>
> >>> Besides Accord, this should cover
> >>>
> >>> - Calvin and FaunaDB
> >>> - A Spanner derivative (no opinion on whether that should be Cockroach
> or
> >>> Yugabyte, I don’t think it’s necessary to cover both)
> >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I
> suspect
> >>> there is more public information about MongoDB)
> >>> - RAMP
> >>>
> >>> Here’s an example of what I mean:
> >>>
> >>> =Calvin=
> >>>
> >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
> >>> transactions, then replicas execute the transactions independently with
> >> no
> >>> further coordination.  No SPOF.  Transactions are batched by each
> >> sequencer
> >>> to keep this from becoming a bottleneck.
> >>>
> >>> Performance: Calvin paper (published 2012) reports linear scaling of
> >> TPC-C
> >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
> >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is
> composed
> >>> of four reads and four writes, so this is effectively 2M reads and 2M
> >>> writes as we normally measure them in C*.
> >>>
> >>> Calvin supports mixed read/write transactions, but because the
> >> transaction
> >>> execution logic requires knowing all partition keys in advance to
> ensure
> >>> that all replicas can reproduce the same results with no coordination,
> >>> reads against non-PK predicates must be done ahead of time
> >> (transparently,
> >>> by the server) to determine the set of keys, and this must be retried
> if
> >>> the set of rows affected is updated before the actual transaction
> >> executes.
> >>>
> >>> Batching and global consensus adds latency -- 100ms in the Calvin paper
> >> and
> >>> apparently about 50ms in FaunaDB.  Glass half full: all transactions
> >>> (including multi-partition updates) are equally performant in Calvin
> >> since
> >>> the coordination is handled up front in the sequencing step.  Glass
> half
> >>> empty: even single-row reads and writes have to pay the full
> coordination
> >>> cost.  Fauna has optimized this away for reads but I am not aware of a
> >>> description of how they changed the design to allow this.
> >>>
> >>> Functionality and limitations: since the entire transaction must be
> known
> >>> in advance to allow coordination-less execution at the replicas, Calvin
> >>> cannot support interactive transactions at all.  FaunaDB mitigates this
> >> by
> >>> allowing server-side logic to be included, but a Calvin approach will
> >> never
> >>> be able to offer SQL compatibility.
> >>>
> >>> Guarantees: Calvin transactions are strictly serializable.  There is no
> >>> additional complexity or performance hit to generalizing to multiple
> >>> regions, apart from the speed of light.  And since Calvin is already
> >> paying
> >>> a batching latency penalty, this is less painful than for other
> systems.
> >>>
> >>> Application to Cassandra: B-.  Distributed transactions are handled by
> >> the
> >>> sequencing and scheduling layers, which are leaderless, and Calvin’s
> >>> requirements for the storage layer are easily met by C*.  But Calvin
> also
> >>> requires a global consensus protocol and LWT is almost certainly not
> >>> sufficiently performant, so this would require ZK or etcd (reasonable
> >> for a
> >>> library approach but not for replacing LWT in C* itself), or an
> >>> implementation of Accord.  I don’t believe Calvin would require
> >> additional
> >>> table-level metadata in Cassandra.
> >>>
> >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> benedict@apache.org>
> >>> wrote:
> >>>
> >>>> Wiki:
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> >>>> Whitepaper:
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> >>>> <
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >>>>>
> >>>> Prototype: https://github.com/belliottsmith/accord
> >>>>
> >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> >> community.
> >>>>
> >>>> Cassandra has benefitted from LWTs for many years, but application
> >>>> developers that want to ensure consistency for complex operations must
> >>>> either accept the scalability bottleneck of serializing all related
> >> state
> >>>> through a single partition, or layer a complex state machine on top of
> >>> the
> >>>> database. These are sophisticated and costly activities that our users
> >>>> should not be expected to undertake. Since distributed databases are
> >>>> beginning to offer distributed transactions with fewer caveats, it is
> >>> past
> >>>> time for Cassandra to do so as well.
> >>>>
> >>>> This CEP proposes the use of several novel techniques that build upon
> >>>> research (that followed EPaxos) to deliver (non-interactive) general
> >>>> purpose distributed transactions. The approach is outlined in the
> >>> wikipage
> >>>> and in more detail in the linked whitepaper. Importantly, by adopting
> >>> this
> >>>> approach we will be the _only_ distributed database to offer global,
> >>>> scalable, strict serializable transactions in one wide area
> round-trip.
> >>>> This would represent a significant improvement in the state of the
> art,
> >>>> both in the academic literature and in commercial or open source
> >>> offerings.
> >>>>
> >>>> This work has been partially realised in a prototype. This partial
> >>>> prototype has been verified against Jepsen.io’s Maelstrom library and
> >>>> dedicated in-tree strict serializability verification tools, but much
> >>> work
> >>>> remains for the work to be production capable and integrated into
> >>> Cassandra.
> >>>>
> >>>> I propose including the prototype in the project as a new source
> >>>> repository, to be developed as a standalone library for integration
> >> into
> >>>> Cassandra. I hope the community sees the important value proposition
> of
> >>>> this proposal, and will adopt the CEP after this discussion, so that
> >> the
> >>>> library and its integration into Cassandra can be developed in
> parallel
> >>> and
> >>>> with the involvement of the wider community.
> >>>>
> >>>
> >>>
> >>> --
> >>> Jonathan Ellis
> >>> co-founder, http://www.datastax.com
> >>> @spyced
> >>>
> >>
> >>
> >> --
> >> Jonathan Ellis
> >> co-founder, http://www.datastax.com
> >> @spyced
> >>
> >
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>


--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

Hi Jonathan,

It would be great if we could achieve a bandwidth higher than 1-2 short emails per week. It remains unclear to me what your goal is, and it would help if you could make a statement like “I want Cassandra to be able to do X” so that we can respond directly to it. I am also available to have another call, in which we can have a back and forth, please feel free to propose a London-compatible time within the next week that is suitable for you.

In my opinion we are at risk of veering off-topic, though. This CEP is not to deliver interactive transactions, and to my knowledge nobody is proposing a CEP for interactive transactions. So, for the CEP at hand the salient question seems: does this CEP prevent us from implementing interactive transactions with properties X, Y, Z in future? To which the answer is almost certainly no.

However, to continue the discussion and respond directly to your queries, I believe we agree on the definition of an interactive transaction.

Two protocols were loosely outlined. The first, using timestamps for optimistic concurrency control, would indeed involve the possibility of aborts. It would not however inherently adopt the issue of LWTs where no transaction is able to make progress. Whether or not progress is guaranteed (in a livelock-free sense) would depend on the structure of the transactions that were interfering.

This approach has the advantage of being very simple to implement, so that we could realistically support interactive transactions quite quickly. It has the additional advantage that transactions would execute very quickly by avoiding the WAN during construction, and as a result may in practice experience fewer aborts than protocols that guarantee livelock-freedom.

The second protocol proposed using read/write intents and would be able to support almost any behaviour you want. We could even utilise pessimistic concurrency control, or anything in-between. This is its own huge design space, and discussion of this approach and the trade-offs that could be made is (in my opinion) entirely out of scope for this CEP.


From: Jonathan Ellis <jb...@gmail.com>
Date: Friday, 1 October 2021 at 05:00
To: dev <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
The obstacle for me is you've provided a protocol but not a fully fleshed
out architecture, so it's hard to fill in some of the blanks.  But it looks
to me like optimistic concurrency control for interactive transactions
applied to Accord would leave you in a LWT-like situation under fairly
light contention where nobody actually makes progress due to retries.

To make sure we're talking about the same thing, as Henrik pointed out,
interactive transactions mean multiple round trips from the client within a
transaction.  For example, here
<https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213>
is a simple implementation of the TPC-C New Order transaction.  The high
level logic (via
<https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm>)
is,

   1. Get records describing a warehouse, customer, & district
   2. Update the district
   3. Increment next available order number
   4. Insert record into Order and New-Order tables
   5. For 5-15 items, get Item record, get/update Stock record
   6. Insert Order-Line Record

As you can see, this requires a lot of client-side logic mixed in with the
actual SQL commands.


On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <be...@apache.org>
wrote:

> Essentially this, although I think in practice we will need to track each
> partition’s timestamp separately (or optionally for reduced conflicts, each
> row or datum’s), and make them all part of the conditional application of
> the transaction - at least for strict-serializability.
>
> The alternative is to insert read/write intents for the transaction during
> each step, and to confirm they are still valid on commit, but this approach
> would require a WAN round-trip for each step in the interactive
> transaction, whereas the timestamp-validating approach can use a LAN
> round-trip for each step besides the final one, and is also much simpler to
> implement.
>
>
> From: Blake Eggleston <be...@apple.com.INVALID>
> Date: Thursday, 30 September 2021 at 05:47
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> You could establish a lower timestamp bound and buffer transaction state
> on the coordinator, then make the commit an operation that only applies if
> all partitions involved haven’t been changed by a more recent timestamp.
> You could also implement mvcc either in the storage layer or for some
> period of time by buffering commits on each replica before applying.
>
> > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> >
> > How are interactive transactions possible with Accord?
> >
> >
> >
> > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> benedict@apache.org>
> > wrote:
> >
> >> Could you explain why you believe this trade-off is necessary? We can
> >> support full SQL just fine with Accord, and I hope that we eventually
> do so.
> >>
> >> This domain is incredibly complex, so it is easy to reach wrong
> >> conclusions. I would invite you again to propose a system for discussion
> >> that you think offers something Accord is unable to, and that you
> consider
> >> desirable, and we can work from there.
> >>
> >> To pre-empt some possible discussions, I am not aware of anything we
> >> cannot do with Accord that we could do with either Calvin or Spanner.
> >> Interactive transactions are possible on top of Accord, as are
> transactions
> >> with an unknown read/write set. In each case the only cost is that they
> >> would use optimistic concurrency control, which is no worse the spanner
> >> derivatives anyway (which I have to assume is your benchmark in this
> >> regard). I do not expect to deliver either functionality initially, but
> >> Accord takes us most of the way there for both.
> >>
> >>
> >> From: Jonathan Ellis <jb...@gmail.com>
> >> Date: Wednesday, 22 September 2021 at 05:36
> >> To: dev <de...@cassandra.apache.org>
> >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> >> Right, I'm looking for exactly a discussion on the high level goals.
> >> Instead of saying "here's the goals and we ruled out X because Y" we
> should
> >> start with a discussion around, "Approach A allows X and W, approach B
> >> allows Y and Z" and decide together what the goals should be and and
> what
> >> we are willing to trade to get those goals, e.g., are we willing to
> give up
> >> global strict serializability to get the ability to support full SQL.
> Both
> >> of these are nice to have!
> >>
> >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> benedict@apache.org>
> >> wrote:
> >>
> >>> Hi Jonathan,
> >>>
> >>> These other systems are incompatible with the goals of the CEP. I do
> >>> discuss them (besides 2PC) in both the whitepaper and the CEP, and will
> >>> summarise that discussion below. A true and accurate comparison of
> these
> >>> other systems is essentially intractable, as there are complex
> subtleties
> >>> to each flavour, and those who are interested would be better served by
> >>> performing their own research.
> >>>
> >>> I think it is more productive to focus on what we want to achieve as a
> >>> community. If you believe the goals of this CEP are wrong for the
> >> project,
> >>> let’s focus on that. If you want to compare and contrast specific
> facets
> >> of
> >>> alternative systems that you consider to be preferable in some
> dimension,
> >>> let’s do that here or in a Q&A as proposed by Joey.
> >>>
> >>> The relevant goals are that we:
> >>>
> >>>
> >>>  1.  Guarantee strict serializable isolation on commodity hardware
> >>>  2.  Scale to any cluster size
> >>>  3.  Achieve optimal latency
> >>>
> >>> The approach taken by Spanner derivatives is rejected by (1) because
> they
> >>> guarantee only Serializable isolation (they additionally fail (3)).
> From
> >>> watching talks by YugaByte, and inferring from Cockroach’s
> >>> panic-cluster-death under clock skew, this is clearly considered by
> >>> everyone to be undesirable but necessary to achieve scalability.
> >>>
> >>> The approach taken by FaunaDB (Calvin) is rejected by (2) because its
> >>> sequencing layer requires a global leader process for the cluster,
> which
> >> is
> >>> incompatible with Cassandra’s scalability requirements. It additionally
> >>> fails (3) for global clients.
> >>>
> >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
> >>> Spanner clone for its multi-key transaction functionality, not 2PC.
> >>>
> >>> Systems such as RAMP with even weaker isolation are not considered for
> >> the
> >>> simple reason that they do not even claim to meet (1).
> >>>
> >>> If we want to additionally offer weaker isolation levels than
> >>> Serializable, such as that provided by the recent RAMP-TAO paper,
> >> Cassandra
> >>> is likely able to support multiple distinct transaction layers that
> >> operate
> >>> independently. I would encourage you to file a CEP to explore how we
> can
> >>> meet these distinct use cases, but I consider them to be niche. I
> expect
> >>> that a majority of our user base desire strict serializable isolation,
> >> and
> >>> certainly no less than serializable isolation, to augment the existing
> >>> weaker isolation offered by quorum reads and writes.
> >>>
> >>> I would tangentially note that we are not an AP database under normal
> >>> recommended operation. A minority in any network partition cannot reach
> >>> QUORUM, so under recommended usage we are a high-availability
> leaderless
> >> CP
> >>> database.
> >>>
> >>>
> >>> From: Jonathan Ellis <jb...@gmail.com>
> >>> Date: Tuesday, 21 September 2021 at 23:45
> >>> To: dev <de...@cassandra.apache.org>
> >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> >>> Benedict, thanks for taking the lead in putting this together. Since
> >>> Cassandra is the only relevant database today designed around a
> >> leaderless
> >>> architecture, it's quite likely that we'll be better served with a
> custom
> >>> transaction design instead of trying to retrofit one from CP systems.
> >>>
> >>> The whitepaper here is a good description of the consensus algorithm
> >> itself
> >>> as well as its robustness and stability characteristics, and its
> >> comparison
> >>> with other state-of-the-art consensus algorithms is very useful.  In
> the
> >>> context of Cassandra, where a consensus algorithm is only part of what
> >> will
> >>> be implemented, I'd like to see a more complete evaluation of the
> >>> transactional side of things as well, including performance
> >> characteristics
> >>> as well as the types of transactions that can be supported and at
> least a
> >>> general idea of what it would look like applied to Cassandra. This will
> >>> allow the PMC to make a more informed decision about what tradeoffs are
> >>> best for the entire long-term project of first supplementing and
> >> ultimately
> >>> replacing LWT.
> >>>
> >>> (Allowing users to mix LWT and AP Cassandra operations against the same
> >>> rows was probably a mistake, so in contrast with LWT we’re not looking
> >> for
> >>> something fast enough for occasional use but rather something within a
> >>> reasonable factor of AP operations, appropriate to being the only way
> to
> >>> interact with tables declared as such.)
> >>>
> >>> Besides Accord, this should cover
> >>>
> >>> - Calvin and FaunaDB
> >>> - A Spanner derivative (no opinion on whether that should be Cockroach
> or
> >>> Yugabyte, I don’t think it’s necessary to cover both)
> >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I
> suspect
> >>> there is more public information about MongoDB)
> >>> - RAMP
> >>>
> >>> Here’s an example of what I mean:
> >>>
> >>> =Calvin=
> >>>
> >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
> >>> transactions, then replicas execute the transactions independently with
> >> no
> >>> further coordination.  No SPOF.  Transactions are batched by each
> >> sequencer
> >>> to keep this from becoming a bottleneck.
> >>>
> >>> Performance: Calvin paper (published 2012) reports linear scaling of
> >> TPC-C
> >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
> >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is
> composed
> >>> of four reads and four writes, so this is effectively 2M reads and 2M
> >>> writes as we normally measure them in C*.
> >>>
> >>> Calvin supports mixed read/write transactions, but because the
> >> transaction
> >>> execution logic requires knowing all partition keys in advance to
> ensure
> >>> that all replicas can reproduce the same results with no coordination,
> >>> reads against non-PK predicates must be done ahead of time
> >> (transparently,
> >>> by the server) to determine the set of keys, and this must be retried
> if
> >>> the set of rows affected is updated before the actual transaction
> >> executes.
> >>>
> >>> Batching and global consensus adds latency -- 100ms in the Calvin paper
> >> and
> >>> apparently about 50ms in FaunaDB.  Glass half full: all transactions
> >>> (including multi-partition updates) are equally performant in Calvin
> >> since
> >>> the coordination is handled up front in the sequencing step.  Glass
> half
> >>> empty: even single-row reads and writes have to pay the full
> coordination
> >>> cost.  Fauna has optimized this away for reads but I am not aware of a
> >>> description of how they changed the design to allow this.
> >>>
> >>> Functionality and limitations: since the entire transaction must be
> known
> >>> in advance to allow coordination-less execution at the replicas, Calvin
> >>> cannot support interactive transactions at all.  FaunaDB mitigates this
> >> by
> >>> allowing server-side logic to be included, but a Calvin approach will
> >> never
> >>> be able to offer SQL compatibility.
> >>>
> >>> Guarantees: Calvin transactions are strictly serializable.  There is no
> >>> additional complexity or performance hit to generalizing to multiple
> >>> regions, apart from the speed of light.  And since Calvin is already
> >> paying
> >>> a batching latency penalty, this is less painful than for other
> systems.
> >>>
> >>> Application to Cassandra: B-.  Distributed transactions are handled by
> >> the
> >>> sequencing and scheduling layers, which are leaderless, and Calvin’s
> >>> requirements for the storage layer are easily met by C*.  But Calvin
> also
> >>> requires a global consensus protocol and LWT is almost certainly not
> >>> sufficiently performant, so this would require ZK or etcd (reasonable
> >> for a
> >>> library approach but not for replacing LWT in C* itself), or an
> >>> implementation of Accord.  I don’t believe Calvin would require
> >> additional
> >>> table-level metadata in Cassandra.
> >>>
> >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> benedict@apache.org>
> >>> wrote:
> >>>
> >>>> Wiki:
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> >>>> Whitepaper:
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> >>>> <
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >>>>>
> >>>> Prototype: https://github.com/belliottsmith/accord
> >>>>
> >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> >> community.
> >>>>
> >>>> Cassandra has benefitted from LWTs for many years, but application
> >>>> developers that want to ensure consistency for complex operations must
> >>>> either accept the scalability bottleneck of serializing all related
> >> state
> >>>> through a single partition, or layer a complex state machine on top of
> >>> the
> >>>> database. These are sophisticated and costly activities that our users
> >>>> should not be expected to undertake. Since distributed databases are
> >>>> beginning to offer distributed transactions with fewer caveats, it is
> >>> past
> >>>> time for Cassandra to do so as well.
> >>>>
> >>>> This CEP proposes the use of several novel techniques that build upon
> >>>> research (that followed EPaxos) to deliver (non-interactive) general
> >>>> purpose distributed transactions. The approach is outlined in the
> >>> wikipage
> >>>> and in more detail in the linked whitepaper. Importantly, by adopting
> >>> this
> >>>> approach we will be the _only_ distributed database to offer global,
> >>>> scalable, strict serializable transactions in one wide area
> round-trip.
> >>>> This would represent a significant improvement in the state of the
> art,
> >>>> both in the academic literature and in commercial or open source
> >>> offerings.
> >>>>
> >>>> This work has been partially realised in a prototype. This partial
> >>>> prototype has been verified against Jepsen.io’s Maelstrom library and
> >>>> dedicated in-tree strict serializability verification tools, but much
> >>> work
> >>>> remains for the work to be production capable and integrated into
> >>> Cassandra.
> >>>>
> >>>> I propose including the prototype in the project as a new source
> >>>> repository, to be developed as a standalone library for integration
> >> into
> >>>> Cassandra. I hope the community sees the important value proposition
> of
> >>>> this proposal, and will adopt the CEP after this discussion, so that
> >> the
> >>>> library and its integration into Cassandra can be developed in
> parallel
> >>> and
> >>>> with the involvement of the wider community.
> >>>>
> >>>
> >>>
> >>> --
> >>> Jonathan Ellis
> >>> co-founder, http://www.datastax.com
> >>> @spyced
> >>>
> >>
> >>
> >> --
> >> Jonathan Ellis
> >> co-founder, http://www.datastax.com
> >> @spyced
> >>
> >
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>


--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Jonathan Ellis <jb...@gmail.com>.

 The obstacle for me is you've provided a protocol but not a fully fleshed
out architecture, so it's hard to fill in some of the blanks.  But it looks
to me like optimistic concurrency control for interactive transactions
applied to Accord would leave you in a LWT-like situation under fairly
light contention where nobody actually makes progress due to retries.

To make sure we're talking about the same thing, as Henrik pointed out,
interactive transactions mean multiple round trips from the client within a
transaction.  For example, here
<https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213>
is a simple implementation of the TPC-C New Order transaction.  The high
level logic (via
<https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm>)
is,

   1. Get records describing a warehouse, customer, & district
   2. Update the district
   3. Increment next available order number
   4. Insert record into Order and New-Order tables
   5. For 5-15 items, get Item record, get/update Stock record
   6. Insert Order-Line Record

As you can see, this requires a lot of client-side logic mixed in with the
actual SQL commands.


On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <be...@apache.org>
wrote:

> Essentially this, although I think in practice we will need to track each
> partition’s timestamp separately (or optionally for reduced conflicts, each
> row or datum’s), and make them all part of the conditional application of
> the transaction - at least for strict-serializability.
>
> The alternative is to insert read/write intents for the transaction during
> each step, and to confirm they are still valid on commit, but this approach
> would require a WAN round-trip for each step in the interactive
> transaction, whereas the timestamp-validating approach can use a LAN
> round-trip for each step besides the final one, and is also much simpler to
> implement.
>
>
> From: Blake Eggleston <be...@apple.com.INVALID>
> Date: Thursday, 30 September 2021 at 05:47
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> You could establish a lower timestamp bound and buffer transaction state
> on the coordinator, then make the commit an operation that only applies if
> all partitions involved haven’t been changed by a more recent timestamp.
> You could also implement mvcc either in the storage layer or for some
> period of time by buffering commits on each replica before applying.
>
> > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> >
> > How are interactive transactions possible with Accord?
> >
> >
> >
> > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> benedict@apache.org>
> > wrote:
> >
> >> Could you explain why you believe this trade-off is necessary? We can
> >> support full SQL just fine with Accord, and I hope that we eventually
> do so.
> >>
> >> This domain is incredibly complex, so it is easy to reach wrong
> >> conclusions. I would invite you again to propose a system for discussion
> >> that you think offers something Accord is unable to, and that you
> consider
> >> desirable, and we can work from there.
> >>
> >> To pre-empt some possible discussions, I am not aware of anything we
> >> cannot do with Accord that we could do with either Calvin or Spanner.
> >> Interactive transactions are possible on top of Accord, as are
> transactions
> >> with an unknown read/write set. In each case the only cost is that they
> >> would use optimistic concurrency control, which is no worse the spanner
> >> derivatives anyway (which I have to assume is your benchmark in this
> >> regard). I do not expect to deliver either functionality initially, but
> >> Accord takes us most of the way there for both.
> >>
> >>
> >> From: Jonathan Ellis <jb...@gmail.com>
> >> Date: Wednesday, 22 September 2021 at 05:36
> >> To: dev <de...@cassandra.apache.org>
> >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> >> Right, I'm looking for exactly a discussion on the high level goals.
> >> Instead of saying "here's the goals and we ruled out X because Y" we
> should
> >> start with a discussion around, "Approach A allows X and W, approach B
> >> allows Y and Z" and decide together what the goals should be and and
> what
> >> we are willing to trade to get those goals, e.g., are we willing to
> give up
> >> global strict serializability to get the ability to support full SQL.
> Both
> >> of these are nice to have!
> >>
> >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> benedict@apache.org>
> >> wrote:
> >>
> >>> Hi Jonathan,
> >>>
> >>> These other systems are incompatible with the goals of the CEP. I do
> >>> discuss them (besides 2PC) in both the whitepaper and the CEP, and will
> >>> summarise that discussion below. A true and accurate comparison of
> these
> >>> other systems is essentially intractable, as there are complex
> subtleties
> >>> to each flavour, and those who are interested would be better served by
> >>> performing their own research.
> >>>
> >>> I think it is more productive to focus on what we want to achieve as a
> >>> community. If you believe the goals of this CEP are wrong for the
> >> project,
> >>> let’s focus on that. If you want to compare and contrast specific
> facets
> >> of
> >>> alternative systems that you consider to be preferable in some
> dimension,
> >>> let’s do that here or in a Q&A as proposed by Joey.
> >>>
> >>> The relevant goals are that we:
> >>>
> >>>
> >>>  1.  Guarantee strict serializable isolation on commodity hardware
> >>>  2.  Scale to any cluster size
> >>>  3.  Achieve optimal latency
> >>>
> >>> The approach taken by Spanner derivatives is rejected by (1) because
> they
> >>> guarantee only Serializable isolation (they additionally fail (3)).
> From
> >>> watching talks by YugaByte, and inferring from Cockroach’s
> >>> panic-cluster-death under clock skew, this is clearly considered by
> >>> everyone to be undesirable but necessary to achieve scalability.
> >>>
> >>> The approach taken by FaunaDB (Calvin) is rejected by (2) because its
> >>> sequencing layer requires a global leader process for the cluster,
> which
> >> is
> >>> incompatible with Cassandra’s scalability requirements. It additionally
> >>> fails (3) for global clients.
> >>>
> >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
> >>> Spanner clone for its multi-key transaction functionality, not 2PC.
> >>>
> >>> Systems such as RAMP with even weaker isolation are not considered for
> >> the
> >>> simple reason that they do not even claim to meet (1).
> >>>
> >>> If we want to additionally offer weaker isolation levels than
> >>> Serializable, such as that provided by the recent RAMP-TAO paper,
> >> Cassandra
> >>> is likely able to support multiple distinct transaction layers that
> >> operate
> >>> independently. I would encourage you to file a CEP to explore how we
> can
> >>> meet these distinct use cases, but I consider them to be niche. I
> expect
> >>> that a majority of our user base desire strict serializable isolation,
> >> and
> >>> certainly no less than serializable isolation, to augment the existing
> >>> weaker isolation offered by quorum reads and writes.
> >>>
> >>> I would tangentially note that we are not an AP database under normal
> >>> recommended operation. A minority in any network partition cannot reach
> >>> QUORUM, so under recommended usage we are a high-availability
> leaderless
> >> CP
> >>> database.
> >>>
> >>>
> >>> From: Jonathan Ellis <jb...@gmail.com>
> >>> Date: Tuesday, 21 September 2021 at 23:45
> >>> To: dev <de...@cassandra.apache.org>
> >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> >>> Benedict, thanks for taking the lead in putting this together. Since
> >>> Cassandra is the only relevant database today designed around a
> >> leaderless
> >>> architecture, it's quite likely that we'll be better served with a
> custom
> >>> transaction design instead of trying to retrofit one from CP systems.
> >>>
> >>> The whitepaper here is a good description of the consensus algorithm
> >> itself
> >>> as well as its robustness and stability characteristics, and its
> >> comparison
> >>> with other state-of-the-art consensus algorithms is very useful.  In
> the
> >>> context of Cassandra, where a consensus algorithm is only part of what
> >> will
> >>> be implemented, I'd like to see a more complete evaluation of the
> >>> transactional side of things as well, including performance
> >> characteristics
> >>> as well as the types of transactions that can be supported and at
> least a
> >>> general idea of what it would look like applied to Cassandra. This will
> >>> allow the PMC to make a more informed decision about what tradeoffs are
> >>> best for the entire long-term project of first supplementing and
> >> ultimately
> >>> replacing LWT.
> >>>
> >>> (Allowing users to mix LWT and AP Cassandra operations against the same
> >>> rows was probably a mistake, so in contrast with LWT we’re not looking
> >> for
> >>> something fast enough for occasional use but rather something within a
> >>> reasonable factor of AP operations, appropriate to being the only way
> to
> >>> interact with tables declared as such.)
> >>>
> >>> Besides Accord, this should cover
> >>>
> >>> - Calvin and FaunaDB
> >>> - A Spanner derivative (no opinion on whether that should be Cockroach
> or
> >>> Yugabyte, I don’t think it’s necessary to cover both)
> >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I
> suspect
> >>> there is more public information about MongoDB)
> >>> - RAMP
> >>>
> >>> Here’s an example of what I mean:
> >>>
> >>> =Calvin=
> >>>
> >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
> >>> transactions, then replicas execute the transactions independently with
> >> no
> >>> further coordination.  No SPOF.  Transactions are batched by each
> >> sequencer
> >>> to keep this from becoming a bottleneck.
> >>>
> >>> Performance: Calvin paper (published 2012) reports linear scaling of
> >> TPC-C
> >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
> >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is
> composed
> >>> of four reads and four writes, so this is effectively 2M reads and 2M
> >>> writes as we normally measure them in C*.
> >>>
> >>> Calvin supports mixed read/write transactions, but because the
> >> transaction
> >>> execution logic requires knowing all partition keys in advance to
> ensure
> >>> that all replicas can reproduce the same results with no coordination,
> >>> reads against non-PK predicates must be done ahead of time
> >> (transparently,
> >>> by the server) to determine the set of keys, and this must be retried
> if
> >>> the set of rows affected is updated before the actual transaction
> >> executes.
> >>>
> >>> Batching and global consensus adds latency -- 100ms in the Calvin paper
> >> and
> >>> apparently about 50ms in FaunaDB.  Glass half full: all transactions
> >>> (including multi-partition updates) are equally performant in Calvin
> >> since
> >>> the coordination is handled up front in the sequencing step.  Glass
> half
> >>> empty: even single-row reads and writes have to pay the full
> coordination
> >>> cost.  Fauna has optimized this away for reads but I am not aware of a
> >>> description of how they changed the design to allow this.
> >>>
> >>> Functionality and limitations: since the entire transaction must be
> known
> >>> in advance to allow coordination-less execution at the replicas, Calvin
> >>> cannot support interactive transactions at all.  FaunaDB mitigates this
> >> by
> >>> allowing server-side logic to be included, but a Calvin approach will
> >> never
> >>> be able to offer SQL compatibility.
> >>>
> >>> Guarantees: Calvin transactions are strictly serializable.  There is no
> >>> additional complexity or performance hit to generalizing to multiple
> >>> regions, apart from the speed of light.  And since Calvin is already
> >> paying
> >>> a batching latency penalty, this is less painful than for other
> systems.
> >>>
> >>> Application to Cassandra: B-.  Distributed transactions are handled by
> >> the
> >>> sequencing and scheduling layers, which are leaderless, and Calvin’s
> >>> requirements for the storage layer are easily met by C*.  But Calvin
> also
> >>> requires a global consensus protocol and LWT is almost certainly not
> >>> sufficiently performant, so this would require ZK or etcd (reasonable
> >> for a
> >>> library approach but not for replacing LWT in C* itself), or an
> >>> implementation of Accord.  I don’t believe Calvin would require
> >> additional
> >>> table-level metadata in Cassandra.
> >>>
> >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> benedict@apache.org>
> >>> wrote:
> >>>
> >>>> Wiki:
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> >>>> Whitepaper:
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> >>>> <
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >>>>>
> >>>> Prototype: https://github.com/belliottsmith/accord
> >>>>
> >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> >> community.
> >>>>
> >>>> Cassandra has benefitted from LWTs for many years, but application
> >>>> developers that want to ensure consistency for complex operations must
> >>>> either accept the scalability bottleneck of serializing all related
> >> state
> >>>> through a single partition, or layer a complex state machine on top of
> >>> the
> >>>> database. These are sophisticated and costly activities that our users
> >>>> should not be expected to undertake. Since distributed databases are
> >>>> beginning to offer distributed transactions with fewer caveats, it is
> >>> past
> >>>> time for Cassandra to do so as well.
> >>>>
> >>>> This CEP proposes the use of several novel techniques that build upon
> >>>> research (that followed EPaxos) to deliver (non-interactive) general
> >>>> purpose distributed transactions. The approach is outlined in the
> >>> wikipage
> >>>> and in more detail in the linked whitepaper. Importantly, by adopting
> >>> this
> >>>> approach we will be the _only_ distributed database to offer global,
> >>>> scalable, strict serializable transactions in one wide area
> round-trip.
> >>>> This would represent a significant improvement in the state of the
> art,
> >>>> both in the academic literature and in commercial or open source
> >>> offerings.
> >>>>
> >>>> This work has been partially realised in a prototype. This partial
> >>>> prototype has been verified against Jepsen.io’s Maelstrom library and
> >>>> dedicated in-tree strict serializability verification tools, but much
> >>> work
> >>>> remains for the work to be production capable and integrated into
> >>> Cassandra.
> >>>>
> >>>> I propose including the prototype in the project as a new source
> >>>> repository, to be developed as a standalone library for integration
> >> into
> >>>> Cassandra. I hope the community sees the important value proposition
> of
> >>>> this proposal, and will adopt the CEP after this discussion, so that
> >> the
> >>>> library and its integration into Cassandra can be developed in
> parallel
> >>> and
> >>>> with the involvement of the wider community.
> >>>>
> >>>
> >>>
> >>> --
> >>> Jonathan Ellis
> >>> co-founder, http://www.datastax.com
> >>> @spyced
> >>>
> >>
> >>
> >> --
> >> Jonathan Ellis
> >> co-founder, http://www.datastax.com
> >> @spyced
> >>
> >
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>


-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

Essentially this, although I think in practice we will need to track each partition’s timestamp separately (or optionally for reduced conflicts, each row or datum’s), and make them all part of the conditional application of the transaction - at least for strict-serializability.

The alternative is to insert read/write intents for the transaction during each step, and to confirm they are still valid on commit, but this approach would require a WAN round-trip for each step in the interactive transaction, whereas the timestamp-validating approach can use a LAN round-trip for each step besides the final one, and is also much simpler to implement.


From: Blake Eggleston <be...@apple.com.INVALID>
Date: Thursday, 30 September 2021 at 05:47
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
You could establish a lower timestamp bound and buffer transaction state on the coordinator, then make the commit an operation that only applies if all partitions involved haven’t been changed by a more recent timestamp. You could also implement mvcc either in the storage layer or for some period of time by buffering commits on each replica before applying.

> On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>
> How are interactive transactions possible with Accord?
>
>
>
> On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <be...@apache.org>
> wrote:
>
>> Could you explain why you believe this trade-off is necessary? We can
>> support full SQL just fine with Accord, and I hope that we eventually do so.
>>
>> This domain is incredibly complex, so it is easy to reach wrong
>> conclusions. I would invite you again to propose a system for discussion
>> that you think offers something Accord is unable to, and that you consider
>> desirable, and we can work from there.
>>
>> To pre-empt some possible discussions, I am not aware of anything we
>> cannot do with Accord that we could do with either Calvin or Spanner.
>> Interactive transactions are possible on top of Accord, as are transactions
>> with an unknown read/write set. In each case the only cost is that they
>> would use optimistic concurrency control, which is no worse the spanner
>> derivatives anyway (which I have to assume is your benchmark in this
>> regard). I do not expect to deliver either functionality initially, but
>> Accord takes us most of the way there for both.
>>
>>
>> From: Jonathan Ellis <jb...@gmail.com>
>> Date: Wednesday, 22 September 2021 at 05:36
>> To: dev <de...@cassandra.apache.org>
>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
>> Right, I'm looking for exactly a discussion on the high level goals.
>> Instead of saying "here's the goals and we ruled out X because Y" we should
>> start with a discussion around, "Approach A allows X and W, approach B
>> allows Y and Z" and decide together what the goals should be and and what
>> we are willing to trade to get those goals, e.g., are we willing to give up
>> global strict serializability to get the ability to support full SQL.  Both
>> of these are nice to have!
>>
>> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <be...@apache.org>
>> wrote:
>>
>>> Hi Jonathan,
>>>
>>> These other systems are incompatible with the goals of the CEP. I do
>>> discuss them (besides 2PC) in both the whitepaper and the CEP, and will
>>> summarise that discussion below. A true and accurate comparison of these
>>> other systems is essentially intractable, as there are complex subtleties
>>> to each flavour, and those who are interested would be better served by
>>> performing their own research.
>>>
>>> I think it is more productive to focus on what we want to achieve as a
>>> community. If you believe the goals of this CEP are wrong for the
>> project,
>>> let’s focus on that. If you want to compare and contrast specific facets
>> of
>>> alternative systems that you consider to be preferable in some dimension,
>>> let’s do that here or in a Q&A as proposed by Joey.
>>>
>>> The relevant goals are that we:
>>>
>>>
>>>  1.  Guarantee strict serializable isolation on commodity hardware
>>>  2.  Scale to any cluster size
>>>  3.  Achieve optimal latency
>>>
>>> The approach taken by Spanner derivatives is rejected by (1) because they
>>> guarantee only Serializable isolation (they additionally fail (3)). From
>>> watching talks by YugaByte, and inferring from Cockroach’s
>>> panic-cluster-death under clock skew, this is clearly considered by
>>> everyone to be undesirable but necessary to achieve scalability.
>>>
>>> The approach taken by FaunaDB (Calvin) is rejected by (2) because its
>>> sequencing layer requires a global leader process for the cluster, which
>> is
>>> incompatible with Cassandra’s scalability requirements. It additionally
>>> fails (3) for global clients.
>>>
>>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
>>> Spanner clone for its multi-key transaction functionality, not 2PC.
>>>
>>> Systems such as RAMP with even weaker isolation are not considered for
>> the
>>> simple reason that they do not even claim to meet (1).
>>>
>>> If we want to additionally offer weaker isolation levels than
>>> Serializable, such as that provided by the recent RAMP-TAO paper,
>> Cassandra
>>> is likely able to support multiple distinct transaction layers that
>> operate
>>> independently. I would encourage you to file a CEP to explore how we can
>>> meet these distinct use cases, but I consider them to be niche. I expect
>>> that a majority of our user base desire strict serializable isolation,
>> and
>>> certainly no less than serializable isolation, to augment the existing
>>> weaker isolation offered by quorum reads and writes.
>>>
>>> I would tangentially note that we are not an AP database under normal
>>> recommended operation. A minority in any network partition cannot reach
>>> QUORUM, so under recommended usage we are a high-availability leaderless
>> CP
>>> database.
>>>
>>>
>>> From: Jonathan Ellis <jb...@gmail.com>
>>> Date: Tuesday, 21 September 2021 at 23:45
>>> To: dev <de...@cassandra.apache.org>
>>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
>>> Benedict, thanks for taking the lead in putting this together. Since
>>> Cassandra is the only relevant database today designed around a
>> leaderless
>>> architecture, it's quite likely that we'll be better served with a custom
>>> transaction design instead of trying to retrofit one from CP systems.
>>>
>>> The whitepaper here is a good description of the consensus algorithm
>> itself
>>> as well as its robustness and stability characteristics, and its
>> comparison
>>> with other state-of-the-art consensus algorithms is very useful.  In the
>>> context of Cassandra, where a consensus algorithm is only part of what
>> will
>>> be implemented, I'd like to see a more complete evaluation of the
>>> transactional side of things as well, including performance
>> characteristics
>>> as well as the types of transactions that can be supported and at least a
>>> general idea of what it would look like applied to Cassandra. This will
>>> allow the PMC to make a more informed decision about what tradeoffs are
>>> best for the entire long-term project of first supplementing and
>> ultimately
>>> replacing LWT.
>>>
>>> (Allowing users to mix LWT and AP Cassandra operations against the same
>>> rows was probably a mistake, so in contrast with LWT we’re not looking
>> for
>>> something fast enough for occasional use but rather something within a
>>> reasonable factor of AP operations, appropriate to being the only way to
>>> interact with tables declared as such.)
>>>
>>> Besides Accord, this should cover
>>>
>>> - Calvin and FaunaDB
>>> - A Spanner derivative (no opinion on whether that should be Cockroach or
>>> Yugabyte, I don’t think it’s necessary to cover both)
>>> - A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
>>> there is more public information about MongoDB)
>>> - RAMP
>>>
>>> Here’s an example of what I mean:
>>>
>>> =Calvin=
>>>
>>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
>>> transactions, then replicas execute the transactions independently with
>> no
>>> further coordination.  No SPOF.  Transactions are batched by each
>> sequencer
>>> to keep this from becoming a bottleneck.
>>>
>>> Performance: Calvin paper (published 2012) reports linear scaling of
>> TPC-C
>>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
>>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is composed
>>> of four reads and four writes, so this is effectively 2M reads and 2M
>>> writes as we normally measure them in C*.
>>>
>>> Calvin supports mixed read/write transactions, but because the
>> transaction
>>> execution logic requires knowing all partition keys in advance to ensure
>>> that all replicas can reproduce the same results with no coordination,
>>> reads against non-PK predicates must be done ahead of time
>> (transparently,
>>> by the server) to determine the set of keys, and this must be retried if
>>> the set of rows affected is updated before the actual transaction
>> executes.
>>>
>>> Batching and global consensus adds latency -- 100ms in the Calvin paper
>> and
>>> apparently about 50ms in FaunaDB.  Glass half full: all transactions
>>> (including multi-partition updates) are equally performant in Calvin
>> since
>>> the coordination is handled up front in the sequencing step.  Glass half
>>> empty: even single-row reads and writes have to pay the full coordination
>>> cost.  Fauna has optimized this away for reads but I am not aware of a
>>> description of how they changed the design to allow this.
>>>
>>> Functionality and limitations: since the entire transaction must be known
>>> in advance to allow coordination-less execution at the replicas, Calvin
>>> cannot support interactive transactions at all.  FaunaDB mitigates this
>> by
>>> allowing server-side logic to be included, but a Calvin approach will
>> never
>>> be able to offer SQL compatibility.
>>>
>>> Guarantees: Calvin transactions are strictly serializable.  There is no
>>> additional complexity or performance hit to generalizing to multiple
>>> regions, apart from the speed of light.  And since Calvin is already
>> paying
>>> a batching latency penalty, this is less painful than for other systems.
>>>
>>> Application to Cassandra: B-.  Distributed transactions are handled by
>> the
>>> sequencing and scheduling layers, which are leaderless, and Calvin’s
>>> requirements for the storage layer are easily met by C*.  But Calvin also
>>> requires a global consensus protocol and LWT is almost certainly not
>>> sufficiently performant, so this would require ZK or etcd (reasonable
>> for a
>>> library approach but not for replacing LWT in C* itself), or an
>>> implementation of Accord.  I don’t believe Calvin would require
>> additional
>>> table-level metadata in Cassandra.
>>>
>>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <be...@apache.org>
>>> wrote:
>>>
>>>> Wiki:
>>>>
>>>
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
>>>> Whitepaper:
>>>>
>>>
>> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
>>>> <
>>>>
>>>
>> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
>>>>>
>>>> Prototype: https://github.com/belliottsmith/accord
>>>>
>>>> Hi everyone, I’d like to propose this CEP for adoption by the
>> community.
>>>>
>>>> Cassandra has benefitted from LWTs for many years, but application
>>>> developers that want to ensure consistency for complex operations must
>>>> either accept the scalability bottleneck of serializing all related
>> state
>>>> through a single partition, or layer a complex state machine on top of
>>> the
>>>> database. These are sophisticated and costly activities that our users
>>>> should not be expected to undertake. Since distributed databases are
>>>> beginning to offer distributed transactions with fewer caveats, it is
>>> past
>>>> time for Cassandra to do so as well.
>>>>
>>>> This CEP proposes the use of several novel techniques that build upon
>>>> research (that followed EPaxos) to deliver (non-interactive) general
>>>> purpose distributed transactions. The approach is outlined in the
>>> wikipage
>>>> and in more detail in the linked whitepaper. Importantly, by adopting
>>> this
>>>> approach we will be the _only_ distributed database to offer global,
>>>> scalable, strict serializable transactions in one wide area round-trip.
>>>> This would represent a significant improvement in the state of the art,
>>>> both in the academic literature and in commercial or open source
>>> offerings.
>>>>
>>>> This work has been partially realised in a prototype. This partial
>>>> prototype has been verified against Jepsen.io’s Maelstrom library and
>>>> dedicated in-tree strict serializability verification tools, but much
>>> work
>>>> remains for the work to be production capable and integrated into
>>> Cassandra.
>>>>
>>>> I propose including the prototype in the project as a new source
>>>> repository, to be developed as a standalone library for integration
>> into
>>>> Cassandra. I hope the community sees the important value proposition of
>>>> this proposal, and will adopt the CEP after this discussion, so that
>> the
>>>> library and its integration into Cassandra can be developed in parallel
>>> and
>>>> with the involvement of the wider community.
>>>>
>>>
>>>
>>> --
>>> Jonathan Ellis
>>> co-founder, http://www.datastax.com
>>> @spyced
>>>
>>
>>
>> --
>> Jonathan Ellis
>> co-founder, http://www.datastax.com
>> @spyced
>>
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Blake Eggleston <be...@apple.com.INVALID>.

You could establish a lower timestamp bound and buffer transaction state on the coordinator, then make the commit an operation that only applies if all partitions involved haven’t been changed by a more recent timestamp. You could also implement mvcc either in the storage layer or for some period of time by buffering commits on each replica before applying.

> On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> 
> How are interactive transactions possible with Accord?
> 
> 
> 
> On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <be...@apache.org>
> wrote:
> 
>> Could you explain why you believe this trade-off is necessary? We can
>> support full SQL just fine with Accord, and I hope that we eventually do so.
>> 
>> This domain is incredibly complex, so it is easy to reach wrong
>> conclusions. I would invite you again to propose a system for discussion
>> that you think offers something Accord is unable to, and that you consider
>> desirable, and we can work from there.
>> 
>> To pre-empt some possible discussions, I am not aware of anything we
>> cannot do with Accord that we could do with either Calvin or Spanner.
>> Interactive transactions are possible on top of Accord, as are transactions
>> with an unknown read/write set. In each case the only cost is that they
>> would use optimistic concurrency control, which is no worse the spanner
>> derivatives anyway (which I have to assume is your benchmark in this
>> regard). I do not expect to deliver either functionality initially, but
>> Accord takes us most of the way there for both.
>> 
>> 
>> From: Jonathan Ellis <jb...@gmail.com>
>> Date: Wednesday, 22 September 2021 at 05:36
>> To: dev <de...@cassandra.apache.org>
>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
>> Right, I'm looking for exactly a discussion on the high level goals.
>> Instead of saying "here's the goals and we ruled out X because Y" we should
>> start with a discussion around, "Approach A allows X and W, approach B
>> allows Y and Z" and decide together what the goals should be and and what
>> we are willing to trade to get those goals, e.g., are we willing to give up
>> global strict serializability to get the ability to support full SQL.  Both
>> of these are nice to have!
>> 
>> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <be...@apache.org>
>> wrote:
>> 
>>> Hi Jonathan,
>>> 
>>> These other systems are incompatible with the goals of the CEP. I do
>>> discuss them (besides 2PC) in both the whitepaper and the CEP, and will
>>> summarise that discussion below. A true and accurate comparison of these
>>> other systems is essentially intractable, as there are complex subtleties
>>> to each flavour, and those who are interested would be better served by
>>> performing their own research.
>>> 
>>> I think it is more productive to focus on what we want to achieve as a
>>> community. If you believe the goals of this CEP are wrong for the
>> project,
>>> let’s focus on that. If you want to compare and contrast specific facets
>> of
>>> alternative systems that you consider to be preferable in some dimension,
>>> let’s do that here or in a Q&A as proposed by Joey.
>>> 
>>> The relevant goals are that we:
>>> 
>>> 
>>>  1.  Guarantee strict serializable isolation on commodity hardware
>>>  2.  Scale to any cluster size
>>>  3.  Achieve optimal latency
>>> 
>>> The approach taken by Spanner derivatives is rejected by (1) because they
>>> guarantee only Serializable isolation (they additionally fail (3)). From
>>> watching talks by YugaByte, and inferring from Cockroach’s
>>> panic-cluster-death under clock skew, this is clearly considered by
>>> everyone to be undesirable but necessary to achieve scalability.
>>> 
>>> The approach taken by FaunaDB (Calvin) is rejected by (2) because its
>>> sequencing layer requires a global leader process for the cluster, which
>> is
>>> incompatible with Cassandra’s scalability requirements. It additionally
>>> fails (3) for global clients.
>>> 
>>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
>>> Spanner clone for its multi-key transaction functionality, not 2PC.
>>> 
>>> Systems such as RAMP with even weaker isolation are not considered for
>> the
>>> simple reason that they do not even claim to meet (1).
>>> 
>>> If we want to additionally offer weaker isolation levels than
>>> Serializable, such as that provided by the recent RAMP-TAO paper,
>> Cassandra
>>> is likely able to support multiple distinct transaction layers that
>> operate
>>> independently. I would encourage you to file a CEP to explore how we can
>>> meet these distinct use cases, but I consider them to be niche. I expect
>>> that a majority of our user base desire strict serializable isolation,
>> and
>>> certainly no less than serializable isolation, to augment the existing
>>> weaker isolation offered by quorum reads and writes.
>>> 
>>> I would tangentially note that we are not an AP database under normal
>>> recommended operation. A minority in any network partition cannot reach
>>> QUORUM, so under recommended usage we are a high-availability leaderless
>> CP
>>> database.
>>> 
>>> 
>>> From: Jonathan Ellis <jb...@gmail.com>
>>> Date: Tuesday, 21 September 2021 at 23:45
>>> To: dev <de...@cassandra.apache.org>
>>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
>>> Benedict, thanks for taking the lead in putting this together. Since
>>> Cassandra is the only relevant database today designed around a
>> leaderless
>>> architecture, it's quite likely that we'll be better served with a custom
>>> transaction design instead of trying to retrofit one from CP systems.
>>> 
>>> The whitepaper here is a good description of the consensus algorithm
>> itself
>>> as well as its robustness and stability characteristics, and its
>> comparison
>>> with other state-of-the-art consensus algorithms is very useful.  In the
>>> context of Cassandra, where a consensus algorithm is only part of what
>> will
>>> be implemented, I'd like to see a more complete evaluation of the
>>> transactional side of things as well, including performance
>> characteristics
>>> as well as the types of transactions that can be supported and at least a
>>> general idea of what it would look like applied to Cassandra. This will
>>> allow the PMC to make a more informed decision about what tradeoffs are
>>> best for the entire long-term project of first supplementing and
>> ultimately
>>> replacing LWT.
>>> 
>>> (Allowing users to mix LWT and AP Cassandra operations against the same
>>> rows was probably a mistake, so in contrast with LWT we’re not looking
>> for
>>> something fast enough for occasional use but rather something within a
>>> reasonable factor of AP operations, appropriate to being the only way to
>>> interact with tables declared as such.)
>>> 
>>> Besides Accord, this should cover
>>> 
>>> - Calvin and FaunaDB
>>> - A Spanner derivative (no opinion on whether that should be Cockroach or
>>> Yugabyte, I don’t think it’s necessary to cover both)
>>> - A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
>>> there is more public information about MongoDB)
>>> - RAMP
>>> 
>>> Here’s an example of what I mean:
>>> 
>>> =Calvin=
>>> 
>>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
>>> transactions, then replicas execute the transactions independently with
>> no
>>> further coordination.  No SPOF.  Transactions are batched by each
>> sequencer
>>> to keep this from becoming a bottleneck.
>>> 
>>> Performance: Calvin paper (published 2012) reports linear scaling of
>> TPC-C
>>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
>>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is composed
>>> of four reads and four writes, so this is effectively 2M reads and 2M
>>> writes as we normally measure them in C*.
>>> 
>>> Calvin supports mixed read/write transactions, but because the
>> transaction
>>> execution logic requires knowing all partition keys in advance to ensure
>>> that all replicas can reproduce the same results with no coordination,
>>> reads against non-PK predicates must be done ahead of time
>> (transparently,
>>> by the server) to determine the set of keys, and this must be retried if
>>> the set of rows affected is updated before the actual transaction
>> executes.
>>> 
>>> Batching and global consensus adds latency -- 100ms in the Calvin paper
>> and
>>> apparently about 50ms in FaunaDB.  Glass half full: all transactions
>>> (including multi-partition updates) are equally performant in Calvin
>> since
>>> the coordination is handled up front in the sequencing step.  Glass half
>>> empty: even single-row reads and writes have to pay the full coordination
>>> cost.  Fauna has optimized this away for reads but I am not aware of a
>>> description of how they changed the design to allow this.
>>> 
>>> Functionality and limitations: since the entire transaction must be known
>>> in advance to allow coordination-less execution at the replicas, Calvin
>>> cannot support interactive transactions at all.  FaunaDB mitigates this
>> by
>>> allowing server-side logic to be included, but a Calvin approach will
>> never
>>> be able to offer SQL compatibility.
>>> 
>>> Guarantees: Calvin transactions are strictly serializable.  There is no
>>> additional complexity or performance hit to generalizing to multiple
>>> regions, apart from the speed of light.  And since Calvin is already
>> paying
>>> a batching latency penalty, this is less painful than for other systems.
>>> 
>>> Application to Cassandra: B-.  Distributed transactions are handled by
>> the
>>> sequencing and scheduling layers, which are leaderless, and Calvin’s
>>> requirements for the storage layer are easily met by C*.  But Calvin also
>>> requires a global consensus protocol and LWT is almost certainly not
>>> sufficiently performant, so this would require ZK or etcd (reasonable
>> for a
>>> library approach but not for replacing LWT in C* itself), or an
>>> implementation of Accord.  I don’t believe Calvin would require
>> additional
>>> table-level metadata in Cassandra.
>>> 
>>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <be...@apache.org>
>>> wrote:
>>> 
>>>> Wiki:
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
>>>> Whitepaper:
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
>>>> <
>>>> 
>>> 
>> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
>>>>> 
>>>> Prototype: https://github.com/belliottsmith/accord
>>>> 
>>>> Hi everyone, I’d like to propose this CEP for adoption by the
>> community.
>>>> 
>>>> Cassandra has benefitted from LWTs for many years, but application
>>>> developers that want to ensure consistency for complex operations must
>>>> either accept the scalability bottleneck of serializing all related
>> state
>>>> through a single partition, or layer a complex state machine on top of
>>> the
>>>> database. These are sophisticated and costly activities that our users
>>>> should not be expected to undertake. Since distributed databases are
>>>> beginning to offer distributed transactions with fewer caveats, it is
>>> past
>>>> time for Cassandra to do so as well.
>>>> 
>>>> This CEP proposes the use of several novel techniques that build upon
>>>> research (that followed EPaxos) to deliver (non-interactive) general
>>>> purpose distributed transactions. The approach is outlined in the
>>> wikipage
>>>> and in more detail in the linked whitepaper. Importantly, by adopting
>>> this
>>>> approach we will be the _only_ distributed database to offer global,
>>>> scalable, strict serializable transactions in one wide area round-trip.
>>>> This would represent a significant improvement in the state of the art,
>>>> both in the academic literature and in commercial or open source
>>> offerings.
>>>> 
>>>> This work has been partially realised in a prototype. This partial
>>>> prototype has been verified against Jepsen.io’s Maelstrom library and
>>>> dedicated in-tree strict serializability verification tools, but much
>>> work
>>>> remains for the work to be production capable and integrated into
>>> Cassandra.
>>>> 
>>>> I propose including the prototype in the project as a new source
>>>> repository, to be developed as a standalone library for integration
>> into
>>>> Cassandra. I hope the community sees the important value proposition of
>>>> this proposal, and will adopt the CEP after this discussion, so that
>> the
>>>> library and its integration into Cassandra can be developed in parallel
>>> and
>>>> with the involvement of the wider community.
>>>> 
>>> 
>>> 
>>> --
>>> Jonathan Ellis
>>> co-founder, http://www.datastax.com
>>> @spyced
>>> 
>> 
>> 
>> --
>> Jonathan Ellis
>> co-founder, http://www.datastax.com
>> @spyced
>> 
> 
> 
> -- 
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Jonathan Ellis <jb...@gmail.com>.

How are interactive transactions possible with Accord?



On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <be...@apache.org>
wrote:

> Could you explain why you believe this trade-off is necessary? We can
> support full SQL just fine with Accord, and I hope that we eventually do so.
>
> This domain is incredibly complex, so it is easy to reach wrong
> conclusions. I would invite you again to propose a system for discussion
> that you think offers something Accord is unable to, and that you consider
> desirable, and we can work from there.
>
> To pre-empt some possible discussions, I am not aware of anything we
> cannot do with Accord that we could do with either Calvin or Spanner.
> Interactive transactions are possible on top of Accord, as are transactions
> with an unknown read/write set. In each case the only cost is that they
> would use optimistic concurrency control, which is no worse the spanner
> derivatives anyway (which I have to assume is your benchmark in this
> regard). I do not expect to deliver either functionality initially, but
> Accord takes us most of the way there for both.
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Wednesday, 22 September 2021 at 05:36
> To: dev <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Right, I'm looking for exactly a discussion on the high level goals.
> Instead of saying "here's the goals and we ruled out X because Y" we should
> start with a discussion around, "Approach A allows X and W, approach B
> allows Y and Z" and decide together what the goals should be and and what
> we are willing to trade to get those goals, e.g., are we willing to give up
> global strict serializability to get the ability to support full SQL.  Both
> of these are nice to have!
>
> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <be...@apache.org>
> wrote:
>
> > Hi Jonathan,
> >
> > These other systems are incompatible with the goals of the CEP. I do
> > discuss them (besides 2PC) in both the whitepaper and the CEP, and will
> > summarise that discussion below. A true and accurate comparison of these
> > other systems is essentially intractable, as there are complex subtleties
> > to each flavour, and those who are interested would be better served by
> > performing their own research.
> >
> > I think it is more productive to focus on what we want to achieve as a
> > community. If you believe the goals of this CEP are wrong for the
> project,
> > let’s focus on that. If you want to compare and contrast specific facets
> of
> > alternative systems that you consider to be preferable in some dimension,
> > let’s do that here or in a Q&A as proposed by Joey.
> >
> > The relevant goals are that we:
> >
> >
> >   1.  Guarantee strict serializable isolation on commodity hardware
> >   2.  Scale to any cluster size
> >   3.  Achieve optimal latency
> >
> > The approach taken by Spanner derivatives is rejected by (1) because they
> > guarantee only Serializable isolation (they additionally fail (3)). From
> > watching talks by YugaByte, and inferring from Cockroach’s
> > panic-cluster-death under clock skew, this is clearly considered by
> > everyone to be undesirable but necessary to achieve scalability.
> >
> > The approach taken by FaunaDB (Calvin) is rejected by (2) because its
> > sequencing layer requires a global leader process for the cluster, which
> is
> > incompatible with Cassandra’s scalability requirements. It additionally
> > fails (3) for global clients.
> >
> > Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
> > Spanner clone for its multi-key transaction functionality, not 2PC.
> >
> > Systems such as RAMP with even weaker isolation are not considered for
> the
> > simple reason that they do not even claim to meet (1).
> >
> > If we want to additionally offer weaker isolation levels than
> > Serializable, such as that provided by the recent RAMP-TAO paper,
> Cassandra
> > is likely able to support multiple distinct transaction layers that
> operate
> > independently. I would encourage you to file a CEP to explore how we can
> > meet these distinct use cases, but I consider them to be niche. I expect
> > that a majority of our user base desire strict serializable isolation,
> and
> > certainly no less than serializable isolation, to augment the existing
> > weaker isolation offered by quorum reads and writes.
> >
> > I would tangentially note that we are not an AP database under normal
> > recommended operation. A minority in any network partition cannot reach
> > QUORUM, so under recommended usage we are a high-availability leaderless
> CP
> > database.
> >
> >
> > From: Jonathan Ellis <jb...@gmail.com>
> > Date: Tuesday, 21 September 2021 at 23:45
> > To: dev <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > Benedict, thanks for taking the lead in putting this together. Since
> > Cassandra is the only relevant database today designed around a
> leaderless
> > architecture, it's quite likely that we'll be better served with a custom
> > transaction design instead of trying to retrofit one from CP systems.
> >
> > The whitepaper here is a good description of the consensus algorithm
> itself
> > as well as its robustness and stability characteristics, and its
> comparison
> > with other state-of-the-art consensus algorithms is very useful.  In the
> > context of Cassandra, where a consensus algorithm is only part of what
> will
> > be implemented, I'd like to see a more complete evaluation of the
> > transactional side of things as well, including performance
> characteristics
> > as well as the types of transactions that can be supported and at least a
> > general idea of what it would look like applied to Cassandra. This will
> > allow the PMC to make a more informed decision about what tradeoffs are
> > best for the entire long-term project of first supplementing and
> ultimately
> > replacing LWT.
> >
> > (Allowing users to mix LWT and AP Cassandra operations against the same
> > rows was probably a mistake, so in contrast with LWT we’re not looking
> for
> > something fast enough for occasional use but rather something within a
> > reasonable factor of AP operations, appropriate to being the only way to
> > interact with tables declared as such.)
> >
> > Besides Accord, this should cover
> >
> > - Calvin and FaunaDB
> > - A Spanner derivative (no opinion on whether that should be Cockroach or
> > Yugabyte, I don’t think it’s necessary to cover both)
> > - A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
> > there is more public information about MongoDB)
> > - RAMP
> >
> > Here’s an example of what I mean:
> >
> > =Calvin=
> >
> > Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
> > transactions, then replicas execute the transactions independently with
> no
> > further coordination.  No SPOF.  Transactions are batched by each
> sequencer
> > to keep this from becoming a bottleneck.
> >
> > Performance: Calvin paper (published 2012) reports linear scaling of
> TPC-C
> > New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
> > with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is composed
> > of four reads and four writes, so this is effectively 2M reads and 2M
> > writes as we normally measure them in C*.
> >
> > Calvin supports mixed read/write transactions, but because the
> transaction
> > execution logic requires knowing all partition keys in advance to ensure
> > that all replicas can reproduce the same results with no coordination,
> > reads against non-PK predicates must be done ahead of time
> (transparently,
> > by the server) to determine the set of keys, and this must be retried if
> > the set of rows affected is updated before the actual transaction
> executes.
> >
> > Batching and global consensus adds latency -- 100ms in the Calvin paper
> and
> > apparently about 50ms in FaunaDB.  Glass half full: all transactions
> > (including multi-partition updates) are equally performant in Calvin
> since
> > the coordination is handled up front in the sequencing step.  Glass half
> > empty: even single-row reads and writes have to pay the full coordination
> > cost.  Fauna has optimized this away for reads but I am not aware of a
> > description of how they changed the design to allow this.
> >
> > Functionality and limitations: since the entire transaction must be known
> > in advance to allow coordination-less execution at the replicas, Calvin
> > cannot support interactive transactions at all.  FaunaDB mitigates this
> by
> > allowing server-side logic to be included, but a Calvin approach will
> never
> > be able to offer SQL compatibility.
> >
> > Guarantees: Calvin transactions are strictly serializable.  There is no
> > additional complexity or performance hit to generalizing to multiple
> > regions, apart from the speed of light.  And since Calvin is already
> paying
> > a batching latency penalty, this is less painful than for other systems.
> >
> > Application to Cassandra: B-.  Distributed transactions are handled by
> the
> > sequencing and scheduling layers, which are leaderless, and Calvin’s
> > requirements for the storage layer are easily met by C*.  But Calvin also
> > requires a global consensus protocol and LWT is almost certainly not
> > sufficiently performant, so this would require ZK or etcd (reasonable
> for a
> > library approach but not for replacing LWT in C* itself), or an
> > implementation of Accord.  I don’t believe Calvin would require
> additional
> > table-level metadata in Cassandra.
> >
> > On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <be...@apache.org>
> > wrote:
> >
> > > Wiki:
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > Whitepaper:
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > <
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > >
> > > Prototype: https://github.com/belliottsmith/accord
> > >
> > > Hi everyone, I’d like to propose this CEP for adoption by the
> community.
> > >
> > > Cassandra has benefitted from LWTs for many years, but application
> > > developers that want to ensure consistency for complex operations must
> > > either accept the scalability bottleneck of serializing all related
> state
> > > through a single partition, or layer a complex state machine on top of
> > the
> > > database. These are sophisticated and costly activities that our users
> > > should not be expected to undertake. Since distributed databases are
> > > beginning to offer distributed transactions with fewer caveats, it is
> > past
> > > time for Cassandra to do so as well.
> > >
> > > This CEP proposes the use of several novel techniques that build upon
> > > research (that followed EPaxos) to deliver (non-interactive) general
> > > purpose distributed transactions. The approach is outlined in the
> > wikipage
> > > and in more detail in the linked whitepaper. Importantly, by adopting
> > this
> > > approach we will be the _only_ distributed database to offer global,
> > > scalable, strict serializable transactions in one wide area round-trip.
> > > This would represent a significant improvement in the state of the art,
> > > both in the academic literature and in commercial or open source
> > offerings.
> > >
> > > This work has been partially realised in a prototype. This partial
> > > prototype has been verified against Jepsen.io’s Maelstrom library and
> > > dedicated in-tree strict serializability verification tools, but much
> > work
> > > remains for the work to be production capable and integrated into
> > Cassandra.
> > >
> > > I propose including the prototype in the project as a new source
> > > repository, to be developed as a standalone library for integration
> into
> > > Cassandra. I hope the community sees the important value proposition of
> > > this proposal, and will adopt the CEP after this discussion, so that
> the
> > > library and its integration into Cassandra can be developed in parallel
> > and
> > > with the involvement of the wider community.
> > >
> >
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
> >
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Henrik Ingo <he...@datastax.com>.

On Fri, Oct 1, 2021 at 4:37 PM Henrik Ingo <he...@datastax.com> wrote:

> A known optimization for the hot rows problem is to "hint" or manually
> force clients to direct all updates to the hot row to the same node,
> essentially making the system leader based. This allows the database to
> start processing new updates even while the first one is still committing.
> (See Galera for an example implementing this
> <https://galeracluster.com/library/documentation/using-sr.html#usr-hot-records>.)
> This makes me wonder whether there is a similar optimization for Accord
> where transactions from the same coordinator can be allowed to commit
> within the SkewMax window, because we can assume that the trx timestamps
> originating at the same coordinator cannot arrive out of order when using
> TPC?
>
>
TCP

-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Paulo Motta <pa...@gmail.com>.

I don’t have any objection to call a vote, I think we had a good time to
discuss and I’m satisfied with the clarifications to my questions.

Thanks Benedict, Blake and Scott for detailing the proposal and answering
questions.

I think everyone is excited and looking forward to this groundbreaking work
that will enable the next generation of features and improvements in
Cassandra! :-)

On Mon, 4 Oct 2021 at 03:03 benedict@apache.org <be...@apache.org> wrote:

> Hi everyone,
>
> It’s been a month since I brought this proposal forward. I think we’re
> ready for a vote, and I’d like to get a show of hands to see if others
> agree.
>
> I don’t intend for this to curtail any further questions or suggestions.
> I’m grateful for the continued healthy discussion, but from my point of
> view the topics we are now covering are not core to the proposal’s adoption.
>
> If anyone think this proposal is not ready for a vote, I would really
> appreciate it if that sentiment could be accompanied by a brief statement
> of what is wrong with the substance of the proposal, so that we can address
> these issues directly to move things forward.
>
> Thanks!
>
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

Hi everyone,

It’s been a month since I brought this proposal forward. I think we’re ready for a vote, and I’d like to get a show of hands to see if others agree.

I don’t intend for this to curtail any further questions or suggestions. I’m grateful for the continued healthy discussion, but from my point of view the topics we are now covering are not core to the proposal’s adoption.

If anyone think this proposal is not ready for a vote, I would really appreciate it if that sentiment could be accompanied by a brief statement of what is wrong with the substance of the proposal, so that we can address these issues directly to move things forward.

Thanks!

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Henrik Ingo <he...@datastax.com>.

On Fri, Oct 1, 2021 at 7:20 PM benedict@apache.org <be...@apache.org>
wrote:

> I haven’t encountered Galera – do you have any technical papers to hand?
>
>
Yes, but it's a whole thesis :-)

https://www.inf.usi.ch/faculty/pedone/Paper/199x/These-2090-Pedone.pdf

I guess parts of that were presented in conference papers.

Pedone's work implements a protocol with Snapshot Isolation. More recent
work from down under describe a similar system providing Serializeable
Snapshot Isolation:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.228.185&rep=rep1&type=pdf

The best known implementation of Pedone's work would be Galera Cluster,
which hooks the "Galera" replication library into MySQL. It's also included
with MariaDB Cluster and Percona XtraDB Cluster. Oracle later did an
independent implementation (for IPR ownership reasons) which is known as
InnoDB Cluster.

This page in the Galera docs has a great diagram to get you started:
https://galeracluster.com/library/documentation/certification-based-replication.html

For an end user oriented beginner lecture, search conference video
recordings for Seppo Jaakola:
https://www.youtube.com/watch?v=5e3unwy_OVs

Worth calling out that we are in RDBMS land now, and the above is just a
replication solution, there is no sharding anywhere. For the Serializeable
paper, I struggle to even imagine how it could scale to multiple shards.
For SI it's kind of easier as only write conflicts need to be checked.

henrik

-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

> If I'm reading you correctly, then Accord does / could do exactly what I was asking for: two round trips in a single DC cluster, and one roundtrip + SkewMax when network roundtrips are >> SkewMax.

Yes, in fact it’s even better than that. Even in this setup *most* transactions will still take only one round-trip, and at worst case (under conflicts) two round-trips.

> assuming I got it correct...

As far as I can tell your understanding is correct, yes - though worth noting of course that the WAN round-trip on write is asynchronous.

I haven’t encountered Galera – do you have any technical papers to hand?

From: Henrik Ingo <he...@datastax.com>
Date: Friday, 1 October 2021 at 16:24
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
On Fri, Oct 1, 2021 at 5:30 PM benedict@apache.org <be...@apache.org>
wrote:

> > Typical value for SkewMax in e.g. the Spanner paper, some CockroachDB
> discussions = 7 ms
>
> I think skew max is likely to be much lower than this, even on commodity
> hardware. Bear in mind that unlike Cockroach and Spanner correctness does
> not depend on this value, only performance. So we can pick the real number,
> not some p100 outlier value.
>
> Also bear in mind that this is an optimisation. In clusters where it makes
> no sense we can simply use the raw protocol and accept transactions will
> very infrequently take two round-trips (which is fine, because in this
> scenario round-trips are cheap).
>
>
Oh, this was not at all obvious :-D

If I'm reading you correctly, then Accord does / could do exactly what I
was asking for: two round trips in a single DC cluster, and one roundtrip +
SkewMax when network roundtrips are >> SkewMax.



> > A known optimization for the hot rows problem is to "hint" or manually
> force clients to direct all updates to the hot row to the same node
>
> So, with a leaderless protocol like Accord the ordering decisions are
> never really bottlenecked - no matter how many are in-flight, a new
> transaction will experience no additional latency determining its execution
> order. The only bottleneck will be execution. For this it is absolutely
> possible to funnel everything to a single coordinator, but I don’t know
> that this would in practice achieve much – the important bottleneck would
> be that the coordinators are all within the same
>
> DC, so that the _replicas_ may all respond to them with their data
> dependencies with minimal delay. This is something we discussed in the
> ApacheCon call as it happens. If a significant number of transactions are
> pending, and they are in different DCs, it would be quite straightforward
> to nominate a coordinator within the DC serving the majority of operations
> to serve the remainder, and to forward the results to the original
> coordinators.
>
>
Thanks for explaining. This is really interesting. I now reread section 2.2
of the paper and realize it says exactly this.

So in Accord:

Step 1: One network round trip + SkewMax to establish a global ordering.

Step 2: a) One (local) network round trip for read phase, One (wan) round
trip for writes.
             b) In addition, before either reading or writing, the node
must first commit and apply all previous transactions that are in the
"deps" set of this transaction.

In addition, if we implement interactive transactions, or support for
secondary indexes, or other "complex" transactions, then that work would
happen before Step 1.

Ok, now that I spelled this out... assuming I got it correct... Then this
actually resembles Galera more than Spanner. The wall clock time is not
actually the transaction id, it's just a step in the consensus dialogue
where nodes agree on a global ordering.



> I don’t anticipate this optimisation being a high priority until we have
> user reports of this bottleneck in the wild, however. Since clients for
> many workloads will naturally be geo-partitioned so that related state is
> being updated from the same region, it might simply not be needed – at
> least any time soon.
>
>
For sure. I think we're all just trying to understand the landscape what we
are talking about here, not trying to say everything should be implemented
in v1.


henrik

--

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Henrik Ingo <he...@datastax.com>.

On Fri, Oct 1, 2021 at 5:30 PM benedict@apache.org <be...@apache.org>
wrote:

> > Typical value for SkewMax in e.g. the Spanner paper, some CockroachDB
> discussions = 7 ms
>
> I think skew max is likely to be much lower than this, even on commodity
> hardware. Bear in mind that unlike Cockroach and Spanner correctness does
> not depend on this value, only performance. So we can pick the real number,
> not some p100 outlier value.
>
> Also bear in mind that this is an optimisation. In clusters where it makes
> no sense we can simply use the raw protocol and accept transactions will
> very infrequently take two round-trips (which is fine, because in this
> scenario round-trips are cheap).
>
>
Oh, this was not at all obvious :-D

If I'm reading you correctly, then Accord does / could do exactly what I
was asking for: two round trips in a single DC cluster, and one roundtrip +
SkewMax when network roundtrips are >> SkewMax.



> > A known optimization for the hot rows problem is to "hint" or manually
> force clients to direct all updates to the hot row to the same node
>
> So, with a leaderless protocol like Accord the ordering decisions are
> never really bottlenecked - no matter how many are in-flight, a new
> transaction will experience no additional latency determining its execution
> order. The only bottleneck will be execution. For this it is absolutely
> possible to funnel everything to a single coordinator, but I don’t know
> that this would in practice achieve much – the important bottleneck would
> be that the coordinators are all within the same
>
> DC, so that the _replicas_ may all respond to them with their data
> dependencies with minimal delay. This is something we discussed in the
> ApacheCon call as it happens. If a significant number of transactions are
> pending, and they are in different DCs, it would be quite straightforward
> to nominate a coordinator within the DC serving the majority of operations
> to serve the remainder, and to forward the results to the original
> coordinators.
>
>
Thanks for explaining. This is really interesting. I now reread section 2.2
of the paper and realize it says exactly this.

So in Accord:

Step 1: One network round trip + SkewMax to establish a global ordering.

Step 2: a) One (local) network round trip for read phase, One (wan) round
trip for writes.
             b) In addition, before either reading or writing, the node
must first commit and apply all previous transactions that are in the
"deps" set of this transaction.

In addition, if we implement interactive transactions, or support for
secondary indexes, or other "complex" transactions, then that work would
happen before Step 1.

Ok, now that I spelled this out... assuming I got it correct... Then this
actually resembles Galera more than Spanner. The wall clock time is not
actually the transaction id, it's just a step in the consensus dialogue
where nodes agree on a global ordering.



> I don’t anticipate this optimisation being a high priority until we have
> user reports of this bottleneck in the wild, however. Since clients for
> many workloads will naturally be geo-partitioned so that related state is
> being updated from the same region, it might simply not be needed – at
> least any time soon.
>
>
For sure. I think we're all just trying to understand the landscape what we
are talking about here, not trying to say everything should be implemented
in v1.


henrik

-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

Hi Henrik,

> While I understand they are out of scope, do you happen to have already some idea what it would require to support secondary indexes?

Yes, it is likely that the approach will be the same taken by Calvin-like systems where a “reconnaissance” round is taken within the local DC to construct a transaction involving the secondary index. This would be the reverse if reading from a secondary index, where the primary keys would be determined via a reconnaissance round and the transaction updated to include them.

If we choose to implement one of the more sophisticated interactive transaction proposals then it would of course be possible to implement secondary indexes on top of these.

Note that all of this is entirely independent of SAI – since these indexes are built per-partition they will be easily transactional within a partition key, or probably never transactional if you perform a scatter gather across the whole cluster. I’m not sufficiently well versed in SAI to really consider this well as yet, and I will update the CEP to note that they are out of scope.

> Typical value for SkewMax in e.g. the Spanner paper, some CockroachDB discussions = 7 ms

I think skew max is likely to be much lower than this, even on commodity hardware. Bear in mind that unlike Cockroach and Spanner correctness does not depend on this value, only performance. So we can pick the real number, not some p100 outlier value.

Also bear in mind that this is an optimisation. In clusters where it makes no sense we can simply use the raw protocol and accept transactions will very infrequently take two round-trips (which is fine, because in this scenario round-trips are cheap).

> A known optimization for the hot rows problem is to "hint" or manually force clients to direct all updates to the hot row to the same node

So, with a leaderless protocol like Accord the ordering decisions are never really bottlenecked - no matter how many are in-flight, a new transaction will experience no additional latency determining its execution order. The only bottleneck will be execution. For this it is absolutely possible to funnel everything to a single coordinator, but I don’t know that this would in practice achieve much – the important bottleneck would be that the coordinators are all within the same

DC, so that the _replicas_ may all respond to them with their data dependencies with minimal delay. This is something we discussed in the ApacheCon call as it happens. If a significant number of transactions are pending, and they are in different DCs, it would be quite straightforward to nominate a coordinator within the DC serving the majority of operations to serve the remainder, and to forward the results to the original coordinators.

I don’t anticipate this optimisation being a high priority until we have user reports of this bottleneck in the wild, however. Since clients for many workloads will naturally be geo-partitioned so that related state is being updated from the same region, it might simply not be needed – at least any time soon.

From: Henrik Ingo <he...@datastax.com>
Date: Friday, 1 October 2021 at 14:38
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Hi Benedict

Since you asked, I reviewed the thread a bit and found this...

*secondary indexes*

>> What I would like to understand better and without guessing is, what do
these transactions look like from a client/user point of view?

> This is a fair question, and perhaps something I should pinpoint more
directly for the reader. The CEP does stipulate non-interactive
transactions, i.e. those that are one-shot. The only other limitation is
that the partition keys must be known upfront, however I expect we will
follow-up soon after with some weaker semantics that build on top (probably
using optimistic concurrency control) to support transactions where only
some partition keys are known upfront, so that we may support global
secondary indexes with proper isolation and consistency.

The CEP doesn't actually mention lack of support for secondary index
queries. Probably good to add as a limitation. (I realize currently using
secondary indexes isn't mainstream in Cassandra anyway, but with SASI in
4.0 and SAI being a separate CEP in discussion, it's good to call out
Accord wouldn't automatically support them.)

While I understand they are out of scope, do you happen to have already
some idea what it would require to support secondary indexes? Is it
sufficient to just include the secondary index keys (or a range of such) in
the "deps" of the transaction? Of course, still needing to also include the
partitions or rows actuallly read as a result of scanning the secondary
index. Similarly then for mutations, deps would have to include changes to
index keys in the transaction?

*commit latency*

A topic on some off-list discussions has been to understand the
implications of using a Spanner-inspired approach where the clock skew
between cluster nodes is a necessary part of the commit latency:

Deadline(t0 ,C,P) = t0 +SkewMax +max(Latency(C′,P) |C′ ∈C)−Latency(C,P)

In the white paper you even explicitly mention the trade off you have
chosen: *"This technique trades wide area round-trips for an additional
latency penalty equal to the bounds on clock synchrony."*

If we try to quantify what this trade off means in practice, I get:

Typical value for SkewMax in e.g. the Spanner paper, some CockroachDB
discussions = 7 ms. Maybe 10 - 20 ms if you don't have Google-level
hardware.
Common network latencies in a globally distributed cluster:
US West - East = 60 ms
US East - EU Central = 100 ms
US/EU to APAC, Africa, LATAM = 100-200 ms
Source: https://www.cloudping.co/grid

The conclusion is that this tradeoff definitely makes sense for globally
distributed transactions. This resembles QUORUM writes in current Cassandra.

However, users commonly prefer LOCAL_QUORUM in current Cassandra. I read
that this was discussed in the phone call, but haven't read about a
specific proposal. Just for the sake of completing my math, let's assume
that some LOCAL_QUORUM style Accord commit is invented. A naive example
could be to simply deploy a Cassandra cluster *with Accord transactions* in
a single geographical region, and other geographical regions would be
served by some external replication mechanism and would have to be
read-only.

Whatever the (hypothetical) solution, for LOCAL_QUORUM style or just single
region commits we end up with:

Typical SkewMax = 7 - 20 ms
Network latency < 1 ms.

It seems the SkewMax is quite high for a cluster deployed in a single
region, and what's worse there's no way to avoid it or make it much smaller
than 7 ms?

The only solution that comes to mind while writing this is to design Accord
to be pluggable such that the consensus part could be switched to something
that uses a logical clock for the transaction id. The user would choose one
or the other depending on what they optimize for.

I'll finish with a few notes:

Commit latency in itself isn't categorically bad for performance. I've
worked with several implementations of distributed databases that provide
good throughput even when a single write has high latency due to
geography/speed of light.

However, the duration of a commit is the window during which other
transactions may conflict with the committing transaction. Thus commit
latency will either increase the likelihood of aborted transactions, or in
other concurrency mechanisms block and impose a max throughput for hot rows.

A known optimization for the hot rows problem is to "hint" or manually
force clients to direct all updates to the hot row to the same node,
essentially making the system leader based. This allows the database to
start processing new updates even while the first one is still committing.
(See Galera for an example implementing this
<https://galeracluster.com/library/documentation/using-sr.html#usr-hot-records>.)
This makes me wonder whether there is a similar optimization for Accord
where transactions from the same coordinator can be allowed to commit
within the SkewMax window, because we can assume that the trx timestamps
originating at the same coordinator cannot arrive out of order when using
TPC?

henrik

On Mon, Sep 27, 2021 at 11:59 PM benedict@apache.org <be...@apache.org>
wrote:

> Ok, it’s time for the weekly poking of the hornet’s nest.
>
> Any more thoughts, questions or criticisms, anyone?
>
> From: benedict@apache.org <be...@apache.org>
> Date: Friday, 24 September 2021 at 22:41
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> I’m not aware of anybody having taken any notes, but somebody please chime
> in if I’m wrong.
>
> From my recollection, re Accord:
>
>
>   *   Q: Will batches now support rollbacks?
>      *   Batches would apply atomically or not, but unlikely to have a
> concept of rollback. Timeouts remain unknown, but hope to have some
> mechanism to provide clients a definitive answer about such transactions
> after the fact.
>   *   Q: Can stale replicas participate in transactions?
>      *   Accord applies conflicting transactions in-order at every
> replica, so only nodes that are up-to-date may participate in the execution
> of a transaction, but any replica may participate in agreeing a
> transaction. To ensure replicas remain up-to-date I anticipate introducing
> a real-time repair facility at the transactional message level, with peers
> reconciling recently processed messages and cross-delivering any that are
> missing.
>   *   Possible UX directions in very vague terms: CQL atomic and
> conditional batches initially; going forwards interactive transactions?
> Complex user defined functions? SQL?
>   *   Discussed possibility of LOCAL_QUORUM reads for globally replicated
> transactional tables, as this is an important use case
>      *   Simple stale reads to transactional tables
>      *   Brainstormed a bit about serializable reads to a single DC
> without (normally) crossing WAN
>      *   Discussed possibility of multiple ACKs providing separate LAN and
> WAN persistence notifications to clients
>   *   Discussed size of fast path quorums in Accord, and how this might
> affect global latency in high RF clusters (i.e. not optimal, and in some
> cases may need every DC to participate) and how this can be modified by
> biasing fast path electorate so that 2 of the 3 DCs may reach fast-path
> decisions with each other (remaining DC having to reach both those DCs to
> reach fast path). Also discussed Calvin-like modes of operation that would
> offer optimal global latency for sufficiently small clusters at RF=3 or
> RF=5.
>
> I’m sure there were other discussions I can’t remember, perhaps others can
> fill in the blanks.
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Friday, 24 September 2021 at 20:28
> To: dev <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Does anyone have notes for those of us who couldn't make the call?
>
> On Wed, Sep 22, 2021 at 1:35 PM benedict@apache.org <be...@apache.org>
> wrote:
>
> > Hi everyone,
> >
> > Joey has helpfully arranged a call for tomorrow at 8am PST / 10am CST /
> > 4pm BST to discuss Accord and other things in the community. There are no
> > plans to make any kind of project decisions. Everyone is welcome to drop
> in
> > to discuss Accord or whatever else might be on your mind.
> >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__gather.town_app_2UKSboSjqKXIXliE_ac2021-2Dcass-2Dsocial&d=DwIF-g&c=adz96Xi0w1RHqtPMowiL2g&r=eYcKRCU2ISzgciHbxg_tERbSQOZMMscdGLftkLqUuXo&m=yN7Y6u6BfW9NUZaSousZnD2Y-WiBtM1xDeJNy2WEq_r-gZqFwHVT4IPaeMOUa-AF&s=cgKblfbz9lUghSPbj5Si7oM7RsZy1w9vfvWjyzL8MXs&e=
> >
> >
> > From: benedict@apache.org <be...@apache.org>
> > Date: Wednesday, 22 September 2021 at 16:22
> > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > No, I would expect to deliver strict serializable interactive
> transactions
> > using Accord. These would simply corroborate that the participating keys
> > had not modified their write timestamps during the final transaction.
> These
> > could even be undertaken with still only a single wide area round-trip,
> > using local copies of the data to assemble the transaction (though this
> > would marginally increase the chance of aborts)
> >
> > My goal for MVCC is parallelism, not additional isolation levels (though
> > snapshot isolation is useful and we’ll probably also want to offer that
> > eventually)
> >
> > From: Henrik Ingo <he...@datastax.com>
> > Date: Wednesday, 22 September 2021 at 15:15
> > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > On Wed, Sep 22, 2021 at 7:56 AM benedict@apache.org <benedict@apache.org
> >
> > wrote:
> >
> > > Could you explain why you believe this trade-off is necessary? We can
> > > support full SQL just fine with Accord, and I hope that we eventually
> do
> > so.
> > >
> >
> > I assume this is really referring to interactive transactions = multiple
> > round trips to the client within a transaction.
> >
> > You mentioned previously we could later build a more MVCC like
> transaction
> > semantic on top of Accord. (Independent reads from a single snapshot,
> > followed by a commit using Accord.) In this case I think the relevant
> > discussion is whether Accord is still the optimal building block
> > performance wise to do so, or whether users would then have lower
> > consistency level but still pay the performance cost of a stricter
> > consistency level.
> >
> > henrik
> > --
> >
> > Henrik Ingo
> >
> > +358 40 569 7354 <358405697354>
> >
> > [image: Visit us online.] <https://www.datastax.com/>  [image: Visit us
> on
> > Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on
> YouTube.]
> > <
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=
> > >
> >   [image: Visit my LinkedIn profile.] <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_heingo_&d=DwIF-g&c=adz96Xi0w1RHqtPMowiL2g&r=eYcKRCU2ISzgciHbxg_tERbSQOZMMscdGLftkLqUuXo&m=yN7Y6u6BfW9NUZaSousZnD2Y-WiBtM1xDeJNy2WEq_r-gZqFwHVT4IPaeMOUa-AF&s=hWWsWoR24lF18raNqjeqYEL56ZMWgN4slrOU_-RYwQg&e=
> > >
> >
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>

--

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Henrik Ingo <he...@datastax.com>.

Hi Benedict

Since you asked, I reviewed the thread a bit and found this...

*secondary indexes*

>> What I would like to understand better and without guessing is, what do
these transactions look like from a client/user point of view?

> This is a fair question, and perhaps something I should pinpoint more
directly for the reader. The CEP does stipulate non-interactive
transactions, i.e. those that are one-shot. The only other limitation is
that the partition keys must be known upfront, however I expect we will
follow-up soon after with some weaker semantics that build on top (probably
using optimistic concurrency control) to support transactions where only
some partition keys are known upfront, so that we may support global
secondary indexes with proper isolation and consistency.

The CEP doesn't actually mention lack of support for secondary index
queries. Probably good to add as a limitation. (I realize currently using
secondary indexes isn't mainstream in Cassandra anyway, but with SASI in
4.0 and SAI being a separate CEP in discussion, it's good to call out
Accord wouldn't automatically support them.)

While I understand they are out of scope, do you happen to have already
some idea what it would require to support secondary indexes? Is it
sufficient to just include the secondary index keys (or a range of such) in
the "deps" of the transaction? Of course, still needing to also include the
partitions or rows actuallly read as a result of scanning the secondary
index. Similarly then for mutations, deps would have to include changes to
index keys in the transaction?

*commit latency*

A topic on some off-list discussions has been to understand the
implications of using a Spanner-inspired approach where the clock skew
between cluster nodes is a necessary part of the commit latency:

Deadline(t0 ,C,P) = t0 +SkewMax +max(Latency(C′,P) |C′ ∈C)−Latency(C,P)

In the white paper you even explicitly mention the trade off you have
chosen: *"This technique trades wide area round-trips for an additional
latency penalty equal to the bounds on clock synchrony."*

If we try to quantify what this trade off means in practice, I get:

Typical value for SkewMax in e.g. the Spanner paper, some CockroachDB
discussions = 7 ms. Maybe 10 - 20 ms if you don't have Google-level
hardware.
Common network latencies in a globally distributed cluster:
US West - East = 60 ms
US East - EU Central = 100 ms
US/EU to APAC, Africa, LATAM = 100-200 ms
Source: https://www.cloudping.co/grid

The conclusion is that this tradeoff definitely makes sense for globally
distributed transactions. This resembles QUORUM writes in current Cassandra.

However, users commonly prefer LOCAL_QUORUM in current Cassandra. I read
that this was discussed in the phone call, but haven't read about a
specific proposal. Just for the sake of completing my math, let's assume
that some LOCAL_QUORUM style Accord commit is invented. A naive example
could be to simply deploy a Cassandra cluster *with Accord transactions* in
a single geographical region, and other geographical regions would be
served by some external replication mechanism and would have to be
read-only.

Whatever the (hypothetical) solution, for LOCAL_QUORUM style or just single
region commits we end up with:

Typical SkewMax = 7 - 20 ms
Network latency < 1 ms.

It seems the SkewMax is quite high for a cluster deployed in a single
region, and what's worse there's no way to avoid it or make it much smaller
than 7 ms?

The only solution that comes to mind while writing this is to design Accord
to be pluggable such that the consensus part could be switched to something
that uses a logical clock for the transaction id. The user would choose one
or the other depending on what they optimize for.

I'll finish with a few notes:

Commit latency in itself isn't categorically bad for performance. I've
worked with several implementations of distributed databases that provide
good throughput even when a single write has high latency due to
geography/speed of light.

However, the duration of a commit is the window during which other
transactions may conflict with the committing transaction. Thus commit
latency will either increase the likelihood of aborted transactions, or in
other concurrency mechanisms block and impose a max throughput for hot rows.

A known optimization for the hot rows problem is to "hint" or manually
force clients to direct all updates to the hot row to the same node,
essentially making the system leader based. This allows the database to
start processing new updates even while the first one is still committing.
(See Galera for an example implementing this
<https://galeracluster.com/library/documentation/using-sr.html#usr-hot-records>.)
This makes me wonder whether there is a similar optimization for Accord
where transactions from the same coordinator can be allowed to commit
within the SkewMax window, because we can assume that the trx timestamps
originating at the same coordinator cannot arrive out of order when using
TPC?

henrik

On Mon, Sep 27, 2021 at 11:59 PM benedict@apache.org <be...@apache.org>
wrote:

> Ok, it’s time for the weekly poking of the hornet’s nest.
>
> Any more thoughts, questions or criticisms, anyone?
>
> From: benedict@apache.org <be...@apache.org>
> Date: Friday, 24 September 2021 at 22:41
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> I’m not aware of anybody having taken any notes, but somebody please chime
> in if I’m wrong.
>
> From my recollection, re Accord:
>
>
>   *   Q: Will batches now support rollbacks?
>      *   Batches would apply atomically or not, but unlikely to have a
> concept of rollback. Timeouts remain unknown, but hope to have some
> mechanism to provide clients a definitive answer about such transactions
> after the fact.
>   *   Q: Can stale replicas participate in transactions?
>      *   Accord applies conflicting transactions in-order at every
> replica, so only nodes that are up-to-date may participate in the execution
> of a transaction, but any replica may participate in agreeing a
> transaction. To ensure replicas remain up-to-date I anticipate introducing
> a real-time repair facility at the transactional message level, with peers
> reconciling recently processed messages and cross-delivering any that are
> missing.
>   *   Possible UX directions in very vague terms: CQL atomic and
> conditional batches initially; going forwards interactive transactions?
> Complex user defined functions? SQL?
>   *   Discussed possibility of LOCAL_QUORUM reads for globally replicated
> transactional tables, as this is an important use case
>      *   Simple stale reads to transactional tables
>      *   Brainstormed a bit about serializable reads to a single DC
> without (normally) crossing WAN
>      *   Discussed possibility of multiple ACKs providing separate LAN and
> WAN persistence notifications to clients
>   *   Discussed size of fast path quorums in Accord, and how this might
> affect global latency in high RF clusters (i.e. not optimal, and in some
> cases may need every DC to participate) and how this can be modified by
> biasing fast path electorate so that 2 of the 3 DCs may reach fast-path
> decisions with each other (remaining DC having to reach both those DCs to
> reach fast path). Also discussed Calvin-like modes of operation that would
> offer optimal global latency for sufficiently small clusters at RF=3 or
> RF=5.
>
> I’m sure there were other discussions I can’t remember, perhaps others can
> fill in the blanks.
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Friday, 24 September 2021 at 20:28
> To: dev <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Does anyone have notes for those of us who couldn't make the call?
>
> On Wed, Sep 22, 2021 at 1:35 PM benedict@apache.org <be...@apache.org>
> wrote:
>
> > Hi everyone,
> >
> > Joey has helpfully arranged a call for tomorrow at 8am PST / 10am CST /
> > 4pm BST to discuss Accord and other things in the community. There are no
> > plans to make any kind of project decisions. Everyone is welcome to drop
> in
> > to discuss Accord or whatever else might be on your mind.
> >
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__gather.town_app_2UKSboSjqKXIXliE_ac2021-2Dcass-2Dsocial&d=DwIF-g&c=adz96Xi0w1RHqtPMowiL2g&r=eYcKRCU2ISzgciHbxg_tERbSQOZMMscdGLftkLqUuXo&m=yN7Y6u6BfW9NUZaSousZnD2Y-WiBtM1xDeJNy2WEq_r-gZqFwHVT4IPaeMOUa-AF&s=cgKblfbz9lUghSPbj5Si7oM7RsZy1w9vfvWjyzL8MXs&e=
> >
> >
> > From: benedict@apache.org <be...@apache.org>
> > Date: Wednesday, 22 September 2021 at 16:22
> > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > No, I would expect to deliver strict serializable interactive
> transactions
> > using Accord. These would simply corroborate that the participating keys
> > had not modified their write timestamps during the final transaction.
> These
> > could even be undertaken with still only a single wide area round-trip,
> > using local copies of the data to assemble the transaction (though this
> > would marginally increase the chance of aborts)
> >
> > My goal for MVCC is parallelism, not additional isolation levels (though
> > snapshot isolation is useful and we’ll probably also want to offer that
> > eventually)
> >
> > From: Henrik Ingo <he...@datastax.com>
> > Date: Wednesday, 22 September 2021 at 15:15
> > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > On Wed, Sep 22, 2021 at 7:56 AM benedict@apache.org <benedict@apache.org
> >
> > wrote:
> >
> > > Could you explain why you believe this trade-off is necessary? We can
> > > support full SQL just fine with Accord, and I hope that we eventually
> do
> > so.
> > >
> >
> > I assume this is really referring to interactive transactions = multiple
> > round trips to the client within a transaction.
> >
> > You mentioned previously we could later build a more MVCC like
> transaction
> > semantic on top of Accord. (Independent reads from a single snapshot,
> > followed by a commit using Accord.) In this case I think the relevant
> > discussion is whether Accord is still the optimal building block
> > performance wise to do so, or whether users would then have lower
> > consistency level but still pay the performance cost of a stricter
> > consistency level.
> >
> > henrik
> > --
> >
> > Henrik Ingo
> >
> > +358 40 569 7354 <358405697354>
> >
> > [image: Visit us online.] <https://www.datastax.com/>  [image: Visit us
> on
> > Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on
> YouTube.]
> > <
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=
> > >
> >   [image: Visit my LinkedIn profile.] <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_in_heingo_&d=DwIF-g&c=adz96Xi0w1RHqtPMowiL2g&r=eYcKRCU2ISzgciHbxg_tERbSQOZMMscdGLftkLqUuXo&m=yN7Y6u6BfW9NUZaSousZnD2Y-WiBtM1xDeJNy2WEq_r-gZqFwHVT4IPaeMOUa-AF&s=hWWsWoR24lF18raNqjeqYEL56ZMWgN4slrOU_-RYwQg&e=
> > >
> >
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>

-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Benjamin Lerer <b....@gmail.com>.

Did I bite someone?  😂

Thanks for your patience with all the questions and comments Benedict. I
believe that everybody is pretty excited by this CEP. At least I am :-)

Le lun. 27 sept. 2021 à 22:59, benedict@apache.org <be...@apache.org> a
écrit :

> Ok, it’s time for the weekly poking of the hornet’s nest.
>
> Any more thoughts, questions or criticisms, anyone?
>
> From: benedict@apache.org <be...@apache.org>
> Date: Friday, 24 September 2021 at 22:41
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> I’m not aware of anybody having taken any notes, but somebody please chime
> in if I’m wrong.
>
> From my recollection, re Accord:
>
>
>   *   Q: Will batches now support rollbacks?
>      *   Batches would apply atomically or not, but unlikely to have a
> concept of rollback. Timeouts remain unknown, but hope to have some
> mechanism to provide clients a definitive answer about such transactions
> after the fact.
>   *   Q: Can stale replicas participate in transactions?
>      *   Accord applies conflicting transactions in-order at every
> replica, so only nodes that are up-to-date may participate in the execution
> of a transaction, but any replica may participate in agreeing a
> transaction. To ensure replicas remain up-to-date I anticipate introducing
> a real-time repair facility at the transactional message level, with peers
> reconciling recently processed messages and cross-delivering any that are
> missing.
>   *   Possible UX directions in very vague terms: CQL atomic and
> conditional batches initially; going forwards interactive transactions?
> Complex user defined functions? SQL?
>   *   Discussed possibility of LOCAL_QUORUM reads for globally replicated
> transactional tables, as this is an important use case
>      *   Simple stale reads to transactional tables
>      *   Brainstormed a bit about serializable reads to a single DC
> without (normally) crossing WAN
>      *   Discussed possibility of multiple ACKs providing separate LAN and
> WAN persistence notifications to clients
>   *   Discussed size of fast path quorums in Accord, and how this might
> affect global latency in high RF clusters (i.e. not optimal, and in some
> cases may need every DC to participate) and how this can be modified by
> biasing fast path electorate so that 2 of the 3 DCs may reach fast-path
> decisions with each other (remaining DC having to reach both those DCs to
> reach fast path). Also discussed Calvin-like modes of operation that would
> offer optimal global latency for sufficiently small clusters at RF=3 or
> RF=5.
>
> I’m sure there were other discussions I can’t remember, perhaps others can
> fill in the blanks.
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Friday, 24 September 2021 at 20:28
> To: dev <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Does anyone have notes for those of us who couldn't make the call?
>
> On Wed, Sep 22, 2021 at 1:35 PM benedict@apache.org <be...@apache.org>
> wrote:
>
> > Hi everyone,
> >
> > Joey has helpfully arranged a call for tomorrow at 8am PST / 10am CST /
> > 4pm BST to discuss Accord and other things in the community. There are no
> > plans to make any kind of project decisions. Everyone is welcome to drop
> in
> > to discuss Accord or whatever else might be on your mind.
> >
> > https://gather.town/app/2UKSboSjqKXIXliE/ac2021-cass-social
> >
> >
> > From: benedict@apache.org <be...@apache.org>
> > Date: Wednesday, 22 September 2021 at 16:22
> > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > No, I would expect to deliver strict serializable interactive
> transactions
> > using Accord. These would simply corroborate that the participating keys
> > had not modified their write timestamps during the final transaction.
> These
> > could even be undertaken with still only a single wide area round-trip,
> > using local copies of the data to assemble the transaction (though this
> > would marginally increase the chance of aborts)
> >
> > My goal for MVCC is parallelism, not additional isolation levels (though
> > snapshot isolation is useful and we’ll probably also want to offer that
> > eventually)
> >
> > From: Henrik Ingo <he...@datastax.com>
> > Date: Wednesday, 22 September 2021 at 15:15
> > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > On Wed, Sep 22, 2021 at 7:56 AM benedict@apache.org <benedict@apache.org
> >
> > wrote:
> >
> > > Could you explain why you believe this trade-off is necessary? We can
> > > support full SQL just fine with Accord, and I hope that we eventually
> do
> > so.
> > >
> >
> > I assume this is really referring to interactive transactions = multiple
> > round trips to the client within a transaction.
> >
> > You mentioned previously we could later build a more MVCC like
> transaction
> > semantic on top of Accord. (Independent reads from a single snapshot,
> > followed by a commit using Accord.) In this case I think the relevant
> > discussion is whether Accord is still the optimal building block
> > performance wise to do so, or whether users would then have lower
> > consistency level but still pay the performance cost of a stricter
> > consistency level.
> >
> > henrik
> > --
> >
> > Henrik Ingo
> >
> > +358 40 569 7354 <358405697354>
> >
> > [image: Visit us online.] <https://www.datastax.com/>  [image: Visit us
> on
> > Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on
> YouTube.]
> > <
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=
> > >
> >   [image: Visit my LinkedIn profile.] <
> https://www.linkedin.com/in/heingo/
> > >
> >
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

Ok, it’s time for the weekly poking of the hornet’s nest.

Any more thoughts, questions or criticisms, anyone?

From: benedict@apache.org <be...@apache.org>
Date: Friday, 24 September 2021 at 22:41
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
I’m not aware of anybody having taken any notes, but somebody please chime in if I’m wrong.

From my recollection, re Accord:


  *   Q: Will batches now support rollbacks?
     *   Batches would apply atomically or not, but unlikely to have a concept of rollback. Timeouts remain unknown, but hope to have some mechanism to provide clients a definitive answer about such transactions after the fact.
  *   Q: Can stale replicas participate in transactions?
     *   Accord applies conflicting transactions in-order at every replica, so only nodes that are up-to-date may participate in the execution of a transaction, but any replica may participate in agreeing a transaction. To ensure replicas remain up-to-date I anticipate introducing a real-time repair facility at the transactional message level, with peers reconciling recently processed messages and cross-delivering any that are missing.
  *   Possible UX directions in very vague terms: CQL atomic and conditional batches initially; going forwards interactive transactions? Complex user defined functions? SQL?
  *   Discussed possibility of LOCAL_QUORUM reads for globally replicated transactional tables, as this is an important use case
     *   Simple stale reads to transactional tables
     *   Brainstormed a bit about serializable reads to a single DC without (normally) crossing WAN
     *   Discussed possibility of multiple ACKs providing separate LAN and WAN persistence notifications to clients
  *   Discussed size of fast path quorums in Accord, and how this might affect global latency in high RF clusters (i.e. not optimal, and in some cases may need every DC to participate) and how this can be modified by biasing fast path electorate so that 2 of the 3 DCs may reach fast-path decisions with each other (remaining DC having to reach both those DCs to reach fast path). Also discussed Calvin-like modes of operation that would offer optimal global latency for sufficiently small clusters at RF=3 or RF=5.

I’m sure there were other discussions I can’t remember, perhaps others can fill in the blanks.


From: Jonathan Ellis <jb...@gmail.com>
Date: Friday, 24 September 2021 at 20:28
To: dev <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Does anyone have notes for those of us who couldn't make the call?

On Wed, Sep 22, 2021 at 1:35 PM benedict@apache.org <be...@apache.org>
wrote:

> Hi everyone,
>
> Joey has helpfully arranged a call for tomorrow at 8am PST / 10am CST /
> 4pm BST to discuss Accord and other things in the community. There are no
> plans to make any kind of project decisions. Everyone is welcome to drop in
> to discuss Accord or whatever else might be on your mind.
>
> https://gather.town/app/2UKSboSjqKXIXliE/ac2021-cass-social
>
>
> From: benedict@apache.org <be...@apache.org>
> Date: Wednesday, 22 September 2021 at 16:22
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> No, I would expect to deliver strict serializable interactive transactions
> using Accord. These would simply corroborate that the participating keys
> had not modified their write timestamps during the final transaction. These
> could even be undertaken with still only a single wide area round-trip,
> using local copies of the data to assemble the transaction (though this
> would marginally increase the chance of aborts)
>
> My goal for MVCC is parallelism, not additional isolation levels (though
> snapshot isolation is useful and we’ll probably also want to offer that
> eventually)
>
> From: Henrik Ingo <he...@datastax.com>
> Date: Wednesday, 22 September 2021 at 15:15
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> On Wed, Sep 22, 2021 at 7:56 AM benedict@apache.org <be...@apache.org>
> wrote:
>
> > Could you explain why you believe this trade-off is necessary? We can
> > support full SQL just fine with Accord, and I hope that we eventually do
> so.
> >
>
> I assume this is really referring to interactive transactions = multiple
> round trips to the client within a transaction.
>
> You mentioned previously we could later build a more MVCC like transaction
> semantic on top of Accord. (Independent reads from a single snapshot,
> followed by a commit using Accord.) In this case I think the relevant
> discussion is whether Accord is still the optimal building block
> performance wise to do so, or whether users would then have lower
> consistency level but still pay the performance cost of a stricter
> consistency level.
>
> henrik
> --
>
> Henrik Ingo
>
> +358 40 569 7354 <358405697354>
>
> [image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
> Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
> <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=
> >
>   [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/
> >
>


--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

I’m not aware of anybody having taken any notes, but somebody please chime in if I’m wrong.

From my recollection, re Accord:


  *   Q: Will batches now support rollbacks?
     *   Batches would apply atomically or not, but unlikely to have a concept of rollback. Timeouts remain unknown, but hope to have some mechanism to provide clients a definitive answer about such transactions after the fact.
  *   Q: Can stale replicas participate in transactions?
     *   Accord applies conflicting transactions in-order at every replica, so only nodes that are up-to-date may participate in the execution of a transaction, but any replica may participate in agreeing a transaction. To ensure replicas remain up-to-date I anticipate introducing a real-time repair facility at the transactional message level, with peers reconciling recently processed messages and cross-delivering any that are missing.
  *   Possible UX directions in very vague terms: CQL atomic and conditional batches initially; going forwards interactive transactions? Complex user defined functions? SQL?
  *   Discussed possibility of LOCAL_QUORUM reads for globally replicated transactional tables, as this is an important use case
     *   Simple stale reads to transactional tables
     *   Brainstormed a bit about serializable reads to a single DC without (normally) crossing WAN
     *   Discussed possibility of multiple ACKs providing separate LAN and WAN persistence notifications to clients
  *   Discussed size of fast path quorums in Accord, and how this might affect global latency in high RF clusters (i.e. not optimal, and in some cases may need every DC to participate) and how this can be modified by biasing fast path electorate so that 2 of the 3 DCs may reach fast-path decisions with each other (remaining DC having to reach both those DCs to reach fast path). Also discussed Calvin-like modes of operation that would offer optimal global latency for sufficiently small clusters at RF=3 or RF=5.

I’m sure there were other discussions I can’t remember, perhaps others can fill in the blanks.


From: Jonathan Ellis <jb...@gmail.com>
Date: Friday, 24 September 2021 at 20:28
To: dev <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Does anyone have notes for those of us who couldn't make the call?

On Wed, Sep 22, 2021 at 1:35 PM benedict@apache.org <be...@apache.org>
wrote:

> Hi everyone,
>
> Joey has helpfully arranged a call for tomorrow at 8am PST / 10am CST /
> 4pm BST to discuss Accord and other things in the community. There are no
> plans to make any kind of project decisions. Everyone is welcome to drop in
> to discuss Accord or whatever else might be on your mind.
>
> https://gather.town/app/2UKSboSjqKXIXliE/ac2021-cass-social
>
>
> From: benedict@apache.org <be...@apache.org>
> Date: Wednesday, 22 September 2021 at 16:22
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> No, I would expect to deliver strict serializable interactive transactions
> using Accord. These would simply corroborate that the participating keys
> had not modified their write timestamps during the final transaction. These
> could even be undertaken with still only a single wide area round-trip,
> using local copies of the data to assemble the transaction (though this
> would marginally increase the chance of aborts)
>
> My goal for MVCC is parallelism, not additional isolation levels (though
> snapshot isolation is useful and we’ll probably also want to offer that
> eventually)
>
> From: Henrik Ingo <he...@datastax.com>
> Date: Wednesday, 22 September 2021 at 15:15
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> On Wed, Sep 22, 2021 at 7:56 AM benedict@apache.org <be...@apache.org>
> wrote:
>
> > Could you explain why you believe this trade-off is necessary? We can
> > support full SQL just fine with Accord, and I hope that we eventually do
> so.
> >
>
> I assume this is really referring to interactive transactions = multiple
> round trips to the client within a transaction.
>
> You mentioned previously we could later build a more MVCC like transaction
> semantic on top of Accord. (Independent reads from a single snapshot,
> followed by a commit using Accord.) In this case I think the relevant
> discussion is whether Accord is still the optimal building block
> performance wise to do so, or whether users would then have lower
> consistency level but still pay the performance cost of a stricter
> consistency level.
>
> henrik
> --
>
> Henrik Ingo
>
> +358 40 569 7354 <358405697354>
>
> [image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
> Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
> <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=
> >
>   [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/
> >
>


--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Jonathan Ellis <jb...@gmail.com>.

Does anyone have notes for those of us who couldn't make the call?

On Wed, Sep 22, 2021 at 1:35 PM benedict@apache.org <be...@apache.org>
wrote:

> Hi everyone,
>
> Joey has helpfully arranged a call for tomorrow at 8am PST / 10am CST /
> 4pm BST to discuss Accord and other things in the community. There are no
> plans to make any kind of project decisions. Everyone is welcome to drop in
> to discuss Accord or whatever else might be on your mind.
>
> https://gather.town/app/2UKSboSjqKXIXliE/ac2021-cass-social
>
>
> From: benedict@apache.org <be...@apache.org>
> Date: Wednesday, 22 September 2021 at 16:22
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> No, I would expect to deliver strict serializable interactive transactions
> using Accord. These would simply corroborate that the participating keys
> had not modified their write timestamps during the final transaction. These
> could even be undertaken with still only a single wide area round-trip,
> using local copies of the data to assemble the transaction (though this
> would marginally increase the chance of aborts)
>
> My goal for MVCC is parallelism, not additional isolation levels (though
> snapshot isolation is useful and we’ll probably also want to offer that
> eventually)
>
> From: Henrik Ingo <he...@datastax.com>
> Date: Wednesday, 22 September 2021 at 15:15
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> On Wed, Sep 22, 2021 at 7:56 AM benedict@apache.org <be...@apache.org>
> wrote:
>
> > Could you explain why you believe this trade-off is necessary? We can
> > support full SQL just fine with Accord, and I hope that we eventually do
> so.
> >
>
> I assume this is really referring to interactive transactions = multiple
> round trips to the client within a transaction.
>
> You mentioned previously we could later build a more MVCC like transaction
> semantic on top of Accord. (Independent reads from a single snapshot,
> followed by a commit using Accord.) In this case I think the relevant
> discussion is whether Accord is still the optimal building block
> performance wise to do so, or whether users would then have lower
> consistency level but still pay the performance cost of a stricter
> consistency level.
>
> henrik
> --
>
> Henrik Ingo
>
> +358 40 569 7354 <358405697354>
>
> [image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
> Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
> <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=
> >
>   [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/
> >
>


-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

Hi everyone,

Joey has helpfully arranged a call for tomorrow at 8am PST / 10am CST / 4pm BST to discuss Accord and other things in the community. There are no plans to make any kind of project decisions. Everyone is welcome to drop in to discuss Accord or whatever else might be on your mind.

https://gather.town/app/2UKSboSjqKXIXliE/ac2021-cass-social

From: benedict@apache.org <be...@apache.org>
Date: Wednesday, 22 September 2021 at 16:22
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
No, I would expect to deliver strict serializable interactive transactions using Accord. These would simply corroborate that the participating keys had not modified their write timestamps during the final transaction. These could even be undertaken with still only a single wide area round-trip, using local copies of the data to assemble the transaction (though this would marginally increase the chance of aborts)

My goal for MVCC is parallelism, not additional isolation levels (though snapshot isolation is useful and we’ll probably also want to offer that eventually)

From: Henrik Ingo <he...@datastax.com>
Date: Wednesday, 22 September 2021 at 15:15
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
On Wed, Sep 22, 2021 at 7:56 AM benedict@apache.org <be...@apache.org>
wrote:

> Could you explain why you believe this trade-off is necessary? We can
> support full SQL just fine with Accord, and I hope that we eventually do so.
>

I assume this is really referring to interactive transactions = multiple
round trips to the client within a transaction.

You mentioned previously we could later build a more MVCC like transaction
semantic on top of Accord. (Independent reads from a single snapshot,
followed by a commit using Accord.) In this case I think the relevant
discussion is whether Accord is still the optimal building block
performance wise to do so, or whether users would then have lower
consistency level but still pay the performance cost of a stricter
consistency level.

henrik
--

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

No, I would expect to deliver strict serializable interactive transactions using Accord. These would simply corroborate that the participating keys had not modified their write timestamps during the final transaction. These could even be undertaken with still only a single wide area round-trip, using local copies of the data to assemble the transaction (though this would marginally increase the chance of aborts)

My goal for MVCC is parallelism, not additional isolation levels (though snapshot isolation is useful and we’ll probably also want to offer that eventually)

From: Henrik Ingo <he...@datastax.com>
Date: Wednesday, 22 September 2021 at 15:15
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
On Wed, Sep 22, 2021 at 7:56 AM benedict@apache.org <be...@apache.org>
wrote:

> Could you explain why you believe this trade-off is necessary? We can
> support full SQL just fine with Accord, and I hope that we eventually do so.
>

I assume this is really referring to interactive transactions = multiple
round trips to the client within a transaction.

You mentioned previously we could later build a more MVCC like transaction
semantic on top of Accord. (Independent reads from a single snapshot,
followed by a commit using Accord.) In this case I think the relevant
discussion is whether Accord is still the optimal building block
performance wise to do so, or whether users would then have lower
consistency level but still pay the performance cost of a stricter
consistency level.

henrik
--

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Henrik Ingo <he...@datastax.com>.

On Wed, Sep 22, 2021 at 7:56 AM benedict@apache.org <be...@apache.org>
wrote:

> Could you explain why you believe this trade-off is necessary? We can
> support full SQL just fine with Accord, and I hope that we eventually do so.
>

I assume this is really referring to interactive transactions = multiple
round trips to the client within a transaction.

You mentioned previously we could later build a more MVCC like transaction
semantic on top of Accord. (Independent reads from a single snapshot,
followed by a commit using Accord.) In this case I think the relevant
discussion is whether Accord is still the optimal building block
performance wise to do so, or whether users would then have lower
consistency level but still pay the performance cost of a stricter
consistency level.

henrik
-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

Could you explain why you believe this trade-off is necessary? We can support full SQL just fine with Accord, and I hope that we eventually do so.

This domain is incredibly complex, so it is easy to reach wrong conclusions. I would invite you again to propose a system for discussion that you think offers something Accord is unable to, and that you consider desirable, and we can work from there.

To pre-empt some possible discussions, I am not aware of anything we cannot do with Accord that we could do with either Calvin or Spanner. Interactive transactions are possible on top of Accord, as are transactions with an unknown read/write set. In each case the only cost is that they would use optimistic concurrency control, which is no worse the spanner derivatives anyway (which I have to assume is your benchmark in this regard). I do not expect to deliver either functionality initially, but Accord takes us most of the way there for both.


From: Jonathan Ellis <jb...@gmail.com>
Date: Wednesday, 22 September 2021 at 05:36
To: dev <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Right, I'm looking for exactly a discussion on the high level goals.
Instead of saying "here's the goals and we ruled out X because Y" we should
start with a discussion around, "Approach A allows X and W, approach B
allows Y and Z" and decide together what the goals should be and and what
we are willing to trade to get those goals, e.g., are we willing to give up
global strict serializability to get the ability to support full SQL.  Both
of these are nice to have!

On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <be...@apache.org>
wrote:

> Hi Jonathan,
>
> These other systems are incompatible with the goals of the CEP. I do
> discuss them (besides 2PC) in both the whitepaper and the CEP, and will
> summarise that discussion below. A true and accurate comparison of these
> other systems is essentially intractable, as there are complex subtleties
> to each flavour, and those who are interested would be better served by
> performing their own research.
>
> I think it is more productive to focus on what we want to achieve as a
> community. If you believe the goals of this CEP are wrong for the project,
> let’s focus on that. If you want to compare and contrast specific facets of
> alternative systems that you consider to be preferable in some dimension,
> let’s do that here or in a Q&A as proposed by Joey.
>
> The relevant goals are that we:
>
>
>   1.  Guarantee strict serializable isolation on commodity hardware
>   2.  Scale to any cluster size
>   3.  Achieve optimal latency
>
> The approach taken by Spanner derivatives is rejected by (1) because they
> guarantee only Serializable isolation (they additionally fail (3)). From
> watching talks by YugaByte, and inferring from Cockroach’s
> panic-cluster-death under clock skew, this is clearly considered by
> everyone to be undesirable but necessary to achieve scalability.
>
> The approach taken by FaunaDB (Calvin) is rejected by (2) because its
> sequencing layer requires a global leader process for the cluster, which is
> incompatible with Cassandra’s scalability requirements. It additionally
> fails (3) for global clients.
>
> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
> Spanner clone for its multi-key transaction functionality, not 2PC.
>
> Systems such as RAMP with even weaker isolation are not considered for the
> simple reason that they do not even claim to meet (1).
>
> If we want to additionally offer weaker isolation levels than
> Serializable, such as that provided by the recent RAMP-TAO paper, Cassandra
> is likely able to support multiple distinct transaction layers that operate
> independently. I would encourage you to file a CEP to explore how we can
> meet these distinct use cases, but I consider them to be niche. I expect
> that a majority of our user base desire strict serializable isolation, and
> certainly no less than serializable isolation, to augment the existing
> weaker isolation offered by quorum reads and writes.
>
> I would tangentially note that we are not an AP database under normal
> recommended operation. A minority in any network partition cannot reach
> QUORUM, so under recommended usage we are a high-availability leaderless CP
> database.
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Tuesday, 21 September 2021 at 23:45
> To: dev <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Benedict, thanks for taking the lead in putting this together. Since
> Cassandra is the only relevant database today designed around a leaderless
> architecture, it's quite likely that we'll be better served with a custom
> transaction design instead of trying to retrofit one from CP systems.
>
> The whitepaper here is a good description of the consensus algorithm itself
> as well as its robustness and stability characteristics, and its comparison
> with other state-of-the-art consensus algorithms is very useful.  In the
> context of Cassandra, where a consensus algorithm is only part of what will
> be implemented, I'd like to see a more complete evaluation of the
> transactional side of things as well, including performance characteristics
> as well as the types of transactions that can be supported and at least a
> general idea of what it would look like applied to Cassandra. This will
> allow the PMC to make a more informed decision about what tradeoffs are
> best for the entire long-term project of first supplementing and ultimately
> replacing LWT.
>
> (Allowing users to mix LWT and AP Cassandra operations against the same
> rows was probably a mistake, so in contrast with LWT we’re not looking for
> something fast enough for occasional use but rather something within a
> reasonable factor of AP operations, appropriate to being the only way to
> interact with tables declared as such.)
>
> Besides Accord, this should cover
>
> - Calvin and FaunaDB
> - A Spanner derivative (no opinion on whether that should be Cockroach or
> Yugabyte, I don’t think it’s necessary to cover both)
> - A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
> there is more public information about MongoDB)
> - RAMP
>
> Here’s an example of what I mean:
>
> =Calvin=
>
> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
> transactions, then replicas execute the transactions independently with no
> further coordination.  No SPOF.  Transactions are batched by each sequencer
> to keep this from becoming a bottleneck.
>
> Performance: Calvin paper (published 2012) reports linear scaling of TPC-C
> New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is composed
> of four reads and four writes, so this is effectively 2M reads and 2M
> writes as we normally measure them in C*.
>
> Calvin supports mixed read/write transactions, but because the transaction
> execution logic requires knowing all partition keys in advance to ensure
> that all replicas can reproduce the same results with no coordination,
> reads against non-PK predicates must be done ahead of time (transparently,
> by the server) to determine the set of keys, and this must be retried if
> the set of rows affected is updated before the actual transaction executes.
>
> Batching and global consensus adds latency -- 100ms in the Calvin paper and
> apparently about 50ms in FaunaDB.  Glass half full: all transactions
> (including multi-partition updates) are equally performant in Calvin since
> the coordination is handled up front in the sequencing step.  Glass half
> empty: even single-row reads and writes have to pay the full coordination
> cost.  Fauna has optimized this away for reads but I am not aware of a
> description of how they changed the design to allow this.
>
> Functionality and limitations: since the entire transaction must be known
> in advance to allow coordination-less execution at the replicas, Calvin
> cannot support interactive transactions at all.  FaunaDB mitigates this by
> allowing server-side logic to be included, but a Calvin approach will never
> be able to offer SQL compatibility.
>
> Guarantees: Calvin transactions are strictly serializable.  There is no
> additional complexity or performance hit to generalizing to multiple
> regions, apart from the speed of light.  And since Calvin is already paying
> a batching latency penalty, this is less painful than for other systems.
>
> Application to Cassandra: B-.  Distributed transactions are handled by the
> sequencing and scheduling layers, which are leaderless, and Calvin’s
> requirements for the storage layer are easily met by C*.  But Calvin also
> requires a global consensus protocol and LWT is almost certainly not
> sufficiently performant, so this would require ZK or etcd (reasonable for a
> library approach but not for replacing LWT in C* itself), or an
> implementation of Accord.  I don’t believe Calvin would require additional
> table-level metadata in Cassandra.
>
> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <be...@apache.org>
> wrote:
>
> > Wiki:
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > Whitepaper:
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > <
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > >
> > Prototype: https://github.com/belliottsmith/accord
> >
> > Hi everyone, I’d like to propose this CEP for adoption by the community.
> >
> > Cassandra has benefitted from LWTs for many years, but application
> > developers that want to ensure consistency for complex operations must
> > either accept the scalability bottleneck of serializing all related state
> > through a single partition, or layer a complex state machine on top of
> the
> > database. These are sophisticated and costly activities that our users
> > should not be expected to undertake. Since distributed databases are
> > beginning to offer distributed transactions with fewer caveats, it is
> past
> > time for Cassandra to do so as well.
> >
> > This CEP proposes the use of several novel techniques that build upon
> > research (that followed EPaxos) to deliver (non-interactive) general
> > purpose distributed transactions. The approach is outlined in the
> wikipage
> > and in more detail in the linked whitepaper. Importantly, by adopting
> this
> > approach we will be the _only_ distributed database to offer global,
> > scalable, strict serializable transactions in one wide area round-trip.
> > This would represent a significant improvement in the state of the art,
> > both in the academic literature and in commercial or open source
> offerings.
> >
> > This work has been partially realised in a prototype. This partial
> > prototype has been verified against Jepsen.io’s Maelstrom library and
> > dedicated in-tree strict serializability verification tools, but much
> work
> > remains for the work to be production capable and integrated into
> Cassandra.
> >
> > I propose including the prototype in the project as a new source
> > repository, to be developed as a standalone library for integration into
> > Cassandra. I hope the community sees the important value proposition of
> > this proposal, and will adopt the CEP after this discussion, so that the
> > library and its integration into Cassandra can be developed in parallel
> and
> > with the involvement of the wider community.
> >
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Jonathan Ellis <jb...@gmail.com>.

Right, I'm looking for exactly a discussion on the high level goals.
Instead of saying "here's the goals and we ruled out X because Y" we should
start with a discussion around, "Approach A allows X and W, approach B
allows Y and Z" and decide together what the goals should be and and what
we are willing to trade to get those goals, e.g., are we willing to give up
global strict serializability to get the ability to support full SQL.  Both
of these are nice to have!

On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <be...@apache.org>
wrote:

> Hi Jonathan,
>
> These other systems are incompatible with the goals of the CEP. I do
> discuss them (besides 2PC) in both the whitepaper and the CEP, and will
> summarise that discussion below. A true and accurate comparison of these
> other systems is essentially intractable, as there are complex subtleties
> to each flavour, and those who are interested would be better served by
> performing their own research.
>
> I think it is more productive to focus on what we want to achieve as a
> community. If you believe the goals of this CEP are wrong for the project,
> let’s focus on that. If you want to compare and contrast specific facets of
> alternative systems that you consider to be preferable in some dimension,
> let’s do that here or in a Q&A as proposed by Joey.
>
> The relevant goals are that we:
>
>
>   1.  Guarantee strict serializable isolation on commodity hardware
>   2.  Scale to any cluster size
>   3.  Achieve optimal latency
>
> The approach taken by Spanner derivatives is rejected by (1) because they
> guarantee only Serializable isolation (they additionally fail (3)). From
> watching talks by YugaByte, and inferring from Cockroach’s
> panic-cluster-death under clock skew, this is clearly considered by
> everyone to be undesirable but necessary to achieve scalability.
>
> The approach taken by FaunaDB (Calvin) is rejected by (2) because its
> sequencing layer requires a global leader process for the cluster, which is
> incompatible with Cassandra’s scalability requirements. It additionally
> fails (3) for global clients.
>
> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
> Spanner clone for its multi-key transaction functionality, not 2PC.
>
> Systems such as RAMP with even weaker isolation are not considered for the
> simple reason that they do not even claim to meet (1).
>
> If we want to additionally offer weaker isolation levels than
> Serializable, such as that provided by the recent RAMP-TAO paper, Cassandra
> is likely able to support multiple distinct transaction layers that operate
> independently. I would encourage you to file a CEP to explore how we can
> meet these distinct use cases, but I consider them to be niche. I expect
> that a majority of our user base desire strict serializable isolation, and
> certainly no less than serializable isolation, to augment the existing
> weaker isolation offered by quorum reads and writes.
>
> I would tangentially note that we are not an AP database under normal
> recommended operation. A minority in any network partition cannot reach
> QUORUM, so under recommended usage we are a high-availability leaderless CP
> database.
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Tuesday, 21 September 2021 at 23:45
> To: dev <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Benedict, thanks for taking the lead in putting this together. Since
> Cassandra is the only relevant database today designed around a leaderless
> architecture, it's quite likely that we'll be better served with a custom
> transaction design instead of trying to retrofit one from CP systems.
>
> The whitepaper here is a good description of the consensus algorithm itself
> as well as its robustness and stability characteristics, and its comparison
> with other state-of-the-art consensus algorithms is very useful.  In the
> context of Cassandra, where a consensus algorithm is only part of what will
> be implemented, I'd like to see a more complete evaluation of the
> transactional side of things as well, including performance characteristics
> as well as the types of transactions that can be supported and at least a
> general idea of what it would look like applied to Cassandra. This will
> allow the PMC to make a more informed decision about what tradeoffs are
> best for the entire long-term project of first supplementing and ultimately
> replacing LWT.
>
> (Allowing users to mix LWT and AP Cassandra operations against the same
> rows was probably a mistake, so in contrast with LWT we’re not looking for
> something fast enough for occasional use but rather something within a
> reasonable factor of AP operations, appropriate to being the only way to
> interact with tables declared as such.)
>
> Besides Accord, this should cover
>
> - Calvin and FaunaDB
> - A Spanner derivative (no opinion on whether that should be Cockroach or
> Yugabyte, I don’t think it’s necessary to cover both)
> - A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
> there is more public information about MongoDB)
> - RAMP
>
> Here’s an example of what I mean:
>
> =Calvin=
>
> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
> transactions, then replicas execute the transactions independently with no
> further coordination.  No SPOF.  Transactions are batched by each sequencer
> to keep this from becoming a bottleneck.
>
> Performance: Calvin paper (published 2012) reports linear scaling of TPC-C
> New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is composed
> of four reads and four writes, so this is effectively 2M reads and 2M
> writes as we normally measure them in C*.
>
> Calvin supports mixed read/write transactions, but because the transaction
> execution logic requires knowing all partition keys in advance to ensure
> that all replicas can reproduce the same results with no coordination,
> reads against non-PK predicates must be done ahead of time (transparently,
> by the server) to determine the set of keys, and this must be retried if
> the set of rows affected is updated before the actual transaction executes.
>
> Batching and global consensus adds latency -- 100ms in the Calvin paper and
> apparently about 50ms in FaunaDB.  Glass half full: all transactions
> (including multi-partition updates) are equally performant in Calvin since
> the coordination is handled up front in the sequencing step.  Glass half
> empty: even single-row reads and writes have to pay the full coordination
> cost.  Fauna has optimized this away for reads but I am not aware of a
> description of how they changed the design to allow this.
>
> Functionality and limitations: since the entire transaction must be known
> in advance to allow coordination-less execution at the replicas, Calvin
> cannot support interactive transactions at all.  FaunaDB mitigates this by
> allowing server-side logic to be included, but a Calvin approach will never
> be able to offer SQL compatibility.
>
> Guarantees: Calvin transactions are strictly serializable.  There is no
> additional complexity or performance hit to generalizing to multiple
> regions, apart from the speed of light.  And since Calvin is already paying
> a batching latency penalty, this is less painful than for other systems.
>
> Application to Cassandra: B-.  Distributed transactions are handled by the
> sequencing and scheduling layers, which are leaderless, and Calvin’s
> requirements for the storage layer are easily met by C*.  But Calvin also
> requires a global consensus protocol and LWT is almost certainly not
> sufficiently performant, so this would require ZK or etcd (reasonable for a
> library approach but not for replacing LWT in C* itself), or an
> implementation of Accord.  I don’t believe Calvin would require additional
> table-level metadata in Cassandra.
>
> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <be...@apache.org>
> wrote:
>
> > Wiki:
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > Whitepaper:
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > <
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > >
> > Prototype: https://github.com/belliottsmith/accord
> >
> > Hi everyone, I’d like to propose this CEP for adoption by the community.
> >
> > Cassandra has benefitted from LWTs for many years, but application
> > developers that want to ensure consistency for complex operations must
> > either accept the scalability bottleneck of serializing all related state
> > through a single partition, or layer a complex state machine on top of
> the
> > database. These are sophisticated and costly activities that our users
> > should not be expected to undertake. Since distributed databases are
> > beginning to offer distributed transactions with fewer caveats, it is
> past
> > time for Cassandra to do so as well.
> >
> > This CEP proposes the use of several novel techniques that build upon
> > research (that followed EPaxos) to deliver (non-interactive) general
> > purpose distributed transactions. The approach is outlined in the
> wikipage
> > and in more detail in the linked whitepaper. Importantly, by adopting
> this
> > approach we will be the _only_ distributed database to offer global,
> > scalable, strict serializable transactions in one wide area round-trip.
> > This would represent a significant improvement in the state of the art,
> > both in the academic literature and in commercial or open source
> offerings.
> >
> > This work has been partially realised in a prototype. This partial
> > prototype has been verified against Jepsen.io’s Maelstrom library and
> > dedicated in-tree strict serializability verification tools, but much
> work
> > remains for the work to be production capable and integrated into
> Cassandra.
> >
> > I propose including the prototype in the project as a new source
> > repository, to be developed as a standalone library for integration into
> > Cassandra. I hope the community sees the important value proposition of
> > this proposal, and will adopt the CEP after this discussion, so that the
> > library and its integration into Cassandra can be developed in parallel
> and
> > with the involvement of the wider community.
> >
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.

Hi Jonathan,

These other systems are incompatible with the goals of the CEP. I do discuss them (besides 2PC) in both the whitepaper and the CEP, and will summarise that discussion below. A true and accurate comparison of these other systems is essentially intractable, as there are complex subtleties to each flavour, and those who are interested would be better served by performing their own research.

I think it is more productive to focus on what we want to achieve as a community. If you believe the goals of this CEP are wrong for the project, let’s focus on that. If you want to compare and contrast specific facets of alternative systems that you consider to be preferable in some dimension, let’s do that here or in a Q&A as proposed by Joey.

The relevant goals are that we:

  1.  Guarantee strict serializable isolation on commodity hardware
  2.  Scale to any cluster size
  3.  Achieve optimal latency

The approach taken by Spanner derivatives is rejected by (1) because they guarantee only Serializable isolation (they additionally fail (3)). From watching talks by YugaByte, and inferring from Cockroach’s panic-cluster-death under clock skew, this is clearly considered by everyone to be undesirable but necessary to achieve scalability.

The approach taken by FaunaDB (Calvin) is rejected by (2) because its sequencing layer requires a global leader process for the cluster, which is incompatible with Cassandra’s scalability requirements. It additionally fails (3) for global clients.

Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a Spanner clone for its multi-key transaction functionality, not 2PC.

Systems such as RAMP with even weaker isolation are not considered for the simple reason that they do not even claim to meet (1).

If we want to additionally offer weaker isolation levels than Serializable, such as that provided by the recent RAMP-TAO paper, Cassandra is likely able to support multiple distinct transaction layers that operate independently. I would encourage you to file a CEP to explore how we can meet these distinct use cases, but I consider them to be niche. I expect that a majority of our user base desire strict serializable isolation, and certainly no less than serializable isolation, to augment the existing weaker isolation offered by quorum reads and writes.

I would tangentially note that we are not an AP database under normal recommended operation. A minority in any network partition cannot reach QUORUM, so under recommended usage we are a high-availability leaderless CP database.

From: Jonathan Ellis <jb...@gmail.com>
Date: Tuesday, 21 September 2021 at 23:45
To: dev <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Benedict, thanks for taking the lead in putting this together. Since
Cassandra is the only relevant database today designed around a leaderless
architecture, it's quite likely that we'll be better served with a custom
transaction design instead of trying to retrofit one from CP systems.

The whitepaper here is a good description of the consensus algorithm itself
as well as its robustness and stability characteristics, and its comparison
with other state-of-the-art consensus algorithms is very useful.  In the
context of Cassandra, where a consensus algorithm is only part of what will
be implemented, I'd like to see a more complete evaluation of the
transactional side of things as well, including performance characteristics
as well as the types of transactions that can be supported and at least a
general idea of what it would look like applied to Cassandra. This will
allow the PMC to make a more informed decision about what tradeoffs are
best for the entire long-term project of first supplementing and ultimately
replacing LWT.

(Allowing users to mix LWT and AP Cassandra operations against the same
rows was probably a mistake, so in contrast with LWT we’re not looking for
something fast enough for occasional use but rather something within a
reasonable factor of AP operations, appropriate to being the only way to
interact with tables declared as such.)

Besides Accord, this should cover

- Calvin and FaunaDB
- A Spanner derivative (no opinion on whether that should be Cockroach or
Yugabyte, I don’t think it’s necessary to cover both)
- A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
there is more public information about MongoDB)
- RAMP

Here’s an example of what I mean:

=Calvin=

Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
transactions, then replicas execute the transactions independently with no
further coordination.  No SPOF.  Transactions are batched by each sequencer
to keep this from becoming a bottleneck.

Performance: Calvin paper (published 2012) reports linear scaling of TPC-C
New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is composed
of four reads and four writes, so this is effectively 2M reads and 2M
writes as we normally measure them in C*.

Calvin supports mixed read/write transactions, but because the transaction
execution logic requires knowing all partition keys in advance to ensure
that all replicas can reproduce the same results with no coordination,
reads against non-PK predicates must be done ahead of time (transparently,
by the server) to determine the set of keys, and this must be retried if
the set of rows affected is updated before the actual transaction executes.

Batching and global consensus adds latency -- 100ms in the Calvin paper and
apparently about 50ms in FaunaDB.  Glass half full: all transactions
(including multi-partition updates) are equally performant in Calvin since
the coordination is handled up front in the sequencing step.  Glass half
empty: even single-row reads and writes have to pay the full coordination
cost.  Fauna has optimized this away for reads but I am not aware of a
description of how they changed the design to allow this.

Functionality and limitations: since the entire transaction must be known
in advance to allow coordination-less execution at the replicas, Calvin
cannot support interactive transactions at all.  FaunaDB mitigates this by
allowing server-side logic to be included, but a Calvin approach will never
be able to offer SQL compatibility.

Guarantees: Calvin transactions are strictly serializable.  There is no
additional complexity or performance hit to generalizing to multiple
regions, apart from the speed of light.  And since Calvin is already paying
a batching latency penalty, this is less painful than for other systems.

Application to Cassandra: B-.  Distributed transactions are handled by the
sequencing and scheduling layers, which are leaderless, and Calvin’s
requirements for the storage layer are easily met by C*.  But Calvin also
requires a global consensus protocol and LWT is almost certainly not
sufficiently performant, so this would require ZK or etcd (reasonable for a
library approach but not for replacing LWT in C* itself), or an
implementation of Accord.  I don’t believe Calvin would require additional
table-level metadata in Cassandra.

On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <be...@apache.org>
wrote:

> Wiki:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> Whitepaper:
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> <
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >
> Prototype: https://github.com/belliottsmith/accord
>
> Hi everyone, I’d like to propose this CEP for adoption by the community.
>
> Cassandra has benefitted from LWTs for many years, but application
> developers that want to ensure consistency for complex operations must
> either accept the scalability bottleneck of serializing all related state
> through a single partition, or layer a complex state machine on top of the
> database. These are sophisticated and costly activities that our users
> should not be expected to undertake. Since distributed databases are
> beginning to offer distributed transactions with fewer caveats, it is past
> time for Cassandra to do so as well.
>
> This CEP proposes the use of several novel techniques that build upon
> research (that followed EPaxos) to deliver (non-interactive) general
> purpose distributed transactions. The approach is outlined in the wikipage
> and in more detail in the linked whitepaper. Importantly, by adopting this
> approach we will be the _only_ distributed database to offer global,
> scalable, strict serializable transactions in one wide area round-trip.
> This would represent a significant improvement in the state of the art,
> both in the academic literature and in commercial or open source offerings.
>
> This work has been partially realised in a prototype. This partial
> prototype has been verified against Jepsen.io’s Maelstrom library and
> dedicated in-tree strict serializability verification tools, but much work
> remains for the work to be production capable and integrated into Cassandra.
>
> I propose including the prototype in the project as a new source
> repository, to be developed as a standalone library for integration into
> Cassandra. I hope the community sees the important value proposition of
> this proposal, and will adopt the CEP after this discussion, so that the
> library and its integration into Cassandra can be developed in parallel and
> with the involvement of the wider community.
>

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Jonathan Ellis <jb...@gmail.com>.

Benedict, thanks for taking the lead in putting this together. Since
Cassandra is the only relevant database today designed around a leaderless
architecture, it's quite likely that we'll be better served with a custom
transaction design instead of trying to retrofit one from CP systems.

The whitepaper here is a good description of the consensus algorithm itself
as well as its robustness and stability characteristics, and its comparison
with other state-of-the-art consensus algorithms is very useful.  In the
context of Cassandra, where a consensus algorithm is only part of what will
be implemented, I'd like to see a more complete evaluation of the
transactional side of things as well, including performance characteristics
as well as the types of transactions that can be supported and at least a
general idea of what it would look like applied to Cassandra. This will
allow the PMC to make a more informed decision about what tradeoffs are
best for the entire long-term project of first supplementing and ultimately
replacing LWT.

(Allowing users to mix LWT and AP Cassandra operations against the same
rows was probably a mistake, so in contrast with LWT we’re not looking for
something fast enough for occasional use but rather something within a
reasonable factor of AP operations, appropriate to being the only way to
interact with tables declared as such.)

Besides Accord, this should cover

- Calvin and FaunaDB
- A Spanner derivative (no opinion on whether that should be Cockroach or
Yugabyte, I don’t think it’s necessary to cover both)
- A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
there is more public information about MongoDB)
- RAMP

Here’s an example of what I mean:

=Calvin=

Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
transactions, then replicas execute the transactions independently with no
further coordination.  No SPOF.  Transactions are batched by each sequencer
to keep this from becoming a bottleneck.

Performance: Calvin paper (published 2012) reports linear scaling of TPC-C
New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is composed
of four reads and four writes, so this is effectively 2M reads and 2M
writes as we normally measure them in C*.

Calvin supports mixed read/write transactions, but because the transaction
execution logic requires knowing all partition keys in advance to ensure
that all replicas can reproduce the same results with no coordination,
reads against non-PK predicates must be done ahead of time (transparently,
by the server) to determine the set of keys, and this must be retried if
the set of rows affected is updated before the actual transaction executes.

Batching and global consensus adds latency -- 100ms in the Calvin paper and
apparently about 50ms in FaunaDB.  Glass half full: all transactions
(including multi-partition updates) are equally performant in Calvin since
the coordination is handled up front in the sequencing step.  Glass half
empty: even single-row reads and writes have to pay the full coordination
cost.  Fauna has optimized this away for reads but I am not aware of a
description of how they changed the design to allow this.

Functionality and limitations: since the entire transaction must be known
in advance to allow coordination-less execution at the replicas, Calvin
cannot support interactive transactions at all.  FaunaDB mitigates this by
allowing server-side logic to be included, but a Calvin approach will never
be able to offer SQL compatibility.

Guarantees: Calvin transactions are strictly serializable.  There is no
additional complexity or performance hit to generalizing to multiple
regions, apart from the speed of light.  And since Calvin is already paying
a batching latency penalty, this is less painful than for other systems.

Application to Cassandra: B-.  Distributed transactions are handled by the
sequencing and scheduling layers, which are leaderless, and Calvin’s
requirements for the storage layer are easily met by C*.  But Calvin also
requires a global consensus protocol and LWT is almost certainly not
sufficiently performant, so this would require ZK or etcd (reasonable for a
library approach but not for replacing LWT in C* itself), or an
implementation of Accord.  I don’t believe Calvin would require additional
table-level metadata in Cassandra.

On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <be...@apache.org>
wrote:

> Wiki:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> Whitepaper:
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> <
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >
> Prototype: https://github.com/belliottsmith/accord
>
> Hi everyone, I’d like to propose this CEP for adoption by the community.
>
> Cassandra has benefitted from LWTs for many years, but application
> developers that want to ensure consistency for complex operations must
> either accept the scalability bottleneck of serializing all related state
> through a single partition, or layer a complex state machine on top of the
> database. These are sophisticated and costly activities that our users
> should not be expected to undertake. Since distributed databases are
> beginning to offer distributed transactions with fewer caveats, it is past
> time for Cassandra to do so as well.
>
> This CEP proposes the use of several novel techniques that build upon
> research (that followed EPaxos) to deliver (non-interactive) general
> purpose distributed transactions. The approach is outlined in the wikipage
> and in more detail in the linked whitepaper. Importantly, by adopting this
> approach we will be the _only_ distributed database to offer global,
> scalable, strict serializable transactions in one wide area round-trip.
> This would represent a significant improvement in the state of the art,
> both in the academic literature and in commercial or open source offerings.
>
> This work has been partially realised in a prototype. This partial
> prototype has been verified against Jepsen.io’s Maelstrom library and
> dedicated in-tree strict serializability verification tools, but much work
> remains for the work to be production capable and integrated into Cassandra.
>
> I propose including the prototype in the project as a new source
> repository, to be developed as a standalone library for integration into
> Cassandra. I hope the community sees the important value proposition of
> this proposal, and will adopt the CEP after this discussion, so that the
> library and its integration into Cassandra can be developed in parallel and
> with the involvement of the wider community.
>

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced