You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by Jonathan Ellis <jb...@gmail.com> on 2021/10/01 04:00:12 UTC

Re: [DISCUSS] CEP-15: General Purpose Transactions

 The obstacle for me is you've provided a protocol but not a fully fleshed
out architecture, so it's hard to fill in some of the blanks.  But it looks
to me like optimistic concurrency control for interactive transactions
applied to Accord would leave you in a LWT-like situation under fairly
light contention where nobody actually makes progress due to retries.

To make sure we're talking about the same thing, as Henrik pointed out,
interactive transactions mean multiple round trips from the client within a
transaction.  For example, here
<https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213>
is a simple implementation of the TPC-C New Order transaction.  The high
level logic (via
<https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm>)
is,

   1. Get records describing a warehouse, customer, & district
   2. Update the district
   3. Increment next available order number
   4. Insert record into Order and New-Order tables
   5. For 5-15 items, get Item record, get/update Stock record
   6. Insert Order-Line Record

As you can see, this requires a lot of client-side logic mixed in with the
actual SQL commands.


On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <be...@apache.org>
wrote:

> Essentially this, although I think in practice we will need to track each
> partition’s timestamp separately (or optionally for reduced conflicts, each
> row or datum’s), and make them all part of the conditional application of
> the transaction - at least for strict-serializability.
>
> The alternative is to insert read/write intents for the transaction during
> each step, and to confirm they are still valid on commit, but this approach
> would require a WAN round-trip for each step in the interactive
> transaction, whereas the timestamp-validating approach can use a LAN
> round-trip for each step besides the final one, and is also much simpler to
> implement.
>
>
> From: Blake Eggleston <be...@apple.com.INVALID>
> Date: Thursday, 30 September 2021 at 05:47
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> You could establish a lower timestamp bound and buffer transaction state
> on the coordinator, then make the commit an operation that only applies if
> all partitions involved haven’t been changed by a more recent timestamp.
> You could also implement mvcc either in the storage layer or for some
> period of time by buffering commits on each replica before applying.
>
> > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> >
> > How are interactive transactions possible with Accord?
> >
> >
> >
> > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> benedict@apache.org>
> > wrote:
> >
> >> Could you explain why you believe this trade-off is necessary? We can
> >> support full SQL just fine with Accord, and I hope that we eventually
> do so.
> >>
> >> This domain is incredibly complex, so it is easy to reach wrong
> >> conclusions. I would invite you again to propose a system for discussion
> >> that you think offers something Accord is unable to, and that you
> consider
> >> desirable, and we can work from there.
> >>
> >> To pre-empt some possible discussions, I am not aware of anything we
> >> cannot do with Accord that we could do with either Calvin or Spanner.
> >> Interactive transactions are possible on top of Accord, as are
> transactions
> >> with an unknown read/write set. In each case the only cost is that they
> >> would use optimistic concurrency control, which is no worse the spanner
> >> derivatives anyway (which I have to assume is your benchmark in this
> >> regard). I do not expect to deliver either functionality initially, but
> >> Accord takes us most of the way there for both.
> >>
> >>
> >> From: Jonathan Ellis <jb...@gmail.com>
> >> Date: Wednesday, 22 September 2021 at 05:36
> >> To: dev <de...@cassandra.apache.org>
> >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> >> Right, I'm looking for exactly a discussion on the high level goals.
> >> Instead of saying "here's the goals and we ruled out X because Y" we
> should
> >> start with a discussion around, "Approach A allows X and W, approach B
> >> allows Y and Z" and decide together what the goals should be and and
> what
> >> we are willing to trade to get those goals, e.g., are we willing to
> give up
> >> global strict serializability to get the ability to support full SQL.
> Both
> >> of these are nice to have!
> >>
> >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> benedict@apache.org>
> >> wrote:
> >>
> >>> Hi Jonathan,
> >>>
> >>> These other systems are incompatible with the goals of the CEP. I do
> >>> discuss them (besides 2PC) in both the whitepaper and the CEP, and will
> >>> summarise that discussion below. A true and accurate comparison of
> these
> >>> other systems is essentially intractable, as there are complex
> subtleties
> >>> to each flavour, and those who are interested would be better served by
> >>> performing their own research.
> >>>
> >>> I think it is more productive to focus on what we want to achieve as a
> >>> community. If you believe the goals of this CEP are wrong for the
> >> project,
> >>> let’s focus on that. If you want to compare and contrast specific
> facets
> >> of
> >>> alternative systems that you consider to be preferable in some
> dimension,
> >>> let’s do that here or in a Q&A as proposed by Joey.
> >>>
> >>> The relevant goals are that we:
> >>>
> >>>
> >>>  1.  Guarantee strict serializable isolation on commodity hardware
> >>>  2.  Scale to any cluster size
> >>>  3.  Achieve optimal latency
> >>>
> >>> The approach taken by Spanner derivatives is rejected by (1) because
> they
> >>> guarantee only Serializable isolation (they additionally fail (3)).
> From
> >>> watching talks by YugaByte, and inferring from Cockroach’s
> >>> panic-cluster-death under clock skew, this is clearly considered by
> >>> everyone to be undesirable but necessary to achieve scalability.
> >>>
> >>> The approach taken by FaunaDB (Calvin) is rejected by (2) because its
> >>> sequencing layer requires a global leader process for the cluster,
> which
> >> is
> >>> incompatible with Cassandra’s scalability requirements. It additionally
> >>> fails (3) for global clients.
> >>>
> >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
> >>> Spanner clone for its multi-key transaction functionality, not 2PC.
> >>>
> >>> Systems such as RAMP with even weaker isolation are not considered for
> >> the
> >>> simple reason that they do not even claim to meet (1).
> >>>
> >>> If we want to additionally offer weaker isolation levels than
> >>> Serializable, such as that provided by the recent RAMP-TAO paper,
> >> Cassandra
> >>> is likely able to support multiple distinct transaction layers that
> >> operate
> >>> independently. I would encourage you to file a CEP to explore how we
> can
> >>> meet these distinct use cases, but I consider them to be niche. I
> expect
> >>> that a majority of our user base desire strict serializable isolation,
> >> and
> >>> certainly no less than serializable isolation, to augment the existing
> >>> weaker isolation offered by quorum reads and writes.
> >>>
> >>> I would tangentially note that we are not an AP database under normal
> >>> recommended operation. A minority in any network partition cannot reach
> >>> QUORUM, so under recommended usage we are a high-availability
> leaderless
> >> CP
> >>> database.
> >>>
> >>>
> >>> From: Jonathan Ellis <jb...@gmail.com>
> >>> Date: Tuesday, 21 September 2021 at 23:45
> >>> To: dev <de...@cassandra.apache.org>
> >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> >>> Benedict, thanks for taking the lead in putting this together. Since
> >>> Cassandra is the only relevant database today designed around a
> >> leaderless
> >>> architecture, it's quite likely that we'll be better served with a
> custom
> >>> transaction design instead of trying to retrofit one from CP systems.
> >>>
> >>> The whitepaper here is a good description of the consensus algorithm
> >> itself
> >>> as well as its robustness and stability characteristics, and its
> >> comparison
> >>> with other state-of-the-art consensus algorithms is very useful.  In
> the
> >>> context of Cassandra, where a consensus algorithm is only part of what
> >> will
> >>> be implemented, I'd like to see a more complete evaluation of the
> >>> transactional side of things as well, including performance
> >> characteristics
> >>> as well as the types of transactions that can be supported and at
> least a
> >>> general idea of what it would look like applied to Cassandra. This will
> >>> allow the PMC to make a more informed decision about what tradeoffs are
> >>> best for the entire long-term project of first supplementing and
> >> ultimately
> >>> replacing LWT.
> >>>
> >>> (Allowing users to mix LWT and AP Cassandra operations against the same
> >>> rows was probably a mistake, so in contrast with LWT we’re not looking
> >> for
> >>> something fast enough for occasional use but rather something within a
> >>> reasonable factor of AP operations, appropriate to being the only way
> to
> >>> interact with tables declared as such.)
> >>>
> >>> Besides Accord, this should cover
> >>>
> >>> - Calvin and FaunaDB
> >>> - A Spanner derivative (no opinion on whether that should be Cockroach
> or
> >>> Yugabyte, I don’t think it’s necessary to cover both)
> >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I
> suspect
> >>> there is more public information about MongoDB)
> >>> - RAMP
> >>>
> >>> Here’s an example of what I mean:
> >>>
> >>> =Calvin=
> >>>
> >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
> >>> transactions, then replicas execute the transactions independently with
> >> no
> >>> further coordination.  No SPOF.  Transactions are batched by each
> >> sequencer
> >>> to keep this from becoming a bottleneck.
> >>>
> >>> Performance: Calvin paper (published 2012) reports linear scaling of
> >> TPC-C
> >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
> >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is
> composed
> >>> of four reads and four writes, so this is effectively 2M reads and 2M
> >>> writes as we normally measure them in C*.
> >>>
> >>> Calvin supports mixed read/write transactions, but because the
> >> transaction
> >>> execution logic requires knowing all partition keys in advance to
> ensure
> >>> that all replicas can reproduce the same results with no coordination,
> >>> reads against non-PK predicates must be done ahead of time
> >> (transparently,
> >>> by the server) to determine the set of keys, and this must be retried
> if
> >>> the set of rows affected is updated before the actual transaction
> >> executes.
> >>>
> >>> Batching and global consensus adds latency -- 100ms in the Calvin paper
> >> and
> >>> apparently about 50ms in FaunaDB.  Glass half full: all transactions
> >>> (including multi-partition updates) are equally performant in Calvin
> >> since
> >>> the coordination is handled up front in the sequencing step.  Glass
> half
> >>> empty: even single-row reads and writes have to pay the full
> coordination
> >>> cost.  Fauna has optimized this away for reads but I am not aware of a
> >>> description of how they changed the design to allow this.
> >>>
> >>> Functionality and limitations: since the entire transaction must be
> known
> >>> in advance to allow coordination-less execution at the replicas, Calvin
> >>> cannot support interactive transactions at all.  FaunaDB mitigates this
> >> by
> >>> allowing server-side logic to be included, but a Calvin approach will
> >> never
> >>> be able to offer SQL compatibility.
> >>>
> >>> Guarantees: Calvin transactions are strictly serializable.  There is no
> >>> additional complexity or performance hit to generalizing to multiple
> >>> regions, apart from the speed of light.  And since Calvin is already
> >> paying
> >>> a batching latency penalty, this is less painful than for other
> systems.
> >>>
> >>> Application to Cassandra: B-.  Distributed transactions are handled by
> >> the
> >>> sequencing and scheduling layers, which are leaderless, and Calvin’s
> >>> requirements for the storage layer are easily met by C*.  But Calvin
> also
> >>> requires a global consensus protocol and LWT is almost certainly not
> >>> sufficiently performant, so this would require ZK or etcd (reasonable
> >> for a
> >>> library approach but not for replacing LWT in C* itself), or an
> >>> implementation of Accord.  I don’t believe Calvin would require
> >> additional
> >>> table-level metadata in Cassandra.
> >>>
> >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> benedict@apache.org>
> >>> wrote:
> >>>
> >>>> Wiki:
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> >>>> Whitepaper:
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> >>>> <
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >>>>>
> >>>> Prototype: https://github.com/belliottsmith/accord
> >>>>
> >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> >> community.
> >>>>
> >>>> Cassandra has benefitted from LWTs for many years, but application
> >>>> developers that want to ensure consistency for complex operations must
> >>>> either accept the scalability bottleneck of serializing all related
> >> state
> >>>> through a single partition, or layer a complex state machine on top of
> >>> the
> >>>> database. These are sophisticated and costly activities that our users
> >>>> should not be expected to undertake. Since distributed databases are
> >>>> beginning to offer distributed transactions with fewer caveats, it is
> >>> past
> >>>> time for Cassandra to do so as well.
> >>>>
> >>>> This CEP proposes the use of several novel techniques that build upon
> >>>> research (that followed EPaxos) to deliver (non-interactive) general
> >>>> purpose distributed transactions. The approach is outlined in the
> >>> wikipage
> >>>> and in more detail in the linked whitepaper. Importantly, by adopting
> >>> this
> >>>> approach we will be the _only_ distributed database to offer global,
> >>>> scalable, strict serializable transactions in one wide area
> round-trip.
> >>>> This would represent a significant improvement in the state of the
> art,
> >>>> both in the academic literature and in commercial or open source
> >>> offerings.
> >>>>
> >>>> This work has been partially realised in a prototype. This partial
> >>>> prototype has been verified against Jepsen.io’s Maelstrom library and
> >>>> dedicated in-tree strict serializability verification tools, but much
> >>> work
> >>>> remains for the work to be production capable and integrated into
> >>> Cassandra.
> >>>>
> >>>> I propose including the prototype in the project as a new source
> >>>> repository, to be developed as a standalone library for integration
> >> into
> >>>> Cassandra. I hope the community sees the important value proposition
> of
> >>>> this proposal, and will adopt the CEP after this discussion, so that
> >> the
> >>>> library and its integration into Cassandra can be developed in
> parallel
> >>> and
> >>>> with the involvement of the wider community.
> >>>>
> >>>
> >>>
> >>> --
> >>> Jonathan Ellis
> >>> co-founder, http://www.datastax.com
> >>> @spyced
> >>>
> >>
> >>
> >> --
> >> Jonathan Ellis
> >> co-founder, http://www.datastax.com
> >> @spyced
> >>
> >
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>


-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.
I am of course more than happy to continue discussing CEP-15 with respect to the proposed goals, and queries about the proposed protocol. I hope people feel free to continue raising queries. If anybody disagrees with the goals or any specific part of the proposal on substantive (rather than aesthetic/structural) grounds I also remain very open to further discussion.

However, I think at this point it is reasonable to request that we engage with the proposal as defined, and in particular the goals that have been proposed. Those who wish for a different proposal can produce one so that it may be engaged with on the same terms.

From: benedict@apache.org <be...@apache.org>
Date: Friday, 1 October 2021 at 14:19
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
I think this is getting circular and unproductive. Basic disagreements about whether the CEP specifies a feature I am inclined to leave for a vote. In my view the CEP specifies several features, both immediate ones for the user (ACID batches and multi-key LWTS) and developer-focused ones around ground-breaking semantics that will be enabled.

The proposal as it stands today is exceptionally thorough, more so than any other CEP to date, or any CEP is likely to be in the near future.

This is a Cassandra Enhancement *Proposal*, and at some point we have to engage with what is proposed, not what you might like to be proposed. Since it remains unclear to me what either yourself or Jonathan want to see as an alternative, at this point it would seem more productive to produce your own proposals for the community to consider. It is possible for multiple transaction systems to co-exist, if you feel this is necessary.



From: Paulo Motta <pa...@gmail.com>
Date: Friday, 1 October 2021 at 13:58
To: Cassandra DEV <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
I share similar feelings as jbellis that this proposal seems to be focusing
on the protocol itself but lacking the actual feature that will use the
protocol which IMO a key element to discuss on a CEP.

It's similar to saying: hey I want to add this Tries Serialization Protocol
to Cassandra, but not providing specific details of how this protocol is
going to be used.

I think the right route for a CEP is to describe the feature that will be
added to the database and the protocol is a mere requirement of the
high-level feature, for example:

CEP: Add Trie-backed memtable
- Trie Serialization Protocol: implementation detail of the above CEP

What is the difficulty of taking this approach, picking one of the myriad
of features that will be enabled by Accord and using that as the initial
CEP to introduce the protocol to the database?

Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
benedict@apache.org> escreveu:

> Actually, thinking about it again, the simple optimistic protocol would in
> fact guarantee system forward progress (i.e. independent of transaction
> formulation).
>
>
> From: benedict@apache.org <be...@apache.org>
> Date: Friday, 1 October 2021 at 09:14
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Hi Jonathan,
>
> It would be great if we could achieve a bandwidth higher than 1-2 short
> emails per week. It remains unclear to me what your goal is, and it would
> help if you could make a statement like “I want Cassandra to be able to do
> X” so that we can respond directly to it. I am also available to have
> another call, in which we can have a back and forth, please feel free to
> propose a London-compatible time within the next week that is suitable for
> you.
>
> In my opinion we are at risk of veering off-topic, though. This CEP is not
> to deliver interactive transactions, and to my knowledge nobody is
> proposing a CEP for interactive transactions. So, for the CEP at hand the
> salient question seems: does this CEP prevent us from implementing
> interactive transactions with properties X, Y, Z in future? To which the
> answer is almost certainly no.
>
> However, to continue the discussion and respond directly to your queries,
> I believe we agree on the definition of an interactive transaction.
>
> Two protocols were loosely outlined. The first, using timestamps for
> optimistic concurrency control, would indeed involve the possibility of
> aborts. It would not however inherently adopt the issue of LWTs where no
> transaction is able to make progress. Whether or not progress is guaranteed
> (in a livelock-free sense) would depend on the structure of the
> transactions that were interfering.
>
> This approach has the advantage of being very simple to implement, so that
> we could realistically support interactive transactions quite quickly. It
> has the additional advantage that transactions would execute very quickly
> by avoiding the WAN during construction, and as a result may in practice
> experience fewer aborts than protocols that guarantee livelock-freedom.
>
> The second protocol proposed using read/write intents and would be able to
> support almost any behaviour you want. We could even utilise pessimistic
> concurrency control, or anything in-between. This is its own huge design
> space, and discussion of this approach and the trade-offs that could be
> made is (in my opinion) entirely out of scope for this CEP.
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Friday, 1 October 2021 at 05:00
> To: dev <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> The obstacle for me is you've provided a protocol but not a fully fleshed
> out architecture, so it's hard to fill in some of the blanks.  But it looks
> to me like optimistic concurrency control for interactive transactions
> applied to Accord would leave you in a LWT-like situation under fairly
> light contention where nobody actually makes progress due to retries.
>
> To make sure we're talking about the same thing, as Henrik pointed out,
> interactive transactions mean multiple round trips from the client within a
> transaction.  For example, here
> <
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> >
> is a simple implementation of the TPC-C New Order transaction.  The high
> level logic (via
> <
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> >)
> is,
>
>    1. Get records describing a warehouse, customer, & district
>    2. Update the district
>    3. Increment next available order number
>    4. Insert record into Order and New-Order tables
>    5. For 5-15 items, get Item record, get/update Stock record
>    6. Insert Order-Line Record
>
> As you can see, this requires a lot of client-side logic mixed in with the
> actual SQL commands.
>
>
> On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <be...@apache.org>
> wrote:
>
> > Essentially this, although I think in practice we will need to track each
> > partition’s timestamp separately (or optionally for reduced conflicts,
> each
> > row or datum’s), and make them all part of the conditional application of
> > the transaction - at least for strict-serializability.
> >
> > The alternative is to insert read/write intents for the transaction
> during
> > each step, and to confirm they are still valid on commit, but this
> approach
> > would require a WAN round-trip for each step in the interactive
> > transaction, whereas the timestamp-validating approach can use a LAN
> > round-trip for each step besides the final one, and is also much simpler
> to
> > implement.
> >
> >
> > From: Blake Eggleston <be...@apple.com.INVALID>
> > Date: Thursday, 30 September 2021 at 05:47
> > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > You could establish a lower timestamp bound and buffer transaction state
> > on the coordinator, then make the commit an operation that only applies
> if
> > all partitions involved haven’t been changed by a more recent timestamp.
> > You could also implement mvcc either in the storage layer or for some
> > period of time by buffering commits on each replica before applying.
> >
> > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> > >
> > > How are interactive transactions possible with Accord?
> > >
> > >
> > >
> > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > benedict@apache.org>
> > > wrote:
> > >
> > >> Could you explain why you believe this trade-off is necessary? We can
> > >> support full SQL just fine with Accord, and I hope that we eventually
> > do so.
> > >>
> > >> This domain is incredibly complex, so it is easy to reach wrong
> > >> conclusions. I would invite you again to propose a system for
> discussion
> > >> that you think offers something Accord is unable to, and that you
> > consider
> > >> desirable, and we can work from there.
> > >>
> > >> To pre-empt some possible discussions, I am not aware of anything we
> > >> cannot do with Accord that we could do with either Calvin or Spanner.
> > >> Interactive transactions are possible on top of Accord, as are
> > transactions
> > >> with an unknown read/write set. In each case the only cost is that
> they
> > >> would use optimistic concurrency control, which is no worse the
> spanner
> > >> derivatives anyway (which I have to assume is your benchmark in this
> > >> regard). I do not expect to deliver either functionality initially,
> but
> > >> Accord takes us most of the way there for both.
> > >>
> > >>
> > >> From: Jonathan Ellis <jb...@gmail.com>
> > >> Date: Wednesday, 22 September 2021 at 05:36
> > >> To: dev <de...@cassandra.apache.org>
> > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >> Right, I'm looking for exactly a discussion on the high level goals.
> > >> Instead of saying "here's the goals and we ruled out X because Y" we
> > should
> > >> start with a discussion around, "Approach A allows X and W, approach B
> > >> allows Y and Z" and decide together what the goals should be and and
> > what
> > >> we are willing to trade to get those goals, e.g., are we willing to
> > give up
> > >> global strict serializability to get the ability to support full SQL.
> > Both
> > >> of these are nice to have!
> > >>
> > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > benedict@apache.org>
> > >> wrote:
> > >>
> > >>> Hi Jonathan,
> > >>>
> > >>> These other systems are incompatible with the goals of the CEP. I do
> > >>> discuss them (besides 2PC) in both the whitepaper and the CEP, and
> will
> > >>> summarise that discussion below. A true and accurate comparison of
> > these
> > >>> other systems is essentially intractable, as there are complex
> > subtleties
> > >>> to each flavour, and those who are interested would be better served
> by
> > >>> performing their own research.
> > >>>
> > >>> I think it is more productive to focus on what we want to achieve as
> a
> > >>> community. If you believe the goals of this CEP are wrong for the
> > >> project,
> > >>> let’s focus on that. If you want to compare and contrast specific
> > facets
> > >> of
> > >>> alternative systems that you consider to be preferable in some
> > dimension,
> > >>> let’s do that here or in a Q&A as proposed by Joey.
> > >>>
> > >>> The relevant goals are that we:
> > >>>
> > >>>
> > >>>  1.  Guarantee strict serializable isolation on commodity hardware
> > >>>  2.  Scale to any cluster size
> > >>>  3.  Achieve optimal latency
> > >>>
> > >>> The approach taken by Spanner derivatives is rejected by (1) because
> > they
> > >>> guarantee only Serializable isolation (they additionally fail (3)).
> > From
> > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > >>> panic-cluster-death under clock skew, this is clearly considered by
> > >>> everyone to be undesirable but necessary to achieve scalability.
> > >>>
> > >>> The approach taken by FaunaDB (Calvin) is rejected by (2) because its
> > >>> sequencing layer requires a global leader process for the cluster,
> > which
> > >> is
> > >>> incompatible with Cassandra’s scalability requirements. It
> additionally
> > >>> fails (3) for global clients.
> > >>>
> > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
> > >>> Spanner clone for its multi-key transaction functionality, not 2PC.
> > >>>
> > >>> Systems such as RAMP with even weaker isolation are not considered
> for
> > >> the
> > >>> simple reason that they do not even claim to meet (1).
> > >>>
> > >>> If we want to additionally offer weaker isolation levels than
> > >>> Serializable, such as that provided by the recent RAMP-TAO paper,
> > >> Cassandra
> > >>> is likely able to support multiple distinct transaction layers that
> > >> operate
> > >>> independently. I would encourage you to file a CEP to explore how we
> > can
> > >>> meet these distinct use cases, but I consider them to be niche. I
> > expect
> > >>> that a majority of our user base desire strict serializable
> isolation,
> > >> and
> > >>> certainly no less than serializable isolation, to augment the
> existing
> > >>> weaker isolation offered by quorum reads and writes.
> > >>>
> > >>> I would tangentially note that we are not an AP database under normal
> > >>> recommended operation. A minority in any network partition cannot
> reach
> > >>> QUORUM, so under recommended usage we are a high-availability
> > leaderless
> > >> CP
> > >>> database.
> > >>>
> > >>>
> > >>> From: Jonathan Ellis <jb...@gmail.com>
> > >>> Date: Tuesday, 21 September 2021 at 23:45
> > >>> To: dev <de...@cassandra.apache.org>
> > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >>> Benedict, thanks for taking the lead in putting this together. Since
> > >>> Cassandra is the only relevant database today designed around a
> > >> leaderless
> > >>> architecture, it's quite likely that we'll be better served with a
> > custom
> > >>> transaction design instead of trying to retrofit one from CP systems.
> > >>>
> > >>> The whitepaper here is a good description of the consensus algorithm
> > >> itself
> > >>> as well as its robustness and stability characteristics, and its
> > >> comparison
> > >>> with other state-of-the-art consensus algorithms is very useful.  In
> > the
> > >>> context of Cassandra, where a consensus algorithm is only part of
> what
> > >> will
> > >>> be implemented, I'd like to see a more complete evaluation of the
> > >>> transactional side of things as well, including performance
> > >> characteristics
> > >>> as well as the types of transactions that can be supported and at
> > least a
> > >>> general idea of what it would look like applied to Cassandra. This
> will
> > >>> allow the PMC to make a more informed decision about what tradeoffs
> are
> > >>> best for the entire long-term project of first supplementing and
> > >> ultimately
> > >>> replacing LWT.
> > >>>
> > >>> (Allowing users to mix LWT and AP Cassandra operations against the
> same
> > >>> rows was probably a mistake, so in contrast with LWT we’re not
> looking
> > >> for
> > >>> something fast enough for occasional use but rather something within
> a
> > >>> reasonable factor of AP operations, appropriate to being the only way
> > to
> > >>> interact with tables declared as such.)
> > >>>
> > >>> Besides Accord, this should cover
> > >>>
> > >>> - Calvin and FaunaDB
> > >>> - A Spanner derivative (no opinion on whether that should be
> Cockroach
> > or
> > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I
> > suspect
> > >>> there is more public information about MongoDB)
> > >>> - RAMP
> > >>>
> > >>> Here’s an example of what I mean:
> > >>>
> > >>> =Calvin=
> > >>>
> > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to
> order
> > >>> transactions, then replicas execute the transactions independently
> with
> > >> no
> > >>> further coordination.  No SPOF.  Transactions are batched by each
> > >> sequencer
> > >>> to keep this from becoming a bottleneck.
> > >>>
> > >>> Performance: Calvin paper (published 2012) reports linear scaling of
> > >> TPC-C
> > >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL
> machines
> > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is
> > composed
> > >>> of four reads and four writes, so this is effectively 2M reads and 2M
> > >>> writes as we normally measure them in C*.
> > >>>
> > >>> Calvin supports mixed read/write transactions, but because the
> > >> transaction
> > >>> execution logic requires knowing all partition keys in advance to
> > ensure
> > >>> that all replicas can reproduce the same results with no
> coordination,
> > >>> reads against non-PK predicates must be done ahead of time
> > >> (transparently,
> > >>> by the server) to determine the set of keys, and this must be retried
> > if
> > >>> the set of rows affected is updated before the actual transaction
> > >> executes.
> > >>>
> > >>> Batching and global consensus adds latency -- 100ms in the Calvin
> paper
> > >> and
> > >>> apparently about 50ms in FaunaDB.  Glass half full: all transactions
> > >>> (including multi-partition updates) are equally performant in Calvin
> > >> since
> > >>> the coordination is handled up front in the sequencing step.  Glass
> > half
> > >>> empty: even single-row reads and writes have to pay the full
> > coordination
> > >>> cost.  Fauna has optimized this away for reads but I am not aware of
> a
> > >>> description of how they changed the design to allow this.
> > >>>
> > >>> Functionality and limitations: since the entire transaction must be
> > known
> > >>> in advance to allow coordination-less execution at the replicas,
> Calvin
> > >>> cannot support interactive transactions at all.  FaunaDB mitigates
> this
> > >> by
> > >>> allowing server-side logic to be included, but a Calvin approach will
> > >> never
> > >>> be able to offer SQL compatibility.
> > >>>
> > >>> Guarantees: Calvin transactions are strictly serializable.  There is
> no
> > >>> additional complexity or performance hit to generalizing to multiple
> > >>> regions, apart from the speed of light.  And since Calvin is already
> > >> paying
> > >>> a batching latency penalty, this is less painful than for other
> > systems.
> > >>>
> > >>> Application to Cassandra: B-.  Distributed transactions are handled
> by
> > >> the
> > >>> sequencing and scheduling layers, which are leaderless, and Calvin’s
> > >>> requirements for the storage layer are easily met by C*.  But Calvin
> > also
> > >>> requires a global consensus protocol and LWT is almost certainly not
> > >>> sufficiently performant, so this would require ZK or etcd (reasonable
> > >> for a
> > >>> library approach but not for replacing LWT in C* itself), or an
> > >>> implementation of Accord.  I don’t believe Calvin would require
> > >> additional
> > >>> table-level metadata in Cassandra.
> > >>>
> > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > benedict@apache.org>
> > >>> wrote:
> > >>>
> > >>>> Wiki:
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > >>>> Whitepaper:
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > >>>> <
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > >>>>>
> > >>>> Prototype: https://github.com/belliottsmith/accord
> > >>>>
> > >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> > >> community.
> > >>>>
> > >>>> Cassandra has benefitted from LWTs for many years, but application
> > >>>> developers that want to ensure consistency for complex operations
> must
> > >>>> either accept the scalability bottleneck of serializing all related
> > >> state
> > >>>> through a single partition, or layer a complex state machine on top
> of
> > >>> the
> > >>>> database. These are sophisticated and costly activities that our
> users
> > >>>> should not be expected to undertake. Since distributed databases are
> > >>>> beginning to offer distributed transactions with fewer caveats, it
> is
> > >>> past
> > >>>> time for Cassandra to do so as well.
> > >>>>
> > >>>> This CEP proposes the use of several novel techniques that build
> upon
> > >>>> research (that followed EPaxos) to deliver (non-interactive) general
> > >>>> purpose distributed transactions. The approach is outlined in the
> > >>> wikipage
> > >>>> and in more detail in the linked whitepaper. Importantly, by
> adopting
> > >>> this
> > >>>> approach we will be the _only_ distributed database to offer global,
> > >>>> scalable, strict serializable transactions in one wide area
> > round-trip.
> > >>>> This would represent a significant improvement in the state of the
> > art,
> > >>>> both in the academic literature and in commercial or open source
> > >>> offerings.
> > >>>>
> > >>>> This work has been partially realised in a prototype. This partial
> > >>>> prototype has been verified against Jepsen.io’s Maelstrom library
> and
> > >>>> dedicated in-tree strict serializability verification tools, but
> much
> > >>> work
> > >>>> remains for the work to be production capable and integrated into
> > >>> Cassandra.
> > >>>>
> > >>>> I propose including the prototype in the project as a new source
> > >>>> repository, to be developed as a standalone library for integration
> > >> into
> > >>>> Cassandra. I hope the community sees the important value proposition
> > of
> > >>>> this proposal, and will adopt the CEP after this discussion, so that
> > >> the
> > >>>> library and its integration into Cassandra can be developed in
> > parallel
> > >>> and
> > >>>> with the involvement of the wider community.
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Jonathan Ellis
> > >>> co-founder, http://www.datastax.com
> > >>> @spyced
> > >>>
> > >>
> > >>
> > >> --
> > >> Jonathan Ellis
> > >> co-founder, http://www.datastax.com
> > >> @spyced
> > >>
> > >
> > >
> > > --
> > > Jonathan Ellis
> > > co-founder, http://www.datastax.com
> > > @spyced
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > For additional commands, e-mail: dev-help@cassandra.apache.org
> >
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.
Jonathan,

This work will only determine Cassandra’s future if no other contributors choose to take a different route in future. If in future the community decides this work is incompatible with its direction, it remains in the community’s power to remove the facility, or to make it optional.

OSS is a living thing, and this CEP will shape the future of community only by virtue of the work that I and others will do. You are equally capable of investing this time and effort.

Today, this is the only CEP of the kind on offer. If another competing proposal were to be made, we could either work to reconcile them, or to ensure they may co-exist. You cannot, however, expect to impose your _goals_ on the work that I and others will undertake. That is not how the community works.

Since we are going around in circles, I propose a simple majority vote to establish if the community endorses the stated goals of the CEP.


From: Jonathan Ellis <jb...@gmail.com>
Date: Wednesday, 6 October 2021 at 16:05
To: dev <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
The problem that I keep pointing out is that you've created this CEP for
Accord without first getting consensus that the goals and the tradeoffs it
makes to achieve those goals (and that it will impose on future work around
transactions) are the right ones for Cassandra long term.

At this point I'm done repeating myself.  For the convenience of anyone
following this thread intermittently, I'll quote my first reply on this
thread to illustrate the kind of discussion I'd like to have.

-----

The whitepaper here is a good description of the consensus algorithm itself
as well as its robustness and stability characteristics, and its comparison
with other state-of-the-art consensus algorithms is very useful.  In the
context of Cassandra, where a consensus algorithm is only part of what will
be implemented, I'd like to see a more complete evaluation of the
transactional side of things as well, including performance characteristics
as well as the types of transactions that can be supported and at least a
general idea of what it would look like applied to Cassandra. This will
allow the PMC to make a more informed decision about what tradeoffs are
best for the entire long-term project of first supplementing and ultimately
replacing LWT.

(Allowing users to mix LWT and AP Cassandra operations against the same
rows was probably a mistake, so in contrast with LWT we’re not looking for
something fast enough for occasional use but rather something within a
reasonable factor of AP operations, appropriate to being the only way to
interact with tables declared as such.)

Besides Accord, this should cover

- Calvin and FaunaDB
- A Spanner derivative (no opinion on whether that should be Cockroach or
Yugabyte, I don’t think it’s necessary to cover both)
- A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
there is more public information about MongoDB)
- RAMP

Here’s an example of what I mean:

=Calvin=

Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
transactions, then replicas execute the transactions independently with no
further coordination.  No SPOF.  Transactions are batched by each sequencer
to keep this from becoming a bottleneck.

Performance: Calvin paper (published 2012) reports linear scaling of TPC-C
New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is composed
of four reads and four writes, so this is effectively 2M reads and 2M
writes as we normally measure them in C*.

Calvin supports mixed read/write transactions, but because the transaction
execution logic requires knowing all partition keys in advance to ensure
that all replicas can reproduce the same results with no coordination,
reads against non-PK predicates must be done ahead of time (transparently,
by the server) to determine the set of keys, and this must be retried if
the set of rows affected is updated before the actual transaction executes.

Batching and global consensus adds latency -- 100ms in the Calvin paper and
apparently about 50ms in FaunaDB.  Glass half full: all transactions
(including multi-partition updates) are equally performant in Calvin since
the coordination is handled up front in the sequencing step.  Glass half
empty: even single-row reads and writes have to pay the full coordination
cost.  Fauna has optimized this away for reads but I am not aware of a
description of how they changed the design to allow this.

Functionality and limitations: since the entire transaction must be known
in advance to allow coordination-less execution at the replicas, Calvin
cannot support interactive transactions at all.  FaunaDB mitigates this by
allowing server-side logic to be included, but a Calvin approach will never
be able to offer SQL compatibility.

Guarantees: Calvin transactions are strictly serializable.  There is no
additional complexity or performance hit to generalizing to multiple
regions, apart from the speed of light.  And since Calvin is already paying
a batching latency penalty, this is less painful than for other systems.

Application to Cassandra: B-.  Distributed transactions are handled by the
sequencing and scheduling layers, which are leaderless, and Calvin’s
requirements for the storage layer are easily met by C*.  But Calvin also
requires a global consensus protocol and LWT is almost certainly not
sufficiently performant, so this would require ZK or etcd (reasonable for a
library approach but not for replacing LWT in C* itself), or an
implementation of Accord.  I don’t believe Calvin would require additional
table-level metadata in Cassandra.

On Wed, Oct 6, 2021 at 9:53 AM benedict@apache.org <be...@apache.org>
wrote:

> The problem with dropping a patch on Jira is that there is no opportunity
> to point out problems, either with the fundamental approach or with the
> specific implementation. So please point out some problems I can engage
> with!
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Wednesday, 6 October 2021 at 15:48
> To: dev <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> On Wed, Oct 6, 2021 at 9:21 AM benedict@apache.org <be...@apache.org>
> wrote:
>
> > The goals of the CEP are stated clearly, and these were the goals we had
> > going into the (multi-month) research project we undertook before
> proposing
> > this CEP. These goals are necessarily value judgements, so we cannot
> expect
> > that everyone will agree that they are optimal.
> >
>
> Right, so I'm saying that this is exactly the most important thing to get
> consensus on, and creating a CEP for a protocol to achieve goals that you
> have not discussed with the community is the CEP equivalent of dropping a
> patch on Jira without discussing its goals either.
>
> That's why our conversations haven't gone anywhere, because I keep saying
> "we need discuss the goals and tradeoffs", and I'll give an example of what
> I mean, and you keep addressing the examples (sometimes very shallowly, "it
> would be possible to X" or "Y could be done as an optimization") while
> ignoring the request to open a discussion around the big picture.
>


--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Jonathan Ellis <jb...@gmail.com>.
The problem that I keep pointing out is that you've created this CEP for
Accord without first getting consensus that the goals and the tradeoffs it
makes to achieve those goals (and that it will impose on future work around
transactions) are the right ones for Cassandra long term.

At this point I'm done repeating myself.  For the convenience of anyone
following this thread intermittently, I'll quote my first reply on this
thread to illustrate the kind of discussion I'd like to have.

-----

The whitepaper here is a good description of the consensus algorithm itself
as well as its robustness and stability characteristics, and its comparison
with other state-of-the-art consensus algorithms is very useful.  In the
context of Cassandra, where a consensus algorithm is only part of what will
be implemented, I'd like to see a more complete evaluation of the
transactional side of things as well, including performance characteristics
as well as the types of transactions that can be supported and at least a
general idea of what it would look like applied to Cassandra. This will
allow the PMC to make a more informed decision about what tradeoffs are
best for the entire long-term project of first supplementing and ultimately
replacing LWT.

(Allowing users to mix LWT and AP Cassandra operations against the same
rows was probably a mistake, so in contrast with LWT we’re not looking for
something fast enough for occasional use but rather something within a
reasonable factor of AP operations, appropriate to being the only way to
interact with tables declared as such.)

Besides Accord, this should cover

- Calvin and FaunaDB
- A Spanner derivative (no opinion on whether that should be Cockroach or
Yugabyte, I don’t think it’s necessary to cover both)
- A 2PC implementation (the Accord paper mentions DynamoDB but I suspect
there is more public information about MongoDB)
- RAMP

Here’s an example of what I mean:

=Calvin=

Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
transactions, then replicas execute the transactions independently with no
further coordination.  No SPOF.  Transactions are batched by each sequencer
to keep this from becoming a bottleneck.

Performance: Calvin paper (published 2012) reports linear scaling of TPC-C
New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is composed
of four reads and four writes, so this is effectively 2M reads and 2M
writes as we normally measure them in C*.

Calvin supports mixed read/write transactions, but because the transaction
execution logic requires knowing all partition keys in advance to ensure
that all replicas can reproduce the same results with no coordination,
reads against non-PK predicates must be done ahead of time (transparently,
by the server) to determine the set of keys, and this must be retried if
the set of rows affected is updated before the actual transaction executes.

Batching and global consensus adds latency -- 100ms in the Calvin paper and
apparently about 50ms in FaunaDB.  Glass half full: all transactions
(including multi-partition updates) are equally performant in Calvin since
the coordination is handled up front in the sequencing step.  Glass half
empty: even single-row reads and writes have to pay the full coordination
cost.  Fauna has optimized this away for reads but I am not aware of a
description of how they changed the design to allow this.

Functionality and limitations: since the entire transaction must be known
in advance to allow coordination-less execution at the replicas, Calvin
cannot support interactive transactions at all.  FaunaDB mitigates this by
allowing server-side logic to be included, but a Calvin approach will never
be able to offer SQL compatibility.

Guarantees: Calvin transactions are strictly serializable.  There is no
additional complexity or performance hit to generalizing to multiple
regions, apart from the speed of light.  And since Calvin is already paying
a batching latency penalty, this is less painful than for other systems.

Application to Cassandra: B-.  Distributed transactions are handled by the
sequencing and scheduling layers, which are leaderless, and Calvin’s
requirements for the storage layer are easily met by C*.  But Calvin also
requires a global consensus protocol and LWT is almost certainly not
sufficiently performant, so this would require ZK or etcd (reasonable for a
library approach but not for replacing LWT in C* itself), or an
implementation of Accord.  I don’t believe Calvin would require additional
table-level metadata in Cassandra.

On Wed, Oct 6, 2021 at 9:53 AM benedict@apache.org <be...@apache.org>
wrote:

> The problem with dropping a patch on Jira is that there is no opportunity
> to point out problems, either with the fundamental approach or with the
> specific implementation. So please point out some problems I can engage
> with!
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Wednesday, 6 October 2021 at 15:48
> To: dev <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> On Wed, Oct 6, 2021 at 9:21 AM benedict@apache.org <be...@apache.org>
> wrote:
>
> > The goals of the CEP are stated clearly, and these were the goals we had
> > going into the (multi-month) research project we undertook before
> proposing
> > this CEP. These goals are necessarily value judgements, so we cannot
> expect
> > that everyone will agree that they are optimal.
> >
>
> Right, so I'm saying that this is exactly the most important thing to get
> consensus on, and creating a CEP for a protocol to achieve goals that you
> have not discussed with the community is the CEP equivalent of dropping a
> patch on Jira without discussing its goals either.
>
> That's why our conversations haven't gone anywhere, because I keep saying
> "we need discuss the goals and tradeoffs", and I'll give an example of what
> I mean, and you keep addressing the examples (sometimes very shallowly, "it
> would be possible to X" or "Y could be done as an optimization") while
> ignoring the request to open a discussion around the big picture.
>


-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.
The problem with dropping a patch on Jira is that there is no opportunity to point out problems, either with the fundamental approach or with the specific implementation. So please point out some problems I can engage with!


From: Jonathan Ellis <jb...@gmail.com>
Date: Wednesday, 6 October 2021 at 15:48
To: dev <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
On Wed, Oct 6, 2021 at 9:21 AM benedict@apache.org <be...@apache.org>
wrote:

> The goals of the CEP are stated clearly, and these were the goals we had
> going into the (multi-month) research project we undertook before proposing
> this CEP. These goals are necessarily value judgements, so we cannot expect
> that everyone will agree that they are optimal.
>

Right, so I'm saying that this is exactly the most important thing to get
consensus on, and creating a CEP for a protocol to achieve goals that you
have not discussed with the community is the CEP equivalent of dropping a
patch on Jira without discussing its goals either.

That's why our conversations haven't gone anywhere, because I keep saying
"we need discuss the goals and tradeoffs", and I'll give an example of what
I mean, and you keep addressing the examples (sometimes very shallowly, "it
would be possible to X" or "Y could be done as an optimization") while
ignoring the request to open a discussion around the big picture.

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Jonathan Ellis <jb...@gmail.com>.
On Wed, Oct 6, 2021 at 9:21 AM benedict@apache.org <be...@apache.org>
wrote:

> The goals of the CEP are stated clearly, and these were the goals we had
> going into the (multi-month) research project we undertook before proposing
> this CEP. These goals are necessarily value judgements, so we cannot expect
> that everyone will agree that they are optimal.
>

Right, so I'm saying that this is exactly the most important thing to get
consensus on, and creating a CEP for a protocol to achieve goals that you
have not discussed with the community is the CEP equivalent of dropping a
patch on Jira without discussing its goals either.

That's why our conversations haven't gone anywhere, because I keep saying
"we need discuss the goals and tradeoffs", and I'll give an example of what
I mean, and you keep addressing the examples (sometimes very shallowly, "it
would be possible to X" or "Y could be done as an optimization") while
ignoring the request to open a discussion around the big picture.

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.
The goals of the CEP are stated clearly, and these were the goals we had going into the (multi-month) research project we undertook before proposing this CEP. These goals are necessarily value judgements, so we cannot expect that everyone will agree that they are optimal.

So far you have not engaged with these goals to state any specific disagreement. I have engaged with all of the trade-offs you imagined, and every specific concern you have raised. Despite a month having elapsed and a great deal of time spent answering your emails, this is the first confirmation I have that you are dissatisfied with my responses to you.

The role of the CEP is to advertise a project, allowing people to register their interest in collaborating, and for technical concerns to be stated in advance. So far you have expressed no specific technical concerns that I have not engaged with, and yet I have received no response to my engagements.

The role of the CEP is *not* to permit members of the community to dictate their preferences on the proposers, or to declare that the CEP is inadequate because it doesn’t meet their goals, or to demand additional work to explore others’ preferred research avenues on the topic.

You have to do some of the work here, Jonathan.

If you have an alternative approach, I continue to ask you to propose it so we may compare and contrast in a specific and technical manner.  If you have any specific technical concerns I exhort you to raise them, so we my discuss them. If you dispute the goals, please make an argument as to why. If our goals are irreconcilable, file another CEP.



From: Jonathan Ellis <jb...@gmail.com>
Date: Wednesday, 6 October 2021 at 14:41
To: dev <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
I've repeatedly explained why I'm unhappy: instead of starting with a
discussion of what API and tradeoffs we should make to get that, this CEP
starts with a protocol and asks us to figure out what API we can build with
it.

Of course by API I mean, what kinds of CQL and SQL operations we can
perform, with what kinds of ACID semantics and what kinds of performance,
not "Result perform(Transaction transaction)".  And it's not simply SQL
syntax, either.  I realize that this could sound a little vague, but that's
why I gave an example of the kind of analysis I'm talking about in my first
reply.  Your responses have been to attempt to avoid the discussion
entirely ("the relevant goals are [mine]") or to declare it to be out of
scope.

The CEP process is intended to help get to alignment across the community
of PMC members, committers, and contributors on goals and outcomes before
starting in writing code, not simply to bless a completed design.  That's
why we're going in circles here.

On Wed, Oct 6, 2021 at 2:12 AM benedict@apache.org <be...@apache.org>
wrote:

> We have discussed the API at length in this thread. The API primarily
> involves the semantics of the transactions, as besides this the API of a
> transaction is simply:
>
> Result perform(Transaction transaction)
>
> As discussed in follow-up to that email, a prototype API is specified
> alongside the prototype protocol. I am unsure what more you want than this,
> or the above, or the prior semantic discussions.
>
> It seems clear that you’re unhappy with the proposal, but it remains
> ambiguous as to why. Your emails are terse, infrequent and unclear. My
> responses receive no follow up from you, even to clarify if I have answered
> your query. Sometime later I seem to be able to expect a new unrelated
> problem that you are unhappy about. You have not yet responded to even one
> of my repeated offers to hop on a call to hash out any of your concerns,
> even if only to decline.
>
> This does not feel like constructive and respectful engagement to me, and
> I am losing interest.
>
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Wednesday, 6 October 2021 at 00:02
> To: dev <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> I honestly can't understand the perspective that on the one hand, you're
> asking for approval of a specific protocol as part of the CEP, but on the
> other, you think discussion of the APIs this will enable is not warranted.
> Surely we need agreement on what APIs we're trying to build, before we
> discuss the protocols and architectures with which to build them.
>
> On Fri, Oct 1, 2021 at 9:34 AM benedict@apache.org <be...@apache.org>
> wrote:
>
> > > The current document details thoroughly the protocol but in my view
> > lacks to illustrate what specific API, methods, modules will become
> > available to developers
> >
> > With respect to this, in my view this kind of detail is not warranted
> > within a CEP. Software development is an exploratory process with respect
> > to structure, and these decisions will be made as the CEP progresses. If
> > these need to be specified upfront, then the purpose of a CEP – seeking
> buy
> > in – is invalidated, because the work must be complete before you know
> the
> > answers.
> >

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Jonathan Ellis <jb...@gmail.com>.
I've repeatedly explained why I'm unhappy: instead of starting with a
discussion of what API and tradeoffs we should make to get that, this CEP
starts with a protocol and asks us to figure out what API we can build with
it.

Of course by API I mean, what kinds of CQL and SQL operations we can
perform, with what kinds of ACID semantics and what kinds of performance,
not "Result perform(Transaction transaction)".  And it's not simply SQL
syntax, either.  I realize that this could sound a little vague, but that's
why I gave an example of the kind of analysis I'm talking about in my first
reply.  Your responses have been to attempt to avoid the discussion
entirely ("the relevant goals are [mine]") or to declare it to be out of
scope.

The CEP process is intended to help get to alignment across the community
of PMC members, committers, and contributors on goals and outcomes before
starting in writing code, not simply to bless a completed design.  That's
why we're going in circles here.

On Wed, Oct 6, 2021 at 2:12 AM benedict@apache.org <be...@apache.org>
wrote:

> We have discussed the API at length in this thread. The API primarily
> involves the semantics of the transactions, as besides this the API of a
> transaction is simply:
>
> Result perform(Transaction transaction)
>
> As discussed in follow-up to that email, a prototype API is specified
> alongside the prototype protocol. I am unsure what more you want than this,
> or the above, or the prior semantic discussions.
>
> It seems clear that you’re unhappy with the proposal, but it remains
> ambiguous as to why. Your emails are terse, infrequent and unclear. My
> responses receive no follow up from you, even to clarify if I have answered
> your query. Sometime later I seem to be able to expect a new unrelated
> problem that you are unhappy about. You have not yet responded to even one
> of my repeated offers to hop on a call to hash out any of your concerns,
> even if only to decline.
>
> This does not feel like constructive and respectful engagement to me, and
> I am losing interest.
>
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Wednesday, 6 October 2021 at 00:02
> To: dev <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> I honestly can't understand the perspective that on the one hand, you're
> asking for approval of a specific protocol as part of the CEP, but on the
> other, you think discussion of the APIs this will enable is not warranted.
> Surely we need agreement on what APIs we're trying to build, before we
> discuss the protocols and architectures with which to build them.
>
> On Fri, Oct 1, 2021 at 9:34 AM benedict@apache.org <be...@apache.org>
> wrote:
>
> > > The current document details thoroughly the protocol but in my view
> > lacks to illustrate what specific API, methods, modules will become
> > available to developers
> >
> > With respect to this, in my view this kind of detail is not warranted
> > within a CEP. Software development is an exploratory process with respect
> > to structure, and these decisions will be made as the CEP progresses. If
> > these need to be specified upfront, then the purpose of a CEP – seeking
> buy
> > in – is invalidated, because the work must be complete before you know
> the
> > answers.
> >

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.
We have discussed the API at length in this thread. The API primarily involves the semantics of the transactions, as besides this the API of a transaction is simply:

Result perform(Transaction transaction)

As discussed in follow-up to that email, a prototype API is specified alongside the prototype protocol. I am unsure what more you want than this, or the above, or the prior semantic discussions.

It seems clear that you’re unhappy with the proposal, but it remains ambiguous as to why. Your emails are terse, infrequent and unclear. My responses receive no follow up from you, even to clarify if I have answered your query. Sometime later I seem to be able to expect a new unrelated problem that you are unhappy about. You have not yet responded to even one of my repeated offers to hop on a call to hash out any of your concerns, even if only to decline.

This does not feel like constructive and respectful engagement to me, and I am losing interest.



From: Jonathan Ellis <jb...@gmail.com>
Date: Wednesday, 6 October 2021 at 00:02
To: dev <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
I honestly can't understand the perspective that on the one hand, you're
asking for approval of a specific protocol as part of the CEP, but on the
other, you think discussion of the APIs this will enable is not warranted.
Surely we need agreement on what APIs we're trying to build, before we
discuss the protocols and architectures with which to build them.

On Fri, Oct 1, 2021 at 9:34 AM benedict@apache.org <be...@apache.org>
wrote:

> > The current document details thoroughly the protocol but in my view
> lacks to illustrate what specific API, methods, modules will become
> available to developers
>
> With respect to this, in my view this kind of detail is not warranted
> within a CEP. Software development is an exploratory process with respect
> to structure, and these decisions will be made as the CEP progresses. If
> these need to be specified upfront, then the purpose of a CEP – seeking buy
> in – is invalidated, because the work must be complete before you know the
> answers.
>
>
> From: benedict@apache.org <be...@apache.org>
> Date: Friday, 1 October 2021 at 15:31
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> From the CEP:
>
> Batches (including unconditional batches) on transactional tables will
> receive ACID properties, and grammatically correct conditional batch
> operations that would be rejected for operating over multiple CQL
> partitions will now be supported
>
>
> From: Paulo Motta <pa...@gmail.com>
> Date: Friday, 1 October 2021 at 15:30
> To: Cassandra DEV <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Can you just answer what palpable feature will be available once this CEP
> lands because this is still not clear to me (and perhaps to others) from
> the current CEP structure. The current document details thoroughly the
> protocol but in my view lacks to illustrate what specific API, methods,
> modules will become available to developers, how it fits into the larger
> picture and interacts with existing modules if at all and perhaps a few
> examples of how it can be used to build features on top.
>
> Em sex., 1 de out. de 2021 às 11:10, benedict@apache.org <
> benedict@apache.org> escreveu:
>
> > I’m not, though it might seem that way. I disagree with your views about
> > how CEP should be structured. Since the CEP process was itself codified
> via
> > the CEP process, if you want to recodify how CEP work, the correct way is
> > via the CEP process itself.
> >
> > The discussion is being drawn in multiple directions away from the CEP
> > itself, and I am trying to keep this particular thread focused on the
> > business at hand, not meta discussions around CEP structure that will no
> > doubt be unproductive given likely irreconcilable views about the topic,
> > nor discussions about other CEP that could have been.
> >
> > If you want to start a separate exploratory discussion thread about CEP
> > structure without filing a CEP feel free to do so.
> >
> >
> > From: Paulo Motta <pa...@gmail.com>
> > Date: Friday, 1 October 2021 at 15:04
> > To: Cassandra DEV <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > If you want to impose your views on CEP structure on others, please
> file
> > a CEP with the additional restrictions and guidance you want to impose
> and
> > start a discussion thread. I can then respond in detail to why I perceive
> > this approach to be flawed, in a dedicated context.
> >
> > This sounds very kafkaesque. You know I won't file a meta-CEP to change
> the
> > structure of CEP so you're just using this as an excuse to just shut the
> > discussion on the lack of clarity on what actual palpable feature will be
> > available once the CEP lands. :-)
> >
> > I'm just providing my humble feedback on how a CEP could be more
> digestible
> > and easier to consume from an external point of view, and this seems like
> > an appropriate and contextualized place to voice this opinion which is
> > perhaps shared by others.
> >
> > Em sex., 1 de out. de 2021 às 10:55, benedict@apache.org <
> > benedict@apache.org> escreveu:
> >
> > > I disagree with you. However, this is the wrong forum to have a meta
> > > discussion about how CEP should be structured.
> > >
> > > If you want to impose your views on CEP structure on others, please
> file
> > a
> > > CEP with the additional restrictions and guidance you want to impose
> and
> > > start a discussion thread. I can then respond in detail to why I
> perceive
> > > this approach to be flawed, in a dedicated context.
> > >
> > >
> > > From: Paulo Motta <pa...@gmail.com>
> > > Date: Friday, 1 October 2021 at 14:48
> > > To: Cassandra DEV <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > >  The proposal as it stands today is exceptionally thorough, more so
> > than
> > > any other CEP to date, or any CEP is likely to be in the near future.
> > >
> > > The protocol is thoroughly described, but in my view CEP is a forum to
> > > discuss the high level architecture and plan for adding a full
> end-to-end
> > > enhancement to the database, breaking it into sub-CEPs if needed, as
> long
> > > as the full plan is known in advance, otherwise the community will not
> > have
> > > the context to judge the full extent and impact of the proposed
> > > enhancement.
> > >
> > > > Since it remains unclear to me what either yourself or Jonathan want
> to
> > > see as an alternative
> > >
> > > I would personally like to see something along these lines:
> > >
> > > CEP1: Add ACID-compliant atomic batches
> > > - UX changes needed: none, CQL provides the grammar we need.
> > > - Distributed transaction protocol needed: Accord (link to white paper
> if
> > > you want specific details about the protcool)
> > > - High-level architecture: what new components will be added, how
> > existing
> > > components will be modified, what new messages will be added, what new
> > > configuration knobs will be introduced, what are the milestones of the
> > > project, etc.
> > >
> > > CEP2: Make LWT faster and more reliable
> > > - UX changes needed: none
> > > - Distributed transaction protocol needed: Accord, already added by
> > > previous CEP.
> > > - High-level architecture: blablabla... and so on.
> > >
> > > Em sex., 1 de out. de 2021 às 10:19, benedict@apache.org <
> > > benedict@apache.org> escreveu:
> > >
> > > > I think this is getting circular and unproductive. Basic
> disagreements
> > > > about whether the CEP specifies a feature I am inclined to leave for
> a
> > > > vote. In my view the CEP specifies several features, both immediate
> > ones
> > > > for the user (ACID batches and multi-key LWTS) and developer-focused
> > ones
> > > > around ground-breaking semantics that will be enabled.
> > > >
> > > > The proposal as it stands today is exceptionally thorough, more so
> than
> > > > any other CEP to date, or any CEP is likely to be in the near future.
> > > >
> > > > This is a Cassandra Enhancement *Proposal*, and at some point we have
> > to
> > > > engage with what is proposed, not what you might like to be proposed.
> > > Since
> > > > it remains unclear to me what either yourself or Jonathan want to see
> > as
> > > an
> > > > alternative, at this point it would seem more productive to produce
> > your
> > > > own proposals for the community to consider. It is possible for
> > multiple
> > > > transaction systems to co-exist, if you feel this is necessary.
> > > >
> > > >
> > > >
> > > > From: Paulo Motta <pa...@gmail.com>
> > > > Date: Friday, 1 October 2021 at 13:58
> > > > To: Cassandra DEV <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > I share similar feelings as jbellis that this proposal seems to be
> > > focusing
> > > > on the protocol itself but lacking the actual feature that will use
> the
> > > > protocol which IMO a key element to discuss on a CEP.
> > > >
> > > > It's similar to saying: hey I want to add this Tries Serialization
> > > Protocol
> > > > to Cassandra, but not providing specific details of how this protocol
> > is
> > > > going to be used.
> > > >
> > > > I think the right route for a CEP is to describe the feature that
> will
> > be
> > > > added to the database and the protocol is a mere requirement of the
> > > > high-level feature, for example:
> > > >
> > > > CEP: Add Trie-backed memtable
> > > > - Trie Serialization Protocol: implementation detail of the above CEP
> > > >
> > > > What is the difficulty of taking this approach, picking one of the
> > myriad
> > > > of features that will be enabled by Accord and using that as the
> > initial
> > > > CEP to introduce the protocol to the database?
> > > >
> > > > Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
> > > > benedict@apache.org> escreveu:
> > > >
> > > > > Actually, thinking about it again, the simple optimistic protocol
> > would
> > > > in
> > > > > fact guarantee system forward progress (i.e. independent of
> > transaction
> > > > > formulation).
> > > > >
> > > > >
> > > > > From: benedict@apache.org <be...@apache.org>
> > > > > Date: Friday, 1 October 2021 at 09:14
> > > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > Hi Jonathan,
> > > > >
> > > > > It would be great if we could achieve a bandwidth higher than 1-2
> > short
> > > > > emails per week. It remains unclear to me what your goal is, and it
> > > would
> > > > > help if you could make a statement like “I want Cassandra to be
> able
> > to
> > > > do
> > > > > X” so that we can respond directly to it. I am also available to
> have
> > > > > another call, in which we can have a back and forth, please feel
> free
> > > to
> > > > > propose a London-compatible time within the next week that is
> > suitable
> > > > for
> > > > > you.
> > > > >
> > > > > In my opinion we are at risk of veering off-topic, though. This CEP
> > is
> > > > not
> > > > > to deliver interactive transactions, and to my knowledge nobody is
> > > > > proposing a CEP for interactive transactions. So, for the CEP at
> hand
> > > the
> > > > > salient question seems: does this CEP prevent us from implementing
> > > > > interactive transactions with properties X, Y, Z in future? To
> which
> > > the
> > > > > answer is almost certainly no.
> > > > >
> > > > > However, to continue the discussion and respond directly to your
> > > queries,
> > > > > I believe we agree on the definition of an interactive transaction.
> > > > >
> > > > > Two protocols were loosely outlined. The first, using timestamps
> for
> > > > > optimistic concurrency control, would indeed involve the
> possibility
> > of
> > > > > aborts. It would not however inherently adopt the issue of LWTs
> where
> > > no
> > > > > transaction is able to make progress. Whether or not progress is
> > > > guaranteed
> > > > > (in a livelock-free sense) would depend on the structure of the
> > > > > transactions that were interfering.
> > > > >
> > > > > This approach has the advantage of being very simple to implement,
> so
> > > > that
> > > > > we could realistically support interactive transactions quite
> > quickly.
> > > It
> > > > > has the additional advantage that transactions would execute very
> > > quickly
> > > > > by avoiding the WAN during construction, and as a result may in
> > > practice
> > > > > experience fewer aborts than protocols that guarantee
> > livelock-freedom.
> > > > >
> > > > > The second protocol proposed using read/write intents and would be
> > able
> > > > to
> > > > > support almost any behaviour you want. We could even utilise
> > > pessimistic
> > > > > concurrency control, or anything in-between. This is its own huge
> > > design
> > > > > space, and discussion of this approach and the trade-offs that
> could
> > be
> > > > > made is (in my opinion) entirely out of scope for this CEP.
> > > > >
> > > > >
> > > > > From: Jonathan Ellis <jb...@gmail.com>
> > > > > Date: Friday, 1 October 2021 at 05:00
> > > > > To: dev <de...@cassandra.apache.org>
> > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > The obstacle for me is you've provided a protocol but not a fully
> > > fleshed
> > > > > out architecture, so it's hard to fill in some of the blanks.  But
> it
> > > > looks
> > > > > to me like optimistic concurrency control for interactive
> > transactions
> > > > > applied to Accord would leave you in a LWT-like situation under
> > fairly
> > > > > light contention where nobody actually makes progress due to
> retries.
> > > > >
> > > > > To make sure we're talking about the same thing, as Henrik pointed
> > out,
> > > > > interactive transactions mean multiple round trips from the client
> > > > within a
> > > > > transaction.  For example, here
> > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > > > > >
> > > > > is a simple implementation of the TPC-C New Order transaction.  The
> > > high
> > > > > level logic (via
> > > > > <
> > > > >
> > > >
> > >
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > > > > >)
> > > > > is,
> > > > >
> > > > >    1. Get records describing a warehouse, customer, & district
> > > > >    2. Update the district
> > > > >    3. Increment next available order number
> > > > >    4. Insert record into Order and New-Order tables
> > > > >    5. For 5-15 items, get Item record, get/update Stock record
> > > > >    6. Insert Order-Line Record
> > > > >
> > > > > As you can see, this requires a lot of client-side logic mixed in
> > with
> > > > the
> > > > > actual SQL commands.
> > > > >
> > > > >
> > > > > On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <
> > > benedict@apache.org
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Essentially this, although I think in practice we will need to
> > track
> > > > each
> > > > > > partition’s timestamp separately (or optionally for reduced
> > > conflicts,
> > > > > each
> > > > > > row or datum’s), and make them all part of the conditional
> > > application
> > > > of
> > > > > > the transaction - at least for strict-serializability.
> > > > > >
> > > > > > The alternative is to insert read/write intents for the
> transaction
> > > > > during
> > > > > > each step, and to confirm they are still valid on commit, but
> this
> > > > > approach
> > > > > > would require a WAN round-trip for each step in the interactive
> > > > > > transaction, whereas the timestamp-validating approach can use a
> > LAN
> > > > > > round-trip for each step besides the final one, and is also much
> > > > simpler
> > > > > to
> > > > > > implement.
> > > > > >
> > > > > >
> > > > > > From: Blake Eggleston <be...@apple.com.INVALID>
> > > > > > Date: Thursday, 30 September 2021 at 05:47
> > > > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > You could establish a lower timestamp bound and buffer
> transaction
> > > > state
> > > > > > on the coordinator, then make the commit an operation that only
> > > applies
> > > > > if
> > > > > > all partitions involved haven’t been changed by a more recent
> > > > timestamp.
> > > > > > You could also implement mvcc either in the storage layer or for
> > some
> > > > > > period of time by buffering commits on each replica before
> > applying.
> > > > > >
> > > > > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jbellis@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > How are interactive transactions possible with Accord?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Could you explain why you believe this trade-off is necessary?
> > We
> > > > can
> > > > > > >> support full SQL just fine with Accord, and I hope that we
> > > > eventually
> > > > > > do so.
> > > > > > >>
> > > > > > >> This domain is incredibly complex, so it is easy to reach
> wrong
> > > > > > >> conclusions. I would invite you again to propose a system for
> > > > > discussion
> > > > > > >> that you think offers something Accord is unable to, and that
> > you
> > > > > > consider
> > > > > > >> desirable, and we can work from there.
> > > > > > >>
> > > > > > >> To pre-empt some possible discussions, I am not aware of
> > anything
> > > we
> > > > > > >> cannot do with Accord that we could do with either Calvin or
> > > > Spanner.
> > > > > > >> Interactive transactions are possible on top of Accord, as are
> > > > > > transactions
> > > > > > >> with an unknown read/write set. In each case the only cost is
> > that
> > > > > they
> > > > > > >> would use optimistic concurrency control, which is no worse
> the
> > > > > spanner
> > > > > > >> derivatives anyway (which I have to assume is your benchmark
> in
> > > this
> > > > > > >> regard). I do not expect to deliver either functionality
> > > initially,
> > > > > but
> > > > > > >> Accord takes us most of the way there for both.
> > > > > > >>
> > > > > > >>
> > > > > > >> From: Jonathan Ellis <jb...@gmail.com>
> > > > > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > > > > >> To: dev <de...@cassandra.apache.org>
> > > > > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > >> Right, I'm looking for exactly a discussion on the high level
> > > goals.
> > > > > > >> Instead of saying "here's the goals and we ruled out X because
> > Y"
> > > we
> > > > > > should
> > > > > > >> start with a discussion around, "Approach A allows X and W,
> > > > approach B
> > > > > > >> allows Y and Z" and decide together what the goals should be
> and
> > > and
> > > > > > what
> > > > > > >> we are willing to trade to get those goals, e.g., are we
> willing
> > > to
> > > > > > give up
> > > > > > >> global strict serializability to get the ability to support
> full
> > > > SQL.
> > > > > > Both
> > > > > > >> of these are nice to have!
> > > > > > >>
> > > > > > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> Hi Jonathan,
> > > > > > >>>
> > > > > > >>> These other systems are incompatible with the goals of the
> > CEP. I
> > > > do
> > > > > > >>> discuss them (besides 2PC) in both the whitepaper and the
> CEP,
> > > and
> > > > > will
> > > > > > >>> summarise that discussion below. A true and accurate
> comparison
> > > of
> > > > > > these
> > > > > > >>> other systems is essentially intractable, as there are
> complex
> > > > > > subtleties
> > > > > > >>> to each flavour, and those who are interested would be better
> > > > served
> > > > > by
> > > > > > >>> performing their own research.
> > > > > > >>>
> > > > > > >>> I think it is more productive to focus on what we want to
> > achieve
> > > > as
> > > > > a
> > > > > > >>> community. If you believe the goals of this CEP are wrong for
> > the
> > > > > > >> project,
> > > > > > >>> let’s focus on that. If you want to compare and contrast
> > specific
> > > > > > facets
> > > > > > >> of
> > > > > > >>> alternative systems that you consider to be preferable in
> some
> > > > > > dimension,
> > > > > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > > > > >>>
> > > > > > >>> The relevant goals are that we:
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>  1.  Guarantee strict serializable isolation on commodity
> > > hardware
> > > > > > >>>  2.  Scale to any cluster size
> > > > > > >>>  3.  Achieve optimal latency
> > > > > > >>>
> > > > > > >>> The approach taken by Spanner derivatives is rejected by (1)
> > > > because
> > > > > > they
> > > > > > >>> guarantee only Serializable isolation (they additionally fail
> > > (3)).
> > > > > > From
> > > > > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > > > > >>> panic-cluster-death under clock skew, this is clearly
> > considered
> > > by
> > > > > > >>> everyone to be undesirable but necessary to achieve
> > scalability.
> > > > > > >>>
> > > > > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2)
> > because
> > > > its
> > > > > > >>> sequencing layer requires a global leader process for the
> > > cluster,
> > > > > > which
> > > > > > >> is
> > > > > > >>> incompatible with Cassandra’s scalability requirements. It
> > > > > additionally
> > > > > > >>> fails (3) for global clients.
> > > > > > >>>
> > > > > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is
> > > today a
> > > > > > >>> Spanner clone for its multi-key transaction functionality,
> not
> > > 2PC.
> > > > > > >>>
> > > > > > >>> Systems such as RAMP with even weaker isolation are not
> > > considered
> > > > > for
> > > > > > >> the
> > > > > > >>> simple reason that they do not even claim to meet (1).
> > > > > > >>>
> > > > > > >>> If we want to additionally offer weaker isolation levels than
> > > > > > >>> Serializable, such as that provided by the recent RAMP-TAO
> > paper,
> > > > > > >> Cassandra
> > > > > > >>> is likely able to support multiple distinct transaction
> layers
> > > that
> > > > > > >> operate
> > > > > > >>> independently. I would encourage you to file a CEP to explore
> > how
> > > > we
> > > > > > can
> > > > > > >>> meet these distinct use cases, but I consider them to be
> > niche. I
> > > > > > expect
> > > > > > >>> that a majority of our user base desire strict serializable
> > > > > isolation,
> > > > > > >> and
> > > > > > >>> certainly no less than serializable isolation, to augment the
> > > > > existing
> > > > > > >>> weaker isolation offered by quorum reads and writes.
> > > > > > >>>
> > > > > > >>> I would tangentially note that we are not an AP database
> under
> > > > normal
> > > > > > >>> recommended operation. A minority in any network partition
> > cannot
> > > > > reach
> > > > > > >>> QUORUM, so under recommended usage we are a high-availability
> > > > > > leaderless
> > > > > > >> CP
> > > > > > >>> database.
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> From: Jonathan Ellis <jb...@gmail.com>
> > > > > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > > > > >>> To: dev <de...@cassandra.apache.org>
> > > > > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > >>> Benedict, thanks for taking the lead in putting this
> together.
> > > > Since
> > > > > > >>> Cassandra is the only relevant database today designed
> around a
> > > > > > >> leaderless
> > > > > > >>> architecture, it's quite likely that we'll be better served
> > with
> > > a
> > > > > > custom
> > > > > > >>> transaction design instead of trying to retrofit one from CP
> > > > systems.
> > > > > > >>>
> > > > > > >>> The whitepaper here is a good description of the consensus
> > > > algorithm
> > > > > > >> itself
> > > > > > >>> as well as its robustness and stability characteristics, and
> > its
> > > > > > >> comparison
> > > > > > >>> with other state-of-the-art consensus algorithms is very
> > useful.
> > > > In
> > > > > > the
> > > > > > >>> context of Cassandra, where a consensus algorithm is only
> part
> > of
> > > > > what
> > > > > > >> will
> > > > > > >>> be implemented, I'd like to see a more complete evaluation of
> > the
> > > > > > >>> transactional side of things as well, including performance
> > > > > > >> characteristics
> > > > > > >>> as well as the types of transactions that can be supported
> and
> > at
> > > > > > least a
> > > > > > >>> general idea of what it would look like applied to Cassandra.
> > > This
> > > > > will
> > > > > > >>> allow the PMC to make a more informed decision about what
> > > tradeoffs
> > > > > are
> > > > > > >>> best for the entire long-term project of first supplementing
> > and
> > > > > > >> ultimately
> > > > > > >>> replacing LWT.
> > > > > > >>>
> > > > > > >>> (Allowing users to mix LWT and AP Cassandra operations
> against
> > > the
> > > > > same
> > > > > > >>> rows was probably a mistake, so in contrast with LWT we’re
> not
> > > > > looking
> > > > > > >> for
> > > > > > >>> something fast enough for occasional use but rather something
> > > > within
> > > > > a
> > > > > > >>> reasonable factor of AP operations, appropriate to being the
> > only
> > > > way
> > > > > > to
> > > > > > >>> interact with tables declared as such.)
> > > > > > >>>
> > > > > > >>> Besides Accord, this should cover
> > > > > > >>>
> > > > > > >>> - Calvin and FaunaDB
> > > > > > >>> - A Spanner derivative (no opinion on whether that should be
> > > > > Cockroach
> > > > > > or
> > > > > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > > > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB
> but
> > I
> > > > > > suspect
> > > > > > >>> there is more public information about MongoDB)
> > > > > > >>> - RAMP
> > > > > > >>>
> > > > > > >>> Here’s an example of what I mean:
> > > > > > >>>
> > > > > > >>> =Calvin=
> > > > > > >>>
> > > > > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB)
> > to
> > > > > order
> > > > > > >>> transactions, then replicas execute the transactions
> > > independently
> > > > > with
> > > > > > >> no
> > > > > > >>> further coordination.  No SPOF.  Transactions are batched by
> > each
> > > > > > >> sequencer
> > > > > > >>> to keep this from becoming a bottleneck.
> > > > > > >>>
> > > > > > >>> Performance: Calvin paper (published 2012) reports linear
> > scaling
> > > > of
> > > > > > >> TPC-C
> > > > > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2
> XL
> > > > > machines
> > > > > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order
> > is
> > > > > > composed
> > > > > > >>> of four reads and four writes, so this is effectively 2M
> reads
> > > and
> > > > 2M
> > > > > > >>> writes as we normally measure them in C*.
> > > > > > >>>
> > > > > > >>> Calvin supports mixed read/write transactions, but because
> the
> > > > > > >> transaction
> > > > > > >>> execution logic requires knowing all partition keys in
> advance
> > to
> > > > > > ensure
> > > > > > >>> that all replicas can reproduce the same results with no
> > > > > coordination,
> > > > > > >>> reads against non-PK predicates must be done ahead of time
> > > > > > >> (transparently,
> > > > > > >>> by the server) to determine the set of keys, and this must be
> > > > retried
> > > > > > if
> > > > > > >>> the set of rows affected is updated before the actual
> > transaction
> > > > > > >> executes.
> > > > > > >>>
> > > > > > >>> Batching and global consensus adds latency -- 100ms in the
> > Calvin
> > > > > paper
> > > > > > >> and
> > > > > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> > > > transactions
> > > > > > >>> (including multi-partition updates) are equally performant in
> > > > Calvin
> > > > > > >> since
> > > > > > >>> the coordination is handled up front in the sequencing step.
> > > Glass
> > > > > > half
> > > > > > >>> empty: even single-row reads and writes have to pay the full
> > > > > > coordination
> > > > > > >>> cost.  Fauna has optimized this away for reads but I am not
> > aware
> > > > of
> > > > > a
> > > > > > >>> description of how they changed the design to allow this.
> > > > > > >>>
> > > > > > >>> Functionality and limitations: since the entire transaction
> > must
> > > be
> > > > > > known
> > > > > > >>> in advance to allow coordination-less execution at the
> > replicas,
> > > > > Calvin
> > > > > > >>> cannot support interactive transactions at all.  FaunaDB
> > > mitigates
> > > > > this
> > > > > > >> by
> > > > > > >>> allowing server-side logic to be included, but a Calvin
> > approach
> > > > will
> > > > > > >> never
> > > > > > >>> be able to offer SQL compatibility.
> > > > > > >>>
> > > > > > >>> Guarantees: Calvin transactions are strictly serializable.
> > There
> > > > is
> > > > > no
> > > > > > >>> additional complexity or performance hit to generalizing to
> > > > multiple
> > > > > > >>> regions, apart from the speed of light.  And since Calvin is
> > > > already
> > > > > > >> paying
> > > > > > >>> a batching latency penalty, this is less painful than for
> other
> > > > > > systems.
> > > > > > >>>
> > > > > > >>> Application to Cassandra: B-.  Distributed transactions are
> > > handled
> > > > > by
> > > > > > >> the
> > > > > > >>> sequencing and scheduling layers, which are leaderless, and
> > > > Calvin’s
> > > > > > >>> requirements for the storage layer are easily met by C*.  But
> > > > Calvin
> > > > > > also
> > > > > > >>> requires a global consensus protocol and LWT is almost
> > certainly
> > > > not
> > > > > > >>> sufficiently performant, so this would require ZK or etcd
> > > > (reasonable
> > > > > > >> for a
> > > > > > >>> library approach but not for replacing LWT in C* itself), or
> an
> > > > > > >>> implementation of Accord.  I don’t believe Calvin would
> require
> > > > > > >> additional
> > > > > > >>> table-level metadata in Cassandra.
> > > > > > >>>
> > > > > > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > >>> wrote:
> > > > > > >>>
> > > > > > >>>> Wiki:
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > > > > >>>> Whitepaper:
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > > > > >>>> <
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > > > > >>>>>
> > > > > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > > > > >>>>
> > > > > > >>>> Hi everyone, I’d like to propose this CEP for adoption by
> the
> > > > > > >> community.
> > > > > > >>>>
> > > > > > >>>> Cassandra has benefitted from LWTs for many years, but
> > > application
> > > > > > >>>> developers that want to ensure consistency for complex
> > > operations
> > > > > must
> > > > > > >>>> either accept the scalability bottleneck of serializing all
> > > > related
> > > > > > >> state
> > > > > > >>>> through a single partition, or layer a complex state machine
> > on
> > > > top
> > > > > of
> > > > > > >>> the
> > > > > > >>>> database. These are sophisticated and costly activities that
> > our
> > > > > users
> > > > > > >>>> should not be expected to undertake. Since distributed
> > databases
> > > > are
> > > > > > >>>> beginning to offer distributed transactions with fewer
> > caveats,
> > > it
> > > > > is
> > > > > > >>> past
> > > > > > >>>> time for Cassandra to do so as well.
> > > > > > >>>>
> > > > > > >>>> This CEP proposes the use of several novel techniques that
> > build
> > > > > upon
> > > > > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> > > > general
> > > > > > >>>> purpose distributed transactions. The approach is outlined
> in
> > > the
> > > > > > >>> wikipage
> > > > > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > > > > adopting
> > > > > > >>> this
> > > > > > >>>> approach we will be the _only_ distributed database to offer
> > > > global,
> > > > > > >>>> scalable, strict serializable transactions in one wide area
> > > > > > round-trip.
> > > > > > >>>> This would represent a significant improvement in the state
> of
> > > the
> > > > > > art,
> > > > > > >>>> both in the academic literature and in commercial or open
> > source
> > > > > > >>> offerings.
> > > > > > >>>>
> > > > > > >>>> This work has been partially realised in a prototype. This
> > > partial
> > > > > > >>>> prototype has been verified against Jepsen.io’s Maelstrom
> > > library
> > > > > and
> > > > > > >>>> dedicated in-tree strict serializability verification tools,
> > but
> > > > > much
> > > > > > >>> work
> > > > > > >>>> remains for the work to be production capable and integrated
> > > into
> > > > > > >>> Cassandra.
> > > > > > >>>>
> > > > > > >>>> I propose including the prototype in the project as a new
> > source
> > > > > > >>>> repository, to be developed as a standalone library for
> > > > integration
> > > > > > >> into
> > > > > > >>>> Cassandra. I hope the community sees the important value
> > > > proposition
> > > > > > of
> > > > > > >>>> this proposal, and will adopt the CEP after this discussion,
> > so
> > > > that
> > > > > > >> the
> > > > > > >>>> library and its integration into Cassandra can be developed
> in
> > > > > > parallel
> > > > > > >>> and
> > > > > > >>>> with the involvement of the wider community.
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> --
> > > > > > >>> Jonathan Ellis
> > > > > > >>> co-founder, http://www.datastax.com
> > > > > > >>> @spyced
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> Jonathan Ellis
> > > > > > >> co-founder, http://www.datastax.com
> > > > > > >> @spyced
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jonathan Ellis
> > > > > > > co-founder, http://www.datastax.com
> > > > > > > @spyced
> > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jonathan Ellis
> > > > > co-founder, http://www.datastax.com
> > > > > @spyced
> > > > >
> > > >
> > >
> >
>


--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.
You can take a look at the Accord library, as linked in the CEP: https://github.com/belliottsmith/accord

It will of course be modified extensively over time, but this is the basic shape of the API that is envisaged. You can take a look at the Maelstrom implementation for how this will be integrated with Cassandra (which of course will be much more involved).

There will be a function for describing atomic transactions involving some combination of reads and writes, and it will be possible to submit these operations and receive an answer back. The relevant point of integration for this is accord.local.Node#coordinate.

There will likely be separate APIs for providing the system with topology changes, which it will ensure are linearized correctly with respect to ongoing transactions.

But when it boils down to it, we are providing a single point of entry for one-shot transactions. So the API from the perspective of a developer building features on top is pretty simple.


From: Paulo Motta <pa...@gmail.com>
Date: Friday, 1 October 2021 at 15:40
To: Cassandra DEV <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> With respect to this, in my view this kind of detail is not warranted
within a CEP. Software development is an exploratory process with respect
to structure, and these decisions will be made as the CEP progresses. If
these need to be specified upfront, then the purpose of a CEP – seeking buy
in – is invalidated, because the work must be complete before you know the
answers.

These need not to be set in stone, they're just a rough sketch of what the
end product will look like to make it easier to build a mental model of the
project, specially for those not directly involved with it, as well as to
guide its development for those involved. At least for me it's much easier
to visualize a project top-down (from how it's going to be used to its
particular implementation details) versus the other way around.

Em sex., 1 de out. de 2021 às 11:33, benedict@apache.org <
benedict@apache.org> escreveu:

> > The current document details thoroughly the protocol but in my view
> lacks to illustrate what specific API, methods, modules will become
> available to developers
>
> With respect to this, in my view this kind of detail is not warranted
> within a CEP. Software development is an exploratory process with respect
> to structure, and these decisions will be made as the CEP progresses. If
> these need to be specified upfront, then the purpose of a CEP – seeking buy
> in – is invalidated, because the work must be complete before you know the
> answers.
>
>
> From: benedict@apache.org <be...@apache.org>
> Date: Friday, 1 October 2021 at 15:31
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> From the CEP:
>
> Batches (including unconditional batches) on transactional tables will
> receive ACID properties, and grammatically correct conditional batch
> operations that would be rejected for operating over multiple CQL
> partitions will now be supported
>
>
> From: Paulo Motta <pa...@gmail.com>
> Date: Friday, 1 October 2021 at 15:30
> To: Cassandra DEV <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Can you just answer what palpable feature will be available once this CEP
> lands because this is still not clear to me (and perhaps to others) from
> the current CEP structure. The current document details thoroughly the
> protocol but in my view lacks to illustrate what specific API, methods,
> modules will become available to developers, how it fits into the larger
> picture and interacts with existing modules if at all and perhaps a few
> examples of how it can be used to build features on top.
>
> Em sex., 1 de out. de 2021 às 11:10, benedict@apache.org <
> benedict@apache.org> escreveu:
>
> > I’m not, though it might seem that way. I disagree with your views about
> > how CEP should be structured. Since the CEP process was itself codified
> via
> > the CEP process, if you want to recodify how CEP work, the correct way is
> > via the CEP process itself.
> >
> > The discussion is being drawn in multiple directions away from the CEP
> > itself, and I am trying to keep this particular thread focused on the
> > business at hand, not meta discussions around CEP structure that will no
> > doubt be unproductive given likely irreconcilable views about the topic,
> > nor discussions about other CEP that could have been.
> >
> > If you want to start a separate exploratory discussion thread about CEP
> > structure without filing a CEP feel free to do so.
> >
> >
> > From: Paulo Motta <pa...@gmail.com>
> > Date: Friday, 1 October 2021 at 15:04
> > To: Cassandra DEV <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > If you want to impose your views on CEP structure on others, please
> file
> > a CEP with the additional restrictions and guidance you want to impose
> and
> > start a discussion thread. I can then respond in detail to why I perceive
> > this approach to be flawed, in a dedicated context.
> >
> > This sounds very kafkaesque. You know I won't file a meta-CEP to change
> the
> > structure of CEP so you're just using this as an excuse to just shut the
> > discussion on the lack of clarity on what actual palpable feature will be
> > available once the CEP lands. :-)
> >
> > I'm just providing my humble feedback on how a CEP could be more
> digestible
> > and easier to consume from an external point of view, and this seems like
> > an appropriate and contextualized place to voice this opinion which is
> > perhaps shared by others.
> >
> > Em sex., 1 de out. de 2021 às 10:55, benedict@apache.org <
> > benedict@apache.org> escreveu:
> >
> > > I disagree with you. However, this is the wrong forum to have a meta
> > > discussion about how CEP should be structured.
> > >
> > > If you want to impose your views on CEP structure on others, please
> file
> > a
> > > CEP with the additional restrictions and guidance you want to impose
> and
> > > start a discussion thread. I can then respond in detail to why I
> perceive
> > > this approach to be flawed, in a dedicated context.
> > >
> > >
> > > From: Paulo Motta <pa...@gmail.com>
> > > Date: Friday, 1 October 2021 at 14:48
> > > To: Cassandra DEV <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > >  The proposal as it stands today is exceptionally thorough, more so
> > than
> > > any other CEP to date, or any CEP is likely to be in the near future.
> > >
> > > The protocol is thoroughly described, but in my view CEP is a forum to
> > > discuss the high level architecture and plan for adding a full
> end-to-end
> > > enhancement to the database, breaking it into sub-CEPs if needed, as
> long
> > > as the full plan is known in advance, otherwise the community will not
> > have
> > > the context to judge the full extent and impact of the proposed
> > > enhancement.
> > >
> > > > Since it remains unclear to me what either yourself or Jonathan want
> to
> > > see as an alternative
> > >
> > > I would personally like to see something along these lines:
> > >
> > > CEP1: Add ACID-compliant atomic batches
> > > - UX changes needed: none, CQL provides the grammar we need.
> > > - Distributed transaction protocol needed: Accord (link to white paper
> if
> > > you want specific details about the protcool)
> > > - High-level architecture: what new components will be added, how
> > existing
> > > components will be modified, what new messages will be added, what new
> > > configuration knobs will be introduced, what are the milestones of the
> > > project, etc.
> > >
> > > CEP2: Make LWT faster and more reliable
> > > - UX changes needed: none
> > > - Distributed transaction protocol needed: Accord, already added by
> > > previous CEP.
> > > - High-level architecture: blablabla... and so on.
> > >
> > > Em sex., 1 de out. de 2021 às 10:19, benedict@apache.org <
> > > benedict@apache.org> escreveu:
> > >
> > > > I think this is getting circular and unproductive. Basic
> disagreements
> > > > about whether the CEP specifies a feature I am inclined to leave for
> a
> > > > vote. In my view the CEP specifies several features, both immediate
> > ones
> > > > for the user (ACID batches and multi-key LWTS) and developer-focused
> > ones
> > > > around ground-breaking semantics that will be enabled.
> > > >
> > > > The proposal as it stands today is exceptionally thorough, more so
> than
> > > > any other CEP to date, or any CEP is likely to be in the near future.
> > > >
> > > > This is a Cassandra Enhancement *Proposal*, and at some point we have
> > to
> > > > engage with what is proposed, not what you might like to be proposed.
> > > Since
> > > > it remains unclear to me what either yourself or Jonathan want to see
> > as
> > > an
> > > > alternative, at this point it would seem more productive to produce
> > your
> > > > own proposals for the community to consider. It is possible for
> > multiple
> > > > transaction systems to co-exist, if you feel this is necessary.
> > > >
> > > >
> > > >
> > > > From: Paulo Motta <pa...@gmail.com>
> > > > Date: Friday, 1 October 2021 at 13:58
> > > > To: Cassandra DEV <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > I share similar feelings as jbellis that this proposal seems to be
> > > focusing
> > > > on the protocol itself but lacking the actual feature that will use
> the
> > > > protocol which IMO a key element to discuss on a CEP.
> > > >
> > > > It's similar to saying: hey I want to add this Tries Serialization
> > > Protocol
> > > > to Cassandra, but not providing specific details of how this protocol
> > is
> > > > going to be used.
> > > >
> > > > I think the right route for a CEP is to describe the feature that
> will
> > be
> > > > added to the database and the protocol is a mere requirement of the
> > > > high-level feature, for example:
> > > >
> > > > CEP: Add Trie-backed memtable
> > > > - Trie Serialization Protocol: implementation detail of the above CEP
> > > >
> > > > What is the difficulty of taking this approach, picking one of the
> > myriad
> > > > of features that will be enabled by Accord and using that as the
> > initial
> > > > CEP to introduce the protocol to the database?
> > > >
> > > > Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
> > > > benedict@apache.org> escreveu:
> > > >
> > > > > Actually, thinking about it again, the simple optimistic protocol
> > would
> > > > in
> > > > > fact guarantee system forward progress (i.e. independent of
> > transaction
> > > > > formulation).
> > > > >
> > > > >
> > > > > From: benedict@apache.org <be...@apache.org>
> > > > > Date: Friday, 1 October 2021 at 09:14
> > > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > Hi Jonathan,
> > > > >
> > > > > It would be great if we could achieve a bandwidth higher than 1-2
> > short
> > > > > emails per week. It remains unclear to me what your goal is, and it
> > > would
> > > > > help if you could make a statement like “I want Cassandra to be
> able
> > to
> > > > do
> > > > > X” so that we can respond directly to it. I am also available to
> have
> > > > > another call, in which we can have a back and forth, please feel
> free
> > > to
> > > > > propose a London-compatible time within the next week that is
> > suitable
> > > > for
> > > > > you.
> > > > >
> > > > > In my opinion we are at risk of veering off-topic, though. This CEP
> > is
> > > > not
> > > > > to deliver interactive transactions, and to my knowledge nobody is
> > > > > proposing a CEP for interactive transactions. So, for the CEP at
> hand
> > > the
> > > > > salient question seems: does this CEP prevent us from implementing
> > > > > interactive transactions with properties X, Y, Z in future? To
> which
> > > the
> > > > > answer is almost certainly no.
> > > > >
> > > > > However, to continue the discussion and respond directly to your
> > > queries,
> > > > > I believe we agree on the definition of an interactive transaction.
> > > > >
> > > > > Two protocols were loosely outlined. The first, using timestamps
> for
> > > > > optimistic concurrency control, would indeed involve the
> possibility
> > of
> > > > > aborts. It would not however inherently adopt the issue of LWTs
> where
> > > no
> > > > > transaction is able to make progress. Whether or not progress is
> > > > guaranteed
> > > > > (in a livelock-free sense) would depend on the structure of the
> > > > > transactions that were interfering.
> > > > >
> > > > > This approach has the advantage of being very simple to implement,
> so
> > > > that
> > > > > we could realistically support interactive transactions quite
> > quickly.
> > > It
> > > > > has the additional advantage that transactions would execute very
> > > quickly
> > > > > by avoiding the WAN during construction, and as a result may in
> > > practice
> > > > > experience fewer aborts than protocols that guarantee
> > livelock-freedom.
> > > > >
> > > > > The second protocol proposed using read/write intents and would be
> > able
> > > > to
> > > > > support almost any behaviour you want. We could even utilise
> > > pessimistic
> > > > > concurrency control, or anything in-between. This is its own huge
> > > design
> > > > > space, and discussion of this approach and the trade-offs that
> could
> > be
> > > > > made is (in my opinion) entirely out of scope for this CEP.
> > > > >
> > > > >
> > > > > From: Jonathan Ellis <jb...@gmail.com>
> > > > > Date: Friday, 1 October 2021 at 05:00
> > > > > To: dev <de...@cassandra.apache.org>
> > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > The obstacle for me is you've provided a protocol but not a fully
> > > fleshed
> > > > > out architecture, so it's hard to fill in some of the blanks.  But
> it
> > > > looks
> > > > > to me like optimistic concurrency control for interactive
> > transactions
> > > > > applied to Accord would leave you in a LWT-like situation under
> > fairly
> > > > > light contention where nobody actually makes progress due to
> retries.
> > > > >
> > > > > To make sure we're talking about the same thing, as Henrik pointed
> > out,
> > > > > interactive transactions mean multiple round trips from the client
> > > > within a
> > > > > transaction.  For example, here
> > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > > > > >
> > > > > is a simple implementation of the TPC-C New Order transaction.  The
> > > high
> > > > > level logic (via
> > > > > <
> > > > >
> > > >
> > >
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > > > > >)
> > > > > is,
> > > > >
> > > > >    1. Get records describing a warehouse, customer, & district
> > > > >    2. Update the district
> > > > >    3. Increment next available order number
> > > > >    4. Insert record into Order and New-Order tables
> > > > >    5. For 5-15 items, get Item record, get/update Stock record
> > > > >    6. Insert Order-Line Record
> > > > >
> > > > > As you can see, this requires a lot of client-side logic mixed in
> > with
> > > > the
> > > > > actual SQL commands.
> > > > >
> > > > >
> > > > > On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <
> > > benedict@apache.org
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Essentially this, although I think in practice we will need to
> > track
> > > > each
> > > > > > partition’s timestamp separately (or optionally for reduced
> > > conflicts,
> > > > > each
> > > > > > row or datum’s), and make them all part of the conditional
> > > application
> > > > of
> > > > > > the transaction - at least for strict-serializability.
> > > > > >
> > > > > > The alternative is to insert read/write intents for the
> transaction
> > > > > during
> > > > > > each step, and to confirm they are still valid on commit, but
> this
> > > > > approach
> > > > > > would require a WAN round-trip for each step in the interactive
> > > > > > transaction, whereas the timestamp-validating approach can use a
> > LAN
> > > > > > round-trip for each step besides the final one, and is also much
> > > > simpler
> > > > > to
> > > > > > implement.
> > > > > >
> > > > > >
> > > > > > From: Blake Eggleston <be...@apple.com.INVALID>
> > > > > > Date: Thursday, 30 September 2021 at 05:47
> > > > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > You could establish a lower timestamp bound and buffer
> transaction
> > > > state
> > > > > > on the coordinator, then make the commit an operation that only
> > > applies
> > > > > if
> > > > > > all partitions involved haven’t been changed by a more recent
> > > > timestamp.
> > > > > > You could also implement mvcc either in the storage layer or for
> > some
> > > > > > period of time by buffering commits on each replica before
> > applying.
> > > > > >
> > > > > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jbellis@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > How are interactive transactions possible with Accord?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Could you explain why you believe this trade-off is necessary?
> > We
> > > > can
> > > > > > >> support full SQL just fine with Accord, and I hope that we
> > > > eventually
> > > > > > do so.
> > > > > > >>
> > > > > > >> This domain is incredibly complex, so it is easy to reach
> wrong
> > > > > > >> conclusions. I would invite you again to propose a system for
> > > > > discussion
> > > > > > >> that you think offers something Accord is unable to, and that
> > you
> > > > > > consider
> > > > > > >> desirable, and we can work from there.
> > > > > > >>
> > > > > > >> To pre-empt some possible discussions, I am not aware of
> > anything
> > > we
> > > > > > >> cannot do with Accord that we could do with either Calvin or
> > > > Spanner.
> > > > > > >> Interactive transactions are possible on top of Accord, as are
> > > > > > transactions
> > > > > > >> with an unknown read/write set. In each case the only cost is
> > that
> > > > > they
> > > > > > >> would use optimistic concurrency control, which is no worse
> the
> > > > > spanner
> > > > > > >> derivatives anyway (which I have to assume is your benchmark
> in
> > > this
> > > > > > >> regard). I do not expect to deliver either functionality
> > > initially,
> > > > > but
> > > > > > >> Accord takes us most of the way there for both.
> > > > > > >>
> > > > > > >>
> > > > > > >> From: Jonathan Ellis <jb...@gmail.com>
> > > > > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > > > > >> To: dev <de...@cassandra.apache.org>
> > > > > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > >> Right, I'm looking for exactly a discussion on the high level
> > > goals.
> > > > > > >> Instead of saying "here's the goals and we ruled out X because
> > Y"
> > > we
> > > > > > should
> > > > > > >> start with a discussion around, "Approach A allows X and W,
> > > > approach B
> > > > > > >> allows Y and Z" and decide together what the goals should be
> and
> > > and
> > > > > > what
> > > > > > >> we are willing to trade to get those goals, e.g., are we
> willing
> > > to
> > > > > > give up
> > > > > > >> global strict serializability to get the ability to support
> full
> > > > SQL.
> > > > > > Both
> > > > > > >> of these are nice to have!
> > > > > > >>
> > > > > > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> Hi Jonathan,
> > > > > > >>>
> > > > > > >>> These other systems are incompatible with the goals of the
> > CEP. I
> > > > do
> > > > > > >>> discuss them (besides 2PC) in both the whitepaper and the
> CEP,
> > > and
> > > > > will
> > > > > > >>> summarise that discussion below. A true and accurate
> comparison
> > > of
> > > > > > these
> > > > > > >>> other systems is essentially intractable, as there are
> complex
> > > > > > subtleties
> > > > > > >>> to each flavour, and those who are interested would be better
> > > > served
> > > > > by
> > > > > > >>> performing their own research.
> > > > > > >>>
> > > > > > >>> I think it is more productive to focus on what we want to
> > achieve
> > > > as
> > > > > a
> > > > > > >>> community. If you believe the goals of this CEP are wrong for
> > the
> > > > > > >> project,
> > > > > > >>> let’s focus on that. If you want to compare and contrast
> > specific
> > > > > > facets
> > > > > > >> of
> > > > > > >>> alternative systems that you consider to be preferable in
> some
> > > > > > dimension,
> > > > > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > > > > >>>
> > > > > > >>> The relevant goals are that we:
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>  1.  Guarantee strict serializable isolation on commodity
> > > hardware
> > > > > > >>>  2.  Scale to any cluster size
> > > > > > >>>  3.  Achieve optimal latency
> > > > > > >>>
> > > > > > >>> The approach taken by Spanner derivatives is rejected by (1)
> > > > because
> > > > > > they
> > > > > > >>> guarantee only Serializable isolation (they additionally fail
> > > (3)).
> > > > > > From
> > > > > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > > > > >>> panic-cluster-death under clock skew, this is clearly
> > considered
> > > by
> > > > > > >>> everyone to be undesirable but necessary to achieve
> > scalability.
> > > > > > >>>
> > > > > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2)
> > because
> > > > its
> > > > > > >>> sequencing layer requires a global leader process for the
> > > cluster,
> > > > > > which
> > > > > > >> is
> > > > > > >>> incompatible with Cassandra’s scalability requirements. It
> > > > > additionally
> > > > > > >>> fails (3) for global clients.
> > > > > > >>>
> > > > > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is
> > > today a
> > > > > > >>> Spanner clone for its multi-key transaction functionality,
> not
> > > 2PC.
> > > > > > >>>
> > > > > > >>> Systems such as RAMP with even weaker isolation are not
> > > considered
> > > > > for
> > > > > > >> the
> > > > > > >>> simple reason that they do not even claim to meet (1).
> > > > > > >>>
> > > > > > >>> If we want to additionally offer weaker isolation levels than
> > > > > > >>> Serializable, such as that provided by the recent RAMP-TAO
> > paper,
> > > > > > >> Cassandra
> > > > > > >>> is likely able to support multiple distinct transaction
> layers
> > > that
> > > > > > >> operate
> > > > > > >>> independently. I would encourage you to file a CEP to explore
> > how
> > > > we
> > > > > > can
> > > > > > >>> meet these distinct use cases, but I consider them to be
> > niche. I
> > > > > > expect
> > > > > > >>> that a majority of our user base desire strict serializable
> > > > > isolation,
> > > > > > >> and
> > > > > > >>> certainly no less than serializable isolation, to augment the
> > > > > existing
> > > > > > >>> weaker isolation offered by quorum reads and writes.
> > > > > > >>>
> > > > > > >>> I would tangentially note that we are not an AP database
> under
> > > > normal
> > > > > > >>> recommended operation. A minority in any network partition
> > cannot
> > > > > reach
> > > > > > >>> QUORUM, so under recommended usage we are a high-availability
> > > > > > leaderless
> > > > > > >> CP
> > > > > > >>> database.
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> From: Jonathan Ellis <jb...@gmail.com>
> > > > > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > > > > >>> To: dev <de...@cassandra.apache.org>
> > > > > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > >>> Benedict, thanks for taking the lead in putting this
> together.
> > > > Since
> > > > > > >>> Cassandra is the only relevant database today designed
> around a
> > > > > > >> leaderless
> > > > > > >>> architecture, it's quite likely that we'll be better served
> > with
> > > a
> > > > > > custom
> > > > > > >>> transaction design instead of trying to retrofit one from CP
> > > > systems.
> > > > > > >>>
> > > > > > >>> The whitepaper here is a good description of the consensus
> > > > algorithm
> > > > > > >> itself
> > > > > > >>> as well as its robustness and stability characteristics, and
> > its
> > > > > > >> comparison
> > > > > > >>> with other state-of-the-art consensus algorithms is very
> > useful.
> > > > In
> > > > > > the
> > > > > > >>> context of Cassandra, where a consensus algorithm is only
> part
> > of
> > > > > what
> > > > > > >> will
> > > > > > >>> be implemented, I'd like to see a more complete evaluation of
> > the
> > > > > > >>> transactional side of things as well, including performance
> > > > > > >> characteristics
> > > > > > >>> as well as the types of transactions that can be supported
> and
> > at
> > > > > > least a
> > > > > > >>> general idea of what it would look like applied to Cassandra.
> > > This
> > > > > will
> > > > > > >>> allow the PMC to make a more informed decision about what
> > > tradeoffs
> > > > > are
> > > > > > >>> best for the entire long-term project of first supplementing
> > and
> > > > > > >> ultimately
> > > > > > >>> replacing LWT.
> > > > > > >>>
> > > > > > >>> (Allowing users to mix LWT and AP Cassandra operations
> against
> > > the
> > > > > same
> > > > > > >>> rows was probably a mistake, so in contrast with LWT we’re
> not
> > > > > looking
> > > > > > >> for
> > > > > > >>> something fast enough for occasional use but rather something
> > > > within
> > > > > a
> > > > > > >>> reasonable factor of AP operations, appropriate to being the
> > only
> > > > way
> > > > > > to
> > > > > > >>> interact with tables declared as such.)
> > > > > > >>>
> > > > > > >>> Besides Accord, this should cover
> > > > > > >>>
> > > > > > >>> - Calvin and FaunaDB
> > > > > > >>> - A Spanner derivative (no opinion on whether that should be
> > > > > Cockroach
> > > > > > or
> > > > > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > > > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB
> but
> > I
> > > > > > suspect
> > > > > > >>> there is more public information about MongoDB)
> > > > > > >>> - RAMP
> > > > > > >>>
> > > > > > >>> Here’s an example of what I mean:
> > > > > > >>>
> > > > > > >>> =Calvin=
> > > > > > >>>
> > > > > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB)
> > to
> > > > > order
> > > > > > >>> transactions, then replicas execute the transactions
> > > independently
> > > > > with
> > > > > > >> no
> > > > > > >>> further coordination.  No SPOF.  Transactions are batched by
> > each
> > > > > > >> sequencer
> > > > > > >>> to keep this from becoming a bottleneck.
> > > > > > >>>
> > > > > > >>> Performance: Calvin paper (published 2012) reports linear
> > scaling
> > > > of
> > > > > > >> TPC-C
> > > > > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2
> XL
> > > > > machines
> > > > > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order
> > is
> > > > > > composed
> > > > > > >>> of four reads and four writes, so this is effectively 2M
> reads
> > > and
> > > > 2M
> > > > > > >>> writes as we normally measure them in C*.
> > > > > > >>>
> > > > > > >>> Calvin supports mixed read/write transactions, but because
> the
> > > > > > >> transaction
> > > > > > >>> execution logic requires knowing all partition keys in
> advance
> > to
> > > > > > ensure
> > > > > > >>> that all replicas can reproduce the same results with no
> > > > > coordination,
> > > > > > >>> reads against non-PK predicates must be done ahead of time
> > > > > > >> (transparently,
> > > > > > >>> by the server) to determine the set of keys, and this must be
> > > > retried
> > > > > > if
> > > > > > >>> the set of rows affected is updated before the actual
> > transaction
> > > > > > >> executes.
> > > > > > >>>
> > > > > > >>> Batching and global consensus adds latency -- 100ms in the
> > Calvin
> > > > > paper
> > > > > > >> and
> > > > > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> > > > transactions
> > > > > > >>> (including multi-partition updates) are equally performant in
> > > > Calvin
> > > > > > >> since
> > > > > > >>> the coordination is handled up front in the sequencing step.
> > > Glass
> > > > > > half
> > > > > > >>> empty: even single-row reads and writes have to pay the full
> > > > > > coordination
> > > > > > >>> cost.  Fauna has optimized this away for reads but I am not
> > aware
> > > > of
> > > > > a
> > > > > > >>> description of how they changed the design to allow this.
> > > > > > >>>
> > > > > > >>> Functionality and limitations: since the entire transaction
> > must
> > > be
> > > > > > known
> > > > > > >>> in advance to allow coordination-less execution at the
> > replicas,
> > > > > Calvin
> > > > > > >>> cannot support interactive transactions at all.  FaunaDB
> > > mitigates
> > > > > this
> > > > > > >> by
> > > > > > >>> allowing server-side logic to be included, but a Calvin
> > approach
> > > > will
> > > > > > >> never
> > > > > > >>> be able to offer SQL compatibility.
> > > > > > >>>
> > > > > > >>> Guarantees: Calvin transactions are strictly serializable.
> > There
> > > > is
> > > > > no
> > > > > > >>> additional complexity or performance hit to generalizing to
> > > > multiple
> > > > > > >>> regions, apart from the speed of light.  And since Calvin is
> > > > already
> > > > > > >> paying
> > > > > > >>> a batching latency penalty, this is less painful than for
> other
> > > > > > systems.
> > > > > > >>>
> > > > > > >>> Application to Cassandra: B-.  Distributed transactions are
> > > handled
> > > > > by
> > > > > > >> the
> > > > > > >>> sequencing and scheduling layers, which are leaderless, and
> > > > Calvin’s
> > > > > > >>> requirements for the storage layer are easily met by C*.  But
> > > > Calvin
> > > > > > also
> > > > > > >>> requires a global consensus protocol and LWT is almost
> > certainly
> > > > not
> > > > > > >>> sufficiently performant, so this would require ZK or etcd
> > > > (reasonable
> > > > > > >> for a
> > > > > > >>> library approach but not for replacing LWT in C* itself), or
> an
> > > > > > >>> implementation of Accord.  I don’t believe Calvin would
> require
> > > > > > >> additional
> > > > > > >>> table-level metadata in Cassandra.
> > > > > > >>>
> > > > > > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > >>> wrote:
> > > > > > >>>
> > > > > > >>>> Wiki:
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > > > > >>>> Whitepaper:
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > > > > >>>> <
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > > > > >>>>>
> > > > > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > > > > >>>>
> > > > > > >>>> Hi everyone, I’d like to propose this CEP for adoption by
> the
> > > > > > >> community.
> > > > > > >>>>
> > > > > > >>>> Cassandra has benefitted from LWTs for many years, but
> > > application
> > > > > > >>>> developers that want to ensure consistency for complex
> > > operations
> > > > > must
> > > > > > >>>> either accept the scalability bottleneck of serializing all
> > > > related
> > > > > > >> state
> > > > > > >>>> through a single partition, or layer a complex state machine
> > on
> > > > top
> > > > > of
> > > > > > >>> the
> > > > > > >>>> database. These are sophisticated and costly activities that
> > our
> > > > > users
> > > > > > >>>> should not be expected to undertake. Since distributed
> > databases
> > > > are
> > > > > > >>>> beginning to offer distributed transactions with fewer
> > caveats,
> > > it
> > > > > is
> > > > > > >>> past
> > > > > > >>>> time for Cassandra to do so as well.
> > > > > > >>>>
> > > > > > >>>> This CEP proposes the use of several novel techniques that
> > build
> > > > > upon
> > > > > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> > > > general
> > > > > > >>>> purpose distributed transactions. The approach is outlined
> in
> > > the
> > > > > > >>> wikipage
> > > > > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > > > > adopting
> > > > > > >>> this
> > > > > > >>>> approach we will be the _only_ distributed database to offer
> > > > global,
> > > > > > >>>> scalable, strict serializable transactions in one wide area
> > > > > > round-trip.
> > > > > > >>>> This would represent a significant improvement in the state
> of
> > > the
> > > > > > art,
> > > > > > >>>> both in the academic literature and in commercial or open
> > source
> > > > > > >>> offerings.
> > > > > > >>>>
> > > > > > >>>> This work has been partially realised in a prototype. This
> > > partial
> > > > > > >>>> prototype has been verified against Jepsen.io’s Maelstrom
> > > library
> > > > > and
> > > > > > >>>> dedicated in-tree strict serializability verification tools,
> > but
> > > > > much
> > > > > > >>> work
> > > > > > >>>> remains for the work to be production capable and integrated
> > > into
> > > > > > >>> Cassandra.
> > > > > > >>>>
> > > > > > >>>> I propose including the prototype in the project as a new
> > source
> > > > > > >>>> repository, to be developed as a standalone library for
> > > > integration
> > > > > > >> into
> > > > > > >>>> Cassandra. I hope the community sees the important value
> > > > proposition
> > > > > > of
> > > > > > >>>> this proposal, and will adopt the CEP after this discussion,
> > so
> > > > that
> > > > > > >> the
> > > > > > >>>> library and its integration into Cassandra can be developed
> in
> > > > > > parallel
> > > > > > >>> and
> > > > > > >>>> with the involvement of the wider community.
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> --
> > > > > > >>> Jonathan Ellis
> > > > > > >>> co-founder, http://www.datastax.com
> > > > > > >>> @spyced
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> Jonathan Ellis
> > > > > > >> co-founder, http://www.datastax.com
> > > > > > >> @spyced
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jonathan Ellis
> > > > > > > co-founder, http://www.datastax.com
> > > > > > > @spyced
> > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jonathan Ellis
> > > > > co-founder, http://www.datastax.com
> > > > > @spyced
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Jonathan Ellis <jb...@gmail.com>.
I honestly can't understand the perspective that on the one hand, you're
asking for approval of a specific protocol as part of the CEP, but on the
other, you think discussion of the APIs this will enable is not warranted.
Surely we need agreement on what APIs we're trying to build, before we
discuss the protocols and architectures with which to build them.

On Fri, Oct 1, 2021 at 9:34 AM benedict@apache.org <be...@apache.org>
wrote:

> > The current document details thoroughly the protocol but in my view
> lacks to illustrate what specific API, methods, modules will become
> available to developers
>
> With respect to this, in my view this kind of detail is not warranted
> within a CEP. Software development is an exploratory process with respect
> to structure, and these decisions will be made as the CEP progresses. If
> these need to be specified upfront, then the purpose of a CEP – seeking buy
> in – is invalidated, because the work must be complete before you know the
> answers.
>
>
> From: benedict@apache.org <be...@apache.org>
> Date: Friday, 1 October 2021 at 15:31
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> From the CEP:
>
> Batches (including unconditional batches) on transactional tables will
> receive ACID properties, and grammatically correct conditional batch
> operations that would be rejected for operating over multiple CQL
> partitions will now be supported
>
>
> From: Paulo Motta <pa...@gmail.com>
> Date: Friday, 1 October 2021 at 15:30
> To: Cassandra DEV <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Can you just answer what palpable feature will be available once this CEP
> lands because this is still not clear to me (and perhaps to others) from
> the current CEP structure. The current document details thoroughly the
> protocol but in my view lacks to illustrate what specific API, methods,
> modules will become available to developers, how it fits into the larger
> picture and interacts with existing modules if at all and perhaps a few
> examples of how it can be used to build features on top.
>
> Em sex., 1 de out. de 2021 às 11:10, benedict@apache.org <
> benedict@apache.org> escreveu:
>
> > I’m not, though it might seem that way. I disagree with your views about
> > how CEP should be structured. Since the CEP process was itself codified
> via
> > the CEP process, if you want to recodify how CEP work, the correct way is
> > via the CEP process itself.
> >
> > The discussion is being drawn in multiple directions away from the CEP
> > itself, and I am trying to keep this particular thread focused on the
> > business at hand, not meta discussions around CEP structure that will no
> > doubt be unproductive given likely irreconcilable views about the topic,
> > nor discussions about other CEP that could have been.
> >
> > If you want to start a separate exploratory discussion thread about CEP
> > structure without filing a CEP feel free to do so.
> >
> >
> > From: Paulo Motta <pa...@gmail.com>
> > Date: Friday, 1 October 2021 at 15:04
> > To: Cassandra DEV <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > If you want to impose your views on CEP structure on others, please
> file
> > a CEP with the additional restrictions and guidance you want to impose
> and
> > start a discussion thread. I can then respond in detail to why I perceive
> > this approach to be flawed, in a dedicated context.
> >
> > This sounds very kafkaesque. You know I won't file a meta-CEP to change
> the
> > structure of CEP so you're just using this as an excuse to just shut the
> > discussion on the lack of clarity on what actual palpable feature will be
> > available once the CEP lands. :-)
> >
> > I'm just providing my humble feedback on how a CEP could be more
> digestible
> > and easier to consume from an external point of view, and this seems like
> > an appropriate and contextualized place to voice this opinion which is
> > perhaps shared by others.
> >
> > Em sex., 1 de out. de 2021 às 10:55, benedict@apache.org <
> > benedict@apache.org> escreveu:
> >
> > > I disagree with you. However, this is the wrong forum to have a meta
> > > discussion about how CEP should be structured.
> > >
> > > If you want to impose your views on CEP structure on others, please
> file
> > a
> > > CEP with the additional restrictions and guidance you want to impose
> and
> > > start a discussion thread. I can then respond in detail to why I
> perceive
> > > this approach to be flawed, in a dedicated context.
> > >
> > >
> > > From: Paulo Motta <pa...@gmail.com>
> > > Date: Friday, 1 October 2021 at 14:48
> > > To: Cassandra DEV <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > >  The proposal as it stands today is exceptionally thorough, more so
> > than
> > > any other CEP to date, or any CEP is likely to be in the near future.
> > >
> > > The protocol is thoroughly described, but in my view CEP is a forum to
> > > discuss the high level architecture and plan for adding a full
> end-to-end
> > > enhancement to the database, breaking it into sub-CEPs if needed, as
> long
> > > as the full plan is known in advance, otherwise the community will not
> > have
> > > the context to judge the full extent and impact of the proposed
> > > enhancement.
> > >
> > > > Since it remains unclear to me what either yourself or Jonathan want
> to
> > > see as an alternative
> > >
> > > I would personally like to see something along these lines:
> > >
> > > CEP1: Add ACID-compliant atomic batches
> > > - UX changes needed: none, CQL provides the grammar we need.
> > > - Distributed transaction protocol needed: Accord (link to white paper
> if
> > > you want specific details about the protcool)
> > > - High-level architecture: what new components will be added, how
> > existing
> > > components will be modified, what new messages will be added, what new
> > > configuration knobs will be introduced, what are the milestones of the
> > > project, etc.
> > >
> > > CEP2: Make LWT faster and more reliable
> > > - UX changes needed: none
> > > - Distributed transaction protocol needed: Accord, already added by
> > > previous CEP.
> > > - High-level architecture: blablabla... and so on.
> > >
> > > Em sex., 1 de out. de 2021 às 10:19, benedict@apache.org <
> > > benedict@apache.org> escreveu:
> > >
> > > > I think this is getting circular and unproductive. Basic
> disagreements
> > > > about whether the CEP specifies a feature I am inclined to leave for
> a
> > > > vote. In my view the CEP specifies several features, both immediate
> > ones
> > > > for the user (ACID batches and multi-key LWTS) and developer-focused
> > ones
> > > > around ground-breaking semantics that will be enabled.
> > > >
> > > > The proposal as it stands today is exceptionally thorough, more so
> than
> > > > any other CEP to date, or any CEP is likely to be in the near future.
> > > >
> > > > This is a Cassandra Enhancement *Proposal*, and at some point we have
> > to
> > > > engage with what is proposed, not what you might like to be proposed.
> > > Since
> > > > it remains unclear to me what either yourself or Jonathan want to see
> > as
> > > an
> > > > alternative, at this point it would seem more productive to produce
> > your
> > > > own proposals for the community to consider. It is possible for
> > multiple
> > > > transaction systems to co-exist, if you feel this is necessary.
> > > >
> > > >
> > > >
> > > > From: Paulo Motta <pa...@gmail.com>
> > > > Date: Friday, 1 October 2021 at 13:58
> > > > To: Cassandra DEV <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > I share similar feelings as jbellis that this proposal seems to be
> > > focusing
> > > > on the protocol itself but lacking the actual feature that will use
> the
> > > > protocol which IMO a key element to discuss on a CEP.
> > > >
> > > > It's similar to saying: hey I want to add this Tries Serialization
> > > Protocol
> > > > to Cassandra, but not providing specific details of how this protocol
> > is
> > > > going to be used.
> > > >
> > > > I think the right route for a CEP is to describe the feature that
> will
> > be
> > > > added to the database and the protocol is a mere requirement of the
> > > > high-level feature, for example:
> > > >
> > > > CEP: Add Trie-backed memtable
> > > > - Trie Serialization Protocol: implementation detail of the above CEP
> > > >
> > > > What is the difficulty of taking this approach, picking one of the
> > myriad
> > > > of features that will be enabled by Accord and using that as the
> > initial
> > > > CEP to introduce the protocol to the database?
> > > >
> > > > Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
> > > > benedict@apache.org> escreveu:
> > > >
> > > > > Actually, thinking about it again, the simple optimistic protocol
> > would
> > > > in
> > > > > fact guarantee system forward progress (i.e. independent of
> > transaction
> > > > > formulation).
> > > > >
> > > > >
> > > > > From: benedict@apache.org <be...@apache.org>
> > > > > Date: Friday, 1 October 2021 at 09:14
> > > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > Hi Jonathan,
> > > > >
> > > > > It would be great if we could achieve a bandwidth higher than 1-2
> > short
> > > > > emails per week. It remains unclear to me what your goal is, and it
> > > would
> > > > > help if you could make a statement like “I want Cassandra to be
> able
> > to
> > > > do
> > > > > X” so that we can respond directly to it. I am also available to
> have
> > > > > another call, in which we can have a back and forth, please feel
> free
> > > to
> > > > > propose a London-compatible time within the next week that is
> > suitable
> > > > for
> > > > > you.
> > > > >
> > > > > In my opinion we are at risk of veering off-topic, though. This CEP
> > is
> > > > not
> > > > > to deliver interactive transactions, and to my knowledge nobody is
> > > > > proposing a CEP for interactive transactions. So, for the CEP at
> hand
> > > the
> > > > > salient question seems: does this CEP prevent us from implementing
> > > > > interactive transactions with properties X, Y, Z in future? To
> which
> > > the
> > > > > answer is almost certainly no.
> > > > >
> > > > > However, to continue the discussion and respond directly to your
> > > queries,
> > > > > I believe we agree on the definition of an interactive transaction.
> > > > >
> > > > > Two protocols were loosely outlined. The first, using timestamps
> for
> > > > > optimistic concurrency control, would indeed involve the
> possibility
> > of
> > > > > aborts. It would not however inherently adopt the issue of LWTs
> where
> > > no
> > > > > transaction is able to make progress. Whether or not progress is
> > > > guaranteed
> > > > > (in a livelock-free sense) would depend on the structure of the
> > > > > transactions that were interfering.
> > > > >
> > > > > This approach has the advantage of being very simple to implement,
> so
> > > > that
> > > > > we could realistically support interactive transactions quite
> > quickly.
> > > It
> > > > > has the additional advantage that transactions would execute very
> > > quickly
> > > > > by avoiding the WAN during construction, and as a result may in
> > > practice
> > > > > experience fewer aborts than protocols that guarantee
> > livelock-freedom.
> > > > >
> > > > > The second protocol proposed using read/write intents and would be
> > able
> > > > to
> > > > > support almost any behaviour you want. We could even utilise
> > > pessimistic
> > > > > concurrency control, or anything in-between. This is its own huge
> > > design
> > > > > space, and discussion of this approach and the trade-offs that
> could
> > be
> > > > > made is (in my opinion) entirely out of scope for this CEP.
> > > > >
> > > > >
> > > > > From: Jonathan Ellis <jb...@gmail.com>
> > > > > Date: Friday, 1 October 2021 at 05:00
> > > > > To: dev <de...@cassandra.apache.org>
> > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > The obstacle for me is you've provided a protocol but not a fully
> > > fleshed
> > > > > out architecture, so it's hard to fill in some of the blanks.  But
> it
> > > > looks
> > > > > to me like optimistic concurrency control for interactive
> > transactions
> > > > > applied to Accord would leave you in a LWT-like situation under
> > fairly
> > > > > light contention where nobody actually makes progress due to
> retries.
> > > > >
> > > > > To make sure we're talking about the same thing, as Henrik pointed
> > out,
> > > > > interactive transactions mean multiple round trips from the client
> > > > within a
> > > > > transaction.  For example, here
> > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > > > > >
> > > > > is a simple implementation of the TPC-C New Order transaction.  The
> > > high
> > > > > level logic (via
> > > > > <
> > > > >
> > > >
> > >
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > > > > >)
> > > > > is,
> > > > >
> > > > >    1. Get records describing a warehouse, customer, & district
> > > > >    2. Update the district
> > > > >    3. Increment next available order number
> > > > >    4. Insert record into Order and New-Order tables
> > > > >    5. For 5-15 items, get Item record, get/update Stock record
> > > > >    6. Insert Order-Line Record
> > > > >
> > > > > As you can see, this requires a lot of client-side logic mixed in
> > with
> > > > the
> > > > > actual SQL commands.
> > > > >
> > > > >
> > > > > On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <
> > > benedict@apache.org
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Essentially this, although I think in practice we will need to
> > track
> > > > each
> > > > > > partition’s timestamp separately (or optionally for reduced
> > > conflicts,
> > > > > each
> > > > > > row or datum’s), and make them all part of the conditional
> > > application
> > > > of
> > > > > > the transaction - at least for strict-serializability.
> > > > > >
> > > > > > The alternative is to insert read/write intents for the
> transaction
> > > > > during
> > > > > > each step, and to confirm they are still valid on commit, but
> this
> > > > > approach
> > > > > > would require a WAN round-trip for each step in the interactive
> > > > > > transaction, whereas the timestamp-validating approach can use a
> > LAN
> > > > > > round-trip for each step besides the final one, and is also much
> > > > simpler
> > > > > to
> > > > > > implement.
> > > > > >
> > > > > >
> > > > > > From: Blake Eggleston <be...@apple.com.INVALID>
> > > > > > Date: Thursday, 30 September 2021 at 05:47
> > > > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > You could establish a lower timestamp bound and buffer
> transaction
> > > > state
> > > > > > on the coordinator, then make the commit an operation that only
> > > applies
> > > > > if
> > > > > > all partitions involved haven’t been changed by a more recent
> > > > timestamp.
> > > > > > You could also implement mvcc either in the storage layer or for
> > some
> > > > > > period of time by buffering commits on each replica before
> > applying.
> > > > > >
> > > > > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jbellis@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > How are interactive transactions possible with Accord?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Could you explain why you believe this trade-off is necessary?
> > We
> > > > can
> > > > > > >> support full SQL just fine with Accord, and I hope that we
> > > > eventually
> > > > > > do so.
> > > > > > >>
> > > > > > >> This domain is incredibly complex, so it is easy to reach
> wrong
> > > > > > >> conclusions. I would invite you again to propose a system for
> > > > > discussion
> > > > > > >> that you think offers something Accord is unable to, and that
> > you
> > > > > > consider
> > > > > > >> desirable, and we can work from there.
> > > > > > >>
> > > > > > >> To pre-empt some possible discussions, I am not aware of
> > anything
> > > we
> > > > > > >> cannot do with Accord that we could do with either Calvin or
> > > > Spanner.
> > > > > > >> Interactive transactions are possible on top of Accord, as are
> > > > > > transactions
> > > > > > >> with an unknown read/write set. In each case the only cost is
> > that
> > > > > they
> > > > > > >> would use optimistic concurrency control, which is no worse
> the
> > > > > spanner
> > > > > > >> derivatives anyway (which I have to assume is your benchmark
> in
> > > this
> > > > > > >> regard). I do not expect to deliver either functionality
> > > initially,
> > > > > but
> > > > > > >> Accord takes us most of the way there for both.
> > > > > > >>
> > > > > > >>
> > > > > > >> From: Jonathan Ellis <jb...@gmail.com>
> > > > > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > > > > >> To: dev <de...@cassandra.apache.org>
> > > > > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > >> Right, I'm looking for exactly a discussion on the high level
> > > goals.
> > > > > > >> Instead of saying "here's the goals and we ruled out X because
> > Y"
> > > we
> > > > > > should
> > > > > > >> start with a discussion around, "Approach A allows X and W,
> > > > approach B
> > > > > > >> allows Y and Z" and decide together what the goals should be
> and
> > > and
> > > > > > what
> > > > > > >> we are willing to trade to get those goals, e.g., are we
> willing
> > > to
> > > > > > give up
> > > > > > >> global strict serializability to get the ability to support
> full
> > > > SQL.
> > > > > > Both
> > > > > > >> of these are nice to have!
> > > > > > >>
> > > > > > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> Hi Jonathan,
> > > > > > >>>
> > > > > > >>> These other systems are incompatible with the goals of the
> > CEP. I
> > > > do
> > > > > > >>> discuss them (besides 2PC) in both the whitepaper and the
> CEP,
> > > and
> > > > > will
> > > > > > >>> summarise that discussion below. A true and accurate
> comparison
> > > of
> > > > > > these
> > > > > > >>> other systems is essentially intractable, as there are
> complex
> > > > > > subtleties
> > > > > > >>> to each flavour, and those who are interested would be better
> > > > served
> > > > > by
> > > > > > >>> performing their own research.
> > > > > > >>>
> > > > > > >>> I think it is more productive to focus on what we want to
> > achieve
> > > > as
> > > > > a
> > > > > > >>> community. If you believe the goals of this CEP are wrong for
> > the
> > > > > > >> project,
> > > > > > >>> let’s focus on that. If you want to compare and contrast
> > specific
> > > > > > facets
> > > > > > >> of
> > > > > > >>> alternative systems that you consider to be preferable in
> some
> > > > > > dimension,
> > > > > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > > > > >>>
> > > > > > >>> The relevant goals are that we:
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>  1.  Guarantee strict serializable isolation on commodity
> > > hardware
> > > > > > >>>  2.  Scale to any cluster size
> > > > > > >>>  3.  Achieve optimal latency
> > > > > > >>>
> > > > > > >>> The approach taken by Spanner derivatives is rejected by (1)
> > > > because
> > > > > > they
> > > > > > >>> guarantee only Serializable isolation (they additionally fail
> > > (3)).
> > > > > > From
> > > > > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > > > > >>> panic-cluster-death under clock skew, this is clearly
> > considered
> > > by
> > > > > > >>> everyone to be undesirable but necessary to achieve
> > scalability.
> > > > > > >>>
> > > > > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2)
> > because
> > > > its
> > > > > > >>> sequencing layer requires a global leader process for the
> > > cluster,
> > > > > > which
> > > > > > >> is
> > > > > > >>> incompatible with Cassandra’s scalability requirements. It
> > > > > additionally
> > > > > > >>> fails (3) for global clients.
> > > > > > >>>
> > > > > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is
> > > today a
> > > > > > >>> Spanner clone for its multi-key transaction functionality,
> not
> > > 2PC.
> > > > > > >>>
> > > > > > >>> Systems such as RAMP with even weaker isolation are not
> > > considered
> > > > > for
> > > > > > >> the
> > > > > > >>> simple reason that they do not even claim to meet (1).
> > > > > > >>>
> > > > > > >>> If we want to additionally offer weaker isolation levels than
> > > > > > >>> Serializable, such as that provided by the recent RAMP-TAO
> > paper,
> > > > > > >> Cassandra
> > > > > > >>> is likely able to support multiple distinct transaction
> layers
> > > that
> > > > > > >> operate
> > > > > > >>> independently. I would encourage you to file a CEP to explore
> > how
> > > > we
> > > > > > can
> > > > > > >>> meet these distinct use cases, but I consider them to be
> > niche. I
> > > > > > expect
> > > > > > >>> that a majority of our user base desire strict serializable
> > > > > isolation,
> > > > > > >> and
> > > > > > >>> certainly no less than serializable isolation, to augment the
> > > > > existing
> > > > > > >>> weaker isolation offered by quorum reads and writes.
> > > > > > >>>
> > > > > > >>> I would tangentially note that we are not an AP database
> under
> > > > normal
> > > > > > >>> recommended operation. A minority in any network partition
> > cannot
> > > > > reach
> > > > > > >>> QUORUM, so under recommended usage we are a high-availability
> > > > > > leaderless
> > > > > > >> CP
> > > > > > >>> database.
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> From: Jonathan Ellis <jb...@gmail.com>
> > > > > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > > > > >>> To: dev <de...@cassandra.apache.org>
> > > > > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > >>> Benedict, thanks for taking the lead in putting this
> together.
> > > > Since
> > > > > > >>> Cassandra is the only relevant database today designed
> around a
> > > > > > >> leaderless
> > > > > > >>> architecture, it's quite likely that we'll be better served
> > with
> > > a
> > > > > > custom
> > > > > > >>> transaction design instead of trying to retrofit one from CP
> > > > systems.
> > > > > > >>>
> > > > > > >>> The whitepaper here is a good description of the consensus
> > > > algorithm
> > > > > > >> itself
> > > > > > >>> as well as its robustness and stability characteristics, and
> > its
> > > > > > >> comparison
> > > > > > >>> with other state-of-the-art consensus algorithms is very
> > useful.
> > > > In
> > > > > > the
> > > > > > >>> context of Cassandra, where a consensus algorithm is only
> part
> > of
> > > > > what
> > > > > > >> will
> > > > > > >>> be implemented, I'd like to see a more complete evaluation of
> > the
> > > > > > >>> transactional side of things as well, including performance
> > > > > > >> characteristics
> > > > > > >>> as well as the types of transactions that can be supported
> and
> > at
> > > > > > least a
> > > > > > >>> general idea of what it would look like applied to Cassandra.
> > > This
> > > > > will
> > > > > > >>> allow the PMC to make a more informed decision about what
> > > tradeoffs
> > > > > are
> > > > > > >>> best for the entire long-term project of first supplementing
> > and
> > > > > > >> ultimately
> > > > > > >>> replacing LWT.
> > > > > > >>>
> > > > > > >>> (Allowing users to mix LWT and AP Cassandra operations
> against
> > > the
> > > > > same
> > > > > > >>> rows was probably a mistake, so in contrast with LWT we’re
> not
> > > > > looking
> > > > > > >> for
> > > > > > >>> something fast enough for occasional use but rather something
> > > > within
> > > > > a
> > > > > > >>> reasonable factor of AP operations, appropriate to being the
> > only
> > > > way
> > > > > > to
> > > > > > >>> interact with tables declared as such.)
> > > > > > >>>
> > > > > > >>> Besides Accord, this should cover
> > > > > > >>>
> > > > > > >>> - Calvin and FaunaDB
> > > > > > >>> - A Spanner derivative (no opinion on whether that should be
> > > > > Cockroach
> > > > > > or
> > > > > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > > > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB
> but
> > I
> > > > > > suspect
> > > > > > >>> there is more public information about MongoDB)
> > > > > > >>> - RAMP
> > > > > > >>>
> > > > > > >>> Here’s an example of what I mean:
> > > > > > >>>
> > > > > > >>> =Calvin=
> > > > > > >>>
> > > > > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB)
> > to
> > > > > order
> > > > > > >>> transactions, then replicas execute the transactions
> > > independently
> > > > > with
> > > > > > >> no
> > > > > > >>> further coordination.  No SPOF.  Transactions are batched by
> > each
> > > > > > >> sequencer
> > > > > > >>> to keep this from becoming a bottleneck.
> > > > > > >>>
> > > > > > >>> Performance: Calvin paper (published 2012) reports linear
> > scaling
> > > > of
> > > > > > >> TPC-C
> > > > > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2
> XL
> > > > > machines
> > > > > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order
> > is
> > > > > > composed
> > > > > > >>> of four reads and four writes, so this is effectively 2M
> reads
> > > and
> > > > 2M
> > > > > > >>> writes as we normally measure them in C*.
> > > > > > >>>
> > > > > > >>> Calvin supports mixed read/write transactions, but because
> the
> > > > > > >> transaction
> > > > > > >>> execution logic requires knowing all partition keys in
> advance
> > to
> > > > > > ensure
> > > > > > >>> that all replicas can reproduce the same results with no
> > > > > coordination,
> > > > > > >>> reads against non-PK predicates must be done ahead of time
> > > > > > >> (transparently,
> > > > > > >>> by the server) to determine the set of keys, and this must be
> > > > retried
> > > > > > if
> > > > > > >>> the set of rows affected is updated before the actual
> > transaction
> > > > > > >> executes.
> > > > > > >>>
> > > > > > >>> Batching and global consensus adds latency -- 100ms in the
> > Calvin
> > > > > paper
> > > > > > >> and
> > > > > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> > > > transactions
> > > > > > >>> (including multi-partition updates) are equally performant in
> > > > Calvin
> > > > > > >> since
> > > > > > >>> the coordination is handled up front in the sequencing step.
> > > Glass
> > > > > > half
> > > > > > >>> empty: even single-row reads and writes have to pay the full
> > > > > > coordination
> > > > > > >>> cost.  Fauna has optimized this away for reads but I am not
> > aware
> > > > of
> > > > > a
> > > > > > >>> description of how they changed the design to allow this.
> > > > > > >>>
> > > > > > >>> Functionality and limitations: since the entire transaction
> > must
> > > be
> > > > > > known
> > > > > > >>> in advance to allow coordination-less execution at the
> > replicas,
> > > > > Calvin
> > > > > > >>> cannot support interactive transactions at all.  FaunaDB
> > > mitigates
> > > > > this
> > > > > > >> by
> > > > > > >>> allowing server-side logic to be included, but a Calvin
> > approach
> > > > will
> > > > > > >> never
> > > > > > >>> be able to offer SQL compatibility.
> > > > > > >>>
> > > > > > >>> Guarantees: Calvin transactions are strictly serializable.
> > There
> > > > is
> > > > > no
> > > > > > >>> additional complexity or performance hit to generalizing to
> > > > multiple
> > > > > > >>> regions, apart from the speed of light.  And since Calvin is
> > > > already
> > > > > > >> paying
> > > > > > >>> a batching latency penalty, this is less painful than for
> other
> > > > > > systems.
> > > > > > >>>
> > > > > > >>> Application to Cassandra: B-.  Distributed transactions are
> > > handled
> > > > > by
> > > > > > >> the
> > > > > > >>> sequencing and scheduling layers, which are leaderless, and
> > > > Calvin’s
> > > > > > >>> requirements for the storage layer are easily met by C*.  But
> > > > Calvin
> > > > > > also
> > > > > > >>> requires a global consensus protocol and LWT is almost
> > certainly
> > > > not
> > > > > > >>> sufficiently performant, so this would require ZK or etcd
> > > > (reasonable
> > > > > > >> for a
> > > > > > >>> library approach but not for replacing LWT in C* itself), or
> an
> > > > > > >>> implementation of Accord.  I don’t believe Calvin would
> require
> > > > > > >> additional
> > > > > > >>> table-level metadata in Cassandra.
> > > > > > >>>
> > > > > > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > >>> wrote:
> > > > > > >>>
> > > > > > >>>> Wiki:
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > > > > >>>> Whitepaper:
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > > > > >>>> <
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > > > > >>>>>
> > > > > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > > > > >>>>
> > > > > > >>>> Hi everyone, I’d like to propose this CEP for adoption by
> the
> > > > > > >> community.
> > > > > > >>>>
> > > > > > >>>> Cassandra has benefitted from LWTs for many years, but
> > > application
> > > > > > >>>> developers that want to ensure consistency for complex
> > > operations
> > > > > must
> > > > > > >>>> either accept the scalability bottleneck of serializing all
> > > > related
> > > > > > >> state
> > > > > > >>>> through a single partition, or layer a complex state machine
> > on
> > > > top
> > > > > of
> > > > > > >>> the
> > > > > > >>>> database. These are sophisticated and costly activities that
> > our
> > > > > users
> > > > > > >>>> should not be expected to undertake. Since distributed
> > databases
> > > > are
> > > > > > >>>> beginning to offer distributed transactions with fewer
> > caveats,
> > > it
> > > > > is
> > > > > > >>> past
> > > > > > >>>> time for Cassandra to do so as well.
> > > > > > >>>>
> > > > > > >>>> This CEP proposes the use of several novel techniques that
> > build
> > > > > upon
> > > > > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> > > > general
> > > > > > >>>> purpose distributed transactions. The approach is outlined
> in
> > > the
> > > > > > >>> wikipage
> > > > > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > > > > adopting
> > > > > > >>> this
> > > > > > >>>> approach we will be the _only_ distributed database to offer
> > > > global,
> > > > > > >>>> scalable, strict serializable transactions in one wide area
> > > > > > round-trip.
> > > > > > >>>> This would represent a significant improvement in the state
> of
> > > the
> > > > > > art,
> > > > > > >>>> both in the academic literature and in commercial or open
> > source
> > > > > > >>> offerings.
> > > > > > >>>>
> > > > > > >>>> This work has been partially realised in a prototype. This
> > > partial
> > > > > > >>>> prototype has been verified against Jepsen.io’s Maelstrom
> > > library
> > > > > and
> > > > > > >>>> dedicated in-tree strict serializability verification tools,
> > but
> > > > > much
> > > > > > >>> work
> > > > > > >>>> remains for the work to be production capable and integrated
> > > into
> > > > > > >>> Cassandra.
> > > > > > >>>>
> > > > > > >>>> I propose including the prototype in the project as a new
> > source
> > > > > > >>>> repository, to be developed as a standalone library for
> > > > integration
> > > > > > >> into
> > > > > > >>>> Cassandra. I hope the community sees the important value
> > > > proposition
> > > > > > of
> > > > > > >>>> this proposal, and will adopt the CEP after this discussion,
> > so
> > > > that
> > > > > > >> the
> > > > > > >>>> library and its integration into Cassandra can be developed
> in
> > > > > > parallel
> > > > > > >>> and
> > > > > > >>>> with the involvement of the wider community.
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> --
> > > > > > >>> Jonathan Ellis
> > > > > > >>> co-founder, http://www.datastax.com
> > > > > > >>> @spyced
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> Jonathan Ellis
> > > > > > >> co-founder, http://www.datastax.com
> > > > > > >> @spyced
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jonathan Ellis
> > > > > > > co-founder, http://www.datastax.com
> > > > > > > @spyced
> > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jonathan Ellis
> > > > > co-founder, http://www.datastax.com
> > > > > @spyced
> > > > >
> > > >
> > >
> >
>


-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Paulo Motta <pa...@gmail.com>.
> With respect to this, in my view this kind of detail is not warranted
within a CEP. Software development is an exploratory process with respect
to structure, and these decisions will be made as the CEP progresses. If
these need to be specified upfront, then the purpose of a CEP – seeking buy
in – is invalidated, because the work must be complete before you know the
answers.

These need not to be set in stone, they're just a rough sketch of what the
end product will look like to make it easier to build a mental model of the
project, specially for those not directly involved with it, as well as to
guide its development for those involved. At least for me it's much easier
to visualize a project top-down (from how it's going to be used to its
particular implementation details) versus the other way around.

Em sex., 1 de out. de 2021 às 11:33, benedict@apache.org <
benedict@apache.org> escreveu:

> > The current document details thoroughly the protocol but in my view
> lacks to illustrate what specific API, methods, modules will become
> available to developers
>
> With respect to this, in my view this kind of detail is not warranted
> within a CEP. Software development is an exploratory process with respect
> to structure, and these decisions will be made as the CEP progresses. If
> these need to be specified upfront, then the purpose of a CEP – seeking buy
> in – is invalidated, because the work must be complete before you know the
> answers.
>
>
> From: benedict@apache.org <be...@apache.org>
> Date: Friday, 1 October 2021 at 15:31
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> From the CEP:
>
> Batches (including unconditional batches) on transactional tables will
> receive ACID properties, and grammatically correct conditional batch
> operations that would be rejected for operating over multiple CQL
> partitions will now be supported
>
>
> From: Paulo Motta <pa...@gmail.com>
> Date: Friday, 1 October 2021 at 15:30
> To: Cassandra DEV <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Can you just answer what palpable feature will be available once this CEP
> lands because this is still not clear to me (and perhaps to others) from
> the current CEP structure. The current document details thoroughly the
> protocol but in my view lacks to illustrate what specific API, methods,
> modules will become available to developers, how it fits into the larger
> picture and interacts with existing modules if at all and perhaps a few
> examples of how it can be used to build features on top.
>
> Em sex., 1 de out. de 2021 às 11:10, benedict@apache.org <
> benedict@apache.org> escreveu:
>
> > I’m not, though it might seem that way. I disagree with your views about
> > how CEP should be structured. Since the CEP process was itself codified
> via
> > the CEP process, if you want to recodify how CEP work, the correct way is
> > via the CEP process itself.
> >
> > The discussion is being drawn in multiple directions away from the CEP
> > itself, and I am trying to keep this particular thread focused on the
> > business at hand, not meta discussions around CEP structure that will no
> > doubt be unproductive given likely irreconcilable views about the topic,
> > nor discussions about other CEP that could have been.
> >
> > If you want to start a separate exploratory discussion thread about CEP
> > structure without filing a CEP feel free to do so.
> >
> >
> > From: Paulo Motta <pa...@gmail.com>
> > Date: Friday, 1 October 2021 at 15:04
> > To: Cassandra DEV <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > If you want to impose your views on CEP structure on others, please
> file
> > a CEP with the additional restrictions and guidance you want to impose
> and
> > start a discussion thread. I can then respond in detail to why I perceive
> > this approach to be flawed, in a dedicated context.
> >
> > This sounds very kafkaesque. You know I won't file a meta-CEP to change
> the
> > structure of CEP so you're just using this as an excuse to just shut the
> > discussion on the lack of clarity on what actual palpable feature will be
> > available once the CEP lands. :-)
> >
> > I'm just providing my humble feedback on how a CEP could be more
> digestible
> > and easier to consume from an external point of view, and this seems like
> > an appropriate and contextualized place to voice this opinion which is
> > perhaps shared by others.
> >
> > Em sex., 1 de out. de 2021 às 10:55, benedict@apache.org <
> > benedict@apache.org> escreveu:
> >
> > > I disagree with you. However, this is the wrong forum to have a meta
> > > discussion about how CEP should be structured.
> > >
> > > If you want to impose your views on CEP structure on others, please
> file
> > a
> > > CEP with the additional restrictions and guidance you want to impose
> and
> > > start a discussion thread. I can then respond in detail to why I
> perceive
> > > this approach to be flawed, in a dedicated context.
> > >
> > >
> > > From: Paulo Motta <pa...@gmail.com>
> > > Date: Friday, 1 October 2021 at 14:48
> > > To: Cassandra DEV <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > >  The proposal as it stands today is exceptionally thorough, more so
> > than
> > > any other CEP to date, or any CEP is likely to be in the near future.
> > >
> > > The protocol is thoroughly described, but in my view CEP is a forum to
> > > discuss the high level architecture and plan for adding a full
> end-to-end
> > > enhancement to the database, breaking it into sub-CEPs if needed, as
> long
> > > as the full plan is known in advance, otherwise the community will not
> > have
> > > the context to judge the full extent and impact of the proposed
> > > enhancement.
> > >
> > > > Since it remains unclear to me what either yourself or Jonathan want
> to
> > > see as an alternative
> > >
> > > I would personally like to see something along these lines:
> > >
> > > CEP1: Add ACID-compliant atomic batches
> > > - UX changes needed: none, CQL provides the grammar we need.
> > > - Distributed transaction protocol needed: Accord (link to white paper
> if
> > > you want specific details about the protcool)
> > > - High-level architecture: what new components will be added, how
> > existing
> > > components will be modified, what new messages will be added, what new
> > > configuration knobs will be introduced, what are the milestones of the
> > > project, etc.
> > >
> > > CEP2: Make LWT faster and more reliable
> > > - UX changes needed: none
> > > - Distributed transaction protocol needed: Accord, already added by
> > > previous CEP.
> > > - High-level architecture: blablabla... and so on.
> > >
> > > Em sex., 1 de out. de 2021 às 10:19, benedict@apache.org <
> > > benedict@apache.org> escreveu:
> > >
> > > > I think this is getting circular and unproductive. Basic
> disagreements
> > > > about whether the CEP specifies a feature I am inclined to leave for
> a
> > > > vote. In my view the CEP specifies several features, both immediate
> > ones
> > > > for the user (ACID batches and multi-key LWTS) and developer-focused
> > ones
> > > > around ground-breaking semantics that will be enabled.
> > > >
> > > > The proposal as it stands today is exceptionally thorough, more so
> than
> > > > any other CEP to date, or any CEP is likely to be in the near future.
> > > >
> > > > This is a Cassandra Enhancement *Proposal*, and at some point we have
> > to
> > > > engage with what is proposed, not what you might like to be proposed.
> > > Since
> > > > it remains unclear to me what either yourself or Jonathan want to see
> > as
> > > an
> > > > alternative, at this point it would seem more productive to produce
> > your
> > > > own proposals for the community to consider. It is possible for
> > multiple
> > > > transaction systems to co-exist, if you feel this is necessary.
> > > >
> > > >
> > > >
> > > > From: Paulo Motta <pa...@gmail.com>
> > > > Date: Friday, 1 October 2021 at 13:58
> > > > To: Cassandra DEV <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > I share similar feelings as jbellis that this proposal seems to be
> > > focusing
> > > > on the protocol itself but lacking the actual feature that will use
> the
> > > > protocol which IMO a key element to discuss on a CEP.
> > > >
> > > > It's similar to saying: hey I want to add this Tries Serialization
> > > Protocol
> > > > to Cassandra, but not providing specific details of how this protocol
> > is
> > > > going to be used.
> > > >
> > > > I think the right route for a CEP is to describe the feature that
> will
> > be
> > > > added to the database and the protocol is a mere requirement of the
> > > > high-level feature, for example:
> > > >
> > > > CEP: Add Trie-backed memtable
> > > > - Trie Serialization Protocol: implementation detail of the above CEP
> > > >
> > > > What is the difficulty of taking this approach, picking one of the
> > myriad
> > > > of features that will be enabled by Accord and using that as the
> > initial
> > > > CEP to introduce the protocol to the database?
> > > >
> > > > Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
> > > > benedict@apache.org> escreveu:
> > > >
> > > > > Actually, thinking about it again, the simple optimistic protocol
> > would
> > > > in
> > > > > fact guarantee system forward progress (i.e. independent of
> > transaction
> > > > > formulation).
> > > > >
> > > > >
> > > > > From: benedict@apache.org <be...@apache.org>
> > > > > Date: Friday, 1 October 2021 at 09:14
> > > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > Hi Jonathan,
> > > > >
> > > > > It would be great if we could achieve a bandwidth higher than 1-2
> > short
> > > > > emails per week. It remains unclear to me what your goal is, and it
> > > would
> > > > > help if you could make a statement like “I want Cassandra to be
> able
> > to
> > > > do
> > > > > X” so that we can respond directly to it. I am also available to
> have
> > > > > another call, in which we can have a back and forth, please feel
> free
> > > to
> > > > > propose a London-compatible time within the next week that is
> > suitable
> > > > for
> > > > > you.
> > > > >
> > > > > In my opinion we are at risk of veering off-topic, though. This CEP
> > is
> > > > not
> > > > > to deliver interactive transactions, and to my knowledge nobody is
> > > > > proposing a CEP for interactive transactions. So, for the CEP at
> hand
> > > the
> > > > > salient question seems: does this CEP prevent us from implementing
> > > > > interactive transactions with properties X, Y, Z in future? To
> which
> > > the
> > > > > answer is almost certainly no.
> > > > >
> > > > > However, to continue the discussion and respond directly to your
> > > queries,
> > > > > I believe we agree on the definition of an interactive transaction.
> > > > >
> > > > > Two protocols were loosely outlined. The first, using timestamps
> for
> > > > > optimistic concurrency control, would indeed involve the
> possibility
> > of
> > > > > aborts. It would not however inherently adopt the issue of LWTs
> where
> > > no
> > > > > transaction is able to make progress. Whether or not progress is
> > > > guaranteed
> > > > > (in a livelock-free sense) would depend on the structure of the
> > > > > transactions that were interfering.
> > > > >
> > > > > This approach has the advantage of being very simple to implement,
> so
> > > > that
> > > > > we could realistically support interactive transactions quite
> > quickly.
> > > It
> > > > > has the additional advantage that transactions would execute very
> > > quickly
> > > > > by avoiding the WAN during construction, and as a result may in
> > > practice
> > > > > experience fewer aborts than protocols that guarantee
> > livelock-freedom.
> > > > >
> > > > > The second protocol proposed using read/write intents and would be
> > able
> > > > to
> > > > > support almost any behaviour you want. We could even utilise
> > > pessimistic
> > > > > concurrency control, or anything in-between. This is its own huge
> > > design
> > > > > space, and discussion of this approach and the trade-offs that
> could
> > be
> > > > > made is (in my opinion) entirely out of scope for this CEP.
> > > > >
> > > > >
> > > > > From: Jonathan Ellis <jb...@gmail.com>
> > > > > Date: Friday, 1 October 2021 at 05:00
> > > > > To: dev <de...@cassandra.apache.org>
> > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > The obstacle for me is you've provided a protocol but not a fully
> > > fleshed
> > > > > out architecture, so it's hard to fill in some of the blanks.  But
> it
> > > > looks
> > > > > to me like optimistic concurrency control for interactive
> > transactions
> > > > > applied to Accord would leave you in a LWT-like situation under
> > fairly
> > > > > light contention where nobody actually makes progress due to
> retries.
> > > > >
> > > > > To make sure we're talking about the same thing, as Henrik pointed
> > out,
> > > > > interactive transactions mean multiple round trips from the client
> > > > within a
> > > > > transaction.  For example, here
> > > > > <
> > > > >
> > > >
> > >
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > > > > >
> > > > > is a simple implementation of the TPC-C New Order transaction.  The
> > > high
> > > > > level logic (via
> > > > > <
> > > > >
> > > >
> > >
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > > > > >)
> > > > > is,
> > > > >
> > > > >    1. Get records describing a warehouse, customer, & district
> > > > >    2. Update the district
> > > > >    3. Increment next available order number
> > > > >    4. Insert record into Order and New-Order tables
> > > > >    5. For 5-15 items, get Item record, get/update Stock record
> > > > >    6. Insert Order-Line Record
> > > > >
> > > > > As you can see, this requires a lot of client-side logic mixed in
> > with
> > > > the
> > > > > actual SQL commands.
> > > > >
> > > > >
> > > > > On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <
> > > benedict@apache.org
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Essentially this, although I think in practice we will need to
> > track
> > > > each
> > > > > > partition’s timestamp separately (or optionally for reduced
> > > conflicts,
> > > > > each
> > > > > > row or datum’s), and make them all part of the conditional
> > > application
> > > > of
> > > > > > the transaction - at least for strict-serializability.
> > > > > >
> > > > > > The alternative is to insert read/write intents for the
> transaction
> > > > > during
> > > > > > each step, and to confirm they are still valid on commit, but
> this
> > > > > approach
> > > > > > would require a WAN round-trip for each step in the interactive
> > > > > > transaction, whereas the timestamp-validating approach can use a
> > LAN
> > > > > > round-trip for each step besides the final one, and is also much
> > > > simpler
> > > > > to
> > > > > > implement.
> > > > > >
> > > > > >
> > > > > > From: Blake Eggleston <be...@apple.com.INVALID>
> > > > > > Date: Thursday, 30 September 2021 at 05:47
> > > > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > You could establish a lower timestamp bound and buffer
> transaction
> > > > state
> > > > > > on the coordinator, then make the commit an operation that only
> > > applies
> > > > > if
> > > > > > all partitions involved haven’t been changed by a more recent
> > > > timestamp.
> > > > > > You could also implement mvcc either in the storage layer or for
> > some
> > > > > > period of time by buffering commits on each replica before
> > applying.
> > > > > >
> > > > > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jbellis@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > How are interactive transactions possible with Accord?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > > wrote:
> > > > > > >
> > > > > > >> Could you explain why you believe this trade-off is necessary?
> > We
> > > > can
> > > > > > >> support full SQL just fine with Accord, and I hope that we
> > > > eventually
> > > > > > do so.
> > > > > > >>
> > > > > > >> This domain is incredibly complex, so it is easy to reach
> wrong
> > > > > > >> conclusions. I would invite you again to propose a system for
> > > > > discussion
> > > > > > >> that you think offers something Accord is unable to, and that
> > you
> > > > > > consider
> > > > > > >> desirable, and we can work from there.
> > > > > > >>
> > > > > > >> To pre-empt some possible discussions, I am not aware of
> > anything
> > > we
> > > > > > >> cannot do with Accord that we could do with either Calvin or
> > > > Spanner.
> > > > > > >> Interactive transactions are possible on top of Accord, as are
> > > > > > transactions
> > > > > > >> with an unknown read/write set. In each case the only cost is
> > that
> > > > > they
> > > > > > >> would use optimistic concurrency control, which is no worse
> the
> > > > > spanner
> > > > > > >> derivatives anyway (which I have to assume is your benchmark
> in
> > > this
> > > > > > >> regard). I do not expect to deliver either functionality
> > > initially,
> > > > > but
> > > > > > >> Accord takes us most of the way there for both.
> > > > > > >>
> > > > > > >>
> > > > > > >> From: Jonathan Ellis <jb...@gmail.com>
> > > > > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > > > > >> To: dev <de...@cassandra.apache.org>
> > > > > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > >> Right, I'm looking for exactly a discussion on the high level
> > > goals.
> > > > > > >> Instead of saying "here's the goals and we ruled out X because
> > Y"
> > > we
> > > > > > should
> > > > > > >> start with a discussion around, "Approach A allows X and W,
> > > > approach B
> > > > > > >> allows Y and Z" and decide together what the goals should be
> and
> > > and
> > > > > > what
> > > > > > >> we are willing to trade to get those goals, e.g., are we
> willing
> > > to
> > > > > > give up
> > > > > > >> global strict serializability to get the ability to support
> full
> > > > SQL.
> > > > > > Both
> > > > > > >> of these are nice to have!
> > > > > > >>
> > > > > > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > >> wrote:
> > > > > > >>
> > > > > > >>> Hi Jonathan,
> > > > > > >>>
> > > > > > >>> These other systems are incompatible with the goals of the
> > CEP. I
> > > > do
> > > > > > >>> discuss them (besides 2PC) in both the whitepaper and the
> CEP,
> > > and
> > > > > will
> > > > > > >>> summarise that discussion below. A true and accurate
> comparison
> > > of
> > > > > > these
> > > > > > >>> other systems is essentially intractable, as there are
> complex
> > > > > > subtleties
> > > > > > >>> to each flavour, and those who are interested would be better
> > > > served
> > > > > by
> > > > > > >>> performing their own research.
> > > > > > >>>
> > > > > > >>> I think it is more productive to focus on what we want to
> > achieve
> > > > as
> > > > > a
> > > > > > >>> community. If you believe the goals of this CEP are wrong for
> > the
> > > > > > >> project,
> > > > > > >>> let’s focus on that. If you want to compare and contrast
> > specific
> > > > > > facets
> > > > > > >> of
> > > > > > >>> alternative systems that you consider to be preferable in
> some
> > > > > > dimension,
> > > > > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > > > > >>>
> > > > > > >>> The relevant goals are that we:
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>  1.  Guarantee strict serializable isolation on commodity
> > > hardware
> > > > > > >>>  2.  Scale to any cluster size
> > > > > > >>>  3.  Achieve optimal latency
> > > > > > >>>
> > > > > > >>> The approach taken by Spanner derivatives is rejected by (1)
> > > > because
> > > > > > they
> > > > > > >>> guarantee only Serializable isolation (they additionally fail
> > > (3)).
> > > > > > From
> > > > > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > > > > >>> panic-cluster-death under clock skew, this is clearly
> > considered
> > > by
> > > > > > >>> everyone to be undesirable but necessary to achieve
> > scalability.
> > > > > > >>>
> > > > > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2)
> > because
> > > > its
> > > > > > >>> sequencing layer requires a global leader process for the
> > > cluster,
> > > > > > which
> > > > > > >> is
> > > > > > >>> incompatible with Cassandra’s scalability requirements. It
> > > > > additionally
> > > > > > >>> fails (3) for global clients.
> > > > > > >>>
> > > > > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is
> > > today a
> > > > > > >>> Spanner clone for its multi-key transaction functionality,
> not
> > > 2PC.
> > > > > > >>>
> > > > > > >>> Systems such as RAMP with even weaker isolation are not
> > > considered
> > > > > for
> > > > > > >> the
> > > > > > >>> simple reason that they do not even claim to meet (1).
> > > > > > >>>
> > > > > > >>> If we want to additionally offer weaker isolation levels than
> > > > > > >>> Serializable, such as that provided by the recent RAMP-TAO
> > paper,
> > > > > > >> Cassandra
> > > > > > >>> is likely able to support multiple distinct transaction
> layers
> > > that
> > > > > > >> operate
> > > > > > >>> independently. I would encourage you to file a CEP to explore
> > how
> > > > we
> > > > > > can
> > > > > > >>> meet these distinct use cases, but I consider them to be
> > niche. I
> > > > > > expect
> > > > > > >>> that a majority of our user base desire strict serializable
> > > > > isolation,
> > > > > > >> and
> > > > > > >>> certainly no less than serializable isolation, to augment the
> > > > > existing
> > > > > > >>> weaker isolation offered by quorum reads and writes.
> > > > > > >>>
> > > > > > >>> I would tangentially note that we are not an AP database
> under
> > > > normal
> > > > > > >>> recommended operation. A minority in any network partition
> > cannot
> > > > > reach
> > > > > > >>> QUORUM, so under recommended usage we are a high-availability
> > > > > > leaderless
> > > > > > >> CP
> > > > > > >>> database.
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> From: Jonathan Ellis <jb...@gmail.com>
> > > > > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > > > > >>> To: dev <de...@cassandra.apache.org>
> > > > > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > > >>> Benedict, thanks for taking the lead in putting this
> together.
> > > > Since
> > > > > > >>> Cassandra is the only relevant database today designed
> around a
> > > > > > >> leaderless
> > > > > > >>> architecture, it's quite likely that we'll be better served
> > with
> > > a
> > > > > > custom
> > > > > > >>> transaction design instead of trying to retrofit one from CP
> > > > systems.
> > > > > > >>>
> > > > > > >>> The whitepaper here is a good description of the consensus
> > > > algorithm
> > > > > > >> itself
> > > > > > >>> as well as its robustness and stability characteristics, and
> > its
> > > > > > >> comparison
> > > > > > >>> with other state-of-the-art consensus algorithms is very
> > useful.
> > > > In
> > > > > > the
> > > > > > >>> context of Cassandra, where a consensus algorithm is only
> part
> > of
> > > > > what
> > > > > > >> will
> > > > > > >>> be implemented, I'd like to see a more complete evaluation of
> > the
> > > > > > >>> transactional side of things as well, including performance
> > > > > > >> characteristics
> > > > > > >>> as well as the types of transactions that can be supported
> and
> > at
> > > > > > least a
> > > > > > >>> general idea of what it would look like applied to Cassandra.
> > > This
> > > > > will
> > > > > > >>> allow the PMC to make a more informed decision about what
> > > tradeoffs
> > > > > are
> > > > > > >>> best for the entire long-term project of first supplementing
> > and
> > > > > > >> ultimately
> > > > > > >>> replacing LWT.
> > > > > > >>>
> > > > > > >>> (Allowing users to mix LWT and AP Cassandra operations
> against
> > > the
> > > > > same
> > > > > > >>> rows was probably a mistake, so in contrast with LWT we’re
> not
> > > > > looking
> > > > > > >> for
> > > > > > >>> something fast enough for occasional use but rather something
> > > > within
> > > > > a
> > > > > > >>> reasonable factor of AP operations, appropriate to being the
> > only
> > > > way
> > > > > > to
> > > > > > >>> interact with tables declared as such.)
> > > > > > >>>
> > > > > > >>> Besides Accord, this should cover
> > > > > > >>>
> > > > > > >>> - Calvin and FaunaDB
> > > > > > >>> - A Spanner derivative (no opinion on whether that should be
> > > > > Cockroach
> > > > > > or
> > > > > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > > > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB
> but
> > I
> > > > > > suspect
> > > > > > >>> there is more public information about MongoDB)
> > > > > > >>> - RAMP
> > > > > > >>>
> > > > > > >>> Here’s an example of what I mean:
> > > > > > >>>
> > > > > > >>> =Calvin=
> > > > > > >>>
> > > > > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB)
> > to
> > > > > order
> > > > > > >>> transactions, then replicas execute the transactions
> > > independently
> > > > > with
> > > > > > >> no
> > > > > > >>> further coordination.  No SPOF.  Transactions are batched by
> > each
> > > > > > >> sequencer
> > > > > > >>> to keep this from becoming a bottleneck.
> > > > > > >>>
> > > > > > >>> Performance: Calvin paper (published 2012) reports linear
> > scaling
> > > > of
> > > > > > >> TPC-C
> > > > > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2
> XL
> > > > > machines
> > > > > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order
> > is
> > > > > > composed
> > > > > > >>> of four reads and four writes, so this is effectively 2M
> reads
> > > and
> > > > 2M
> > > > > > >>> writes as we normally measure them in C*.
> > > > > > >>>
> > > > > > >>> Calvin supports mixed read/write transactions, but because
> the
> > > > > > >> transaction
> > > > > > >>> execution logic requires knowing all partition keys in
> advance
> > to
> > > > > > ensure
> > > > > > >>> that all replicas can reproduce the same results with no
> > > > > coordination,
> > > > > > >>> reads against non-PK predicates must be done ahead of time
> > > > > > >> (transparently,
> > > > > > >>> by the server) to determine the set of keys, and this must be
> > > > retried
> > > > > > if
> > > > > > >>> the set of rows affected is updated before the actual
> > transaction
> > > > > > >> executes.
> > > > > > >>>
> > > > > > >>> Batching and global consensus adds latency -- 100ms in the
> > Calvin
> > > > > paper
> > > > > > >> and
> > > > > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> > > > transactions
> > > > > > >>> (including multi-partition updates) are equally performant in
> > > > Calvin
> > > > > > >> since
> > > > > > >>> the coordination is handled up front in the sequencing step.
> > > Glass
> > > > > > half
> > > > > > >>> empty: even single-row reads and writes have to pay the full
> > > > > > coordination
> > > > > > >>> cost.  Fauna has optimized this away for reads but I am not
> > aware
> > > > of
> > > > > a
> > > > > > >>> description of how they changed the design to allow this.
> > > > > > >>>
> > > > > > >>> Functionality and limitations: since the entire transaction
> > must
> > > be
> > > > > > known
> > > > > > >>> in advance to allow coordination-less execution at the
> > replicas,
> > > > > Calvin
> > > > > > >>> cannot support interactive transactions at all.  FaunaDB
> > > mitigates
> > > > > this
> > > > > > >> by
> > > > > > >>> allowing server-side logic to be included, but a Calvin
> > approach
> > > > will
> > > > > > >> never
> > > > > > >>> be able to offer SQL compatibility.
> > > > > > >>>
> > > > > > >>> Guarantees: Calvin transactions are strictly serializable.
> > There
> > > > is
> > > > > no
> > > > > > >>> additional complexity or performance hit to generalizing to
> > > > multiple
> > > > > > >>> regions, apart from the speed of light.  And since Calvin is
> > > > already
> > > > > > >> paying
> > > > > > >>> a batching latency penalty, this is less painful than for
> other
> > > > > > systems.
> > > > > > >>>
> > > > > > >>> Application to Cassandra: B-.  Distributed transactions are
> > > handled
> > > > > by
> > > > > > >> the
> > > > > > >>> sequencing and scheduling layers, which are leaderless, and
> > > > Calvin’s
> > > > > > >>> requirements for the storage layer are easily met by C*.  But
> > > > Calvin
> > > > > > also
> > > > > > >>> requires a global consensus protocol and LWT is almost
> > certainly
> > > > not
> > > > > > >>> sufficiently performant, so this would require ZK or etcd
> > > > (reasonable
> > > > > > >> for a
> > > > > > >>> library approach but not for replacing LWT in C* itself), or
> an
> > > > > > >>> implementation of Accord.  I don’t believe Calvin would
> require
> > > > > > >> additional
> > > > > > >>> table-level metadata in Cassandra.
> > > > > > >>>
> > > > > > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > > > > > benedict@apache.org>
> > > > > > >>> wrote:
> > > > > > >>>
> > > > > > >>>> Wiki:
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > > > > >>>> Whitepaper:
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > > > > >>>> <
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > > > > >>>>>
> > > > > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > > > > >>>>
> > > > > > >>>> Hi everyone, I’d like to propose this CEP for adoption by
> the
> > > > > > >> community.
> > > > > > >>>>
> > > > > > >>>> Cassandra has benefitted from LWTs for many years, but
> > > application
> > > > > > >>>> developers that want to ensure consistency for complex
> > > operations
> > > > > must
> > > > > > >>>> either accept the scalability bottleneck of serializing all
> > > > related
> > > > > > >> state
> > > > > > >>>> through a single partition, or layer a complex state machine
> > on
> > > > top
> > > > > of
> > > > > > >>> the
> > > > > > >>>> database. These are sophisticated and costly activities that
> > our
> > > > > users
> > > > > > >>>> should not be expected to undertake. Since distributed
> > databases
> > > > are
> > > > > > >>>> beginning to offer distributed transactions with fewer
> > caveats,
> > > it
> > > > > is
> > > > > > >>> past
> > > > > > >>>> time for Cassandra to do so as well.
> > > > > > >>>>
> > > > > > >>>> This CEP proposes the use of several novel techniques that
> > build
> > > > > upon
> > > > > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> > > > general
> > > > > > >>>> purpose distributed transactions. The approach is outlined
> in
> > > the
> > > > > > >>> wikipage
> > > > > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > > > > adopting
> > > > > > >>> this
> > > > > > >>>> approach we will be the _only_ distributed database to offer
> > > > global,
> > > > > > >>>> scalable, strict serializable transactions in one wide area
> > > > > > round-trip.
> > > > > > >>>> This would represent a significant improvement in the state
> of
> > > the
> > > > > > art,
> > > > > > >>>> both in the academic literature and in commercial or open
> > source
> > > > > > >>> offerings.
> > > > > > >>>>
> > > > > > >>>> This work has been partially realised in a prototype. This
> > > partial
> > > > > > >>>> prototype has been verified against Jepsen.io’s Maelstrom
> > > library
> > > > > and
> > > > > > >>>> dedicated in-tree strict serializability verification tools,
> > but
> > > > > much
> > > > > > >>> work
> > > > > > >>>> remains for the work to be production capable and integrated
> > > into
> > > > > > >>> Cassandra.
> > > > > > >>>>
> > > > > > >>>> I propose including the prototype in the project as a new
> > source
> > > > > > >>>> repository, to be developed as a standalone library for
> > > > integration
> > > > > > >> into
> > > > > > >>>> Cassandra. I hope the community sees the important value
> > > > proposition
> > > > > > of
> > > > > > >>>> this proposal, and will adopt the CEP after this discussion,
> > so
> > > > that
> > > > > > >> the
> > > > > > >>>> library and its integration into Cassandra can be developed
> in
> > > > > > parallel
> > > > > > >>> and
> > > > > > >>>> with the involvement of the wider community.
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> --
> > > > > > >>> Jonathan Ellis
> > > > > > >>> co-founder, http://www.datastax.com
> > > > > > >>> @spyced
> > > > > > >>>
> > > > > > >>
> > > > > > >>
> > > > > > >> --
> > > > > > >> Jonathan Ellis
> > > > > > >> co-founder, http://www.datastax.com
> > > > > > >> @spyced
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jonathan Ellis
> > > > > > > co-founder, http://www.datastax.com
> > > > > > > @spyced
> > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jonathan Ellis
> > > > > co-founder, http://www.datastax.com
> > > > > @spyced
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.
> The current document details thoroughly the protocol but in my view lacks to illustrate what specific API, methods, modules will become available to developers

With respect to this, in my view this kind of detail is not warranted within a CEP. Software development is an exploratory process with respect to structure, and these decisions will be made as the CEP progresses. If these need to be specified upfront, then the purpose of a CEP – seeking buy in – is invalidated, because the work must be complete before you know the answers.


From: benedict@apache.org <be...@apache.org>
Date: Friday, 1 October 2021 at 15:31
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
From the CEP:

Batches (including unconditional batches) on transactional tables will receive ACID properties, and grammatically correct conditional batch operations that would be rejected for operating over multiple CQL partitions will now be supported


From: Paulo Motta <pa...@gmail.com>
Date: Friday, 1 October 2021 at 15:30
To: Cassandra DEV <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Can you just answer what palpable feature will be available once this CEP
lands because this is still not clear to me (and perhaps to others) from
the current CEP structure. The current document details thoroughly the
protocol but in my view lacks to illustrate what specific API, methods,
modules will become available to developers, how it fits into the larger
picture and interacts with existing modules if at all and perhaps a few
examples of how it can be used to build features on top.

Em sex., 1 de out. de 2021 às 11:10, benedict@apache.org <
benedict@apache.org> escreveu:

> I’m not, though it might seem that way. I disagree with your views about
> how CEP should be structured. Since the CEP process was itself codified via
> the CEP process, if you want to recodify how CEP work, the correct way is
> via the CEP process itself.
>
> The discussion is being drawn in multiple directions away from the CEP
> itself, and I am trying to keep this particular thread focused on the
> business at hand, not meta discussions around CEP structure that will no
> doubt be unproductive given likely irreconcilable views about the topic,
> nor discussions about other CEP that could have been.
>
> If you want to start a separate exploratory discussion thread about CEP
> structure without filing a CEP feel free to do so.
>
>
> From: Paulo Motta <pa...@gmail.com>
> Date: Friday, 1 October 2021 at 15:04
> To: Cassandra DEV <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > If you want to impose your views on CEP structure on others, please file
> a CEP with the additional restrictions and guidance you want to impose and
> start a discussion thread. I can then respond in detail to why I perceive
> this approach to be flawed, in a dedicated context.
>
> This sounds very kafkaesque. You know I won't file a meta-CEP to change the
> structure of CEP so you're just using this as an excuse to just shut the
> discussion on the lack of clarity on what actual palpable feature will be
> available once the CEP lands. :-)
>
> I'm just providing my humble feedback on how a CEP could be more digestible
> and easier to consume from an external point of view, and this seems like
> an appropriate and contextualized place to voice this opinion which is
> perhaps shared by others.
>
> Em sex., 1 de out. de 2021 às 10:55, benedict@apache.org <
> benedict@apache.org> escreveu:
>
> > I disagree with you. However, this is the wrong forum to have a meta
> > discussion about how CEP should be structured.
> >
> > If you want to impose your views on CEP structure on others, please file
> a
> > CEP with the additional restrictions and guidance you want to impose and
> > start a discussion thread. I can then respond in detail to why I perceive
> > this approach to be flawed, in a dedicated context.
> >
> >
> > From: Paulo Motta <pa...@gmail.com>
> > Date: Friday, 1 October 2021 at 14:48
> > To: Cassandra DEV <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >  The proposal as it stands today is exceptionally thorough, more so
> than
> > any other CEP to date, or any CEP is likely to be in the near future.
> >
> > The protocol is thoroughly described, but in my view CEP is a forum to
> > discuss the high level architecture and plan for adding a full end-to-end
> > enhancement to the database, breaking it into sub-CEPs if needed, as long
> > as the full plan is known in advance, otherwise the community will not
> have
> > the context to judge the full extent and impact of the proposed
> > enhancement.
> >
> > > Since it remains unclear to me what either yourself or Jonathan want to
> > see as an alternative
> >
> > I would personally like to see something along these lines:
> >
> > CEP1: Add ACID-compliant atomic batches
> > - UX changes needed: none, CQL provides the grammar we need.
> > - Distributed transaction protocol needed: Accord (link to white paper if
> > you want specific details about the protcool)
> > - High-level architecture: what new components will be added, how
> existing
> > components will be modified, what new messages will be added, what new
> > configuration knobs will be introduced, what are the milestones of the
> > project, etc.
> >
> > CEP2: Make LWT faster and more reliable
> > - UX changes needed: none
> > - Distributed transaction protocol needed: Accord, already added by
> > previous CEP.
> > - High-level architecture: blablabla... and so on.
> >
> > Em sex., 1 de out. de 2021 às 10:19, benedict@apache.org <
> > benedict@apache.org> escreveu:
> >
> > > I think this is getting circular and unproductive. Basic disagreements
> > > about whether the CEP specifies a feature I am inclined to leave for a
> > > vote. In my view the CEP specifies several features, both immediate
> ones
> > > for the user (ACID batches and multi-key LWTS) and developer-focused
> ones
> > > around ground-breaking semantics that will be enabled.
> > >
> > > The proposal as it stands today is exceptionally thorough, more so than
> > > any other CEP to date, or any CEP is likely to be in the near future.
> > >
> > > This is a Cassandra Enhancement *Proposal*, and at some point we have
> to
> > > engage with what is proposed, not what you might like to be proposed.
> > Since
> > > it remains unclear to me what either yourself or Jonathan want to see
> as
> > an
> > > alternative, at this point it would seem more productive to produce
> your
> > > own proposals for the community to consider. It is possible for
> multiple
> > > transaction systems to co-exist, if you feel this is necessary.
> > >
> > >
> > >
> > > From: Paulo Motta <pa...@gmail.com>
> > > Date: Friday, 1 October 2021 at 13:58
> > > To: Cassandra DEV <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > I share similar feelings as jbellis that this proposal seems to be
> > focusing
> > > on the protocol itself but lacking the actual feature that will use the
> > > protocol which IMO a key element to discuss on a CEP.
> > >
> > > It's similar to saying: hey I want to add this Tries Serialization
> > Protocol
> > > to Cassandra, but not providing specific details of how this protocol
> is
> > > going to be used.
> > >
> > > I think the right route for a CEP is to describe the feature that will
> be
> > > added to the database and the protocol is a mere requirement of the
> > > high-level feature, for example:
> > >
> > > CEP: Add Trie-backed memtable
> > > - Trie Serialization Protocol: implementation detail of the above CEP
> > >
> > > What is the difficulty of taking this approach, picking one of the
> myriad
> > > of features that will be enabled by Accord and using that as the
> initial
> > > CEP to introduce the protocol to the database?
> > >
> > > Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
> > > benedict@apache.org> escreveu:
> > >
> > > > Actually, thinking about it again, the simple optimistic protocol
> would
> > > in
> > > > fact guarantee system forward progress (i.e. independent of
> transaction
> > > > formulation).
> > > >
> > > >
> > > > From: benedict@apache.org <be...@apache.org>
> > > > Date: Friday, 1 October 2021 at 09:14
> > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > Hi Jonathan,
> > > >
> > > > It would be great if we could achieve a bandwidth higher than 1-2
> short
> > > > emails per week. It remains unclear to me what your goal is, and it
> > would
> > > > help if you could make a statement like “I want Cassandra to be able
> to
> > > do
> > > > X” so that we can respond directly to it. I am also available to have
> > > > another call, in which we can have a back and forth, please feel free
> > to
> > > > propose a London-compatible time within the next week that is
> suitable
> > > for
> > > > you.
> > > >
> > > > In my opinion we are at risk of veering off-topic, though. This CEP
> is
> > > not
> > > > to deliver interactive transactions, and to my knowledge nobody is
> > > > proposing a CEP for interactive transactions. So, for the CEP at hand
> > the
> > > > salient question seems: does this CEP prevent us from implementing
> > > > interactive transactions with properties X, Y, Z in future? To which
> > the
> > > > answer is almost certainly no.
> > > >
> > > > However, to continue the discussion and respond directly to your
> > queries,
> > > > I believe we agree on the definition of an interactive transaction.
> > > >
> > > > Two protocols were loosely outlined. The first, using timestamps for
> > > > optimistic concurrency control, would indeed involve the possibility
> of
> > > > aborts. It would not however inherently adopt the issue of LWTs where
> > no
> > > > transaction is able to make progress. Whether or not progress is
> > > guaranteed
> > > > (in a livelock-free sense) would depend on the structure of the
> > > > transactions that were interfering.
> > > >
> > > > This approach has the advantage of being very simple to implement, so
> > > that
> > > > we could realistically support interactive transactions quite
> quickly.
> > It
> > > > has the additional advantage that transactions would execute very
> > quickly
> > > > by avoiding the WAN during construction, and as a result may in
> > practice
> > > > experience fewer aborts than protocols that guarantee
> livelock-freedom.
> > > >
> > > > The second protocol proposed using read/write intents and would be
> able
> > > to
> > > > support almost any behaviour you want. We could even utilise
> > pessimistic
> > > > concurrency control, or anything in-between. This is its own huge
> > design
> > > > space, and discussion of this approach and the trade-offs that could
> be
> > > > made is (in my opinion) entirely out of scope for this CEP.
> > > >
> > > >
> > > > From: Jonathan Ellis <jb...@gmail.com>
> > > > Date: Friday, 1 October 2021 at 05:00
> > > > To: dev <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > The obstacle for me is you've provided a protocol but not a fully
> > fleshed
> > > > out architecture, so it's hard to fill in some of the blanks.  But it
> > > looks
> > > > to me like optimistic concurrency control for interactive
> transactions
> > > > applied to Accord would leave you in a LWT-like situation under
> fairly
> > > > light contention where nobody actually makes progress due to retries.
> > > >
> > > > To make sure we're talking about the same thing, as Henrik pointed
> out,
> > > > interactive transactions mean multiple round trips from the client
> > > within a
> > > > transaction.  For example, here
> > > > <
> > > >
> > >
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > > > >
> > > > is a simple implementation of the TPC-C New Order transaction.  The
> > high
> > > > level logic (via
> > > > <
> > > >
> > >
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > > > >)
> > > > is,
> > > >
> > > >    1. Get records describing a warehouse, customer, & district
> > > >    2. Update the district
> > > >    3. Increment next available order number
> > > >    4. Insert record into Order and New-Order tables
> > > >    5. For 5-15 items, get Item record, get/update Stock record
> > > >    6. Insert Order-Line Record
> > > >
> > > > As you can see, this requires a lot of client-side logic mixed in
> with
> > > the
> > > > actual SQL commands.
> > > >
> > > >
> > > > On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <
> > benedict@apache.org
> > > >
> > > > wrote:
> > > >
> > > > > Essentially this, although I think in practice we will need to
> track
> > > each
> > > > > partition’s timestamp separately (or optionally for reduced
> > conflicts,
> > > > each
> > > > > row or datum’s), and make them all part of the conditional
> > application
> > > of
> > > > > the transaction - at least for strict-serializability.
> > > > >
> > > > > The alternative is to insert read/write intents for the transaction
> > > > during
> > > > > each step, and to confirm they are still valid on commit, but this
> > > > approach
> > > > > would require a WAN round-trip for each step in the interactive
> > > > > transaction, whereas the timestamp-validating approach can use a
> LAN
> > > > > round-trip for each step besides the final one, and is also much
> > > simpler
> > > > to
> > > > > implement.
> > > > >
> > > > >
> > > > > From: Blake Eggleston <be...@apple.com.INVALID>
> > > > > Date: Thursday, 30 September 2021 at 05:47
> > > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > You could establish a lower timestamp bound and buffer transaction
> > > state
> > > > > on the coordinator, then make the commit an operation that only
> > applies
> > > > if
> > > > > all partitions involved haven’t been changed by a more recent
> > > timestamp.
> > > > > You could also implement mvcc either in the storage layer or for
> some
> > > > > period of time by buffering commits on each replica before
> applying.
> > > > >
> > > > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > How are interactive transactions possible with Accord?
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > > > > benedict@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > >> Could you explain why you believe this trade-off is necessary?
> We
> > > can
> > > > > >> support full SQL just fine with Accord, and I hope that we
> > > eventually
> > > > > do so.
> > > > > >>
> > > > > >> This domain is incredibly complex, so it is easy to reach wrong
> > > > > >> conclusions. I would invite you again to propose a system for
> > > > discussion
> > > > > >> that you think offers something Accord is unable to, and that
> you
> > > > > consider
> > > > > >> desirable, and we can work from there.
> > > > > >>
> > > > > >> To pre-empt some possible discussions, I am not aware of
> anything
> > we
> > > > > >> cannot do with Accord that we could do with either Calvin or
> > > Spanner.
> > > > > >> Interactive transactions are possible on top of Accord, as are
> > > > > transactions
> > > > > >> with an unknown read/write set. In each case the only cost is
> that
> > > > they
> > > > > >> would use optimistic concurrency control, which is no worse the
> > > > spanner
> > > > > >> derivatives anyway (which I have to assume is your benchmark in
> > this
> > > > > >> regard). I do not expect to deliver either functionality
> > initially,
> > > > but
> > > > > >> Accord takes us most of the way there for both.
> > > > > >>
> > > > > >>
> > > > > >> From: Jonathan Ellis <jb...@gmail.com>
> > > > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > > > >> To: dev <de...@cassandra.apache.org>
> > > > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > >> Right, I'm looking for exactly a discussion on the high level
> > goals.
> > > > > >> Instead of saying "here's the goals and we ruled out X because
> Y"
> > we
> > > > > should
> > > > > >> start with a discussion around, "Approach A allows X and W,
> > > approach B
> > > > > >> allows Y and Z" and decide together what the goals should be and
> > and
> > > > > what
> > > > > >> we are willing to trade to get those goals, e.g., are we willing
> > to
> > > > > give up
> > > > > >> global strict serializability to get the ability to support full
> > > SQL.
> > > > > Both
> > > > > >> of these are nice to have!
> > > > > >>
> > > > > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > > > > benedict@apache.org>
> > > > > >> wrote:
> > > > > >>
> > > > > >>> Hi Jonathan,
> > > > > >>>
> > > > > >>> These other systems are incompatible with the goals of the
> CEP. I
> > > do
> > > > > >>> discuss them (besides 2PC) in both the whitepaper and the CEP,
> > and
> > > > will
> > > > > >>> summarise that discussion below. A true and accurate comparison
> > of
> > > > > these
> > > > > >>> other systems is essentially intractable, as there are complex
> > > > > subtleties
> > > > > >>> to each flavour, and those who are interested would be better
> > > served
> > > > by
> > > > > >>> performing their own research.
> > > > > >>>
> > > > > >>> I think it is more productive to focus on what we want to
> achieve
> > > as
> > > > a
> > > > > >>> community. If you believe the goals of this CEP are wrong for
> the
> > > > > >> project,
> > > > > >>> let’s focus on that. If you want to compare and contrast
> specific
> > > > > facets
> > > > > >> of
> > > > > >>> alternative systems that you consider to be preferable in some
> > > > > dimension,
> > > > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > > > >>>
> > > > > >>> The relevant goals are that we:
> > > > > >>>
> > > > > >>>
> > > > > >>>  1.  Guarantee strict serializable isolation on commodity
> > hardware
> > > > > >>>  2.  Scale to any cluster size
> > > > > >>>  3.  Achieve optimal latency
> > > > > >>>
> > > > > >>> The approach taken by Spanner derivatives is rejected by (1)
> > > because
> > > > > they
> > > > > >>> guarantee only Serializable isolation (they additionally fail
> > (3)).
> > > > > From
> > > > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > > > >>> panic-cluster-death under clock skew, this is clearly
> considered
> > by
> > > > > >>> everyone to be undesirable but necessary to achieve
> scalability.
> > > > > >>>
> > > > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2)
> because
> > > its
> > > > > >>> sequencing layer requires a global leader process for the
> > cluster,
> > > > > which
> > > > > >> is
> > > > > >>> incompatible with Cassandra’s scalability requirements. It
> > > > additionally
> > > > > >>> fails (3) for global clients.
> > > > > >>>
> > > > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is
> > today a
> > > > > >>> Spanner clone for its multi-key transaction functionality, not
> > 2PC.
> > > > > >>>
> > > > > >>> Systems such as RAMP with even weaker isolation are not
> > considered
> > > > for
> > > > > >> the
> > > > > >>> simple reason that they do not even claim to meet (1).
> > > > > >>>
> > > > > >>> If we want to additionally offer weaker isolation levels than
> > > > > >>> Serializable, such as that provided by the recent RAMP-TAO
> paper,
> > > > > >> Cassandra
> > > > > >>> is likely able to support multiple distinct transaction layers
> > that
> > > > > >> operate
> > > > > >>> independently. I would encourage you to file a CEP to explore
> how
> > > we
> > > > > can
> > > > > >>> meet these distinct use cases, but I consider them to be
> niche. I
> > > > > expect
> > > > > >>> that a majority of our user base desire strict serializable
> > > > isolation,
> > > > > >> and
> > > > > >>> certainly no less than serializable isolation, to augment the
> > > > existing
> > > > > >>> weaker isolation offered by quorum reads and writes.
> > > > > >>>
> > > > > >>> I would tangentially note that we are not an AP database under
> > > normal
> > > > > >>> recommended operation. A minority in any network partition
> cannot
> > > > reach
> > > > > >>> QUORUM, so under recommended usage we are a high-availability
> > > > > leaderless
> > > > > >> CP
> > > > > >>> database.
> > > > > >>>
> > > > > >>>
> > > > > >>> From: Jonathan Ellis <jb...@gmail.com>
> > > > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > > > >>> To: dev <de...@cassandra.apache.org>
> > > > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > >>> Benedict, thanks for taking the lead in putting this together.
> > > Since
> > > > > >>> Cassandra is the only relevant database today designed around a
> > > > > >> leaderless
> > > > > >>> architecture, it's quite likely that we'll be better served
> with
> > a
> > > > > custom
> > > > > >>> transaction design instead of trying to retrofit one from CP
> > > systems.
> > > > > >>>
> > > > > >>> The whitepaper here is a good description of the consensus
> > > algorithm
> > > > > >> itself
> > > > > >>> as well as its robustness and stability characteristics, and
> its
> > > > > >> comparison
> > > > > >>> with other state-of-the-art consensus algorithms is very
> useful.
> > > In
> > > > > the
> > > > > >>> context of Cassandra, where a consensus algorithm is only part
> of
> > > > what
> > > > > >> will
> > > > > >>> be implemented, I'd like to see a more complete evaluation of
> the
> > > > > >>> transactional side of things as well, including performance
> > > > > >> characteristics
> > > > > >>> as well as the types of transactions that can be supported and
> at
> > > > > least a
> > > > > >>> general idea of what it would look like applied to Cassandra.
> > This
> > > > will
> > > > > >>> allow the PMC to make a more informed decision about what
> > tradeoffs
> > > > are
> > > > > >>> best for the entire long-term project of first supplementing
> and
> > > > > >> ultimately
> > > > > >>> replacing LWT.
> > > > > >>>
> > > > > >>> (Allowing users to mix LWT and AP Cassandra operations against
> > the
> > > > same
> > > > > >>> rows was probably a mistake, so in contrast with LWT we’re not
> > > > looking
> > > > > >> for
> > > > > >>> something fast enough for occasional use but rather something
> > > within
> > > > a
> > > > > >>> reasonable factor of AP operations, appropriate to being the
> only
> > > way
> > > > > to
> > > > > >>> interact with tables declared as such.)
> > > > > >>>
> > > > > >>> Besides Accord, this should cover
> > > > > >>>
> > > > > >>> - Calvin and FaunaDB
> > > > > >>> - A Spanner derivative (no opinion on whether that should be
> > > > Cockroach
> > > > > or
> > > > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB but
> I
> > > > > suspect
> > > > > >>> there is more public information about MongoDB)
> > > > > >>> - RAMP
> > > > > >>>
> > > > > >>> Here’s an example of what I mean:
> > > > > >>>
> > > > > >>> =Calvin=
> > > > > >>>
> > > > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB)
> to
> > > > order
> > > > > >>> transactions, then replicas execute the transactions
> > independently
> > > > with
> > > > > >> no
> > > > > >>> further coordination.  No SPOF.  Transactions are batched by
> each
> > > > > >> sequencer
> > > > > >>> to keep this from becoming a bottleneck.
> > > > > >>>
> > > > > >>> Performance: Calvin paper (published 2012) reports linear
> scaling
> > > of
> > > > > >> TPC-C
> > > > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL
> > > > machines
> > > > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order
> is
> > > > > composed
> > > > > >>> of four reads and four writes, so this is effectively 2M reads
> > and
> > > 2M
> > > > > >>> writes as we normally measure them in C*.
> > > > > >>>
> > > > > >>> Calvin supports mixed read/write transactions, but because the
> > > > > >> transaction
> > > > > >>> execution logic requires knowing all partition keys in advance
> to
> > > > > ensure
> > > > > >>> that all replicas can reproduce the same results with no
> > > > coordination,
> > > > > >>> reads against non-PK predicates must be done ahead of time
> > > > > >> (transparently,
> > > > > >>> by the server) to determine the set of keys, and this must be
> > > retried
> > > > > if
> > > > > >>> the set of rows affected is updated before the actual
> transaction
> > > > > >> executes.
> > > > > >>>
> > > > > >>> Batching and global consensus adds latency -- 100ms in the
> Calvin
> > > > paper
> > > > > >> and
> > > > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> > > transactions
> > > > > >>> (including multi-partition updates) are equally performant in
> > > Calvin
> > > > > >> since
> > > > > >>> the coordination is handled up front in the sequencing step.
> > Glass
> > > > > half
> > > > > >>> empty: even single-row reads and writes have to pay the full
> > > > > coordination
> > > > > >>> cost.  Fauna has optimized this away for reads but I am not
> aware
> > > of
> > > > a
> > > > > >>> description of how they changed the design to allow this.
> > > > > >>>
> > > > > >>> Functionality and limitations: since the entire transaction
> must
> > be
> > > > > known
> > > > > >>> in advance to allow coordination-less execution at the
> replicas,
> > > > Calvin
> > > > > >>> cannot support interactive transactions at all.  FaunaDB
> > mitigates
> > > > this
> > > > > >> by
> > > > > >>> allowing server-side logic to be included, but a Calvin
> approach
> > > will
> > > > > >> never
> > > > > >>> be able to offer SQL compatibility.
> > > > > >>>
> > > > > >>> Guarantees: Calvin transactions are strictly serializable.
> There
> > > is
> > > > no
> > > > > >>> additional complexity or performance hit to generalizing to
> > > multiple
> > > > > >>> regions, apart from the speed of light.  And since Calvin is
> > > already
> > > > > >> paying
> > > > > >>> a batching latency penalty, this is less painful than for other
> > > > > systems.
> > > > > >>>
> > > > > >>> Application to Cassandra: B-.  Distributed transactions are
> > handled
> > > > by
> > > > > >> the
> > > > > >>> sequencing and scheduling layers, which are leaderless, and
> > > Calvin’s
> > > > > >>> requirements for the storage layer are easily met by C*.  But
> > > Calvin
> > > > > also
> > > > > >>> requires a global consensus protocol and LWT is almost
> certainly
> > > not
> > > > > >>> sufficiently performant, so this would require ZK or etcd
> > > (reasonable
> > > > > >> for a
> > > > > >>> library approach but not for replacing LWT in C* itself), or an
> > > > > >>> implementation of Accord.  I don’t believe Calvin would require
> > > > > >> additional
> > > > > >>> table-level metadata in Cassandra.
> > > > > >>>
> > > > > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > > > > benedict@apache.org>
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>>> Wiki:
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > > > >>>> Whitepaper:
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > > > >>>> <
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > > > >>>>>
> > > > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > > > >>>>
> > > > > >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> > > > > >> community.
> > > > > >>>>
> > > > > >>>> Cassandra has benefitted from LWTs for many years, but
> > application
> > > > > >>>> developers that want to ensure consistency for complex
> > operations
> > > > must
> > > > > >>>> either accept the scalability bottleneck of serializing all
> > > related
> > > > > >> state
> > > > > >>>> through a single partition, or layer a complex state machine
> on
> > > top
> > > > of
> > > > > >>> the
> > > > > >>>> database. These are sophisticated and costly activities that
> our
> > > > users
> > > > > >>>> should not be expected to undertake. Since distributed
> databases
> > > are
> > > > > >>>> beginning to offer distributed transactions with fewer
> caveats,
> > it
> > > > is
> > > > > >>> past
> > > > > >>>> time for Cassandra to do so as well.
> > > > > >>>>
> > > > > >>>> This CEP proposes the use of several novel techniques that
> build
> > > > upon
> > > > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> > > general
> > > > > >>>> purpose distributed transactions. The approach is outlined in
> > the
> > > > > >>> wikipage
> > > > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > > > adopting
> > > > > >>> this
> > > > > >>>> approach we will be the _only_ distributed database to offer
> > > global,
> > > > > >>>> scalable, strict serializable transactions in one wide area
> > > > > round-trip.
> > > > > >>>> This would represent a significant improvement in the state of
> > the
> > > > > art,
> > > > > >>>> both in the academic literature and in commercial or open
> source
> > > > > >>> offerings.
> > > > > >>>>
> > > > > >>>> This work has been partially realised in a prototype. This
> > partial
> > > > > >>>> prototype has been verified against Jepsen.io’s Maelstrom
> > library
> > > > and
> > > > > >>>> dedicated in-tree strict serializability verification tools,
> but
> > > > much
> > > > > >>> work
> > > > > >>>> remains for the work to be production capable and integrated
> > into
> > > > > >>> Cassandra.
> > > > > >>>>
> > > > > >>>> I propose including the prototype in the project as a new
> source
> > > > > >>>> repository, to be developed as a standalone library for
> > > integration
> > > > > >> into
> > > > > >>>> Cassandra. I hope the community sees the important value
> > > proposition
> > > > > of
> > > > > >>>> this proposal, and will adopt the CEP after this discussion,
> so
> > > that
> > > > > >> the
> > > > > >>>> library and its integration into Cassandra can be developed in
> > > > > parallel
> > > > > >>> and
> > > > > >>>> with the involvement of the wider community.
> > > > > >>>>
> > > > > >>>
> > > > > >>>
> > > > > >>> --
> > > > > >>> Jonathan Ellis
> > > > > >>> co-founder, http://www.datastax.com
> > > > > >>> @spyced
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Jonathan Ellis
> > > > > >> co-founder, http://www.datastax.com
> > > > > >> @spyced
> > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jonathan Ellis
> > > > > > co-founder, http://www.datastax.com
> > > > > > @spyced
> > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > > > >
> > > >
> > > >
> > > > --
> > > > Jonathan Ellis
> > > > co-founder, http://www.datastax.com
> > > > @spyced
> > > >
> > >
> >
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.
From the CEP:

Batches (including unconditional batches) on transactional tables will receive ACID properties, and grammatically correct conditional batch operations that would be rejected for operating over multiple CQL partitions will now be supported


From: Paulo Motta <pa...@gmail.com>
Date: Friday, 1 October 2021 at 15:30
To: Cassandra DEV <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Can you just answer what palpable feature will be available once this CEP
lands because this is still not clear to me (and perhaps to others) from
the current CEP structure. The current document details thoroughly the
protocol but in my view lacks to illustrate what specific API, methods,
modules will become available to developers, how it fits into the larger
picture and interacts with existing modules if at all and perhaps a few
examples of how it can be used to build features on top.

Em sex., 1 de out. de 2021 às 11:10, benedict@apache.org <
benedict@apache.org> escreveu:

> I’m not, though it might seem that way. I disagree with your views about
> how CEP should be structured. Since the CEP process was itself codified via
> the CEP process, if you want to recodify how CEP work, the correct way is
> via the CEP process itself.
>
> The discussion is being drawn in multiple directions away from the CEP
> itself, and I am trying to keep this particular thread focused on the
> business at hand, not meta discussions around CEP structure that will no
> doubt be unproductive given likely irreconcilable views about the topic,
> nor discussions about other CEP that could have been.
>
> If you want to start a separate exploratory discussion thread about CEP
> structure without filing a CEP feel free to do so.
>
>
> From: Paulo Motta <pa...@gmail.com>
> Date: Friday, 1 October 2021 at 15:04
> To: Cassandra DEV <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > If you want to impose your views on CEP structure on others, please file
> a CEP with the additional restrictions and guidance you want to impose and
> start a discussion thread. I can then respond in detail to why I perceive
> this approach to be flawed, in a dedicated context.
>
> This sounds very kafkaesque. You know I won't file a meta-CEP to change the
> structure of CEP so you're just using this as an excuse to just shut the
> discussion on the lack of clarity on what actual palpable feature will be
> available once the CEP lands. :-)
>
> I'm just providing my humble feedback on how a CEP could be more digestible
> and easier to consume from an external point of view, and this seems like
> an appropriate and contextualized place to voice this opinion which is
> perhaps shared by others.
>
> Em sex., 1 de out. de 2021 às 10:55, benedict@apache.org <
> benedict@apache.org> escreveu:
>
> > I disagree with you. However, this is the wrong forum to have a meta
> > discussion about how CEP should be structured.
> >
> > If you want to impose your views on CEP structure on others, please file
> a
> > CEP with the additional restrictions and guidance you want to impose and
> > start a discussion thread. I can then respond in detail to why I perceive
> > this approach to be flawed, in a dedicated context.
> >
> >
> > From: Paulo Motta <pa...@gmail.com>
> > Date: Friday, 1 October 2021 at 14:48
> > To: Cassandra DEV <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >  The proposal as it stands today is exceptionally thorough, more so
> than
> > any other CEP to date, or any CEP is likely to be in the near future.
> >
> > The protocol is thoroughly described, but in my view CEP is a forum to
> > discuss the high level architecture and plan for adding a full end-to-end
> > enhancement to the database, breaking it into sub-CEPs if needed, as long
> > as the full plan is known in advance, otherwise the community will not
> have
> > the context to judge the full extent and impact of the proposed
> > enhancement.
> >
> > > Since it remains unclear to me what either yourself or Jonathan want to
> > see as an alternative
> >
> > I would personally like to see something along these lines:
> >
> > CEP1: Add ACID-compliant atomic batches
> > - UX changes needed: none, CQL provides the grammar we need.
> > - Distributed transaction protocol needed: Accord (link to white paper if
> > you want specific details about the protcool)
> > - High-level architecture: what new components will be added, how
> existing
> > components will be modified, what new messages will be added, what new
> > configuration knobs will be introduced, what are the milestones of the
> > project, etc.
> >
> > CEP2: Make LWT faster and more reliable
> > - UX changes needed: none
> > - Distributed transaction protocol needed: Accord, already added by
> > previous CEP.
> > - High-level architecture: blablabla... and so on.
> >
> > Em sex., 1 de out. de 2021 às 10:19, benedict@apache.org <
> > benedict@apache.org> escreveu:
> >
> > > I think this is getting circular and unproductive. Basic disagreements
> > > about whether the CEP specifies a feature I am inclined to leave for a
> > > vote. In my view the CEP specifies several features, both immediate
> ones
> > > for the user (ACID batches and multi-key LWTS) and developer-focused
> ones
> > > around ground-breaking semantics that will be enabled.
> > >
> > > The proposal as it stands today is exceptionally thorough, more so than
> > > any other CEP to date, or any CEP is likely to be in the near future.
> > >
> > > This is a Cassandra Enhancement *Proposal*, and at some point we have
> to
> > > engage with what is proposed, not what you might like to be proposed.
> > Since
> > > it remains unclear to me what either yourself or Jonathan want to see
> as
> > an
> > > alternative, at this point it would seem more productive to produce
> your
> > > own proposals for the community to consider. It is possible for
> multiple
> > > transaction systems to co-exist, if you feel this is necessary.
> > >
> > >
> > >
> > > From: Paulo Motta <pa...@gmail.com>
> > > Date: Friday, 1 October 2021 at 13:58
> > > To: Cassandra DEV <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > I share similar feelings as jbellis that this proposal seems to be
> > focusing
> > > on the protocol itself but lacking the actual feature that will use the
> > > protocol which IMO a key element to discuss on a CEP.
> > >
> > > It's similar to saying: hey I want to add this Tries Serialization
> > Protocol
> > > to Cassandra, but not providing specific details of how this protocol
> is
> > > going to be used.
> > >
> > > I think the right route for a CEP is to describe the feature that will
> be
> > > added to the database and the protocol is a mere requirement of the
> > > high-level feature, for example:
> > >
> > > CEP: Add Trie-backed memtable
> > > - Trie Serialization Protocol: implementation detail of the above CEP
> > >
> > > What is the difficulty of taking this approach, picking one of the
> myriad
> > > of features that will be enabled by Accord and using that as the
> initial
> > > CEP to introduce the protocol to the database?
> > >
> > > Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
> > > benedict@apache.org> escreveu:
> > >
> > > > Actually, thinking about it again, the simple optimistic protocol
> would
> > > in
> > > > fact guarantee system forward progress (i.e. independent of
> transaction
> > > > formulation).
> > > >
> > > >
> > > > From: benedict@apache.org <be...@apache.org>
> > > > Date: Friday, 1 October 2021 at 09:14
> > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > Hi Jonathan,
> > > >
> > > > It would be great if we could achieve a bandwidth higher than 1-2
> short
> > > > emails per week. It remains unclear to me what your goal is, and it
> > would
> > > > help if you could make a statement like “I want Cassandra to be able
> to
> > > do
> > > > X” so that we can respond directly to it. I am also available to have
> > > > another call, in which we can have a back and forth, please feel free
> > to
> > > > propose a London-compatible time within the next week that is
> suitable
> > > for
> > > > you.
> > > >
> > > > In my opinion we are at risk of veering off-topic, though. This CEP
> is
> > > not
> > > > to deliver interactive transactions, and to my knowledge nobody is
> > > > proposing a CEP for interactive transactions. So, for the CEP at hand
> > the
> > > > salient question seems: does this CEP prevent us from implementing
> > > > interactive transactions with properties X, Y, Z in future? To which
> > the
> > > > answer is almost certainly no.
> > > >
> > > > However, to continue the discussion and respond directly to your
> > queries,
> > > > I believe we agree on the definition of an interactive transaction.
> > > >
> > > > Two protocols were loosely outlined. The first, using timestamps for
> > > > optimistic concurrency control, would indeed involve the possibility
> of
> > > > aborts. It would not however inherently adopt the issue of LWTs where
> > no
> > > > transaction is able to make progress. Whether or not progress is
> > > guaranteed
> > > > (in a livelock-free sense) would depend on the structure of the
> > > > transactions that were interfering.
> > > >
> > > > This approach has the advantage of being very simple to implement, so
> > > that
> > > > we could realistically support interactive transactions quite
> quickly.
> > It
> > > > has the additional advantage that transactions would execute very
> > quickly
> > > > by avoiding the WAN during construction, and as a result may in
> > practice
> > > > experience fewer aborts than protocols that guarantee
> livelock-freedom.
> > > >
> > > > The second protocol proposed using read/write intents and would be
> able
> > > to
> > > > support almost any behaviour you want. We could even utilise
> > pessimistic
> > > > concurrency control, or anything in-between. This is its own huge
> > design
> > > > space, and discussion of this approach and the trade-offs that could
> be
> > > > made is (in my opinion) entirely out of scope for this CEP.
> > > >
> > > >
> > > > From: Jonathan Ellis <jb...@gmail.com>
> > > > Date: Friday, 1 October 2021 at 05:00
> > > > To: dev <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > The obstacle for me is you've provided a protocol but not a fully
> > fleshed
> > > > out architecture, so it's hard to fill in some of the blanks.  But it
> > > looks
> > > > to me like optimistic concurrency control for interactive
> transactions
> > > > applied to Accord would leave you in a LWT-like situation under
> fairly
> > > > light contention where nobody actually makes progress due to retries.
> > > >
> > > > To make sure we're talking about the same thing, as Henrik pointed
> out,
> > > > interactive transactions mean multiple round trips from the client
> > > within a
> > > > transaction.  For example, here
> > > > <
> > > >
> > >
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > > > >
> > > > is a simple implementation of the TPC-C New Order transaction.  The
> > high
> > > > level logic (via
> > > > <
> > > >
> > >
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > > > >)
> > > > is,
> > > >
> > > >    1. Get records describing a warehouse, customer, & district
> > > >    2. Update the district
> > > >    3. Increment next available order number
> > > >    4. Insert record into Order and New-Order tables
> > > >    5. For 5-15 items, get Item record, get/update Stock record
> > > >    6. Insert Order-Line Record
> > > >
> > > > As you can see, this requires a lot of client-side logic mixed in
> with
> > > the
> > > > actual SQL commands.
> > > >
> > > >
> > > > On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <
> > benedict@apache.org
> > > >
> > > > wrote:
> > > >
> > > > > Essentially this, although I think in practice we will need to
> track
> > > each
> > > > > partition’s timestamp separately (or optionally for reduced
> > conflicts,
> > > > each
> > > > > row or datum’s), and make them all part of the conditional
> > application
> > > of
> > > > > the transaction - at least for strict-serializability.
> > > > >
> > > > > The alternative is to insert read/write intents for the transaction
> > > > during
> > > > > each step, and to confirm they are still valid on commit, but this
> > > > approach
> > > > > would require a WAN round-trip for each step in the interactive
> > > > > transaction, whereas the timestamp-validating approach can use a
> LAN
> > > > > round-trip for each step besides the final one, and is also much
> > > simpler
> > > > to
> > > > > implement.
> > > > >
> > > > >
> > > > > From: Blake Eggleston <be...@apple.com.INVALID>
> > > > > Date: Thursday, 30 September 2021 at 05:47
> > > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > You could establish a lower timestamp bound and buffer transaction
> > > state
> > > > > on the coordinator, then make the commit an operation that only
> > applies
> > > > if
> > > > > all partitions involved haven’t been changed by a more recent
> > > timestamp.
> > > > > You could also implement mvcc either in the storage layer or for
> some
> > > > > period of time by buffering commits on each replica before
> applying.
> > > > >
> > > > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > How are interactive transactions possible with Accord?
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > > > > benedict@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > >> Could you explain why you believe this trade-off is necessary?
> We
> > > can
> > > > > >> support full SQL just fine with Accord, and I hope that we
> > > eventually
> > > > > do so.
> > > > > >>
> > > > > >> This domain is incredibly complex, so it is easy to reach wrong
> > > > > >> conclusions. I would invite you again to propose a system for
> > > > discussion
> > > > > >> that you think offers something Accord is unable to, and that
> you
> > > > > consider
> > > > > >> desirable, and we can work from there.
> > > > > >>
> > > > > >> To pre-empt some possible discussions, I am not aware of
> anything
> > we
> > > > > >> cannot do with Accord that we could do with either Calvin or
> > > Spanner.
> > > > > >> Interactive transactions are possible on top of Accord, as are
> > > > > transactions
> > > > > >> with an unknown read/write set. In each case the only cost is
> that
> > > > they
> > > > > >> would use optimistic concurrency control, which is no worse the
> > > > spanner
> > > > > >> derivatives anyway (which I have to assume is your benchmark in
> > this
> > > > > >> regard). I do not expect to deliver either functionality
> > initially,
> > > > but
> > > > > >> Accord takes us most of the way there for both.
> > > > > >>
> > > > > >>
> > > > > >> From: Jonathan Ellis <jb...@gmail.com>
> > > > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > > > >> To: dev <de...@cassandra.apache.org>
> > > > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > >> Right, I'm looking for exactly a discussion on the high level
> > goals.
> > > > > >> Instead of saying "here's the goals and we ruled out X because
> Y"
> > we
> > > > > should
> > > > > >> start with a discussion around, "Approach A allows X and W,
> > > approach B
> > > > > >> allows Y and Z" and decide together what the goals should be and
> > and
> > > > > what
> > > > > >> we are willing to trade to get those goals, e.g., are we willing
> > to
> > > > > give up
> > > > > >> global strict serializability to get the ability to support full
> > > SQL.
> > > > > Both
> > > > > >> of these are nice to have!
> > > > > >>
> > > > > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > > > > benedict@apache.org>
> > > > > >> wrote:
> > > > > >>
> > > > > >>> Hi Jonathan,
> > > > > >>>
> > > > > >>> These other systems are incompatible with the goals of the
> CEP. I
> > > do
> > > > > >>> discuss them (besides 2PC) in both the whitepaper and the CEP,
> > and
> > > > will
> > > > > >>> summarise that discussion below. A true and accurate comparison
> > of
> > > > > these
> > > > > >>> other systems is essentially intractable, as there are complex
> > > > > subtleties
> > > > > >>> to each flavour, and those who are interested would be better
> > > served
> > > > by
> > > > > >>> performing their own research.
> > > > > >>>
> > > > > >>> I think it is more productive to focus on what we want to
> achieve
> > > as
> > > > a
> > > > > >>> community. If you believe the goals of this CEP are wrong for
> the
> > > > > >> project,
> > > > > >>> let’s focus on that. If you want to compare and contrast
> specific
> > > > > facets
> > > > > >> of
> > > > > >>> alternative systems that you consider to be preferable in some
> > > > > dimension,
> > > > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > > > >>>
> > > > > >>> The relevant goals are that we:
> > > > > >>>
> > > > > >>>
> > > > > >>>  1.  Guarantee strict serializable isolation on commodity
> > hardware
> > > > > >>>  2.  Scale to any cluster size
> > > > > >>>  3.  Achieve optimal latency
> > > > > >>>
> > > > > >>> The approach taken by Spanner derivatives is rejected by (1)
> > > because
> > > > > they
> > > > > >>> guarantee only Serializable isolation (they additionally fail
> > (3)).
> > > > > From
> > > > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > > > >>> panic-cluster-death under clock skew, this is clearly
> considered
> > by
> > > > > >>> everyone to be undesirable but necessary to achieve
> scalability.
> > > > > >>>
> > > > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2)
> because
> > > its
> > > > > >>> sequencing layer requires a global leader process for the
> > cluster,
> > > > > which
> > > > > >> is
> > > > > >>> incompatible with Cassandra’s scalability requirements. It
> > > > additionally
> > > > > >>> fails (3) for global clients.
> > > > > >>>
> > > > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is
> > today a
> > > > > >>> Spanner clone for its multi-key transaction functionality, not
> > 2PC.
> > > > > >>>
> > > > > >>> Systems such as RAMP with even weaker isolation are not
> > considered
> > > > for
> > > > > >> the
> > > > > >>> simple reason that they do not even claim to meet (1).
> > > > > >>>
> > > > > >>> If we want to additionally offer weaker isolation levels than
> > > > > >>> Serializable, such as that provided by the recent RAMP-TAO
> paper,
> > > > > >> Cassandra
> > > > > >>> is likely able to support multiple distinct transaction layers
> > that
> > > > > >> operate
> > > > > >>> independently. I would encourage you to file a CEP to explore
> how
> > > we
> > > > > can
> > > > > >>> meet these distinct use cases, but I consider them to be
> niche. I
> > > > > expect
> > > > > >>> that a majority of our user base desire strict serializable
> > > > isolation,
> > > > > >> and
> > > > > >>> certainly no less than serializable isolation, to augment the
> > > > existing
> > > > > >>> weaker isolation offered by quorum reads and writes.
> > > > > >>>
> > > > > >>> I would tangentially note that we are not an AP database under
> > > normal
> > > > > >>> recommended operation. A minority in any network partition
> cannot
> > > > reach
> > > > > >>> QUORUM, so under recommended usage we are a high-availability
> > > > > leaderless
> > > > > >> CP
> > > > > >>> database.
> > > > > >>>
> > > > > >>>
> > > > > >>> From: Jonathan Ellis <jb...@gmail.com>
> > > > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > > > >>> To: dev <de...@cassandra.apache.org>
> > > > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > >>> Benedict, thanks for taking the lead in putting this together.
> > > Since
> > > > > >>> Cassandra is the only relevant database today designed around a
> > > > > >> leaderless
> > > > > >>> architecture, it's quite likely that we'll be better served
> with
> > a
> > > > > custom
> > > > > >>> transaction design instead of trying to retrofit one from CP
> > > systems.
> > > > > >>>
> > > > > >>> The whitepaper here is a good description of the consensus
> > > algorithm
> > > > > >> itself
> > > > > >>> as well as its robustness and stability characteristics, and
> its
> > > > > >> comparison
> > > > > >>> with other state-of-the-art consensus algorithms is very
> useful.
> > > In
> > > > > the
> > > > > >>> context of Cassandra, where a consensus algorithm is only part
> of
> > > > what
> > > > > >> will
> > > > > >>> be implemented, I'd like to see a more complete evaluation of
> the
> > > > > >>> transactional side of things as well, including performance
> > > > > >> characteristics
> > > > > >>> as well as the types of transactions that can be supported and
> at
> > > > > least a
> > > > > >>> general idea of what it would look like applied to Cassandra.
> > This
> > > > will
> > > > > >>> allow the PMC to make a more informed decision about what
> > tradeoffs
> > > > are
> > > > > >>> best for the entire long-term project of first supplementing
> and
> > > > > >> ultimately
> > > > > >>> replacing LWT.
> > > > > >>>
> > > > > >>> (Allowing users to mix LWT and AP Cassandra operations against
> > the
> > > > same
> > > > > >>> rows was probably a mistake, so in contrast with LWT we’re not
> > > > looking
> > > > > >> for
> > > > > >>> something fast enough for occasional use but rather something
> > > within
> > > > a
> > > > > >>> reasonable factor of AP operations, appropriate to being the
> only
> > > way
> > > > > to
> > > > > >>> interact with tables declared as such.)
> > > > > >>>
> > > > > >>> Besides Accord, this should cover
> > > > > >>>
> > > > > >>> - Calvin and FaunaDB
> > > > > >>> - A Spanner derivative (no opinion on whether that should be
> > > > Cockroach
> > > > > or
> > > > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB but
> I
> > > > > suspect
> > > > > >>> there is more public information about MongoDB)
> > > > > >>> - RAMP
> > > > > >>>
> > > > > >>> Here’s an example of what I mean:
> > > > > >>>
> > > > > >>> =Calvin=
> > > > > >>>
> > > > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB)
> to
> > > > order
> > > > > >>> transactions, then replicas execute the transactions
> > independently
> > > > with
> > > > > >> no
> > > > > >>> further coordination.  No SPOF.  Transactions are batched by
> each
> > > > > >> sequencer
> > > > > >>> to keep this from becoming a bottleneck.
> > > > > >>>
> > > > > >>> Performance: Calvin paper (published 2012) reports linear
> scaling
> > > of
> > > > > >> TPC-C
> > > > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL
> > > > machines
> > > > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order
> is
> > > > > composed
> > > > > >>> of four reads and four writes, so this is effectively 2M reads
> > and
> > > 2M
> > > > > >>> writes as we normally measure them in C*.
> > > > > >>>
> > > > > >>> Calvin supports mixed read/write transactions, but because the
> > > > > >> transaction
> > > > > >>> execution logic requires knowing all partition keys in advance
> to
> > > > > ensure
> > > > > >>> that all replicas can reproduce the same results with no
> > > > coordination,
> > > > > >>> reads against non-PK predicates must be done ahead of time
> > > > > >> (transparently,
> > > > > >>> by the server) to determine the set of keys, and this must be
> > > retried
> > > > > if
> > > > > >>> the set of rows affected is updated before the actual
> transaction
> > > > > >> executes.
> > > > > >>>
> > > > > >>> Batching and global consensus adds latency -- 100ms in the
> Calvin
> > > > paper
> > > > > >> and
> > > > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> > > transactions
> > > > > >>> (including multi-partition updates) are equally performant in
> > > Calvin
> > > > > >> since
> > > > > >>> the coordination is handled up front in the sequencing step.
> > Glass
> > > > > half
> > > > > >>> empty: even single-row reads and writes have to pay the full
> > > > > coordination
> > > > > >>> cost.  Fauna has optimized this away for reads but I am not
> aware
> > > of
> > > > a
> > > > > >>> description of how they changed the design to allow this.
> > > > > >>>
> > > > > >>> Functionality and limitations: since the entire transaction
> must
> > be
> > > > > known
> > > > > >>> in advance to allow coordination-less execution at the
> replicas,
> > > > Calvin
> > > > > >>> cannot support interactive transactions at all.  FaunaDB
> > mitigates
> > > > this
> > > > > >> by
> > > > > >>> allowing server-side logic to be included, but a Calvin
> approach
> > > will
> > > > > >> never
> > > > > >>> be able to offer SQL compatibility.
> > > > > >>>
> > > > > >>> Guarantees: Calvin transactions are strictly serializable.
> There
> > > is
> > > > no
> > > > > >>> additional complexity or performance hit to generalizing to
> > > multiple
> > > > > >>> regions, apart from the speed of light.  And since Calvin is
> > > already
> > > > > >> paying
> > > > > >>> a batching latency penalty, this is less painful than for other
> > > > > systems.
> > > > > >>>
> > > > > >>> Application to Cassandra: B-.  Distributed transactions are
> > handled
> > > > by
> > > > > >> the
> > > > > >>> sequencing and scheduling layers, which are leaderless, and
> > > Calvin’s
> > > > > >>> requirements for the storage layer are easily met by C*.  But
> > > Calvin
> > > > > also
> > > > > >>> requires a global consensus protocol and LWT is almost
> certainly
> > > not
> > > > > >>> sufficiently performant, so this would require ZK or etcd
> > > (reasonable
> > > > > >> for a
> > > > > >>> library approach but not for replacing LWT in C* itself), or an
> > > > > >>> implementation of Accord.  I don’t believe Calvin would require
> > > > > >> additional
> > > > > >>> table-level metadata in Cassandra.
> > > > > >>>
> > > > > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > > > > benedict@apache.org>
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>>> Wiki:
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > > > >>>> Whitepaper:
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > > > >>>> <
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > > > >>>>>
> > > > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > > > >>>>
> > > > > >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> > > > > >> community.
> > > > > >>>>
> > > > > >>>> Cassandra has benefitted from LWTs for many years, but
> > application
> > > > > >>>> developers that want to ensure consistency for complex
> > operations
> > > > must
> > > > > >>>> either accept the scalability bottleneck of serializing all
> > > related
> > > > > >> state
> > > > > >>>> through a single partition, or layer a complex state machine
> on
> > > top
> > > > of
> > > > > >>> the
> > > > > >>>> database. These are sophisticated and costly activities that
> our
> > > > users
> > > > > >>>> should not be expected to undertake. Since distributed
> databases
> > > are
> > > > > >>>> beginning to offer distributed transactions with fewer
> caveats,
> > it
> > > > is
> > > > > >>> past
> > > > > >>>> time for Cassandra to do so as well.
> > > > > >>>>
> > > > > >>>> This CEP proposes the use of several novel techniques that
> build
> > > > upon
> > > > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> > > general
> > > > > >>>> purpose distributed transactions. The approach is outlined in
> > the
> > > > > >>> wikipage
> > > > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > > > adopting
> > > > > >>> this
> > > > > >>>> approach we will be the _only_ distributed database to offer
> > > global,
> > > > > >>>> scalable, strict serializable transactions in one wide area
> > > > > round-trip.
> > > > > >>>> This would represent a significant improvement in the state of
> > the
> > > > > art,
> > > > > >>>> both in the academic literature and in commercial or open
> source
> > > > > >>> offerings.
> > > > > >>>>
> > > > > >>>> This work has been partially realised in a prototype. This
> > partial
> > > > > >>>> prototype has been verified against Jepsen.io’s Maelstrom
> > library
> > > > and
> > > > > >>>> dedicated in-tree strict serializability verification tools,
> but
> > > > much
> > > > > >>> work
> > > > > >>>> remains for the work to be production capable and integrated
> > into
> > > > > >>> Cassandra.
> > > > > >>>>
> > > > > >>>> I propose including the prototype in the project as a new
> source
> > > > > >>>> repository, to be developed as a standalone library for
> > > integration
> > > > > >> into
> > > > > >>>> Cassandra. I hope the community sees the important value
> > > proposition
> > > > > of
> > > > > >>>> this proposal, and will adopt the CEP after this discussion,
> so
> > > that
> > > > > >> the
> > > > > >>>> library and its integration into Cassandra can be developed in
> > > > > parallel
> > > > > >>> and
> > > > > >>>> with the involvement of the wider community.
> > > > > >>>>
> > > > > >>>
> > > > > >>>
> > > > > >>> --
> > > > > >>> Jonathan Ellis
> > > > > >>> co-founder, http://www.datastax.com
> > > > > >>> @spyced
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Jonathan Ellis
> > > > > >> co-founder, http://www.datastax.com
> > > > > >> @spyced
> > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jonathan Ellis
> > > > > > co-founder, http://www.datastax.com
> > > > > > @spyced
> > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > > > >
> > > >
> > > >
> > > > --
> > > > Jonathan Ellis
> > > > co-founder, http://www.datastax.com
> > > > @spyced
> > > >
> > >
> >
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Paulo Motta <pa...@gmail.com>.
Can you just answer what palpable feature will be available once this CEP
lands because this is still not clear to me (and perhaps to others) from
the current CEP structure. The current document details thoroughly the
protocol but in my view lacks to illustrate what specific API, methods,
modules will become available to developers, how it fits into the larger
picture and interacts with existing modules if at all and perhaps a few
examples of how it can be used to build features on top.

Em sex., 1 de out. de 2021 às 11:10, benedict@apache.org <
benedict@apache.org> escreveu:

> I’m not, though it might seem that way. I disagree with your views about
> how CEP should be structured. Since the CEP process was itself codified via
> the CEP process, if you want to recodify how CEP work, the correct way is
> via the CEP process itself.
>
> The discussion is being drawn in multiple directions away from the CEP
> itself, and I am trying to keep this particular thread focused on the
> business at hand, not meta discussions around CEP structure that will no
> doubt be unproductive given likely irreconcilable views about the topic,
> nor discussions about other CEP that could have been.
>
> If you want to start a separate exploratory discussion thread about CEP
> structure without filing a CEP feel free to do so.
>
>
> From: Paulo Motta <pa...@gmail.com>
> Date: Friday, 1 October 2021 at 15:04
> To: Cassandra DEV <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > If you want to impose your views on CEP structure on others, please file
> a CEP with the additional restrictions and guidance you want to impose and
> start a discussion thread. I can then respond in detail to why I perceive
> this approach to be flawed, in a dedicated context.
>
> This sounds very kafkaesque. You know I won't file a meta-CEP to change the
> structure of CEP so you're just using this as an excuse to just shut the
> discussion on the lack of clarity on what actual palpable feature will be
> available once the CEP lands. :-)
>
> I'm just providing my humble feedback on how a CEP could be more digestible
> and easier to consume from an external point of view, and this seems like
> an appropriate and contextualized place to voice this opinion which is
> perhaps shared by others.
>
> Em sex., 1 de out. de 2021 às 10:55, benedict@apache.org <
> benedict@apache.org> escreveu:
>
> > I disagree with you. However, this is the wrong forum to have a meta
> > discussion about how CEP should be structured.
> >
> > If you want to impose your views on CEP structure on others, please file
> a
> > CEP with the additional restrictions and guidance you want to impose and
> > start a discussion thread. I can then respond in detail to why I perceive
> > this approach to be flawed, in a dedicated context.
> >
> >
> > From: Paulo Motta <pa...@gmail.com>
> > Date: Friday, 1 October 2021 at 14:48
> > To: Cassandra DEV <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >  The proposal as it stands today is exceptionally thorough, more so
> than
> > any other CEP to date, or any CEP is likely to be in the near future.
> >
> > The protocol is thoroughly described, but in my view CEP is a forum to
> > discuss the high level architecture and plan for adding a full end-to-end
> > enhancement to the database, breaking it into sub-CEPs if needed, as long
> > as the full plan is known in advance, otherwise the community will not
> have
> > the context to judge the full extent and impact of the proposed
> > enhancement.
> >
> > > Since it remains unclear to me what either yourself or Jonathan want to
> > see as an alternative
> >
> > I would personally like to see something along these lines:
> >
> > CEP1: Add ACID-compliant atomic batches
> > - UX changes needed: none, CQL provides the grammar we need.
> > - Distributed transaction protocol needed: Accord (link to white paper if
> > you want specific details about the protcool)
> > - High-level architecture: what new components will be added, how
> existing
> > components will be modified, what new messages will be added, what new
> > configuration knobs will be introduced, what are the milestones of the
> > project, etc.
> >
> > CEP2: Make LWT faster and more reliable
> > - UX changes needed: none
> > - Distributed transaction protocol needed: Accord, already added by
> > previous CEP.
> > - High-level architecture: blablabla... and so on.
> >
> > Em sex., 1 de out. de 2021 às 10:19, benedict@apache.org <
> > benedict@apache.org> escreveu:
> >
> > > I think this is getting circular and unproductive. Basic disagreements
> > > about whether the CEP specifies a feature I am inclined to leave for a
> > > vote. In my view the CEP specifies several features, both immediate
> ones
> > > for the user (ACID batches and multi-key LWTS) and developer-focused
> ones
> > > around ground-breaking semantics that will be enabled.
> > >
> > > The proposal as it stands today is exceptionally thorough, more so than
> > > any other CEP to date, or any CEP is likely to be in the near future.
> > >
> > > This is a Cassandra Enhancement *Proposal*, and at some point we have
> to
> > > engage with what is proposed, not what you might like to be proposed.
> > Since
> > > it remains unclear to me what either yourself or Jonathan want to see
> as
> > an
> > > alternative, at this point it would seem more productive to produce
> your
> > > own proposals for the community to consider. It is possible for
> multiple
> > > transaction systems to co-exist, if you feel this is necessary.
> > >
> > >
> > >
> > > From: Paulo Motta <pa...@gmail.com>
> > > Date: Friday, 1 October 2021 at 13:58
> > > To: Cassandra DEV <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > I share similar feelings as jbellis that this proposal seems to be
> > focusing
> > > on the protocol itself but lacking the actual feature that will use the
> > > protocol which IMO a key element to discuss on a CEP.
> > >
> > > It's similar to saying: hey I want to add this Tries Serialization
> > Protocol
> > > to Cassandra, but not providing specific details of how this protocol
> is
> > > going to be used.
> > >
> > > I think the right route for a CEP is to describe the feature that will
> be
> > > added to the database and the protocol is a mere requirement of the
> > > high-level feature, for example:
> > >
> > > CEP: Add Trie-backed memtable
> > > - Trie Serialization Protocol: implementation detail of the above CEP
> > >
> > > What is the difficulty of taking this approach, picking one of the
> myriad
> > > of features that will be enabled by Accord and using that as the
> initial
> > > CEP to introduce the protocol to the database?
> > >
> > > Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
> > > benedict@apache.org> escreveu:
> > >
> > > > Actually, thinking about it again, the simple optimistic protocol
> would
> > > in
> > > > fact guarantee system forward progress (i.e. independent of
> transaction
> > > > formulation).
> > > >
> > > >
> > > > From: benedict@apache.org <be...@apache.org>
> > > > Date: Friday, 1 October 2021 at 09:14
> > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > Hi Jonathan,
> > > >
> > > > It would be great if we could achieve a bandwidth higher than 1-2
> short
> > > > emails per week. It remains unclear to me what your goal is, and it
> > would
> > > > help if you could make a statement like “I want Cassandra to be able
> to
> > > do
> > > > X” so that we can respond directly to it. I am also available to have
> > > > another call, in which we can have a back and forth, please feel free
> > to
> > > > propose a London-compatible time within the next week that is
> suitable
> > > for
> > > > you.
> > > >
> > > > In my opinion we are at risk of veering off-topic, though. This CEP
> is
> > > not
> > > > to deliver interactive transactions, and to my knowledge nobody is
> > > > proposing a CEP for interactive transactions. So, for the CEP at hand
> > the
> > > > salient question seems: does this CEP prevent us from implementing
> > > > interactive transactions with properties X, Y, Z in future? To which
> > the
> > > > answer is almost certainly no.
> > > >
> > > > However, to continue the discussion and respond directly to your
> > queries,
> > > > I believe we agree on the definition of an interactive transaction.
> > > >
> > > > Two protocols were loosely outlined. The first, using timestamps for
> > > > optimistic concurrency control, would indeed involve the possibility
> of
> > > > aborts. It would not however inherently adopt the issue of LWTs where
> > no
> > > > transaction is able to make progress. Whether or not progress is
> > > guaranteed
> > > > (in a livelock-free sense) would depend on the structure of the
> > > > transactions that were interfering.
> > > >
> > > > This approach has the advantage of being very simple to implement, so
> > > that
> > > > we could realistically support interactive transactions quite
> quickly.
> > It
> > > > has the additional advantage that transactions would execute very
> > quickly
> > > > by avoiding the WAN during construction, and as a result may in
> > practice
> > > > experience fewer aborts than protocols that guarantee
> livelock-freedom.
> > > >
> > > > The second protocol proposed using read/write intents and would be
> able
> > > to
> > > > support almost any behaviour you want. We could even utilise
> > pessimistic
> > > > concurrency control, or anything in-between. This is its own huge
> > design
> > > > space, and discussion of this approach and the trade-offs that could
> be
> > > > made is (in my opinion) entirely out of scope for this CEP.
> > > >
> > > >
> > > > From: Jonathan Ellis <jb...@gmail.com>
> > > > Date: Friday, 1 October 2021 at 05:00
> > > > To: dev <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > The obstacle for me is you've provided a protocol but not a fully
> > fleshed
> > > > out architecture, so it's hard to fill in some of the blanks.  But it
> > > looks
> > > > to me like optimistic concurrency control for interactive
> transactions
> > > > applied to Accord would leave you in a LWT-like situation under
> fairly
> > > > light contention where nobody actually makes progress due to retries.
> > > >
> > > > To make sure we're talking about the same thing, as Henrik pointed
> out,
> > > > interactive transactions mean multiple round trips from the client
> > > within a
> > > > transaction.  For example, here
> > > > <
> > > >
> > >
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > > > >
> > > > is a simple implementation of the TPC-C New Order transaction.  The
> > high
> > > > level logic (via
> > > > <
> > > >
> > >
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > > > >)
> > > > is,
> > > >
> > > >    1. Get records describing a warehouse, customer, & district
> > > >    2. Update the district
> > > >    3. Increment next available order number
> > > >    4. Insert record into Order and New-Order tables
> > > >    5. For 5-15 items, get Item record, get/update Stock record
> > > >    6. Insert Order-Line Record
> > > >
> > > > As you can see, this requires a lot of client-side logic mixed in
> with
> > > the
> > > > actual SQL commands.
> > > >
> > > >
> > > > On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <
> > benedict@apache.org
> > > >
> > > > wrote:
> > > >
> > > > > Essentially this, although I think in practice we will need to
> track
> > > each
> > > > > partition’s timestamp separately (or optionally for reduced
> > conflicts,
> > > > each
> > > > > row or datum’s), and make them all part of the conditional
> > application
> > > of
> > > > > the transaction - at least for strict-serializability.
> > > > >
> > > > > The alternative is to insert read/write intents for the transaction
> > > > during
> > > > > each step, and to confirm they are still valid on commit, but this
> > > > approach
> > > > > would require a WAN round-trip for each step in the interactive
> > > > > transaction, whereas the timestamp-validating approach can use a
> LAN
> > > > > round-trip for each step besides the final one, and is also much
> > > simpler
> > > > to
> > > > > implement.
> > > > >
> > > > >
> > > > > From: Blake Eggleston <be...@apple.com.INVALID>
> > > > > Date: Thursday, 30 September 2021 at 05:47
> > > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > You could establish a lower timestamp bound and buffer transaction
> > > state
> > > > > on the coordinator, then make the commit an operation that only
> > applies
> > > > if
> > > > > all partitions involved haven’t been changed by a more recent
> > > timestamp.
> > > > > You could also implement mvcc either in the storage layer or for
> some
> > > > > period of time by buffering commits on each replica before
> applying.
> > > > >
> > > > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > How are interactive transactions possible with Accord?
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > > > > benedict@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > >> Could you explain why you believe this trade-off is necessary?
> We
> > > can
> > > > > >> support full SQL just fine with Accord, and I hope that we
> > > eventually
> > > > > do so.
> > > > > >>
> > > > > >> This domain is incredibly complex, so it is easy to reach wrong
> > > > > >> conclusions. I would invite you again to propose a system for
> > > > discussion
> > > > > >> that you think offers something Accord is unable to, and that
> you
> > > > > consider
> > > > > >> desirable, and we can work from there.
> > > > > >>
> > > > > >> To pre-empt some possible discussions, I am not aware of
> anything
> > we
> > > > > >> cannot do with Accord that we could do with either Calvin or
> > > Spanner.
> > > > > >> Interactive transactions are possible on top of Accord, as are
> > > > > transactions
> > > > > >> with an unknown read/write set. In each case the only cost is
> that
> > > > they
> > > > > >> would use optimistic concurrency control, which is no worse the
> > > > spanner
> > > > > >> derivatives anyway (which I have to assume is your benchmark in
> > this
> > > > > >> regard). I do not expect to deliver either functionality
> > initially,
> > > > but
> > > > > >> Accord takes us most of the way there for both.
> > > > > >>
> > > > > >>
> > > > > >> From: Jonathan Ellis <jb...@gmail.com>
> > > > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > > > >> To: dev <de...@cassandra.apache.org>
> > > > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > >> Right, I'm looking for exactly a discussion on the high level
> > goals.
> > > > > >> Instead of saying "here's the goals and we ruled out X because
> Y"
> > we
> > > > > should
> > > > > >> start with a discussion around, "Approach A allows X and W,
> > > approach B
> > > > > >> allows Y and Z" and decide together what the goals should be and
> > and
> > > > > what
> > > > > >> we are willing to trade to get those goals, e.g., are we willing
> > to
> > > > > give up
> > > > > >> global strict serializability to get the ability to support full
> > > SQL.
> > > > > Both
> > > > > >> of these are nice to have!
> > > > > >>
> > > > > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > > > > benedict@apache.org>
> > > > > >> wrote:
> > > > > >>
> > > > > >>> Hi Jonathan,
> > > > > >>>
> > > > > >>> These other systems are incompatible with the goals of the
> CEP. I
> > > do
> > > > > >>> discuss them (besides 2PC) in both the whitepaper and the CEP,
> > and
> > > > will
> > > > > >>> summarise that discussion below. A true and accurate comparison
> > of
> > > > > these
> > > > > >>> other systems is essentially intractable, as there are complex
> > > > > subtleties
> > > > > >>> to each flavour, and those who are interested would be better
> > > served
> > > > by
> > > > > >>> performing their own research.
> > > > > >>>
> > > > > >>> I think it is more productive to focus on what we want to
> achieve
> > > as
> > > > a
> > > > > >>> community. If you believe the goals of this CEP are wrong for
> the
> > > > > >> project,
> > > > > >>> let’s focus on that. If you want to compare and contrast
> specific
> > > > > facets
> > > > > >> of
> > > > > >>> alternative systems that you consider to be preferable in some
> > > > > dimension,
> > > > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > > > >>>
> > > > > >>> The relevant goals are that we:
> > > > > >>>
> > > > > >>>
> > > > > >>>  1.  Guarantee strict serializable isolation on commodity
> > hardware
> > > > > >>>  2.  Scale to any cluster size
> > > > > >>>  3.  Achieve optimal latency
> > > > > >>>
> > > > > >>> The approach taken by Spanner derivatives is rejected by (1)
> > > because
> > > > > they
> > > > > >>> guarantee only Serializable isolation (they additionally fail
> > (3)).
> > > > > From
> > > > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > > > >>> panic-cluster-death under clock skew, this is clearly
> considered
> > by
> > > > > >>> everyone to be undesirable but necessary to achieve
> scalability.
> > > > > >>>
> > > > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2)
> because
> > > its
> > > > > >>> sequencing layer requires a global leader process for the
> > cluster,
> > > > > which
> > > > > >> is
> > > > > >>> incompatible with Cassandra’s scalability requirements. It
> > > > additionally
> > > > > >>> fails (3) for global clients.
> > > > > >>>
> > > > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is
> > today a
> > > > > >>> Spanner clone for its multi-key transaction functionality, not
> > 2PC.
> > > > > >>>
> > > > > >>> Systems such as RAMP with even weaker isolation are not
> > considered
> > > > for
> > > > > >> the
> > > > > >>> simple reason that they do not even claim to meet (1).
> > > > > >>>
> > > > > >>> If we want to additionally offer weaker isolation levels than
> > > > > >>> Serializable, such as that provided by the recent RAMP-TAO
> paper,
> > > > > >> Cassandra
> > > > > >>> is likely able to support multiple distinct transaction layers
> > that
> > > > > >> operate
> > > > > >>> independently. I would encourage you to file a CEP to explore
> how
> > > we
> > > > > can
> > > > > >>> meet these distinct use cases, but I consider them to be
> niche. I
> > > > > expect
> > > > > >>> that a majority of our user base desire strict serializable
> > > > isolation,
> > > > > >> and
> > > > > >>> certainly no less than serializable isolation, to augment the
> > > > existing
> > > > > >>> weaker isolation offered by quorum reads and writes.
> > > > > >>>
> > > > > >>> I would tangentially note that we are not an AP database under
> > > normal
> > > > > >>> recommended operation. A minority in any network partition
> cannot
> > > > reach
> > > > > >>> QUORUM, so under recommended usage we are a high-availability
> > > > > leaderless
> > > > > >> CP
> > > > > >>> database.
> > > > > >>>
> > > > > >>>
> > > > > >>> From: Jonathan Ellis <jb...@gmail.com>
> > > > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > > > >>> To: dev <de...@cassandra.apache.org>
> > > > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > > >>> Benedict, thanks for taking the lead in putting this together.
> > > Since
> > > > > >>> Cassandra is the only relevant database today designed around a
> > > > > >> leaderless
> > > > > >>> architecture, it's quite likely that we'll be better served
> with
> > a
> > > > > custom
> > > > > >>> transaction design instead of trying to retrofit one from CP
> > > systems.
> > > > > >>>
> > > > > >>> The whitepaper here is a good description of the consensus
> > > algorithm
> > > > > >> itself
> > > > > >>> as well as its robustness and stability characteristics, and
> its
> > > > > >> comparison
> > > > > >>> with other state-of-the-art consensus algorithms is very
> useful.
> > > In
> > > > > the
> > > > > >>> context of Cassandra, where a consensus algorithm is only part
> of
> > > > what
> > > > > >> will
> > > > > >>> be implemented, I'd like to see a more complete evaluation of
> the
> > > > > >>> transactional side of things as well, including performance
> > > > > >> characteristics
> > > > > >>> as well as the types of transactions that can be supported and
> at
> > > > > least a
> > > > > >>> general idea of what it would look like applied to Cassandra.
> > This
> > > > will
> > > > > >>> allow the PMC to make a more informed decision about what
> > tradeoffs
> > > > are
> > > > > >>> best for the entire long-term project of first supplementing
> and
> > > > > >> ultimately
> > > > > >>> replacing LWT.
> > > > > >>>
> > > > > >>> (Allowing users to mix LWT and AP Cassandra operations against
> > the
> > > > same
> > > > > >>> rows was probably a mistake, so in contrast with LWT we’re not
> > > > looking
> > > > > >> for
> > > > > >>> something fast enough for occasional use but rather something
> > > within
> > > > a
> > > > > >>> reasonable factor of AP operations, appropriate to being the
> only
> > > way
> > > > > to
> > > > > >>> interact with tables declared as such.)
> > > > > >>>
> > > > > >>> Besides Accord, this should cover
> > > > > >>>
> > > > > >>> - Calvin and FaunaDB
> > > > > >>> - A Spanner derivative (no opinion on whether that should be
> > > > Cockroach
> > > > > or
> > > > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB but
> I
> > > > > suspect
> > > > > >>> there is more public information about MongoDB)
> > > > > >>> - RAMP
> > > > > >>>
> > > > > >>> Here’s an example of what I mean:
> > > > > >>>
> > > > > >>> =Calvin=
> > > > > >>>
> > > > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB)
> to
> > > > order
> > > > > >>> transactions, then replicas execute the transactions
> > independently
> > > > with
> > > > > >> no
> > > > > >>> further coordination.  No SPOF.  Transactions are batched by
> each
> > > > > >> sequencer
> > > > > >>> to keep this from becoming a bottleneck.
> > > > > >>>
> > > > > >>> Performance: Calvin paper (published 2012) reports linear
> scaling
> > > of
> > > > > >> TPC-C
> > > > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL
> > > > machines
> > > > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order
> is
> > > > > composed
> > > > > >>> of four reads and four writes, so this is effectively 2M reads
> > and
> > > 2M
> > > > > >>> writes as we normally measure them in C*.
> > > > > >>>
> > > > > >>> Calvin supports mixed read/write transactions, but because the
> > > > > >> transaction
> > > > > >>> execution logic requires knowing all partition keys in advance
> to
> > > > > ensure
> > > > > >>> that all replicas can reproduce the same results with no
> > > > coordination,
> > > > > >>> reads against non-PK predicates must be done ahead of time
> > > > > >> (transparently,
> > > > > >>> by the server) to determine the set of keys, and this must be
> > > retried
> > > > > if
> > > > > >>> the set of rows affected is updated before the actual
> transaction
> > > > > >> executes.
> > > > > >>>
> > > > > >>> Batching and global consensus adds latency -- 100ms in the
> Calvin
> > > > paper
> > > > > >> and
> > > > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> > > transactions
> > > > > >>> (including multi-partition updates) are equally performant in
> > > Calvin
> > > > > >> since
> > > > > >>> the coordination is handled up front in the sequencing step.
> > Glass
> > > > > half
> > > > > >>> empty: even single-row reads and writes have to pay the full
> > > > > coordination
> > > > > >>> cost.  Fauna has optimized this away for reads but I am not
> aware
> > > of
> > > > a
> > > > > >>> description of how they changed the design to allow this.
> > > > > >>>
> > > > > >>> Functionality and limitations: since the entire transaction
> must
> > be
> > > > > known
> > > > > >>> in advance to allow coordination-less execution at the
> replicas,
> > > > Calvin
> > > > > >>> cannot support interactive transactions at all.  FaunaDB
> > mitigates
> > > > this
> > > > > >> by
> > > > > >>> allowing server-side logic to be included, but a Calvin
> approach
> > > will
> > > > > >> never
> > > > > >>> be able to offer SQL compatibility.
> > > > > >>>
> > > > > >>> Guarantees: Calvin transactions are strictly serializable.
> There
> > > is
> > > > no
> > > > > >>> additional complexity or performance hit to generalizing to
> > > multiple
> > > > > >>> regions, apart from the speed of light.  And since Calvin is
> > > already
> > > > > >> paying
> > > > > >>> a batching latency penalty, this is less painful than for other
> > > > > systems.
> > > > > >>>
> > > > > >>> Application to Cassandra: B-.  Distributed transactions are
> > handled
> > > > by
> > > > > >> the
> > > > > >>> sequencing and scheduling layers, which are leaderless, and
> > > Calvin’s
> > > > > >>> requirements for the storage layer are easily met by C*.  But
> > > Calvin
> > > > > also
> > > > > >>> requires a global consensus protocol and LWT is almost
> certainly
> > > not
> > > > > >>> sufficiently performant, so this would require ZK or etcd
> > > (reasonable
> > > > > >> for a
> > > > > >>> library approach but not for replacing LWT in C* itself), or an
> > > > > >>> implementation of Accord.  I don’t believe Calvin would require
> > > > > >> additional
> > > > > >>> table-level metadata in Cassandra.
> > > > > >>>
> > > > > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > > > > benedict@apache.org>
> > > > > >>> wrote:
> > > > > >>>
> > > > > >>>> Wiki:
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > > > >>>> Whitepaper:
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > > > >>>> <
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > > > >>>>>
> > > > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > > > >>>>
> > > > > >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> > > > > >> community.
> > > > > >>>>
> > > > > >>>> Cassandra has benefitted from LWTs for many years, but
> > application
> > > > > >>>> developers that want to ensure consistency for complex
> > operations
> > > > must
> > > > > >>>> either accept the scalability bottleneck of serializing all
> > > related
> > > > > >> state
> > > > > >>>> through a single partition, or layer a complex state machine
> on
> > > top
> > > > of
> > > > > >>> the
> > > > > >>>> database. These are sophisticated and costly activities that
> our
> > > > users
> > > > > >>>> should not be expected to undertake. Since distributed
> databases
> > > are
> > > > > >>>> beginning to offer distributed transactions with fewer
> caveats,
> > it
> > > > is
> > > > > >>> past
> > > > > >>>> time for Cassandra to do so as well.
> > > > > >>>>
> > > > > >>>> This CEP proposes the use of several novel techniques that
> build
> > > > upon
> > > > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> > > general
> > > > > >>>> purpose distributed transactions. The approach is outlined in
> > the
> > > > > >>> wikipage
> > > > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > > > adopting
> > > > > >>> this
> > > > > >>>> approach we will be the _only_ distributed database to offer
> > > global,
> > > > > >>>> scalable, strict serializable transactions in one wide area
> > > > > round-trip.
> > > > > >>>> This would represent a significant improvement in the state of
> > the
> > > > > art,
> > > > > >>>> both in the academic literature and in commercial or open
> source
> > > > > >>> offerings.
> > > > > >>>>
> > > > > >>>> This work has been partially realised in a prototype. This
> > partial
> > > > > >>>> prototype has been verified against Jepsen.io’s Maelstrom
> > library
> > > > and
> > > > > >>>> dedicated in-tree strict serializability verification tools,
> but
> > > > much
> > > > > >>> work
> > > > > >>>> remains for the work to be production capable and integrated
> > into
> > > > > >>> Cassandra.
> > > > > >>>>
> > > > > >>>> I propose including the prototype in the project as a new
> source
> > > > > >>>> repository, to be developed as a standalone library for
> > > integration
> > > > > >> into
> > > > > >>>> Cassandra. I hope the community sees the important value
> > > proposition
> > > > > of
> > > > > >>>> this proposal, and will adopt the CEP after this discussion,
> so
> > > that
> > > > > >> the
> > > > > >>>> library and its integration into Cassandra can be developed in
> > > > > parallel
> > > > > >>> and
> > > > > >>>> with the involvement of the wider community.
> > > > > >>>>
> > > > > >>>
> > > > > >>>
> > > > > >>> --
> > > > > >>> Jonathan Ellis
> > > > > >>> co-founder, http://www.datastax.com
> > > > > >>> @spyced
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Jonathan Ellis
> > > > > >> co-founder, http://www.datastax.com
> > > > > >> @spyced
> > > > > >>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Jonathan Ellis
> > > > > > co-founder, http://www.datastax.com
> > > > > > @spyced
> > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > > > >
> > > >
> > > >
> > > > --
> > > > Jonathan Ellis
> > > > co-founder, http://www.datastax.com
> > > > @spyced
> > > >
> > >
> >
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.
I’m not, though it might seem that way. I disagree with your views about how CEP should be structured. Since the CEP process was itself codified via the CEP process, if you want to recodify how CEP work, the correct way is via the CEP process itself.

The discussion is being drawn in multiple directions away from the CEP itself, and I am trying to keep this particular thread focused on the business at hand, not meta discussions around CEP structure that will no doubt be unproductive given likely irreconcilable views about the topic, nor discussions about other CEP that could have been.

If you want to start a separate exploratory discussion thread about CEP structure without filing a CEP feel free to do so.


From: Paulo Motta <pa...@gmail.com>
Date: Friday, 1 October 2021 at 15:04
To: Cassandra DEV <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> If you want to impose your views on CEP structure on others, please file
a CEP with the additional restrictions and guidance you want to impose and
start a discussion thread. I can then respond in detail to why I perceive
this approach to be flawed, in a dedicated context.

This sounds very kafkaesque. You know I won't file a meta-CEP to change the
structure of CEP so you're just using this as an excuse to just shut the
discussion on the lack of clarity on what actual palpable feature will be
available once the CEP lands. :-)

I'm just providing my humble feedback on how a CEP could be more digestible
and easier to consume from an external point of view, and this seems like
an appropriate and contextualized place to voice this opinion which is
perhaps shared by others.

Em sex., 1 de out. de 2021 às 10:55, benedict@apache.org <
benedict@apache.org> escreveu:

> I disagree with you. However, this is the wrong forum to have a meta
> discussion about how CEP should be structured.
>
> If you want to impose your views on CEP structure on others, please file a
> CEP with the additional restrictions and guidance you want to impose and
> start a discussion thread. I can then respond in detail to why I perceive
> this approach to be flawed, in a dedicated context.
>
>
> From: Paulo Motta <pa...@gmail.com>
> Date: Friday, 1 October 2021 at 14:48
> To: Cassandra DEV <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> >  The proposal as it stands today is exceptionally thorough, more so than
> any other CEP to date, or any CEP is likely to be in the near future.
>
> The protocol is thoroughly described, but in my view CEP is a forum to
> discuss the high level architecture and plan for adding a full end-to-end
> enhancement to the database, breaking it into sub-CEPs if needed, as long
> as the full plan is known in advance, otherwise the community will not have
> the context to judge the full extent and impact of the proposed
> enhancement.
>
> > Since it remains unclear to me what either yourself or Jonathan want to
> see as an alternative
>
> I would personally like to see something along these lines:
>
> CEP1: Add ACID-compliant atomic batches
> - UX changes needed: none, CQL provides the grammar we need.
> - Distributed transaction protocol needed: Accord (link to white paper if
> you want specific details about the protcool)
> - High-level architecture: what new components will be added, how existing
> components will be modified, what new messages will be added, what new
> configuration knobs will be introduced, what are the milestones of the
> project, etc.
>
> CEP2: Make LWT faster and more reliable
> - UX changes needed: none
> - Distributed transaction protocol needed: Accord, already added by
> previous CEP.
> - High-level architecture: blablabla... and so on.
>
> Em sex., 1 de out. de 2021 às 10:19, benedict@apache.org <
> benedict@apache.org> escreveu:
>
> > I think this is getting circular and unproductive. Basic disagreements
> > about whether the CEP specifies a feature I am inclined to leave for a
> > vote. In my view the CEP specifies several features, both immediate ones
> > for the user (ACID batches and multi-key LWTS) and developer-focused ones
> > around ground-breaking semantics that will be enabled.
> >
> > The proposal as it stands today is exceptionally thorough, more so than
> > any other CEP to date, or any CEP is likely to be in the near future.
> >
> > This is a Cassandra Enhancement *Proposal*, and at some point we have to
> > engage with what is proposed, not what you might like to be proposed.
> Since
> > it remains unclear to me what either yourself or Jonathan want to see as
> an
> > alternative, at this point it would seem more productive to produce your
> > own proposals for the community to consider. It is possible for multiple
> > transaction systems to co-exist, if you feel this is necessary.
> >
> >
> >
> > From: Paulo Motta <pa...@gmail.com>
> > Date: Friday, 1 October 2021 at 13:58
> > To: Cassandra DEV <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > I share similar feelings as jbellis that this proposal seems to be
> focusing
> > on the protocol itself but lacking the actual feature that will use the
> > protocol which IMO a key element to discuss on a CEP.
> >
> > It's similar to saying: hey I want to add this Tries Serialization
> Protocol
> > to Cassandra, but not providing specific details of how this protocol is
> > going to be used.
> >
> > I think the right route for a CEP is to describe the feature that will be
> > added to the database and the protocol is a mere requirement of the
> > high-level feature, for example:
> >
> > CEP: Add Trie-backed memtable
> > - Trie Serialization Protocol: implementation detail of the above CEP
> >
> > What is the difficulty of taking this approach, picking one of the myriad
> > of features that will be enabled by Accord and using that as the initial
> > CEP to introduce the protocol to the database?
> >
> > Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
> > benedict@apache.org> escreveu:
> >
> > > Actually, thinking about it again, the simple optimistic protocol would
> > in
> > > fact guarantee system forward progress (i.e. independent of transaction
> > > formulation).
> > >
> > >
> > > From: benedict@apache.org <be...@apache.org>
> > > Date: Friday, 1 October 2021 at 09:14
> > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > Hi Jonathan,
> > >
> > > It would be great if we could achieve a bandwidth higher than 1-2 short
> > > emails per week. It remains unclear to me what your goal is, and it
> would
> > > help if you could make a statement like “I want Cassandra to be able to
> > do
> > > X” so that we can respond directly to it. I am also available to have
> > > another call, in which we can have a back and forth, please feel free
> to
> > > propose a London-compatible time within the next week that is suitable
> > for
> > > you.
> > >
> > > In my opinion we are at risk of veering off-topic, though. This CEP is
> > not
> > > to deliver interactive transactions, and to my knowledge nobody is
> > > proposing a CEP for interactive transactions. So, for the CEP at hand
> the
> > > salient question seems: does this CEP prevent us from implementing
> > > interactive transactions with properties X, Y, Z in future? To which
> the
> > > answer is almost certainly no.
> > >
> > > However, to continue the discussion and respond directly to your
> queries,
> > > I believe we agree on the definition of an interactive transaction.
> > >
> > > Two protocols were loosely outlined. The first, using timestamps for
> > > optimistic concurrency control, would indeed involve the possibility of
> > > aborts. It would not however inherently adopt the issue of LWTs where
> no
> > > transaction is able to make progress. Whether or not progress is
> > guaranteed
> > > (in a livelock-free sense) would depend on the structure of the
> > > transactions that were interfering.
> > >
> > > This approach has the advantage of being very simple to implement, so
> > that
> > > we could realistically support interactive transactions quite quickly.
> It
> > > has the additional advantage that transactions would execute very
> quickly
> > > by avoiding the WAN during construction, and as a result may in
> practice
> > > experience fewer aborts than protocols that guarantee livelock-freedom.
> > >
> > > The second protocol proposed using read/write intents and would be able
> > to
> > > support almost any behaviour you want. We could even utilise
> pessimistic
> > > concurrency control, or anything in-between. This is its own huge
> design
> > > space, and discussion of this approach and the trade-offs that could be
> > > made is (in my opinion) entirely out of scope for this CEP.
> > >
> > >
> > > From: Jonathan Ellis <jb...@gmail.com>
> > > Date: Friday, 1 October 2021 at 05:00
> > > To: dev <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > The obstacle for me is you've provided a protocol but not a fully
> fleshed
> > > out architecture, so it's hard to fill in some of the blanks.  But it
> > looks
> > > to me like optimistic concurrency control for interactive transactions
> > > applied to Accord would leave you in a LWT-like situation under fairly
> > > light contention where nobody actually makes progress due to retries.
> > >
> > > To make sure we're talking about the same thing, as Henrik pointed out,
> > > interactive transactions mean multiple round trips from the client
> > within a
> > > transaction.  For example, here
> > > <
> > >
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > > >
> > > is a simple implementation of the TPC-C New Order transaction.  The
> high
> > > level logic (via
> > > <
> > >
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > > >)
> > > is,
> > >
> > >    1. Get records describing a warehouse, customer, & district
> > >    2. Update the district
> > >    3. Increment next available order number
> > >    4. Insert record into Order and New-Order tables
> > >    5. For 5-15 items, get Item record, get/update Stock record
> > >    6. Insert Order-Line Record
> > >
> > > As you can see, this requires a lot of client-side logic mixed in with
> > the
> > > actual SQL commands.
> > >
> > >
> > > On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <
> benedict@apache.org
> > >
> > > wrote:
> > >
> > > > Essentially this, although I think in practice we will need to track
> > each
> > > > partition’s timestamp separately (or optionally for reduced
> conflicts,
> > > each
> > > > row or datum’s), and make them all part of the conditional
> application
> > of
> > > > the transaction - at least for strict-serializability.
> > > >
> > > > The alternative is to insert read/write intents for the transaction
> > > during
> > > > each step, and to confirm they are still valid on commit, but this
> > > approach
> > > > would require a WAN round-trip for each step in the interactive
> > > > transaction, whereas the timestamp-validating approach can use a LAN
> > > > round-trip for each step besides the final one, and is also much
> > simpler
> > > to
> > > > implement.
> > > >
> > > >
> > > > From: Blake Eggleston <be...@apple.com.INVALID>
> > > > Date: Thursday, 30 September 2021 at 05:47
> > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > You could establish a lower timestamp bound and buffer transaction
> > state
> > > > on the coordinator, then make the commit an operation that only
> applies
> > > if
> > > > all partitions involved haven’t been changed by a more recent
> > timestamp.
> > > > You could also implement mvcc either in the storage layer or for some
> > > > period of time by buffering commits on each replica before applying.
> > > >
> > > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com>
> > wrote:
> > > > >
> > > > > How are interactive transactions possible with Accord?
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > > > benedict@apache.org>
> > > > > wrote:
> > > > >
> > > > >> Could you explain why you believe this trade-off is necessary? We
> > can
> > > > >> support full SQL just fine with Accord, and I hope that we
> > eventually
> > > > do so.
> > > > >>
> > > > >> This domain is incredibly complex, so it is easy to reach wrong
> > > > >> conclusions. I would invite you again to propose a system for
> > > discussion
> > > > >> that you think offers something Accord is unable to, and that you
> > > > consider
> > > > >> desirable, and we can work from there.
> > > > >>
> > > > >> To pre-empt some possible discussions, I am not aware of anything
> we
> > > > >> cannot do with Accord that we could do with either Calvin or
> > Spanner.
> > > > >> Interactive transactions are possible on top of Accord, as are
> > > > transactions
> > > > >> with an unknown read/write set. In each case the only cost is that
> > > they
> > > > >> would use optimistic concurrency control, which is no worse the
> > > spanner
> > > > >> derivatives anyway (which I have to assume is your benchmark in
> this
> > > > >> regard). I do not expect to deliver either functionality
> initially,
> > > but
> > > > >> Accord takes us most of the way there for both.
> > > > >>
> > > > >>
> > > > >> From: Jonathan Ellis <jb...@gmail.com>
> > > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > > >> To: dev <de...@cassandra.apache.org>
> > > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > >> Right, I'm looking for exactly a discussion on the high level
> goals.
> > > > >> Instead of saying "here's the goals and we ruled out X because Y"
> we
> > > > should
> > > > >> start with a discussion around, "Approach A allows X and W,
> > approach B
> > > > >> allows Y and Z" and decide together what the goals should be and
> and
> > > > what
> > > > >> we are willing to trade to get those goals, e.g., are we willing
> to
> > > > give up
> > > > >> global strict serializability to get the ability to support full
> > SQL.
> > > > Both
> > > > >> of these are nice to have!
> > > > >>
> > > > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > > > benedict@apache.org>
> > > > >> wrote:
> > > > >>
> > > > >>> Hi Jonathan,
> > > > >>>
> > > > >>> These other systems are incompatible with the goals of the CEP. I
> > do
> > > > >>> discuss them (besides 2PC) in both the whitepaper and the CEP,
> and
> > > will
> > > > >>> summarise that discussion below. A true and accurate comparison
> of
> > > > these
> > > > >>> other systems is essentially intractable, as there are complex
> > > > subtleties
> > > > >>> to each flavour, and those who are interested would be better
> > served
> > > by
> > > > >>> performing their own research.
> > > > >>>
> > > > >>> I think it is more productive to focus on what we want to achieve
> > as
> > > a
> > > > >>> community. If you believe the goals of this CEP are wrong for the
> > > > >> project,
> > > > >>> let’s focus on that. If you want to compare and contrast specific
> > > > facets
> > > > >> of
> > > > >>> alternative systems that you consider to be preferable in some
> > > > dimension,
> > > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > > >>>
> > > > >>> The relevant goals are that we:
> > > > >>>
> > > > >>>
> > > > >>>  1.  Guarantee strict serializable isolation on commodity
> hardware
> > > > >>>  2.  Scale to any cluster size
> > > > >>>  3.  Achieve optimal latency
> > > > >>>
> > > > >>> The approach taken by Spanner derivatives is rejected by (1)
> > because
> > > > they
> > > > >>> guarantee only Serializable isolation (they additionally fail
> (3)).
> > > > From
> > > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > > >>> panic-cluster-death under clock skew, this is clearly considered
> by
> > > > >>> everyone to be undesirable but necessary to achieve scalability.
> > > > >>>
> > > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2) because
> > its
> > > > >>> sequencing layer requires a global leader process for the
> cluster,
> > > > which
> > > > >> is
> > > > >>> incompatible with Cassandra’s scalability requirements. It
> > > additionally
> > > > >>> fails (3) for global clients.
> > > > >>>
> > > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is
> today a
> > > > >>> Spanner clone for its multi-key transaction functionality, not
> 2PC.
> > > > >>>
> > > > >>> Systems such as RAMP with even weaker isolation are not
> considered
> > > for
> > > > >> the
> > > > >>> simple reason that they do not even claim to meet (1).
> > > > >>>
> > > > >>> If we want to additionally offer weaker isolation levels than
> > > > >>> Serializable, such as that provided by the recent RAMP-TAO paper,
> > > > >> Cassandra
> > > > >>> is likely able to support multiple distinct transaction layers
> that
> > > > >> operate
> > > > >>> independently. I would encourage you to file a CEP to explore how
> > we
> > > > can
> > > > >>> meet these distinct use cases, but I consider them to be niche. I
> > > > expect
> > > > >>> that a majority of our user base desire strict serializable
> > > isolation,
> > > > >> and
> > > > >>> certainly no less than serializable isolation, to augment the
> > > existing
> > > > >>> weaker isolation offered by quorum reads and writes.
> > > > >>>
> > > > >>> I would tangentially note that we are not an AP database under
> > normal
> > > > >>> recommended operation. A minority in any network partition cannot
> > > reach
> > > > >>> QUORUM, so under recommended usage we are a high-availability
> > > > leaderless
> > > > >> CP
> > > > >>> database.
> > > > >>>
> > > > >>>
> > > > >>> From: Jonathan Ellis <jb...@gmail.com>
> > > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > > >>> To: dev <de...@cassandra.apache.org>
> > > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > >>> Benedict, thanks for taking the lead in putting this together.
> > Since
> > > > >>> Cassandra is the only relevant database today designed around a
> > > > >> leaderless
> > > > >>> architecture, it's quite likely that we'll be better served with
> a
> > > > custom
> > > > >>> transaction design instead of trying to retrofit one from CP
> > systems.
> > > > >>>
> > > > >>> The whitepaper here is a good description of the consensus
> > algorithm
> > > > >> itself
> > > > >>> as well as its robustness and stability characteristics, and its
> > > > >> comparison
> > > > >>> with other state-of-the-art consensus algorithms is very useful.
> > In
> > > > the
> > > > >>> context of Cassandra, where a consensus algorithm is only part of
> > > what
> > > > >> will
> > > > >>> be implemented, I'd like to see a more complete evaluation of the
> > > > >>> transactional side of things as well, including performance
> > > > >> characteristics
> > > > >>> as well as the types of transactions that can be supported and at
> > > > least a
> > > > >>> general idea of what it would look like applied to Cassandra.
> This
> > > will
> > > > >>> allow the PMC to make a more informed decision about what
> tradeoffs
> > > are
> > > > >>> best for the entire long-term project of first supplementing and
> > > > >> ultimately
> > > > >>> replacing LWT.
> > > > >>>
> > > > >>> (Allowing users to mix LWT and AP Cassandra operations against
> the
> > > same
> > > > >>> rows was probably a mistake, so in contrast with LWT we’re not
> > > looking
> > > > >> for
> > > > >>> something fast enough for occasional use but rather something
> > within
> > > a
> > > > >>> reasonable factor of AP operations, appropriate to being the only
> > way
> > > > to
> > > > >>> interact with tables declared as such.)
> > > > >>>
> > > > >>> Besides Accord, this should cover
> > > > >>>
> > > > >>> - Calvin and FaunaDB
> > > > >>> - A Spanner derivative (no opinion on whether that should be
> > > Cockroach
> > > > or
> > > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I
> > > > suspect
> > > > >>> there is more public information about MongoDB)
> > > > >>> - RAMP
> > > > >>>
> > > > >>> Here’s an example of what I mean:
> > > > >>>
> > > > >>> =Calvin=
> > > > >>>
> > > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to
> > > order
> > > > >>> transactions, then replicas execute the transactions
> independently
> > > with
> > > > >> no
> > > > >>> further coordination.  No SPOF.  Transactions are batched by each
> > > > >> sequencer
> > > > >>> to keep this from becoming a bottleneck.
> > > > >>>
> > > > >>> Performance: Calvin paper (published 2012) reports linear scaling
> > of
> > > > >> TPC-C
> > > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL
> > > machines
> > > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is
> > > > composed
> > > > >>> of four reads and four writes, so this is effectively 2M reads
> and
> > 2M
> > > > >>> writes as we normally measure them in C*.
> > > > >>>
> > > > >>> Calvin supports mixed read/write transactions, but because the
> > > > >> transaction
> > > > >>> execution logic requires knowing all partition keys in advance to
> > > > ensure
> > > > >>> that all replicas can reproduce the same results with no
> > > coordination,
> > > > >>> reads against non-PK predicates must be done ahead of time
> > > > >> (transparently,
> > > > >>> by the server) to determine the set of keys, and this must be
> > retried
> > > > if
> > > > >>> the set of rows affected is updated before the actual transaction
> > > > >> executes.
> > > > >>>
> > > > >>> Batching and global consensus adds latency -- 100ms in the Calvin
> > > paper
> > > > >> and
> > > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> > transactions
> > > > >>> (including multi-partition updates) are equally performant in
> > Calvin
> > > > >> since
> > > > >>> the coordination is handled up front in the sequencing step.
> Glass
> > > > half
> > > > >>> empty: even single-row reads and writes have to pay the full
> > > > coordination
> > > > >>> cost.  Fauna has optimized this away for reads but I am not aware
> > of
> > > a
> > > > >>> description of how they changed the design to allow this.
> > > > >>>
> > > > >>> Functionality and limitations: since the entire transaction must
> be
> > > > known
> > > > >>> in advance to allow coordination-less execution at the replicas,
> > > Calvin
> > > > >>> cannot support interactive transactions at all.  FaunaDB
> mitigates
> > > this
> > > > >> by
> > > > >>> allowing server-side logic to be included, but a Calvin approach
> > will
> > > > >> never
> > > > >>> be able to offer SQL compatibility.
> > > > >>>
> > > > >>> Guarantees: Calvin transactions are strictly serializable.  There
> > is
> > > no
> > > > >>> additional complexity or performance hit to generalizing to
> > multiple
> > > > >>> regions, apart from the speed of light.  And since Calvin is
> > already
> > > > >> paying
> > > > >>> a batching latency penalty, this is less painful than for other
> > > > systems.
> > > > >>>
> > > > >>> Application to Cassandra: B-.  Distributed transactions are
> handled
> > > by
> > > > >> the
> > > > >>> sequencing and scheduling layers, which are leaderless, and
> > Calvin’s
> > > > >>> requirements for the storage layer are easily met by C*.  But
> > Calvin
> > > > also
> > > > >>> requires a global consensus protocol and LWT is almost certainly
> > not
> > > > >>> sufficiently performant, so this would require ZK or etcd
> > (reasonable
> > > > >> for a
> > > > >>> library approach but not for replacing LWT in C* itself), or an
> > > > >>> implementation of Accord.  I don’t believe Calvin would require
> > > > >> additional
> > > > >>> table-level metadata in Cassandra.
> > > > >>>
> > > > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > > > benedict@apache.org>
> > > > >>> wrote:
> > > > >>>
> > > > >>>> Wiki:
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > > >>>> Whitepaper:
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > > >>>> <
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > > >>>>>
> > > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > > >>>>
> > > > >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> > > > >> community.
> > > > >>>>
> > > > >>>> Cassandra has benefitted from LWTs for many years, but
> application
> > > > >>>> developers that want to ensure consistency for complex
> operations
> > > must
> > > > >>>> either accept the scalability bottleneck of serializing all
> > related
> > > > >> state
> > > > >>>> through a single partition, or layer a complex state machine on
> > top
> > > of
> > > > >>> the
> > > > >>>> database. These are sophisticated and costly activities that our
> > > users
> > > > >>>> should not be expected to undertake. Since distributed databases
> > are
> > > > >>>> beginning to offer distributed transactions with fewer caveats,
> it
> > > is
> > > > >>> past
> > > > >>>> time for Cassandra to do so as well.
> > > > >>>>
> > > > >>>> This CEP proposes the use of several novel techniques that build
> > > upon
> > > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> > general
> > > > >>>> purpose distributed transactions. The approach is outlined in
> the
> > > > >>> wikipage
> > > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > > adopting
> > > > >>> this
> > > > >>>> approach we will be the _only_ distributed database to offer
> > global,
> > > > >>>> scalable, strict serializable transactions in one wide area
> > > > round-trip.
> > > > >>>> This would represent a significant improvement in the state of
> the
> > > > art,
> > > > >>>> both in the academic literature and in commercial or open source
> > > > >>> offerings.
> > > > >>>>
> > > > >>>> This work has been partially realised in a prototype. This
> partial
> > > > >>>> prototype has been verified against Jepsen.io’s Maelstrom
> library
> > > and
> > > > >>>> dedicated in-tree strict serializability verification tools, but
> > > much
> > > > >>> work
> > > > >>>> remains for the work to be production capable and integrated
> into
> > > > >>> Cassandra.
> > > > >>>>
> > > > >>>> I propose including the prototype in the project as a new source
> > > > >>>> repository, to be developed as a standalone library for
> > integration
> > > > >> into
> > > > >>>> Cassandra. I hope the community sees the important value
> > proposition
> > > > of
> > > > >>>> this proposal, and will adopt the CEP after this discussion, so
> > that
> > > > >> the
> > > > >>>> library and its integration into Cassandra can be developed in
> > > > parallel
> > > > >>> and
> > > > >>>> with the involvement of the wider community.
> > > > >>>>
> > > > >>>
> > > > >>>
> > > > >>> --
> > > > >>> Jonathan Ellis
> > > > >>> co-founder, http://www.datastax.com
> > > > >>> @spyced
> > > > >>>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Jonathan Ellis
> > > > >> co-founder, http://www.datastax.com
> > > > >> @spyced
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > > Jonathan Ellis
> > > > > co-founder, http://www.datastax.com
> > > > > @spyced
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > > >
> > >
> > >
> > > --
> > > Jonathan Ellis
> > > co-founder, http://www.datastax.com
> > > @spyced
> > >
> >
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Paulo Motta <pa...@gmail.com>.
> If you want to impose your views on CEP structure on others, please file
a CEP with the additional restrictions and guidance you want to impose and
start a discussion thread. I can then respond in detail to why I perceive
this approach to be flawed, in a dedicated context.

This sounds very kafkaesque. You know I won't file a meta-CEP to change the
structure of CEP so you're just using this as an excuse to just shut the
discussion on the lack of clarity on what actual palpable feature will be
available once the CEP lands. :-)

I'm just providing my humble feedback on how a CEP could be more digestible
and easier to consume from an external point of view, and this seems like
an appropriate and contextualized place to voice this opinion which is
perhaps shared by others.

Em sex., 1 de out. de 2021 às 10:55, benedict@apache.org <
benedict@apache.org> escreveu:

> I disagree with you. However, this is the wrong forum to have a meta
> discussion about how CEP should be structured.
>
> If you want to impose your views on CEP structure on others, please file a
> CEP with the additional restrictions and guidance you want to impose and
> start a discussion thread. I can then respond in detail to why I perceive
> this approach to be flawed, in a dedicated context.
>
>
> From: Paulo Motta <pa...@gmail.com>
> Date: Friday, 1 October 2021 at 14:48
> To: Cassandra DEV <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> >  The proposal as it stands today is exceptionally thorough, more so than
> any other CEP to date, or any CEP is likely to be in the near future.
>
> The protocol is thoroughly described, but in my view CEP is a forum to
> discuss the high level architecture and plan for adding a full end-to-end
> enhancement to the database, breaking it into sub-CEPs if needed, as long
> as the full plan is known in advance, otherwise the community will not have
> the context to judge the full extent and impact of the proposed
> enhancement.
>
> > Since it remains unclear to me what either yourself or Jonathan want to
> see as an alternative
>
> I would personally like to see something along these lines:
>
> CEP1: Add ACID-compliant atomic batches
> - UX changes needed: none, CQL provides the grammar we need.
> - Distributed transaction protocol needed: Accord (link to white paper if
> you want specific details about the protcool)
> - High-level architecture: what new components will be added, how existing
> components will be modified, what new messages will be added, what new
> configuration knobs will be introduced, what are the milestones of the
> project, etc.
>
> CEP2: Make LWT faster and more reliable
> - UX changes needed: none
> - Distributed transaction protocol needed: Accord, already added by
> previous CEP.
> - High-level architecture: blablabla... and so on.
>
> Em sex., 1 de out. de 2021 às 10:19, benedict@apache.org <
> benedict@apache.org> escreveu:
>
> > I think this is getting circular and unproductive. Basic disagreements
> > about whether the CEP specifies a feature I am inclined to leave for a
> > vote. In my view the CEP specifies several features, both immediate ones
> > for the user (ACID batches and multi-key LWTS) and developer-focused ones
> > around ground-breaking semantics that will be enabled.
> >
> > The proposal as it stands today is exceptionally thorough, more so than
> > any other CEP to date, or any CEP is likely to be in the near future.
> >
> > This is a Cassandra Enhancement *Proposal*, and at some point we have to
> > engage with what is proposed, not what you might like to be proposed.
> Since
> > it remains unclear to me what either yourself or Jonathan want to see as
> an
> > alternative, at this point it would seem more productive to produce your
> > own proposals for the community to consider. It is possible for multiple
> > transaction systems to co-exist, if you feel this is necessary.
> >
> >
> >
> > From: Paulo Motta <pa...@gmail.com>
> > Date: Friday, 1 October 2021 at 13:58
> > To: Cassandra DEV <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > I share similar feelings as jbellis that this proposal seems to be
> focusing
> > on the protocol itself but lacking the actual feature that will use the
> > protocol which IMO a key element to discuss on a CEP.
> >
> > It's similar to saying: hey I want to add this Tries Serialization
> Protocol
> > to Cassandra, but not providing specific details of how this protocol is
> > going to be used.
> >
> > I think the right route for a CEP is to describe the feature that will be
> > added to the database and the protocol is a mere requirement of the
> > high-level feature, for example:
> >
> > CEP: Add Trie-backed memtable
> > - Trie Serialization Protocol: implementation detail of the above CEP
> >
> > What is the difficulty of taking this approach, picking one of the myriad
> > of features that will be enabled by Accord and using that as the initial
> > CEP to introduce the protocol to the database?
> >
> > Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
> > benedict@apache.org> escreveu:
> >
> > > Actually, thinking about it again, the simple optimistic protocol would
> > in
> > > fact guarantee system forward progress (i.e. independent of transaction
> > > formulation).
> > >
> > >
> > > From: benedict@apache.org <be...@apache.org>
> > > Date: Friday, 1 October 2021 at 09:14
> > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > Hi Jonathan,
> > >
> > > It would be great if we could achieve a bandwidth higher than 1-2 short
> > > emails per week. It remains unclear to me what your goal is, and it
> would
> > > help if you could make a statement like “I want Cassandra to be able to
> > do
> > > X” so that we can respond directly to it. I am also available to have
> > > another call, in which we can have a back and forth, please feel free
> to
> > > propose a London-compatible time within the next week that is suitable
> > for
> > > you.
> > >
> > > In my opinion we are at risk of veering off-topic, though. This CEP is
> > not
> > > to deliver interactive transactions, and to my knowledge nobody is
> > > proposing a CEP for interactive transactions. So, for the CEP at hand
> the
> > > salient question seems: does this CEP prevent us from implementing
> > > interactive transactions with properties X, Y, Z in future? To which
> the
> > > answer is almost certainly no.
> > >
> > > However, to continue the discussion and respond directly to your
> queries,
> > > I believe we agree on the definition of an interactive transaction.
> > >
> > > Two protocols were loosely outlined. The first, using timestamps for
> > > optimistic concurrency control, would indeed involve the possibility of
> > > aborts. It would not however inherently adopt the issue of LWTs where
> no
> > > transaction is able to make progress. Whether or not progress is
> > guaranteed
> > > (in a livelock-free sense) would depend on the structure of the
> > > transactions that were interfering.
> > >
> > > This approach has the advantage of being very simple to implement, so
> > that
> > > we could realistically support interactive transactions quite quickly.
> It
> > > has the additional advantage that transactions would execute very
> quickly
> > > by avoiding the WAN during construction, and as a result may in
> practice
> > > experience fewer aborts than protocols that guarantee livelock-freedom.
> > >
> > > The second protocol proposed using read/write intents and would be able
> > to
> > > support almost any behaviour you want. We could even utilise
> pessimistic
> > > concurrency control, or anything in-between. This is its own huge
> design
> > > space, and discussion of this approach and the trade-offs that could be
> > > made is (in my opinion) entirely out of scope for this CEP.
> > >
> > >
> > > From: Jonathan Ellis <jb...@gmail.com>
> > > Date: Friday, 1 October 2021 at 05:00
> > > To: dev <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > The obstacle for me is you've provided a protocol but not a fully
> fleshed
> > > out architecture, so it's hard to fill in some of the blanks.  But it
> > looks
> > > to me like optimistic concurrency control for interactive transactions
> > > applied to Accord would leave you in a LWT-like situation under fairly
> > > light contention where nobody actually makes progress due to retries.
> > >
> > > To make sure we're talking about the same thing, as Henrik pointed out,
> > > interactive transactions mean multiple round trips from the client
> > within a
> > > transaction.  For example, here
> > > <
> > >
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > > >
> > > is a simple implementation of the TPC-C New Order transaction.  The
> high
> > > level logic (via
> > > <
> > >
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > > >)
> > > is,
> > >
> > >    1. Get records describing a warehouse, customer, & district
> > >    2. Update the district
> > >    3. Increment next available order number
> > >    4. Insert record into Order and New-Order tables
> > >    5. For 5-15 items, get Item record, get/update Stock record
> > >    6. Insert Order-Line Record
> > >
> > > As you can see, this requires a lot of client-side logic mixed in with
> > the
> > > actual SQL commands.
> > >
> > >
> > > On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <
> benedict@apache.org
> > >
> > > wrote:
> > >
> > > > Essentially this, although I think in practice we will need to track
> > each
> > > > partition’s timestamp separately (or optionally for reduced
> conflicts,
> > > each
> > > > row or datum’s), and make them all part of the conditional
> application
> > of
> > > > the transaction - at least for strict-serializability.
> > > >
> > > > The alternative is to insert read/write intents for the transaction
> > > during
> > > > each step, and to confirm they are still valid on commit, but this
> > > approach
> > > > would require a WAN round-trip for each step in the interactive
> > > > transaction, whereas the timestamp-validating approach can use a LAN
> > > > round-trip for each step besides the final one, and is also much
> > simpler
> > > to
> > > > implement.
> > > >
> > > >
> > > > From: Blake Eggleston <be...@apple.com.INVALID>
> > > > Date: Thursday, 30 September 2021 at 05:47
> > > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > You could establish a lower timestamp bound and buffer transaction
> > state
> > > > on the coordinator, then make the commit an operation that only
> applies
> > > if
> > > > all partitions involved haven’t been changed by a more recent
> > timestamp.
> > > > You could also implement mvcc either in the storage layer or for some
> > > > period of time by buffering commits on each replica before applying.
> > > >
> > > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com>
> > wrote:
> > > > >
> > > > > How are interactive transactions possible with Accord?
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > > > benedict@apache.org>
> > > > > wrote:
> > > > >
> > > > >> Could you explain why you believe this trade-off is necessary? We
> > can
> > > > >> support full SQL just fine with Accord, and I hope that we
> > eventually
> > > > do so.
> > > > >>
> > > > >> This domain is incredibly complex, so it is easy to reach wrong
> > > > >> conclusions. I would invite you again to propose a system for
> > > discussion
> > > > >> that you think offers something Accord is unable to, and that you
> > > > consider
> > > > >> desirable, and we can work from there.
> > > > >>
> > > > >> To pre-empt some possible discussions, I am not aware of anything
> we
> > > > >> cannot do with Accord that we could do with either Calvin or
> > Spanner.
> > > > >> Interactive transactions are possible on top of Accord, as are
> > > > transactions
> > > > >> with an unknown read/write set. In each case the only cost is that
> > > they
> > > > >> would use optimistic concurrency control, which is no worse the
> > > spanner
> > > > >> derivatives anyway (which I have to assume is your benchmark in
> this
> > > > >> regard). I do not expect to deliver either functionality
> initially,
> > > but
> > > > >> Accord takes us most of the way there for both.
> > > > >>
> > > > >>
> > > > >> From: Jonathan Ellis <jb...@gmail.com>
> > > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > > >> To: dev <de...@cassandra.apache.org>
> > > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > >> Right, I'm looking for exactly a discussion on the high level
> goals.
> > > > >> Instead of saying "here's the goals and we ruled out X because Y"
> we
> > > > should
> > > > >> start with a discussion around, "Approach A allows X and W,
> > approach B
> > > > >> allows Y and Z" and decide together what the goals should be and
> and
> > > > what
> > > > >> we are willing to trade to get those goals, e.g., are we willing
> to
> > > > give up
> > > > >> global strict serializability to get the ability to support full
> > SQL.
> > > > Both
> > > > >> of these are nice to have!
> > > > >>
> > > > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > > > benedict@apache.org>
> > > > >> wrote:
> > > > >>
> > > > >>> Hi Jonathan,
> > > > >>>
> > > > >>> These other systems are incompatible with the goals of the CEP. I
> > do
> > > > >>> discuss them (besides 2PC) in both the whitepaper and the CEP,
> and
> > > will
> > > > >>> summarise that discussion below. A true and accurate comparison
> of
> > > > these
> > > > >>> other systems is essentially intractable, as there are complex
> > > > subtleties
> > > > >>> to each flavour, and those who are interested would be better
> > served
> > > by
> > > > >>> performing their own research.
> > > > >>>
> > > > >>> I think it is more productive to focus on what we want to achieve
> > as
> > > a
> > > > >>> community. If you believe the goals of this CEP are wrong for the
> > > > >> project,
> > > > >>> let’s focus on that. If you want to compare and contrast specific
> > > > facets
> > > > >> of
> > > > >>> alternative systems that you consider to be preferable in some
> > > > dimension,
> > > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > > >>>
> > > > >>> The relevant goals are that we:
> > > > >>>
> > > > >>>
> > > > >>>  1.  Guarantee strict serializable isolation on commodity
> hardware
> > > > >>>  2.  Scale to any cluster size
> > > > >>>  3.  Achieve optimal latency
> > > > >>>
> > > > >>> The approach taken by Spanner derivatives is rejected by (1)
> > because
> > > > they
> > > > >>> guarantee only Serializable isolation (they additionally fail
> (3)).
> > > > From
> > > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > > >>> panic-cluster-death under clock skew, this is clearly considered
> by
> > > > >>> everyone to be undesirable but necessary to achieve scalability.
> > > > >>>
> > > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2) because
> > its
> > > > >>> sequencing layer requires a global leader process for the
> cluster,
> > > > which
> > > > >> is
> > > > >>> incompatible with Cassandra’s scalability requirements. It
> > > additionally
> > > > >>> fails (3) for global clients.
> > > > >>>
> > > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is
> today a
> > > > >>> Spanner clone for its multi-key transaction functionality, not
> 2PC.
> > > > >>>
> > > > >>> Systems such as RAMP with even weaker isolation are not
> considered
> > > for
> > > > >> the
> > > > >>> simple reason that they do not even claim to meet (1).
> > > > >>>
> > > > >>> If we want to additionally offer weaker isolation levels than
> > > > >>> Serializable, such as that provided by the recent RAMP-TAO paper,
> > > > >> Cassandra
> > > > >>> is likely able to support multiple distinct transaction layers
> that
> > > > >> operate
> > > > >>> independently. I would encourage you to file a CEP to explore how
> > we
> > > > can
> > > > >>> meet these distinct use cases, but I consider them to be niche. I
> > > > expect
> > > > >>> that a majority of our user base desire strict serializable
> > > isolation,
> > > > >> and
> > > > >>> certainly no less than serializable isolation, to augment the
> > > existing
> > > > >>> weaker isolation offered by quorum reads and writes.
> > > > >>>
> > > > >>> I would tangentially note that we are not an AP database under
> > normal
> > > > >>> recommended operation. A minority in any network partition cannot
> > > reach
> > > > >>> QUORUM, so under recommended usage we are a high-availability
> > > > leaderless
> > > > >> CP
> > > > >>> database.
> > > > >>>
> > > > >>>
> > > > >>> From: Jonathan Ellis <jb...@gmail.com>
> > > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > > >>> To: dev <de...@cassandra.apache.org>
> > > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > > >>> Benedict, thanks for taking the lead in putting this together.
> > Since
> > > > >>> Cassandra is the only relevant database today designed around a
> > > > >> leaderless
> > > > >>> architecture, it's quite likely that we'll be better served with
> a
> > > > custom
> > > > >>> transaction design instead of trying to retrofit one from CP
> > systems.
> > > > >>>
> > > > >>> The whitepaper here is a good description of the consensus
> > algorithm
> > > > >> itself
> > > > >>> as well as its robustness and stability characteristics, and its
> > > > >> comparison
> > > > >>> with other state-of-the-art consensus algorithms is very useful.
> > In
> > > > the
> > > > >>> context of Cassandra, where a consensus algorithm is only part of
> > > what
> > > > >> will
> > > > >>> be implemented, I'd like to see a more complete evaluation of the
> > > > >>> transactional side of things as well, including performance
> > > > >> characteristics
> > > > >>> as well as the types of transactions that can be supported and at
> > > > least a
> > > > >>> general idea of what it would look like applied to Cassandra.
> This
> > > will
> > > > >>> allow the PMC to make a more informed decision about what
> tradeoffs
> > > are
> > > > >>> best for the entire long-term project of first supplementing and
> > > > >> ultimately
> > > > >>> replacing LWT.
> > > > >>>
> > > > >>> (Allowing users to mix LWT and AP Cassandra operations against
> the
> > > same
> > > > >>> rows was probably a mistake, so in contrast with LWT we’re not
> > > looking
> > > > >> for
> > > > >>> something fast enough for occasional use but rather something
> > within
> > > a
> > > > >>> reasonable factor of AP operations, appropriate to being the only
> > way
> > > > to
> > > > >>> interact with tables declared as such.)
> > > > >>>
> > > > >>> Besides Accord, this should cover
> > > > >>>
> > > > >>> - Calvin and FaunaDB
> > > > >>> - A Spanner derivative (no opinion on whether that should be
> > > Cockroach
> > > > or
> > > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I
> > > > suspect
> > > > >>> there is more public information about MongoDB)
> > > > >>> - RAMP
> > > > >>>
> > > > >>> Here’s an example of what I mean:
> > > > >>>
> > > > >>> =Calvin=
> > > > >>>
> > > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to
> > > order
> > > > >>> transactions, then replicas execute the transactions
> independently
> > > with
> > > > >> no
> > > > >>> further coordination.  No SPOF.  Transactions are batched by each
> > > > >> sequencer
> > > > >>> to keep this from becoming a bottleneck.
> > > > >>>
> > > > >>> Performance: Calvin paper (published 2012) reports linear scaling
> > of
> > > > >> TPC-C
> > > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL
> > > machines
> > > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is
> > > > composed
> > > > >>> of four reads and four writes, so this is effectively 2M reads
> and
> > 2M
> > > > >>> writes as we normally measure them in C*.
> > > > >>>
> > > > >>> Calvin supports mixed read/write transactions, but because the
> > > > >> transaction
> > > > >>> execution logic requires knowing all partition keys in advance to
> > > > ensure
> > > > >>> that all replicas can reproduce the same results with no
> > > coordination,
> > > > >>> reads against non-PK predicates must be done ahead of time
> > > > >> (transparently,
> > > > >>> by the server) to determine the set of keys, and this must be
> > retried
> > > > if
> > > > >>> the set of rows affected is updated before the actual transaction
> > > > >> executes.
> > > > >>>
> > > > >>> Batching and global consensus adds latency -- 100ms in the Calvin
> > > paper
> > > > >> and
> > > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> > transactions
> > > > >>> (including multi-partition updates) are equally performant in
> > Calvin
> > > > >> since
> > > > >>> the coordination is handled up front in the sequencing step.
> Glass
> > > > half
> > > > >>> empty: even single-row reads and writes have to pay the full
> > > > coordination
> > > > >>> cost.  Fauna has optimized this away for reads but I am not aware
> > of
> > > a
> > > > >>> description of how they changed the design to allow this.
> > > > >>>
> > > > >>> Functionality and limitations: since the entire transaction must
> be
> > > > known
> > > > >>> in advance to allow coordination-less execution at the replicas,
> > > Calvin
> > > > >>> cannot support interactive transactions at all.  FaunaDB
> mitigates
> > > this
> > > > >> by
> > > > >>> allowing server-side logic to be included, but a Calvin approach
> > will
> > > > >> never
> > > > >>> be able to offer SQL compatibility.
> > > > >>>
> > > > >>> Guarantees: Calvin transactions are strictly serializable.  There
> > is
> > > no
> > > > >>> additional complexity or performance hit to generalizing to
> > multiple
> > > > >>> regions, apart from the speed of light.  And since Calvin is
> > already
> > > > >> paying
> > > > >>> a batching latency penalty, this is less painful than for other
> > > > systems.
> > > > >>>
> > > > >>> Application to Cassandra: B-.  Distributed transactions are
> handled
> > > by
> > > > >> the
> > > > >>> sequencing and scheduling layers, which are leaderless, and
> > Calvin’s
> > > > >>> requirements for the storage layer are easily met by C*.  But
> > Calvin
> > > > also
> > > > >>> requires a global consensus protocol and LWT is almost certainly
> > not
> > > > >>> sufficiently performant, so this would require ZK or etcd
> > (reasonable
> > > > >> for a
> > > > >>> library approach but not for replacing LWT in C* itself), or an
> > > > >>> implementation of Accord.  I don’t believe Calvin would require
> > > > >> additional
> > > > >>> table-level metadata in Cassandra.
> > > > >>>
> > > > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > > > benedict@apache.org>
> > > > >>> wrote:
> > > > >>>
> > > > >>>> Wiki:
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > > >>>> Whitepaper:
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > > >>>> <
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > > >>>>>
> > > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > > >>>>
> > > > >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> > > > >> community.
> > > > >>>>
> > > > >>>> Cassandra has benefitted from LWTs for many years, but
> application
> > > > >>>> developers that want to ensure consistency for complex
> operations
> > > must
> > > > >>>> either accept the scalability bottleneck of serializing all
> > related
> > > > >> state
> > > > >>>> through a single partition, or layer a complex state machine on
> > top
> > > of
> > > > >>> the
> > > > >>>> database. These are sophisticated and costly activities that our
> > > users
> > > > >>>> should not be expected to undertake. Since distributed databases
> > are
> > > > >>>> beginning to offer distributed transactions with fewer caveats,
> it
> > > is
> > > > >>> past
> > > > >>>> time for Cassandra to do so as well.
> > > > >>>>
> > > > >>>> This CEP proposes the use of several novel techniques that build
> > > upon
> > > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> > general
> > > > >>>> purpose distributed transactions. The approach is outlined in
> the
> > > > >>> wikipage
> > > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > > adopting
> > > > >>> this
> > > > >>>> approach we will be the _only_ distributed database to offer
> > global,
> > > > >>>> scalable, strict serializable transactions in one wide area
> > > > round-trip.
> > > > >>>> This would represent a significant improvement in the state of
> the
> > > > art,
> > > > >>>> both in the academic literature and in commercial or open source
> > > > >>> offerings.
> > > > >>>>
> > > > >>>> This work has been partially realised in a prototype. This
> partial
> > > > >>>> prototype has been verified against Jepsen.io’s Maelstrom
> library
> > > and
> > > > >>>> dedicated in-tree strict serializability verification tools, but
> > > much
> > > > >>> work
> > > > >>>> remains for the work to be production capable and integrated
> into
> > > > >>> Cassandra.
> > > > >>>>
> > > > >>>> I propose including the prototype in the project as a new source
> > > > >>>> repository, to be developed as a standalone library for
> > integration
> > > > >> into
> > > > >>>> Cassandra. I hope the community sees the important value
> > proposition
> > > > of
> > > > >>>> this proposal, and will adopt the CEP after this discussion, so
> > that
> > > > >> the
> > > > >>>> library and its integration into Cassandra can be developed in
> > > > parallel
> > > > >>> and
> > > > >>>> with the involvement of the wider community.
> > > > >>>>
> > > > >>>
> > > > >>>
> > > > >>> --
> > > > >>> Jonathan Ellis
> > > > >>> co-founder, http://www.datastax.com
> > > > >>> @spyced
> > > > >>>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Jonathan Ellis
> > > > >> co-founder, http://www.datastax.com
> > > > >> @spyced
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > > Jonathan Ellis
> > > > > co-founder, http://www.datastax.com
> > > > > @spyced
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > > >
> > >
> > >
> > > --
> > > Jonathan Ellis
> > > co-founder, http://www.datastax.com
> > > @spyced
> > >
> >
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.
I disagree with you. However, this is the wrong forum to have a meta discussion about how CEP should be structured.

If you want to impose your views on CEP structure on others, please file a CEP with the additional restrictions and guidance you want to impose and start a discussion thread. I can then respond in detail to why I perceive this approach to be flawed, in a dedicated context.


From: Paulo Motta <pa...@gmail.com>
Date: Friday, 1 October 2021 at 14:48
To: Cassandra DEV <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
>  The proposal as it stands today is exceptionally thorough, more so than
any other CEP to date, or any CEP is likely to be in the near future.

The protocol is thoroughly described, but in my view CEP is a forum to
discuss the high level architecture and plan for adding a full end-to-end
enhancement to the database, breaking it into sub-CEPs if needed, as long
as the full plan is known in advance, otherwise the community will not have
the context to judge the full extent and impact of the proposed enhancement.

> Since it remains unclear to me what either yourself or Jonathan want to
see as an alternative

I would personally like to see something along these lines:

CEP1: Add ACID-compliant atomic batches
- UX changes needed: none, CQL provides the grammar we need.
- Distributed transaction protocol needed: Accord (link to white paper if
you want specific details about the protcool)
- High-level architecture: what new components will be added, how existing
components will be modified, what new messages will be added, what new
configuration knobs will be introduced, what are the milestones of the
project, etc.

CEP2: Make LWT faster and more reliable
- UX changes needed: none
- Distributed transaction protocol needed: Accord, already added by
previous CEP.
- High-level architecture: blablabla... and so on.

Em sex., 1 de out. de 2021 às 10:19, benedict@apache.org <
benedict@apache.org> escreveu:

> I think this is getting circular and unproductive. Basic disagreements
> about whether the CEP specifies a feature I am inclined to leave for a
> vote. In my view the CEP specifies several features, both immediate ones
> for the user (ACID batches and multi-key LWTS) and developer-focused ones
> around ground-breaking semantics that will be enabled.
>
> The proposal as it stands today is exceptionally thorough, more so than
> any other CEP to date, or any CEP is likely to be in the near future.
>
> This is a Cassandra Enhancement *Proposal*, and at some point we have to
> engage with what is proposed, not what you might like to be proposed. Since
> it remains unclear to me what either yourself or Jonathan want to see as an
> alternative, at this point it would seem more productive to produce your
> own proposals for the community to consider. It is possible for multiple
> transaction systems to co-exist, if you feel this is necessary.
>
>
>
> From: Paulo Motta <pa...@gmail.com>
> Date: Friday, 1 October 2021 at 13:58
> To: Cassandra DEV <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> I share similar feelings as jbellis that this proposal seems to be focusing
> on the protocol itself but lacking the actual feature that will use the
> protocol which IMO a key element to discuss on a CEP.
>
> It's similar to saying: hey I want to add this Tries Serialization Protocol
> to Cassandra, but not providing specific details of how this protocol is
> going to be used.
>
> I think the right route for a CEP is to describe the feature that will be
> added to the database and the protocol is a mere requirement of the
> high-level feature, for example:
>
> CEP: Add Trie-backed memtable
> - Trie Serialization Protocol: implementation detail of the above CEP
>
> What is the difficulty of taking this approach, picking one of the myriad
> of features that will be enabled by Accord and using that as the initial
> CEP to introduce the protocol to the database?
>
> Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
> benedict@apache.org> escreveu:
>
> > Actually, thinking about it again, the simple optimistic protocol would
> in
> > fact guarantee system forward progress (i.e. independent of transaction
> > formulation).
> >
> >
> > From: benedict@apache.org <be...@apache.org>
> > Date: Friday, 1 October 2021 at 09:14
> > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > Hi Jonathan,
> >
> > It would be great if we could achieve a bandwidth higher than 1-2 short
> > emails per week. It remains unclear to me what your goal is, and it would
> > help if you could make a statement like “I want Cassandra to be able to
> do
> > X” so that we can respond directly to it. I am also available to have
> > another call, in which we can have a back and forth, please feel free to
> > propose a London-compatible time within the next week that is suitable
> for
> > you.
> >
> > In my opinion we are at risk of veering off-topic, though. This CEP is
> not
> > to deliver interactive transactions, and to my knowledge nobody is
> > proposing a CEP for interactive transactions. So, for the CEP at hand the
> > salient question seems: does this CEP prevent us from implementing
> > interactive transactions with properties X, Y, Z in future? To which the
> > answer is almost certainly no.
> >
> > However, to continue the discussion and respond directly to your queries,
> > I believe we agree on the definition of an interactive transaction.
> >
> > Two protocols were loosely outlined. The first, using timestamps for
> > optimistic concurrency control, would indeed involve the possibility of
> > aborts. It would not however inherently adopt the issue of LWTs where no
> > transaction is able to make progress. Whether or not progress is
> guaranteed
> > (in a livelock-free sense) would depend on the structure of the
> > transactions that were interfering.
> >
> > This approach has the advantage of being very simple to implement, so
> that
> > we could realistically support interactive transactions quite quickly. It
> > has the additional advantage that transactions would execute very quickly
> > by avoiding the WAN during construction, and as a result may in practice
> > experience fewer aborts than protocols that guarantee livelock-freedom.
> >
> > The second protocol proposed using read/write intents and would be able
> to
> > support almost any behaviour you want. We could even utilise pessimistic
> > concurrency control, or anything in-between. This is its own huge design
> > space, and discussion of this approach and the trade-offs that could be
> > made is (in my opinion) entirely out of scope for this CEP.
> >
> >
> > From: Jonathan Ellis <jb...@gmail.com>
> > Date: Friday, 1 October 2021 at 05:00
> > To: dev <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > The obstacle for me is you've provided a protocol but not a fully fleshed
> > out architecture, so it's hard to fill in some of the blanks.  But it
> looks
> > to me like optimistic concurrency control for interactive transactions
> > applied to Accord would leave you in a LWT-like situation under fairly
> > light contention where nobody actually makes progress due to retries.
> >
> > To make sure we're talking about the same thing, as Henrik pointed out,
> > interactive transactions mean multiple round trips from the client
> within a
> > transaction.  For example, here
> > <
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > >
> > is a simple implementation of the TPC-C New Order transaction.  The high
> > level logic (via
> > <
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > >)
> > is,
> >
> >    1. Get records describing a warehouse, customer, & district
> >    2. Update the district
> >    3. Increment next available order number
> >    4. Insert record into Order and New-Order tables
> >    5. For 5-15 items, get Item record, get/update Stock record
> >    6. Insert Order-Line Record
> >
> > As you can see, this requires a lot of client-side logic mixed in with
> the
> > actual SQL commands.
> >
> >
> > On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <benedict@apache.org
> >
> > wrote:
> >
> > > Essentially this, although I think in practice we will need to track
> each
> > > partition’s timestamp separately (or optionally for reduced conflicts,
> > each
> > > row or datum’s), and make them all part of the conditional application
> of
> > > the transaction - at least for strict-serializability.
> > >
> > > The alternative is to insert read/write intents for the transaction
> > during
> > > each step, and to confirm they are still valid on commit, but this
> > approach
> > > would require a WAN round-trip for each step in the interactive
> > > transaction, whereas the timestamp-validating approach can use a LAN
> > > round-trip for each step besides the final one, and is also much
> simpler
> > to
> > > implement.
> > >
> > >
> > > From: Blake Eggleston <be...@apple.com.INVALID>
> > > Date: Thursday, 30 September 2021 at 05:47
> > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > You could establish a lower timestamp bound and buffer transaction
> state
> > > on the coordinator, then make the commit an operation that only applies
> > if
> > > all partitions involved haven’t been changed by a more recent
> timestamp.
> > > You could also implement mvcc either in the storage layer or for some
> > > period of time by buffering commits on each replica before applying.
> > >
> > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> > > >
> > > > How are interactive transactions possible with Accord?
> > > >
> > > >
> > > >
> > > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > > benedict@apache.org>
> > > > wrote:
> > > >
> > > >> Could you explain why you believe this trade-off is necessary? We
> can
> > > >> support full SQL just fine with Accord, and I hope that we
> eventually
> > > do so.
> > > >>
> > > >> This domain is incredibly complex, so it is easy to reach wrong
> > > >> conclusions. I would invite you again to propose a system for
> > discussion
> > > >> that you think offers something Accord is unable to, and that you
> > > consider
> > > >> desirable, and we can work from there.
> > > >>
> > > >> To pre-empt some possible discussions, I am not aware of anything we
> > > >> cannot do with Accord that we could do with either Calvin or
> Spanner.
> > > >> Interactive transactions are possible on top of Accord, as are
> > > transactions
> > > >> with an unknown read/write set. In each case the only cost is that
> > they
> > > >> would use optimistic concurrency control, which is no worse the
> > spanner
> > > >> derivatives anyway (which I have to assume is your benchmark in this
> > > >> regard). I do not expect to deliver either functionality initially,
> > but
> > > >> Accord takes us most of the way there for both.
> > > >>
> > > >>
> > > >> From: Jonathan Ellis <jb...@gmail.com>
> > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > >> To: dev <de...@cassandra.apache.org>
> > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > >> Right, I'm looking for exactly a discussion on the high level goals.
> > > >> Instead of saying "here's the goals and we ruled out X because Y" we
> > > should
> > > >> start with a discussion around, "Approach A allows X and W,
> approach B
> > > >> allows Y and Z" and decide together what the goals should be and and
> > > what
> > > >> we are willing to trade to get those goals, e.g., are we willing to
> > > give up
> > > >> global strict serializability to get the ability to support full
> SQL.
> > > Both
> > > >> of these are nice to have!
> > > >>
> > > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > > benedict@apache.org>
> > > >> wrote:
> > > >>
> > > >>> Hi Jonathan,
> > > >>>
> > > >>> These other systems are incompatible with the goals of the CEP. I
> do
> > > >>> discuss them (besides 2PC) in both the whitepaper and the CEP, and
> > will
> > > >>> summarise that discussion below. A true and accurate comparison of
> > > these
> > > >>> other systems is essentially intractable, as there are complex
> > > subtleties
> > > >>> to each flavour, and those who are interested would be better
> served
> > by
> > > >>> performing their own research.
> > > >>>
> > > >>> I think it is more productive to focus on what we want to achieve
> as
> > a
> > > >>> community. If you believe the goals of this CEP are wrong for the
> > > >> project,
> > > >>> let’s focus on that. If you want to compare and contrast specific
> > > facets
> > > >> of
> > > >>> alternative systems that you consider to be preferable in some
> > > dimension,
> > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > >>>
> > > >>> The relevant goals are that we:
> > > >>>
> > > >>>
> > > >>>  1.  Guarantee strict serializable isolation on commodity hardware
> > > >>>  2.  Scale to any cluster size
> > > >>>  3.  Achieve optimal latency
> > > >>>
> > > >>> The approach taken by Spanner derivatives is rejected by (1)
> because
> > > they
> > > >>> guarantee only Serializable isolation (they additionally fail (3)).
> > > From
> > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > >>> panic-cluster-death under clock skew, this is clearly considered by
> > > >>> everyone to be undesirable but necessary to achieve scalability.
> > > >>>
> > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2) because
> its
> > > >>> sequencing layer requires a global leader process for the cluster,
> > > which
> > > >> is
> > > >>> incompatible with Cassandra’s scalability requirements. It
> > additionally
> > > >>> fails (3) for global clients.
> > > >>>
> > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
> > > >>> Spanner clone for its multi-key transaction functionality, not 2PC.
> > > >>>
> > > >>> Systems such as RAMP with even weaker isolation are not considered
> > for
> > > >> the
> > > >>> simple reason that they do not even claim to meet (1).
> > > >>>
> > > >>> If we want to additionally offer weaker isolation levels than
> > > >>> Serializable, such as that provided by the recent RAMP-TAO paper,
> > > >> Cassandra
> > > >>> is likely able to support multiple distinct transaction layers that
> > > >> operate
> > > >>> independently. I would encourage you to file a CEP to explore how
> we
> > > can
> > > >>> meet these distinct use cases, but I consider them to be niche. I
> > > expect
> > > >>> that a majority of our user base desire strict serializable
> > isolation,
> > > >> and
> > > >>> certainly no less than serializable isolation, to augment the
> > existing
> > > >>> weaker isolation offered by quorum reads and writes.
> > > >>>
> > > >>> I would tangentially note that we are not an AP database under
> normal
> > > >>> recommended operation. A minority in any network partition cannot
> > reach
> > > >>> QUORUM, so under recommended usage we are a high-availability
> > > leaderless
> > > >> CP
> > > >>> database.
> > > >>>
> > > >>>
> > > >>> From: Jonathan Ellis <jb...@gmail.com>
> > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > >>> To: dev <de...@cassandra.apache.org>
> > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > >>> Benedict, thanks for taking the lead in putting this together.
> Since
> > > >>> Cassandra is the only relevant database today designed around a
> > > >> leaderless
> > > >>> architecture, it's quite likely that we'll be better served with a
> > > custom
> > > >>> transaction design instead of trying to retrofit one from CP
> systems.
> > > >>>
> > > >>> The whitepaper here is a good description of the consensus
> algorithm
> > > >> itself
> > > >>> as well as its robustness and stability characteristics, and its
> > > >> comparison
> > > >>> with other state-of-the-art consensus algorithms is very useful.
> In
> > > the
> > > >>> context of Cassandra, where a consensus algorithm is only part of
> > what
> > > >> will
> > > >>> be implemented, I'd like to see a more complete evaluation of the
> > > >>> transactional side of things as well, including performance
> > > >> characteristics
> > > >>> as well as the types of transactions that can be supported and at
> > > least a
> > > >>> general idea of what it would look like applied to Cassandra. This
> > will
> > > >>> allow the PMC to make a more informed decision about what tradeoffs
> > are
> > > >>> best for the entire long-term project of first supplementing and
> > > >> ultimately
> > > >>> replacing LWT.
> > > >>>
> > > >>> (Allowing users to mix LWT and AP Cassandra operations against the
> > same
> > > >>> rows was probably a mistake, so in contrast with LWT we’re not
> > looking
> > > >> for
> > > >>> something fast enough for occasional use but rather something
> within
> > a
> > > >>> reasonable factor of AP operations, appropriate to being the only
> way
> > > to
> > > >>> interact with tables declared as such.)
> > > >>>
> > > >>> Besides Accord, this should cover
> > > >>>
> > > >>> - Calvin and FaunaDB
> > > >>> - A Spanner derivative (no opinion on whether that should be
> > Cockroach
> > > or
> > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I
> > > suspect
> > > >>> there is more public information about MongoDB)
> > > >>> - RAMP
> > > >>>
> > > >>> Here’s an example of what I mean:
> > > >>>
> > > >>> =Calvin=
> > > >>>
> > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to
> > order
> > > >>> transactions, then replicas execute the transactions independently
> > with
> > > >> no
> > > >>> further coordination.  No SPOF.  Transactions are batched by each
> > > >> sequencer
> > > >>> to keep this from becoming a bottleneck.
> > > >>>
> > > >>> Performance: Calvin paper (published 2012) reports linear scaling
> of
> > > >> TPC-C
> > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL
> > machines
> > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is
> > > composed
> > > >>> of four reads and four writes, so this is effectively 2M reads and
> 2M
> > > >>> writes as we normally measure them in C*.
> > > >>>
> > > >>> Calvin supports mixed read/write transactions, but because the
> > > >> transaction
> > > >>> execution logic requires knowing all partition keys in advance to
> > > ensure
> > > >>> that all replicas can reproduce the same results with no
> > coordination,
> > > >>> reads against non-PK predicates must be done ahead of time
> > > >> (transparently,
> > > >>> by the server) to determine the set of keys, and this must be
> retried
> > > if
> > > >>> the set of rows affected is updated before the actual transaction
> > > >> executes.
> > > >>>
> > > >>> Batching and global consensus adds latency -- 100ms in the Calvin
> > paper
> > > >> and
> > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> transactions
> > > >>> (including multi-partition updates) are equally performant in
> Calvin
> > > >> since
> > > >>> the coordination is handled up front in the sequencing step.  Glass
> > > half
> > > >>> empty: even single-row reads and writes have to pay the full
> > > coordination
> > > >>> cost.  Fauna has optimized this away for reads but I am not aware
> of
> > a
> > > >>> description of how they changed the design to allow this.
> > > >>>
> > > >>> Functionality and limitations: since the entire transaction must be
> > > known
> > > >>> in advance to allow coordination-less execution at the replicas,
> > Calvin
> > > >>> cannot support interactive transactions at all.  FaunaDB mitigates
> > this
> > > >> by
> > > >>> allowing server-side logic to be included, but a Calvin approach
> will
> > > >> never
> > > >>> be able to offer SQL compatibility.
> > > >>>
> > > >>> Guarantees: Calvin transactions are strictly serializable.  There
> is
> > no
> > > >>> additional complexity or performance hit to generalizing to
> multiple
> > > >>> regions, apart from the speed of light.  And since Calvin is
> already
> > > >> paying
> > > >>> a batching latency penalty, this is less painful than for other
> > > systems.
> > > >>>
> > > >>> Application to Cassandra: B-.  Distributed transactions are handled
> > by
> > > >> the
> > > >>> sequencing and scheduling layers, which are leaderless, and
> Calvin’s
> > > >>> requirements for the storage layer are easily met by C*.  But
> Calvin
> > > also
> > > >>> requires a global consensus protocol and LWT is almost certainly
> not
> > > >>> sufficiently performant, so this would require ZK or etcd
> (reasonable
> > > >> for a
> > > >>> library approach but not for replacing LWT in C* itself), or an
> > > >>> implementation of Accord.  I don’t believe Calvin would require
> > > >> additional
> > > >>> table-level metadata in Cassandra.
> > > >>>
> > > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > > benedict@apache.org>
> > > >>> wrote:
> > > >>>
> > > >>>> Wiki:
> > > >>>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > >>>> Whitepaper:
> > > >>>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > >>>> <
> > > >>>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > >>>>>
> > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > >>>>
> > > >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> > > >> community.
> > > >>>>
> > > >>>> Cassandra has benefitted from LWTs for many years, but application
> > > >>>> developers that want to ensure consistency for complex operations
> > must
> > > >>>> either accept the scalability bottleneck of serializing all
> related
> > > >> state
> > > >>>> through a single partition, or layer a complex state machine on
> top
> > of
> > > >>> the
> > > >>>> database. These are sophisticated and costly activities that our
> > users
> > > >>>> should not be expected to undertake. Since distributed databases
> are
> > > >>>> beginning to offer distributed transactions with fewer caveats, it
> > is
> > > >>> past
> > > >>>> time for Cassandra to do so as well.
> > > >>>>
> > > >>>> This CEP proposes the use of several novel techniques that build
> > upon
> > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> general
> > > >>>> purpose distributed transactions. The approach is outlined in the
> > > >>> wikipage
> > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > adopting
> > > >>> this
> > > >>>> approach we will be the _only_ distributed database to offer
> global,
> > > >>>> scalable, strict serializable transactions in one wide area
> > > round-trip.
> > > >>>> This would represent a significant improvement in the state of the
> > > art,
> > > >>>> both in the academic literature and in commercial or open source
> > > >>> offerings.
> > > >>>>
> > > >>>> This work has been partially realised in a prototype. This partial
> > > >>>> prototype has been verified against Jepsen.io’s Maelstrom library
> > and
> > > >>>> dedicated in-tree strict serializability verification tools, but
> > much
> > > >>> work
> > > >>>> remains for the work to be production capable and integrated into
> > > >>> Cassandra.
> > > >>>>
> > > >>>> I propose including the prototype in the project as a new source
> > > >>>> repository, to be developed as a standalone library for
> integration
> > > >> into
> > > >>>> Cassandra. I hope the community sees the important value
> proposition
> > > of
> > > >>>> this proposal, and will adopt the CEP after this discussion, so
> that
> > > >> the
> > > >>>> library and its integration into Cassandra can be developed in
> > > parallel
> > > >>> and
> > > >>>> with the involvement of the wider community.
> > > >>>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> Jonathan Ellis
> > > >>> co-founder, http://www.datastax.com
> > > >>> @spyced
> > > >>>
> > > >>
> > > >>
> > > >> --
> > > >> Jonathan Ellis
> > > >> co-founder, http://www.datastax.com
> > > >> @spyced
> > > >>
> > > >
> > > >
> > > > --
> > > > Jonathan Ellis
> > > > co-founder, http://www.datastax.com
> > > > @spyced
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > >
> >
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
> >
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Paulo Motta <pa...@gmail.com>.
>  The proposal as it stands today is exceptionally thorough, more so than
any other CEP to date, or any CEP is likely to be in the near future.

The protocol is thoroughly described, but in my view CEP is a forum to
discuss the high level architecture and plan for adding a full end-to-end
enhancement to the database, breaking it into sub-CEPs if needed, as long
as the full plan is known in advance, otherwise the community will not have
the context to judge the full extent and impact of the proposed enhancement.

> Since it remains unclear to me what either yourself or Jonathan want to
see as an alternative

I would personally like to see something along these lines:

CEP1: Add ACID-compliant atomic batches
- UX changes needed: none, CQL provides the grammar we need.
- Distributed transaction protocol needed: Accord (link to white paper if
you want specific details about the protcool)
- High-level architecture: what new components will be added, how existing
components will be modified, what new messages will be added, what new
configuration knobs will be introduced, what are the milestones of the
project, etc.

CEP2: Make LWT faster and more reliable
- UX changes needed: none
- Distributed transaction protocol needed: Accord, already added by
previous CEP.
- High-level architecture: blablabla... and so on.

Em sex., 1 de out. de 2021 às 10:19, benedict@apache.org <
benedict@apache.org> escreveu:

> I think this is getting circular and unproductive. Basic disagreements
> about whether the CEP specifies a feature I am inclined to leave for a
> vote. In my view the CEP specifies several features, both immediate ones
> for the user (ACID batches and multi-key LWTS) and developer-focused ones
> around ground-breaking semantics that will be enabled.
>
> The proposal as it stands today is exceptionally thorough, more so than
> any other CEP to date, or any CEP is likely to be in the near future.
>
> This is a Cassandra Enhancement *Proposal*, and at some point we have to
> engage with what is proposed, not what you might like to be proposed. Since
> it remains unclear to me what either yourself or Jonathan want to see as an
> alternative, at this point it would seem more productive to produce your
> own proposals for the community to consider. It is possible for multiple
> transaction systems to co-exist, if you feel this is necessary.
>
>
>
> From: Paulo Motta <pa...@gmail.com>
> Date: Friday, 1 October 2021 at 13:58
> To: Cassandra DEV <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> I share similar feelings as jbellis that this proposal seems to be focusing
> on the protocol itself but lacking the actual feature that will use the
> protocol which IMO a key element to discuss on a CEP.
>
> It's similar to saying: hey I want to add this Tries Serialization Protocol
> to Cassandra, but not providing specific details of how this protocol is
> going to be used.
>
> I think the right route for a CEP is to describe the feature that will be
> added to the database and the protocol is a mere requirement of the
> high-level feature, for example:
>
> CEP: Add Trie-backed memtable
> - Trie Serialization Protocol: implementation detail of the above CEP
>
> What is the difficulty of taking this approach, picking one of the myriad
> of features that will be enabled by Accord and using that as the initial
> CEP to introduce the protocol to the database?
>
> Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
> benedict@apache.org> escreveu:
>
> > Actually, thinking about it again, the simple optimistic protocol would
> in
> > fact guarantee system forward progress (i.e. independent of transaction
> > formulation).
> >
> >
> > From: benedict@apache.org <be...@apache.org>
> > Date: Friday, 1 October 2021 at 09:14
> > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > Hi Jonathan,
> >
> > It would be great if we could achieve a bandwidth higher than 1-2 short
> > emails per week. It remains unclear to me what your goal is, and it would
> > help if you could make a statement like “I want Cassandra to be able to
> do
> > X” so that we can respond directly to it. I am also available to have
> > another call, in which we can have a back and forth, please feel free to
> > propose a London-compatible time within the next week that is suitable
> for
> > you.
> >
> > In my opinion we are at risk of veering off-topic, though. This CEP is
> not
> > to deliver interactive transactions, and to my knowledge nobody is
> > proposing a CEP for interactive transactions. So, for the CEP at hand the
> > salient question seems: does this CEP prevent us from implementing
> > interactive transactions with properties X, Y, Z in future? To which the
> > answer is almost certainly no.
> >
> > However, to continue the discussion and respond directly to your queries,
> > I believe we agree on the definition of an interactive transaction.
> >
> > Two protocols were loosely outlined. The first, using timestamps for
> > optimistic concurrency control, would indeed involve the possibility of
> > aborts. It would not however inherently adopt the issue of LWTs where no
> > transaction is able to make progress. Whether or not progress is
> guaranteed
> > (in a livelock-free sense) would depend on the structure of the
> > transactions that were interfering.
> >
> > This approach has the advantage of being very simple to implement, so
> that
> > we could realistically support interactive transactions quite quickly. It
> > has the additional advantage that transactions would execute very quickly
> > by avoiding the WAN during construction, and as a result may in practice
> > experience fewer aborts than protocols that guarantee livelock-freedom.
> >
> > The second protocol proposed using read/write intents and would be able
> to
> > support almost any behaviour you want. We could even utilise pessimistic
> > concurrency control, or anything in-between. This is its own huge design
> > space, and discussion of this approach and the trade-offs that could be
> > made is (in my opinion) entirely out of scope for this CEP.
> >
> >
> > From: Jonathan Ellis <jb...@gmail.com>
> > Date: Friday, 1 October 2021 at 05:00
> > To: dev <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > The obstacle for me is you've provided a protocol but not a fully fleshed
> > out architecture, so it's hard to fill in some of the blanks.  But it
> looks
> > to me like optimistic concurrency control for interactive transactions
> > applied to Accord would leave you in a LWT-like situation under fairly
> > light contention where nobody actually makes progress due to retries.
> >
> > To make sure we're talking about the same thing, as Henrik pointed out,
> > interactive transactions mean multiple round trips from the client
> within a
> > transaction.  For example, here
> > <
> >
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> > >
> > is a simple implementation of the TPC-C New Order transaction.  The high
> > level logic (via
> > <
> >
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> > >)
> > is,
> >
> >    1. Get records describing a warehouse, customer, & district
> >    2. Update the district
> >    3. Increment next available order number
> >    4. Insert record into Order and New-Order tables
> >    5. For 5-15 items, get Item record, get/update Stock record
> >    6. Insert Order-Line Record
> >
> > As you can see, this requires a lot of client-side logic mixed in with
> the
> > actual SQL commands.
> >
> >
> > On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <benedict@apache.org
> >
> > wrote:
> >
> > > Essentially this, although I think in practice we will need to track
> each
> > > partition’s timestamp separately (or optionally for reduced conflicts,
> > each
> > > row or datum’s), and make them all part of the conditional application
> of
> > > the transaction - at least for strict-serializability.
> > >
> > > The alternative is to insert read/write intents for the transaction
> > during
> > > each step, and to confirm they are still valid on commit, but this
> > approach
> > > would require a WAN round-trip for each step in the interactive
> > > transaction, whereas the timestamp-validating approach can use a LAN
> > > round-trip for each step besides the final one, and is also much
> simpler
> > to
> > > implement.
> > >
> > >
> > > From: Blake Eggleston <be...@apple.com.INVALID>
> > > Date: Thursday, 30 September 2021 at 05:47
> > > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > You could establish a lower timestamp bound and buffer transaction
> state
> > > on the coordinator, then make the commit an operation that only applies
> > if
> > > all partitions involved haven’t been changed by a more recent
> timestamp.
> > > You could also implement mvcc either in the storage layer or for some
> > > period of time by buffering commits on each replica before applying.
> > >
> > > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> > > >
> > > > How are interactive transactions possible with Accord?
> > > >
> > > >
> > > >
> > > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > > benedict@apache.org>
> > > > wrote:
> > > >
> > > >> Could you explain why you believe this trade-off is necessary? We
> can
> > > >> support full SQL just fine with Accord, and I hope that we
> eventually
> > > do so.
> > > >>
> > > >> This domain is incredibly complex, so it is easy to reach wrong
> > > >> conclusions. I would invite you again to propose a system for
> > discussion
> > > >> that you think offers something Accord is unable to, and that you
> > > consider
> > > >> desirable, and we can work from there.
> > > >>
> > > >> To pre-empt some possible discussions, I am not aware of anything we
> > > >> cannot do with Accord that we could do with either Calvin or
> Spanner.
> > > >> Interactive transactions are possible on top of Accord, as are
> > > transactions
> > > >> with an unknown read/write set. In each case the only cost is that
> > they
> > > >> would use optimistic concurrency control, which is no worse the
> > spanner
> > > >> derivatives anyway (which I have to assume is your benchmark in this
> > > >> regard). I do not expect to deliver either functionality initially,
> > but
> > > >> Accord takes us most of the way there for both.
> > > >>
> > > >>
> > > >> From: Jonathan Ellis <jb...@gmail.com>
> > > >> Date: Wednesday, 22 September 2021 at 05:36
> > > >> To: dev <de...@cassandra.apache.org>
> > > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > >> Right, I'm looking for exactly a discussion on the high level goals.
> > > >> Instead of saying "here's the goals and we ruled out X because Y" we
> > > should
> > > >> start with a discussion around, "Approach A allows X and W,
> approach B
> > > >> allows Y and Z" and decide together what the goals should be and and
> > > what
> > > >> we are willing to trade to get those goals, e.g., are we willing to
> > > give up
> > > >> global strict serializability to get the ability to support full
> SQL.
> > > Both
> > > >> of these are nice to have!
> > > >>
> > > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > > benedict@apache.org>
> > > >> wrote:
> > > >>
> > > >>> Hi Jonathan,
> > > >>>
> > > >>> These other systems are incompatible with the goals of the CEP. I
> do
> > > >>> discuss them (besides 2PC) in both the whitepaper and the CEP, and
> > will
> > > >>> summarise that discussion below. A true and accurate comparison of
> > > these
> > > >>> other systems is essentially intractable, as there are complex
> > > subtleties
> > > >>> to each flavour, and those who are interested would be better
> served
> > by
> > > >>> performing their own research.
> > > >>>
> > > >>> I think it is more productive to focus on what we want to achieve
> as
> > a
> > > >>> community. If you believe the goals of this CEP are wrong for the
> > > >> project,
> > > >>> let’s focus on that. If you want to compare and contrast specific
> > > facets
> > > >> of
> > > >>> alternative systems that you consider to be preferable in some
> > > dimension,
> > > >>> let’s do that here or in a Q&A as proposed by Joey.
> > > >>>
> > > >>> The relevant goals are that we:
> > > >>>
> > > >>>
> > > >>>  1.  Guarantee strict serializable isolation on commodity hardware
> > > >>>  2.  Scale to any cluster size
> > > >>>  3.  Achieve optimal latency
> > > >>>
> > > >>> The approach taken by Spanner derivatives is rejected by (1)
> because
> > > they
> > > >>> guarantee only Serializable isolation (they additionally fail (3)).
> > > From
> > > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > > >>> panic-cluster-death under clock skew, this is clearly considered by
> > > >>> everyone to be undesirable but necessary to achieve scalability.
> > > >>>
> > > >>> The approach taken by FaunaDB (Calvin) is rejected by (2) because
> its
> > > >>> sequencing layer requires a global leader process for the cluster,
> > > which
> > > >> is
> > > >>> incompatible with Cassandra’s scalability requirements. It
> > additionally
> > > >>> fails (3) for global clients.
> > > >>>
> > > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
> > > >>> Spanner clone for its multi-key transaction functionality, not 2PC.
> > > >>>
> > > >>> Systems such as RAMP with even weaker isolation are not considered
> > for
> > > >> the
> > > >>> simple reason that they do not even claim to meet (1).
> > > >>>
> > > >>> If we want to additionally offer weaker isolation levels than
> > > >>> Serializable, such as that provided by the recent RAMP-TAO paper,
> > > >> Cassandra
> > > >>> is likely able to support multiple distinct transaction layers that
> > > >> operate
> > > >>> independently. I would encourage you to file a CEP to explore how
> we
> > > can
> > > >>> meet these distinct use cases, but I consider them to be niche. I
> > > expect
> > > >>> that a majority of our user base desire strict serializable
> > isolation,
> > > >> and
> > > >>> certainly no less than serializable isolation, to augment the
> > existing
> > > >>> weaker isolation offered by quorum reads and writes.
> > > >>>
> > > >>> I would tangentially note that we are not an AP database under
> normal
> > > >>> recommended operation. A minority in any network partition cannot
> > reach
> > > >>> QUORUM, so under recommended usage we are a high-availability
> > > leaderless
> > > >> CP
> > > >>> database.
> > > >>>
> > > >>>
> > > >>> From: Jonathan Ellis <jb...@gmail.com>
> > > >>> Date: Tuesday, 21 September 2021 at 23:45
> > > >>> To: dev <de...@cassandra.apache.org>
> > > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > > >>> Benedict, thanks for taking the lead in putting this together.
> Since
> > > >>> Cassandra is the only relevant database today designed around a
> > > >> leaderless
> > > >>> architecture, it's quite likely that we'll be better served with a
> > > custom
> > > >>> transaction design instead of trying to retrofit one from CP
> systems.
> > > >>>
> > > >>> The whitepaper here is a good description of the consensus
> algorithm
> > > >> itself
> > > >>> as well as its robustness and stability characteristics, and its
> > > >> comparison
> > > >>> with other state-of-the-art consensus algorithms is very useful.
> In
> > > the
> > > >>> context of Cassandra, where a consensus algorithm is only part of
> > what
> > > >> will
> > > >>> be implemented, I'd like to see a more complete evaluation of the
> > > >>> transactional side of things as well, including performance
> > > >> characteristics
> > > >>> as well as the types of transactions that can be supported and at
> > > least a
> > > >>> general idea of what it would look like applied to Cassandra. This
> > will
> > > >>> allow the PMC to make a more informed decision about what tradeoffs
> > are
> > > >>> best for the entire long-term project of first supplementing and
> > > >> ultimately
> > > >>> replacing LWT.
> > > >>>
> > > >>> (Allowing users to mix LWT and AP Cassandra operations against the
> > same
> > > >>> rows was probably a mistake, so in contrast with LWT we’re not
> > looking
> > > >> for
> > > >>> something fast enough for occasional use but rather something
> within
> > a
> > > >>> reasonable factor of AP operations, appropriate to being the only
> way
> > > to
> > > >>> interact with tables declared as such.)
> > > >>>
> > > >>> Besides Accord, this should cover
> > > >>>
> > > >>> - Calvin and FaunaDB
> > > >>> - A Spanner derivative (no opinion on whether that should be
> > Cockroach
> > > or
> > > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > > >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I
> > > suspect
> > > >>> there is more public information about MongoDB)
> > > >>> - RAMP
> > > >>>
> > > >>> Here’s an example of what I mean:
> > > >>>
> > > >>> =Calvin=
> > > >>>
> > > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to
> > order
> > > >>> transactions, then replicas execute the transactions independently
> > with
> > > >> no
> > > >>> further coordination.  No SPOF.  Transactions are batched by each
> > > >> sequencer
> > > >>> to keep this from becoming a bottleneck.
> > > >>>
> > > >>> Performance: Calvin paper (published 2012) reports linear scaling
> of
> > > >> TPC-C
> > > >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL
> > machines
> > > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is
> > > composed
> > > >>> of four reads and four writes, so this is effectively 2M reads and
> 2M
> > > >>> writes as we normally measure them in C*.
> > > >>>
> > > >>> Calvin supports mixed read/write transactions, but because the
> > > >> transaction
> > > >>> execution logic requires knowing all partition keys in advance to
> > > ensure
> > > >>> that all replicas can reproduce the same results with no
> > coordination,
> > > >>> reads against non-PK predicates must be done ahead of time
> > > >> (transparently,
> > > >>> by the server) to determine the set of keys, and this must be
> retried
> > > if
> > > >>> the set of rows affected is updated before the actual transaction
> > > >> executes.
> > > >>>
> > > >>> Batching and global consensus adds latency -- 100ms in the Calvin
> > paper
> > > >> and
> > > >>> apparently about 50ms in FaunaDB.  Glass half full: all
> transactions
> > > >>> (including multi-partition updates) are equally performant in
> Calvin
> > > >> since
> > > >>> the coordination is handled up front in the sequencing step.  Glass
> > > half
> > > >>> empty: even single-row reads and writes have to pay the full
> > > coordination
> > > >>> cost.  Fauna has optimized this away for reads but I am not aware
> of
> > a
> > > >>> description of how they changed the design to allow this.
> > > >>>
> > > >>> Functionality and limitations: since the entire transaction must be
> > > known
> > > >>> in advance to allow coordination-less execution at the replicas,
> > Calvin
> > > >>> cannot support interactive transactions at all.  FaunaDB mitigates
> > this
> > > >> by
> > > >>> allowing server-side logic to be included, but a Calvin approach
> will
> > > >> never
> > > >>> be able to offer SQL compatibility.
> > > >>>
> > > >>> Guarantees: Calvin transactions are strictly serializable.  There
> is
> > no
> > > >>> additional complexity or performance hit to generalizing to
> multiple
> > > >>> regions, apart from the speed of light.  And since Calvin is
> already
> > > >> paying
> > > >>> a batching latency penalty, this is less painful than for other
> > > systems.
> > > >>>
> > > >>> Application to Cassandra: B-.  Distributed transactions are handled
> > by
> > > >> the
> > > >>> sequencing and scheduling layers, which are leaderless, and
> Calvin’s
> > > >>> requirements for the storage layer are easily met by C*.  But
> Calvin
> > > also
> > > >>> requires a global consensus protocol and LWT is almost certainly
> not
> > > >>> sufficiently performant, so this would require ZK or etcd
> (reasonable
> > > >> for a
> > > >>> library approach but not for replacing LWT in C* itself), or an
> > > >>> implementation of Accord.  I don’t believe Calvin would require
> > > >> additional
> > > >>> table-level metadata in Cassandra.
> > > >>>
> > > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > > benedict@apache.org>
> > > >>> wrote:
> > > >>>
> > > >>>> Wiki:
> > > >>>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > > >>>> Whitepaper:
> > > >>>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > > >>>> <
> > > >>>>
> > > >>>
> > > >>
> > >
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > > >>>>>
> > > >>>> Prototype: https://github.com/belliottsmith/accord
> > > >>>>
> > > >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> > > >> community.
> > > >>>>
> > > >>>> Cassandra has benefitted from LWTs for many years, but application
> > > >>>> developers that want to ensure consistency for complex operations
> > must
> > > >>>> either accept the scalability bottleneck of serializing all
> related
> > > >> state
> > > >>>> through a single partition, or layer a complex state machine on
> top
> > of
> > > >>> the
> > > >>>> database. These are sophisticated and costly activities that our
> > users
> > > >>>> should not be expected to undertake. Since distributed databases
> are
> > > >>>> beginning to offer distributed transactions with fewer caveats, it
> > is
> > > >>> past
> > > >>>> time for Cassandra to do so as well.
> > > >>>>
> > > >>>> This CEP proposes the use of several novel techniques that build
> > upon
> > > >>>> research (that followed EPaxos) to deliver (non-interactive)
> general
> > > >>>> purpose distributed transactions. The approach is outlined in the
> > > >>> wikipage
> > > >>>> and in more detail in the linked whitepaper. Importantly, by
> > adopting
> > > >>> this
> > > >>>> approach we will be the _only_ distributed database to offer
> global,
> > > >>>> scalable, strict serializable transactions in one wide area
> > > round-trip.
> > > >>>> This would represent a significant improvement in the state of the
> > > art,
> > > >>>> both in the academic literature and in commercial or open source
> > > >>> offerings.
> > > >>>>
> > > >>>> This work has been partially realised in a prototype. This partial
> > > >>>> prototype has been verified against Jepsen.io’s Maelstrom library
> > and
> > > >>>> dedicated in-tree strict serializability verification tools, but
> > much
> > > >>> work
> > > >>>> remains for the work to be production capable and integrated into
> > > >>> Cassandra.
> > > >>>>
> > > >>>> I propose including the prototype in the project as a new source
> > > >>>> repository, to be developed as a standalone library for
> integration
> > > >> into
> > > >>>> Cassandra. I hope the community sees the important value
> proposition
> > > of
> > > >>>> this proposal, and will adopt the CEP after this discussion, so
> that
> > > >> the
> > > >>>> library and its integration into Cassandra can be developed in
> > > parallel
> > > >>> and
> > > >>>> with the involvement of the wider community.
> > > >>>>
> > > >>>
> > > >>>
> > > >>> --
> > > >>> Jonathan Ellis
> > > >>> co-founder, http://www.datastax.com
> > > >>> @spyced
> > > >>>
> > > >>
> > > >>
> > > >> --
> > > >> Jonathan Ellis
> > > >> co-founder, http://www.datastax.com
> > > >> @spyced
> > > >>
> > > >
> > > >
> > > > --
> > > > Jonathan Ellis
> > > > co-founder, http://www.datastax.com
> > > > @spyced
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > >
> >
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
> >
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.
I think this is getting circular and unproductive. Basic disagreements about whether the CEP specifies a feature I am inclined to leave for a vote. In my view the CEP specifies several features, both immediate ones for the user (ACID batches and multi-key LWTS) and developer-focused ones around ground-breaking semantics that will be enabled.

The proposal as it stands today is exceptionally thorough, more so than any other CEP to date, or any CEP is likely to be in the near future.

This is a Cassandra Enhancement *Proposal*, and at some point we have to engage with what is proposed, not what you might like to be proposed. Since it remains unclear to me what either yourself or Jonathan want to see as an alternative, at this point it would seem more productive to produce your own proposals for the community to consider. It is possible for multiple transaction systems to co-exist, if you feel this is necessary.



From: Paulo Motta <pa...@gmail.com>
Date: Friday, 1 October 2021 at 13:58
To: Cassandra DEV <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
I share similar feelings as jbellis that this proposal seems to be focusing
on the protocol itself but lacking the actual feature that will use the
protocol which IMO a key element to discuss on a CEP.

It's similar to saying: hey I want to add this Tries Serialization Protocol
to Cassandra, but not providing specific details of how this protocol is
going to be used.

I think the right route for a CEP is to describe the feature that will be
added to the database and the protocol is a mere requirement of the
high-level feature, for example:

CEP: Add Trie-backed memtable
- Trie Serialization Protocol: implementation detail of the above CEP

What is the difficulty of taking this approach, picking one of the myriad
of features that will be enabled by Accord and using that as the initial
CEP to introduce the protocol to the database?

Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
benedict@apache.org> escreveu:

> Actually, thinking about it again, the simple optimistic protocol would in
> fact guarantee system forward progress (i.e. independent of transaction
> formulation).
>
>
> From: benedict@apache.org <be...@apache.org>
> Date: Friday, 1 October 2021 at 09:14
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Hi Jonathan,
>
> It would be great if we could achieve a bandwidth higher than 1-2 short
> emails per week. It remains unclear to me what your goal is, and it would
> help if you could make a statement like “I want Cassandra to be able to do
> X” so that we can respond directly to it. I am also available to have
> another call, in which we can have a back and forth, please feel free to
> propose a London-compatible time within the next week that is suitable for
> you.
>
> In my opinion we are at risk of veering off-topic, though. This CEP is not
> to deliver interactive transactions, and to my knowledge nobody is
> proposing a CEP for interactive transactions. So, for the CEP at hand the
> salient question seems: does this CEP prevent us from implementing
> interactive transactions with properties X, Y, Z in future? To which the
> answer is almost certainly no.
>
> However, to continue the discussion and respond directly to your queries,
> I believe we agree on the definition of an interactive transaction.
>
> Two protocols were loosely outlined. The first, using timestamps for
> optimistic concurrency control, would indeed involve the possibility of
> aborts. It would not however inherently adopt the issue of LWTs where no
> transaction is able to make progress. Whether or not progress is guaranteed
> (in a livelock-free sense) would depend on the structure of the
> transactions that were interfering.
>
> This approach has the advantage of being very simple to implement, so that
> we could realistically support interactive transactions quite quickly. It
> has the additional advantage that transactions would execute very quickly
> by avoiding the WAN during construction, and as a result may in practice
> experience fewer aborts than protocols that guarantee livelock-freedom.
>
> The second protocol proposed using read/write intents and would be able to
> support almost any behaviour you want. We could even utilise pessimistic
> concurrency control, or anything in-between. This is its own huge design
> space, and discussion of this approach and the trade-offs that could be
> made is (in my opinion) entirely out of scope for this CEP.
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Friday, 1 October 2021 at 05:00
> To: dev <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> The obstacle for me is you've provided a protocol but not a fully fleshed
> out architecture, so it's hard to fill in some of the blanks.  But it looks
> to me like optimistic concurrency control for interactive transactions
> applied to Accord would leave you in a LWT-like situation under fairly
> light contention where nobody actually makes progress due to retries.
>
> To make sure we're talking about the same thing, as Henrik pointed out,
> interactive transactions mean multiple round trips from the client within a
> transaction.  For example, here
> <
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> >
> is a simple implementation of the TPC-C New Order transaction.  The high
> level logic (via
> <
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> >)
> is,
>
>    1. Get records describing a warehouse, customer, & district
>    2. Update the district
>    3. Increment next available order number
>    4. Insert record into Order and New-Order tables
>    5. For 5-15 items, get Item record, get/update Stock record
>    6. Insert Order-Line Record
>
> As you can see, this requires a lot of client-side logic mixed in with the
> actual SQL commands.
>
>
> On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <be...@apache.org>
> wrote:
>
> > Essentially this, although I think in practice we will need to track each
> > partition’s timestamp separately (or optionally for reduced conflicts,
> each
> > row or datum’s), and make them all part of the conditional application of
> > the transaction - at least for strict-serializability.
> >
> > The alternative is to insert read/write intents for the transaction
> during
> > each step, and to confirm they are still valid on commit, but this
> approach
> > would require a WAN round-trip for each step in the interactive
> > transaction, whereas the timestamp-validating approach can use a LAN
> > round-trip for each step besides the final one, and is also much simpler
> to
> > implement.
> >
> >
> > From: Blake Eggleston <be...@apple.com.INVALID>
> > Date: Thursday, 30 September 2021 at 05:47
> > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > You could establish a lower timestamp bound and buffer transaction state
> > on the coordinator, then make the commit an operation that only applies
> if
> > all partitions involved haven’t been changed by a more recent timestamp.
> > You could also implement mvcc either in the storage layer or for some
> > period of time by buffering commits on each replica before applying.
> >
> > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> > >
> > > How are interactive transactions possible with Accord?
> > >
> > >
> > >
> > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > benedict@apache.org>
> > > wrote:
> > >
> > >> Could you explain why you believe this trade-off is necessary? We can
> > >> support full SQL just fine with Accord, and I hope that we eventually
> > do so.
> > >>
> > >> This domain is incredibly complex, so it is easy to reach wrong
> > >> conclusions. I would invite you again to propose a system for
> discussion
> > >> that you think offers something Accord is unable to, and that you
> > consider
> > >> desirable, and we can work from there.
> > >>
> > >> To pre-empt some possible discussions, I am not aware of anything we
> > >> cannot do with Accord that we could do with either Calvin or Spanner.
> > >> Interactive transactions are possible on top of Accord, as are
> > transactions
> > >> with an unknown read/write set. In each case the only cost is that
> they
> > >> would use optimistic concurrency control, which is no worse the
> spanner
> > >> derivatives anyway (which I have to assume is your benchmark in this
> > >> regard). I do not expect to deliver either functionality initially,
> but
> > >> Accord takes us most of the way there for both.
> > >>
> > >>
> > >> From: Jonathan Ellis <jb...@gmail.com>
> > >> Date: Wednesday, 22 September 2021 at 05:36
> > >> To: dev <de...@cassandra.apache.org>
> > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >> Right, I'm looking for exactly a discussion on the high level goals.
> > >> Instead of saying "here's the goals and we ruled out X because Y" we
> > should
> > >> start with a discussion around, "Approach A allows X and W, approach B
> > >> allows Y and Z" and decide together what the goals should be and and
> > what
> > >> we are willing to trade to get those goals, e.g., are we willing to
> > give up
> > >> global strict serializability to get the ability to support full SQL.
> > Both
> > >> of these are nice to have!
> > >>
> > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > benedict@apache.org>
> > >> wrote:
> > >>
> > >>> Hi Jonathan,
> > >>>
> > >>> These other systems are incompatible with the goals of the CEP. I do
> > >>> discuss them (besides 2PC) in both the whitepaper and the CEP, and
> will
> > >>> summarise that discussion below. A true and accurate comparison of
> > these
> > >>> other systems is essentially intractable, as there are complex
> > subtleties
> > >>> to each flavour, and those who are interested would be better served
> by
> > >>> performing their own research.
> > >>>
> > >>> I think it is more productive to focus on what we want to achieve as
> a
> > >>> community. If you believe the goals of this CEP are wrong for the
> > >> project,
> > >>> let’s focus on that. If you want to compare and contrast specific
> > facets
> > >> of
> > >>> alternative systems that you consider to be preferable in some
> > dimension,
> > >>> let’s do that here or in a Q&A as proposed by Joey.
> > >>>
> > >>> The relevant goals are that we:
> > >>>
> > >>>
> > >>>  1.  Guarantee strict serializable isolation on commodity hardware
> > >>>  2.  Scale to any cluster size
> > >>>  3.  Achieve optimal latency
> > >>>
> > >>> The approach taken by Spanner derivatives is rejected by (1) because
> > they
> > >>> guarantee only Serializable isolation (they additionally fail (3)).
> > From
> > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > >>> panic-cluster-death under clock skew, this is clearly considered by
> > >>> everyone to be undesirable but necessary to achieve scalability.
> > >>>
> > >>> The approach taken by FaunaDB (Calvin) is rejected by (2) because its
> > >>> sequencing layer requires a global leader process for the cluster,
> > which
> > >> is
> > >>> incompatible with Cassandra’s scalability requirements. It
> additionally
> > >>> fails (3) for global clients.
> > >>>
> > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
> > >>> Spanner clone for its multi-key transaction functionality, not 2PC.
> > >>>
> > >>> Systems such as RAMP with even weaker isolation are not considered
> for
> > >> the
> > >>> simple reason that they do not even claim to meet (1).
> > >>>
> > >>> If we want to additionally offer weaker isolation levels than
> > >>> Serializable, such as that provided by the recent RAMP-TAO paper,
> > >> Cassandra
> > >>> is likely able to support multiple distinct transaction layers that
> > >> operate
> > >>> independently. I would encourage you to file a CEP to explore how we
> > can
> > >>> meet these distinct use cases, but I consider them to be niche. I
> > expect
> > >>> that a majority of our user base desire strict serializable
> isolation,
> > >> and
> > >>> certainly no less than serializable isolation, to augment the
> existing
> > >>> weaker isolation offered by quorum reads and writes.
> > >>>
> > >>> I would tangentially note that we are not an AP database under normal
> > >>> recommended operation. A minority in any network partition cannot
> reach
> > >>> QUORUM, so under recommended usage we are a high-availability
> > leaderless
> > >> CP
> > >>> database.
> > >>>
> > >>>
> > >>> From: Jonathan Ellis <jb...@gmail.com>
> > >>> Date: Tuesday, 21 September 2021 at 23:45
> > >>> To: dev <de...@cassandra.apache.org>
> > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >>> Benedict, thanks for taking the lead in putting this together. Since
> > >>> Cassandra is the only relevant database today designed around a
> > >> leaderless
> > >>> architecture, it's quite likely that we'll be better served with a
> > custom
> > >>> transaction design instead of trying to retrofit one from CP systems.
> > >>>
> > >>> The whitepaper here is a good description of the consensus algorithm
> > >> itself
> > >>> as well as its robustness and stability characteristics, and its
> > >> comparison
> > >>> with other state-of-the-art consensus algorithms is very useful.  In
> > the
> > >>> context of Cassandra, where a consensus algorithm is only part of
> what
> > >> will
> > >>> be implemented, I'd like to see a more complete evaluation of the
> > >>> transactional side of things as well, including performance
> > >> characteristics
> > >>> as well as the types of transactions that can be supported and at
> > least a
> > >>> general idea of what it would look like applied to Cassandra. This
> will
> > >>> allow the PMC to make a more informed decision about what tradeoffs
> are
> > >>> best for the entire long-term project of first supplementing and
> > >> ultimately
> > >>> replacing LWT.
> > >>>
> > >>> (Allowing users to mix LWT and AP Cassandra operations against the
> same
> > >>> rows was probably a mistake, so in contrast with LWT we’re not
> looking
> > >> for
> > >>> something fast enough for occasional use but rather something within
> a
> > >>> reasonable factor of AP operations, appropriate to being the only way
> > to
> > >>> interact with tables declared as such.)
> > >>>
> > >>> Besides Accord, this should cover
> > >>>
> > >>> - Calvin and FaunaDB
> > >>> - A Spanner derivative (no opinion on whether that should be
> Cockroach
> > or
> > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I
> > suspect
> > >>> there is more public information about MongoDB)
> > >>> - RAMP
> > >>>
> > >>> Here’s an example of what I mean:
> > >>>
> > >>> =Calvin=
> > >>>
> > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to
> order
> > >>> transactions, then replicas execute the transactions independently
> with
> > >> no
> > >>> further coordination.  No SPOF.  Transactions are batched by each
> > >> sequencer
> > >>> to keep this from becoming a bottleneck.
> > >>>
> > >>> Performance: Calvin paper (published 2012) reports linear scaling of
> > >> TPC-C
> > >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL
> machines
> > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is
> > composed
> > >>> of four reads and four writes, so this is effectively 2M reads and 2M
> > >>> writes as we normally measure them in C*.
> > >>>
> > >>> Calvin supports mixed read/write transactions, but because the
> > >> transaction
> > >>> execution logic requires knowing all partition keys in advance to
> > ensure
> > >>> that all replicas can reproduce the same results with no
> coordination,
> > >>> reads against non-PK predicates must be done ahead of time
> > >> (transparently,
> > >>> by the server) to determine the set of keys, and this must be retried
> > if
> > >>> the set of rows affected is updated before the actual transaction
> > >> executes.
> > >>>
> > >>> Batching and global consensus adds latency -- 100ms in the Calvin
> paper
> > >> and
> > >>> apparently about 50ms in FaunaDB.  Glass half full: all transactions
> > >>> (including multi-partition updates) are equally performant in Calvin
> > >> since
> > >>> the coordination is handled up front in the sequencing step.  Glass
> > half
> > >>> empty: even single-row reads and writes have to pay the full
> > coordination
> > >>> cost.  Fauna has optimized this away for reads but I am not aware of
> a
> > >>> description of how they changed the design to allow this.
> > >>>
> > >>> Functionality and limitations: since the entire transaction must be
> > known
> > >>> in advance to allow coordination-less execution at the replicas,
> Calvin
> > >>> cannot support interactive transactions at all.  FaunaDB mitigates
> this
> > >> by
> > >>> allowing server-side logic to be included, but a Calvin approach will
> > >> never
> > >>> be able to offer SQL compatibility.
> > >>>
> > >>> Guarantees: Calvin transactions are strictly serializable.  There is
> no
> > >>> additional complexity or performance hit to generalizing to multiple
> > >>> regions, apart from the speed of light.  And since Calvin is already
> > >> paying
> > >>> a batching latency penalty, this is less painful than for other
> > systems.
> > >>>
> > >>> Application to Cassandra: B-.  Distributed transactions are handled
> by
> > >> the
> > >>> sequencing and scheduling layers, which are leaderless, and Calvin’s
> > >>> requirements for the storage layer are easily met by C*.  But Calvin
> > also
> > >>> requires a global consensus protocol and LWT is almost certainly not
> > >>> sufficiently performant, so this would require ZK or etcd (reasonable
> > >> for a
> > >>> library approach but not for replacing LWT in C* itself), or an
> > >>> implementation of Accord.  I don’t believe Calvin would require
> > >> additional
> > >>> table-level metadata in Cassandra.
> > >>>
> > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > benedict@apache.org>
> > >>> wrote:
> > >>>
> > >>>> Wiki:
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > >>>> Whitepaper:
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > >>>> <
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > >>>>>
> > >>>> Prototype: https://github.com/belliottsmith/accord
> > >>>>
> > >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> > >> community.
> > >>>>
> > >>>> Cassandra has benefitted from LWTs for many years, but application
> > >>>> developers that want to ensure consistency for complex operations
> must
> > >>>> either accept the scalability bottleneck of serializing all related
> > >> state
> > >>>> through a single partition, or layer a complex state machine on top
> of
> > >>> the
> > >>>> database. These are sophisticated and costly activities that our
> users
> > >>>> should not be expected to undertake. Since distributed databases are
> > >>>> beginning to offer distributed transactions with fewer caveats, it
> is
> > >>> past
> > >>>> time for Cassandra to do so as well.
> > >>>>
> > >>>> This CEP proposes the use of several novel techniques that build
> upon
> > >>>> research (that followed EPaxos) to deliver (non-interactive) general
> > >>>> purpose distributed transactions. The approach is outlined in the
> > >>> wikipage
> > >>>> and in more detail in the linked whitepaper. Importantly, by
> adopting
> > >>> this
> > >>>> approach we will be the _only_ distributed database to offer global,
> > >>>> scalable, strict serializable transactions in one wide area
> > round-trip.
> > >>>> This would represent a significant improvement in the state of the
> > art,
> > >>>> both in the academic literature and in commercial or open source
> > >>> offerings.
> > >>>>
> > >>>> This work has been partially realised in a prototype. This partial
> > >>>> prototype has been verified against Jepsen.io’s Maelstrom library
> and
> > >>>> dedicated in-tree strict serializability verification tools, but
> much
> > >>> work
> > >>>> remains for the work to be production capable and integrated into
> > >>> Cassandra.
> > >>>>
> > >>>> I propose including the prototype in the project as a new source
> > >>>> repository, to be developed as a standalone library for integration
> > >> into
> > >>>> Cassandra. I hope the community sees the important value proposition
> > of
> > >>>> this proposal, and will adopt the CEP after this discussion, so that
> > >> the
> > >>>> library and its integration into Cassandra can be developed in
> > parallel
> > >>> and
> > >>>> with the involvement of the wider community.
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Jonathan Ellis
> > >>> co-founder, http://www.datastax.com
> > >>> @spyced
> > >>>
> > >>
> > >>
> > >> --
> > >> Jonathan Ellis
> > >> co-founder, http://www.datastax.com
> > >> @spyced
> > >>
> > >
> > >
> > > --
> > > Jonathan Ellis
> > > co-founder, http://www.datastax.com
> > > @spyced
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > For additional commands, e-mail: dev-help@cassandra.apache.org
> >
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by Paulo Motta <pa...@gmail.com>.
I share similar feelings as jbellis that this proposal seems to be focusing
on the protocol itself but lacking the actual feature that will use the
protocol which IMO a key element to discuss on a CEP.

It's similar to saying: hey I want to add this Tries Serialization Protocol
to Cassandra, but not providing specific details of how this protocol is
going to be used.

I think the right route for a CEP is to describe the feature that will be
added to the database and the protocol is a mere requirement of the
high-level feature, for example:

CEP: Add Trie-backed memtable
- Trie Serialization Protocol: implementation detail of the above CEP

What is the difficulty of taking this approach, picking one of the myriad
of features that will be enabled by Accord and using that as the initial
CEP to introduce the protocol to the database?

Em sex., 1 de out. de 2021 às 08:37, benedict@apache.org <
benedict@apache.org> escreveu:

> Actually, thinking about it again, the simple optimistic protocol would in
> fact guarantee system forward progress (i.e. independent of transaction
> formulation).
>
>
> From: benedict@apache.org <be...@apache.org>
> Date: Friday, 1 October 2021 at 09:14
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> Hi Jonathan,
>
> It would be great if we could achieve a bandwidth higher than 1-2 short
> emails per week. It remains unclear to me what your goal is, and it would
> help if you could make a statement like “I want Cassandra to be able to do
> X” so that we can respond directly to it. I am also available to have
> another call, in which we can have a back and forth, please feel free to
> propose a London-compatible time within the next week that is suitable for
> you.
>
> In my opinion we are at risk of veering off-topic, though. This CEP is not
> to deliver interactive transactions, and to my knowledge nobody is
> proposing a CEP for interactive transactions. So, for the CEP at hand the
> salient question seems: does this CEP prevent us from implementing
> interactive transactions with properties X, Y, Z in future? To which the
> answer is almost certainly no.
>
> However, to continue the discussion and respond directly to your queries,
> I believe we agree on the definition of an interactive transaction.
>
> Two protocols were loosely outlined. The first, using timestamps for
> optimistic concurrency control, would indeed involve the possibility of
> aborts. It would not however inherently adopt the issue of LWTs where no
> transaction is able to make progress. Whether or not progress is guaranteed
> (in a livelock-free sense) would depend on the structure of the
> transactions that were interfering.
>
> This approach has the advantage of being very simple to implement, so that
> we could realistically support interactive transactions quite quickly. It
> has the additional advantage that transactions would execute very quickly
> by avoiding the WAN during construction, and as a result may in practice
> experience fewer aborts than protocols that guarantee livelock-freedom.
>
> The second protocol proposed using read/write intents and would be able to
> support almost any behaviour you want. We could even utilise pessimistic
> concurrency control, or anything in-between. This is its own huge design
> space, and discussion of this approach and the trade-offs that could be
> made is (in my opinion) entirely out of scope for this CEP.
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Friday, 1 October 2021 at 05:00
> To: dev <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> The obstacle for me is you've provided a protocol but not a fully fleshed
> out architecture, so it's hard to fill in some of the blanks.  But it looks
> to me like optimistic concurrency control for interactive transactions
> applied to Accord would leave you in a LWT-like situation under fairly
> light contention where nobody actually makes progress due to retries.
>
> To make sure we're talking about the same thing, as Henrik pointed out,
> interactive transactions mean multiple round trips from the client within a
> transaction.  For example, here
> <
> https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213
> >
> is a simple implementation of the TPC-C New Order transaction.  The high
> level logic (via
> <
> https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm
> >)
> is,
>
>    1. Get records describing a warehouse, customer, & district
>    2. Update the district
>    3. Increment next available order number
>    4. Insert record into Order and New-Order tables
>    5. For 5-15 items, get Item record, get/update Stock record
>    6. Insert Order-Line Record
>
> As you can see, this requires a lot of client-side logic mixed in with the
> actual SQL commands.
>
>
> On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <be...@apache.org>
> wrote:
>
> > Essentially this, although I think in practice we will need to track each
> > partition’s timestamp separately (or optionally for reduced conflicts,
> each
> > row or datum’s), and make them all part of the conditional application of
> > the transaction - at least for strict-serializability.
> >
> > The alternative is to insert read/write intents for the transaction
> during
> > each step, and to confirm they are still valid on commit, but this
> approach
> > would require a WAN round-trip for each step in the interactive
> > transaction, whereas the timestamp-validating approach can use a LAN
> > round-trip for each step besides the final one, and is also much simpler
> to
> > implement.
> >
> >
> > From: Blake Eggleston <be...@apple.com.INVALID>
> > Date: Thursday, 30 September 2021 at 05:47
> > To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> > Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > You could establish a lower timestamp bound and buffer transaction state
> > on the coordinator, then make the commit an operation that only applies
> if
> > all partitions involved haven’t been changed by a more recent timestamp.
> > You could also implement mvcc either in the storage layer or for some
> > period of time by buffering commits on each replica before applying.
> >
> > > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> > >
> > > How are interactive transactions possible with Accord?
> > >
> > >
> > >
> > > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> > benedict@apache.org>
> > > wrote:
> > >
> > >> Could you explain why you believe this trade-off is necessary? We can
> > >> support full SQL just fine with Accord, and I hope that we eventually
> > do so.
> > >>
> > >> This domain is incredibly complex, so it is easy to reach wrong
> > >> conclusions. I would invite you again to propose a system for
> discussion
> > >> that you think offers something Accord is unable to, and that you
> > consider
> > >> desirable, and we can work from there.
> > >>
> > >> To pre-empt some possible discussions, I am not aware of anything we
> > >> cannot do with Accord that we could do with either Calvin or Spanner.
> > >> Interactive transactions are possible on top of Accord, as are
> > transactions
> > >> with an unknown read/write set. In each case the only cost is that
> they
> > >> would use optimistic concurrency control, which is no worse the
> spanner
> > >> derivatives anyway (which I have to assume is your benchmark in this
> > >> regard). I do not expect to deliver either functionality initially,
> but
> > >> Accord takes us most of the way there for both.
> > >>
> > >>
> > >> From: Jonathan Ellis <jb...@gmail.com>
> > >> Date: Wednesday, 22 September 2021 at 05:36
> > >> To: dev <de...@cassandra.apache.org>
> > >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >> Right, I'm looking for exactly a discussion on the high level goals.
> > >> Instead of saying "here's the goals and we ruled out X because Y" we
> > should
> > >> start with a discussion around, "Approach A allows X and W, approach B
> > >> allows Y and Z" and decide together what the goals should be and and
> > what
> > >> we are willing to trade to get those goals, e.g., are we willing to
> > give up
> > >> global strict serializability to get the ability to support full SQL.
> > Both
> > >> of these are nice to have!
> > >>
> > >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> > benedict@apache.org>
> > >> wrote:
> > >>
> > >>> Hi Jonathan,
> > >>>
> > >>> These other systems are incompatible with the goals of the CEP. I do
> > >>> discuss them (besides 2PC) in both the whitepaper and the CEP, and
> will
> > >>> summarise that discussion below. A true and accurate comparison of
> > these
> > >>> other systems is essentially intractable, as there are complex
> > subtleties
> > >>> to each flavour, and those who are interested would be better served
> by
> > >>> performing their own research.
> > >>>
> > >>> I think it is more productive to focus on what we want to achieve as
> a
> > >>> community. If you believe the goals of this CEP are wrong for the
> > >> project,
> > >>> let’s focus on that. If you want to compare and contrast specific
> > facets
> > >> of
> > >>> alternative systems that you consider to be preferable in some
> > dimension,
> > >>> let’s do that here or in a Q&A as proposed by Joey.
> > >>>
> > >>> The relevant goals are that we:
> > >>>
> > >>>
> > >>>  1.  Guarantee strict serializable isolation on commodity hardware
> > >>>  2.  Scale to any cluster size
> > >>>  3.  Achieve optimal latency
> > >>>
> > >>> The approach taken by Spanner derivatives is rejected by (1) because
> > they
> > >>> guarantee only Serializable isolation (they additionally fail (3)).
> > From
> > >>> watching talks by YugaByte, and inferring from Cockroach’s
> > >>> panic-cluster-death under clock skew, this is clearly considered by
> > >>> everyone to be undesirable but necessary to achieve scalability.
> > >>>
> > >>> The approach taken by FaunaDB (Calvin) is rejected by (2) because its
> > >>> sequencing layer requires a global leader process for the cluster,
> > which
> > >> is
> > >>> incompatible with Cassandra’s scalability requirements. It
> additionally
> > >>> fails (3) for global clients.
> > >>>
> > >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
> > >>> Spanner clone for its multi-key transaction functionality, not 2PC.
> > >>>
> > >>> Systems such as RAMP with even weaker isolation are not considered
> for
> > >> the
> > >>> simple reason that they do not even claim to meet (1).
> > >>>
> > >>> If we want to additionally offer weaker isolation levels than
> > >>> Serializable, such as that provided by the recent RAMP-TAO paper,
> > >> Cassandra
> > >>> is likely able to support multiple distinct transaction layers that
> > >> operate
> > >>> independently. I would encourage you to file a CEP to explore how we
> > can
> > >>> meet these distinct use cases, but I consider them to be niche. I
> > expect
> > >>> that a majority of our user base desire strict serializable
> isolation,
> > >> and
> > >>> certainly no less than serializable isolation, to augment the
> existing
> > >>> weaker isolation offered by quorum reads and writes.
> > >>>
> > >>> I would tangentially note that we are not an AP database under normal
> > >>> recommended operation. A minority in any network partition cannot
> reach
> > >>> QUORUM, so under recommended usage we are a high-availability
> > leaderless
> > >> CP
> > >>> database.
> > >>>
> > >>>
> > >>> From: Jonathan Ellis <jb...@gmail.com>
> > >>> Date: Tuesday, 21 September 2021 at 23:45
> > >>> To: dev <de...@cassandra.apache.org>
> > >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> > >>> Benedict, thanks for taking the lead in putting this together. Since
> > >>> Cassandra is the only relevant database today designed around a
> > >> leaderless
> > >>> architecture, it's quite likely that we'll be better served with a
> > custom
> > >>> transaction design instead of trying to retrofit one from CP systems.
> > >>>
> > >>> The whitepaper here is a good description of the consensus algorithm
> > >> itself
> > >>> as well as its robustness and stability characteristics, and its
> > >> comparison
> > >>> with other state-of-the-art consensus algorithms is very useful.  In
> > the
> > >>> context of Cassandra, where a consensus algorithm is only part of
> what
> > >> will
> > >>> be implemented, I'd like to see a more complete evaluation of the
> > >>> transactional side of things as well, including performance
> > >> characteristics
> > >>> as well as the types of transactions that can be supported and at
> > least a
> > >>> general idea of what it would look like applied to Cassandra. This
> will
> > >>> allow the PMC to make a more informed decision about what tradeoffs
> are
> > >>> best for the entire long-term project of first supplementing and
> > >> ultimately
> > >>> replacing LWT.
> > >>>
> > >>> (Allowing users to mix LWT and AP Cassandra operations against the
> same
> > >>> rows was probably a mistake, so in contrast with LWT we’re not
> looking
> > >> for
> > >>> something fast enough for occasional use but rather something within
> a
> > >>> reasonable factor of AP operations, appropriate to being the only way
> > to
> > >>> interact with tables declared as such.)
> > >>>
> > >>> Besides Accord, this should cover
> > >>>
> > >>> - Calvin and FaunaDB
> > >>> - A Spanner derivative (no opinion on whether that should be
> Cockroach
> > or
> > >>> Yugabyte, I don’t think it’s necessary to cover both)
> > >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I
> > suspect
> > >>> there is more public information about MongoDB)
> > >>> - RAMP
> > >>>
> > >>> Here’s an example of what I mean:
> > >>>
> > >>> =Calvin=
> > >>>
> > >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to
> order
> > >>> transactions, then replicas execute the transactions independently
> with
> > >> no
> > >>> further coordination.  No SPOF.  Transactions are batched by each
> > >> sequencer
> > >>> to keep this from becoming a bottleneck.
> > >>>
> > >>> Performance: Calvin paper (published 2012) reports linear scaling of
> > >> TPC-C
> > >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL
> machines
> > >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is
> > composed
> > >>> of four reads and four writes, so this is effectively 2M reads and 2M
> > >>> writes as we normally measure them in C*.
> > >>>
> > >>> Calvin supports mixed read/write transactions, but because the
> > >> transaction
> > >>> execution logic requires knowing all partition keys in advance to
> > ensure
> > >>> that all replicas can reproduce the same results with no
> coordination,
> > >>> reads against non-PK predicates must be done ahead of time
> > >> (transparently,
> > >>> by the server) to determine the set of keys, and this must be retried
> > if
> > >>> the set of rows affected is updated before the actual transaction
> > >> executes.
> > >>>
> > >>> Batching and global consensus adds latency -- 100ms in the Calvin
> paper
> > >> and
> > >>> apparently about 50ms in FaunaDB.  Glass half full: all transactions
> > >>> (including multi-partition updates) are equally performant in Calvin
> > >> since
> > >>> the coordination is handled up front in the sequencing step.  Glass
> > half
> > >>> empty: even single-row reads and writes have to pay the full
> > coordination
> > >>> cost.  Fauna has optimized this away for reads but I am not aware of
> a
> > >>> description of how they changed the design to allow this.
> > >>>
> > >>> Functionality and limitations: since the entire transaction must be
> > known
> > >>> in advance to allow coordination-less execution at the replicas,
> Calvin
> > >>> cannot support interactive transactions at all.  FaunaDB mitigates
> this
> > >> by
> > >>> allowing server-side logic to be included, but a Calvin approach will
> > >> never
> > >>> be able to offer SQL compatibility.
> > >>>
> > >>> Guarantees: Calvin transactions are strictly serializable.  There is
> no
> > >>> additional complexity or performance hit to generalizing to multiple
> > >>> regions, apart from the speed of light.  And since Calvin is already
> > >> paying
> > >>> a batching latency penalty, this is less painful than for other
> > systems.
> > >>>
> > >>> Application to Cassandra: B-.  Distributed transactions are handled
> by
> > >> the
> > >>> sequencing and scheduling layers, which are leaderless, and Calvin’s
> > >>> requirements for the storage layer are easily met by C*.  But Calvin
> > also
> > >>> requires a global consensus protocol and LWT is almost certainly not
> > >>> sufficiently performant, so this would require ZK or etcd (reasonable
> > >> for a
> > >>> library approach but not for replacing LWT in C* itself), or an
> > >>> implementation of Accord.  I don’t believe Calvin would require
> > >> additional
> > >>> table-level metadata in Cassandra.
> > >>>
> > >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> > benedict@apache.org>
> > >>> wrote:
> > >>>
> > >>>> Wiki:
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> > >>>> Whitepaper:
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> > >>>> <
> > >>>>
> > >>>
> > >>
> >
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> > >>>>>
> > >>>> Prototype: https://github.com/belliottsmith/accord
> > >>>>
> > >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> > >> community.
> > >>>>
> > >>>> Cassandra has benefitted from LWTs for many years, but application
> > >>>> developers that want to ensure consistency for complex operations
> must
> > >>>> either accept the scalability bottleneck of serializing all related
> > >> state
> > >>>> through a single partition, or layer a complex state machine on top
> of
> > >>> the
> > >>>> database. These are sophisticated and costly activities that our
> users
> > >>>> should not be expected to undertake. Since distributed databases are
> > >>>> beginning to offer distributed transactions with fewer caveats, it
> is
> > >>> past
> > >>>> time for Cassandra to do so as well.
> > >>>>
> > >>>> This CEP proposes the use of several novel techniques that build
> upon
> > >>>> research (that followed EPaxos) to deliver (non-interactive) general
> > >>>> purpose distributed transactions. The approach is outlined in the
> > >>> wikipage
> > >>>> and in more detail in the linked whitepaper. Importantly, by
> adopting
> > >>> this
> > >>>> approach we will be the _only_ distributed database to offer global,
> > >>>> scalable, strict serializable transactions in one wide area
> > round-trip.
> > >>>> This would represent a significant improvement in the state of the
> > art,
> > >>>> both in the academic literature and in commercial or open source
> > >>> offerings.
> > >>>>
> > >>>> This work has been partially realised in a prototype. This partial
> > >>>> prototype has been verified against Jepsen.io’s Maelstrom library
> and
> > >>>> dedicated in-tree strict serializability verification tools, but
> much
> > >>> work
> > >>>> remains for the work to be production capable and integrated into
> > >>> Cassandra.
> > >>>>
> > >>>> I propose including the prototype in the project as a new source
> > >>>> repository, to be developed as a standalone library for integration
> > >> into
> > >>>> Cassandra. I hope the community sees the important value proposition
> > of
> > >>>> this proposal, and will adopt the CEP after this discussion, so that
> > >> the
> > >>>> library and its integration into Cassandra can be developed in
> > parallel
> > >>> and
> > >>>> with the involvement of the wider community.
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Jonathan Ellis
> > >>> co-founder, http://www.datastax.com
> > >>> @spyced
> > >>>
> > >>
> > >>
> > >> --
> > >> Jonathan Ellis
> > >> co-founder, http://www.datastax.com
> > >> @spyced
> > >>
> > >
> > >
> > > --
> > > Jonathan Ellis
> > > co-founder, http://www.datastax.com
> > > @spyced
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > For additional commands, e-mail: dev-help@cassandra.apache.org
> >
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.
Actually, thinking about it again, the simple optimistic protocol would in fact guarantee system forward progress (i.e. independent of transaction formulation).


From: benedict@apache.org <be...@apache.org>
Date: Friday, 1 October 2021 at 09:14
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
Hi Jonathan,

It would be great if we could achieve a bandwidth higher than 1-2 short emails per week. It remains unclear to me what your goal is, and it would help if you could make a statement like “I want Cassandra to be able to do X” so that we can respond directly to it. I am also available to have another call, in which we can have a back and forth, please feel free to propose a London-compatible time within the next week that is suitable for you.

In my opinion we are at risk of veering off-topic, though. This CEP is not to deliver interactive transactions, and to my knowledge nobody is proposing a CEP for interactive transactions. So, for the CEP at hand the salient question seems: does this CEP prevent us from implementing interactive transactions with properties X, Y, Z in future? To which the answer is almost certainly no.

However, to continue the discussion and respond directly to your queries, I believe we agree on the definition of an interactive transaction.

Two protocols were loosely outlined. The first, using timestamps for optimistic concurrency control, would indeed involve the possibility of aborts. It would not however inherently adopt the issue of LWTs where no transaction is able to make progress. Whether or not progress is guaranteed (in a livelock-free sense) would depend on the structure of the transactions that were interfering.

This approach has the advantage of being very simple to implement, so that we could realistically support interactive transactions quite quickly. It has the additional advantage that transactions would execute very quickly by avoiding the WAN during construction, and as a result may in practice experience fewer aborts than protocols that guarantee livelock-freedom.

The second protocol proposed using read/write intents and would be able to support almost any behaviour you want. We could even utilise pessimistic concurrency control, or anything in-between. This is its own huge design space, and discussion of this approach and the trade-offs that could be made is (in my opinion) entirely out of scope for this CEP.


From: Jonathan Ellis <jb...@gmail.com>
Date: Friday, 1 October 2021 at 05:00
To: dev <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
The obstacle for me is you've provided a protocol but not a fully fleshed
out architecture, so it's hard to fill in some of the blanks.  But it looks
to me like optimistic concurrency control for interactive transactions
applied to Accord would leave you in a LWT-like situation under fairly
light contention where nobody actually makes progress due to retries.

To make sure we're talking about the same thing, as Henrik pointed out,
interactive transactions mean multiple round trips from the client within a
transaction.  For example, here
<https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213>
is a simple implementation of the TPC-C New Order transaction.  The high
level logic (via
<https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm>)
is,

   1. Get records describing a warehouse, customer, & district
   2. Update the district
   3. Increment next available order number
   4. Insert record into Order and New-Order tables
   5. For 5-15 items, get Item record, get/update Stock record
   6. Insert Order-Line Record

As you can see, this requires a lot of client-side logic mixed in with the
actual SQL commands.


On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <be...@apache.org>
wrote:

> Essentially this, although I think in practice we will need to track each
> partition’s timestamp separately (or optionally for reduced conflicts, each
> row or datum’s), and make them all part of the conditional application of
> the transaction - at least for strict-serializability.
>
> The alternative is to insert read/write intents for the transaction during
> each step, and to confirm they are still valid on commit, but this approach
> would require a WAN round-trip for each step in the interactive
> transaction, whereas the timestamp-validating approach can use a LAN
> round-trip for each step besides the final one, and is also much simpler to
> implement.
>
>
> From: Blake Eggleston <be...@apple.com.INVALID>
> Date: Thursday, 30 September 2021 at 05:47
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> You could establish a lower timestamp bound and buffer transaction state
> on the coordinator, then make the commit an operation that only applies if
> all partitions involved haven’t been changed by a more recent timestamp.
> You could also implement mvcc either in the storage layer or for some
> period of time by buffering commits on each replica before applying.
>
> > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> >
> > How are interactive transactions possible with Accord?
> >
> >
> >
> > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> benedict@apache.org>
> > wrote:
> >
> >> Could you explain why you believe this trade-off is necessary? We can
> >> support full SQL just fine with Accord, and I hope that we eventually
> do so.
> >>
> >> This domain is incredibly complex, so it is easy to reach wrong
> >> conclusions. I would invite you again to propose a system for discussion
> >> that you think offers something Accord is unable to, and that you
> consider
> >> desirable, and we can work from there.
> >>
> >> To pre-empt some possible discussions, I am not aware of anything we
> >> cannot do with Accord that we could do with either Calvin or Spanner.
> >> Interactive transactions are possible on top of Accord, as are
> transactions
> >> with an unknown read/write set. In each case the only cost is that they
> >> would use optimistic concurrency control, which is no worse the spanner
> >> derivatives anyway (which I have to assume is your benchmark in this
> >> regard). I do not expect to deliver either functionality initially, but
> >> Accord takes us most of the way there for both.
> >>
> >>
> >> From: Jonathan Ellis <jb...@gmail.com>
> >> Date: Wednesday, 22 September 2021 at 05:36
> >> To: dev <de...@cassandra.apache.org>
> >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> >> Right, I'm looking for exactly a discussion on the high level goals.
> >> Instead of saying "here's the goals and we ruled out X because Y" we
> should
> >> start with a discussion around, "Approach A allows X and W, approach B
> >> allows Y and Z" and decide together what the goals should be and and
> what
> >> we are willing to trade to get those goals, e.g., are we willing to
> give up
> >> global strict serializability to get the ability to support full SQL.
> Both
> >> of these are nice to have!
> >>
> >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> benedict@apache.org>
> >> wrote:
> >>
> >>> Hi Jonathan,
> >>>
> >>> These other systems are incompatible with the goals of the CEP. I do
> >>> discuss them (besides 2PC) in both the whitepaper and the CEP, and will
> >>> summarise that discussion below. A true and accurate comparison of
> these
> >>> other systems is essentially intractable, as there are complex
> subtleties
> >>> to each flavour, and those who are interested would be better served by
> >>> performing their own research.
> >>>
> >>> I think it is more productive to focus on what we want to achieve as a
> >>> community. If you believe the goals of this CEP are wrong for the
> >> project,
> >>> let’s focus on that. If you want to compare and contrast specific
> facets
> >> of
> >>> alternative systems that you consider to be preferable in some
> dimension,
> >>> let’s do that here or in a Q&A as proposed by Joey.
> >>>
> >>> The relevant goals are that we:
> >>>
> >>>
> >>>  1.  Guarantee strict serializable isolation on commodity hardware
> >>>  2.  Scale to any cluster size
> >>>  3.  Achieve optimal latency
> >>>
> >>> The approach taken by Spanner derivatives is rejected by (1) because
> they
> >>> guarantee only Serializable isolation (they additionally fail (3)).
> From
> >>> watching talks by YugaByte, and inferring from Cockroach’s
> >>> panic-cluster-death under clock skew, this is clearly considered by
> >>> everyone to be undesirable but necessary to achieve scalability.
> >>>
> >>> The approach taken by FaunaDB (Calvin) is rejected by (2) because its
> >>> sequencing layer requires a global leader process for the cluster,
> which
> >> is
> >>> incompatible with Cassandra’s scalability requirements. It additionally
> >>> fails (3) for global clients.
> >>>
> >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
> >>> Spanner clone for its multi-key transaction functionality, not 2PC.
> >>>
> >>> Systems such as RAMP with even weaker isolation are not considered for
> >> the
> >>> simple reason that they do not even claim to meet (1).
> >>>
> >>> If we want to additionally offer weaker isolation levels than
> >>> Serializable, such as that provided by the recent RAMP-TAO paper,
> >> Cassandra
> >>> is likely able to support multiple distinct transaction layers that
> >> operate
> >>> independently. I would encourage you to file a CEP to explore how we
> can
> >>> meet these distinct use cases, but I consider them to be niche. I
> expect
> >>> that a majority of our user base desire strict serializable isolation,
> >> and
> >>> certainly no less than serializable isolation, to augment the existing
> >>> weaker isolation offered by quorum reads and writes.
> >>>
> >>> I would tangentially note that we are not an AP database under normal
> >>> recommended operation. A minority in any network partition cannot reach
> >>> QUORUM, so under recommended usage we are a high-availability
> leaderless
> >> CP
> >>> database.
> >>>
> >>>
> >>> From: Jonathan Ellis <jb...@gmail.com>
> >>> Date: Tuesday, 21 September 2021 at 23:45
> >>> To: dev <de...@cassandra.apache.org>
> >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> >>> Benedict, thanks for taking the lead in putting this together. Since
> >>> Cassandra is the only relevant database today designed around a
> >> leaderless
> >>> architecture, it's quite likely that we'll be better served with a
> custom
> >>> transaction design instead of trying to retrofit one from CP systems.
> >>>
> >>> The whitepaper here is a good description of the consensus algorithm
> >> itself
> >>> as well as its robustness and stability characteristics, and its
> >> comparison
> >>> with other state-of-the-art consensus algorithms is very useful.  In
> the
> >>> context of Cassandra, where a consensus algorithm is only part of what
> >> will
> >>> be implemented, I'd like to see a more complete evaluation of the
> >>> transactional side of things as well, including performance
> >> characteristics
> >>> as well as the types of transactions that can be supported and at
> least a
> >>> general idea of what it would look like applied to Cassandra. This will
> >>> allow the PMC to make a more informed decision about what tradeoffs are
> >>> best for the entire long-term project of first supplementing and
> >> ultimately
> >>> replacing LWT.
> >>>
> >>> (Allowing users to mix LWT and AP Cassandra operations against the same
> >>> rows was probably a mistake, so in contrast with LWT we’re not looking
> >> for
> >>> something fast enough for occasional use but rather something within a
> >>> reasonable factor of AP operations, appropriate to being the only way
> to
> >>> interact with tables declared as such.)
> >>>
> >>> Besides Accord, this should cover
> >>>
> >>> - Calvin and FaunaDB
> >>> - A Spanner derivative (no opinion on whether that should be Cockroach
> or
> >>> Yugabyte, I don’t think it’s necessary to cover both)
> >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I
> suspect
> >>> there is more public information about MongoDB)
> >>> - RAMP
> >>>
> >>> Here’s an example of what I mean:
> >>>
> >>> =Calvin=
> >>>
> >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
> >>> transactions, then replicas execute the transactions independently with
> >> no
> >>> further coordination.  No SPOF.  Transactions are batched by each
> >> sequencer
> >>> to keep this from becoming a bottleneck.
> >>>
> >>> Performance: Calvin paper (published 2012) reports linear scaling of
> >> TPC-C
> >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
> >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is
> composed
> >>> of four reads and four writes, so this is effectively 2M reads and 2M
> >>> writes as we normally measure them in C*.
> >>>
> >>> Calvin supports mixed read/write transactions, but because the
> >> transaction
> >>> execution logic requires knowing all partition keys in advance to
> ensure
> >>> that all replicas can reproduce the same results with no coordination,
> >>> reads against non-PK predicates must be done ahead of time
> >> (transparently,
> >>> by the server) to determine the set of keys, and this must be retried
> if
> >>> the set of rows affected is updated before the actual transaction
> >> executes.
> >>>
> >>> Batching and global consensus adds latency -- 100ms in the Calvin paper
> >> and
> >>> apparently about 50ms in FaunaDB.  Glass half full: all transactions
> >>> (including multi-partition updates) are equally performant in Calvin
> >> since
> >>> the coordination is handled up front in the sequencing step.  Glass
> half
> >>> empty: even single-row reads and writes have to pay the full
> coordination
> >>> cost.  Fauna has optimized this away for reads but I am not aware of a
> >>> description of how they changed the design to allow this.
> >>>
> >>> Functionality and limitations: since the entire transaction must be
> known
> >>> in advance to allow coordination-less execution at the replicas, Calvin
> >>> cannot support interactive transactions at all.  FaunaDB mitigates this
> >> by
> >>> allowing server-side logic to be included, but a Calvin approach will
> >> never
> >>> be able to offer SQL compatibility.
> >>>
> >>> Guarantees: Calvin transactions are strictly serializable.  There is no
> >>> additional complexity or performance hit to generalizing to multiple
> >>> regions, apart from the speed of light.  And since Calvin is already
> >> paying
> >>> a batching latency penalty, this is less painful than for other
> systems.
> >>>
> >>> Application to Cassandra: B-.  Distributed transactions are handled by
> >> the
> >>> sequencing and scheduling layers, which are leaderless, and Calvin’s
> >>> requirements for the storage layer are easily met by C*.  But Calvin
> also
> >>> requires a global consensus protocol and LWT is almost certainly not
> >>> sufficiently performant, so this would require ZK or etcd (reasonable
> >> for a
> >>> library approach but not for replacing LWT in C* itself), or an
> >>> implementation of Accord.  I don’t believe Calvin would require
> >> additional
> >>> table-level metadata in Cassandra.
> >>>
> >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> benedict@apache.org>
> >>> wrote:
> >>>
> >>>> Wiki:
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> >>>> Whitepaper:
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> >>>> <
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >>>>>
> >>>> Prototype: https://github.com/belliottsmith/accord
> >>>>
> >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> >> community.
> >>>>
> >>>> Cassandra has benefitted from LWTs for many years, but application
> >>>> developers that want to ensure consistency for complex operations must
> >>>> either accept the scalability bottleneck of serializing all related
> >> state
> >>>> through a single partition, or layer a complex state machine on top of
> >>> the
> >>>> database. These are sophisticated and costly activities that our users
> >>>> should not be expected to undertake. Since distributed databases are
> >>>> beginning to offer distributed transactions with fewer caveats, it is
> >>> past
> >>>> time for Cassandra to do so as well.
> >>>>
> >>>> This CEP proposes the use of several novel techniques that build upon
> >>>> research (that followed EPaxos) to deliver (non-interactive) general
> >>>> purpose distributed transactions. The approach is outlined in the
> >>> wikipage
> >>>> and in more detail in the linked whitepaper. Importantly, by adopting
> >>> this
> >>>> approach we will be the _only_ distributed database to offer global,
> >>>> scalable, strict serializable transactions in one wide area
> round-trip.
> >>>> This would represent a significant improvement in the state of the
> art,
> >>>> both in the academic literature and in commercial or open source
> >>> offerings.
> >>>>
> >>>> This work has been partially realised in a prototype. This partial
> >>>> prototype has been verified against Jepsen.io’s Maelstrom library and
> >>>> dedicated in-tree strict serializability verification tools, but much
> >>> work
> >>>> remains for the work to be production capable and integrated into
> >>> Cassandra.
> >>>>
> >>>> I propose including the prototype in the project as a new source
> >>>> repository, to be developed as a standalone library for integration
> >> into
> >>>> Cassandra. I hope the community sees the important value proposition
> of
> >>>> this proposal, and will adopt the CEP after this discussion, so that
> >> the
> >>>> library and its integration into Cassandra can be developed in
> parallel
> >>> and
> >>>> with the involvement of the wider community.
> >>>>
> >>>
> >>>
> >>> --
> >>> Jonathan Ellis
> >>> co-founder, http://www.datastax.com
> >>> @spyced
> >>>
> >>
> >>
> >> --
> >> Jonathan Ellis
> >> co-founder, http://www.datastax.com
> >> @spyced
> >>
> >
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>


--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [DISCUSS] CEP-15: General Purpose Transactions

Posted by "benedict@apache.org" <be...@apache.org>.
Hi Jonathan,

It would be great if we could achieve a bandwidth higher than 1-2 short emails per week. It remains unclear to me what your goal is, and it would help if you could make a statement like “I want Cassandra to be able to do X” so that we can respond directly to it. I am also available to have another call, in which we can have a back and forth, please feel free to propose a London-compatible time within the next week that is suitable for you.

In my opinion we are at risk of veering off-topic, though. This CEP is not to deliver interactive transactions, and to my knowledge nobody is proposing a CEP for interactive transactions. So, for the CEP at hand the salient question seems: does this CEP prevent us from implementing interactive transactions with properties X, Y, Z in future? To which the answer is almost certainly no.

However, to continue the discussion and respond directly to your queries, I believe we agree on the definition of an interactive transaction.

Two protocols were loosely outlined. The first, using timestamps for optimistic concurrency control, would indeed involve the possibility of aborts. It would not however inherently adopt the issue of LWTs where no transaction is able to make progress. Whether or not progress is guaranteed (in a livelock-free sense) would depend on the structure of the transactions that were interfering.

This approach has the advantage of being very simple to implement, so that we could realistically support interactive transactions quite quickly. It has the additional advantage that transactions would execute very quickly by avoiding the WAN during construction, and as a result may in practice experience fewer aborts than protocols that guarantee livelock-freedom.

The second protocol proposed using read/write intents and would be able to support almost any behaviour you want. We could even utilise pessimistic concurrency control, or anything in-between. This is its own huge design space, and discussion of this approach and the trade-offs that could be made is (in my opinion) entirely out of scope for this CEP.


From: Jonathan Ellis <jb...@gmail.com>
Date: Friday, 1 October 2021 at 05:00
To: dev <de...@cassandra.apache.org>
Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
The obstacle for me is you've provided a protocol but not a fully fleshed
out architecture, so it's hard to fill in some of the blanks.  But it looks
to me like optimistic concurrency control for interactive transactions
applied to Accord would leave you in a LWT-like situation under fairly
light contention where nobody actually makes progress due to retries.

To make sure we're talking about the same thing, as Henrik pointed out,
interactive transactions mean multiple round trips from the client within a
transaction.  For example, here
<https://github.com/apavlo/py-tpcc/blob/master/pytpcc/drivers/sqlitedriver.py#L213>
is a simple implementation of the TPC-C New Order transaction.  The high
level logic (via
<https://courses.cs.washington.edu/courses/csep545/01wi/lectures/class1/tsld039.htm>)
is,

   1. Get records describing a warehouse, customer, & district
   2. Update the district
   3. Increment next available order number
   4. Insert record into Order and New-Order tables
   5. For 5-15 items, get Item record, get/update Stock record
   6. Insert Order-Line Record

As you can see, this requires a lot of client-side logic mixed in with the
actual SQL commands.


On Thu, Sep 30, 2021 at 2:30 AM benedict@apache.org <be...@apache.org>
wrote:

> Essentially this, although I think in practice we will need to track each
> partition’s timestamp separately (or optionally for reduced conflicts, each
> row or datum’s), and make them all part of the conditional application of
> the transaction - at least for strict-serializability.
>
> The alternative is to insert read/write intents for the transaction during
> each step, and to confirm they are still valid on commit, but this approach
> would require a WAN round-trip for each step in the interactive
> transaction, whereas the timestamp-validating approach can use a LAN
> round-trip for each step besides the final one, and is also much simpler to
> implement.
>
>
> From: Blake Eggleston <be...@apple.com.INVALID>
> Date: Thursday, 30 September 2021 at 05:47
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> You could establish a lower timestamp bound and buffer transaction state
> on the coordinator, then make the commit an operation that only applies if
> all partitions involved haven’t been changed by a more recent timestamp.
> You could also implement mvcc either in the storage layer or for some
> period of time by buffering commits on each replica before applying.
>
> > On Sep 29, 2021, at 6:18 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> >
> > How are interactive transactions possible with Accord?
> >
> >
> >
> > On Tue, Sep 21, 2021 at 11:56 PM benedict@apache.org <
> benedict@apache.org>
> > wrote:
> >
> >> Could you explain why you believe this trade-off is necessary? We can
> >> support full SQL just fine with Accord, and I hope that we eventually
> do so.
> >>
> >> This domain is incredibly complex, so it is easy to reach wrong
> >> conclusions. I would invite you again to propose a system for discussion
> >> that you think offers something Accord is unable to, and that you
> consider
> >> desirable, and we can work from there.
> >>
> >> To pre-empt some possible discussions, I am not aware of anything we
> >> cannot do with Accord that we could do with either Calvin or Spanner.
> >> Interactive transactions are possible on top of Accord, as are
> transactions
> >> with an unknown read/write set. In each case the only cost is that they
> >> would use optimistic concurrency control, which is no worse the spanner
> >> derivatives anyway (which I have to assume is your benchmark in this
> >> regard). I do not expect to deliver either functionality initially, but
> >> Accord takes us most of the way there for both.
> >>
> >>
> >> From: Jonathan Ellis <jb...@gmail.com>
> >> Date: Wednesday, 22 September 2021 at 05:36
> >> To: dev <de...@cassandra.apache.org>
> >> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> >> Right, I'm looking for exactly a discussion on the high level goals.
> >> Instead of saying "here's the goals and we ruled out X because Y" we
> should
> >> start with a discussion around, "Approach A allows X and W, approach B
> >> allows Y and Z" and decide together what the goals should be and and
> what
> >> we are willing to trade to get those goals, e.g., are we willing to
> give up
> >> global strict serializability to get the ability to support full SQL.
> Both
> >> of these are nice to have!
> >>
> >> On Tue, Sep 21, 2021 at 9:52 PM benedict@apache.org <
> benedict@apache.org>
> >> wrote:
> >>
> >>> Hi Jonathan,
> >>>
> >>> These other systems are incompatible with the goals of the CEP. I do
> >>> discuss them (besides 2PC) in both the whitepaper and the CEP, and will
> >>> summarise that discussion below. A true and accurate comparison of
> these
> >>> other systems is essentially intractable, as there are complex
> subtleties
> >>> to each flavour, and those who are interested would be better served by
> >>> performing their own research.
> >>>
> >>> I think it is more productive to focus on what we want to achieve as a
> >>> community. If you believe the goals of this CEP are wrong for the
> >> project,
> >>> let’s focus on that. If you want to compare and contrast specific
> facets
> >> of
> >>> alternative systems that you consider to be preferable in some
> dimension,
> >>> let’s do that here or in a Q&A as proposed by Joey.
> >>>
> >>> The relevant goals are that we:
> >>>
> >>>
> >>>  1.  Guarantee strict serializable isolation on commodity hardware
> >>>  2.  Scale to any cluster size
> >>>  3.  Achieve optimal latency
> >>>
> >>> The approach taken by Spanner derivatives is rejected by (1) because
> they
> >>> guarantee only Serializable isolation (they additionally fail (3)).
> From
> >>> watching talks by YugaByte, and inferring from Cockroach’s
> >>> panic-cluster-death under clock skew, this is clearly considered by
> >>> everyone to be undesirable but necessary to achieve scalability.
> >>>
> >>> The approach taken by FaunaDB (Calvin) is rejected by (2) because its
> >>> sequencing layer requires a global leader process for the cluster,
> which
> >> is
> >>> incompatible with Cassandra’s scalability requirements. It additionally
> >>> fails (3) for global clients.
> >>>
> >>> Two phase commit fails (3). As an aside, AFAICT DynamoDB is today a
> >>> Spanner clone for its multi-key transaction functionality, not 2PC.
> >>>
> >>> Systems such as RAMP with even weaker isolation are not considered for
> >> the
> >>> simple reason that they do not even claim to meet (1).
> >>>
> >>> If we want to additionally offer weaker isolation levels than
> >>> Serializable, such as that provided by the recent RAMP-TAO paper,
> >> Cassandra
> >>> is likely able to support multiple distinct transaction layers that
> >> operate
> >>> independently. I would encourage you to file a CEP to explore how we
> can
> >>> meet these distinct use cases, but I consider them to be niche. I
> expect
> >>> that a majority of our user base desire strict serializable isolation,
> >> and
> >>> certainly no less than serializable isolation, to augment the existing
> >>> weaker isolation offered by quorum reads and writes.
> >>>
> >>> I would tangentially note that we are not an AP database under normal
> >>> recommended operation. A minority in any network partition cannot reach
> >>> QUORUM, so under recommended usage we are a high-availability
> leaderless
> >> CP
> >>> database.
> >>>
> >>>
> >>> From: Jonathan Ellis <jb...@gmail.com>
> >>> Date: Tuesday, 21 September 2021 at 23:45
> >>> To: dev <de...@cassandra.apache.org>
> >>> Subject: Re: [DISCUSS] CEP-15: General Purpose Transactions
> >>> Benedict, thanks for taking the lead in putting this together. Since
> >>> Cassandra is the only relevant database today designed around a
> >> leaderless
> >>> architecture, it's quite likely that we'll be better served with a
> custom
> >>> transaction design instead of trying to retrofit one from CP systems.
> >>>
> >>> The whitepaper here is a good description of the consensus algorithm
> >> itself
> >>> as well as its robustness and stability characteristics, and its
> >> comparison
> >>> with other state-of-the-art consensus algorithms is very useful.  In
> the
> >>> context of Cassandra, where a consensus algorithm is only part of what
> >> will
> >>> be implemented, I'd like to see a more complete evaluation of the
> >>> transactional side of things as well, including performance
> >> characteristics
> >>> as well as the types of transactions that can be supported and at
> least a
> >>> general idea of what it would look like applied to Cassandra. This will
> >>> allow the PMC to make a more informed decision about what tradeoffs are
> >>> best for the entire long-term project of first supplementing and
> >> ultimately
> >>> replacing LWT.
> >>>
> >>> (Allowing users to mix LWT and AP Cassandra operations against the same
> >>> rows was probably a mistake, so in contrast with LWT we’re not looking
> >> for
> >>> something fast enough for occasional use but rather something within a
> >>> reasonable factor of AP operations, appropriate to being the only way
> to
> >>> interact with tables declared as such.)
> >>>
> >>> Besides Accord, this should cover
> >>>
> >>> - Calvin and FaunaDB
> >>> - A Spanner derivative (no opinion on whether that should be Cockroach
> or
> >>> Yugabyte, I don’t think it’s necessary to cover both)
> >>> - A 2PC implementation (the Accord paper mentions DynamoDB but I
> suspect
> >>> there is more public information about MongoDB)
> >>> - RAMP
> >>>
> >>> Here’s an example of what I mean:
> >>>
> >>> =Calvin=
> >>>
> >>> Approach: global consensus (Paxos in Calvin, Raft in FaunaDB) to order
> >>> transactions, then replicas execute the transactions independently with
> >> no
> >>> further coordination.  No SPOF.  Transactions are batched by each
> >> sequencer
> >>> to keep this from becoming a bottleneck.
> >>>
> >>> Performance: Calvin paper (published 2012) reports linear scaling of
> >> TPC-C
> >>> New Order up to 500,000 transactions/s on 100 machines (EC2 XL machines
> >>> with 7GB ram and 8 virtual cores).  Note that TPC-C New Order is
> composed
> >>> of four reads and four writes, so this is effectively 2M reads and 2M
> >>> writes as we normally measure them in C*.
> >>>
> >>> Calvin supports mixed read/write transactions, but because the
> >> transaction
> >>> execution logic requires knowing all partition keys in advance to
> ensure
> >>> that all replicas can reproduce the same results with no coordination,
> >>> reads against non-PK predicates must be done ahead of time
> >> (transparently,
> >>> by the server) to determine the set of keys, and this must be retried
> if
> >>> the set of rows affected is updated before the actual transaction
> >> executes.
> >>>
> >>> Batching and global consensus adds latency -- 100ms in the Calvin paper
> >> and
> >>> apparently about 50ms in FaunaDB.  Glass half full: all transactions
> >>> (including multi-partition updates) are equally performant in Calvin
> >> since
> >>> the coordination is handled up front in the sequencing step.  Glass
> half
> >>> empty: even single-row reads and writes have to pay the full
> coordination
> >>> cost.  Fauna has optimized this away for reads but I am not aware of a
> >>> description of how they changed the design to allow this.
> >>>
> >>> Functionality and limitations: since the entire transaction must be
> known
> >>> in advance to allow coordination-less execution at the replicas, Calvin
> >>> cannot support interactive transactions at all.  FaunaDB mitigates this
> >> by
> >>> allowing server-side logic to be included, but a Calvin approach will
> >> never
> >>> be able to offer SQL compatibility.
> >>>
> >>> Guarantees: Calvin transactions are strictly serializable.  There is no
> >>> additional complexity or performance hit to generalizing to multiple
> >>> regions, apart from the speed of light.  And since Calvin is already
> >> paying
> >>> a batching latency penalty, this is less painful than for other
> systems.
> >>>
> >>> Application to Cassandra: B-.  Distributed transactions are handled by
> >> the
> >>> sequencing and scheduling layers, which are leaderless, and Calvin’s
> >>> requirements for the storage layer are easily met by C*.  But Calvin
> also
> >>> requires a global consensus protocol and LWT is almost certainly not
> >>> sufficiently performant, so this would require ZK or etcd (reasonable
> >> for a
> >>> library approach but not for replacing LWT in C* itself), or an
> >>> implementation of Accord.  I don’t believe Calvin would require
> >> additional
> >>> table-level metadata in Cassandra.
> >>>
> >>> On Sun, Sep 5, 2021 at 9:33 AM benedict@apache.org <
> benedict@apache.org>
> >>> wrote:
> >>>
> >>>> Wiki:
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-15%3A+General+Purpose+Transactions
> >>>> Whitepaper:
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf
> >>>> <
> >>>>
> >>>
> >>
> https://cwiki.apache.org/confluence/download/attachments/188744725/Accord.pdf?version=1&modificationDate=1630847736966&api=v2
> >>>>>
> >>>> Prototype: https://github.com/belliottsmith/accord
> >>>>
> >>>> Hi everyone, I’d like to propose this CEP for adoption by the
> >> community.
> >>>>
> >>>> Cassandra has benefitted from LWTs for many years, but application
> >>>> developers that want to ensure consistency for complex operations must
> >>>> either accept the scalability bottleneck of serializing all related
> >> state
> >>>> through a single partition, or layer a complex state machine on top of
> >>> the
> >>>> database. These are sophisticated and costly activities that our users
> >>>> should not be expected to undertake. Since distributed databases are
> >>>> beginning to offer distributed transactions with fewer caveats, it is
> >>> past
> >>>> time for Cassandra to do so as well.
> >>>>
> >>>> This CEP proposes the use of several novel techniques that build upon
> >>>> research (that followed EPaxos) to deliver (non-interactive) general
> >>>> purpose distributed transactions. The approach is outlined in the
> >>> wikipage
> >>>> and in more detail in the linked whitepaper. Importantly, by adopting
> >>> this
> >>>> approach we will be the _only_ distributed database to offer global,
> >>>> scalable, strict serializable transactions in one wide area
> round-trip.
> >>>> This would represent a significant improvement in the state of the
> art,
> >>>> both in the academic literature and in commercial or open source
> >>> offerings.
> >>>>
> >>>> This work has been partially realised in a prototype. This partial
> >>>> prototype has been verified against Jepsen.io’s Maelstrom library and
> >>>> dedicated in-tree strict serializability verification tools, but much
> >>> work
> >>>> remains for the work to be production capable and integrated into
> >>> Cassandra.
> >>>>
> >>>> I propose including the prototype in the project as a new source
> >>>> repository, to be developed as a standalone library for integration
> >> into
> >>>> Cassandra. I hope the community sees the important value proposition
> of
> >>>> this proposal, and will adopt the CEP after this discussion, so that
> >> the
> >>>> library and its integration into Cassandra can be developed in
> parallel
> >>> and
> >>>> with the involvement of the wider community.
> >>>>
> >>>
> >>>
> >>> --
> >>> Jonathan Ellis
> >>> co-founder, http://www.datastax.com
> >>> @spyced
> >>>
> >>
> >>
> >> --
> >> Jonathan Ellis
> >> co-founder, http://www.datastax.com
> >> @spyced
> >>
> >
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>


--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced