You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cassandra.apache.org by Jonathan Ellis <jb...@gmail.com> on 2021/10/09 16:54:10 UTC

Tradeoffs for Cassandra transaction management

* Hi all,After calling several times for a broader discussion of goals and
tradeoffs around transaction management in the CEP-15 thread, I’ve put
together a short analysis to kick that off.Here is a table that summarizes
the state of the art for distributed transactions that offer
serializability, i.e., a superset of what you can get with LWT.  (The most
interesting option that this eliminates is RAMP.)Since I'm not sure how
this will render outside gmail, I've also uploaded it here:
https://imgur.com/a/SCZ8jex
<https://imgur.com/a/SCZ8jex>SpannerCockroachCalvin/FaunaSLOG (see
below)Write latencyGlobal Paxos, plus 2pc for multi-partition.For
intercontinental replication this is 100+ms.  Cloud Spanner does not allow
truly global deployments for this reason.Single-region Paxos, plus 2pc.
I’m not very clear on how this works but it results in non-strict
serializability.I didn’t find actual numbers for CR other than “2ms in a
single AZ” which is not a typical scenario.Global Raft.  Fauna posts actual
numbers of ~70ms in production which I assume corresponds to a multi-region
deployment with all regions in the USA.  SLOG paper says true global Calvin
is 200+ms.Single-region Paxos (common case) with fallback to multi-region
Paxos.Under 10ms.Scalability bottlenecksLocks held during cross-region
replicationSame as SpannerOLLP approach required when PKs are not known in
advance (mostly for indexed queries) -- results in retries under
contentionSame as CalvinRead latency at serial consistencyTimestamp from
Paxos leader (may be cross-region), then read from local replica.Same as
Spanner, I thinkSame as writesSame as writesMaximum serializability
flavorStrictUn-strictStrictStrictSupport for other isolation
levels?SnapshotNoSnapshot (in Fauna)Paper mentions dropping from
strict-serializable to only serializable.  Probably could also support
Snapshot like Fauna.Interactive transaction support (req’d for
SQL)YesYesNoNoPotential for grafting onto C*NightmareNightmareReasonable,
Calvin is relatively simple and the storage assumptions it makes are
minimalI haven’t thought about this enough. SLOG may require versioned
storage, e.g. see this comment
<http://dbmsmusings.blogspot.com/2019/10/introducing-slog-cheating-low-latency.html?showComment=1570497003296#c5976719429355924873>.(I
have not included Accord here because it’s not sufficiently clear to me how
to create a full transaction manager from the Accord protocol, so I can’t
analyze many of the properties such a system would have.  The most obvious
solution would be “Calvin but with Accord instead of Raft”, but since
Accord already does some Calvin-like things that seems like it would result
in some suboptimal redundancy.)After putting the above together it seems to
me that the two main areas of tradeoff are, 1. Is it worth giving up local
latencies to get full global consistency?  Most LWT use cases use
LOCAL_SERIAL.  While all of the above have more efficient designs than LWT,
it’s still true that global serialization will require 100+ms in the
general case due to physical transmission latency.  So a design that allows
local serialization with EC between regions, or a design (like SLOG) that
automatically infers a “home” region that can do local consensus in the
common case without giving up global serializability, is desirable.2. Is it
worth giving up the possibility of SQL support, to get the benefits of
deterministic transaction design?  To be clear, these benefits include very
significant ones around simplicity of design, higher write throughput, and
(in SLOG) lower read and write latencies.I’ll doubleclick on #2 because it
was asserted in the CEP-15 thread that Accord could support SQL by applying
known techniques on top.  This is mistaken.  Deterministic systems like
Calvin or SLOG or Accord can support queries where the rows affected are
not known in advance using a technique that Abadi calls OLLP (Optimistic
Lock Location Prediction), but this does not help when the transaction
logic is not known in advance.Here is Daniel Abadi’s explanation of OLLP
from “An Overview of Deterministic Database Systems
<https://cacm.acm.org/magazines/2018/9/230601-an-overview-of-deterministic-database-systems/fulltext?mobile=false>:”In
practice, deterministic database systems that use ordered locking do not
wait until runtime for transactions to determine their access-sets.
Instead, they use a technique called OLLP where if a transaction does not
know its access-sets in advance, it is not inserted into the input log.
Instead, it is run in a trial mode that does not write to the database
state, but determines what it would have read or written to if it was
actually being processed. It is then annotated with the access-sets
determined during the trial run, and submitted to the input log for actual
processing. In the actual run, every replica processes the transaction
deterministically, acquiring locks for the transaction based on the
estimate from the trial run. In some cases, database state may have changed
in a way that the access sets estimates are now incorrect. Since a
transaction cannot read or write data for which it does not have a lock, it
must abort as soon as it realizes that it acquired the wrong set of locks.
But since the transaction is being processed deterministically at this
point, every replica will independently come to the same conclusion that
the wrong set of locks were acquired, and will all independently decide to
abort the transaction. The transaction then gets resubmitted to the input
log with the new access-set estimates annotated.Clearly this does not work
if the server-visible logic changes between runs.  For instance, consider
this simple interactive transaction:cursor.execute("BEGIN
TRANSACTION")count = cursor.execute("SELECT count FROM inventory WHERE id =
1").result[0]if count > 0:    cursor.execute("UPDATE inventory SET count =
count - 1 WHERE id = 1")cursor.execute("COMMIT TRANSACTION")The first
problem is that it’s far from clear how to do a “trial run” of a
transaction that the server only knows pieces of at a time.  But even
worse, the server only knows that it got either a SELECT, or a SELECT
followed by an UPDATE.  It doesn’t know anything about the logic that would
drive a change in those statements.  So if the value read changes between
trial run and execution, there is no possibility of transparently retrying,
you’re just screwed and have to report failure.So Abadi concludes,[A]ll
recent [deterministic database] implementations have limited or no support
for interactive transactions, thereby preventing their use in many existing
deployments. If the advantages of deterministic database systems will be
realized in the coming years, one of two things must occur: either database
users must accept a stored procedure interface to the system [instead of
client-side SQL], or additional research must be performed in order to
enable improved support for interactive transactions.TLDR:We need to decide
if we want to give users local transaction latencies, either with an
approach inspired by SLOG or with tuneable serializability like LWT
(trading away global consistency).  I think the answer here is clearly Yes,
we have abundant evidence from LWT that people care a great deal about
latency, and specifically that they are willing to live with
cross-datacenter eventual consistency to get low local latencies.We also
need to decide if we eventually want to support full SQL.  I think this one
is less clear, there are strong arguments both ways.P.S. SLOG deserves more
attention. Here are links to the paper
<http://www.vldb.org/pvldb/vol12/p1747-ren.pdf>, Abadi’s writeup
<http://dbmsmusings.blogspot.com/2019/10/introducing-slog-cheating-low-latency.html>,
and Murat Demirbas’s reading group compares SLOG to something called Ocean
Vista that I’ve never heard of but which reminds me of Accord
<http://muratbuffalo.blogspot.com/2020/11/ocean-vista-gossip-based-visibility.html>.*
-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: Tradeoffs for Cassandra transaction management

Posted by Jonathan Ellis <jb...@gmail.com>.

Thanks for flagging that, Alex.  Here it is without trying to include the
inline table:

*After calling several times for a broader discussion of goals and
tradeoffs around transaction management in the CEP-15 thread, I’ve put
together a short analysis to kick that off.Here is a table that summarizes
the state of the art for distributed transactions that offer
serializability, i.e., a superset of what you can get with LWT.  (The most
interesting option that this eliminates is
RAMP.)https://imgur.com/a/SCZ8jex <https://imgur.com/a/SCZ8jex>(I have not
included Accord here because it’s not sufficiently clear to me how to
create a full transaction manager from the Accord protocol, so I can’t
analyze many of the properties such a system would have.  The most obvious
solution would be “Calvin but with Accord instead of Raft”, but since
Accord already does some Calvin-like things that seems like it would result
in some suboptimal redundancy.)After putting the above together it seems to
me that the two main areas of tradeoff are, 1. Is it worth giving up local
latencies to get full global consistency?  Most LWT use cases use
LOCAL_SERIAL.  While all of the above have more efficient designs than LWT,
it’s still true that global serialization will require 100+ms in the
general case due to physical transmission latency.  So a design that allows
local serialization with EC between regions, or a design (like SLOG) that
automatically infers a “home” region that can do local consensus in the
common case without giving up global serializability, is desirable.2. Is it
worth giving up the possibility of SQL support, to get the benefits of
deterministic transaction design?  To be clear, these benefits include very
significant ones around simplicity of design, higher write throughput, and
(in SLOG) lower read and write latencies.I’ll doubleclick on #2 because it
was asserted in the CEP-15 thread that Accord could support SQL by applying
known techniques on top.  This is mistaken.  Deterministic systems like
Calvin or SLOG or Accord can support queries where the rows affected are
not known in advance using a technique that Abadi calls OLLP (Optimistic
Lock Location Prediction), but this does not help when the transaction
logic is not known in advance.Here is Daniel Abadi’s explanation of OLLP
from “An Overview of Deterministic Database Systems
<https://cacm.acm.org/magazines/2018/9/230601-an-overview-of-deterministic-database-systems/fulltext?mobile=false>:”In
practice, deterministic database systems that use ordered locking do not
wait until runtime for transactions to determine their access-sets.
Instead, they use a technique called OLLP where if a transaction does not
know its access-sets in advance, it is not inserted into the input log.
Instead, it is run in a trial mode that does not write to the database
state, but determines what it would have read or written to if it was
actually being processed. It is then annotated with the access-sets
determined during the trial run, and submitted to the input log for actual
processing. In the actual run, every replica processes the transaction
deterministically, acquiring locks for the transaction based on the
estimate from the trial run. In some cases, database state may have changed
in a way that the access sets estimates are now incorrect. Since a
transaction cannot read or write data for which it does not have a lock, it
must abort as soon as it realizes that it acquired the wrong set of locks.
But since the transaction is being processed deterministically at this
point, every replica will independently come to the same conclusion that
the wrong set of locks were acquired, and will all independently decide to
abort the transaction. The transaction then gets resubmitted to the input
log with the new access-set estimates annotated.Clearly this does not work
if the server-visible logic changes between runs.  For instance, consider
this simple interactive transaction:cursor.execute("BEGIN
TRANSACTION")count = cursor.execute("SELECT count FROM inventory WHERE id =
1").result[0]if count > 0:    cursor.execute("UPDATE inventory SET count =
count - 1 WHERE id = 1")cursor.execute("COMMIT TRANSACTION")The first
problem is that it’s far from clear how to do a “trial run” of a
transaction that the server only knows pieces of at a time.  But even
worse, the server only knows that it got either a SELECT, or a SELECT
followed by an UPDATE.  It doesn’t know anything about the logic that would
drive a change in those statements.  So if the value read changes between
trial run and execution, there is no possibility of transparently retrying,
you’re just screwed and have to report failure.So Abadi concludes,[A]ll
recent [deterministic database] implementations have limited or no support
for interactive transactions, thereby preventing their use in many existing
deployments. If the advantages of deterministic database systems will be
realized in the coming years, one of two things must occur: either database
users must accept a stored procedure interface to the system [instead of
client-side SQL], or additional research must be performed in order to
enable improved support for interactive transactions.TLDR:We need to decide
if we want to give users local transaction latencies, either with an
approach inspired by SLOG or with tuneable serializability like LWT
(trading away global consistency).  I think the answer here is clearly Yes,
we have abundant evidence from LWT that people care a great deal about
latency, and specifically that they are willing to live with
cross-datacenter eventual consistency to get low local latencies.We also
need to decide if we eventually want to support full SQL.  I think this one
is less clear, there are strong arguments both ways.P.S. SLOG deserves more
attention. Here are links to the paper
<http://www.vldb.org/pvldb/vol12/p1747-ren.pdf>, Abadi’s writeup
<http://dbmsmusings.blogspot.com/2019/10/introducing-slog-cheating-low-latency.html>,
and Murat Demirbas’s reading group compares SLOG to something called Ocean
Vista that I’ve never heard of but which reminds me of Accord
<http://muratbuffalo.blogspot.com/2020/11/ocean-vista-gossip-based-visibility.html>.*

On Mon, Oct 11, 2021 at 9:37 AM Oleksandr Petrov <ol...@gmail.com>
wrote:

> I realise this is not contributing to this discussion, but this email is
> very difficult to read because it seems like something has happened with
> formatting. For me it gets displayed as a single paragraph with no line
> breaks.
>
> There seems to be some overlap between the image uploaded to imgur and this
> email, but some things are only present in the email and not on the image.
>
> On Sat, Oct 9, 2021 at 6:54 PM Jonathan Ellis <jb...@gmail.com> wrote:
>
> > * Hi all,After calling several times for a broader discussion of goals
> and
> > tradeoffs around transaction management in the CEP-15 thread, I’ve put
> > together a short analysis to kick that off.Here is a table that
> summarizes
> > the state of the art for distributed transactions that offer
> > serializability, i.e., a superset of what you can get with LWT.  (The
> most
> > interesting option that this eliminates is RAMP.)Since I'm not sure how
> > this will render outside gmail, I've also uploaded it here:
> > https://imgur.com/a/SCZ8jex
> > <https://imgur.com/a/SCZ8jex>SpannerCockroachCalvin/FaunaSLOG (see
> > below)Write latencyGlobal Paxos, plus 2pc for multi-partition.For
> > intercontinental replication this is 100+ms.  Cloud Spanner does not
> allow
> > truly global deployments for this reason.Single-region Paxos, plus 2pc.
> > I’m not very clear on how this works but it results in non-strict
> > serializability.I didn’t find actual numbers for CR other than “2ms in a
> > single AZ” which is not a typical scenario.Global Raft.  Fauna posts
> actual
> > numbers of ~70ms in production which I assume corresponds to a
> multi-region
> > deployment with all regions in the USA.  SLOG paper says true global
> Calvin
> > is 200+ms.Single-region Paxos (common case) with fallback to multi-region
> > Paxos.Under 10ms.Scalability bottlenecksLocks held during cross-region
> > replicationSame as SpannerOLLP approach required when PKs are not known
> in
> > advance (mostly for indexed queries) -- results in retries under
> > contentionSame as CalvinRead latency at serial consistencyTimestamp from
> > Paxos leader (may be cross-region), then read from local replica.Same as
> > Spanner, I thinkSame as writesSame as writesMaximum serializability
> > flavorStrictUn-strictStrictStrictSupport for other isolation
> > levels?SnapshotNoSnapshot (in Fauna)Paper mentions dropping from
> > strict-serializable to only serializable.  Probably could also support
> > Snapshot like Fauna.Interactive transaction support (req’d for
> > SQL)YesYesNoNoPotential for grafting onto C*NightmareNightmareReasonable,
> > Calvin is relatively simple and the storage assumptions it makes are
> > minimalI haven’t thought about this enough. SLOG may require versioned
> > storage, e.g. see this comment
> > <
> >
> http://dbmsmusings.blogspot.com/2019/10/introducing-slog-cheating-low-latency.html?showComment=1570497003296#c5976719429355924873
> > >.(I
> > have not included Accord here because it’s not sufficiently clear to me
> how
> > to create a full transaction manager from the Accord protocol, so I can’t
> > analyze many of the properties such a system would have.  The most
> obvious
> > solution would be “Calvin but with Accord instead of Raft”, but since
> > Accord already does some Calvin-like things that seems like it would
> result
> > in some suboptimal redundancy.)After putting the above together it seems
> to
> > me that the two main areas of tradeoff are, 1. Is it worth giving up
> local
> > latencies to get full global consistency?  Most LWT use cases use
> > LOCAL_SERIAL.  While all of the above have more efficient designs than
> LWT,
> > it’s still true that global serialization will require 100+ms in the
> > general case due to physical transmission latency.  So a design that
> allows
> > local serialization with EC between regions, or a design (like SLOG) that
> > automatically infers a “home” region that can do local consensus in the
> > common case without giving up global serializability, is desirable.2. Is
> it
> > worth giving up the possibility of SQL support, to get the benefits of
> > deterministic transaction design?  To be clear, these benefits include
> very
> > significant ones around simplicity of design, higher write throughput,
> and
> > (in SLOG) lower read and write latencies.I’ll doubleclick on #2 because
> it
> > was asserted in the CEP-15 thread that Accord could support SQL by
> applying
> > known techniques on top.  This is mistaken.  Deterministic systems like
> > Calvin or SLOG or Accord can support queries where the rows affected are
> > not known in advance using a technique that Abadi calls OLLP (Optimistic
> > Lock Location Prediction), but this does not help when the transaction
> > logic is not known in advance.Here is Daniel Abadi’s explanation of OLLP
> > from “An Overview of Deterministic Database Systems
> > <
> >
> https://cacm.acm.org/magazines/2018/9/230601-an-overview-of-deterministic-database-systems/fulltext?mobile=false
> > >:”In
> > practice, deterministic database systems that use ordered locking do not
> > wait until runtime for transactions to determine their access-sets.
> > Instead, they use a technique called OLLP where if a transaction does not
> > know its access-sets in advance, it is not inserted into the input log.
> > Instead, it is run in a trial mode that does not write to the database
> > state, but determines what it would have read or written to if it was
> > actually being processed. It is then annotated with the access-sets
> > determined during the trial run, and submitted to the input log for
> actual
> > processing. In the actual run, every replica processes the transaction
> > deterministically, acquiring locks for the transaction based on the
> > estimate from the trial run. In some cases, database state may have
> changed
> > in a way that the access sets estimates are now incorrect. Since a
> > transaction cannot read or write data for which it does not have a lock,
> it
> > must abort as soon as it realizes that it acquired the wrong set of
> locks.
> > But since the transaction is being processed deterministically at this
> > point, every replica will independently come to the same conclusion that
> > the wrong set of locks were acquired, and will all independently decide
> to
> > abort the transaction. The transaction then gets resubmitted to the input
> > log with the new access-set estimates annotated.Clearly this does not
> work
> > if the server-visible logic changes between runs.  For instance, consider
> > this simple interactive transaction:cursor.execute("BEGIN
> > TRANSACTION")count = cursor.execute("SELECT count FROM inventory WHERE
> id =
> > 1").result[0]if count > 0:    cursor.execute("UPDATE inventory SET count
> =
> > count - 1 WHERE id = 1")cursor.execute("COMMIT TRANSACTION")The first
> > problem is that it’s far from clear how to do a “trial run” of a
> > transaction that the server only knows pieces of at a time.  But even
> > worse, the server only knows that it got either a SELECT, or a SELECT
> > followed by an UPDATE.  It doesn’t know anything about the logic that
> would
> > drive a change in those statements.  So if the value read changes between
> > trial run and execution, there is no possibility of transparently
> retrying,
> > you’re just screwed and have to report failure.So Abadi concludes,[A]ll
> > recent [deterministic database] implementations have limited or no
> support
> > for interactive transactions, thereby preventing their use in many
> existing
> > deployments. If the advantages of deterministic database systems will be
> > realized in the coming years, one of two things must occur: either
> database
> > users must accept a stored procedure interface to the system [instead of
> > client-side SQL], or additional research must be performed in order to
> > enable improved support for interactive transactions.TLDR:We need to
> decide
> > if we want to give users local transaction latencies, either with an
> > approach inspired by SLOG or with tuneable serializability like LWT
> > (trading away global consistency).  I think the answer here is clearly
> Yes,
> > we have abundant evidence from LWT that people care a great deal about
> > latency, and specifically that they are willing to live with
> > cross-datacenter eventual consistency to get low local latencies.We also
> > need to decide if we eventually want to support full SQL.  I think this
> one
> > is less clear, there are strong arguments both ways.P.S. SLOG deserves
> more
> > attention. Here are links to the paper
> > <http://www.vldb.org/pvldb/vol12/p1747-ren.pdf>, Abadi’s writeup
> > <
> >
> http://dbmsmusings.blogspot.com/2019/10/introducing-slog-cheating-low-latency.html
> > >,
> > and Murat Demirbas’s reading group compares SLOG to something called
> Ocean
> > Vista that I’ve never heard of but which reminds me of Accord
> > <
> >
> http://muratbuffalo.blogspot.com/2020/11/ocean-vista-gossip-based-visibility.html
> > >.*
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
> >
>
>
> --
> alex p
>

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: Tradeoffs for Cassandra transaction management

Posted by Oleksandr Petrov <ol...@gmail.com>.

I realise this is not contributing to this discussion, but this email is
very difficult to read because it seems like something has happened with
formatting. For me it gets displayed as a single paragraph with no line
breaks.

There seems to be some overlap between the image uploaded to imgur and this
email, but some things are only present in the email and not on the image.

On Sat, Oct 9, 2021 at 6:54 PM Jonathan Ellis <jb...@gmail.com> wrote:

> * Hi all,After calling several times for a broader discussion of goals and
> tradeoffs around transaction management in the CEP-15 thread, I’ve put
> together a short analysis to kick that off.Here is a table that summarizes
> the state of the art for distributed transactions that offer
> serializability, i.e., a superset of what you can get with LWT.  (The most
> interesting option that this eliminates is RAMP.)Since I'm not sure how
> this will render outside gmail, I've also uploaded it here:
> https://imgur.com/a/SCZ8jex
> <https://imgur.com/a/SCZ8jex>SpannerCockroachCalvin/FaunaSLOG (see
> below)Write latencyGlobal Paxos, plus 2pc for multi-partition.For
> intercontinental replication this is 100+ms.  Cloud Spanner does not allow
> truly global deployments for this reason.Single-region Paxos, plus 2pc.
> I’m not very clear on how this works but it results in non-strict
> serializability.I didn’t find actual numbers for CR other than “2ms in a
> single AZ” which is not a typical scenario.Global Raft.  Fauna posts actual
> numbers of ~70ms in production which I assume corresponds to a multi-region
> deployment with all regions in the USA.  SLOG paper says true global Calvin
> is 200+ms.Single-region Paxos (common case) with fallback to multi-region
> Paxos.Under 10ms.Scalability bottlenecksLocks held during cross-region
> replicationSame as SpannerOLLP approach required when PKs are not known in
> advance (mostly for indexed queries) -- results in retries under
> contentionSame as CalvinRead latency at serial consistencyTimestamp from
> Paxos leader (may be cross-region), then read from local replica.Same as
> Spanner, I thinkSame as writesSame as writesMaximum serializability
> flavorStrictUn-strictStrictStrictSupport for other isolation
> levels?SnapshotNoSnapshot (in Fauna)Paper mentions dropping from
> strict-serializable to only serializable.  Probably could also support
> Snapshot like Fauna.Interactive transaction support (req’d for
> SQL)YesYesNoNoPotential for grafting onto C*NightmareNightmareReasonable,
> Calvin is relatively simple and the storage assumptions it makes are
> minimalI haven’t thought about this enough. SLOG may require versioned
> storage, e.g. see this comment
> <
> http://dbmsmusings.blogspot.com/2019/10/introducing-slog-cheating-low-latency.html?showComment=1570497003296#c5976719429355924873
> >.(I
> have not included Accord here because it’s not sufficiently clear to me how
> to create a full transaction manager from the Accord protocol, so I can’t
> analyze many of the properties such a system would have.  The most obvious
> solution would be “Calvin but with Accord instead of Raft”, but since
> Accord already does some Calvin-like things that seems like it would result
> in some suboptimal redundancy.)After putting the above together it seems to
> me that the two main areas of tradeoff are, 1. Is it worth giving up local
> latencies to get full global consistency?  Most LWT use cases use
> LOCAL_SERIAL.  While all of the above have more efficient designs than LWT,
> it’s still true that global serialization will require 100+ms in the
> general case due to physical transmission latency.  So a design that allows
> local serialization with EC between regions, or a design (like SLOG) that
> automatically infers a “home” region that can do local consensus in the
> common case without giving up global serializability, is desirable.2. Is it
> worth giving up the possibility of SQL support, to get the benefits of
> deterministic transaction design?  To be clear, these benefits include very
> significant ones around simplicity of design, higher write throughput, and
> (in SLOG) lower read and write latencies.I’ll doubleclick on #2 because it
> was asserted in the CEP-15 thread that Accord could support SQL by applying
> known techniques on top.  This is mistaken.  Deterministic systems like
> Calvin or SLOG or Accord can support queries where the rows affected are
> not known in advance using a technique that Abadi calls OLLP (Optimistic
> Lock Location Prediction), but this does not help when the transaction
> logic is not known in advance.Here is Daniel Abadi’s explanation of OLLP
> from “An Overview of Deterministic Database Systems
> <
> https://cacm.acm.org/magazines/2018/9/230601-an-overview-of-deterministic-database-systems/fulltext?mobile=false
> >:”In
> practice, deterministic database systems that use ordered locking do not
> wait until runtime for transactions to determine their access-sets.
> Instead, they use a technique called OLLP where if a transaction does not
> know its access-sets in advance, it is not inserted into the input log.
> Instead, it is run in a trial mode that does not write to the database
> state, but determines what it would have read or written to if it was
> actually being processed. It is then annotated with the access-sets
> determined during the trial run, and submitted to the input log for actual
> processing. In the actual run, every replica processes the transaction
> deterministically, acquiring locks for the transaction based on the
> estimate from the trial run. In some cases, database state may have changed
> in a way that the access sets estimates are now incorrect. Since a
> transaction cannot read or write data for which it does not have a lock, it
> must abort as soon as it realizes that it acquired the wrong set of locks.
> But since the transaction is being processed deterministically at this
> point, every replica will independently come to the same conclusion that
> the wrong set of locks were acquired, and will all independently decide to
> abort the transaction. The transaction then gets resubmitted to the input
> log with the new access-set estimates annotated.Clearly this does not work
> if the server-visible logic changes between runs.  For instance, consider
> this simple interactive transaction:cursor.execute("BEGIN
> TRANSACTION")count = cursor.execute("SELECT count FROM inventory WHERE id =
> 1").result[0]if count > 0:    cursor.execute("UPDATE inventory SET count =
> count - 1 WHERE id = 1")cursor.execute("COMMIT TRANSACTION")The first
> problem is that it’s far from clear how to do a “trial run” of a
> transaction that the server only knows pieces of at a time.  But even
> worse, the server only knows that it got either a SELECT, or a SELECT
> followed by an UPDATE.  It doesn’t know anything about the logic that would
> drive a change in those statements.  So if the value read changes between
> trial run and execution, there is no possibility of transparently retrying,
> you’re just screwed and have to report failure.So Abadi concludes,[A]ll
> recent [deterministic database] implementations have limited or no support
> for interactive transactions, thereby preventing their use in many existing
> deployments. If the advantages of deterministic database systems will be
> realized in the coming years, one of two things must occur: either database
> users must accept a stored procedure interface to the system [instead of
> client-side SQL], or additional research must be performed in order to
> enable improved support for interactive transactions.TLDR:We need to decide
> if we want to give users local transaction latencies, either with an
> approach inspired by SLOG or with tuneable serializability like LWT
> (trading away global consistency).  I think the answer here is clearly Yes,
> we have abundant evidence from LWT that people care a great deal about
> latency, and specifically that they are willing to live with
> cross-datacenter eventual consistency to get low local latencies.We also
> need to decide if we eventually want to support full SQL.  I think this one
> is less clear, there are strong arguments both ways.P.S. SLOG deserves more
> attention. Here are links to the paper
> <http://www.vldb.org/pvldb/vol12/p1747-ren.pdf>, Abadi’s writeup
> <
> http://dbmsmusings.blogspot.com/2019/10/introducing-slog-cheating-low-latency.html
> >,
> and Murat Demirbas’s reading group compares SLOG to something called Ocean
> Vista that I’ve never heard of but which reminds me of Accord
> <
> http://muratbuffalo.blogspot.com/2020/11/ocean-vista-gossip-based-visibility.html
> >.*
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


-- 
alex p

Re: [IDEA] Read committed transaction with Accord

Posted by Henrik Ingo <he...@datastax.com>.

Thanks for clarifying Jonathan. I agree with your example.

It seems we have now moved into discussing specific requirements/semantics
for an interactive transaction implementation. Which is interesting, but
beyond what I will have time to think about tonight. At least off the top
of my head I can't say I have any data or experience to say how important
it is to satisfy the use case you are outlining.

As a gut feeling, I believe the alternative proposal outlined by Alex and
Benedict would take such locks in the database nodes that you describe. But
again, we'll have to return to this another day as my today is almost over.

henrik

On Thu, Oct 14, 2021 at 7:39 PM Jonathan Ellis <jb...@gmail.com> wrote:

> ... which is a long way of saying, in postgresql those errors are there as
> part of checking for correctness -- when you see them it means you did not
> ask for the appropriate locks.  It's not expected that you should write
> try/catch/retry loops to work around this.
>
> On Thu, Oct 14, 2021 at 11:13 AM Jonathan Ellis <jb...@gmail.com> wrote:
>
> > [Moving followup here from the other thread]
> >
> > I think there is in fact a difference here.
> >
> > Consider a workload consisting of two clients.  One of them is submitting
> > a stream of TPC-C new order transactions (new order client = NOC), and
> the
> > other is performing a simple increment of district next order ids
> > (increment district client = IDC).
> >
> > If we run these two workloads in postgresql under READ COMMITTED, both
> > clients will proceed happily (although we will get serialization
> anomalies).
> >
> > If we run them in pg under SERIALIZABLE, then the NOC client will get the
> > "could not serialize access" error whenever the IDC client updates the
> > district concurrently, which will be effectively every time since the IDC
> > transaction is much simpler.  But, SQL gives you a tool to allow NOC to
> > make progress, which is SELECT FOR UPDATE.  If the NOC performs its first
> > read with FOR UPDATE then it will (1) block until the current IDC
> > transaction completes and then (2) grab a lock that prevents further
> > updates from happening concurrently, allowing NOC to make progress.
> > Neither NOC nor IDC will ever get a "could not serialize access" error.
> >
> > It looks to me like the proposed design here would (1) not allow NOC to
> > make progress at READ COMMITTED, but also (2) does not provide the tools
> to
> > achieve progress with SERIALIZABLE either since locking outside of the
> > global consensus does not make sense.
> >
> > On Wed, Oct 13, 2021 at 1:59 PM Henrik Ingo <he...@datastax.com>
> > wrote:
> >
> >> Sorry Jonathan, didn't see this reply earlier today.
> >>
> >> That would be common behaviour for many MVCC databases, including
> MongoDB,
> >> MySQL Galera Cluster, PostgreSQL...
> >>
> >>
> https://urldefense.com/v3/__https://www.postgresql.org/docs/9.5/transaction-iso.html__;!!PbtH5S7Ebw!KP0b2eRHpf-D6w1012nea4UbnsxtFn-zUEBrAZ7ghBFDr_QQyTT6qHzgZ0KKUKxt_64$
> >>
> >> *"Applications using this level must be prepared to retry transactions
> due
> >> to serialization failures."*
> >>
> >> On Wed, Oct 13, 2021 at 3:19 AM Jonathan Ellis <jb...@gmail.com>
> wrote:
> >>
> >> > Hi Henrik,
> >> >
> >> > I don't see how this resolves the fundamental problem that I outlined
> to
> >> > start with, namely, that without having the entire logic of the
> >> transaction
> >> > available to it, the server cannot retry the transaction when
> concurrent
> >> > changes are found to have been applied after the reconnaissance reads
> >> (what
> >> > you call the conversational phase).
> >
> >
> > On Wed, Oct 13, 2021 at 5:00 AM Henrik Ingo <he...@datastax.com>
> > wrote:
> >
> >> On Wed, Oct 13, 2021 at 1:26 AM Blake Eggleston
> >> <be...@apple.com.invalid> wrote:
> >>
> >> > Hi Henrik,
> >> >
> >> > I would agree that the local serial experience for valid use cases
> >> should
> >> > be supported in some form before legacy LWT is replaced by Accord.
> >> >
> >> >
> >> Great! It seems there's a seed of consensus on this point.
> >>
> >>
> >> > Regarding your read committed proposal, I think this CEP discussion
> has
> >> > already spent too much time talking about hypothetical SQL
> >> implementations,
> >> > and I’d like to avoid veering off course again. However, since you’ve
> >> asked
> >> > a well thought out question with concrete goals and implementation
> >> ideas,
> >> > I’m happy to answer it. I just ask that if you want to discuss it
> >> beyond my
> >> > reply, you start a separate ‘[IDEA] Read committed transaction with
> >> Accord’
> >> > thread where we could talk about it a bit more without it feeling like
> >> we
> >> > need to delay a vote.
> >> >
> >> >
> >> This is a reasonable request. We were already in a side thread I guess,
> >> but
> >> I like organizing discussions into separate threads...
> >>
> >> Let's see if I manage to break the thread correctly simply by editing
> the
> >> subject...
> >>
> >> FWIW, my hope for this discussion was that by providing a simple yet
> >> concrete example, it would facilitate the discussion toward a CEP-15
> vote,
> >> not distract from it. As it happened, Alex Miller was writing a hugely
> >> helpful email concurrently with mine, which improves details in CEP-15,
> so
> >> I don't know if expecting the discussion to die out just yet is ignoring
> >> people who maybe working off list to still understand this rather
> advanced
> >> reading material.
> >>
> >>
> >>
> >> > So I think it could work with some modifications.
> >> >
> >> > First you’d need to perform your select statements as accord reads,
> not
> >> > quorum reads. Otherwise you may not see writes that have been (or
> could
> >> > have been) committed. A multi-partition write could also appear to
> >> become
> >> > undone, if a write commit has not reached one of the keys or needs to
> be
> >> > recovered.
> >> >
> >>
> >> Ah right. I think we established early on that tables should be either
> >> Accord-only, or legacy C* only. I was too fixated on the "no other
> >> changes"
> >> and forgot this.
> >>
> >> This is then a very interesting detail you point out! It seems like
> >> potentially every statement now needs to go through the Accord consensus
> >> protocol, and this could become expensive, where my goal was to design
> the
> >> simplest and most lightweight example thinkable. BUT for read-only
> Accord
> >> transactions, where I specifically also don't care about
> serializability,
> >> wouldn't this be precisely the case where I can simply pick my own
> >> timestamp and do a stale read from a  nearby replica?
> >>
> >>
> >> >
> >> > Second, when you talk about transforming mutations, I’m assuming
> you’re
> >> > talking about confirming primary keys do or do not exist,
> >>
> >>
> >> No, I was thinking more broadly of operations like `UPDATE table1 SET
> >> column1=x WHERE pk >= 10 and pk <= 20`
> >>
> >> My thinking was that I need to know the exact primary keys touched both
> >> during the conversational phase and the commit phase. In essence, this
> is
> >> an interactive reconnaisance phase.
> >>
> >> You make a great point that for statements where the PK is explicit,
> they
> >> can just be directly added to the write set and transaction state. Ex:
> >> `UPDATE table1 SET column1=x WHERE pk IN (1,2,3)`
> >>
> >>
> >>
> >> > and supporting auto-incrementing primary keys. To confirm primary keys
> >> do
> >> > or do not exist, you’d also need to perform an accord read also.
> >>
> >>
> >> For sure.
> >>
> >>
> >> > For auto-incrementing primary keys, you’d need to do an accord
> >> read/write
> >> > operation to increment a counter somewhere (or just use uuids).
> >> >
> >> >
> >> I had not considered auto-increment at all, but if that would be a
> >> requirement, then I tend to translate "auto-increment" into "any service
> >> that can hand out unique integers". (In practice, no database can force
> me
> >> to commit the integers in the order that they're actually monotonically
> >> increasing, so "auto-increment" is an illusion, I realized at some point
> >> in
> >> my career.)
> >>
> >>
> >> > Finally, read committed does lock rows, so you’d still need to
> perform a
> >> > read on commit to confirm that the rows being written to haven’t been
> >> > modified since the transaction began.
> >> >
> >>
> >> Hmm...
> >>
> >> As we see in a separate discussion is already diving into this, it seems
> >> like at least the SQL 1992 standard only says read committed must
> protect
> >> against P1 and that's it. My suspicion is that since most modern
> databases
> >> start from MVCC, they essentially "over deliver" when providing read
> >> committed, since the implementation naturally provides snapshot reads
> and
> >> in fact it would be complicated to do something less consistent.
> >>
> >> For this discussion it's not really important which interpretation is
> >> correct, since either is a reasonable semantic. For my purposes I'll
> just
> >> note that needing to re-execute all reads during the Accord phase
> (commit
> >> phase) would make the design more expensive, since the transaction is
> now
> >> executed twice. The goal of a simplistic light weight semantic is
> achieved
> >> by not doing so and claiming the weaker interpretation of read committed
> >> is
> >> "correct".
> >>
> >> henrik
> >>
> >> --
> >>
> >> Henrik Ingo
> >>
> >> +358 40 569 7354 <358405697354>
> >>
> >> [image: Visit us online.] <https://www.datastax.com/>  [image: Visit us
> >> on
> >> Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on
> >> YouTube.]
> >> <
> >>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=
> >> >
> >>   [image: Visit my LinkedIn profile.] <
> >>
> https://urldefense.com/v3/__https://www.linkedin.com/in/heingo/__;!!PbtH5S7Ebw!KP0b2eRHpf-D6w1012nea4UbnsxtFn-zUEBrAZ7ghBFDr_QQyTT6qHzgZ0KKEnhk1jo$
> >
> >>
> >
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
> >
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: [IDEA] Read committed transaction with Accord

Posted by Jonathan Ellis <jb...@gmail.com>.

... which is a long way of saying, in postgresql those errors are there as
part of checking for correctness -- when you see them it means you did not
ask for the appropriate locks.  It's not expected that you should write
try/catch/retry loops to work around this.

On Thu, Oct 14, 2021 at 11:13 AM Jonathan Ellis <jb...@gmail.com> wrote:

> [Moving followup here from the other thread]
>
> I think there is in fact a difference here.
>
> Consider a workload consisting of two clients.  One of them is submitting
> a stream of TPC-C new order transactions (new order client = NOC), and the
> other is performing a simple increment of district next order ids
> (increment district client = IDC).
>
> If we run these two workloads in postgresql under READ COMMITTED, both
> clients will proceed happily (although we will get serialization anomalies).
>
> If we run them in pg under SERIALIZABLE, then the NOC client will get the
> "could not serialize access" error whenever the IDC client updates the
> district concurrently, which will be effectively every time since the IDC
> transaction is much simpler.  But, SQL gives you a tool to allow NOC to
> make progress, which is SELECT FOR UPDATE.  If the NOC performs its first
> read with FOR UPDATE then it will (1) block until the current IDC
> transaction completes and then (2) grab a lock that prevents further
> updates from happening concurrently, allowing NOC to make progress.
> Neither NOC nor IDC will ever get a "could not serialize access" error.
>
> It looks to me like the proposed design here would (1) not allow NOC to
> make progress at READ COMMITTED, but also (2) does not provide the tools to
> achieve progress with SERIALIZABLE either since locking outside of the
> global consensus does not make sense.
>
> On Wed, Oct 13, 2021 at 1:59 PM Henrik Ingo <he...@datastax.com>
> wrote:
>
>> Sorry Jonathan, didn't see this reply earlier today.
>>
>> That would be common behaviour for many MVCC databases, including MongoDB,
>> MySQL Galera Cluster, PostgreSQL...
>>
>> https://www.postgresql.org/docs/9.5/transaction-iso.html
>>
>> *"Applications using this level must be prepared to retry transactions due
>> to serialization failures."*
>>
>> On Wed, Oct 13, 2021 at 3:19 AM Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> > Hi Henrik,
>> >
>> > I don't see how this resolves the fundamental problem that I outlined to
>> > start with, namely, that without having the entire logic of the
>> transaction
>> > available to it, the server cannot retry the transaction when concurrent
>> > changes are found to have been applied after the reconnaissance reads
>> (what
>> > you call the conversational phase).
>
>
> On Wed, Oct 13, 2021 at 5:00 AM Henrik Ingo <he...@datastax.com>
> wrote:
>
>> On Wed, Oct 13, 2021 at 1:26 AM Blake Eggleston
>> <be...@apple.com.invalid> wrote:
>>
>> > Hi Henrik,
>> >
>> > I would agree that the local serial experience for valid use cases
>> should
>> > be supported in some form before legacy LWT is replaced by Accord.
>> >
>> >
>> Great! It seems there's a seed of consensus on this point.
>>
>>
>> > Regarding your read committed proposal, I think this CEP discussion has
>> > already spent too much time talking about hypothetical SQL
>> implementations,
>> > and I’d like to avoid veering off course again. However, since you’ve
>> asked
>> > a well thought out question with concrete goals and implementation
>> ideas,
>> > I’m happy to answer it. I just ask that if you want to discuss it
>> beyond my
>> > reply, you start a separate ‘[IDEA] Read committed transaction with
>> Accord’
>> > thread where we could talk about it a bit more without it feeling like
>> we
>> > need to delay a vote.
>> >
>> >
>> This is a reasonable request. We were already in a side thread I guess,
>> but
>> I like organizing discussions into separate threads...
>>
>> Let's see if I manage to break the thread correctly simply by editing the
>> subject...
>>
>> FWIW, my hope for this discussion was that by providing a simple yet
>> concrete example, it would facilitate the discussion toward a CEP-15 vote,
>> not distract from it. As it happened, Alex Miller was writing a hugely
>> helpful email concurrently with mine, which improves details in CEP-15, so
>> I don't know if expecting the discussion to die out just yet is ignoring
>> people who maybe working off list to still understand this rather advanced
>> reading material.
>>
>>
>>
>> > So I think it could work with some modifications.
>> >
>> > First you’d need to perform your select statements as accord reads, not
>> > quorum reads. Otherwise you may not see writes that have been (or could
>> > have been) committed. A multi-partition write could also appear to
>> become
>> > undone, if a write commit has not reached one of the keys or needs to be
>> > recovered.
>> >
>>
>> Ah right. I think we established early on that tables should be either
>> Accord-only, or legacy C* only. I was too fixated on the "no other
>> changes"
>> and forgot this.
>>
>> This is then a very interesting detail you point out! It seems like
>> potentially every statement now needs to go through the Accord consensus
>> protocol, and this could become expensive, where my goal was to design the
>> simplest and most lightweight example thinkable. BUT for read-only Accord
>> transactions, where I specifically also don't care about serializability,
>> wouldn't this be precisely the case where I can simply pick my own
>> timestamp and do a stale read from a  nearby replica?
>>
>>
>> >
>> > Second, when you talk about transforming mutations, I’m assuming you’re
>> > talking about confirming primary keys do or do not exist,
>>
>>
>> No, I was thinking more broadly of operations like `UPDATE table1 SET
>> column1=x WHERE pk >= 10 and pk <= 20`
>>
>> My thinking was that I need to know the exact primary keys touched both
>> during the conversational phase and the commit phase. In essence, this is
>> an interactive reconnaisance phase.
>>
>> You make a great point that for statements where the PK is explicit, they
>> can just be directly added to the write set and transaction state. Ex:
>> `UPDATE table1 SET column1=x WHERE pk IN (1,2,3)`
>>
>>
>>
>> > and supporting auto-incrementing primary keys. To confirm primary keys
>> do
>> > or do not exist, you’d also need to perform an accord read also.
>>
>>
>> For sure.
>>
>>
>> > For auto-incrementing primary keys, you’d need to do an accord
>> read/write
>> > operation to increment a counter somewhere (or just use uuids).
>> >
>> >
>> I had not considered auto-increment at all, but if that would be a
>> requirement, then I tend to translate "auto-increment" into "any service
>> that can hand out unique integers". (In practice, no database can force me
>> to commit the integers in the order that they're actually monotonically
>> increasing, so "auto-increment" is an illusion, I realized at some point
>> in
>> my career.)
>>
>>
>> > Finally, read committed does lock rows, so you’d still need to perform a
>> > read on commit to confirm that the rows being written to haven’t been
>> > modified since the transaction began.
>> >
>>
>> Hmm...
>>
>> As we see in a separate discussion is already diving into this, it seems
>> like at least the SQL 1992 standard only says read committed must protect
>> against P1 and that's it. My suspicion is that since most modern databases
>> start from MVCC, they essentially "over deliver" when providing read
>> committed, since the implementation naturally provides snapshot reads and
>> in fact it would be complicated to do something less consistent.
>>
>> For this discussion it's not really important which interpretation is
>> correct, since either is a reasonable semantic. For my purposes I'll just
>> note that needing to re-execute all reads during the Accord phase (commit
>> phase) would make the design more expensive, since the transaction is now
>> executed twice. The goal of a simplistic light weight semantic is achieved
>> by not doing so and claiming the weaker interpretation of read committed
>> is
>> "correct".
>>
>> henrik
>>
>> --
>>
>> Henrik Ingo
>>
>> +358 40 569 7354 <358405697354>
>>
>> [image: Visit us online.] <https://www.datastax.com/>  [image: Visit us
>> on
>> Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on
>> YouTube.]
>> <
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=
>> >
>>   [image: Visit my LinkedIn profile.] <
>> https://www.linkedin.com/in/heingo/>
>>
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [IDEA] Read committed transaction with Accord

Posted by Jonathan Ellis <jb...@gmail.com>.

 [Moving followup here from the other thread]

I think there is in fact a difference here.

Consider a workload consisting of two clients.  One of them is submitting a
stream of TPC-C new order transactions (new order client = NOC), and the
other is performing a simple increment of district next order ids
(increment district client = IDC).

If we run these two workloads in postgresql under READ COMMITTED, both
clients will proceed happily (although we will get serialization anomalies).

If we run them in pg under SERIALIZABLE, then the NOC client will get the
"could not serialize access" error whenever the IDC client updates the
district concurrently, which will be effectively every time since the IDC
transaction is much simpler.  But, SQL gives you a tool to allow NOC to
make progress, which is SELECT FOR UPDATE.  If the NOC performs its first
read with FOR UPDATE then it will (1) block until the current IDC
transaction completes and then (2) grab a lock that prevents further
updates from happening concurrently, allowing NOC to make progress.
Neither NOC nor IDC will ever get a "could not serialize access" error.

It looks to me like the proposed design here would (1) not allow NOC to
make progress at READ COMMITTED, but also (2) does not provide the tools to
achieve progress with SERIALIZABLE either since locking outside of the
global consensus does not make sense.

On Wed, Oct 13, 2021 at 1:59 PM Henrik Ingo <he...@datastax.com>
wrote:

> Sorry Jonathan, didn't see this reply earlier today.
>
> That would be common behaviour for many MVCC databases, including MongoDB,
> MySQL Galera Cluster, PostgreSQL...
>
> https://www.postgresql.org/docs/9.5/transaction-iso.html
>
> *"Applications using this level must be prepared to retry transactions due
> to serialization failures."*
>
> On Wed, Oct 13, 2021 at 3:19 AM Jonathan Ellis <jb...@gmail.com> wrote:
>
> > Hi Henrik,
> >
> > I don't see how this resolves the fundamental problem that I outlined to
> > start with, namely, that without having the entire logic of the
> transaction
> > available to it, the server cannot retry the transaction when concurrent
> > changes are found to have been applied after the reconnaissance reads
> (what
> > you call the conversational phase).


On Wed, Oct 13, 2021 at 5:00 AM Henrik Ingo <he...@datastax.com>
wrote:

> On Wed, Oct 13, 2021 at 1:26 AM Blake Eggleston
> <be...@apple.com.invalid> wrote:
>
> > Hi Henrik,
> >
> > I would agree that the local serial experience for valid use cases should
> > be supported in some form before legacy LWT is replaced by Accord.
> >
> >
> Great! It seems there's a seed of consensus on this point.
>
>
> > Regarding your read committed proposal, I think this CEP discussion has
> > already spent too much time talking about hypothetical SQL
> implementations,
> > and I’d like to avoid veering off course again. However, since you’ve
> asked
> > a well thought out question with concrete goals and implementation ideas,
> > I’m happy to answer it. I just ask that if you want to discuss it beyond
> my
> > reply, you start a separate ‘[IDEA] Read committed transaction with
> Accord’
> > thread where we could talk about it a bit more without it feeling like we
> > need to delay a vote.
> >
> >
> This is a reasonable request. We were already in a side thread I guess, but
> I like organizing discussions into separate threads...
>
> Let's see if I manage to break the thread correctly simply by editing the
> subject...
>
> FWIW, my hope for this discussion was that by providing a simple yet
> concrete example, it would facilitate the discussion toward a CEP-15 vote,
> not distract from it. As it happened, Alex Miller was writing a hugely
> helpful email concurrently with mine, which improves details in CEP-15, so
> I don't know if expecting the discussion to die out just yet is ignoring
> people who maybe working off list to still understand this rather advanced
> reading material.
>
>
>
> > So I think it could work with some modifications.
> >
> > First you’d need to perform your select statements as accord reads, not
> > quorum reads. Otherwise you may not see writes that have been (or could
> > have been) committed. A multi-partition write could also appear to become
> > undone, if a write commit has not reached one of the keys or needs to be
> > recovered.
> >
>
> Ah right. I think we established early on that tables should be either
> Accord-only, or legacy C* only. I was too fixated on the "no other changes"
> and forgot this.
>
> This is then a very interesting detail you point out! It seems like
> potentially every statement now needs to go through the Accord consensus
> protocol, and this could become expensive, where my goal was to design the
> simplest and most lightweight example thinkable. BUT for read-only Accord
> transactions, where I specifically also don't care about serializability,
> wouldn't this be precisely the case where I can simply pick my own
> timestamp and do a stale read from a  nearby replica?
>
>
> >
> > Second, when you talk about transforming mutations, I’m assuming you’re
> > talking about confirming primary keys do or do not exist,
>
>
> No, I was thinking more broadly of operations like `UPDATE table1 SET
> column1=x WHERE pk >= 10 and pk <= 20`
>
> My thinking was that I need to know the exact primary keys touched both
> during the conversational phase and the commit phase. In essence, this is
> an interactive reconnaisance phase.
>
> You make a great point that for statements where the PK is explicit, they
> can just be directly added to the write set and transaction state. Ex:
> `UPDATE table1 SET column1=x WHERE pk IN (1,2,3)`
>
>
>
> > and supporting auto-incrementing primary keys. To confirm primary keys do
> > or do not exist, you’d also need to perform an accord read also.
>
>
> For sure.
>
>
> > For auto-incrementing primary keys, you’d need to do an accord read/write
> > operation to increment a counter somewhere (or just use uuids).
> >
> >
> I had not considered auto-increment at all, but if that would be a
> requirement, then I tend to translate "auto-increment" into "any service
> that can hand out unique integers". (In practice, no database can force me
> to commit the integers in the order that they're actually monotonically
> increasing, so "auto-increment" is an illusion, I realized at some point in
> my career.)
>
>
> > Finally, read committed does lock rows, so you’d still need to perform a
> > read on commit to confirm that the rows being written to haven’t been
> > modified since the transaction began.
> >
>
> Hmm...
>
> As we see in a separate discussion is already diving into this, it seems
> like at least the SQL 1992 standard only says read committed must protect
> against P1 and that's it. My suspicion is that since most modern databases
> start from MVCC, they essentially "over deliver" when providing read
> committed, since the implementation naturally provides snapshot reads and
> in fact it would be complicated to do something less consistent.
>
> For this discussion it's not really important which interpretation is
> correct, since either is a reasonable semantic. For my purposes I'll just
> note that needing to re-execute all reads during the Accord phase (commit
> phase) would make the design more expensive, since the transaction is now
> executed twice. The goal of a simplistic light weight semantic is achieved
> by not doing so and claiming the weaker interpretation of read committed is
> "correct".
>
> henrik
>
> --
>
> Henrik Ingo
>
> +358 40 569 7354 <358405697354>
>
> [image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
> Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
> <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=
> >
>   [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/
> >
>


-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: [IDEA] Read committed transaction with Accord

Posted by Henrik Ingo <he...@datastax.com>.

On Wed, Oct 13, 2021 at 3:38 PM benedict@apache.org <be...@apache.org>
wrote:

> It may have been lost in the back and forth (there’s been a lot of
> emails), but I outlined an approach for READ COMMITTED (and SERIALIZABLE)
> read isolation that does not require a WAN trip.


Yes, thank you. I knew I had read it but was looking for this in the
CEP/paper itself and couldn't find it when searching. Thanks for
re-explaining here.

However, for any multi-shard protocol it is unsafe to perform an
> uncoordinated read as each shard may have applied different transaction
> state – the timestamp you pick may be in the future on some shards.
>
>
Ok, fair enough.

Ah, I realize now we're talking past each other. What I had in mind I did
not even expect that different rows or partitions need to reflect the same
snapshot/timestamp. Just that each row independently reflects some state
that was committed. "Per row read committed", if you will. I realize now
that a) that requirement is probably trivially true for any read to any
node managed by Accord, and b) maybe it's not appropriate to call this read
committed after all. I agree that the spirit of read committed requires
that if 2 rows were modified by the same transaction, then a future read
including both rows, must show a consistent result of both the rows.


> For my purposes I'll just note that needing to re-execute all reads
> during the Accord phase (commit phase) would make the design more expensive
>
>
I realize now my response that this was part of incorporates the same
flawed thinking about "per row read commit" isolation. Thanks for
persisting in educating me.



> As noted by Alex, the only thing that needs to be corroborated in this
> approach is the timestamps. However, this is only necessary for
> SERIALIZABLE isolation or above. For READ COMMITTED it would be enough to
> perform reads with the above properties, and to buffer writes until a final
> Accord transaction round. However, as also noted by Alex, the difficulty
> here is read-your-writes. This is a general problem for interactive
> transactions and – like many of these properties - orthogonal to Accord. A
> simple approach would be to nominate a coordinator for the transaction that
> buffers writes and integrates them into the transaction execution. This
> might impose some restrictions on the size of the transaction we want to
> support, and of course means if the coordinator fails the transaction also
> fails.
>

But that is true for Cassandra today as well. (For a shorter time window,
anyway.)


>
> If we want to remove all restrictions, we are back at the monoculture of
> Cockroach, YugaByte et al. Which, again, may both be implemented by, and
> co-exist with, Accord.
>
> In this world, complex transactions would insert read and write intents
> using an Accord operation, along with a transaction state record. If
> transactions insert conflicting intents, the concurrency control
> arbitration mechanism decides what happens (whether one transaction blocks,
> aborts, or what have you). There is a bunch of literature on this that is
> orthogonal to the Accord discussion.
>
> In the case of READ COMMITTED, I believe this can be particularly simple –
> we don’t need any read intents, only write intents which are essentially a
> distributed write buffer. The storage system uses these to answer reads
> from the transaction that inserted the write intents, but they are ignored
> for all other transactions. In this case, there is no arbitration needed as
> there is never a need for one transaction to prevent another’s progress. A
> final Accord operation commits these write intents atomically across all
> shards, so that reads that occur after this operation integrate these
> writes, and those that executed before do not.
>
> Note that in this world, one-shot transactions may still execute without
> participating in the complex transaction system. In the READ COMMITTED
> world this is particularly simple, they may simply execute immediately
> using normal Accord operations. But it remains true even for SERIALIZABLE
> isolation, so long as there are no read or write intents to these keys. In
> this case we must validate there are no such intents, and if any are found
> we may need to upgrade to a complex transaction. This can be done
> atomically as part of the Accord operation, and then the transaction
> concurrency control arbitration mechanism kicks in.
>

And now I teased you to outline a different read committed transaction
approach :-)

henrik
-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: [IDEA] Read committed transaction with Accord

Posted by "benedict@apache.org" <be...@apache.org>.

> It seems like potentially every statement now needs to go through the Accord consensus
protocol, and this could become expensive, where my goal was to design the
simplest and most lightweight example thinkable. BUT for read-only Accord
transactions, where I specifically also don't care about serializability,
wouldn't this be precisely the case where I can simply pick my own
timestamp and do a stale read from a nearby replica?

It may have been lost in the back and forth (there’s been a lot of emails), but I outlined an approach for READ COMMITTED (and SERIALIZABLE) read isolation that does not require a WAN trip. However, for any multi-shard protocol it is unsafe to perform an uncoordinated read as each shard may have applied different transaction state – the timestamp you pick may be in the future on some shards.

So, initially, a sequence of ordinary Accord transactions offer READ COMMITTED isolation. To improve performance further it will be possible to offer stale reads from the local DC that meet the isolation level.

For single shard operations, picking a timestamp known to the shard is perfectly safe.

> For my purposes I'll just note that needing to re-execute all reads during the Accord phase (commit phase) would make the design more expensive

As noted by Alex, the only thing that needs to be corroborated in this approach is the timestamps. However, this is only necessary for SERIALIZABLE isolation or above. For READ COMMITTED it would be enough to perform reads with the above properties, and to buffer writes until a final Accord transaction round. However, as also noted by Alex, the difficulty here is read-your-writes. This is a general problem for interactive transactions and – like many of these properties - orthogonal to Accord. A simple approach would be to nominate a coordinator for the transaction that buffers writes and integrates them into the transaction execution. This might impose some restrictions on the size of the transaction we want to support, and of course means if the coordinator fails the transaction also fails.

If we want to remove all restrictions, we are back at the monoculture of Cockroach, YugaByte et al. Which, again, may both be implemented by, and co-exist with, Accord.

In this world, complex transactions would insert read and write intents using an Accord operation, along with a transaction state record. If transactions insert conflicting intents, the concurrency control arbitration mechanism decides what happens (whether one transaction blocks, aborts, or what have you). There is a bunch of literature on this that is orthogonal to the Accord discussion.

In the case of READ COMMITTED, I believe this can be particularly simple – we don’t need any read intents, only write intents which are essentially a distributed write buffer. The storage system uses these to answer reads from the transaction that inserted the write intents, but they are ignored for all other transactions. In this case, there is no arbitration needed as there is never a need for one transaction to prevent another’s progress. A final Accord operation commits these write intents atomically across all shards, so that reads that occur after this operation integrate these writes, and those that executed before do not.

Note that in this world, one-shot transactions may still execute without participating in the complex transaction system. In the READ COMMITTED world this is particularly simple, they may simply execute immediately using normal Accord operations. But it remains true even for SERIALIZABLE isolation, so long as there are no read or write intents to these keys. In this case we must validate there are no such intents, and if any are found we may need to upgrade to a complex transaction. This can be done atomically as part of the Accord operation, and then the transaction concurrency control arbitration mechanism kicks in.

[IDEA] Read committed transaction with Accord

Posted by Henrik Ingo <he...@datastax.com>.

On Wed, Oct 13, 2021 at 1:26 AM Blake Eggleston
<be...@apple.com.invalid> wrote:

> Hi Henrik,
>
> I would agree that the local serial experience for valid use cases should
> be supported in some form before legacy LWT is replaced by Accord.
>
>
Great! It seems there's a seed of consensus on this point.

> Regarding your read committed proposal, I think this CEP discussion has
> already spent too much time talking about hypothetical SQL implementations,
> and I’d like to avoid veering off course again. However, since you’ve asked
> a well thought out question with concrete goals and implementation ideas,
> I’m happy to answer it. I just ask that if you want to discuss it beyond my
> reply, you start a separate ‘[IDEA] Read committed transaction with Accord’
> thread where we could talk about it a bit more without it feeling like we
> need to delay a vote.
>
>
This is a reasonable request. We were already in a side thread I guess, but
I like organizing discussions into separate threads...

Let's see if I manage to break the thread correctly simply by editing the
subject...

FWIW, my hope for this discussion was that by providing a simple yet
concrete example, it would facilitate the discussion toward a CEP-15 vote,
not distract from it. As it happened, Alex Miller was writing a hugely
helpful email concurrently with mine, which improves details in CEP-15, so
I don't know if expecting the discussion to die out just yet is ignoring
people who maybe working off list to still understand this rather advanced
reading material.

> So I think it could work with some modifications.
>
> First you’d need to perform your select statements as accord reads, not
> quorum reads. Otherwise you may not see writes that have been (or could
> have been) committed. A multi-partition write could also appear to become
> undone, if a write commit has not reached one of the keys or needs to be
> recovered.
>

Ah right. I think we established early on that tables should be either
Accord-only, or legacy C* only. I was too fixated on the "no other changes"
and forgot this.

This is then a very interesting detail you point out! It seems like
potentially every statement now needs to go through the Accord consensus
protocol, and this could become expensive, where my goal was to design the
simplest and most lightweight example thinkable. BUT for read-only Accord
transactions, where I specifically also don't care about serializability,
wouldn't this be precisely the case where I can simply pick my own
timestamp and do a stale read from a  nearby replica?

>
> Second, when you talk about transforming mutations, I’m assuming you’re
> talking about confirming primary keys do or do not exist,

No, I was thinking more broadly of operations like `UPDATE table1 SET
column1=x WHERE pk >= 10 and pk <= 20`

My thinking was that I need to know the exact primary keys touched both
during the conversational phase and the commit phase. In essence, this is
an interactive reconnaisance phase.

You make a great point that for statements where the PK is explicit, they
can just be directly added to the write set and transaction state. Ex:
`UPDATE table1 SET column1=x WHERE pk IN (1,2,3)`

> and supporting auto-incrementing primary keys. To confirm primary keys do
> or do not exist, you’d also need to perform an accord read also.

For sure.

> For auto-incrementing primary keys, you’d need to do an accord read/write
> operation to increment a counter somewhere (or just use uuids).
>
>
I had not considered auto-increment at all, but if that would be a
requirement, then I tend to translate "auto-increment" into "any service
that can hand out unique integers". (In practice, no database can force me
to commit the integers in the order that they're actually monotonically
increasing, so "auto-increment" is an illusion, I realized at some point in
my career.)

> Finally, read committed does lock rows, so you’d still need to perform a
> read on commit to confirm that the rows being written to haven’t been
> modified since the transaction began.
>

Hmm...

As we see in a separate discussion is already diving into this, it seems
like at least the SQL 1992 standard only says read committed must protect
against P1 and that's it. My suspicion is that since most modern databases
start from MVCC, they essentially "over deliver" when providing read
committed, since the implementation naturally provides snapshot reads and
in fact it would be complicated to do something less consistent.

For this discussion it's not really important which interpretation is
correct, since either is a reasonable semantic. For my purposes I'll just
note that needing to re-execute all reads during the Accord phase (commit
phase) would make the design more expensive, since the transaction is now
executed twice. The goal of a simplistic light weight semantic is achieved
by not doing so and claiming the weaker interpretation of read committed is
"correct".

henrik

-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: Tradeoffs for Cassandra transaction management

Posted by Blake Eggleston <be...@apple.com.INVALID>.

Hi Henrik,

I would agree that the local serial experience for valid use cases should be supported in some form before legacy LWT is replaced by Accord.

Regarding your read committed proposal, I think this CEP discussion has already spent too much time talking about hypothetical SQL implementations, and I’d like to avoid veering off course again. However, since you’ve asked a well thought out question with concrete goals and implementation ideas, I’m happy to answer it. I just ask that if you want to discuss it beyond my reply, you start a separate ‘[IDEA] Read committed transaction with Accord’ thread where we could talk about it a bit more without it feeling like we need to delay a vote.

So I think it could work with some modifications. 

First you’d need to perform your select statements as accord reads, not quorum reads. Otherwise you may not see writes that have been (or could have been) committed. A multi-partition write could also appear to become undone, if a write commit has not reached one of the keys or needs to be recovered.

Second, when you talk about transforming mutations, I’m assuming you’re talking about confirming primary keys do or do not exist, and supporting auto-incrementing primary keys. To confirm primary keys do or do not exist, you’d also need to perform an accord read also. For auto-incrementing primary keys, you’d need to do an accord read/write operation to increment a counter somewhere (or just use uuids).

Finally, read committed does lock rows, so you’d still need to perform a read on commit to confirm that the rows being written to haven’t been modified since the transaction began.

Thanks,

Blake


> On Oct 12, 2021, at 1:54 PM, Henrik Ingo <he...@datastax.com> wrote:
> 
> Hi all
> 
> I was expecting to stay out of the way while a vote on CEP-15 seemed
> imminent. But discussing this tradeoffs thread with Jonathan, he encouraged
> me to say these points in my own words, so here we are.
> 
> 
> On Sun, Oct 10, 2021 at 7:17 AM Blake Eggleston
> <beggleston@apple.com.invalid <ma...@apple.com.invalid>> wrote:
> 
>> 1. Is it worth giving up local latencies to get full global consistency?
>> Most LWT use cases use
>> LOCAL_SERIAL.
>> 
>> This isn’t a tradeoff that needs to be made. There’s nothing about Accord
>> that prevents performing consensus in one DC and replicating the writes to
>> others. That’s not in scope for the initial work, but there’s no reason it
>> couldn’t be handled as a follow on if needed. I agree with Jeff that
>> LOCAL_SERIAL and LWTs are not usually done with a full understanding of the
>> implications, but there are some valid use cases. For instance, you can
>> enable an OLAP service to operate against another DC without impacting the
>> primary, assuming the service can tolerate inconsistency for data written
>> since the last repair, and there are some others.
>> 
>> 
> Let's start with the stated goal that CEP-15 is intended to be a better
> version of LWT.
> 
> Reading all the discussion, I feel like addressing the LOCAL_SERIAL /
> LOCAL_QUORUM use case is the one thing where Accord isn't strictly an
> improvement over LWT. I don't agree that Accord will just be so much faster
> anyway, that it would compensate a single network roundtrip around the
> world. Four LWT round-trips with LOCAL_SERIAL will still only be on the
> order of 10 ms, but global latencies for just a single round trip are
> hundreds of ms.
> 
> So, my suggestion to resolve this discussion would be that "local quorum
> latency experience" should be included in CEP-15 to meet its stated goal.
> If I have understood the CEP process correctly, this merely means that we
> agree this is a valid and significant use case in the Cassandra ecosystem.
> It doesn't mean that everything in the CEP must be released in a single v1
> release. At least personally I don't necessarily need to see a very
> detailed design for the implementation. But I'm optimistic it would resolve
> one open discussion if it was codified in the CEP that this is a use case
> that needs to be addressed.
> 
> 
>> 2. Is it worth giving up the possibility of SQL support, to get the
>> benefits of deterministic transaction design?
>> 
>> This is a false dilemma. Today, we’re proposing a deterministic
>> transaction design that addresses some very common user pain points. SQL
>> addresses different user pain point. If someone wants to add an sql
>> implementation in the future they can a) build it on top of accord b)
>> extend or improve accord or c) implement a separate system. The right
>> choice will depend on their goals, but accord won’t prevent work on it, the
>> same way the original lwt design isn’t preventing work on multi-partition
>> transactions. In the worst case, if the goals of a hypothetical sql project
>> are different enough to make them incompatible with accord, I don’t see any
>> reason why we couldn’t have 2 separate consensus systems, so long as people
>> are willing to maintain them and the use cases and available technologies
>> justify it.
>> 
> 
> 
> 
> The part of the discussion that's hard to deal with is "SQL support",
> "interactive transactions", or "complex transactions". Even if this is out
> of scope for CEP-15, it's a valid question to ask whether Accord would
> possibly help, but at least not prevent such future work. (The context
> being, Jonathan and myself both think of this as an important long term
> goal. You may have figured this out already!)
> 
> There are various ways we can get more insight into this question, but
> realistically writing a complete CEP (or a dozen CEPs) on "full SQL
> support" isn't one of them. On the other hand it seems CEP-15 itself
> proposes a conservative approach of developing first version(s) in a
> separate repository, from where it could then prove its usefulness! I feel
> like the authors have already proposed a conservative approach there that
> we can probably work with even without perfect knowledge of the future.
> 
> 
> 
> An idea I've been thinking about for a few days is, what would it take to
> implement interactive READ COMMITTED transactions on top of Accord? Now,
> this may not be an isolation level we want to market as the cool flagship
> feature. BUT this exercise does feel meaningful in a few ways:
> 
> * First of all, READ COMMITTED *is* a real isolation level in the SQL
> standard. So arguably this would be an existence proof of interactive SQL
> transactions built on top of Accord.
> 
> * It's even the default isolation level in PostgeSQL still today.
> 
> * An implementation of such transactions could even be used to benchmark
> the performance of such transactions and would give an approximation of how
> well Accord is suited for this task. This performance would be "best case"
> in the sense that I would expect Snapshot and Serializeable to have worse
> performance, but that overhead can be considered as inherent in the
> isolation level rather than a fault of Accord.
> 
> * Implementing READ COMMITTED transactions on top of Accord is rather
> straightforward and can be described and discussed in this email thread,
> which could hopefully contribute to our understanding of the problem space.
> (Could also be a real CEP, if we think it's a useful first step for
> interactive transactions, but for now I'm dumping it here just to try to
> bring a concrete example into the discussion.)
> 
> 
> 
> Goal: READ COMMITTED interactive transactions
> 
> Dependency: Assume a Cassandra database with CEP-15 implemented.
> 
> 
> Approach: The conversational part of the transaction is a sequence of
> regular Cassandra reads and writes. Mutations are however executed as
> read-queries toward the database nodes. Database state isn't modified
> during the conversational phase, rather the primary keys of the
> to-be-mutated rows are stored for later use. Accord is essentially the
> commit phase of the transaction. All primary keys to be updated are the
> write set of  the Accord transaction. There's no need to re-execute the
> reads, so the read set is empty.
> 
> We define READ COMMITTED as "whatever is returned by Cassandra when
> executing the query (with QUORUM consistency)". In other words, this
> functionality doesn't require any changes to the storage engine or other
> fundamental changes to Cassandra. The Accord commit is guaranteed to
> succeed per design and the READ COMMITTED transaction doesn't add any
> additional checks for conflicts. As such, this functionality remains
> abort-free.
> 
> 
> Proposed Changes: A transaction manager is added to the coordinator, with
> the following functionality:
> 
> BEGIN - initialize transaction state in the coordinator. After a BEGIN
> statement, the following commands are modified as follows:
> 
> INSERT, UPDATE, DELETE: Transform to an equivalent SELECT, returning the
> primary key columns. Store the original command (INSERT, etc…) and the
> returned primary keys into write set.
> 
> SELECT - no changes, except for read your own writes. The results of a
> SELECT query are returned to the client, but there's no need to store the
> results in the transaction state.
> 
> Transaction reads its own writes - For each SELECT the coordinator will
> overlay the current write set onto the query results. You can think of the
> write set as another memtable at Level -1.
> 
> Secondary indexes are supported without any additional work needed.
> 
> COMMIT - Perform a regular Accord transaction, using the above write set as
> the Accord write set. The read set is empty. The commit is guaranteed to
> succeed. In the end, clear state on the coordinator.
> 
> New interfaces: BEGIN and COMMIT. ROLLBACK. Maybe some command to declare
> READ COMMITTED isolation level and to get the current isolation level.
> 
> 
> Future work: A motivation for the above proposal is that the same scheme
> could be extended to support SNAPSHOT ISOLATION transactions. This would
> require MVCC support from the storage engine.
> 
> 
> 
> ---
> 
> It would be interesting to hear from list members whether the above appears
> to understand Accord (and SQL) correctly or whether I'm missing something?
> 
> henrik
> 
> 
> -- 
> 
> Henrik Ingo
> 
> +358 40 569 7354 <358405697354>
> 
> [image: Visit us online.] <https://www.datastax.com/ <https://www.datastax.com/>>  [image: Visit us on
> Twitter.] <https://twitter.com/DataStaxEng <https://twitter.com/DataStaxEng>>  [image: Visit us on YouTube.]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e= <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>>
>  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/ <https://www.linkedin.com/in/heingo/>>

Re: Tradeoffs for Cassandra transaction management

Posted by Henrik Ingo <he...@datastax.com>.

On Tue, Oct 12, 2021 at 11:54 PM Henrik Ingo <he...@datastax.com>
wrote:

> Secondary indexes are supported without any additional work needed.
>
> Correction: The "transaction reads its own writes" feature would require
to also store secondary index keys in the transaction state. These of
course needn't be part of the write set in the commit.

-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: Tradeoffs for Cassandra transaction management

Posted by "benedict@apache.org" <be...@apache.org>.

I just realised my email client hid a lot of your email, so I now realise I must have misunderstood your statement. I realise now you must have meant per-statement snapshot isolation. However, I believe that MVCC is an optimisation for such an isolation level, not a requirement – it is possible (like Calvin, and for distributed consensus protocols) to serialize their execution, unless I’m missing something.

Otherwise I agree entirely with your email, now I’ve read it 😊

From: benedict@apache.org <be...@apache.org>
Date: Wednesday, 13 October 2021 at 08:52
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: Tradeoffs for Cassandra transaction management
Hi Alex,

I hugely value and respect your input here, but I think in this case you may be mistaken.

Postgres[1] makes explicit that subsequent SELECT statements may see different data, and SQL Server[2] does the same. I believe the Oracle documents you reference do the same, but are more obtuse. They say that read committed is a statement-level isolation level, i.e. “read committed isolation level, this point is the time at which the statement was opened” though for read only transactions they upgrade this to transaction level isolation. Indeed, the ANSI SQL document you reference also supports this meaning: “Non-repeatable read” is defined only to be used later in a table on page 68 that defines READ COMMITTED as permitting these to occur.

Accord offers READ COMMITTED out of the box, essentially (modulo read-your-writes).

[1] https://www.postgresql.org/docs/7.2/xact-read-committed.html
[2] https://docs.microsoft.com/en-us/sql/connect/jdbc/understanding-isolation-levels?view=sql-server-ver15

From: Alex Miller <mi...@gmail.com>
Date: Wednesday, 13 October 2021 at 08:07
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: Tradeoffs for Cassandra transaction management
On Tue, Oct 12, 2021 at 3:55 PM Henrik Ingo <he...@datastax.com> wrote:
> We define READ COMMITTED as "whatever is returned by Cassandra when
> executing the query (with QUORUM consistency)". In other words, this
> functionality doesn't require any changes to the storage engine or other
> fundamental changes to Cassandra. The Accord commit is guaranteed to
> succeed per design and the READ COMMITTED transaction doesn't add any
> additional checks for conflicts. As such, this functionality remains
> abort-free.
>     [snip]
> Future work: A motivation for the above proposal is that the same scheme
> could be extended to support SNAPSHOT ISOLATION transactions. This would
> require MVCC support from the storage engine.

These two pieces together seem to imply that your claim is that Read
Committed may read whatever the most recently committed data during
the execution of the statement and does not require MVCC.  Though I
agree that the standard[1] is very unclear as to what a "read" means
when defining a non-repeatable read:

>  2) P2 ("Non-repeatable read"): SQL-transaction T1 reads a row. SQL-
>           transaction T2 then modifies or deletes that row and performs
>           a COMMIT. If T1 then attempts to reread the row, it may receive
>            the modified value or discover that the row has been deleted.

The common implementation is that Read Committed reads from a snapshot
of the database state.  The documentation of various database
implementations are much more clear about this.  See, for example,
Oracle[2] or MySQL[3] on the subject.

So I believe Read Committed would also require MVCC support in the
storage engine the same way that Snapshot Isolation would.

[1]: https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
[2]: https://docs.oracle.com/cd/E25054_01/server.1111/e25789/consist.htm
, see section "Read Consistency in the Read Committed Isolation Level"
[3]: https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html

On Tue, Oct 12, 2021 at 3:55 PM Henrik Ingo <he...@datastax.com> wrote:
> Approach: The conversational part of the transaction is a sequence of
> regular Cassandra reads and writes. Mutations are however executed as
> read-queries toward the database nodes. Database state isn't modified
> during the conversational phase, rather the primary keys of the
> to-be-mutated rows are stored for later use. Accord is essentially the
> commit phase of the transaction. All primary keys to be updated are the
> write set of  the Accord transaction. There's no need to re-execute the
> reads, so the read set is empty.

As I've pondered this over time, I personally specifically fault
read-your-uncommitted-writes as the reason why NewSQL databases are
essentially a design monoculture.  Every database persists uncommitted
writes in the database itself during execution.  Doing so encourages
those writes to be re-used for concurrency control (ie. write
intents), and then that places you in the exact client-driven 3PC
protocol which Cockroach, TiDB, and YugaByte all implement.  Even if
you find some radically different database for SQL, like leanXcale, it
_still_ persists uncommitted writes into the database.

And every time I've thought through this, I tend to agree.  It's too
exceedingly easy to write a SQL query which will exceed any local
limit imposed by memory, and it's too easy to write a query which runs
fine in production for a while, until it hits a memory limit and
begins to immediately fail.  There's a tremendous implementation
difference between `DELETE FROM Table` and `TRUNCATE Table`, and it's
relatively hard to explain that to naive users.

Memory constraints aside, merging a local write cache into remote data
for execution seems like it'd be quite a task.   Any desire for
efficient distributed query execution would push for a design where
query fragments can be pushed down to the nodes holding the data.  I
imagine that one would then need to distribute all writes out to each
partition along with the query fragment for them to execute, so that
they can merge the pending writes in with the existing data, but such
a solution also places a significant overhead burden on the database.
Clients need to resend all potentially relevant writes to servers on
each statement, and servers need to hold all of a client’s writes in
memory as they execute.  The alternative of trying to very calculate a
subset of the result affected by the local writes and union it into
the query executed without the local writes feels prohibitively
complex.  Both directions seem fraught with peril.

But the write intent approach makes this conveniently easy, as any
server that has a row of data, also has all the uncommitted rows from
currently in progress transactions, and thus can easily filter to the
correct row as part of its MVCC implementation.  Faced with these two
options, I understand why the world has chosen that write intents are
a great solution, but the monoculture does make me a bit sad.

On Tue, Oct 12, 2021 at 5:20 PM Jonathan Ellis <jb...@gmail.com> wrote:
> without having the entire logic of the transaction available to it, the server
> cannot retry the transaction when concurrent changes are found to have
> been applied after the reconnaissance reads (what you call the
> conversational phase).

This is not true at Read Committed!  The major advantage of Read
Committed over Repeatable Read is that by giving up a consistent read
timestamp from statement to statement, you give the server the ability
to retry your statement's execution multiple times, and wait out
conflicting transactions between each attempt.  Entire transactions
don't need to be retried when a conflict is encountered, only the
statement itself which encountered the conflict.  This is largely
pedantic nitpicking, as your question applied as Repeatable Read and
above, server-side retry is an important (and expected) Read Committed
optimization!

But more related to your point, suppose an (Accord) transaction
attempts to re-validate its reconnaissance reads, fails, and aborts.
The client that submitted the transaction then notices that its
transaction failed, and re-runs reconnaissance, and re-submits the
transaction.  It does so in a loop until one of its transactions
report successfully executing/applying.  Do you consider that an
interactive transaction?  If so, then I would wish to clarify that
this isn't a question of _if_ a deterministic database can support
interactive transactions, it's one of how efficiently can they be
supported.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org

Re: Tradeoffs for Cassandra transaction management

Posted by "benedict@apache.org" <be...@apache.org>.

Hi Alex,

I hugely value and respect your input here, but I think in this case you may be mistaken.

Postgres[1] makes explicit that subsequent SELECT statements may see different data, and SQL Server[2] does the same. I believe the Oracle documents you reference do the same, but are more obtuse. They say that read committed is a statement-level isolation level, i.e. “read committed isolation level, this point is the time at which the statement was opened” though for read only transactions they upgrade this to transaction level isolation. Indeed, the ANSI SQL document you reference also supports this meaning: “Non-repeatable read” is defined only to be used later in a table on page 68 that defines READ COMMITTED as permitting these to occur.

Accord offers READ COMMITTED out of the box, essentially (modulo read-your-writes).

[1] https://www.postgresql.org/docs/7.2/xact-read-committed.html
[2] https://docs.microsoft.com/en-us/sql/connect/jdbc/understanding-isolation-levels?view=sql-server-ver15

From: Alex Miller <mi...@gmail.com>
Date: Wednesday, 13 October 2021 at 08:07
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: Tradeoffs for Cassandra transaction management
On Tue, Oct 12, 2021 at 3:55 PM Henrik Ingo <he...@datastax.com> wrote:
> We define READ COMMITTED as "whatever is returned by Cassandra when
> executing the query (with QUORUM consistency)". In other words, this
> functionality doesn't require any changes to the storage engine or other
> fundamental changes to Cassandra. The Accord commit is guaranteed to
> succeed per design and the READ COMMITTED transaction doesn't add any
> additional checks for conflicts. As such, this functionality remains
> abort-free.
>     [snip]
> Future work: A motivation for the above proposal is that the same scheme
> could be extended to support SNAPSHOT ISOLATION transactions. This would
> require MVCC support from the storage engine.

These two pieces together seem to imply that your claim is that Read
Committed may read whatever the most recently committed data during
the execution of the statement and does not require MVCC.  Though I
agree that the standard[1] is very unclear as to what a "read" means
when defining a non-repeatable read:

>  2) P2 ("Non-repeatable read"): SQL-transaction T1 reads a row. SQL-
>           transaction T2 then modifies or deletes that row and performs
>           a COMMIT. If T1 then attempts to reread the row, it may receive
>            the modified value or discover that the row has been deleted.

The common implementation is that Read Committed reads from a snapshot
of the database state.  The documentation of various database
implementations are much more clear about this.  See, for example,
Oracle[2] or MySQL[3] on the subject.

So I believe Read Committed would also require MVCC support in the
storage engine the same way that Snapshot Isolation would.

[1]: https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
[2]: https://docs.oracle.com/cd/E25054_01/server.1111/e25789/consist.htm
, see section "Read Consistency in the Read Committed Isolation Level"
[3]: https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html

On Tue, Oct 12, 2021 at 3:55 PM Henrik Ingo <he...@datastax.com> wrote:
> Approach: The conversational part of the transaction is a sequence of
> regular Cassandra reads and writes. Mutations are however executed as
> read-queries toward the database nodes. Database state isn't modified
> during the conversational phase, rather the primary keys of the
> to-be-mutated rows are stored for later use. Accord is essentially the
> commit phase of the transaction. All primary keys to be updated are the
> write set of  the Accord transaction. There's no need to re-execute the
> reads, so the read set is empty.

As I've pondered this over time, I personally specifically fault
read-your-uncommitted-writes as the reason why NewSQL databases are
essentially a design monoculture.  Every database persists uncommitted
writes in the database itself during execution.  Doing so encourages
those writes to be re-used for concurrency control (ie. write
intents), and then that places you in the exact client-driven 3PC
protocol which Cockroach, TiDB, and YugaByte all implement.  Even if
you find some radically different database for SQL, like leanXcale, it
_still_ persists uncommitted writes into the database.

And every time I've thought through this, I tend to agree.  It's too
exceedingly easy to write a SQL query which will exceed any local
limit imposed by memory, and it's too easy to write a query which runs
fine in production for a while, until it hits a memory limit and
begins to immediately fail.  There's a tremendous implementation
difference between `DELETE FROM Table` and `TRUNCATE Table`, and it's
relatively hard to explain that to naive users.

Memory constraints aside, merging a local write cache into remote data
for execution seems like it'd be quite a task.   Any desire for
efficient distributed query execution would push for a design where
query fragments can be pushed down to the nodes holding the data.  I
imagine that one would then need to distribute all writes out to each
partition along with the query fragment for them to execute, so that
they can merge the pending writes in with the existing data, but such
a solution also places a significant overhead burden on the database.
Clients need to resend all potentially relevant writes to servers on
each statement, and servers need to hold all of a client’s writes in
memory as they execute.  The alternative of trying to very calculate a
subset of the result affected by the local writes and union it into
the query executed without the local writes feels prohibitively
complex.  Both directions seem fraught with peril.

But the write intent approach makes this conveniently easy, as any
server that has a row of data, also has all the uncommitted rows from
currently in progress transactions, and thus can easily filter to the
correct row as part of its MVCC implementation.  Faced with these two
options, I understand why the world has chosen that write intents are
a great solution, but the monoculture does make me a bit sad.

On Tue, Oct 12, 2021 at 5:20 PM Jonathan Ellis <jb...@gmail.com> wrote:
> without having the entire logic of the transaction available to it, the server
> cannot retry the transaction when concurrent changes are found to have
> been applied after the reconnaissance reads (what you call the
> conversational phase).

This is not true at Read Committed!  The major advantage of Read
Committed over Repeatable Read is that by giving up a consistent read
timestamp from statement to statement, you give the server the ability
to retry your statement's execution multiple times, and wait out
conflicting transactions between each attempt.  Entire transactions
don't need to be retried when a conflict is encountered, only the
statement itself which encountered the conflict.  This is largely
pedantic nitpicking, as your question applied as Repeatable Read and
above, server-side retry is an important (and expected) Read Committed
optimization!

But more related to your point, suppose an (Accord) transaction
attempts to re-validate its reconnaissance reads, fails, and aborts.
The client that submitted the transaction then notices that its
transaction failed, and re-runs reconnaissance, and re-submits the
transaction.  It does so in a loop until one of its transactions
report successfully executing/applying.  Do you consider that an
interactive transaction?  If so, then I would wish to clarify that
this isn't a question of _if_ a deterministic database can support
interactive transactions, it's one of how efficiently can they be
supported.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org

Re: Tradeoffs for Cassandra transaction management

Posted by Alex Miller <mi...@gmail.com>.

On Wed, Oct 13, 2021 at 3:52 AM Henrik Ingo <he...@datastax.com> wrote:
> Aren't you actually pointing out a limitation in any "single shot"
> transactional algorithm? Including Accord itself, without any interactive
> part?
>
> What you are saying is that an Accord transaction is limited by the need
> for both the client, and coordinator, to be able to keep the entire
> transaction in memory and process it?

I'm under the belief as well that any single-shot transaction protocol
would require some limits on transaction size and/or duration, and
those limits would then be imposed on SQL in a way users coming from a
standard RDBMS (e.g. Postgres) wouldn't expect.  The closest that I've
seen databases get away with is having a distributed layer in the
database that serves as an in-memory lock manager.  Both Spanner and
leanXcale maintain locks in memory in the database while clients
execute transactions, which provides a much higher limit of what one
can do in a transaction, but still presents a degree of complexity to
manage to make sure that clients can't drive servers out of memory.

One could just state that the particular SQL implementation *is*
limited to whatever the constraints of the single-shot transaction
protocol is, and deliver clear documentation of what those limits are
to users, along with being loud about the fact that there are limits.
This has gone okay in other non-SQL systems.  My personal experience
in this subject comes from FoundationDB, which offers a rather
conservative 5 second transaction duration limit and 10MB transaction
size limit.  When presenting a raw key-value API and a database
specifically geared towards supporting OLTP workloads, it works out in
most situations, as users need to write their transactions from
scratch utilizing the database's documentation already.  OLTP is
characterized by short and small transactions, and so things tend to
align anyway.  Some users still tried to implement workloads which
weren't strictly OLTP, and ran into problems.  Offering SQL carries
with it a set of expectations for supported workloads, and I don't
have a concrete example that I can think of for a SQL system with
strict and conservative limits on queries.  My only notes of wisdom
here come from an ex-AWS person I once spoke to, who maintained a
system with partial SQL support, and commented that it was a mistake
due to the support load and customer confusion (but that was more
about a restricted SQL feature set than transaction limitations).

That's not to say that single-shot transaction algorithms aren't
useful, even in the context of SQL.  CockroachDB uses a 3 phase
transaction protocol, which is reduced to only 1 phase when it's a
single partition transaction and Raft may perform the atomic
commitment on its own.  A 1RTT transaction protocol would allow one to
extend that optimized 1 phase protocol to a handful of partitions.
Instead of only supporting 1 phase execution of a point insert, one
could support 1 phase execution of point-ish queries, such as an
insert into a table along with a handful of indexes on that table.  I
think there would still need to be a way to degrade into some other
transaction protocol to support extremely large or long-running
queries, but any single-shot multi-partition transaction protocol
(Accord or otherwise) would likely offer ways to optimize your slow
path transaction protocol.  Maybe it's not really surprising though
that protocols designed for "let me transact my entire database at
once" versus "let me transact a few related keys together" turn out to
be relatively different sorts of protocols...

On Wed, Oct 13, 2021 at 3:52 AM Henrik Ingo <he...@datastax.com> wrote:
> I responded to Blake's similar comment on this topic. Out of respect for
> his request to move the discussion to a newly created thread, I will not
> elaborate here rather just reference my reply to Blake.

Oh!  I missed the new thread.  Thanks!  More transaction processing~~!

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org

Re: Tradeoffs for Cassandra transaction management

Posted by Henrik Ingo <he...@datastax.com>.

Thank you Alex for your feedback, I greatly value these thoughts and always
enjoy learning new details in this space.

On Wed, Oct 13, 2021 at 10:07 AM Alex Miller <mi...@gmail.com> wrote:

> These two pieces together seem to imply that your claim is that Read
> Committed may read whatever the most recently committed data during
> the execution of the statement and does not require MVCC.  Though I
> agree that the standard[1] is very unclear as to what a "read" means
> when defining a non-repeatable read:
>

I responded to Blake's similar comment on this topic. Out of respect for
his request to move the discussion to a newly created thread, I will not
elaborate here rather just reference my reply to Blake.

The following observation seems more relevant for Accord itself and the
discussion on trade-offs, so I'll allow myself to continue within this
thread:


>
> On Tue, Oct 12, 2021 at 3:55 PM Henrik Ingo <he...@datastax.com>
> wrote:
> > Approach: The conversational part of the transaction is a sequence of
> > regular Cassandra reads and writes. Mutations are however executed as
> > read-queries toward the database nodes. Database state isn't modified
> > during the conversational phase, rather the primary keys of the
> > to-be-mutated rows are stored for later use. Accord is essentially the
> > commit phase of the transaction. All primary keys to be updated are the
> > write set of  the Accord transaction. There's no need to re-execute the
> > reads, so the read set is empty.
>
> As I've pondered this over time, I personally specifically fault
> read-your-uncommitted-writes as the reason why NewSQL databases are
> essentially a design monoculture.  Every database persists uncommitted
> writes in the database itself during execution.  Doing so encourages
> those writes to be re-used for concurrency control (ie. write
> intents), and then that places you in the exact client-driven 3PC
> protocol which Cockroach, TiDB, and YugaByte all implement.  Even if
> you find some radically different database for SQL, like leanXcale, it
> _still_ persists uncommitted writes into the database.
>
>
This is an interesting point of view... I simply assumed this approach is
inherited by the fact that single server RDBMS implementations are built
this way, and NewSQL solutions reuse well known designs for the database
engine.


> And every time I've thought through this, I tend to agree.  It's too
> exceedingly easy to write a SQL query which will exceed any local
> limit imposed by memory, and it's too easy to write a query which runs
> fine in production for a while, until it hits a memory limit and
> begins to immediately fail.  There's a tremendous implementation
> difference between `DELETE FROM Table` and `TRUNCATE Table`, and it's
> relatively hard to explain that to naive users.
>
> Memory constraints aside, merging a local write cache into remote data
> for execution seems like it'd be quite a task.   Any desire for
> efficient distributed query execution would push for a design where
> query fragments can be pushed down to the nodes holding the data.


Reading this I realize...

Aren't you actually pointing out a limitation in any "single shot"
transactional algorithm? Including Accord itself, without any interactive
part?

What you are saying is that an Accord transaction is limited by the need
for both the client, and coordinator, to be able to keep the entire
transaction in memory and process it?

Where Cassandra is coming from, I'm not particularly alarmed by this
limitation as I would expect operations on a Cassandra database to be fast
and small, but it's an important limitation to call out for sure. Indeed,
those who have been worried Accord will not be able to serve well all
possible future use cases may have found their first meaningful concrete
example to add to the list?

henrik
-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: Tradeoffs for Cassandra transaction management

Posted by Alex Miller <mi...@gmail.com>.

On Tue, Oct 12, 2021 at 3:55 PM Henrik Ingo <he...@datastax.com> wrote:
> We define READ COMMITTED as "whatever is returned by Cassandra when
> executing the query (with QUORUM consistency)". In other words, this
> functionality doesn't require any changes to the storage engine or other
> fundamental changes to Cassandra. The Accord commit is guaranteed to
> succeed per design and the READ COMMITTED transaction doesn't add any
> additional checks for conflicts. As such, this functionality remains
> abort-free.
>     [snip]
> Future work: A motivation for the above proposal is that the same scheme
> could be extended to support SNAPSHOT ISOLATION transactions. This would
> require MVCC support from the storage engine.

These two pieces together seem to imply that your claim is that Read
Committed may read whatever the most recently committed data during
the execution of the statement and does not require MVCC.  Though I
agree that the standard[1] is very unclear as to what a "read" means
when defining a non-repeatable read:

>  2) P2 ("Non-repeatable read"): SQL-transaction T1 reads a row. SQL-
>           transaction T2 then modifies or deletes that row and performs
>           a COMMIT. If T1 then attempts to reread the row, it may receive
>            the modified value or discover that the row has been deleted.

The common implementation is that Read Committed reads from a snapshot
of the database state.  The documentation of various database
implementations are much more clear about this.  See, for example,
Oracle[2] or MySQL[3] on the subject.

So I believe Read Committed would also require MVCC support in the
storage engine the same way that Snapshot Isolation would.

[1]: https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt
[2]: https://docs.oracle.com/cd/E25054_01/server.1111/e25789/consist.htm
, see section "Read Consistency in the Read Committed Isolation Level"
[3]: https://dev.mysql.com/doc/refman/8.0/en/innodb-transaction-isolation-levels.html

On Tue, Oct 12, 2021 at 3:55 PM Henrik Ingo <he...@datastax.com> wrote:
> Approach: The conversational part of the transaction is a sequence of
> regular Cassandra reads and writes. Mutations are however executed as
> read-queries toward the database nodes. Database state isn't modified
> during the conversational phase, rather the primary keys of the
> to-be-mutated rows are stored for later use. Accord is essentially the
> commit phase of the transaction. All primary keys to be updated are the
> write set of  the Accord transaction. There's no need to re-execute the
> reads, so the read set is empty.

As I've pondered this over time, I personally specifically fault
read-your-uncommitted-writes as the reason why NewSQL databases are
essentially a design monoculture.  Every database persists uncommitted
writes in the database itself during execution.  Doing so encourages
those writes to be re-used for concurrency control (ie. write
intents), and then that places you in the exact client-driven 3PC
protocol which Cockroach, TiDB, and YugaByte all implement.  Even if
you find some radically different database for SQL, like leanXcale, it
_still_ persists uncommitted writes into the database.

And every time I've thought through this, I tend to agree.  It's too
exceedingly easy to write a SQL query which will exceed any local
limit imposed by memory, and it's too easy to write a query which runs
fine in production for a while, until it hits a memory limit and
begins to immediately fail.  There's a tremendous implementation
difference between `DELETE FROM Table` and `TRUNCATE Table`, and it's
relatively hard to explain that to naive users.

Memory constraints aside, merging a local write cache into remote data
for execution seems like it'd be quite a task.   Any desire for
efficient distributed query execution would push for a design where
query fragments can be pushed down to the nodes holding the data.  I
imagine that one would then need to distribute all writes out to each
partition along with the query fragment for them to execute, so that
they can merge the pending writes in with the existing data, but such
a solution also places a significant overhead burden on the database.
Clients need to resend all potentially relevant writes to servers on
each statement, and servers need to hold all of a client’s writes in
memory as they execute.  The alternative of trying to very calculate a
subset of the result affected by the local writes and union it into
the query executed without the local writes feels prohibitively
complex.  Both directions seem fraught with peril.

But the write intent approach makes this conveniently easy, as any
server that has a row of data, also has all the uncommitted rows from
currently in progress transactions, and thus can easily filter to the
correct row as part of its MVCC implementation.  Faced with these two
options, I understand why the world has chosen that write intents are
a great solution, but the monoculture does make me a bit sad.

On Tue, Oct 12, 2021 at 5:20 PM Jonathan Ellis <jb...@gmail.com> wrote:
> without having the entire logic of the transaction available to it, the server
> cannot retry the transaction when concurrent changes are found to have
> been applied after the reconnaissance reads (what you call the
> conversational phase).

This is not true at Read Committed!  The major advantage of Read
Committed over Repeatable Read is that by giving up a consistent read
timestamp from statement to statement, you give the server the ability
to retry your statement's execution multiple times, and wait out
conflicting transactions between each attempt.  Entire transactions
don't need to be retried when a conflict is encountered, only the
statement itself which encountered the conflict.  This is largely
pedantic nitpicking, as your question applied as Repeatable Read and
above, server-side retry is an important (and expected) Read Committed
optimization!

But more related to your point, suppose an (Accord) transaction
attempts to re-validate its reconnaissance reads, fails, and aborts.
The client that submitted the transaction then notices that its
transaction failed, and re-runs reconnaissance, and re-submits the
transaction.  It does so in a loop until one of its transactions
report successfully executing/applying.  Do you consider that an
interactive transaction?  If so, then I would wish to clarify that
this isn't a question of _if_ a deterministic database can support
interactive transactions, it's one of how efficiently can they be
supported.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org

Re: Tradeoffs for Cassandra transaction management

Posted by Henrik Ingo <he...@datastax.com>.

Sorry Jonathan, didn't see this reply earlier today.

That would be common behaviour for many MVCC databases, including MongoDB,
MySQL Galera Cluster, PostgreSQL...

https://www.postgresql.org/docs/9.5/transaction-iso.html

*"Applications using this level must be prepared to retry transactions due
to serialization failures."*

On Wed, Oct 13, 2021 at 3:19 AM Jonathan Ellis <jb...@gmail.com> wrote:

> Hi Henrik,
>
> I don't see how this resolves the fundamental problem that I outlined to
> start with, namely, that without having the entire logic of the transaction
> available to it, the server cannot retry the transaction when concurrent
> changes are found to have been applied after the reconnaissance reads (what
> you call the conversational phase).
>
> On Tue, Oct 12, 2021 at 3:55 PM Henrik Ingo <he...@datastax.com>
> wrote:
>
> > Hi all
> >
> > I was expecting to stay out of the way while a vote on CEP-15 seemed
> > imminent. But discussing this tradeoffs thread with Jonathan, he
> encouraged
> > me to say these points in my own words, so here we are.
> >
> >
> > On Sun, Oct 10, 2021 at 7:17 AM Blake Eggleston
> > <be...@apple.com.invalid> wrote:
> >
> > > 1. Is it worth giving up local latencies to get full global
> consistency?
> > > Most LWT use cases use
> > > LOCAL_SERIAL.
> > >
> > > This isn’t a tradeoff that needs to be made. There’s nothing about
> Accord
> > > that prevents performing consensus in one DC and replicating the writes
> > to
> > > others. That’s not in scope for the initial work, but there’s no reason
> > it
> > > couldn’t be handled as a follow on if needed. I agree with Jeff that
> > > LOCAL_SERIAL and LWTs are not usually done with a full understanding of
> > the
> > > implications, but there are some valid use cases. For instance, you can
> > > enable an OLAP service to operate against another DC without impacting
> > the
> > > primary, assuming the service can tolerate inconsistency for data
> written
> > > since the last repair, and there are some others.
> > >
> > >
> > Let's start with the stated goal that CEP-15 is intended to be a better
> > version of LWT.
> >
> > Reading all the discussion, I feel like addressing the LOCAL_SERIAL /
> > LOCAL_QUORUM use case is the one thing where Accord isn't strictly an
> > improvement over LWT. I don't agree that Accord will just be so much
> faster
> > anyway, that it would compensate a single network roundtrip around the
> > world. Four LWT round-trips with LOCAL_SERIAL will still only be on the
> > order of 10 ms, but global latencies for just a single round trip are
> > hundreds of ms.
> >
> > So, my suggestion to resolve this discussion would be that "local quorum
> > latency experience" should be included in CEP-15 to meet its stated goal.
> > If I have understood the CEP process correctly, this merely means that we
> > agree this is a valid and significant use case in the Cassandra
> ecosystem.
> > It doesn't mean that everything in the CEP must be released in a single
> v1
> > release. At least personally I don't necessarily need to see a very
> > detailed design for the implementation. But I'm optimistic it would
> resolve
> > one open discussion if it was codified in the CEP that this is a use case
> > that needs to be addressed.
> >
> >
> > > 2. Is it worth giving up the possibility of SQL support, to get the
> > > benefits of deterministic transaction design?
> > >
> > > This is a false dilemma. Today, we’re proposing a deterministic
> > > transaction design that addresses some very common user pain points.
> SQL
> > > addresses different user pain point. If someone wants to add an sql
> > > implementation in the future they can a) build it on top of accord b)
> > > extend or improve accord or c) implement a separate system. The right
> > > choice will depend on their goals, but accord won’t prevent work on it,
> > the
> > > same way the original lwt design isn’t preventing work on
> multi-partition
> > > transactions. In the worst case, if the goals of a hypothetical sql
> > project
> > > are different enough to make them incompatible with accord, I don’t see
> > any
> > > reason why we couldn’t have 2 separate consensus systems, so long as
> > people
> > > are willing to maintain them and the use cases and available
> technologies
> > > justify it.
> > >
> >
> >
> >
> > The part of the discussion that's hard to deal with is "SQL support",
> > "interactive transactions", or "complex transactions". Even if this is
> out
> > of scope for CEP-15, it's a valid question to ask whether Accord would
> > possibly help, but at least not prevent such future work. (The context
> > being, Jonathan and myself both think of this as an important long term
> > goal. You may have figured this out already!)
> >
> > There are various ways we can get more insight into this question, but
> > realistically writing a complete CEP (or a dozen CEPs) on "full SQL
> > support" isn't one of them. On the other hand it seems CEP-15 itself
> > proposes a conservative approach of developing first version(s) in a
> > separate repository, from where it could then prove its usefulness! I
> feel
> > like the authors have already proposed a conservative approach there that
> > we can probably work with even without perfect knowledge of the future.
> >
> >
> >
> > An idea I've been thinking about for a few days is, what would it take to
> > implement interactive READ COMMITTED transactions on top of Accord? Now,
> > this may not be an isolation level we want to market as the cool flagship
> > feature. BUT this exercise does feel meaningful in a few ways:
> >
> > * First of all, READ COMMITTED *is* a real isolation level in the SQL
> > standard. So arguably this would be an existence proof of interactive SQL
> > transactions built on top of Accord.
> >
> > * It's even the default isolation level in PostgeSQL still today.
> >
> > * An implementation of such transactions could even be used to benchmark
> > the performance of such transactions and would give an approximation of
> how
> > well Accord is suited for this task. This performance would be "best
> case"
> > in the sense that I would expect Snapshot and Serializeable to have worse
> > performance, but that overhead can be considered as inherent in the
> > isolation level rather than a fault of Accord.
> >
> > * Implementing READ COMMITTED transactions on top of Accord is rather
> > straightforward and can be described and discussed in this email thread,
> > which could hopefully contribute to our understanding of the problem
> space.
> > (Could also be a real CEP, if we think it's a useful first step for
> > interactive transactions, but for now I'm dumping it here just to try to
> > bring a concrete example into the discussion.)
> >
> >
> >
> > Goal: READ COMMITTED interactive transactions
> >
> > Dependency: Assume a Cassandra database with CEP-15 implemented.
> >
> >
> > Approach: The conversational part of the transaction is a sequence of
> > regular Cassandra reads and writes. Mutations are however executed as
> > read-queries toward the database nodes. Database state isn't modified
> > during the conversational phase, rather the primary keys of the
> > to-be-mutated rows are stored for later use. Accord is essentially the
> > commit phase of the transaction. All primary keys to be updated are the
> > write set of  the Accord transaction. There's no need to re-execute the
> > reads, so the read set is empty.
> >
> > We define READ COMMITTED as "whatever is returned by Cassandra when
> > executing the query (with QUORUM consistency)". In other words, this
> > functionality doesn't require any changes to the storage engine or other
> > fundamental changes to Cassandra. The Accord commit is guaranteed to
> > succeed per design and the READ COMMITTED transaction doesn't add any
> > additional checks for conflicts. As such, this functionality remains
> > abort-free.
> >
> >
> > Proposed Changes: A transaction manager is added to the coordinator, with
> > the following functionality:
> >
> > BEGIN - initialize transaction state in the coordinator. After a BEGIN
> > statement, the following commands are modified as follows:
> >
> > INSERT, UPDATE, DELETE: Transform to an equivalent SELECT, returning the
> > primary key columns. Store the original command (INSERT, etc…) and the
> > returned primary keys into write set.
> >
> > SELECT - no changes, except for read your own writes. The results of a
> > SELECT query are returned to the client, but there's no need to store the
> > results in the transaction state.
> >
> > Transaction reads its own writes - For each SELECT the coordinator will
> > overlay the current write set onto the query results. You can think of
> the
> > write set as another memtable at Level -1.
> >
> > Secondary indexes are supported without any additional work needed.
> >
> > COMMIT - Perform a regular Accord transaction, using the above write set
> as
> > the Accord write set. The read set is empty. The commit is guaranteed to
> > succeed. In the end, clear state on the coordinator.
> >
> > New interfaces: BEGIN and COMMIT. ROLLBACK. Maybe some command to declare
> > READ COMMITTED isolation level and to get the current isolation level.
> >
> >
> > Future work: A motivation for the above proposal is that the same scheme
> > could be extended to support SNAPSHOT ISOLATION transactions. This would
> > require MVCC support from the storage engine.
> >
> >
> >
> > ---
> >
> > It would be interesting to hear from list members whether the above
> appears
> > to understand Accord (and SQL) correctly or whether I'm missing
> something?
> >
> > henrik
> >
> >
> > --
> >
> > Henrik Ingo
> >
> > +358 40 569 7354 <358405697354>
> >
> > [image: Visit us online.] <https://www.datastax.com/>  [image: Visit us
> on
> > Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on
> YouTube.]
> > <
> >
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=
> > >
> >   [image: Visit my LinkedIn profile.] <
> https://urldefense.com/v3/__https://www.linkedin.com/in/heingo/__;!!PbtH5S7Ebw!MMVW9XMvdNiSsGMzANXPW8LZVyKo5VqBSfUxNQ5jBwo1jm6KaZD9DYC-25BgNSlOHyo$
> > >
> >
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: Tradeoffs for Cassandra transaction management

Posted by Jonathan Ellis <jb...@gmail.com>.

Hi Henrik,

I don't see how this resolves the fundamental problem that I outlined to
start with, namely, that without having the entire logic of the transaction
available to it, the server cannot retry the transaction when concurrent
changes are found to have been applied after the reconnaissance reads (what
you call the conversational phase).

On Tue, Oct 12, 2021 at 3:55 PM Henrik Ingo <he...@datastax.com>
wrote:

> Hi all
>
> I was expecting to stay out of the way while a vote on CEP-15 seemed
> imminent. But discussing this tradeoffs thread with Jonathan, he encouraged
> me to say these points in my own words, so here we are.
>
>
> On Sun, Oct 10, 2021 at 7:17 AM Blake Eggleston
> <be...@apple.com.invalid> wrote:
>
> > 1. Is it worth giving up local latencies to get full global consistency?
> > Most LWT use cases use
> > LOCAL_SERIAL.
> >
> > This isn’t a tradeoff that needs to be made. There’s nothing about Accord
> > that prevents performing consensus in one DC and replicating the writes
> to
> > others. That’s not in scope for the initial work, but there’s no reason
> it
> > couldn’t be handled as a follow on if needed. I agree with Jeff that
> > LOCAL_SERIAL and LWTs are not usually done with a full understanding of
> the
> > implications, but there are some valid use cases. For instance, you can
> > enable an OLAP service to operate against another DC without impacting
> the
> > primary, assuming the service can tolerate inconsistency for data written
> > since the last repair, and there are some others.
> >
> >
> Let's start with the stated goal that CEP-15 is intended to be a better
> version of LWT.
>
> Reading all the discussion, I feel like addressing the LOCAL_SERIAL /
> LOCAL_QUORUM use case is the one thing where Accord isn't strictly an
> improvement over LWT. I don't agree that Accord will just be so much faster
> anyway, that it would compensate a single network roundtrip around the
> world. Four LWT round-trips with LOCAL_SERIAL will still only be on the
> order of 10 ms, but global latencies for just a single round trip are
> hundreds of ms.
>
> So, my suggestion to resolve this discussion would be that "local quorum
> latency experience" should be included in CEP-15 to meet its stated goal.
> If I have understood the CEP process correctly, this merely means that we
> agree this is a valid and significant use case in the Cassandra ecosystem.
> It doesn't mean that everything in the CEP must be released in a single v1
> release. At least personally I don't necessarily need to see a very
> detailed design for the implementation. But I'm optimistic it would resolve
> one open discussion if it was codified in the CEP that this is a use case
> that needs to be addressed.
>
>
> > 2. Is it worth giving up the possibility of SQL support, to get the
> > benefits of deterministic transaction design?
> >
> > This is a false dilemma. Today, we’re proposing a deterministic
> > transaction design that addresses some very common user pain points. SQL
> > addresses different user pain point. If someone wants to add an sql
> > implementation in the future they can a) build it on top of accord b)
> > extend or improve accord or c) implement a separate system. The right
> > choice will depend on their goals, but accord won’t prevent work on it,
> the
> > same way the original lwt design isn’t preventing work on multi-partition
> > transactions. In the worst case, if the goals of a hypothetical sql
> project
> > are different enough to make them incompatible with accord, I don’t see
> any
> > reason why we couldn’t have 2 separate consensus systems, so long as
> people
> > are willing to maintain them and the use cases and available technologies
> > justify it.
> >
>
>
>
> The part of the discussion that's hard to deal with is "SQL support",
> "interactive transactions", or "complex transactions". Even if this is out
> of scope for CEP-15, it's a valid question to ask whether Accord would
> possibly help, but at least not prevent such future work. (The context
> being, Jonathan and myself both think of this as an important long term
> goal. You may have figured this out already!)
>
> There are various ways we can get more insight into this question, but
> realistically writing a complete CEP (or a dozen CEPs) on "full SQL
> support" isn't one of them. On the other hand it seems CEP-15 itself
> proposes a conservative approach of developing first version(s) in a
> separate repository, from where it could then prove its usefulness! I feel
> like the authors have already proposed a conservative approach there that
> we can probably work with even without perfect knowledge of the future.
>
>
>
> An idea I've been thinking about for a few days is, what would it take to
> implement interactive READ COMMITTED transactions on top of Accord? Now,
> this may not be an isolation level we want to market as the cool flagship
> feature. BUT this exercise does feel meaningful in a few ways:
>
> * First of all, READ COMMITTED *is* a real isolation level in the SQL
> standard. So arguably this would be an existence proof of interactive SQL
> transactions built on top of Accord.
>
> * It's even the default isolation level in PostgeSQL still today.
>
> * An implementation of such transactions could even be used to benchmark
> the performance of such transactions and would give an approximation of how
> well Accord is suited for this task. This performance would be "best case"
> in the sense that I would expect Snapshot and Serializeable to have worse
> performance, but that overhead can be considered as inherent in the
> isolation level rather than a fault of Accord.
>
> * Implementing READ COMMITTED transactions on top of Accord is rather
> straightforward and can be described and discussed in this email thread,
> which could hopefully contribute to our understanding of the problem space.
> (Could also be a real CEP, if we think it's a useful first step for
> interactive transactions, but for now I'm dumping it here just to try to
> bring a concrete example into the discussion.)
>
>
>
> Goal: READ COMMITTED interactive transactions
>
> Dependency: Assume a Cassandra database with CEP-15 implemented.
>
>
> Approach: The conversational part of the transaction is a sequence of
> regular Cassandra reads and writes. Mutations are however executed as
> read-queries toward the database nodes. Database state isn't modified
> during the conversational phase, rather the primary keys of the
> to-be-mutated rows are stored for later use. Accord is essentially the
> commit phase of the transaction. All primary keys to be updated are the
> write set of  the Accord transaction. There's no need to re-execute the
> reads, so the read set is empty.
>
> We define READ COMMITTED as "whatever is returned by Cassandra when
> executing the query (with QUORUM consistency)". In other words, this
> functionality doesn't require any changes to the storage engine or other
> fundamental changes to Cassandra. The Accord commit is guaranteed to
> succeed per design and the READ COMMITTED transaction doesn't add any
> additional checks for conflicts. As such, this functionality remains
> abort-free.
>
>
> Proposed Changes: A transaction manager is added to the coordinator, with
> the following functionality:
>
> BEGIN - initialize transaction state in the coordinator. After a BEGIN
> statement, the following commands are modified as follows:
>
> INSERT, UPDATE, DELETE: Transform to an equivalent SELECT, returning the
> primary key columns. Store the original command (INSERT, etc…) and the
> returned primary keys into write set.
>
> SELECT - no changes, except for read your own writes. The results of a
> SELECT query are returned to the client, but there's no need to store the
> results in the transaction state.
>
> Transaction reads its own writes - For each SELECT the coordinator will
> overlay the current write set onto the query results. You can think of the
> write set as another memtable at Level -1.
>
> Secondary indexes are supported without any additional work needed.
>
> COMMIT - Perform a regular Accord transaction, using the above write set as
> the Accord write set. The read set is empty. The commit is guaranteed to
> succeed. In the end, clear state on the coordinator.
>
> New interfaces: BEGIN and COMMIT. ROLLBACK. Maybe some command to declare
> READ COMMITTED isolation level and to get the current isolation level.
>
>
> Future work: A motivation for the above proposal is that the same scheme
> could be extended to support SNAPSHOT ISOLATION transactions. This would
> require MVCC support from the storage engine.
>
>
>
> ---
>
> It would be interesting to hear from list members whether the above appears
> to understand Accord (and SQL) correctly or whether I'm missing something?
>
> henrik
>
>
> --
>
> Henrik Ingo
>
> +358 40 569 7354 <358405697354>
>
> [image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
> Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
> <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=
> >
>   [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/
> >
>
-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: Tradeoffs for Cassandra transaction management

Posted by Henrik Ingo <he...@datastax.com>.

Hi all

I was expecting to stay out of the way while a vote on CEP-15 seemed
imminent. But discussing this tradeoffs thread with Jonathan, he encouraged
me to say these points in my own words, so here we are.

On Sun, Oct 10, 2021 at 7:17 AM Blake Eggleston
<be...@apple.com.invalid> wrote:

> 1. Is it worth giving up local latencies to get full global consistency?
> Most LWT use cases use
> LOCAL_SERIAL.
>
> This isn’t a tradeoff that needs to be made. There’s nothing about Accord
> that prevents performing consensus in one DC and replicating the writes to
> others. That’s not in scope for the initial work, but there’s no reason it
> couldn’t be handled as a follow on if needed. I agree with Jeff that
> LOCAL_SERIAL and LWTs are not usually done with a full understanding of the
> implications, but there are some valid use cases. For instance, you can
> enable an OLAP service to operate against another DC without impacting the
> primary, assuming the service can tolerate inconsistency for data written
> since the last repair, and there are some others.
>
>
Let's start with the stated goal that CEP-15 is intended to be a better
version of LWT.

Reading all the discussion, I feel like addressing the LOCAL_SERIAL /
LOCAL_QUORUM use case is the one thing where Accord isn't strictly an
improvement over LWT. I don't agree that Accord will just be so much faster
anyway, that it would compensate a single network roundtrip around the
world. Four LWT round-trips with LOCAL_SERIAL will still only be on the
order of 10 ms, but global latencies for just a single round trip are
hundreds of ms.

So, my suggestion to resolve this discussion would be that "local quorum
latency experience" should be included in CEP-15 to meet its stated goal.
If I have understood the CEP process correctly, this merely means that we
agree this is a valid and significant use case in the Cassandra ecosystem.
It doesn't mean that everything in the CEP must be released in a single v1
release. At least personally I don't necessarily need to see a very
detailed design for the implementation. But I'm optimistic it would resolve
one open discussion if it was codified in the CEP that this is a use case
that needs to be addressed.

> 2. Is it worth giving up the possibility of SQL support, to get the
> benefits of deterministic transaction design?
>
> This is a false dilemma. Today, we’re proposing a deterministic
> transaction design that addresses some very common user pain points. SQL
> addresses different user pain point. If someone wants to add an sql
> implementation in the future they can a) build it on top of accord b)
> extend or improve accord or c) implement a separate system. The right
> choice will depend on their goals, but accord won’t prevent work on it, the
> same way the original lwt design isn’t preventing work on multi-partition
> transactions. In the worst case, if the goals of a hypothetical sql project
> are different enough to make them incompatible with accord, I don’t see any
> reason why we couldn’t have 2 separate consensus systems, so long as people
> are willing to maintain them and the use cases and available technologies
> justify it.
>

The part of the discussion that's hard to deal with is "SQL support",
"interactive transactions", or "complex transactions". Even if this is out
of scope for CEP-15, it's a valid question to ask whether Accord would
possibly help, but at least not prevent such future work. (The context
being, Jonathan and myself both think of this as an important long term
goal. You may have figured this out already!)

There are various ways we can get more insight into this question, but
realistically writing a complete CEP (or a dozen CEPs) on "full SQL
support" isn't one of them. On the other hand it seems CEP-15 itself
proposes a conservative approach of developing first version(s) in a
separate repository, from where it could then prove its usefulness! I feel
like the authors have already proposed a conservative approach there that
we can probably work with even without perfect knowledge of the future.

An idea I've been thinking about for a few days is, what would it take to
implement interactive READ COMMITTED transactions on top of Accord? Now,
this may not be an isolation level we want to market as the cool flagship
feature. BUT this exercise does feel meaningful in a few ways:

* First of all, READ COMMITTED *is* a real isolation level in the SQL
standard. So arguably this would be an existence proof of interactive SQL
transactions built on top of Accord.

* It's even the default isolation level in PostgeSQL still today.

* An implementation of such transactions could even be used to benchmark
the performance of such transactions and would give an approximation of how
well Accord is suited for this task. This performance would be "best case"
in the sense that I would expect Snapshot and Serializeable to have worse
performance, but that overhead can be considered as inherent in the
isolation level rather than a fault of Accord.

* Implementing READ COMMITTED transactions on top of Accord is rather
straightforward and can be described and discussed in this email thread,
which could hopefully contribute to our understanding of the problem space.
(Could also be a real CEP, if we think it's a useful first step for
interactive transactions, but for now I'm dumping it here just to try to
bring a concrete example into the discussion.)

Goal: READ COMMITTED interactive transactions

Dependency: Assume a Cassandra database with CEP-15 implemented.

Approach: The conversational part of the transaction is a sequence of
regular Cassandra reads and writes. Mutations are however executed as
read-queries toward the database nodes. Database state isn't modified
during the conversational phase, rather the primary keys of the
to-be-mutated rows are stored for later use. Accord is essentially the
commit phase of the transaction. All primary keys to be updated are the
write set of  the Accord transaction. There's no need to re-execute the
reads, so the read set is empty.

We define READ COMMITTED as "whatever is returned by Cassandra when
executing the query (with QUORUM consistency)". In other words, this
functionality doesn't require any changes to the storage engine or other
fundamental changes to Cassandra. The Accord commit is guaranteed to
succeed per design and the READ COMMITTED transaction doesn't add any
additional checks for conflicts. As such, this functionality remains
abort-free.

Proposed Changes: A transaction manager is added to the coordinator, with
the following functionality:

BEGIN - initialize transaction state in the coordinator. After a BEGIN
statement, the following commands are modified as follows:

INSERT, UPDATE, DELETE: Transform to an equivalent SELECT, returning the
primary key columns. Store the original command (INSERT, etc…) and the
returned primary keys into write set.

SELECT - no changes, except for read your own writes. The results of a
SELECT query are returned to the client, but there's no need to store the
results in the transaction state.

Transaction reads its own writes - For each SELECT the coordinator will
overlay the current write set onto the query results. You can think of the
write set as another memtable at Level -1.

Secondary indexes are supported without any additional work needed.

COMMIT - Perform a regular Accord transaction, using the above write set as
the Accord write set. The read set is empty. The commit is guaranteed to
succeed. In the end, clear state on the coordinator.

New interfaces: BEGIN and COMMIT. ROLLBACK. Maybe some command to declare
READ COMMITTED isolation level and to get the current isolation level.

Future work: A motivation for the above proposal is that the same scheme
could be extended to support SNAPSHOT ISOLATION transactions. This would
require MVCC support from the storage engine.

---

It would be interesting to hear from list members whether the above appears
to understand Accord (and SQL) correctly or whether I'm missing something?

henrik

-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: Tradeoffs for Cassandra transaction management

Posted by Jonathan Ellis <jb...@gmail.com>.

On Mon, Oct 11, 2021 at 12:31 PM Blake Eggleston
<be...@apple.com.invalid> wrote:

> > Come on Blake, you have all been developing software long enough to know
> > that "there's nothing about Accord that prevents this" is close to
> > meaningless.
> >
> > If it's so easy to address an overwhelmingly popular use case, then let's
> > add it to the initial work.
>
> This is moving the goal posts. The concern I was addressing implied this
> wasn’t possible with Accord and asked if we should prefer “a design that
> allows local serialization with EC between regions”. Accord is a design
> that allows this, and support for it is an implementation detail. Whether
> or not it’s in scope for the initial work is a project planning discussion,
> not a transaction management protocol tradeoff discussion.
>

I didn't think I had, but I went back to check and you're right, I did
imply that this wasn't possible with Accord.  I stand corrected, thank you.

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: Tradeoffs for Cassandra transaction management

Posted by Blake Eggleston <be...@apple.com.INVALID>.

> Come on Blake, you have all been developing software long enough to know
> that "there's nothing about Accord that prevents this" is close to
> meaningless.
> 
> If it's so easy to address an overwhelmingly popular use case, then let's
> add it to the initial work.


This is moving the goal posts. The concern I was addressing implied this wasn’t possible with Accord and asked if we should prefer “a design that allows local serialization with EC between regions”. Accord is a design that allows this, and support for it is an implementation detail. Whether or not it’s in scope for the initial work is a project planning discussion, not a transaction management protocol tradeoff discussion.

> I think this is the crux of our disagreement, I very much want to avoid a
> future where we have to maintain two separate consensus systems.


I want to avoid it also, but if we’re going to compare Accord against a hypothetical SQL feature that seems to lack design goals, or any clear ideas about how it might be implemented, I don’t think we can rule it out.


> On Oct 11, 2021, at 6:02 AM, Jonathan Ellis <jb...@gmail.com> wrote:
> 
> On Sat, Oct 9, 2021 at 11:23 PM Blake Eggleston
> <beggleston@apple.com.invalid <ma...@apple.com.invalid>> wrote:
> 
>> 1. Is it worth giving up local latencies to get full global consistency?
>> Most LWT use cases use
>> LOCAL_SERIAL.
>> 
>> This isn’t a tradeoff that needs to be made. There’s nothing about Accord
>> that prevents performing consensus in one DC and replicating the writes to
>> others. That’s not in scope for the initial work, but there’s no reason it
>> couldn’t be handled as a follow on if needed. I agree with Jeff that
>> LOCAL_SERIAL and LWTs are not usually done with a full understanding of the
>> implications, but there are some valid use cases. For instance, you can
>> enable an OLAP service to operate against another DC without impacting the
>> primary, assuming the service can tolerate inconsistency for data written
>> since the last repair, and there are some others.
>> 
> 
> Come on Blake, you have all been developing software long enough to know
> that "there's nothing about Accord that prevents this" is close to
> meaningless.
> 
> If it's so easy to address an overwhelmingly popular use case, then let's
> add it to the initial work.
> 
> 2. Is it worth giving up the possibility of SQL support, to get the
>> benefits of deterministic transaction design?
>> 
>> This is a false dilemma. Today, we’re proposing a deterministic
>> transaction design that addresses some very common user pain points. SQL
>> addresses different user pain point. If someone wants to add an sql
>> implementation in the future they can a) build it on top of accord b)
>> extend or improve accord or c) implement a separate system. The right
>> choice will depend on their goals, but accord won’t prevent work on it, the
>> same way the original lwt design isn’t preventing work on multi-partition
>> transactions. In the worst case, if the goals of a hypothetical sql project
>> are different enough to make them incompatible with accord, I don’t see any
>> reason why we couldn’t have 2 separate consensus systems, so long as people
>> are willing to maintain them and the use cases and available technologies
>> justify it.
>> 
> 
> I think this is the crux of our disagreement, I very much want to avoid a
> future where we have to maintain two separate consensus systems.

Re: Tradeoffs for Cassandra transaction management

Posted by Jonathan Ellis <jb...@gmail.com>.

On Sat, Oct 9, 2021 at 11:23 PM Blake Eggleston
<be...@apple.com.invalid> wrote:

> 1. Is it worth giving up local latencies to get full global consistency?
> Most LWT use cases use
> LOCAL_SERIAL.
>
> This isn’t a tradeoff that needs to be made. There’s nothing about Accord
> that prevents performing consensus in one DC and replicating the writes to
> others. That’s not in scope for the initial work, but there’s no reason it
> couldn’t be handled as a follow on if needed. I agree with Jeff that
> LOCAL_SERIAL and LWTs are not usually done with a full understanding of the
> implications, but there are some valid use cases. For instance, you can
> enable an OLAP service to operate against another DC without impacting the
> primary, assuming the service can tolerate inconsistency for data written
> since the last repair, and there are some others.
>

Come on Blake, you have all been developing software long enough to know
that "there's nothing about Accord that prevents this" is close to
meaningless.

If it's so easy to address an overwhelmingly popular use case, then let's
add it to the initial work.

2. Is it worth giving up the possibility of SQL support, to get the
> benefits of deterministic transaction design?
>
> This is a false dilemma. Today, we’re proposing a deterministic
> transaction design that addresses some very common user pain points. SQL
> addresses different user pain point. If someone wants to add an sql
> implementation in the future they can a) build it on top of accord b)
> extend or improve accord or c) implement a separate system. The right
> choice will depend on their goals, but accord won’t prevent work on it, the
> same way the original lwt design isn’t preventing work on multi-partition
> transactions. In the worst case, if the goals of a hypothetical sql project
> are different enough to make them incompatible with accord, I don’t see any
> reason why we couldn’t have 2 separate consensus systems, so long as people
> are willing to maintain them and the use cases and available technologies
> justify it.
>

 I think this is the crux of our disagreement, I very much want to avoid a
future where we have to maintain two separate consensus systems.

Re: Tradeoffs for Cassandra transaction management

Posted by "benedict@apache.org" <be...@apache.org>.

Hi Jonathan,

This conversation has been circular for some time. I think it is time to separate out your reasons for blocking progress on the CEP as part of a vote, so that the PMC may express its view on this justification for preventing the CEP’s adoption.


From: Jonathan Ellis <jb...@gmail.com>
Date: Thursday, 14 October 2021 at 14:55
To: dev <de...@cassandra.apache.org>
Subject: Re: Tradeoffs for Cassandra transaction management
Hi Benedict,

I'm not sure how to reconcile your statement that "your request to separate
consensus from execution is [nonsensical]" with your earlier claims that we
could build whatever additional transactional semantics we want on top of
Accord.  The Accord whitepaper specifically separates out the consensus and
the execution algorithms, but if we can't use the former to create
execution timestamps for a different transaction manager then it doesn't
sound as flexible as you're claiming.

To your other points, it looks like the core problem is that you believe
that "multi-shard CAS semantics" is the same as "Calvin semantics" which is
not the case.  Calvin supports arbitrarily complex transactions (included
dependent statements and indexed reads and writes), executed in parallel,
with locking as necessary to enable that parallelism.

I think I've also been clear that I want a path to supporting (1) local
latencies (SLOG is a more elegant solution but "let's just let people give
up global serializability like LWT" is also reasonable) and (2) SQL with
interactive transactions.

I'd prefer to keep the discussion on the mailing list, thanks.


On Wed, Oct 13, 2021 at 3:04 AM benedict@apache.org <be...@apache.org>
wrote:

> Jonathan,
>
> Your request to separate consensus from execution is about as sensical as
> asking for this separation in Paxos, or any other distributed consensus
> protocol. I have made these statements repeatedly, so let me break it down
> step by step.
>
> 1. Accord is an optimal leaderless distributed consensus protocol,
> offering multi-shard CAS semantics in one round-trip (or two under
> contention and clock skew).
> 2. By simple virtue of this property, it already achieves Calvin semantics
> with no other work. It remains a distributed consensus protocol, and the
> whitepaper compares to these as peers.
> 3. To build distributed transactions with more complex semantics, the
> remaining candidates are the CockroachDB or YugaByte approach. These must
> utilise a distributed consensus protocol. They do so using Raft today.
> Accord is as optimal as Raft, therefore, Accord may be used to implement
> this technique *without penalty*. Through its multi-shard consensus it has
> the added advantage of supporting stronger isolation (but not requiring it
> – a read/write intent design may choose weaker isolation).
>
> You continue to refuse to engage with these and other points. Please
> respond directly to ALL of the below, that I have been asking you to answer
> now for several weeks.
>
> 1. Since Accord supports all of your mooted transaction systems without
> penalty the conversation about which semantics to pursue may be conducted
> in parallel with its development. What about this claim do you not yet
> understand? If you understand, why should a vote on CEP-15 be delayed?
> 2. Which SPECIFIC transaction semantics do you want to achieve? You are
> all over the shop today, demanding Cockroach/YugaByte interactive
> semantics, but also LOCAL_SERIAL operation and proposing SLOG. These are
> conflicting demands.
> 3. Why do you think Accord cannot support your preferred semantics?
> 4. Will you accept a video call so we may discuss this with you in detail,
> so we may understand your difficulty understanding these points I keep
> repeating?
>
> After several weeks of back and forth you should already be able to answer
> these questions. If you cannot invest the time to answer them now, I
> perceive this as obstructive and I will escalate this to a PMC vote to
> break the deadlock.
>
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Wednesday, 13 October 2021 at 04:21
> To: dev <de...@cassandra.apache.org>
> Subject: Re: Tradeoffs for Cassandra transaction management
> Blake (and Benedict), I’ll ask for your patience here.  We don’t have a
> precedent of pushing through major initiatives in this project in a matter
> of weeks.  We [members of the PMC that weren’t involved in creating Accord]
> need time to do thorough research and make sure both that we understand
> what is being proposed and that we have evaluated reasonable alternatives.
>
> One of the difficulties in evaluating Accord is that it combines a
> state-of-the-art consensus/ordering protocol with a fairly limited
> transaction manager.  So it may be useful to decouple the consensus and
> transaction processing components, which would both allow non-Cassandra
> usage of the consensus piece, and also make explicit the boundaries with
> transaction processing with the consequence of making it easier to evolve
> independently.
>
> In the meantime, it’s very important to me to understand on which
> dimensions the transaction manager can be improved easily, and which
> dimensions resist such improvement.  I get that Accord is your [plural]
> baby and it’s awkward for me to come along and start pointing at its
> limitations, but that’s part of creating a complete understanding of any
> system.
>
> If I keep coming back to the subject of SQL support and interactive
> transactions, that’s because it’s becoming table stakes in the distributed
> database space. People are using Cockroach or Yugabyte or Cloud Spanner for
> use cases where a couple years ago they would have used Cassandra. We can
> expect this trend to continue and strengthen.
>
> On Mon, Oct 11, 2021 at 11:39 PM Blake Eggleston
> <be...@apple.com.invalid> wrote:
>
> > Let’s get back on topic.
> >
> > Jonathan, in your opening email you stated that, in your view, the 2 main
> > areas of tradeoff were:
> >
> > > 1. Is it worth giving up local latencies to get full global
> consistency?
> >
> > Now we’ve established that we don’t need to give up local latencies with
> > Accord, which leaves:
> >
> > > 2. Is it worth giving up the possibility of SQL support, to get the
> > benefits of deterministic transaction design?
> >
> > I pointed out that this was a false dilemma and that, in the worst case,
> a
> > hypothetical SQL feature could have it’s own consensus system. I hope
> that
> > won’t be necessary, but as I later pointed out (and you did not address,
> > although maybe I should have phrased it as a question), if we’re going to
> > weigh accord against a hypothetical SQL feature that lacks design goals,
> or
> > any clear ideas about how it might be implemented, how can we rule that
> out?
> >
> > So Jonathan, how can we rule that out? How can we have a productive
> > discussion about a feature you yourself are unable to describe in any
> > meaningful detail?
> >
> > > On Oct 11, 2021, at 6:34 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> > >
> > > On Mon, Oct 11, 2021 at 5:11 PM benedict@apache.org <
> benedict@apache.org
> > >
> > > wrote:
> > >
> > >> If we want to fully unpack this particular point, as far as I can tell
> > >> claiming ANSI SQL would indeed require interactive transactions in
> which
> > >> arbitrary conditional work may be performed by a client within a
> > >> transaction in response to other actions within that transaction.
> > >>
> > >> However:
> > >>
> > >>  1.  The ANSI SQL standard permits these transactions to fail and
> > >> rollback (e.g. in the event that your optimistic transaction fails).
> So
> > if
> > >> you want to be pedantic, you may modify my statement to “SQL does not
> > >> necessitate support for abort-free interactive transactions” and we
> can
> > >> leave it there.
> > >>
> > >>  2.  I would personally consider “SQL support” to include the
> capability
> > >> of defining arbitrary SQL stored procedures that may be executed by
> > clients
> > >> in an interactive session
> > >
> > >
> > > I note your personal preference and I further note that this is not the
> > > common understanding of "SQL support" in the industry.  If you tell 100
> > > developers that your database supports SQL, then at least 99 of them
> are
> > > going to assume that you can work with APIs like JDBC that expose
> > > interactive transactions as a central feature, and hence that you will
> be
> > > reasonably compatible with the vast array of SQL-based applications out
> > > there.
> > >
> > > Historical side note: VoltDB tried to convince people that stored
> > > procedures were good enough.  It didn't work, and VoltDB had to add
> > > interactive transactions as fast as they could.
> > >
> > >  3.  Most importantly, as I pointed out in the previous email, Accord
> is
> > >> compatible with a YugaByte/Cockroach-like approach, and indeed makes
> > this
> > >> approach both easier to accomplish and enables stronger isolation than
> > the
> > >> equivalent Raft-based approach. These approaches are able to reduce
> the
> > >> number of conflicts, at a cost of significantly higher transaction
> > >> management burden.
> > >>
> > >
> > > If you're saying that you could use Accord instead of Raft or Paxos,
> and
> > > layer 2PC on top of that as in Spanner, then I agree, but I don't think
> > > that is a very good design, as you would no longer get any of the
> > benefits
> > > of the deterministic approach you started with.  If you mean something
> > > else, then perhaps an example would help clarify.
> > >
> > > --
> > > Jonathan Ellis
> > > co-founder, http://www.datastax.com
> > > @spyced
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > For additional commands, e-mail: dev-help@cassandra.apache.org
> >
> >
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: Tradeoffs for Cassandra transaction management

Posted by Jonathan Ellis <jb...@gmail.com>.

On Thu, Oct 14, 2021 at 4:01 PM benedict@apache.org <be...@apache.org>
wrote:

> The only TPC-C New Order transaction I recall you linking was interactive,
> which as far as I am aware is not supported by Calvin.
>

The SQLite version I linked was interactive, but it can be implemented
non-interactively, which is what the Calvin team did to benchmark it.

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: Tradeoffs for Cassandra transaction management

Posted by "benedict@apache.org" <be...@apache.org>.

The only TPC-C New Order transaction I recall you linking was interactive, which as far as I am aware is not supported by Calvin.

Are we settling on Calvin for your preferred system semantics then? As it does not support your preferred interactive transactions. To continue this discussion I must insist you specify your goal criteria, that I have requested six times already.

Please then specify a Calvin-compatible transaction and an explanation of why you believe Accord does not support it. To continue this discussion I must insist on concrete problems that you have invested the time to state clearly, with reasoned explanations using your present understanding of Accord to explain why you believe it does not work. This should ideally reference the actual protocol specified in the whitepaper. You should be able to demonstrate that you have invested the time to understand the proposal, and the problem case you perceive, in some reasonable level of detail.

Since you have asked no clarifying questions about the whitepaper in the past six weeks, I can only assume you believe yourself to understand it already, but in case any confusion has arisen your detailed explanation of the problem case will help me better understand what needs to be stated in response to your query.



From: Jonathan Ellis <jb...@gmail.com>
Date: Thursday, 14 October 2021 at 21:47
To: dev <de...@cassandra.apache.org>
Subject: Re: Tradeoffs for Cassandra transaction management
I already linked a description of the TPC-C New Order transaction, and an
implementation.  This is the most-benchmarked OLTP transaction in the
world.  I look forward to your explanation of how Accord can handle this.

Since your claim is that "[Accord] is equivalent to Calvin," please limit
the discussion to Accord as it is today instead of engaging in
hypotheticals around how "we could enhance Accord with X."

On Thu, Oct 14, 2021 at 12:46 PM benedict@apache.org <be...@apache.org>
wrote:

> > Calvin supports arbitrarily complex transactions (included dependent
> statements and indexed reads and writes), executed in parallel, with
> locking as necessary to enable that parallelism.
>
> By CAS I mean to include any arbitrary state mapping function for the
> involved keys. This is equivalent to Calvin. The locks for execution are
> isomorphic with any multi-shard distributed consensus protocol that applies
> its operations in the agreed partial order on each replica. If you want to
> continue this thread of discussion, please provide a counter example you
> believe disproves this statement.
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Thursday, 14 October 2021 at 14:55
> To: dev <de...@cassandra.apache.org>
> Subject: Re: Tradeoffs for Cassandra transaction management
> Hi Benedict,
>
> I'm not sure how to reconcile your statement that "your request to separate
> consensus from execution is [nonsensical]" with your earlier claims that we
> could build whatever additional transactional semantics we want on top of
> Accord.  The Accord whitepaper specifically separates out the consensus and
> the execution algorithms, but if we can't use the former to create
> execution timestamps for a different transaction manager then it doesn't
> sound as flexible as you're claiming.
>
> To your other points, it looks like the core problem is that you believe
> that "multi-shard CAS semantics" is the same as "Calvin semantics" which is
> not the case.  Calvin supports arbitrarily complex transactions (included
> dependent statements and indexed reads and writes), executed in parallel,
> with locking as necessary to enable that parallelism.
>
> I think I've also been clear that I want a path to supporting (1) local
> latencies (SLOG is a more elegant solution but "let's just let people give
> up global serializability like LWT" is also reasonable) and (2) SQL with
> interactive transactions.
>
> I'd prefer to keep the discussion on the mailing list, thanks.
>
>
> On Wed, Oct 13, 2021 at 3:04 AM benedict@apache.org <be...@apache.org>
> wrote:
>
> > Jonathan,
> >
> > Your request to separate consensus from execution is about as sensical as
> > asking for this separation in Paxos, or any other distributed consensus
> > protocol. I have made these statements repeatedly, so let me break it
> down
> > step by step.
> >
> > 1. Accord is an optimal leaderless distributed consensus protocol,
> > offering multi-shard CAS semantics in one round-trip (or two under
> > contention and clock skew).
> > 2. By simple virtue of this property, it already achieves Calvin
> semantics
> > with no other work. It remains a distributed consensus protocol, and the
> > whitepaper compares to these as peers.
> > 3. To build distributed transactions with more complex semantics, the
> > remaining candidates are the CockroachDB or YugaByte approach. These must
> > utilise a distributed consensus protocol. They do so using Raft today.
> > Accord is as optimal as Raft, therefore, Accord may be used to implement
> > this technique *without penalty*. Through its multi-shard consensus it
> has
> > the added advantage of supporting stronger isolation (but not requiring
> it
> > – a read/write intent design may choose weaker isolation).
> >
> > You continue to refuse to engage with these and other points. Please
> > respond directly to ALL of the below, that I have been asking you to
> answer
> > now for several weeks.
> >
> > 1. Since Accord supports all of your mooted transaction systems without
> > penalty the conversation about which semantics to pursue may be conducted
> > in parallel with its development. What about this claim do you not yet
> > understand? If you understand, why should a vote on CEP-15 be delayed?
> > 2. Which SPECIFIC transaction semantics do you want to achieve? You are
> > all over the shop today, demanding Cockroach/YugaByte interactive
> > semantics, but also LOCAL_SERIAL operation and proposing SLOG. These are
> > conflicting demands.
> > 3. Why do you think Accord cannot support your preferred semantics?
> > 4. Will you accept a video call so we may discuss this with you in
> detail,
> > so we may understand your difficulty understanding these points I keep
> > repeating?
> >
> > After several weeks of back and forth you should already be able to
> answer
> > these questions. If you cannot invest the time to answer them now, I
> > perceive this as obstructive and I will escalate this to a PMC vote to
> > break the deadlock.
> >
> >
> >
> > From: Jonathan Ellis <jb...@gmail.com>
> > Date: Wednesday, 13 October 2021 at 04:21
> > To: dev <de...@cassandra.apache.org>
> > Subject: Re: Tradeoffs for Cassandra transaction management
> > Blake (and Benedict), I’ll ask for your patience here.  We don’t have a
> > precedent of pushing through major initiatives in this project in a
> matter
> > of weeks.  We [members of the PMC that weren’t involved in creating
> Accord]
> > need time to do thorough research and make sure both that we understand
> > what is being proposed and that we have evaluated reasonable
> alternatives.
> >
> > One of the difficulties in evaluating Accord is that it combines a
> > state-of-the-art consensus/ordering protocol with a fairly limited
> > transaction manager.  So it may be useful to decouple the consensus and
> > transaction processing components, which would both allow non-Cassandra
> > usage of the consensus piece, and also make explicit the boundaries with
> > transaction processing with the consequence of making it easier to evolve
> > independently.
> >
> > In the meantime, it’s very important to me to understand on which
> > dimensions the transaction manager can be improved easily, and which
> > dimensions resist such improvement.  I get that Accord is your [plural]
> > baby and it’s awkward for me to come along and start pointing at its
> > limitations, but that’s part of creating a complete understanding of any
> > system.
> >
> > If I keep coming back to the subject of SQL support and interactive
> > transactions, that’s because it’s becoming table stakes in the
> distributed
> > database space. People are using Cockroach or Yugabyte or Cloud Spanner
> for
> > use cases where a couple years ago they would have used Cassandra. We can
> > expect this trend to continue and strengthen.
> >
> > On Mon, Oct 11, 2021 at 11:39 PM Blake Eggleston
> > <be...@apple.com.invalid> wrote:
> >
> > > Let’s get back on topic.
> > >
> > > Jonathan, in your opening email you stated that, in your view, the 2
> main
> > > areas of tradeoff were:
> > >
> > > > 1. Is it worth giving up local latencies to get full global
> > consistency?
> > >
> > > Now we’ve established that we don’t need to give up local latencies
> with
> > > Accord, which leaves:
> > >
> > > > 2. Is it worth giving up the possibility of SQL support, to get the
> > > benefits of deterministic transaction design?
> > >
> > > I pointed out that this was a false dilemma and that, in the worst
> case,
> > a
> > > hypothetical SQL feature could have it’s own consensus system. I hope
> > that
> > > won’t be necessary, but as I later pointed out (and you did not
> address,
> > > although maybe I should have phrased it as a question), if we’re going
> to
> > > weigh accord against a hypothetical SQL feature that lacks design
> goals,
> > or
> > > any clear ideas about how it might be implemented, how can we rule that
> > out?
> > >
> > > So Jonathan, how can we rule that out? How can we have a productive
> > > discussion about a feature you yourself are unable to describe in any
> > > meaningful detail?
> > >
> > > > On Oct 11, 2021, at 6:34 PM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> > > >
> > > > On Mon, Oct 11, 2021 at 5:11 PM benedict@apache.org <
> > benedict@apache.org
> > > >
> > > > wrote:
> > > >
> > > >> If we want to fully unpack this particular point, as far as I can
> tell
> > > >> claiming ANSI SQL would indeed require interactive transactions in
> > which
> > > >> arbitrary conditional work may be performed by a client within a
> > > >> transaction in response to other actions within that transaction.
> > > >>
> > > >> However:
> > > >>
> > > >>  1.  The ANSI SQL standard permits these transactions to fail and
> > > >> rollback (e.g. in the event that your optimistic transaction fails).
> > So
> > > if
> > > >> you want to be pedantic, you may modify my statement to “SQL does
> not
> > > >> necessitate support for abort-free interactive transactions” and we
> > can
> > > >> leave it there.
> > > >>
> > > >>  2.  I would personally consider “SQL support” to include the
> > capability
> > > >> of defining arbitrary SQL stored procedures that may be executed by
> > > clients
> > > >> in an interactive session
> > > >
> > > >
> > > > I note your personal preference and I further note that this is not
> the
> > > > common understanding of "SQL support" in the industry.  If you tell
> 100
> > > > developers that your database supports SQL, then at least 99 of them
> > are
> > > > going to assume that you can work with APIs like JDBC that expose
> > > > interactive transactions as a central feature, and hence that you
> will
> > be
> > > > reasonably compatible with the vast array of SQL-based applications
> out
> > > > there.
> > > >
> > > > Historical side note: VoltDB tried to convince people that stored
> > > > procedures were good enough.  It didn't work, and VoltDB had to add
> > > > interactive transactions as fast as they could.
> > > >
> > > >  3.  Most importantly, as I pointed out in the previous email, Accord
> > is
> > > >> compatible with a YugaByte/Cockroach-like approach, and indeed makes
> > > this
> > > >> approach both easier to accomplish and enables stronger isolation
> than
> > > the
> > > >> equivalent Raft-based approach. These approaches are able to reduce
> > the
> > > >> number of conflicts, at a cost of significantly higher transaction
> > > >> management burden.
> > > >>
> > > >
> > > > If you're saying that you could use Accord instead of Raft or Paxos,
> > and
> > > > layer 2PC on top of that as in Spanner, then I agree, but I don't
> think
> > > > that is a very good design, as you would no longer get any of the
> > > benefits
> > > > of the deterministic approach you started with.  If you mean
> something
> > > > else, then perhaps an example would help clarify.
> > > >
> > > > --
> > > > Jonathan Ellis
> > > > co-founder, http://www.datastax.com
> > > > @spyced
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > >
> > >
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
> >
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: Tradeoffs for Cassandra transaction management

Posted by Jonathan Ellis <jb...@gmail.com>.

I already linked a description of the TPC-C New Order transaction, and an
implementation.  This is the most-benchmarked OLTP transaction in the
world.  I look forward to your explanation of how Accord can handle this.

Since your claim is that "[Accord] is equivalent to Calvin," please limit
the discussion to Accord as it is today instead of engaging in
hypotheticals around how "we could enhance Accord with X."

On Thu, Oct 14, 2021 at 12:46 PM benedict@apache.org <be...@apache.org>
wrote:

> > Calvin supports arbitrarily complex transactions (included dependent
> statements and indexed reads and writes), executed in parallel, with
> locking as necessary to enable that parallelism.
>
> By CAS I mean to include any arbitrary state mapping function for the
> involved keys. This is equivalent to Calvin. The locks for execution are
> isomorphic with any multi-shard distributed consensus protocol that applies
> its operations in the agreed partial order on each replica. If you want to
> continue this thread of discussion, please provide a counter example you
> believe disproves this statement.
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Thursday, 14 October 2021 at 14:55
> To: dev <de...@cassandra.apache.org>
> Subject: Re: Tradeoffs for Cassandra transaction management
> Hi Benedict,
>
> I'm not sure how to reconcile your statement that "your request to separate
> consensus from execution is [nonsensical]" with your earlier claims that we
> could build whatever additional transactional semantics we want on top of
> Accord.  The Accord whitepaper specifically separates out the consensus and
> the execution algorithms, but if we can't use the former to create
> execution timestamps for a different transaction manager then it doesn't
> sound as flexible as you're claiming.
>
> To your other points, it looks like the core problem is that you believe
> that "multi-shard CAS semantics" is the same as "Calvin semantics" which is
> not the case.  Calvin supports arbitrarily complex transactions (included
> dependent statements and indexed reads and writes), executed in parallel,
> with locking as necessary to enable that parallelism.
>
> I think I've also been clear that I want a path to supporting (1) local
> latencies (SLOG is a more elegant solution but "let's just let people give
> up global serializability like LWT" is also reasonable) and (2) SQL with
> interactive transactions.
>
> I'd prefer to keep the discussion on the mailing list, thanks.
>
>
> On Wed, Oct 13, 2021 at 3:04 AM benedict@apache.org <be...@apache.org>
> wrote:
>
> > Jonathan,
> >
> > Your request to separate consensus from execution is about as sensical as
> > asking for this separation in Paxos, or any other distributed consensus
> > protocol. I have made these statements repeatedly, so let me break it
> down
> > step by step.
> >
> > 1. Accord is an optimal leaderless distributed consensus protocol,
> > offering multi-shard CAS semantics in one round-trip (or two under
> > contention and clock skew).
> > 2. By simple virtue of this property, it already achieves Calvin
> semantics
> > with no other work. It remains a distributed consensus protocol, and the
> > whitepaper compares to these as peers.
> > 3. To build distributed transactions with more complex semantics, the
> > remaining candidates are the CockroachDB or YugaByte approach. These must
> > utilise a distributed consensus protocol. They do so using Raft today.
> > Accord is as optimal as Raft, therefore, Accord may be used to implement
> > this technique *without penalty*. Through its multi-shard consensus it
> has
> > the added advantage of supporting stronger isolation (but not requiring
> it
> > – a read/write intent design may choose weaker isolation).
> >
> > You continue to refuse to engage with these and other points. Please
> > respond directly to ALL of the below, that I have been asking you to
> answer
> > now for several weeks.
> >
> > 1. Since Accord supports all of your mooted transaction systems without
> > penalty the conversation about which semantics to pursue may be conducted
> > in parallel with its development. What about this claim do you not yet
> > understand? If you understand, why should a vote on CEP-15 be delayed?
> > 2. Which SPECIFIC transaction semantics do you want to achieve? You are
> > all over the shop today, demanding Cockroach/YugaByte interactive
> > semantics, but also LOCAL_SERIAL operation and proposing SLOG. These are
> > conflicting demands.
> > 3. Why do you think Accord cannot support your preferred semantics?
> > 4. Will you accept a video call so we may discuss this with you in
> detail,
> > so we may understand your difficulty understanding these points I keep
> > repeating?
> >
> > After several weeks of back and forth you should already be able to
> answer
> > these questions. If you cannot invest the time to answer them now, I
> > perceive this as obstructive and I will escalate this to a PMC vote to
> > break the deadlock.
> >
> >
> >
> > From: Jonathan Ellis <jb...@gmail.com>
> > Date: Wednesday, 13 October 2021 at 04:21
> > To: dev <de...@cassandra.apache.org>
> > Subject: Re: Tradeoffs for Cassandra transaction management
> > Blake (and Benedict), I’ll ask for your patience here.  We don’t have a
> > precedent of pushing through major initiatives in this project in a
> matter
> > of weeks.  We [members of the PMC that weren’t involved in creating
> Accord]
> > need time to do thorough research and make sure both that we understand
> > what is being proposed and that we have evaluated reasonable
> alternatives.
> >
> > One of the difficulties in evaluating Accord is that it combines a
> > state-of-the-art consensus/ordering protocol with a fairly limited
> > transaction manager.  So it may be useful to decouple the consensus and
> > transaction processing components, which would both allow non-Cassandra
> > usage of the consensus piece, and also make explicit the boundaries with
> > transaction processing with the consequence of making it easier to evolve
> > independently.
> >
> > In the meantime, it’s very important to me to understand on which
> > dimensions the transaction manager can be improved easily, and which
> > dimensions resist such improvement.  I get that Accord is your [plural]
> > baby and it’s awkward for me to come along and start pointing at its
> > limitations, but that’s part of creating a complete understanding of any
> > system.
> >
> > If I keep coming back to the subject of SQL support and interactive
> > transactions, that’s because it’s becoming table stakes in the
> distributed
> > database space. People are using Cockroach or Yugabyte or Cloud Spanner
> for
> > use cases where a couple years ago they would have used Cassandra. We can
> > expect this trend to continue and strengthen.
> >
> > On Mon, Oct 11, 2021 at 11:39 PM Blake Eggleston
> > <be...@apple.com.invalid> wrote:
> >
> > > Let’s get back on topic.
> > >
> > > Jonathan, in your opening email you stated that, in your view, the 2
> main
> > > areas of tradeoff were:
> > >
> > > > 1. Is it worth giving up local latencies to get full global
> > consistency?
> > >
> > > Now we’ve established that we don’t need to give up local latencies
> with
> > > Accord, which leaves:
> > >
> > > > 2. Is it worth giving up the possibility of SQL support, to get the
> > > benefits of deterministic transaction design?
> > >
> > > I pointed out that this was a false dilemma and that, in the worst
> case,
> > a
> > > hypothetical SQL feature could have it’s own consensus system. I hope
> > that
> > > won’t be necessary, but as I later pointed out (and you did not
> address,
> > > although maybe I should have phrased it as a question), if we’re going
> to
> > > weigh accord against a hypothetical SQL feature that lacks design
> goals,
> > or
> > > any clear ideas about how it might be implemented, how can we rule that
> > out?
> > >
> > > So Jonathan, how can we rule that out? How can we have a productive
> > > discussion about a feature you yourself are unable to describe in any
> > > meaningful detail?
> > >
> > > > On Oct 11, 2021, at 6:34 PM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> > > >
> > > > On Mon, Oct 11, 2021 at 5:11 PM benedict@apache.org <
> > benedict@apache.org
> > > >
> > > > wrote:
> > > >
> > > >> If we want to fully unpack this particular point, as far as I can
> tell
> > > >> claiming ANSI SQL would indeed require interactive transactions in
> > which
> > > >> arbitrary conditional work may be performed by a client within a
> > > >> transaction in response to other actions within that transaction.
> > > >>
> > > >> However:
> > > >>
> > > >>  1.  The ANSI SQL standard permits these transactions to fail and
> > > >> rollback (e.g. in the event that your optimistic transaction fails).
> > So
> > > if
> > > >> you want to be pedantic, you may modify my statement to “SQL does
> not
> > > >> necessitate support for abort-free interactive transactions” and we
> > can
> > > >> leave it there.
> > > >>
> > > >>  2.  I would personally consider “SQL support” to include the
> > capability
> > > >> of defining arbitrary SQL stored procedures that may be executed by
> > > clients
> > > >> in an interactive session
> > > >
> > > >
> > > > I note your personal preference and I further note that this is not
> the
> > > > common understanding of "SQL support" in the industry.  If you tell
> 100
> > > > developers that your database supports SQL, then at least 99 of them
> > are
> > > > going to assume that you can work with APIs like JDBC that expose
> > > > interactive transactions as a central feature, and hence that you
> will
> > be
> > > > reasonably compatible with the vast array of SQL-based applications
> out
> > > > there.
> > > >
> > > > Historical side note: VoltDB tried to convince people that stored
> > > > procedures were good enough.  It didn't work, and VoltDB had to add
> > > > interactive transactions as fast as they could.
> > > >
> > > >  3.  Most importantly, as I pointed out in the previous email, Accord
> > is
> > > >> compatible with a YugaByte/Cockroach-like approach, and indeed makes
> > > this
> > > >> approach both easier to accomplish and enables stronger isolation
> than
> > > the
> > > >> equivalent Raft-based approach. These approaches are able to reduce
> > the
> > > >> number of conflicts, at a cost of significantly higher transaction
> > > >> management burden.
> > > >>
> > > >
> > > > If you're saying that you could use Accord instead of Raft or Paxos,
> > and
> > > > layer 2PC on top of that as in Spanner, then I agree, but I don't
> think
> > > > that is a very good design, as you would no longer get any of the
> > > benefits
> > > > of the deterministic approach you started with.  If you mean
> something
> > > > else, then perhaps an example would help clarify.
> > > >
> > > > --
> > > > Jonathan Ellis
> > > > co-founder, http://www.datastax.com
> > > > @spyced
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > >
> > >
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
> >
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: Tradeoffs for Cassandra transaction management

Posted by "benedict@apache.org" <be...@apache.org>.

> Calvin supports arbitrarily complex transactions (included dependent statements and indexed reads and writes), executed in parallel, with locking as necessary to enable that parallelism.

By CAS I mean to include any arbitrary state mapping function for the involved keys. This is equivalent to Calvin. The locks for execution are isomorphic with any multi-shard distributed consensus protocol that applies its operations in the agreed partial order on each replica. If you want to continue this thread of discussion, please provide a counter example you believe disproves this statement.


From: Jonathan Ellis <jb...@gmail.com>
Date: Thursday, 14 October 2021 at 14:55
To: dev <de...@cassandra.apache.org>
Subject: Re: Tradeoffs for Cassandra transaction management
Hi Benedict,

I'm not sure how to reconcile your statement that "your request to separate
consensus from execution is [nonsensical]" with your earlier claims that we
could build whatever additional transactional semantics we want on top of
Accord.  The Accord whitepaper specifically separates out the consensus and
the execution algorithms, but if we can't use the former to create
execution timestamps for a different transaction manager then it doesn't
sound as flexible as you're claiming.

To your other points, it looks like the core problem is that you believe
that "multi-shard CAS semantics" is the same as "Calvin semantics" which is
not the case.  Calvin supports arbitrarily complex transactions (included
dependent statements and indexed reads and writes), executed in parallel,
with locking as necessary to enable that parallelism.

I think I've also been clear that I want a path to supporting (1) local
latencies (SLOG is a more elegant solution but "let's just let people give
up global serializability like LWT" is also reasonable) and (2) SQL with
interactive transactions.

I'd prefer to keep the discussion on the mailing list, thanks.


On Wed, Oct 13, 2021 at 3:04 AM benedict@apache.org <be...@apache.org>
wrote:

> Jonathan,
>
> Your request to separate consensus from execution is about as sensical as
> asking for this separation in Paxos, or any other distributed consensus
> protocol. I have made these statements repeatedly, so let me break it down
> step by step.
>
> 1. Accord is an optimal leaderless distributed consensus protocol,
> offering multi-shard CAS semantics in one round-trip (or two under
> contention and clock skew).
> 2. By simple virtue of this property, it already achieves Calvin semantics
> with no other work. It remains a distributed consensus protocol, and the
> whitepaper compares to these as peers.
> 3. To build distributed transactions with more complex semantics, the
> remaining candidates are the CockroachDB or YugaByte approach. These must
> utilise a distributed consensus protocol. They do so using Raft today.
> Accord is as optimal as Raft, therefore, Accord may be used to implement
> this technique *without penalty*. Through its multi-shard consensus it has
> the added advantage of supporting stronger isolation (but not requiring it
> – a read/write intent design may choose weaker isolation).
>
> You continue to refuse to engage with these and other points. Please
> respond directly to ALL of the below, that I have been asking you to answer
> now for several weeks.
>
> 1. Since Accord supports all of your mooted transaction systems without
> penalty the conversation about which semantics to pursue may be conducted
> in parallel with its development. What about this claim do you not yet
> understand? If you understand, why should a vote on CEP-15 be delayed?
> 2. Which SPECIFIC transaction semantics do you want to achieve? You are
> all over the shop today, demanding Cockroach/YugaByte interactive
> semantics, but also LOCAL_SERIAL operation and proposing SLOG. These are
> conflicting demands.
> 3. Why do you think Accord cannot support your preferred semantics?
> 4. Will you accept a video call so we may discuss this with you in detail,
> so we may understand your difficulty understanding these points I keep
> repeating?
>
> After several weeks of back and forth you should already be able to answer
> these questions. If you cannot invest the time to answer them now, I
> perceive this as obstructive and I will escalate this to a PMC vote to
> break the deadlock.
>
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Wednesday, 13 October 2021 at 04:21
> To: dev <de...@cassandra.apache.org>
> Subject: Re: Tradeoffs for Cassandra transaction management
> Blake (and Benedict), I’ll ask for your patience here.  We don’t have a
> precedent of pushing through major initiatives in this project in a matter
> of weeks.  We [members of the PMC that weren’t involved in creating Accord]
> need time to do thorough research and make sure both that we understand
> what is being proposed and that we have evaluated reasonable alternatives.
>
> One of the difficulties in evaluating Accord is that it combines a
> state-of-the-art consensus/ordering protocol with a fairly limited
> transaction manager.  So it may be useful to decouple the consensus and
> transaction processing components, which would both allow non-Cassandra
> usage of the consensus piece, and also make explicit the boundaries with
> transaction processing with the consequence of making it easier to evolve
> independently.
>
> In the meantime, it’s very important to me to understand on which
> dimensions the transaction manager can be improved easily, and which
> dimensions resist such improvement.  I get that Accord is your [plural]
> baby and it’s awkward for me to come along and start pointing at its
> limitations, but that’s part of creating a complete understanding of any
> system.
>
> If I keep coming back to the subject of SQL support and interactive
> transactions, that’s because it’s becoming table stakes in the distributed
> database space. People are using Cockroach or Yugabyte or Cloud Spanner for
> use cases where a couple years ago they would have used Cassandra. We can
> expect this trend to continue and strengthen.
>
> On Mon, Oct 11, 2021 at 11:39 PM Blake Eggleston
> <be...@apple.com.invalid> wrote:
>
> > Let’s get back on topic.
> >
> > Jonathan, in your opening email you stated that, in your view, the 2 main
> > areas of tradeoff were:
> >
> > > 1. Is it worth giving up local latencies to get full global
> consistency?
> >
> > Now we’ve established that we don’t need to give up local latencies with
> > Accord, which leaves:
> >
> > > 2. Is it worth giving up the possibility of SQL support, to get the
> > benefits of deterministic transaction design?
> >
> > I pointed out that this was a false dilemma and that, in the worst case,
> a
> > hypothetical SQL feature could have it’s own consensus system. I hope
> that
> > won’t be necessary, but as I later pointed out (and you did not address,
> > although maybe I should have phrased it as a question), if we’re going to
> > weigh accord against a hypothetical SQL feature that lacks design goals,
> or
> > any clear ideas about how it might be implemented, how can we rule that
> out?
> >
> > So Jonathan, how can we rule that out? How can we have a productive
> > discussion about a feature you yourself are unable to describe in any
> > meaningful detail?
> >
> > > On Oct 11, 2021, at 6:34 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> > >
> > > On Mon, Oct 11, 2021 at 5:11 PM benedict@apache.org <
> benedict@apache.org
> > >
> > > wrote:
> > >
> > >> If we want to fully unpack this particular point, as far as I can tell
> > >> claiming ANSI SQL would indeed require interactive transactions in
> which
> > >> arbitrary conditional work may be performed by a client within a
> > >> transaction in response to other actions within that transaction.
> > >>
> > >> However:
> > >>
> > >>  1.  The ANSI SQL standard permits these transactions to fail and
> > >> rollback (e.g. in the event that your optimistic transaction fails).
> So
> > if
> > >> you want to be pedantic, you may modify my statement to “SQL does not
> > >> necessitate support for abort-free interactive transactions” and we
> can
> > >> leave it there.
> > >>
> > >>  2.  I would personally consider “SQL support” to include the
> capability
> > >> of defining arbitrary SQL stored procedures that may be executed by
> > clients
> > >> in an interactive session
> > >
> > >
> > > I note your personal preference and I further note that this is not the
> > > common understanding of "SQL support" in the industry.  If you tell 100
> > > developers that your database supports SQL, then at least 99 of them
> are
> > > going to assume that you can work with APIs like JDBC that expose
> > > interactive transactions as a central feature, and hence that you will
> be
> > > reasonably compatible with the vast array of SQL-based applications out
> > > there.
> > >
> > > Historical side note: VoltDB tried to convince people that stored
> > > procedures were good enough.  It didn't work, and VoltDB had to add
> > > interactive transactions as fast as they could.
> > >
> > >  3.  Most importantly, as I pointed out in the previous email, Accord
> is
> > >> compatible with a YugaByte/Cockroach-like approach, and indeed makes
> > this
> > >> approach both easier to accomplish and enables stronger isolation than
> > the
> > >> equivalent Raft-based approach. These approaches are able to reduce
> the
> > >> number of conflicts, at a cost of significantly higher transaction
> > >> management burden.
> > >>
> > >
> > > If you're saying that you could use Accord instead of Raft or Paxos,
> and
> > > layer 2PC on top of that as in Spanner, then I agree, but I don't think
> > > that is a very good design, as you would no longer get any of the
> > benefits
> > > of the deterministic approach you started with.  If you mean something
> > > else, then perhaps an example would help clarify.
> > >
> > > --
> > > Jonathan Ellis
> > > co-founder, http://www.datastax.com
> > > @spyced
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > For additional commands, e-mail: dev-help@cassandra.apache.org
> >
> >
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: Tradeoffs for Cassandra transaction management

Posted by Henrik Ingo <he...@datastax.com>.

On Fri, Oct 15, 2021 at 5:54 PM Dinesh Joshi <dj...@icloud.com.invalid>
wrote:

> Thank you for clarifying the terminology. I haven’t honestly heard anybody
> call these as interactive transactions. Therefore it is very crucial that
> we lay out things systematically so everyone is on the same page. You’re
> talking about bundling several statements into a single SQL transaction
> block.
>
>
Well, it's more complicated than that. Systems like Calvin and VoltDB have
introduced concepts where you can bundle several statements into a  single
transaction block, but that block is executed server side, and it's not
possible to have any additional roundtrips to the client. So the use of
"interactive transactions" is supposed to distinguish from those. But
you're right I may have invented the word. Historically such transactions
were the norm so no additional qualifier has been needed.

henrik


-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: Tradeoffs for Cassandra transaction management

Posted by Dinesh Joshi <dj...@icloud.com.INVALID>.

Thank you for clarifying the terminology. I haven’t honestly heard anybody call these as interactive transactions. Therefore it is very crucial that we lay out things systematically so everyone is on the same page. You’re talking about bundling several statements into a single SQL transaction block.

Dinesh

> On Oct 15, 2021, at 2:01 AM, Henrik Ingo <he...@datastax.com> wrote:
> On Fri, Oct 15, 2021 at 3:37 AM Dinesh Joshi <dj...@icloud.com.invalid>
> wrote:
> 
>>> On 10/14/21 6:54 AM, Jonathan Ellis wrote:
>>> I think I've also been clear that I want a path to supporting (1) local
>>> latencies (SLOG is a more elegant solution but "let's just let people
>> give
>>> up global serializability like LWT" is also reasonable) and (2) SQL with
>>> interactive transactions.
>> 
>> 
>> 99% of the transactions in a system will not be performed as interactive
>> SQL transactions by a human. We should be optimizing for the 99%.
> "Interactive" here does not mean that it's a human typing the queries. It
> rather means that there are more than one round trips between the client
> and server.
> 
> Any application doing:
> 
>    BEGIN
>    x = SELECT x FROM ...
>    if x == 5:
>        UPDATE t SET y=6
>    COMMIT
> 
> ...would be an interactive transaction. And this is traditionally the
> common case, even if recent NewSQL and NoSQL databases have introduced some
> intriguing outside of the box thinking in this area.
> 
> henrik
> 
> -- 
> 
> Henrik Ingo
> 
> +358 40 569 7354 <358405697354>
> 
> [image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
> Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
>  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org

Re: Tradeoffs for Cassandra transaction management

Posted by Henrik Ingo <he...@datastax.com>.

On Fri, Oct 15, 2021 at 3:37 AM Dinesh Joshi <dj...@icloud.com.invalid>
wrote:

> On 10/14/21 6:54 AM, Jonathan Ellis wrote:
>
> > I think I've also been clear that I want a path to supporting (1) local
> > latencies (SLOG is a more elegant solution but "let's just let people
> give
> > up global serializability like LWT" is also reasonable) and (2) SQL with
> > interactive transactions.
>
>
> 99% of the transactions in a system will not be performed as interactive
> SQL transactions by a human. We should be optimizing for the 99%.
>
>
"Interactive" here does not mean that it's a human typing the queries. It
rather means that there are more than one round trips between the client
and server.

Any application doing:

    BEGIN
    x = SELECT x FROM ...
    if x == 5:
        UPDATE t SET y=6
    COMMIT

...would be an interactive transaction. And this is traditionally the
common case, even if recent NewSQL and NoSQL databases have introduced some
intriguing outside of the box thinking in this area.

henrik

-- 

Henrik Ingo

+358 40 569 7354 <358405697354>

[image: Visit us online.] <https://www.datastax.com/>  [image: Visit us on
Twitter.] <https://twitter.com/DataStaxEng>  [image: Visit us on YouTube.]
<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youtube.com_channel_UCqA6zOSMpQ55vvguq4Y0jAg&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=bmIfaie9O3fWJAu6lESvWj3HajV4VFwgwgVuKmxKZmE&s=16sY48_kvIb7sRQORknZrr3V8iLTfemFKbMVNZhdwgw&e=>
  [image: Visit my LinkedIn profile.] <https://www.linkedin.com/in/heingo/>

Re: Tradeoffs for Cassandra transaction management

Posted by Dinesh Joshi <dj...@icloud.com.INVALID>.

On 10/14/21 6:54 AM, Jonathan Ellis wrote:

> I think I've also been clear that I want a path to supporting (1) local
> latencies (SLOG is a more elegant solution but "let's just let people give
> up global serializability like LWT" is also reasonable) and (2) SQL with
> interactive transactions.


99% of the transactions in a system will not be performed as interactive
SQL transactions by a human. We should be optimizing for the 99%.

Dinesh

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org

Re: Tradeoffs for Cassandra transaction management

Posted by Jeff Jirsa <jj...@gmail.com>.

Do I read this email as "Jonathan will vote against any improvement to
transactions that doesn't guarantee local latencies and interactive SQL,
even though no such proposal exists, thereby blocking any improvement over
the current status quo?"



On Thu, Oct 14, 2021 at 6:55 AM Jonathan Ellis <jb...@gmail.com> wrote:

> Hi Benedict,
>
> I'm not sure how to reconcile your statement that "your request to separate
> consensus from execution is [nonsensical]" with your earlier claims that we
> could build whatever additional transactional semantics we want on top of
> Accord.  The Accord whitepaper specifically separates out the consensus and
> the execution algorithms, but if we can't use the former to create
> execution timestamps for a different transaction manager then it doesn't
> sound as flexible as you're claiming.
>
> To your other points, it looks like the core problem is that you believe
> that "multi-shard CAS semantics" is the same as "Calvin semantics" which is
> not the case.  Calvin supports arbitrarily complex transactions (included
> dependent statements and indexed reads and writes), executed in parallel,
> with locking as necessary to enable that parallelism.
>
> I think I've also been clear that I want a path to supporting (1) local
> latencies (SLOG is a more elegant solution but "let's just let people give
> up global serializability like LWT" is also reasonable) and (2) SQL with
> interactive transactions.
>
> I'd prefer to keep the discussion on the mailing list, thanks.
>
>
> On Wed, Oct 13, 2021 at 3:04 AM benedict@apache.org <be...@apache.org>
> wrote:
>
> > Jonathan,
> >
> > Your request to separate consensus from execution is about as sensical as
> > asking for this separation in Paxos, or any other distributed consensus
> > protocol. I have made these statements repeatedly, so let me break it
> down
> > step by step.
> >
> > 1. Accord is an optimal leaderless distributed consensus protocol,
> > offering multi-shard CAS semantics in one round-trip (or two under
> > contention and clock skew).
> > 2. By simple virtue of this property, it already achieves Calvin
> semantics
> > with no other work. It remains a distributed consensus protocol, and the
> > whitepaper compares to these as peers.
> > 3. To build distributed transactions with more complex semantics, the
> > remaining candidates are the CockroachDB or YugaByte approach. These must
> > utilise a distributed consensus protocol. They do so using Raft today.
> > Accord is as optimal as Raft, therefore, Accord may be used to implement
> > this technique *without penalty*. Through its multi-shard consensus it
> has
> > the added advantage of supporting stronger isolation (but not requiring
> it
> > – a read/write intent design may choose weaker isolation).
> >
> > You continue to refuse to engage with these and other points. Please
> > respond directly to ALL of the below, that I have been asking you to
> answer
> > now for several weeks.
> >
> > 1. Since Accord supports all of your mooted transaction systems without
> > penalty the conversation about which semantics to pursue may be conducted
> > in parallel with its development. What about this claim do you not yet
> > understand? If you understand, why should a vote on CEP-15 be delayed?
> > 2. Which SPECIFIC transaction semantics do you want to achieve? You are
> > all over the shop today, demanding Cockroach/YugaByte interactive
> > semantics, but also LOCAL_SERIAL operation and proposing SLOG. These are
> > conflicting demands.
> > 3. Why do you think Accord cannot support your preferred semantics?
> > 4. Will you accept a video call so we may discuss this with you in
> detail,
> > so we may understand your difficulty understanding these points I keep
> > repeating?
> >
> > After several weeks of back and forth you should already be able to
> answer
> > these questions. If you cannot invest the time to answer them now, I
> > perceive this as obstructive and I will escalate this to a PMC vote to
> > break the deadlock.
> >
> >
> >
> > From: Jonathan Ellis <jb...@gmail.com>
> > Date: Wednesday, 13 October 2021 at 04:21
> > To: dev <de...@cassandra.apache.org>
> > Subject: Re: Tradeoffs for Cassandra transaction management
> > Blake (and Benedict), I’ll ask for your patience here.  We don’t have a
> > precedent of pushing through major initiatives in this project in a
> matter
> > of weeks.  We [members of the PMC that weren’t involved in creating
> Accord]
> > need time to do thorough research and make sure both that we understand
> > what is being proposed and that we have evaluated reasonable
> alternatives.
> >
> > One of the difficulties in evaluating Accord is that it combines a
> > state-of-the-art consensus/ordering protocol with a fairly limited
> > transaction manager.  So it may be useful to decouple the consensus and
> > transaction processing components, which would both allow non-Cassandra
> > usage of the consensus piece, and also make explicit the boundaries with
> > transaction processing with the consequence of making it easier to evolve
> > independently.
> >
> > In the meantime, it’s very important to me to understand on which
> > dimensions the transaction manager can be improved easily, and which
> > dimensions resist such improvement.  I get that Accord is your [plural]
> > baby and it’s awkward for me to come along and start pointing at its
> > limitations, but that’s part of creating a complete understanding of any
> > system.
> >
> > If I keep coming back to the subject of SQL support and interactive
> > transactions, that’s because it’s becoming table stakes in the
> distributed
> > database space. People are using Cockroach or Yugabyte or Cloud Spanner
> for
> > use cases where a couple years ago they would have used Cassandra. We can
> > expect this trend to continue and strengthen.
> >
> > On Mon, Oct 11, 2021 at 11:39 PM Blake Eggleston
> > <be...@apple.com.invalid> wrote:
> >
> > > Let’s get back on topic.
> > >
> > > Jonathan, in your opening email you stated that, in your view, the 2
> main
> > > areas of tradeoff were:
> > >
> > > > 1. Is it worth giving up local latencies to get full global
> > consistency?
> > >
> > > Now we’ve established that we don’t need to give up local latencies
> with
> > > Accord, which leaves:
> > >
> > > > 2. Is it worth giving up the possibility of SQL support, to get the
> > > benefits of deterministic transaction design?
> > >
> > > I pointed out that this was a false dilemma and that, in the worst
> case,
> > a
> > > hypothetical SQL feature could have it’s own consensus system. I hope
> > that
> > > won’t be necessary, but as I later pointed out (and you did not
> address,
> > > although maybe I should have phrased it as a question), if we’re going
> to
> > > weigh accord against a hypothetical SQL feature that lacks design
> goals,
> > or
> > > any clear ideas about how it might be implemented, how can we rule that
> > out?
> > >
> > > So Jonathan, how can we rule that out? How can we have a productive
> > > discussion about a feature you yourself are unable to describe in any
> > > meaningful detail?
> > >
> > > > On Oct 11, 2021, at 6:34 PM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> > > >
> > > > On Mon, Oct 11, 2021 at 5:11 PM benedict@apache.org <
> > benedict@apache.org
> > > >
> > > > wrote:
> > > >
> > > >> If we want to fully unpack this particular point, as far as I can
> tell
> > > >> claiming ANSI SQL would indeed require interactive transactions in
> > which
> > > >> arbitrary conditional work may be performed by a client within a
> > > >> transaction in response to other actions within that transaction.
> > > >>
> > > >> However:
> > > >>
> > > >>  1.  The ANSI SQL standard permits these transactions to fail and
> > > >> rollback (e.g. in the event that your optimistic transaction fails).
> > So
> > > if
> > > >> you want to be pedantic, you may modify my statement to “SQL does
> not
> > > >> necessitate support for abort-free interactive transactions” and we
> > can
> > > >> leave it there.
> > > >>
> > > >>  2.  I would personally consider “SQL support” to include the
> > capability
> > > >> of defining arbitrary SQL stored procedures that may be executed by
> > > clients
> > > >> in an interactive session
> > > >
> > > >
> > > > I note your personal preference and I further note that this is not
> the
> > > > common understanding of "SQL support" in the industry.  If you tell
> 100
> > > > developers that your database supports SQL, then at least 99 of them
> > are
> > > > going to assume that you can work with APIs like JDBC that expose
> > > > interactive transactions as a central feature, and hence that you
> will
> > be
> > > > reasonably compatible with the vast array of SQL-based applications
> out
> > > > there.
> > > >
> > > > Historical side note: VoltDB tried to convince people that stored
> > > > procedures were good enough.  It didn't work, and VoltDB had to add
> > > > interactive transactions as fast as they could.
> > > >
> > > >  3.  Most importantly, as I pointed out in the previous email, Accord
> > is
> > > >> compatible with a YugaByte/Cockroach-like approach, and indeed makes
> > > this
> > > >> approach both easier to accomplish and enables stronger isolation
> than
> > > the
> > > >> equivalent Raft-based approach. These approaches are able to reduce
> > the
> > > >> number of conflicts, at a cost of significantly higher transaction
> > > >> management burden.
> > > >>
> > > >
> > > > If you're saying that you could use Accord instead of Raft or Paxos,
> > and
> > > > layer 2PC on top of that as in Spanner, then I agree, but I don't
> think
> > > > that is a very good design, as you would no longer get any of the
> > > benefits
> > > > of the deterministic approach you started with.  If you mean
> something
> > > > else, then perhaps an example would help clarify.
> > > >
> > > > --
> > > > Jonathan Ellis
> > > > co-founder, http://www.datastax.com
> > > > @spyced
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > > For additional commands, e-mail: dev-help@cassandra.apache.org
> > >
> > >
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
> >
>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>

Re: Tradeoffs for Cassandra transaction management

Posted by Jonathan Ellis <jb...@gmail.com>.

Hi Benedict,

I'm not sure how to reconcile your statement that "your request to separate
consensus from execution is [nonsensical]" with your earlier claims that we
could build whatever additional transactional semantics we want on top of
Accord.  The Accord whitepaper specifically separates out the consensus and
the execution algorithms, but if we can't use the former to create
execution timestamps for a different transaction manager then it doesn't
sound as flexible as you're claiming.

To your other points, it looks like the core problem is that you believe
that "multi-shard CAS semantics" is the same as "Calvin semantics" which is
not the case.  Calvin supports arbitrarily complex transactions (included
dependent statements and indexed reads and writes), executed in parallel,
with locking as necessary to enable that parallelism.

I think I've also been clear that I want a path to supporting (1) local
latencies (SLOG is a more elegant solution but "let's just let people give
up global serializability like LWT" is also reasonable) and (2) SQL with
interactive transactions.

I'd prefer to keep the discussion on the mailing list, thanks.


On Wed, Oct 13, 2021 at 3:04 AM benedict@apache.org <be...@apache.org>
wrote:

> Jonathan,
>
> Your request to separate consensus from execution is about as sensical as
> asking for this separation in Paxos, or any other distributed consensus
> protocol. I have made these statements repeatedly, so let me break it down
> step by step.
>
> 1. Accord is an optimal leaderless distributed consensus protocol,
> offering multi-shard CAS semantics in one round-trip (or two under
> contention and clock skew).
> 2. By simple virtue of this property, it already achieves Calvin semantics
> with no other work. It remains a distributed consensus protocol, and the
> whitepaper compares to these as peers.
> 3. To build distributed transactions with more complex semantics, the
> remaining candidates are the CockroachDB or YugaByte approach. These must
> utilise a distributed consensus protocol. They do so using Raft today.
> Accord is as optimal as Raft, therefore, Accord may be used to implement
> this technique *without penalty*. Through its multi-shard consensus it has
> the added advantage of supporting stronger isolation (but not requiring it
> – a read/write intent design may choose weaker isolation).
>
> You continue to refuse to engage with these and other points. Please
> respond directly to ALL of the below, that I have been asking you to answer
> now for several weeks.
>
> 1. Since Accord supports all of your mooted transaction systems without
> penalty the conversation about which semantics to pursue may be conducted
> in parallel with its development. What about this claim do you not yet
> understand? If you understand, why should a vote on CEP-15 be delayed?
> 2. Which SPECIFIC transaction semantics do you want to achieve? You are
> all over the shop today, demanding Cockroach/YugaByte interactive
> semantics, but also LOCAL_SERIAL operation and proposing SLOG. These are
> conflicting demands.
> 3. Why do you think Accord cannot support your preferred semantics?
> 4. Will you accept a video call so we may discuss this with you in detail,
> so we may understand your difficulty understanding these points I keep
> repeating?
>
> After several weeks of back and forth you should already be able to answer
> these questions. If you cannot invest the time to answer them now, I
> perceive this as obstructive and I will escalate this to a PMC vote to
> break the deadlock.
>
>
>
> From: Jonathan Ellis <jb...@gmail.com>
> Date: Wednesday, 13 October 2021 at 04:21
> To: dev <de...@cassandra.apache.org>
> Subject: Re: Tradeoffs for Cassandra transaction management
> Blake (and Benedict), I’ll ask for your patience here.  We don’t have a
> precedent of pushing through major initiatives in this project in a matter
> of weeks.  We [members of the PMC that weren’t involved in creating Accord]
> need time to do thorough research and make sure both that we understand
> what is being proposed and that we have evaluated reasonable alternatives.
>
> One of the difficulties in evaluating Accord is that it combines a
> state-of-the-art consensus/ordering protocol with a fairly limited
> transaction manager.  So it may be useful to decouple the consensus and
> transaction processing components, which would both allow non-Cassandra
> usage of the consensus piece, and also make explicit the boundaries with
> transaction processing with the consequence of making it easier to evolve
> independently.
>
> In the meantime, it’s very important to me to understand on which
> dimensions the transaction manager can be improved easily, and which
> dimensions resist such improvement.  I get that Accord is your [plural]
> baby and it’s awkward for me to come along and start pointing at its
> limitations, but that’s part of creating a complete understanding of any
> system.
>
> If I keep coming back to the subject of SQL support and interactive
> transactions, that’s because it’s becoming table stakes in the distributed
> database space. People are using Cockroach or Yugabyte or Cloud Spanner for
> use cases where a couple years ago they would have used Cassandra. We can
> expect this trend to continue and strengthen.
>
> On Mon, Oct 11, 2021 at 11:39 PM Blake Eggleston
> <be...@apple.com.invalid> wrote:
>
> > Let’s get back on topic.
> >
> > Jonathan, in your opening email you stated that, in your view, the 2 main
> > areas of tradeoff were:
> >
> > > 1. Is it worth giving up local latencies to get full global
> consistency?
> >
> > Now we’ve established that we don’t need to give up local latencies with
> > Accord, which leaves:
> >
> > > 2. Is it worth giving up the possibility of SQL support, to get the
> > benefits of deterministic transaction design?
> >
> > I pointed out that this was a false dilemma and that, in the worst case,
> a
> > hypothetical SQL feature could have it’s own consensus system. I hope
> that
> > won’t be necessary, but as I later pointed out (and you did not address,
> > although maybe I should have phrased it as a question), if we’re going to
> > weigh accord against a hypothetical SQL feature that lacks design goals,
> or
> > any clear ideas about how it might be implemented, how can we rule that
> out?
> >
> > So Jonathan, how can we rule that out? How can we have a productive
> > discussion about a feature you yourself are unable to describe in any
> > meaningful detail?
> >
> > > On Oct 11, 2021, at 6:34 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> > >
> > > On Mon, Oct 11, 2021 at 5:11 PM benedict@apache.org <
> benedict@apache.org
> > >
> > > wrote:
> > >
> > >> If we want to fully unpack this particular point, as far as I can tell
> > >> claiming ANSI SQL would indeed require interactive transactions in
> which
> > >> arbitrary conditional work may be performed by a client within a
> > >> transaction in response to other actions within that transaction.
> > >>
> > >> However:
> > >>
> > >>  1.  The ANSI SQL standard permits these transactions to fail and
> > >> rollback (e.g. in the event that your optimistic transaction fails).
> So
> > if
> > >> you want to be pedantic, you may modify my statement to “SQL does not
> > >> necessitate support for abort-free interactive transactions” and we
> can
> > >> leave it there.
> > >>
> > >>  2.  I would personally consider “SQL support” to include the
> capability
> > >> of defining arbitrary SQL stored procedures that may be executed by
> > clients
> > >> in an interactive session
> > >
> > >
> > > I note your personal preference and I further note that this is not the
> > > common understanding of "SQL support" in the industry.  If you tell 100
> > > developers that your database supports SQL, then at least 99 of them
> are
> > > going to assume that you can work with APIs like JDBC that expose
> > > interactive transactions as a central feature, and hence that you will
> be
> > > reasonably compatible with the vast array of SQL-based applications out
> > > there.
> > >
> > > Historical side note: VoltDB tried to convince people that stored
> > > procedures were good enough.  It didn't work, and VoltDB had to add
> > > interactive transactions as fast as they could.
> > >
> > >  3.  Most importantly, as I pointed out in the previous email, Accord
> is
> > >> compatible with a YugaByte/Cockroach-like approach, and indeed makes
> > this
> > >> approach both easier to accomplish and enables stronger isolation than
> > the
> > >> equivalent Raft-based approach. These approaches are able to reduce
> the
> > >> number of conflicts, at a cost of significantly higher transaction
> > >> management burden.
> > >>
> > >
> > > If you're saying that you could use Accord instead of Raft or Paxos,
> and
> > > layer 2PC on top of that as in Spanner, then I agree, but I don't think
> > > that is a very good design, as you would no longer get any of the
> > benefits
> > > of the deterministic approach you started with.  If you mean something
> > > else, then perhaps an example would help clarify.
> > >
> > > --
> > > Jonathan Ellis
> > > co-founder, http://www.datastax.com
> > > @spyced
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > For additional commands, e-mail: dev-help@cassandra.apache.org
> >
> >
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: Tradeoffs for Cassandra transaction management

Posted by "benedict@apache.org" <be...@apache.org>.

Jonathan,

Your request to separate consensus from execution is about as sensical as asking for this separation in Paxos, or any other distributed consensus protocol. I have made these statements repeatedly, so let me break it down step by step.

1. Accord is an optimal leaderless distributed consensus protocol, offering multi-shard CAS semantics in one round-trip (or two under contention and clock skew).
2. By simple virtue of this property, it already achieves Calvin semantics with no other work. It remains a distributed consensus protocol, and the whitepaper compares to these as peers.
3. To build distributed transactions with more complex semantics, the remaining candidates are the CockroachDB or YugaByte approach. These must utilise a distributed consensus protocol. They do so using Raft today. Accord is as optimal as Raft, therefore, Accord may be used to implement this technique *without penalty*. Through its multi-shard consensus it has the added advantage of supporting stronger isolation (but not requiring it – a read/write intent design may choose weaker isolation).

You continue to refuse to engage with these and other points. Please respond directly to ALL of the below, that I have been asking you to answer now for several weeks.

1. Since Accord supports all of your mooted transaction systems without penalty the conversation about which semantics to pursue may be conducted in parallel with its development. What about this claim do you not yet understand? If you understand, why should a vote on CEP-15 be delayed?
2. Which SPECIFIC transaction semantics do you want to achieve? You are all over the shop today, demanding Cockroach/YugaByte interactive semantics, but also LOCAL_SERIAL operation and proposing SLOG. These are conflicting demands.
3. Why do you think Accord cannot support your preferred semantics?
4. Will you accept a video call so we may discuss this with you in detail, so we may understand your difficulty understanding these points I keep repeating?

After several weeks of back and forth you should already be able to answer these questions. If you cannot invest the time to answer them now, I perceive this as obstructive and I will escalate this to a PMC vote to break the deadlock.

From: Jonathan Ellis <jb...@gmail.com>
Date: Wednesday, 13 October 2021 at 04:21
To: dev <de...@cassandra.apache.org>
Subject: Re: Tradeoffs for Cassandra transaction management
Blake (and Benedict), I’ll ask for your patience here.  We don’t have a
precedent of pushing through major initiatives in this project in a matter
of weeks.  We [members of the PMC that weren’t involved in creating Accord]
need time to do thorough research and make sure both that we understand
what is being proposed and that we have evaluated reasonable alternatives.

One of the difficulties in evaluating Accord is that it combines a
state-of-the-art consensus/ordering protocol with a fairly limited
transaction manager.  So it may be useful to decouple the consensus and
transaction processing components, which would both allow non-Cassandra
usage of the consensus piece, and also make explicit the boundaries with
transaction processing with the consequence of making it easier to evolve
independently.

In the meantime, it’s very important to me to understand on which
dimensions the transaction manager can be improved easily, and which
dimensions resist such improvement.  I get that Accord is your [plural]
baby and it’s awkward for me to come along and start pointing at its
limitations, but that’s part of creating a complete understanding of any
system.

If I keep coming back to the subject of SQL support and interactive
transactions, that’s because it’s becoming table stakes in the distributed
database space. People are using Cockroach or Yugabyte or Cloud Spanner for
use cases where a couple years ago they would have used Cassandra. We can
expect this trend to continue and strengthen.

On Mon, Oct 11, 2021 at 11:39 PM Blake Eggleston
<be...@apple.com.invalid> wrote:

> Let’s get back on topic.
>
> Jonathan, in your opening email you stated that, in your view, the 2 main
> areas of tradeoff were:
>
> > 1. Is it worth giving up local latencies to get full global consistency?
>
> Now we’ve established that we don’t need to give up local latencies with
> Accord, which leaves:
>
> > 2. Is it worth giving up the possibility of SQL support, to get the
> benefits of deterministic transaction design?
>
> I pointed out that this was a false dilemma and that, in the worst case, a
> hypothetical SQL feature could have it’s own consensus system. I hope that
> won’t be necessary, but as I later pointed out (and you did not address,
> although maybe I should have phrased it as a question), if we’re going to
> weigh accord against a hypothetical SQL feature that lacks design goals, or
> any clear ideas about how it might be implemented, how can we rule that out?
>
> So Jonathan, how can we rule that out? How can we have a productive
> discussion about a feature you yourself are unable to describe in any
> meaningful detail?
>
> > On Oct 11, 2021, at 6:34 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> >
> > On Mon, Oct 11, 2021 at 5:11 PM benedict@apache.org <benedict@apache.org
> >
> > wrote:
> >
> >> If we want to fully unpack this particular point, as far as I can tell
> >> claiming ANSI SQL would indeed require interactive transactions in which
> >> arbitrary conditional work may be performed by a client within a
> >> transaction in response to other actions within that transaction.
> >>
> >> However:
> >>
> >>  1.  The ANSI SQL standard permits these transactions to fail and
> >> rollback (e.g. in the event that your optimistic transaction fails). So
> if
> >> you want to be pedantic, you may modify my statement to “SQL does not
> >> necessitate support for abort-free interactive transactions” and we can
> >> leave it there.
> >>
> >>  2.  I would personally consider “SQL support” to include the capability
> >> of defining arbitrary SQL stored procedures that may be executed by
> clients
> >> in an interactive session
> >
> >
> > I note your personal preference and I further note that this is not the
> > common understanding of "SQL support" in the industry.  If you tell 100
> > developers that your database supports SQL, then at least 99 of them are
> > going to assume that you can work with APIs like JDBC that expose
> > interactive transactions as a central feature, and hence that you will be
> > reasonably compatible with the vast array of SQL-based applications out
> > there.
> >
> > Historical side note: VoltDB tried to convince people that stored
> > procedures were good enough.  It didn't work, and VoltDB had to add
> > interactive transactions as fast as they could.
> >
> >  3.  Most importantly, as I pointed out in the previous email, Accord is
> >> compatible with a YugaByte/Cockroach-like approach, and indeed makes
> this
> >> approach both easier to accomplish and enables stronger isolation than
> the
> >> equivalent Raft-based approach. These approaches are able to reduce the
> >> number of conflicts, at a cost of significantly higher transaction
> >> management burden.
> >>
> >
> > If you're saying that you could use Accord instead of Raft or Paxos, and
> > layer 2PC on top of that as in Spanner, then I agree, but I don't think
> > that is a very good design, as you would no longer get any of the
> benefits
> > of the deterministic approach you started with.  If you mean something
> > else, then perhaps an example would help clarify.
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>
>

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: Tradeoffs for Cassandra transaction management

Posted by Jordan West <jw...@apache.org>.

Hi All,

First off, thank you for the very interesting technical discussions on this
topic. It's been great to see some back and forth on it. I haven't been
involved mainly because my research on this topic is relatively stale. I
did however want to chime in to encourage us to step back and take a look
at the topic of whether SQL support is the direction we want to be going
with Cassandra. For some context, I now work on and operate both Cassandra
and CockroachDB at a relatively large scale. In this case, CockroachDB is
not positioned as a potential replacement for Cassandra but as an
additional choice to meet different needs. Meeting those needs necessitates
different tradeoffs. Tradeoffs that have concrete impacts on how the
database performs, how production support works, how the user can break the
database, and what can be accomplished successfully by the user. When I
look at what my users need from Cassandra, it's not to have a competing
solution to CockroachDB -- a solution that exists and is becoming more and
more production proven every day. They do however need things like
scalable, consistent secondary indexing -- a feature I envision Accord
could unlock with its multi-partition CAS/transactions -- or better
performing single-partition LWTs -- ones that take significantly less round
trips and work over the WAN. I would encourage those pushing for SQL
support to consider that and to start a discussion first with the community
on whether SQL support is the direction we should be heading in the best
interest of the project.

The technical understanding I do have of both Accord and CockroachDB leads
me to believe that holding up CEP-15 for that decision, regardless of
whether we decide SQL support is the direction to go or not, is not
necessary. I believe it was stated earlier in the thread but if Accord
provides similar or better guartunees than Raft then a similar distributed
transaction protocol can be built on top of it to support interactive SQL.

Jordan


On Tue, Oct 12, 2021 at 8:21 PM Jonathan Ellis <jb...@gmail.com> wrote:

> Blake (and Benedict), I’ll ask for your patience here.  We don’t have a
> precedent of pushing through major initiatives in this project in a matter
> of weeks.  We [members of the PMC that weren’t involved in creating Accord]
> need time to do thorough research and make sure both that we understand
> what is being proposed and that we have evaluated reasonable alternatives.
>
> One of the difficulties in evaluating Accord is that it combines a
> state-of-the-art consensus/ordering protocol with a fairly limited
> transaction manager.  So it may be useful to decouple the consensus and
> transaction processing components, which would both allow non-Cassandra
> usage of the consensus piece, and also make explicit the boundaries with
> transaction processing with the consequence of making it easier to evolve
> independently.
>
> In the meantime, it’s very important to me to understand on which
> dimensions the transaction manager can be improved easily, and which
> dimensions resist such improvement.  I get that Accord is your [plural]
> baby and it’s awkward for me to come along and start pointing at its
> limitations, but that’s part of creating a complete understanding of any
> system.
>
> If I keep coming back to the subject of SQL support and interactive
> transactions, that’s because it’s becoming table stakes in the distributed
> database space. People are using Cockroach or Yugabyte or Cloud Spanner for
> use cases where a couple years ago they would have used Cassandra. We can
> expect this trend to continue and strengthen.
>
> On Mon, Oct 11, 2021 at 11:39 PM Blake Eggleston
> <be...@apple.com.invalid> wrote:
>
> > Let’s get back on topic.
> >
> > Jonathan, in your opening email you stated that, in your view, the 2 main
> > areas of tradeoff were:
> >
> > > 1. Is it worth giving up local latencies to get full global
> consistency?
> >
> > Now we’ve established that we don’t need to give up local latencies with
> > Accord, which leaves:
> >
> > > 2. Is it worth giving up the possibility of SQL support, to get the
> > benefits of deterministic transaction design?
> >
> > I pointed out that this was a false dilemma and that, in the worst case,
> a
> > hypothetical SQL feature could have it’s own consensus system. I hope
> that
> > won’t be necessary, but as I later pointed out (and you did not address,
> > although maybe I should have phrased it as a question), if we’re going to
> > weigh accord against a hypothetical SQL feature that lacks design goals,
> or
> > any clear ideas about how it might be implemented, how can we rule that
> out?
> >
> > So Jonathan, how can we rule that out? How can we have a productive
> > discussion about a feature you yourself are unable to describe in any
> > meaningful detail?
> >
> > > On Oct 11, 2021, at 6:34 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> > >
> > > On Mon, Oct 11, 2021 at 5:11 PM benedict@apache.org <
> benedict@apache.org
> > >
> > > wrote:
> > >
> > >> If we want to fully unpack this particular point, as far as I can tell
> > >> claiming ANSI SQL would indeed require interactive transactions in
> which
> > >> arbitrary conditional work may be performed by a client within a
> > >> transaction in response to other actions within that transaction.
> > >>
> > >> However:
> > >>
> > >>  1.  The ANSI SQL standard permits these transactions to fail and
> > >> rollback (e.g. in the event that your optimistic transaction fails).
> So
> > if
> > >> you want to be pedantic, you may modify my statement to “SQL does not
> > >> necessitate support for abort-free interactive transactions” and we
> can
> > >> leave it there.
> > >>
> > >>  2.  I would personally consider “SQL support” to include the
> capability
> > >> of defining arbitrary SQL stored procedures that may be executed by
> > clients
> > >> in an interactive session
> > >
> > >
> > > I note your personal preference and I further note that this is not the
> > > common understanding of "SQL support" in the industry.  If you tell 100
> > > developers that your database supports SQL, then at least 99 of them
> are
> > > going to assume that you can work with APIs like JDBC that expose
> > > interactive transactions as a central feature, and hence that you will
> be
> > > reasonably compatible with the vast array of SQL-based applications out
> > > there.
> > >
> > > Historical side note: VoltDB tried to convince people that stored
> > > procedures were good enough.  It didn't work, and VoltDB had to add
> > > interactive transactions as fast as they could.
> > >
> > >  3.  Most importantly, as I pointed out in the previous email, Accord
> is
> > >> compatible with a YugaByte/Cockroach-like approach, and indeed makes
> > this
> > >> approach both easier to accomplish and enables stronger isolation than
> > the
> > >> equivalent Raft-based approach. These approaches are able to reduce
> the
> > >> number of conflicts, at a cost of significantly higher transaction
> > >> management burden.
> > >>
> > >
> > > If you're saying that you could use Accord instead of Raft or Paxos,
> and
> > > layer 2PC on top of that as in Spanner, then I agree, but I don't think
> > > that is a very good design, as you would no longer get any of the
> > benefits
> > > of the deterministic approach you started with.  If you mean something
> > > else, then perhaps an example would help clarify.
> > >
> > > --
> > > Jonathan Ellis
> > > co-founder, http://www.datastax.com
> > > @spyced
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> > For additional commands, e-mail: dev-help@cassandra.apache.org
> >
> >
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>

Re: Tradeoffs for Cassandra transaction management

Posted by Jonathan Ellis <jb...@gmail.com>.

Blake (and Benedict), I’ll ask for your patience here.  We don’t have a
precedent of pushing through major initiatives in this project in a matter
of weeks.  We [members of the PMC that weren’t involved in creating Accord]
need time to do thorough research and make sure both that we understand
what is being proposed and that we have evaluated reasonable alternatives.

One of the difficulties in evaluating Accord is that it combines a
state-of-the-art consensus/ordering protocol with a fairly limited
transaction manager.  So it may be useful to decouple the consensus and
transaction processing components, which would both allow non-Cassandra
usage of the consensus piece, and also make explicit the boundaries with
transaction processing with the consequence of making it easier to evolve
independently.

In the meantime, it’s very important to me to understand on which
dimensions the transaction manager can be improved easily, and which
dimensions resist such improvement.  I get that Accord is your [plural]
baby and it’s awkward for me to come along and start pointing at its
limitations, but that’s part of creating a complete understanding of any
system.

If I keep coming back to the subject of SQL support and interactive
transactions, that’s because it’s becoming table stakes in the distributed
database space. People are using Cockroach or Yugabyte or Cloud Spanner for
use cases where a couple years ago they would have used Cassandra. We can
expect this trend to continue and strengthen.

On Mon, Oct 11, 2021 at 11:39 PM Blake Eggleston
<be...@apple.com.invalid> wrote:

> Let’s get back on topic.
>
> Jonathan, in your opening email you stated that, in your view, the 2 main
> areas of tradeoff were:
>
> > 1. Is it worth giving up local latencies to get full global consistency?
>
> Now we’ve established that we don’t need to give up local latencies with
> Accord, which leaves:
>
> > 2. Is it worth giving up the possibility of SQL support, to get the
> benefits of deterministic transaction design?
>
> I pointed out that this was a false dilemma and that, in the worst case, a
> hypothetical SQL feature could have it’s own consensus system. I hope that
> won’t be necessary, but as I later pointed out (and you did not address,
> although maybe I should have phrased it as a question), if we’re going to
> weigh accord against a hypothetical SQL feature that lacks design goals, or
> any clear ideas about how it might be implemented, how can we rule that out?
>
> So Jonathan, how can we rule that out? How can we have a productive
> discussion about a feature you yourself are unable to describe in any
> meaningful detail?
>
> > On Oct 11, 2021, at 6:34 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> >
> > On Mon, Oct 11, 2021 at 5:11 PM benedict@apache.org <benedict@apache.org
> >
> > wrote:
> >
> >> If we want to fully unpack this particular point, as far as I can tell
> >> claiming ANSI SQL would indeed require interactive transactions in which
> >> arbitrary conditional work may be performed by a client within a
> >> transaction in response to other actions within that transaction.
> >>
> >> However:
> >>
> >>  1.  The ANSI SQL standard permits these transactions to fail and
> >> rollback (e.g. in the event that your optimistic transaction fails). So
> if
> >> you want to be pedantic, you may modify my statement to “SQL does not
> >> necessitate support for abort-free interactive transactions” and we can
> >> leave it there.
> >>
> >>  2.  I would personally consider “SQL support” to include the capability
> >> of defining arbitrary SQL stored procedures that may be executed by
> clients
> >> in an interactive session
> >
> >
> > I note your personal preference and I further note that this is not the
> > common understanding of "SQL support" in the industry.  If you tell 100
> > developers that your database supports SQL, then at least 99 of them are
> > going to assume that you can work with APIs like JDBC that expose
> > interactive transactions as a central feature, and hence that you will be
> > reasonably compatible with the vast array of SQL-based applications out
> > there.
> >
> > Historical side note: VoltDB tried to convince people that stored
> > procedures were good enough.  It didn't work, and VoltDB had to add
> > interactive transactions as fast as they could.
> >
> >  3.  Most importantly, as I pointed out in the previous email, Accord is
> >> compatible with a YugaByte/Cockroach-like approach, and indeed makes
> this
> >> approach both easier to accomplish and enables stronger isolation than
> the
> >> equivalent Raft-based approach. These approaches are able to reduce the
> >> number of conflicts, at a cost of significantly higher transaction
> >> management burden.
> >>
> >
> > If you're saying that you could use Accord instead of Raft or Paxos, and
> > layer 2PC on top of that as in Spanner, then I agree, but I don't think
> > that is a very good design, as you would no longer get any of the
> benefits
> > of the deterministic approach you started with.  If you mean something
> > else, then perhaps an example would help clarify.
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>
>

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: Tradeoffs for Cassandra transaction management

Posted by Blake Eggleston <be...@apple.com.INVALID>.

Let’s get back on topic.

Jonathan, in your opening email you stated that, in your view, the 2 main areas of tradeoff were:

> 1. Is it worth giving up local latencies to get full global consistency? 

Now we’ve established that we don’t need to give up local latencies with Accord, which leaves:

> 2. Is it worth giving up the possibility of SQL support, to get the benefits of deterministic transaction design?

I pointed out that this was a false dilemma and that, in the worst case, a hypothetical SQL feature could have it’s own consensus system. I hope that won’t be necessary, but as I later pointed out (and you did not address, although maybe I should have phrased it as a question), if we’re going to weigh accord against a hypothetical SQL feature that lacks design goals, or any clear ideas about how it might be implemented, how can we rule that out?

So Jonathan, how can we rule that out? How can we have a productive discussion about a feature you yourself are unable to describe in any meaningful detail?

> On Oct 11, 2021, at 6:34 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> 
> On Mon, Oct 11, 2021 at 5:11 PM benedict@apache.org <be...@apache.org>
> wrote:
> 
>> If we want to fully unpack this particular point, as far as I can tell
>> claiming ANSI SQL would indeed require interactive transactions in which
>> arbitrary conditional work may be performed by a client within a
>> transaction in response to other actions within that transaction.
>> 
>> However:
>> 
>>  1.  The ANSI SQL standard permits these transactions to fail and
>> rollback (e.g. in the event that your optimistic transaction fails). So if
>> you want to be pedantic, you may modify my statement to “SQL does not
>> necessitate support for abort-free interactive transactions” and we can
>> leave it there.
>> 
>>  2.  I would personally consider “SQL support” to include the capability
>> of defining arbitrary SQL stored procedures that may be executed by clients
>> in an interactive session
> 
> 
> I note your personal preference and I further note that this is not the
> common understanding of "SQL support" in the industry.  If you tell 100
> developers that your database supports SQL, then at least 99 of them are
> going to assume that you can work with APIs like JDBC that expose
> interactive transactions as a central feature, and hence that you will be
> reasonably compatible with the vast array of SQL-based applications out
> there.
> 
> Historical side note: VoltDB tried to convince people that stored
> procedures were good enough.  It didn't work, and VoltDB had to add
> interactive transactions as fast as they could.
> 
>  3.  Most importantly, as I pointed out in the previous email, Accord is
>> compatible with a YugaByte/Cockroach-like approach, and indeed makes this
>> approach both easier to accomplish and enables stronger isolation than the
>> equivalent Raft-based approach. These approaches are able to reduce the
>> number of conflicts, at a cost of significantly higher transaction
>> management burden.
>> 
> 
> If you're saying that you could use Accord instead of Raft or Paxos, and
> layer 2PC on top of that as in Spanner, then I agree, but I don't think
> that is a very good design, as you would no longer get any of the benefits
> of the deterministic approach you started with.  If you mean something
> else, then perhaps an example would help clarify.
> 
> -- 
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org

Re: Tradeoffs for Cassandra transaction management

Posted by "benedict@apache.org" <be...@apache.org>.

Hi Jonathan,

You are missing the woods for the trees here. You outlined several transaction systems, and I have demonstrated that Accord brings them *all* closer.

The immediate context of this discussion is that you are unhappy with CEP-15 due to its impact on a future transaction system. Given the above, it remains unclear why this is still an issue.

I’m happy to continue a long-term roadmap discussion, but without specific further criticisms of CEP-15 we are long overdue a vote.

From: Jonathan Ellis <jb...@gmail.com>
Date: Tuesday, 12 October 2021 at 02:35
To: dev <de...@cassandra.apache.org>
Subject: Re: Tradeoffs for Cassandra transaction management
On Mon, Oct 11, 2021 at 5:11 PM benedict@apache.org <be...@apache.org>
wrote:

> If we want to fully unpack this particular point, as far as I can tell
> claiming ANSI SQL would indeed require interactive transactions in which
> arbitrary conditional work may be performed by a client within a
> transaction in response to other actions within that transaction.
>
> However:
>
>   1.  The ANSI SQL standard permits these transactions to fail and
> rollback (e.g. in the event that your optimistic transaction fails). So if
> you want to be pedantic, you may modify my statement to “SQL does not
> necessitate support for abort-free interactive transactions” and we can
> leave it there.
>
>   2.  I would personally consider “SQL support” to include the capability
> of defining arbitrary SQL stored procedures that may be executed by clients
> in an interactive session

I note your personal preference and I further note that this is not the
common understanding of "SQL support" in the industry.  If you tell 100
developers that your database supports SQL, then at least 99 of them are
going to assume that you can work with APIs like JDBC that expose
interactive transactions as a central feature, and hence that you will be
reasonably compatible with the vast array of SQL-based applications out
there.

Historical side note: VoltDB tried to convince people that stored
procedures were good enough.  It didn't work, and VoltDB had to add
interactive transactions as fast as they could.

  3.  Most importantly, as I pointed out in the previous email, Accord is
> compatible with a YugaByte/Cockroach-like approach, and indeed makes this
> approach both easier to accomplish and enables stronger isolation than the
> equivalent Raft-based approach. These approaches are able to reduce the
> number of conflicts, at a cost of significantly higher transaction
> management burden.
>

If you're saying that you could use Accord instead of Raft or Paxos, and
layer 2PC on top of that as in Spanner, then I agree, but I don't think
that is a very good design, as you would no longer get any of the benefits
of the deterministic approach you started with.  If you mean something
else, then perhaps an example would help clarify.

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: Tradeoffs for Cassandra transaction management

Posted by Jonathan Ellis <jb...@gmail.com>.

On Mon, Oct 11, 2021 at 5:11 PM benedict@apache.org <be...@apache.org>
wrote:

> If we want to fully unpack this particular point, as far as I can tell
> claiming ANSI SQL would indeed require interactive transactions in which
> arbitrary conditional work may be performed by a client within a
> transaction in response to other actions within that transaction.
>
> However:
>
>   1.  The ANSI SQL standard permits these transactions to fail and
> rollback (e.g. in the event that your optimistic transaction fails). So if
> you want to be pedantic, you may modify my statement to “SQL does not
> necessitate support for abort-free interactive transactions” and we can
> leave it there.
>
>   2.  I would personally consider “SQL support” to include the capability
> of defining arbitrary SQL stored procedures that may be executed by clients
> in an interactive session

I note your personal preference and I further note that this is not the
common understanding of "SQL support" in the industry.  If you tell 100
developers that your database supports SQL, then at least 99 of them are
going to assume that you can work with APIs like JDBC that expose
interactive transactions as a central feature, and hence that you will be
reasonably compatible with the vast array of SQL-based applications out
there.

Historical side note: VoltDB tried to convince people that stored
procedures were good enough.  It didn't work, and VoltDB had to add
interactive transactions as fast as they could.

  3.  Most importantly, as I pointed out in the previous email, Accord is
> compatible with a YugaByte/Cockroach-like approach, and indeed makes this
> approach both easier to accomplish and enables stronger isolation than the
> equivalent Raft-based approach. These approaches are able to reduce the
> number of conflicts, at a cost of significantly higher transaction
> management burden.
>

If you're saying that you could use Accord instead of Raft or Paxos, and
layer 2PC on top of that as in Spanner, then I agree, but I don't think
that is a very good design, as you would no longer get any of the benefits
of the deterministic approach you started with.  If you mean something
else, then perhaps an example would help clarify.

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: Tradeoffs for Cassandra transaction management

Posted by "benedict@apache.org" <be...@apache.org>.

Hi Jonathan,

I would appreciate it if you would respond to all of my email(s), as (at your insistence) I spend a great deal of time responding to you. Cherry-picking makes these conversations very difficult.

If we want to fully unpack this particular point, as far as I can tell claiming ANSI SQL would indeed require interactive transactions in which arbitrary conditional work may be performed by a client within a transaction in response to other actions within that transaction.

However:

  1.  The ANSI SQL standard permits these transactions to fail and rollback (e.g. in the event that your optimistic transaction fails). So if you want to be pedantic, you may modify my statement to “SQL does not necessitate support for abort-free interactive transactions” and we can leave it there.
  2.  I would personally consider “SQL support” to include the capability of defining arbitrary SQL stored procedures that may be executed by clients in an interactive session, or interactive sessions where the client must submit transactional scripts that may be arbitrarily complex and contingent on prior responses, but where each script must be executed within its own transaction. For many use cases this would constitute SQL support (and, indeed, I think cover every SQL use case in my career).
  3.  Most importantly, as I pointed out in the previous email, Accord is compatible with a YugaByte/Cockroach-like approach, and indeed makes this approach both easier to accomplish and enables stronger isolation than the equivalent Raft-based approach. These approaches are able to reduce the number of conflicts, at a cost of significantly higher transaction management burden.

In summary we have all options on the table. Not only does CEP-15 not close any doors, it brings them all a step closer. If you have a strong opinion about which (if any) of these approaches we pursue post CEP-15, I would love to have this conversation. However, this should not block the adoption of CEP-15, since they are not in conflict.


From: Jonathan Ellis <jb...@gmail.com>
Date: Monday, 11 October 2021 at 22:20
To: dev <de...@cassandra.apache.org>
Subject: Re: Tradeoffs for Cassandra transaction management
Hi Benedict,

Yes, interactive transactions are a necessary part of SQL support (as
opposed to a tiny subset of SQL that matches CQL semantics, I don't know
any other way to make sense of your claim that "SQL does not necessitate
support for interactive transactions").

I still don't understand how you're saying we could implement interactive
transactions on top of a deterministic transaction manager.  In the other
thread you said that "Interactive transactions are possible on top of
Accord, as are transactions with an unknown read/write set. In each case
the only cost is that they would use optimistic concurrency control, which
is no worse than spanner derivatives anyway" but this is not correct,
interactive transactions are substantially more difficult to support than
transactions with unknown read/write set, as I outlined in the email to
kick off this thread.

On Sun, Oct 10, 2021 at 4:05 AM benedict@apache.org <be...@apache.org>
wrote:

> Hi Jonathan,
>
> I will summarise my position below, that I have outlined at various points
> in the other thread, and then I would be interested to hear how you propose
> we move forwards. I will commit to responding the same day to any email I
> receive before 7pm GMT, and to engaging with each of your points. I would
> appreciate it if you could make similar commitments so that we may conclude
> this discussion in a reasonable time frame and conduct a vote on CEP-15.
>
> I also reiterate my standing invitation to an open video chat, to discuss
> anything you like, for as long as you like. Please nominate a suitable time
> and day.
>
> ==TL;DR==
> CEP-15 does not narrow our future options, it only broadens them. Accord
> is a distributed consensus protocol, so these techniques may build upon it
> without penalty. Alternatively, these approaches may simply live alongside
> Accord.
>
> Since these alternative approaches do not achieve the goals of the CEP,
> and this CEP only enhances your ability to pursue them, it seems hard to
> conclude it should not proceed.
>
> ==Goals==
> Our goals are first order principles: we want strict serializable
> cross-shard isolation that is highly available and can be scaled while
> maintaining optimal and predictable latency. Anything less, and the CEP is
> not achieved.
>
> As outlined already (except SLOG, which I address below), these
> alternative approaches do not achieve these goals.
>
> ==Compatibility with other approaches==
> 0. In general, research systems are not irreducible - they are an assembly
> of ideas that can be mixed together. Accord is a distributed consensus
> protocol. These other protocols may utilise it without penalty for
> consensus, in many cases obtaining improved characteristics. Conversely,
> Accord may itself directly integrate some of these ideas.
>
> 1. Cockroach, YugaByte, Dynamo et al utilize read and write intents, the
> same as outlined as a technique for interactive transactions with Accord.
> They manage these in a distributed state machine with per-shard consensus,
> permitting them to achieve serializable isolation. This same technique can
> be used with Accord, with the advantage that strict serializable isolation
> would be achievable. For simple transactions we would be able to execute
> with “pure” Accord and retain its execution advantage. Accord does not
> disadvantage this approach, it is only enhanced and made easier.
>
> 2. Calvin: Accord is broadly functionally equivalent, only leaderless,
> thereby achieving better global latency properties.
>
> 3. SLOG: This is essentially Calvin. The main modification is that we may
> assign data a home region, so that transactions may be faster if they
> participate in just one region, and slower if they involve multiple
> regions. Note that this protocol does not achieve global serializability
> without either losing consistency or availability under network partition
> or paying a WAN cost.
>
> In its consistent mode SLOG therefore remains slower than Accord for both
> single-home and multi-home transactions. Accord requires one WAN penalty
> for linearizing a transaction (competing transactions pay this cost
> simultaneously, as with SLOG), however this is achieved for global clients,
> whereas SLOG must cross the WAN multiple times for transactions initiated
> from outside their home, and for all multi-home transactions.
>
> As discussed elsewhere, a future optimisation with Accord is to
> temporarily “home” competing transaction for execution only, so that there
> is no additional WAN penalty when executing competing transactions. This
> would confer the same performance advantages as SLOG, without any of its
> penalties for multi-home transactions or heterogenous latency
> characteristics, nor any of the complexities of re-homing data, thus
> avoiding these unpredictable performance characteristics.
>
> For those use cases that do not require high availability, it would be
> possible to implement a “home” region setup with Accord, as with SLOG. This
> is not an idea that is exclusive to this particular system. We even
> discussed this briefly in the call, as some use cases do indeed prefer this
> trade-off.
>
> SLOG additionally offers a kind of “home group” multi-home optimisation
> for clusters with many regions, that accept availability loss if fewer than
> half of their regions fail (e.g. in the paper 6 regions in pairs of 2 for
> availability). This is also exploitable by Accord, and something we can
> pursue as a future optimisation, as users explore such topologies in the
> real world.
>
> ==Responding to specific points==
>
> >because it was asserted in the CEP-15 thread that Accord could support
> SQL by applying known techniques on top. This is mistaken. Deterministic
> systems like Calvin or SLOG or Accord can support queries where the rows
> affected are not known in advance using a technique that Abadi calls OLLP
>
> Language is hard and it is easy to conflate things. Here you seem to be
> discussing abort-free interactive transactions, not SQL. SQL does not
> necessitate support for interactive transactions, let alone abort-free
> ones. The technique you mention can support SQL scripts, and also
> interactive client transactions that may be aborted by the server. However,
> see [1] which may support all of these properties.
>
>
>
> From: Blake Eggleston <be...@apple.com.INVALID>
> Date: Sunday, 10 October 2021 at 05:17
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: Tradeoffs for Cassandra transaction management
> 1. Is it worth giving up local latencies to get full global consistency?
> Most LWT use cases use
> LOCAL_SERIAL.
>
> This isn’t a tradeoff that needs to be made. There’s nothing about Accord
> that prevents performing consensus in one DC and replicating the writes to
> others. That’s not in scope for the initial work, but there’s no reason it
> couldn’t be handled as a follow on if needed. I agree with Jeff that
> LOCAL_SERIAL and LWTs are not usually done with a full understanding of the
> implications, but there are some valid use cases. For instance, you can
> enable an OLAP service to operate against another DC without impacting the
> primary, assuming the service can tolerate inconsistency for data written
> since the last repair, and there are some others.
>
> 2. Is it worth giving up the possibility of SQL support, to get the
> benefits of deterministic transaction design?
>
> This is a false dilemma. Today, we’re proposing a deterministic
> transaction design that addresses some very common user pain points. SQL
> addresses different user pain point. If someone wants to add an sql
> implementation in the future they can a) build it on top of accord b)
> extend or improve accord or c) implement a separate system. The right
> choice will depend on their goals, but accord won’t prevent work on it, the
> same way the original lwt design isn’t preventing work on multi-partition
> transactions. In the worst case, if the goals of a hypothetical sql project
> are different enough to make them incompatible with accord, I don’t see any
> reason why we couldn’t have 2 separate consensus systems, so long as people
> are willing to maintain them and the use cases and available technologies
> justify it.
>
> -Blake
>
> > On Oct 9, 2021, at 9:54 AM, Jonathan Ellis <jb...@gmail.com> wrote:
> >
> > * Hi all,After calling several times for a broader discussion of goals
> and
> > tradeoffs around transaction management in the CEP-15 thread, I’ve put
> > together a short analysis to kick that off.Here is a table that
> summarizes
> > the state of the art for distributed transactions that offer
> > serializability, i.e., a superset of what you can get with LWT.  (The
> most
> > interesting option that this eliminates is RAMP.)Since I'm not sure how
> > this will render outside gmail, I've also uploaded it here:
> > https://imgur.com/a/SCZ8jex
> > <https://imgur.com/a/SCZ8jex>SpannerCockroachCalvin/FaunaSLOG (see
> > below)Write latencyGlobal Paxos, plus 2pc for multi-partition.For
> > intercontinental replication this is 100+ms.  Cloud Spanner does not
> allow
> > truly global deployments for this reason.Single-region Paxos, plus 2pc.
> > I’m not very clear on how this works but it results in non-strict
> > serializability.I didn’t find actual numbers for CR other than “2ms in a
> > single AZ” which is not a typical scenario.Global Raft.  Fauna posts
> actual
> > numbers of ~70ms in production which I assume corresponds to a
> multi-region
> > deployment with all regions in the USA.  SLOG paper says true global
> Calvin
> > is 200+ms.Single-region Paxos (common case) with fallback to multi-region
> > Paxos.Under 10ms.Scalability bottlenecksLocks held during cross-region
> > replicationSame as SpannerOLLP approach required when PKs are not known
> in
> > advance (mostly for indexed queries) -- results in retries under
> > contentionSame as CalvinRead latency at serial consistencyTimestamp from
> > Paxos leader (may be cross-region), then read from local replica.Same as
> > Spanner, I thinkSame as writesSame as writesMaximum serializability
> > flavorStrictUn-strictStrictStrictSupport for other isolation
> > levels?SnapshotNoSnapshot (in Fauna)Paper mentions dropping from
> > strict-serializable to only serializable.  Probably could also support
> > Snapshot like Fauna.Interactive transaction support (req’d for
> > SQL)YesYesNoNoPotential for grafting onto C*NightmareNightmareReasonable,
> > Calvin is relatively simple and the storage assumptions it makes are
> > minimalI haven’t thought about this enough. SLOG may require versioned
> > storage, e.g. see this comment
> > <
> http://dbmsmusings.blogspot.com/2019/10/introducing-slog-cheating-low-latency.html?showComment=1570497003296#c5976719429355924873
> >.(I
> > have not included Accord here because it’s not sufficiently clear to me
> how
> > to create a full transaction manager from the Accord protocol, so I can’t
> > analyze many of the properties such a system would have.  The most
> obvious
> > solution would be “Calvin but with Accord instead of Raft”, but since
> > Accord already does some Calvin-like things that seems like it would
> result
> > in some suboptimal redundancy.)After putting the above together it seems
> to
> > me that the two main areas of tradeoff are, 1. Is it worth giving up
> local
> > latencies to get full global consistency?  Most LWT use cases use
> > LOCAL_SERIAL.  While all of the above have more efficient designs than
> LWT,
> > it’s still true that global serialization will require 100+ms in the
> > general case due to physical transmission latency.  So a design that
> allows
> > local serialization with EC between regions, or a design (like SLOG) that
> > automatically infers a “home” region that can do local consensus in the
> > common case without giving up global serializability, is desirable.2. Is
> it
> > worth giving up the possibility of SQL support, to get the benefits of
> > deterministic transaction design?  To be clear, these benefits include
> very
> > significant ones around simplicity of design, higher write throughput,
> and
> > (in SLOG) lower read and write latencies.I’ll doubleclick on #2 because
> it
> > was asserted in the CEP-15 thread that Accord could support SQL by
> applying
> > known techniques on top.  This is mistaken.  Deterministic systems like
> > Calvin or SLOG or Accord can support queries where the rows affected are
> > not known in advance using a technique that Abadi calls OLLP (Optimistic
> > Lock Location Prediction), but this does not help when the transaction
> > logic is not known in advance.Here is Daniel Abadi’s explanation of OLLP
> > from “An Overview of Deterministic Database Systems
> > <
> https://cacm.acm.org/magazines/2018/9/230601-an-overview-of-deterministic-database-systems/fulltext?mobile=false
> >:”In
> > practice, deterministic database systems that use ordered locking do not
> > wait until runtime for transactions to determine their access-sets.
> > Instead, they use a technique called OLLP where if a transaction does not
> > know its access-sets in advance, it is not inserted into the input log.
> > Instead, it is run in a trial mode that does not write to the database
> > state, but determines what it would have read or written to if it was
> > actually being processed. It is then annotated with the access-sets
> > determined during the trial run, and submitted to the input log for
> actual
> > processing. In the actual run, every replica processes the transaction
> > deterministically, acquiring locks for the transaction based on the
> > estimate from the trial run. In some cases, database state may have
> changed
> > in a way that the access sets estimates are now incorrect. Since a
> > transaction cannot read or write data for which it does not have a lock,
> it
> > must abort as soon as it realizes that it acquired the wrong set of
> locks.
> > But since the transaction is being processed deterministically at this
> > point, every replica will independently come to the same conclusion that
> > the wrong set of locks were acquired, and will all independently decide
> to
> > abort the transaction. The transaction then gets resubmitted to the input
> > log with the new access-set estimates annotated.Clearly this does not
> work
> > if the server-visible logic changes between runs.  For instance, consider
> > this simple interactive transaction:cursor.execute("BEGIN
> > TRANSACTION")count = cursor.execute("SELECT count FROM inventory WHERE
> id =
> > 1").result[0]if count > 0:    cursor.execute("UPDATE inventory SET count
> =
> > count - 1 WHERE id = 1")cursor.execute("COMMIT TRANSACTION")The first
> > problem is that it’s far from clear how to do a “trial run” of a
> > transaction that the server only knows pieces of at a time.  But even
> > worse, the server only knows that it got either a SELECT, or a SELECT
> > followed by an UPDATE.  It doesn’t know anything about the logic that
> would
> > drive a change in those statements.  So if the value read changes between
> > trial run and execution, there is no possibility of transparently
> retrying,
> > you’re just screwed and have to report failure.So Abadi concludes,[A]ll
> > recent [deterministic database] implementations have limited or no
> support
> > for interactive transactions, thereby preventing their use in many
> existing
> > deployments. If the advantages of deterministic database systems will be
> > realized in the coming years, one of two things must occur: either
> database
> > users must accept a stored procedure interface to the system [instead of
> > client-side SQL], or additional research must be performed in order to
> > enable improved support for interactive transactions.TLDR:We need to
> decide
> > if we want to give users local transaction latencies, either with an
> > approach inspired by SLOG or with tuneable serializability like LWT
> > (trading away global consistency).  I think the answer here is clearly
> Yes,
> > we have abundant evidence from LWT that people care a great deal about
> > latency, and specifically that they are willing to live with
> > cross-datacenter eventual consistency to get low local latencies.We also
> > need to decide if we eventually want to support full SQL.  I think this
> one
> > is less clear, there are strong arguments both ways.P.S. SLOG deserves
> more
> > attention. Here are links to the paper
> > <http://www.vldb.org/pvldb/vol12/p1747-ren.pdf>, Abadi’s writeup
> > <
> http://dbmsmusings.blogspot.com/2019/10/introducing-slog-cheating-low-latency.html
> >,
> > and Murat Demirbas’s reading group compares SLOG to something called
> Ocean
> > Vista that I’ve never heard of but which reminds me of Accord
> > <
> http://muratbuffalo.blogspot.com/2020/11/ocean-vista-gossip-based-visibility.html
> >.*
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>


--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: Tradeoffs for Cassandra transaction management

Posted by Jonathan Ellis <jb...@gmail.com>.

 Hi Benedict,

Yes, interactive transactions are a necessary part of SQL support (as
opposed to a tiny subset of SQL that matches CQL semantics, I don't know
any other way to make sense of your claim that "SQL does not necessitate
support for interactive transactions").

I still don't understand how you're saying we could implement interactive
transactions on top of a deterministic transaction manager.  In the other
thread you said that "Interactive transactions are possible on top of
Accord, as are transactions with an unknown read/write set. In each case
the only cost is that they would use optimistic concurrency control, which
is no worse than spanner derivatives anyway" but this is not correct,
interactive transactions are substantially more difficult to support than
transactions with unknown read/write set, as I outlined in the email to
kick off this thread.

On Sun, Oct 10, 2021 at 4:05 AM benedict@apache.org <be...@apache.org>
wrote:

> Hi Jonathan,
>
> I will summarise my position below, that I have outlined at various points
> in the other thread, and then I would be interested to hear how you propose
> we move forwards. I will commit to responding the same day to any email I
> receive before 7pm GMT, and to engaging with each of your points. I would
> appreciate it if you could make similar commitments so that we may conclude
> this discussion in a reasonable time frame and conduct a vote on CEP-15.
>
> I also reiterate my standing invitation to an open video chat, to discuss
> anything you like, for as long as you like. Please nominate a suitable time
> and day.
>
> ==TL;DR==
> CEP-15 does not narrow our future options, it only broadens them. Accord
> is a distributed consensus protocol, so these techniques may build upon it
> without penalty. Alternatively, these approaches may simply live alongside
> Accord.
>
> Since these alternative approaches do not achieve the goals of the CEP,
> and this CEP only enhances your ability to pursue them, it seems hard to
> conclude it should not proceed.
>
> ==Goals==
> Our goals are first order principles: we want strict serializable
> cross-shard isolation that is highly available and can be scaled while
> maintaining optimal and predictable latency. Anything less, and the CEP is
> not achieved.
>
> As outlined already (except SLOG, which I address below), these
> alternative approaches do not achieve these goals.
>
> ==Compatibility with other approaches==
> 0. In general, research systems are not irreducible - they are an assembly
> of ideas that can be mixed together. Accord is a distributed consensus
> protocol. These other protocols may utilise it without penalty for
> consensus, in many cases obtaining improved characteristics. Conversely,
> Accord may itself directly integrate some of these ideas.
>
> 1. Cockroach, YugaByte, Dynamo et al utilize read and write intents, the
> same as outlined as a technique for interactive transactions with Accord.
> They manage these in a distributed state machine with per-shard consensus,
> permitting them to achieve serializable isolation. This same technique can
> be used with Accord, with the advantage that strict serializable isolation
> would be achievable. For simple transactions we would be able to execute
> with “pure” Accord and retain its execution advantage. Accord does not
> disadvantage this approach, it is only enhanced and made easier.
>
> 2. Calvin: Accord is broadly functionally equivalent, only leaderless,
> thereby achieving better global latency properties.
>
> 3. SLOG: This is essentially Calvin. The main modification is that we may
> assign data a home region, so that transactions may be faster if they
> participate in just one region, and slower if they involve multiple
> regions. Note that this protocol does not achieve global serializability
> without either losing consistency or availability under network partition
> or paying a WAN cost.
>
> In its consistent mode SLOG therefore remains slower than Accord for both
> single-home and multi-home transactions. Accord requires one WAN penalty
> for linearizing a transaction (competing transactions pay this cost
> simultaneously, as with SLOG), however this is achieved for global clients,
> whereas SLOG must cross the WAN multiple times for transactions initiated
> from outside their home, and for all multi-home transactions.
>
> As discussed elsewhere, a future optimisation with Accord is to
> temporarily “home” competing transaction for execution only, so that there
> is no additional WAN penalty when executing competing transactions. This
> would confer the same performance advantages as SLOG, without any of its
> penalties for multi-home transactions or heterogenous latency
> characteristics, nor any of the complexities of re-homing data, thus
> avoiding these unpredictable performance characteristics.
>
> For those use cases that do not require high availability, it would be
> possible to implement a “home” region setup with Accord, as with SLOG. This
> is not an idea that is exclusive to this particular system. We even
> discussed this briefly in the call, as some use cases do indeed prefer this
> trade-off.
>
> SLOG additionally offers a kind of “home group” multi-home optimisation
> for clusters with many regions, that accept availability loss if fewer than
> half of their regions fail (e.g. in the paper 6 regions in pairs of 2 for
> availability). This is also exploitable by Accord, and something we can
> pursue as a future optimisation, as users explore such topologies in the
> real world.
>
> ==Responding to specific points==
>
> >because it was asserted in the CEP-15 thread that Accord could support
> SQL by applying known techniques on top. This is mistaken. Deterministic
> systems like Calvin or SLOG or Accord can support queries where the rows
> affected are not known in advance using a technique that Abadi calls OLLP
>
> Language is hard and it is easy to conflate things. Here you seem to be
> discussing abort-free interactive transactions, not SQL. SQL does not
> necessitate support for interactive transactions, let alone abort-free
> ones. The technique you mention can support SQL scripts, and also
> interactive client transactions that may be aborted by the server. However,
> see [1] which may support all of these properties.
>
>
>
> From: Blake Eggleston <be...@apple.com.INVALID>
> Date: Sunday, 10 October 2021 at 05:17
> To: dev@cassandra.apache.org <de...@cassandra.apache.org>
> Subject: Re: Tradeoffs for Cassandra transaction management
> 1. Is it worth giving up local latencies to get full global consistency?
> Most LWT use cases use
> LOCAL_SERIAL.
>
> This isn’t a tradeoff that needs to be made. There’s nothing about Accord
> that prevents performing consensus in one DC and replicating the writes to
> others. That’s not in scope for the initial work, but there’s no reason it
> couldn’t be handled as a follow on if needed. I agree with Jeff that
> LOCAL_SERIAL and LWTs are not usually done with a full understanding of the
> implications, but there are some valid use cases. For instance, you can
> enable an OLAP service to operate against another DC without impacting the
> primary, assuming the service can tolerate inconsistency for data written
> since the last repair, and there are some others.
>
> 2. Is it worth giving up the possibility of SQL support, to get the
> benefits of deterministic transaction design?
>
> This is a false dilemma. Today, we’re proposing a deterministic
> transaction design that addresses some very common user pain points. SQL
> addresses different user pain point. If someone wants to add an sql
> implementation in the future they can a) build it on top of accord b)
> extend or improve accord or c) implement a separate system. The right
> choice will depend on their goals, but accord won’t prevent work on it, the
> same way the original lwt design isn’t preventing work on multi-partition
> transactions. In the worst case, if the goals of a hypothetical sql project
> are different enough to make them incompatible with accord, I don’t see any
> reason why we couldn’t have 2 separate consensus systems, so long as people
> are willing to maintain them and the use cases and available technologies
> justify it.
>
> -Blake
>
> > On Oct 9, 2021, at 9:54 AM, Jonathan Ellis <jb...@gmail.com> wrote:
> >
> > * Hi all,After calling several times for a broader discussion of goals
> and
> > tradeoffs around transaction management in the CEP-15 thread, I’ve put
> > together a short analysis to kick that off.Here is a table that
> summarizes
> > the state of the art for distributed transactions that offer
> > serializability, i.e., a superset of what you can get with LWT.  (The
> most
> > interesting option that this eliminates is RAMP.)Since I'm not sure how
> > this will render outside gmail, I've also uploaded it here:
> > https://imgur.com/a/SCZ8jex
> > <https://imgur.com/a/SCZ8jex>SpannerCockroachCalvin/FaunaSLOG (see
> > below)Write latencyGlobal Paxos, plus 2pc for multi-partition.For
> > intercontinental replication this is 100+ms.  Cloud Spanner does not
> allow
> > truly global deployments for this reason.Single-region Paxos, plus 2pc.
> > I’m not very clear on how this works but it results in non-strict
> > serializability.I didn’t find actual numbers for CR other than “2ms in a
> > single AZ” which is not a typical scenario.Global Raft.  Fauna posts
> actual
> > numbers of ~70ms in production which I assume corresponds to a
> multi-region
> > deployment with all regions in the USA.  SLOG paper says true global
> Calvin
> > is 200+ms.Single-region Paxos (common case) with fallback to multi-region
> > Paxos.Under 10ms.Scalability bottlenecksLocks held during cross-region
> > replicationSame as SpannerOLLP approach required when PKs are not known
> in
> > advance (mostly for indexed queries) -- results in retries under
> > contentionSame as CalvinRead latency at serial consistencyTimestamp from
> > Paxos leader (may be cross-region), then read from local replica.Same as
> > Spanner, I thinkSame as writesSame as writesMaximum serializability
> > flavorStrictUn-strictStrictStrictSupport for other isolation
> > levels?SnapshotNoSnapshot (in Fauna)Paper mentions dropping from
> > strict-serializable to only serializable.  Probably could also support
> > Snapshot like Fauna.Interactive transaction support (req’d for
> > SQL)YesYesNoNoPotential for grafting onto C*NightmareNightmareReasonable,
> > Calvin is relatively simple and the storage assumptions it makes are
> > minimalI haven’t thought about this enough. SLOG may require versioned
> > storage, e.g. see this comment
> > <
> http://dbmsmusings.blogspot.com/2019/10/introducing-slog-cheating-low-latency.html?showComment=1570497003296#c5976719429355924873
> >.(I
> > have not included Accord here because it’s not sufficiently clear to me
> how
> > to create a full transaction manager from the Accord protocol, so I can’t
> > analyze many of the properties such a system would have.  The most
> obvious
> > solution would be “Calvin but with Accord instead of Raft”, but since
> > Accord already does some Calvin-like things that seems like it would
> result
> > in some suboptimal redundancy.)After putting the above together it seems
> to
> > me that the two main areas of tradeoff are, 1. Is it worth giving up
> local
> > latencies to get full global consistency?  Most LWT use cases use
> > LOCAL_SERIAL.  While all of the above have more efficient designs than
> LWT,
> > it’s still true that global serialization will require 100+ms in the
> > general case due to physical transmission latency.  So a design that
> allows
> > local serialization with EC between regions, or a design (like SLOG) that
> > automatically infers a “home” region that can do local consensus in the
> > common case without giving up global serializability, is desirable.2. Is
> it
> > worth giving up the possibility of SQL support, to get the benefits of
> > deterministic transaction design?  To be clear, these benefits include
> very
> > significant ones around simplicity of design, higher write throughput,
> and
> > (in SLOG) lower read and write latencies.I’ll doubleclick on #2 because
> it
> > was asserted in the CEP-15 thread that Accord could support SQL by
> applying
> > known techniques on top.  This is mistaken.  Deterministic systems like
> > Calvin or SLOG or Accord can support queries where the rows affected are
> > not known in advance using a technique that Abadi calls OLLP (Optimistic
> > Lock Location Prediction), but this does not help when the transaction
> > logic is not known in advance.Here is Daniel Abadi’s explanation of OLLP
> > from “An Overview of Deterministic Database Systems
> > <
> https://cacm.acm.org/magazines/2018/9/230601-an-overview-of-deterministic-database-systems/fulltext?mobile=false
> >:”In
> > practice, deterministic database systems that use ordered locking do not
> > wait until runtime for transactions to determine their access-sets.
> > Instead, they use a technique called OLLP where if a transaction does not
> > know its access-sets in advance, it is not inserted into the input log.
> > Instead, it is run in a trial mode that does not write to the database
> > state, but determines what it would have read or written to if it was
> > actually being processed. It is then annotated with the access-sets
> > determined during the trial run, and submitted to the input log for
> actual
> > processing. In the actual run, every replica processes the transaction
> > deterministically, acquiring locks for the transaction based on the
> > estimate from the trial run. In some cases, database state may have
> changed
> > in a way that the access sets estimates are now incorrect. Since a
> > transaction cannot read or write data for which it does not have a lock,
> it
> > must abort as soon as it realizes that it acquired the wrong set of
> locks.
> > But since the transaction is being processed deterministically at this
> > point, every replica will independently come to the same conclusion that
> > the wrong set of locks were acquired, and will all independently decide
> to
> > abort the transaction. The transaction then gets resubmitted to the input
> > log with the new access-set estimates annotated.Clearly this does not
> work
> > if the server-visible logic changes between runs.  For instance, consider
> > this simple interactive transaction:cursor.execute("BEGIN
> > TRANSACTION")count = cursor.execute("SELECT count FROM inventory WHERE
> id =
> > 1").result[0]if count > 0:    cursor.execute("UPDATE inventory SET count
> =
> > count - 1 WHERE id = 1")cursor.execute("COMMIT TRANSACTION")The first
> > problem is that it’s far from clear how to do a “trial run” of a
> > transaction that the server only knows pieces of at a time.  But even
> > worse, the server only knows that it got either a SELECT, or a SELECT
> > followed by an UPDATE.  It doesn’t know anything about the logic that
> would
> > drive a change in those statements.  So if the value read changes between
> > trial run and execution, there is no possibility of transparently
> retrying,
> > you’re just screwed and have to report failure.So Abadi concludes,[A]ll
> > recent [deterministic database] implementations have limited or no
> support
> > for interactive transactions, thereby preventing their use in many
> existing
> > deployments. If the advantages of deterministic database systems will be
> > realized in the coming years, one of two things must occur: either
> database
> > users must accept a stored procedure interface to the system [instead of
> > client-side SQL], or additional research must be performed in order to
> > enable improved support for interactive transactions.TLDR:We need to
> decide
> > if we want to give users local transaction latencies, either with an
> > approach inspired by SLOG or with tuneable serializability like LWT
> > (trading away global consistency).  I think the answer here is clearly
> Yes,
> > we have abundant evidence from LWT that people care a great deal about
> > latency, and specifically that they are willing to live with
> > cross-datacenter eventual consistency to get low local latencies.We also
> > need to decide if we eventually want to support full SQL.  I think this
> one
> > is less clear, there are strong arguments both ways.P.S. SLOG deserves
> more
> > attention. Here are links to the paper
> > <http://www.vldb.org/pvldb/vol12/p1747-ren.pdf>, Abadi’s writeup
> > <
> http://dbmsmusings.blogspot.com/2019/10/introducing-slog-cheating-low-latency.html
> >,
> > and Murat Demirbas’s reading group compares SLOG to something called
> Ocean
> > Vista that I’ve never heard of but which reminds me of Accord
> > <
> http://muratbuffalo.blogspot.com/2020/11/ocean-vista-gossip-based-visibility.html
> >.*
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: dev-help@cassandra.apache.org
>


-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: Tradeoffs for Cassandra transaction management

Posted by "benedict@apache.org" <be...@apache.org>.

Hi Jonathan,

I will summarise my position below, that I have outlined at various points in the other thread, and then I would be interested to hear how you propose we move forwards. I will commit to responding the same day to any email I receive before 7pm GMT, and to engaging with each of your points. I would appreciate it if you could make similar commitments so that we may conclude this discussion in a reasonable time frame and conduct a vote on CEP-15.

I also reiterate my standing invitation to an open video chat, to discuss anything you like, for as long as you like. Please nominate a suitable time and day.

==TL;DR==
CEP-15 does not narrow our future options, it only broadens them. Accord is a distributed consensus protocol, so these techniques may build upon it without penalty. Alternatively, these approaches may simply live alongside Accord.

Since these alternative approaches do not achieve the goals of the CEP, and this CEP only enhances your ability to pursue them, it seems hard to conclude it should not proceed.

==Goals==
Our goals are first order principles: we want strict serializable cross-shard isolation that is highly available and can be scaled while maintaining optimal and predictable latency. Anything less, and the CEP is not achieved.

As outlined already (except SLOG, which I address below), these alternative approaches do not achieve these goals.

==Compatibility with other approaches==
0. In general, research systems are not irreducible - they are an assembly of ideas that can be mixed together. Accord is a distributed consensus protocol. These other protocols may utilise it without penalty for consensus, in many cases obtaining improved characteristics. Conversely, Accord may itself directly integrate some of these ideas.

1. Cockroach, YugaByte, Dynamo et al utilize read and write intents, the same as outlined as a technique for interactive transactions with Accord. They manage these in a distributed state machine with per-shard consensus, permitting them to achieve serializable isolation. This same technique can be used with Accord, with the advantage that strict serializable isolation would be achievable. For simple transactions we would be able to execute with “pure” Accord and retain its execution advantage. Accord does not disadvantage this approach, it is only enhanced and made easier.

2. Calvin: Accord is broadly functionally equivalent, only leaderless, thereby achieving better global latency properties.

3. SLOG: This is essentially Calvin. The main modification is that we may assign data a home region, so that transactions may be faster if they participate in just one region, and slower if they involve multiple regions. Note that this protocol does not achieve global serializability without either losing consistency or availability under network partition or paying a WAN cost.

In its consistent mode SLOG therefore remains slower than Accord for both single-home and multi-home transactions. Accord requires one WAN penalty for linearizing a transaction (competing transactions pay this cost simultaneously, as with SLOG), however this is achieved for global clients, whereas SLOG must cross the WAN multiple times for transactions initiated from outside their home, and for all multi-home transactions.

As discussed elsewhere, a future optimisation with Accord is to temporarily “home” competing transaction for execution only, so that there is no additional WAN penalty when executing competing transactions. This would confer the same performance advantages as SLOG, without any of its penalties for multi-home transactions or heterogenous latency characteristics, nor any of the complexities of re-homing data, thus avoiding these unpredictable performance characteristics.

For those use cases that do not require high availability, it would be possible to implement a “home” region setup with Accord, as with SLOG. This is not an idea that is exclusive to this particular system. We even discussed this briefly in the call, as some use cases do indeed prefer this trade-off.

SLOG additionally offers a kind of “home group” multi-home optimisation for clusters with many regions, that accept availability loss if fewer than half of their regions fail (e.g. in the paper 6 regions in pairs of 2 for availability). This is also exploitable by Accord, and something we can pursue as a future optimisation, as users explore such topologies in the real world.

==Responding to specific points==

>because it was asserted in the CEP-15 thread that Accord could support SQL by applying known techniques on top. This is mistaken. Deterministic systems like Calvin or SLOG or Accord can support queries where the rows affected are not known in advance using a technique that Abadi calls OLLP

Language is hard and it is easy to conflate things. Here you seem to be discussing abort-free interactive transactions, not SQL. SQL does not necessitate support for interactive transactions, let alone abort-free ones. The technique you mention can support SQL scripts, and also interactive client transactions that may be aborted by the server. However, see [1] which may support all of these properties.

From: Blake Eggleston <be...@apple.com.INVALID>
Date: Sunday, 10 October 2021 at 05:17
To: dev@cassandra.apache.org <de...@cassandra.apache.org>
Subject: Re: Tradeoffs for Cassandra transaction management
1. Is it worth giving up local latencies to get full global consistency? Most LWT use cases use
LOCAL_SERIAL.

This isn’t a tradeoff that needs to be made. There’s nothing about Accord that prevents performing consensus in one DC and replicating the writes to others. That’s not in scope for the initial work, but there’s no reason it couldn’t be handled as a follow on if needed. I agree with Jeff that LOCAL_SERIAL and LWTs are not usually done with a full understanding of the implications, but there are some valid use cases. For instance, you can enable an OLAP service to operate against another DC without impacting the primary, assuming the service can tolerate inconsistency for data written since the last repair, and there are some others.

2. Is it worth giving up the possibility of SQL support, to get the benefits of deterministic transaction design?

This is a false dilemma. Today, we’re proposing a deterministic transaction design that addresses some very common user pain points. SQL addresses different user pain point. If someone wants to add an sql implementation in the future they can a) build it on top of accord b) extend or improve accord or c) implement a separate system. The right choice will depend on their goals, but accord won’t prevent work on it, the same way the original lwt design isn’t preventing work on multi-partition transactions. In the worst case, if the goals of a hypothetical sql project are different enough to make them incompatible with accord, I don’t see any reason why we couldn’t have 2 separate consensus systems, so long as people are willing to maintain them and the use cases and available technologies justify it.

-Blake

> On Oct 9, 2021, at 9:54 AM, Jonathan Ellis <jb...@gmail.com> wrote:
>
> * Hi all,After calling several times for a broader discussion of goals and
> tradeoffs around transaction management in the CEP-15 thread, I’ve put
> together a short analysis to kick that off.Here is a table that summarizes
> the state of the art for distributed transactions that offer
> serializability, i.e., a superset of what you can get with LWT.  (The most
> interesting option that this eliminates is RAMP.)Since I'm not sure how
> this will render outside gmail, I've also uploaded it here:
> https://imgur.com/a/SCZ8jex
> <https://imgur.com/a/SCZ8jex>SpannerCockroachCalvin/FaunaSLOG (see
> below)Write latencyGlobal Paxos, plus 2pc for multi-partition.For
> intercontinental replication this is 100+ms.  Cloud Spanner does not allow
> truly global deployments for this reason.Single-region Paxos, plus 2pc.
> I’m not very clear on how this works but it results in non-strict
> serializability.I didn’t find actual numbers for CR other than “2ms in a
> single AZ” which is not a typical scenario.Global Raft.  Fauna posts actual
> numbers of ~70ms in production which I assume corresponds to a multi-region
> deployment with all regions in the USA.  SLOG paper says true global Calvin
> is 200+ms.Single-region Paxos (common case) with fallback to multi-region
> Paxos.Under 10ms.Scalability bottlenecksLocks held during cross-region
> replicationSame as SpannerOLLP approach required when PKs are not known in
> advance (mostly for indexed queries) -- results in retries under
> contentionSame as CalvinRead latency at serial consistencyTimestamp from
> Paxos leader (may be cross-region), then read from local replica.Same as
> Spanner, I thinkSame as writesSame as writesMaximum serializability
> flavorStrictUn-strictStrictStrictSupport for other isolation
> levels?SnapshotNoSnapshot (in Fauna)Paper mentions dropping from
> strict-serializable to only serializable.  Probably could also support
> Snapshot like Fauna.Interactive transaction support (req’d for
> SQL)YesYesNoNoPotential for grafting onto C*NightmareNightmareReasonable,
> Calvin is relatively simple and the storage assumptions it makes are
> minimalI haven’t thought about this enough. SLOG may require versioned
> storage, e.g. see this comment
> <http://dbmsmusings.blogspot.com/2019/10/introducing-slog-cheating-low-latency.html?showComment=1570497003296#c5976719429355924873>.(I
> have not included Accord here because it’s not sufficiently clear to me how
> to create a full transaction manager from the Accord protocol, so I can’t
> analyze many of the properties such a system would have.  The most obvious
> solution would be “Calvin but with Accord instead of Raft”, but since
> Accord already does some Calvin-like things that seems like it would result
> in some suboptimal redundancy.)After putting the above together it seems to
> me that the two main areas of tradeoff are, 1. Is it worth giving up local
> latencies to get full global consistency?  Most LWT use cases use
> LOCAL_SERIAL.  While all of the above have more efficient designs than LWT,
> it’s still true that global serialization will require 100+ms in the
> general case due to physical transmission latency.  So a design that allows
> local serialization with EC between regions, or a design (like SLOG) that
> automatically infers a “home” region that can do local consensus in the
> common case without giving up global serializability, is desirable.2. Is it
> worth giving up the possibility of SQL support, to get the benefits of
> deterministic transaction design?  To be clear, these benefits include very
> significant ones around simplicity of design, higher write throughput, and
> (in SLOG) lower read and write latencies.I’ll doubleclick on #2 because it
> was asserted in the CEP-15 thread that Accord could support SQL by applying
> known techniques on top.  This is mistaken.  Deterministic systems like
> Calvin or SLOG or Accord can support queries where the rows affected are
> not known in advance using a technique that Abadi calls OLLP (Optimistic
> Lock Location Prediction), but this does not help when the transaction
> logic is not known in advance.Here is Daniel Abadi’s explanation of OLLP
> from “An Overview of Deterministic Database Systems
> <https://cacm.acm.org/magazines/2018/9/230601-an-overview-of-deterministic-database-systems/fulltext?mobile=false>:”In
> practice, deterministic database systems that use ordered locking do not
> wait until runtime for transactions to determine their access-sets.
> Instead, they use a technique called OLLP where if a transaction does not
> know its access-sets in advance, it is not inserted into the input log.
> Instead, it is run in a trial mode that does not write to the database
> state, but determines what it would have read or written to if it was
> actually being processed. It is then annotated with the access-sets
> determined during the trial run, and submitted to the input log for actual
> processing. In the actual run, every replica processes the transaction
> deterministically, acquiring locks for the transaction based on the
> estimate from the trial run. In some cases, database state may have changed
> in a way that the access sets estimates are now incorrect. Since a
> transaction cannot read or write data for which it does not have a lock, it
> must abort as soon as it realizes that it acquired the wrong set of locks.
> But since the transaction is being processed deterministically at this
> point, every replica will independently come to the same conclusion that
> the wrong set of locks were acquired, and will all independently decide to
> abort the transaction. The transaction then gets resubmitted to the input
> log with the new access-set estimates annotated.Clearly this does not work
> if the server-visible logic changes between runs.  For instance, consider
> this simple interactive transaction:cursor.execute("BEGIN
> TRANSACTION")count = cursor.execute("SELECT count FROM inventory WHERE id =
> 1").result[0]if count > 0:    cursor.execute("UPDATE inventory SET count =
> count - 1 WHERE id = 1")cursor.execute("COMMIT TRANSACTION")The first
> problem is that it’s far from clear how to do a “trial run” of a
> transaction that the server only knows pieces of at a time.  But even
> worse, the server only knows that it got either a SELECT, or a SELECT
> followed by an UPDATE.  It doesn’t know anything about the logic that would
> drive a change in those statements.  So if the value read changes between
> trial run and execution, there is no possibility of transparently retrying,
> you’re just screwed and have to report failure.So Abadi concludes,[A]ll
> recent [deterministic database] implementations have limited or no support
> for interactive transactions, thereby preventing their use in many existing
> deployments. If the advantages of deterministic database systems will be
> realized in the coming years, one of two things must occur: either database
> users must accept a stored procedure interface to the system [instead of
> client-side SQL], or additional research must be performed in order to
> enable improved support for interactive transactions.TLDR:We need to decide
> if we want to give users local transaction latencies, either with an
> approach inspired by SLOG or with tuneable serializability like LWT
> (trading away global consistency).  I think the answer here is clearly Yes,
> we have abundant evidence from LWT that people care a great deal about
> latency, and specifically that they are willing to live with
> cross-datacenter eventual consistency to get low local latencies.We also
> need to decide if we eventually want to support full SQL.  I think this one
> is less clear, there are strong arguments both ways.P.S. SLOG deserves more
> attention. Here are links to the paper
> <http://www.vldb.org/pvldb/vol12/p1747-ren.pdf>, Abadi’s writeup
> <http://dbmsmusings.blogspot.com/2019/10/introducing-slog-cheating-low-latency.html>,
> and Murat Demirbas’s reading group compares SLOG to something called Ocean
> Vista that I’ve never heard of but which reminds me of Accord
> <http://muratbuffalo.blogspot.com/2020/11/ocean-vista-gossip-based-visibility.html>.*
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org

Re: Tradeoffs for Cassandra transaction management

Posted by Blake Eggleston <be...@apple.com.INVALID>.

1. Is it worth giving up local latencies to get full global consistency? Most LWT use cases use
LOCAL_SERIAL.

This isn’t a tradeoff that needs to be made. There’s nothing about Accord that prevents performing consensus in one DC and replicating the writes to others. That’s not in scope for the initial work, but there’s no reason it couldn’t be handled as a follow on if needed. I agree with Jeff that LOCAL_SERIAL and LWTs are not usually done with a full understanding of the implications, but there are some valid use cases. For instance, you can enable an OLAP service to operate against another DC without impacting the primary, assuming the service can tolerate inconsistency for data written since the last repair, and there are some others.

2. Is it worth giving up the possibility of SQL support, to get the benefits of deterministic transaction design? 

This is a false dilemma. Today, we’re proposing a deterministic transaction design that addresses some very common user pain points. SQL addresses different user pain point. If someone wants to add an sql implementation in the future they can a) build it on top of accord b) extend or improve accord or c) implement a separate system. The right choice will depend on their goals, but accord won’t prevent work on it, the same way the original lwt design isn’t preventing work on multi-partition transactions. In the worst case, if the goals of a hypothetical sql project are different enough to make them incompatible with accord, I don’t see any reason why we couldn’t have 2 separate consensus systems, so long as people are willing to maintain them and the use cases and available technologies justify it.

-Blake

> On Oct 9, 2021, at 9:54 AM, Jonathan Ellis <jb...@gmail.com> wrote:
> 
> * Hi all,After calling several times for a broader discussion of goals and
> tradeoffs around transaction management in the CEP-15 thread, I’ve put
> together a short analysis to kick that off.Here is a table that summarizes
> the state of the art for distributed transactions that offer
> serializability, i.e., a superset of what you can get with LWT.  (The most
> interesting option that this eliminates is RAMP.)Since I'm not sure how
> this will render outside gmail, I've also uploaded it here:
> https://imgur.com/a/SCZ8jex
> <https://imgur.com/a/SCZ8jex>SpannerCockroachCalvin/FaunaSLOG (see
> below)Write latencyGlobal Paxos, plus 2pc for multi-partition.For
> intercontinental replication this is 100+ms.  Cloud Spanner does not allow
> truly global deployments for this reason.Single-region Paxos, plus 2pc.
> I’m not very clear on how this works but it results in non-strict
> serializability.I didn’t find actual numbers for CR other than “2ms in a
> single AZ” which is not a typical scenario.Global Raft.  Fauna posts actual
> numbers of ~70ms in production which I assume corresponds to a multi-region
> deployment with all regions in the USA.  SLOG paper says true global Calvin
> is 200+ms.Single-region Paxos (common case) with fallback to multi-region
> Paxos.Under 10ms.Scalability bottlenecksLocks held during cross-region
> replicationSame as SpannerOLLP approach required when PKs are not known in
> advance (mostly for indexed queries) -- results in retries under
> contentionSame as CalvinRead latency at serial consistencyTimestamp from
> Paxos leader (may be cross-region), then read from local replica.Same as
> Spanner, I thinkSame as writesSame as writesMaximum serializability
> flavorStrictUn-strictStrictStrictSupport for other isolation
> levels?SnapshotNoSnapshot (in Fauna)Paper mentions dropping from
> strict-serializable to only serializable.  Probably could also support
> Snapshot like Fauna.Interactive transaction support (req’d for
> SQL)YesYesNoNoPotential for grafting onto C*NightmareNightmareReasonable,
> Calvin is relatively simple and the storage assumptions it makes are
> minimalI haven’t thought about this enough. SLOG may require versioned
> storage, e.g. see this comment
> <http://dbmsmusings.blogspot.com/2019/10/introducing-slog-cheating-low-latency.html?showComment=1570497003296#c5976719429355924873>.(I
> have not included Accord here because it’s not sufficiently clear to me how
> to create a full transaction manager from the Accord protocol, so I can’t
> analyze many of the properties such a system would have.  The most obvious
> solution would be “Calvin but with Accord instead of Raft”, but since
> Accord already does some Calvin-like things that seems like it would result
> in some suboptimal redundancy.)After putting the above together it seems to
> me that the two main areas of tradeoff are, 1. Is it worth giving up local
> latencies to get full global consistency?  Most LWT use cases use
> LOCAL_SERIAL.  While all of the above have more efficient designs than LWT,
> it’s still true that global serialization will require 100+ms in the
> general case due to physical transmission latency.  So a design that allows
> local serialization with EC between regions, or a design (like SLOG) that
> automatically infers a “home” region that can do local consensus in the
> common case without giving up global serializability, is desirable.2. Is it
> worth giving up the possibility of SQL support, to get the benefits of
> deterministic transaction design?  To be clear, these benefits include very
> significant ones around simplicity of design, higher write throughput, and
> (in SLOG) lower read and write latencies.I’ll doubleclick on #2 because it
> was asserted in the CEP-15 thread that Accord could support SQL by applying
> known techniques on top.  This is mistaken.  Deterministic systems like
> Calvin or SLOG or Accord can support queries where the rows affected are
> not known in advance using a technique that Abadi calls OLLP (Optimistic
> Lock Location Prediction), but this does not help when the transaction
> logic is not known in advance.Here is Daniel Abadi’s explanation of OLLP
> from “An Overview of Deterministic Database Systems
> <https://cacm.acm.org/magazines/2018/9/230601-an-overview-of-deterministic-database-systems/fulltext?mobile=false>:”In
> practice, deterministic database systems that use ordered locking do not
> wait until runtime for transactions to determine their access-sets.
> Instead, they use a technique called OLLP where if a transaction does not
> know its access-sets in advance, it is not inserted into the input log.
> Instead, it is run in a trial mode that does not write to the database
> state, but determines what it would have read or written to if it was
> actually being processed. It is then annotated with the access-sets
> determined during the trial run, and submitted to the input log for actual
> processing. In the actual run, every replica processes the transaction
> deterministically, acquiring locks for the transaction based on the
> estimate from the trial run. In some cases, database state may have changed
> in a way that the access sets estimates are now incorrect. Since a
> transaction cannot read or write data for which it does not have a lock, it
> must abort as soon as it realizes that it acquired the wrong set of locks.
> But since the transaction is being processed deterministically at this
> point, every replica will independently come to the same conclusion that
> the wrong set of locks were acquired, and will all independently decide to
> abort the transaction. The transaction then gets resubmitted to the input
> log with the new access-set estimates annotated.Clearly this does not work
> if the server-visible logic changes between runs.  For instance, consider
> this simple interactive transaction:cursor.execute("BEGIN
> TRANSACTION")count = cursor.execute("SELECT count FROM inventory WHERE id =
> 1").result[0]if count > 0:    cursor.execute("UPDATE inventory SET count =
> count - 1 WHERE id = 1")cursor.execute("COMMIT TRANSACTION")The first
> problem is that it’s far from clear how to do a “trial run” of a
> transaction that the server only knows pieces of at a time.  But even
> worse, the server only knows that it got either a SELECT, or a SELECT
> followed by an UPDATE.  It doesn’t know anything about the logic that would
> drive a change in those statements.  So if the value read changes between
> trial run and execution, there is no possibility of transparently retrying,
> you’re just screwed and have to report failure.So Abadi concludes,[A]ll
> recent [deterministic database] implementations have limited or no support
> for interactive transactions, thereby preventing their use in many existing
> deployments. If the advantages of deterministic database systems will be
> realized in the coming years, one of two things must occur: either database
> users must accept a stored procedure interface to the system [instead of
> client-side SQL], or additional research must be performed in order to
> enable improved support for interactive transactions.TLDR:We need to decide
> if we want to give users local transaction latencies, either with an
> approach inspired by SLOG or with tuneable serializability like LWT
> (trading away global consistency).  I think the answer here is clearly Yes,
> we have abundant evidence from LWT that people care a great deal about
> latency, and specifically that they are willing to live with
> cross-datacenter eventual consistency to get low local latencies.We also
> need to decide if we eventually want to support full SQL.  I think this one
> is less clear, there are strong arguments both ways.P.S. SLOG deserves more
> attention. Here are links to the paper
> <http://www.vldb.org/pvldb/vol12/p1747-ren.pdf>, Abadi’s writeup
> <http://dbmsmusings.blogspot.com/2019/10/introducing-slog-cheating-low-latency.html>,
> and Murat Demirbas’s reading group compares SLOG to something called Ocean
> Vista that I’ve never heard of but which reminds me of Accord
> <http://muratbuffalo.blogspot.com/2020/11/ocean-vista-gossip-based-visibility.html>.*
> -- 
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@cassandra.apache.org
For additional commands, e-mail: dev-help@cassandra.apache.org

Re: Tradeoffs for Cassandra transaction management

Posted by Jonathan Ellis <jb...@gmail.com>.

On Sat, Oct 9, 2021 at 7:20 PM Jeff Jirsa <jj...@gmail.com> wrote:

> Most LWT use cases use LOCAL_SERIAL because the difference in latency is
> huge today (given the 4x RTTs) AND almost none of the users actually
> understand how cassandra replication or consistency works, so they
> misunderstand the guarantees provided by the choice they make. When
> informed of the actual tradeoffs, a LOT of those users switch to SERIAL.
>

This doesn't match my experience.  I know of exactly two DataStax customers
using SERIAL; I remember them because they're so unusual.  On the other
hand, I've talked to a dozen plus using LOCAL_SERIAL.

I could try to get more exact numbers if it would help but back of the
envelope, 5:1 in favor of LOCAL is about right.

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Re: Tradeoffs for Cassandra transaction management

Posted by Jeff Jirsa <jj...@gmail.com>.

I'll read more of this in a bit, I want to make sure I fully digest it
before commenting on the rest, but this block here deserves a few words:

On Sat, Oct 9, 2021 at 9:54 AM Jonathan Ellis <jb...@gmail.com> wrote:

> After putting the above together it seems to
> me that the two main areas of tradeoff are, 1. Is it worth giving up local
> latencies to get full global consistency?  Most LWT use cases use
> LOCAL_SERIAL.

Most LWT use cases use LOCAL_SERIAL because the difference in latency is
huge today (given the 4x RTTs) AND almost none of the users actually
understand how cassandra replication or consistency works, so they
misunderstand the guarantees provided by the choice they make. When
informed of the actual tradeoffs, a LOT of those users switch to SERIAL.