You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@distributedlog.apache.org by Xi Liu <xi...@gmail.com> on 2016/09/12 09:20:36 UTC

[Discuss] Transaction Support

Hello,

I asked the transaction support in distributedlog user group two months
ago. I want to raise this up again, as we are looking for using
distributedlog for building a transactional data service. It is a major
feature that is missing in distributedlog. We have some ideas to add this
to distributedlog and want to know if they make sense or not. If they are
good, we'd like to contribute and develop with the community.

Here are the thoughts:

-------------------------------------------------

From our understanding, DL can provide "at-least-once" delivery semantic
(if not, please correct me) but not "exactly-once" delivery semantic. That
means that a message can be delivered one or more times if the reader
doesn't handle duplicates.

The duplicates come from two places, one is at writer side (this assumes
using write proxy not the core library), while the other one is at reader
side.

- writer side: if the client attempts to write a record to the write
proxies and gets a network error (e.g timeouts) then retries, the retrying
will potentially result in duplicates.
- reader side:if the reader reads a message from a stream and then crashes,
when the reader restarts it would restart from last known position (DLSN).
If the reader fails after processing a record and before recording the
position, the processed record will be delivered again.

The reader problem can be properly addressed by making use of the sequence
numbers of records and doing proper checkpointing. For example, in
database, it can checkpoint the indexed data with the sequence number of
records; in flink, it can checkpoint the state with the sequence numbers.

The writer problem can be addressed by implementing an idempotent writer.
However, an alternative and more powerful approach is to support
transactions.

*What does transaction mean?*

A transaction means a collection of records can be written transactionally
within a stream or across multiple streams. They will be consumed by the
reader together when a transaction is committed, or will never be consumed
by the reader when the transaction is aborted.

The transaction will expose following guarantees:

- The reader should not be exposed to records written from uncommitted
transactions (mandatory)
- The reader should consume the records in the transaction commit order
rather than the record written order (mandatory)
- No duplicated records within a transaction (mandatory)
- Allow interleaving transactional writes and non-transactional writes
(optional)

*Stream Transaction & Namespace Transaction*

There will be two types of transaction, one is Stream level transaction
(local transaction), while the other one is Namespace level transaction
(global transaction).

The stream level transaction is a transactional operation on writing
records to one stream; the namespace level transaction is a transactional
operation on writing records to multiple streams.

*Implementation Thoughts*

- A transaction is consist of begin control record, a series of data
records and commit/abort control record.
- The begin/commit/abort control record is written to a `commit` log
stream, while the data records will be written to normal data log streams.
- The `commit` log stream will be the same log stream for stream-level
transaction,  while it will be a *system* stream (or multiple system
streams) for namespace-level transactions.
- The transaction code looks like as below:

<code>

Transaction txn = client.transaction();
Future<DLSN> result1 = txn.write(stream-0, record);
Future<DLSN> result2 = txn.write(stream-1, record);
Future<DLSN> result3 = txn.write(stream-2, record);
Future<Pair<DLSN, DLSN>> result = txn.commit();

</code>

if the txn is committed, all the write futures will be satisfied with their
written DLSNs. if the txn is aborted, all the write futures will be failed
together. there is no partial failure state.

- The actually data flow will be:

1. writer get a transaction id from the owner of the `commit' log stream
1. write the begin control record (synchronously) with the transaction id
2. for each write within the same txn, it will be assigned a local sequence
number starting from 0. the combination of transaction id and local
sequence number will be used later on by the readers to de-duplicate
records.
3. the commit/abort control record will be written based on the results
from 2.

- Application can supply a timeout for the transaction when #begin() a
transaction. The owner of the `commit` log stream can abort transactions
that never be committed/aborted within their timeout.

- Failures:

* all the log records can be simply retried as they will be de-duplicated
probably at the reader side.

- Reader:

* Reader can be configured to read uncommitted records or committed records
only (by default read uncommitted records)
* If reader is configured to read committed records only, the read ahead
cache will be changed to maintain one additional pending committed records.
the pending committed records map is bounded and records will be dropped
when read ahead is moving.
* when the reader hits a commit record, it will rewind to the begin record
and start reading from there. leveraging the proper read ahead cache and
pending commit records cache, it would be good for both short transactions
and long transactions.

- DLSN, SequenceId:

* We will add a fourth field to DLSN. It is `local sequence number` within
a transaction session. So the new DLSN of records in a transaction will be
the DLSN of commit control record plus its local sequence number.
* The sequence id will be still the position of the commit record plus its
local sequence number. The position will be advanced with total number of
written records on writing the commit control record.

- Transaction Group & Namespace Transaction

using one single log stream for namespace transaction can cause the
bottleneck problem since all the begin/commit/end control records will have
to go through one log stream.

the idea of 'transaction group' is to allow partitioning the writers into
different transaction groups.

clients can specify the `group-name` when starting the transaction. if
there is no `group-name` specified, it will use the default `commit` log in
the namespace for creating transactions.

-------------------------------------------------

I'd like to collect feedbacks on this idea. Appreciate any comments and if
anyone is also interested in this idea, we'd like to collaborate with the
community.


- Xi

Re: [Discuss] Transaction Support

Posted by Sijie Guo <si...@apache.org>.

Xi, I added more comments. We are looking forward to your reply and seeing
this happen.

- Sijie

On Tue, Jan 3, 2017 at 11:18 PM, Sijie Guo <si...@apache.org> wrote:

> Sorry for late response. I think Leigh and you already had some very
> valuable discussions in the doc. I will try to add some of my questions to
> the discussion.
>
> Beside that, I had a discussion with Leigh today about this. first of all,
> I think it is very good to add transaction support in distributedlog. It is
> one of the primitives that would help building distributed service. But we
> have a concern about making this system become complicated and introduce
> operational overhead when it runs in the large scale system on production.
> There are two major suggestions that I have for this feature -
>
> Build the 'minimum' logic in core - I think the minimum logic that need to
> be added to the core is -  the special control records (begin, commit and
> abort) and make the reader be able to detect those special control records
> and know what do they mean and how to interrupt with them. Since they are
> special control records, there is not overhead to other readers that
> doesn't require this feature.
>
> Build the transaction coordinator as a separated proxy service  - I think
> the major concern that we have is putting more complexities into the 'write
> proxy' service. We architected distributedlog in a more microservice-like
> way - we have the core as the stream store, the proxy for serving write and
> read traffic. It would be good that the transaction feature can be done in
> a similar way. So the architecture would be like this -
>
> *[ write service ] [ read service ] [ transaction coordinator ]*
> *[ stream store
>               ]*
>
> if people doesn't need the transaction feature, they can turn if off
> completely without any operational overhead.
>
> Beside that, I have one general question - What is the major goal for this
> feature? Are you targeting on building a general XA transaction coordinator
> or just for supporting things like `copy-modify-write' style workflow?
>
>
> Thanks,
> Sijie
>
>
>
>
>
> On Wed, Dec 28, 2016 at 1:12 PM, Xi Liu <xi...@gmail.com> wrote:
>
>> Ping?
>>
>> On Mon, Dec 19, 2016 at 8:28 AM, Xi Liu <xi...@gmail.com> wrote:
>>
>> > Sijie,
>> >
>> > No. I thought it might be easier for people to comment on a google doc
>> to
>> > gather the initial feedback. I will put the content back to wiki page
>> once
>> > addressing the comments. Does that sound good to you?
>> >
>> > And thank you in advance.
>> >
>> > - Xi
>> >
>> >
>> >
>> > On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <si...@apache.org> wrote:
>> >
>> >> Hi Xi,
>> >>
>> >> sorry for late response. I will review it soon.
>> >>
>> >> regarding this, a separate question "are we going to use google doc
>> >> instead
>> >> of email thread for any discussion"? I am a bit worried that the
>> >> discussion
>> >> will become lost after moving to google doc. No idea on how other
>> apache
>> >> projects are doing.
>> >>
>> >> - Sijie
>> >>
>> >> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi...@gmail.com> wrote:
>> >>
>> >> > Hi all,
>> >> >
>> >> > I finalized the first version of the design. This time I used a
>> google
>> >> doc
>> >> > so that it is easier for commenting and add a link the wiki page. I
>> will
>> >> > update this to the wiki page once we come to the finalized design.
>> >> >
>> >> > https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
>> >> > bSIGgSzXuTI5BA/edit
>> >> >
>> >> > Let me know if you have any questions. Appreciate your reviews!
>> >> >
>> >> > - Xi
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
>> >> > <lstewart@twitter.com.invalid
>> >> > > wrote:
>> >> >
>> >> > > Interesting proposal. A couple quick notes while you continue to
>> flesh
>> >> > this
>> >> > > out.
>> >> > >
>> >> > > a. just to be sure - does this eliminate the need to save seqno
>> with
>> >> > > checkpoint?
>> >> > >
>> >> > > b. i.e. another way to describe this kind of improvement is
>> "support
>> >> > > records (atomic writes) larger than 1MB", iiuc. the advantage
>> being it
>> >> > > avoids the baggage of transactions. disadvantages include inability
>> >> to do
>> >> > > cross stream transactions, and flexibility (interleaving, etc) (are
>> >> there
>> >> > > others?).
>> >> > >
>> >> > > c. proxy use case is for supporting multiple writers - have you
>> >> thought
>> >> > > about how this would work with multiple writers?
>> >> > >
>> >> > > Thanks!
>> >> > >
>> >> > >
>> >> > > On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo
>> <sijieg@twitter.com.invalid
>> >> >
>> >> > > wrote:
>> >> > >
>> >> > > > Sound good to me. look forward to the detailed proposal.
>> >> > > >
>> >> > > > (I don't mind the format if it makes things easier to you)
>> >> > > >
>> >> > > > Sijie
>> >> > > >
>> >> > > > On Friday, October 14, 2016, Xi Liu <xi...@gmail.com>
>> wrote:
>> >> > > >
>> >> > > > > Thank you, Sijie
>> >> > > > >
>> >> > > > > We have some internal discussions to sort out some details. We
>> are
>> >> > > ready
>> >> > > > to
>> >> > > > > collaborate with the community for adding the transaction
>> support
>> >> in
>> >> > > DL.
>> >> > > > > We'd like to share more.
>> >> > > > >
>> >> > > > > I created a proposal wiki here -
>> >> > > > > https://cwiki.apache.org/confluence/display/DL/DP-1+-+
>> >> > > > > DistributedLog+Transaction+Support
>> >> > > > >
>> >> > > > > (I followed KIP format and named it as DP (DistributedLog
>> >> Proposal -
>> >> > DP
>> >> > > > is
>> >> > > > > also short for Dynamic Programming). I don't know if you guys
>> like
>> >> > this
>> >> > > > > name or not. Feel free to change it :D)
>> >> > > > >
>> >> > > > > I basically put my initial email as the content there so far.
>> >> Once we
>> >> > > > > finished our final discussion, I will update with more
>> details. At
>> >> > the
>> >> > > > same
>> >> > > > > time, any comments are welcome.
>> >> > > > >
>> >> > > > > - Xi
>> >> > > > >
>> >> > > > >
>> >> > > > >
>> >> > > > > On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
>> >> > > > <javascript:;>>
>> >> > > > > wrote:
>> >> > > > >
>> >> > > > > > Xi,
>> >> > > > > >
>> >> > > > > > I just granted you the edit permission.
>> >> > > > > >
>> >> > > > > > - Sijie
>> >> > > > > >
>> >> > > > > > On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <
>> xi.liu.ant@gmail.com
>> >> > > > > <javascript:;>> wrote:
>> >> > > > > >
>> >> > > > > > > I still can not edit the wiki. Can any of the pmc members
>> >> grant
>> >> > me
>> >> > > > the
>> >> > > > > > > permissions?
>> >> > > > > > >
>> >> > > > > > > - Xi
>> >> > > > > > >
>> >> > > > > > > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <
>> >> xi.liu.ant@gmail.com
>> >> > > > > <javascript:;>> wrote:
>> >> > > > > > >
>> >> > > > > > > > Sijie,
>> >> > > > > > > >
>> >> > > > > > > > I attempted to create a wiki page under that space. I
>> found
>> >> > that
>> >> > > I
>> >> > > > am
>> >> > > > > > not
>> >> > > > > > > > authorized with edit permission.
>> >> > > > > > > >
>> >> > > > > > > > Can any of the committers grant me the wiki edit
>> >> permission? My
>> >> > > > > account
>> >> > > > > > > is
>> >> > > > > > > > "xi.liu.ant".
>> >> > > > > > > >
>> >> > > > > > > > - Xi
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <
>> >> sijie@apache.org
>> >> > > > > <javascript:;>> wrote:
>> >> > > > > > > >
>> >> > > > > > > >> This sounds interesting ... I will take a closer look
>> and
>> >> give
>> >> > > my
>> >> > > > > > > comments
>> >> > > > > > > >> later.
>> >> > > > > > > >>
>> >> > > > > > > >> At the same time, do you mind creating a wiki page to
>> put
>> >> your
>> >> > > > idea
>> >> > > > > > > there?
>> >> > > > > > > >> You can add your wiki page under
>> >> > > > > > > >> https://cwiki.apache.org/confluence/display/DL/Project+
>> >> > > Proposals
>> >> > > > > > > >>
>> >> > > > > > > >> You might need to ask in the dev list to grant the wiki
>> >> edit
>> >> > > > > > permissions
>> >> > > > > > > >> to
>> >> > > > > > > >> you once you have a wiki account.
>> >> > > > > > > >>
>> >> > > > > > > >> - Sijie
>> >> > > > > > > >>
>> >> > > > > > > >>
>> >> > > > > > > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <
>> >> xi.liu.ant@gmail.com
>> >> > > > > <javascript:;>> wrote:
>> >> > > > > > > >>
>> >> > > > > > > >> > Hello,
>> >> > > > > > > >> >
>> >> > > > > > > >> > I asked the transaction support in distributedlog user
>> >> group
>> >> > > two
>> >> > > > > > > months
>> >> > > > > > > >> > ago. I want to raise this up again, as we are looking
>> for
>> >> > > using
>> >> > > > > > > >> > distributedlog for building a transactional data
>> >> service. It
>> >> > > is
>> >> > > > a
>> >> > > > > > > major
>> >> > > > > > > >> > feature that is missing in distributedlog. We have
>> some
>> >> > ideas
>> >> > > to
>> >> > > > > add
>> >> > > > > > > >> this
>> >> > > > > > > >> > to distributedlog and want to know if they make sense
>> or
>> >> > not.
>> >> > > If
>> >> > > > > > they
>> >> > > > > > > >> are
>> >> > > > > > > >> > good, we'd like to contribute and develop with the
>> >> > community.
>> >> > > > > > > >> >
>> >> > > > > > > >> > Here are the thoughts:
>> >> > > > > > > >> >
>> >> > > > > > > >> > -------------------------------------------------
>> >> > > > > > > >> >
>> >> > > > > > > >> > From our understanding, DL can provide "at-least-once"
>> >> > > delivery
>> >> > > > > > > semantic
>> >> > > > > > > >> > (if not, please correct me) but not "exactly-once"
>> >> delivery
>> >> > > > > > semantic.
>> >> > > > > > > >> That
>> >> > > > > > > >> > means that a message can be delivered one or more
>> times
>> >> if
>> >> > the
>> >> > > > > > reader
>> >> > > > > > > >> > doesn't handle duplicates.
>> >> > > > > > > >> >
>> >> > > > > > > >> > The duplicates come from two places, one is at writer
>> >> side
>> >> > > (this
>> >> > > > > > > assumes
>> >> > > > > > > >> > using write proxy not the core library), while the
>> other
>> >> one
>> >> > > is
>> >> > > > at
>> >> > > > > > > >> reader
>> >> > > > > > > >> > side.
>> >> > > > > > > >> >
>> >> > > > > > > >> > - writer side: if the client attempts to write a
>> record
>> >> to
>> >> > the
>> >> > > > > write
>> >> > > > > > > >> > proxies and gets a network error (e.g timeouts) then
>> >> > retries,
>> >> > > > the
>> >> > > > > > > >> retrying
>> >> > > > > > > >> > will potentially result in duplicates.
>> >> > > > > > > >> > - reader side:if the reader reads a message from a
>> stream
>> >> > and
>> >> > > > then
>> >> > > > > > > >> crashes,
>> >> > > > > > > >> > when the reader restarts it would restart from last
>> known
>> >> > > > position
>> >> > > > > > > >> (DLSN).
>> >> > > > > > > >> > If the reader fails after processing a record and
>> before
>> >> > > > recording
>> >> > > > > > the
>> >> > > > > > > >> > position, the processed record will be delivered
>> again.
>> >> > > > > > > >> >
>> >> > > > > > > >> > The reader problem can be properly addressed by making
>> >> use
>> >> > of
>> >> > > > the
>> >> > > > > > > >> sequence
>> >> > > > > > > >> > numbers of records and doing proper checkpointing. For
>> >> > > example,
>> >> > > > in
>> >> > > > > > > >> > database, it can checkpoint the indexed data with the
>> >> > sequence
>> >> > > > > > number
>> >> > > > > > > of
>> >> > > > > > > >> > records; in flink, it can checkpoint the state with
>> the
>> >> > > sequence
>> >> > > > > > > >> numbers.
>> >> > > > > > > >> >
>> >> > > > > > > >> > The writer problem can be addressed by implementing an
>> >> > > > idempotent
>> >> > > > > > > >> writer.
>> >> > > > > > > >> > However, an alternative and more powerful approach is
>> to
>> >> > > support
>> >> > > > > > > >> > transactions.
>> >> > > > > > > >> >
>> >> > > > > > > >> > *What does transaction mean?*
>> >> > > > > > > >> >
>> >> > > > > > > >> > A transaction means a collection of records can be
>> >> written
>> >> > > > > > > >> transactionally
>> >> > > > > > > >> > within a stream or across multiple streams. They will
>> be
>> >> > > > consumed
>> >> > > > > by
>> >> > > > > > > the
>> >> > > > > > > >> > reader together when a transaction is committed, or
>> will
>> >> > never
>> >> > > > be
>> >> > > > > > > >> consumed
>> >> > > > > > > >> > by the reader when the transaction is aborted.
>> >> > > > > > > >> >
>> >> > > > > > > >> > The transaction will expose following guarantees:
>> >> > > > > > > >> >
>> >> > > > > > > >> > - The reader should not be exposed to records written
>> >> from
>> >> > > > > > uncommitted
>> >> > > > > > > >> > transactions (mandatory)
>> >> > > > > > > >> > - The reader should consume the records in the
>> >> transaction
>> >> > > > commit
>> >> > > > > > > order
>> >> > > > > > > >> > rather than the record written order (mandatory)
>> >> > > > > > > >> > - No duplicated records within a transaction
>> (mandatory)
>> >> > > > > > > >> > - Allow interleaving transactional writes and
>> >> > > non-transactional
>> >> > > > > > writes
>> >> > > > > > > >> > (optional)
>> >> > > > > > > >> >
>> >> > > > > > > >> > *Stream Transaction & Namespace Transaction*
>> >> > > > > > > >> >
>> >> > > > > > > >> > There will be two types of transaction, one is Stream
>> >> level
>> >> > > > > > > transaction
>> >> > > > > > > >> > (local transaction), while the other one is Namespace
>> >> level
>> >> > > > > > > transaction
>> >> > > > > > > >> > (global transaction).
>> >> > > > > > > >> >
>> >> > > > > > > >> > The stream level transaction is a transactional
>> >> operation on
>> >> > > > > writing
>> >> > > > > > > >> > records to one stream; the namespace level transaction
>> >> is a
>> >> > > > > > > >> transactional
>> >> > > > > > > >> > operation on writing records to multiple streams.
>> >> > > > > > > >> >
>> >> > > > > > > >> > *Implementation Thoughts*
>> >> > > > > > > >> >
>> >> > > > > > > >> > - A transaction is consist of begin control record, a
>> >> series
>> >> > > of
>> >> > > > > data
>> >> > > > > > > >> > records and commit/abort control record.
>> >> > > > > > > >> > - The begin/commit/abort control record is written to
>> a
>> >> > > `commit`
>> >> > > > > log
>> >> > > > > > > >> > stream, while the data records will be written to
>> normal
>> >> > data
>> >> > > > log
>> >> > > > > > > >> streams.
>> >> > > > > > > >> > - The `commit` log stream will be the same log stream
>> for
>> >> > > > > > stream-level
>> >> > > > > > > >> > transaction,  while it will be a *system* stream (or
>> >> > multiple
>> >> > > > > system
>> >> > > > > > > >> > streams) for namespace-level transactions.
>> >> > > > > > > >> > - The transaction code looks like as below:
>> >> > > > > > > >> >
>> >> > > > > > > >> > <code>
>> >> > > > > > > >> >
>> >> > > > > > > >> > Transaction txn = client.transaction();
>> >> > > > > > > >> > Future<DLSN> result1 = txn.write(stream-0, record);
>> >> > > > > > > >> > Future<DLSN> result2 = txn.write(stream-1, record);
>> >> > > > > > > >> > Future<DLSN> result3 = txn.write(stream-2, record);
>> >> > > > > > > >> > Future<Pair<DLSN, DLSN>> result = txn.commit();
>> >> > > > > > > >> >
>> >> > > > > > > >> > </code>
>> >> > > > > > > >> >
>> >> > > > > > > >> > if the txn is committed, all the write futures will be
>> >> > > satisfied
>> >> > > > > > with
>> >> > > > > > > >> their
>> >> > > > > > > >> > written DLSNs. if the txn is aborted, all the write
>> >> futures
>> >> > > will
>> >> > > > > be
>> >> > > > > > > >> failed
>> >> > > > > > > >> > together. there is no partial failure state.
>> >> > > > > > > >> >
>> >> > > > > > > >> > - The actually data flow will be:
>> >> > > > > > > >> >
>> >> > > > > > > >> > 1. writer get a transaction id from the owner of the
>> >> > `commit'
>> >> > > > log
>> >> > > > > > > stream
>> >> > > > > > > >> > 1. write the begin control record (synchronously) with
>> >> the
>> >> > > > > > transaction
>> >> > > > > > > >> id
>> >> > > > > > > >> > 2. for each write within the same txn, it will be
>> >> assigned a
>> >> > > > local
>> >> > > > > > > >> sequence
>> >> > > > > > > >> > number starting from 0. the combination of
>> transaction id
>> >> > and
>> >> > > > > local
>> >> > > > > > > >> > sequence number will be used later on by the readers
>> to
>> >> > > > > de-duplicate
>> >> > > > > > > >> > records.
>> >> > > > > > > >> > 3. the commit/abort control record will be written
>> based
>> >> on
>> >> > > the
>> >> > > > > > > results
>> >> > > > > > > >> > from 2.
>> >> > > > > > > >> >
>> >> > > > > > > >> > - Application can supply a timeout for the transaction
>> >> when
>> >> > > > > > #begin() a
>> >> > > > > > > >> > transaction. The owner of the `commit` log stream can
>> >> abort
>> >> > > > > > > transactions
>> >> > > > > > > >> > that never be committed/aborted within their timeout.
>> >> > > > > > > >> >
>> >> > > > > > > >> > - Failures:
>> >> > > > > > > >> >
>> >> > > > > > > >> > * all the log records can be simply retried as they
>> will
>> >> be
>> >> > > > > > > >> de-duplicated
>> >> > > > > > > >> > probably at the reader side.
>> >> > > > > > > >> >
>> >> > > > > > > >> > - Reader:
>> >> > > > > > > >> >
>> >> > > > > > > >> > * Reader can be configured to read uncommitted
>> records or
>> >> > > > > committed
>> >> > > > > > > >> records
>> >> > > > > > > >> > only (by default read uncommitted records)
>> >> > > > > > > >> > * If reader is configured to read committed records
>> only,
>> >> > the
>> >> > > > read
>> >> > > > > > > ahead
>> >> > > > > > > >> > cache will be changed to maintain one additional
>> pending
>> >> > > > committed
>> >> > > > > > > >> records.
>> >> > > > > > > >> > the pending committed records map is bounded and
>> records
>> >> > will
>> >> > > be
>> >> > > > > > > dropped
>> >> > > > > > > >> > when read ahead is moving.
>> >> > > > > > > >> > * when the reader hits a commit record, it will
>> rewind to
>> >> > the
>> >> > > > > begin
>> >> > > > > > > >> record
>> >> > > > > > > >> > and start reading from there. leveraging the proper
>> read
>> >> > ahead
>> >> > > > > cache
>> >> > > > > > > and
>> >> > > > > > > >> > pending commit records cache, it would be good for
>> both
>> >> > short
>> >> > > > > > > >> transactions
>> >> > > > > > > >> > and long transactions.
>> >> > > > > > > >> >
>> >> > > > > > > >> > - DLSN, SequenceId:
>> >> > > > > > > >> >
>> >> > > > > > > >> > * We will add a fourth field to DLSN. It is `local
>> >> sequence
>> >> > > > > number`
>> >> > > > > > > >> within
>> >> > > > > > > >> > a transaction session. So the new DLSN of records in a
>> >> > > > transaction
>> >> > > > > > > will
>> >> > > > > > > >> be
>> >> > > > > > > >> > the DLSN of commit control record plus its local
>> sequence
>> >> > > > number.
>> >> > > > > > > >> > * The sequence id will be still the position of the
>> >> commit
>> >> > > > record
>> >> > > > > > plus
>> >> > > > > > > >> its
>> >> > > > > > > >> > local sequence number. The position will be advanced
>> with
>> >> > > total
>> >> > > > > > number
>> >> > > > > > > >> of
>> >> > > > > > > >> > written records on writing the commit control record.
>> >> > > > > > > >> >
>> >> > > > > > > >> > - Transaction Group & Namespace Transaction
>> >> > > > > > > >> >
>> >> > > > > > > >> > using one single log stream for namespace transaction
>> can
>> >> > > cause
>> >> > > > > the
>> >> > > > > > > >> > bottleneck problem since all the begin/commit/end
>> control
>> >> > > > records
>> >> > > > > > will
>> >> > > > > > > >> have
>> >> > > > > > > >> > to go through one log stream.
>> >> > > > > > > >> >
>> >> > > > > > > >> > the idea of 'transaction group' is to allow
>> partitioning
>> >> the
>> >> > > > > writers
>> >> > > > > > > >> into
>> >> > > > > > > >> > different transaction groups.
>> >> > > > > > > >> >
>> >> > > > > > > >> > clients can specify the `group-name` when starting the
>> >> > > > > transaction.
>> >> > > > > > if
>> >> > > > > > > >> > there is no `group-name` specified, it will use the
>> >> default
>> >> > > > > `commit`
>> >> > > > > > > >> log in
>> >> > > > > > > >> > the namespace for creating transactions.
>> >> > > > > > > >> >
>> >> > > > > > > >> > -------------------------------------------------
>> >> > > > > > > >> >
>> >> > > > > > > >> > I'd like to collect feedbacks on this idea. Appreciate
>> >> any
>> >> > > > > comments
>> >> > > > > > > and
>> >> > > > > > > >> if
>> >> > > > > > > >> > anyone is also interested in this idea, we'd like to
>> >> > > collaborate
>> >> > > > > > with
>> >> > > > > > > >> the
>> >> > > > > > > >> > community.
>> >> > > > > > > >> >
>> >> > > > > > > >> >
>> >> > > > > > > >> > - Xi
>> >> > > > > > > >> >
>> >> > > > > > > >>
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > >
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> >
>> >
>>
>
>

Re: [Discuss] Transaction Support

Posted by Sijie Guo <si...@apache.org>.

On Thu, Jan 5, 2017 at 10:56 PM, Xi Liu <xi...@gmail.com> wrote:

> Asko and Sijie,
>
> Thank you so much for your feedbacks.
>
> We are not targeting at building a general XA transaction coordinator. The
> feature we want is be able to write data to multiple log streams in an
> atomic way.
>

So in other words, that means this feature is 'atomic writes across log
streams', no?


>
> I totally agreed with you about building minimal logic. We also don't want
> to enforce this feature to all the users of DL. Building the TC as a
> separated service sounds clear to me. We will do it follow your suggestion.
>
> I am also replying the comments to you and Leigh on the doc. Hopefully we
> can come to an agreement so that our changes can be accepted.
>
> - Xi
>
> On Wed, Jan 4, 2017 at 1:14 AM, Asko Kauppi <as...@zalando.fi>
> wrote:
>
> > > Beside that, I have one general question - What is the major goal for
> > this
> > > feature? Are you targeting on building a general XA transaction
> > coordinator
> > > or just for supporting things like `copy-modify-write' style workflow?
> >
> > The use case I would have for transactions - at some level of the stack -
> > is supporting dynamic configurations.
> >
> > If a config changes in e.g. three lines, some of the changes may
> logically
> > belong together. E.g. changing both “host” and “port” (if separate
> > entries). One shouldn’t be able to read a state, even temporarily, that
> has
> > new host but old port.
> >
> > I can do this in the application level - it does not need to be part of
> > the DL protocol.
> >
> >
> > Asko Kauppi
> > Zalando Tech Helsinki
> >
> > > On 4 Jan 2017, at 9.18, Sijie Guo <si...@apache.org> wrote:
> > >
> > > Sorry for late response. I think Leigh and you already had some very
> > > valuable discussions in the doc. I will try to add some of my questions
> > to
> > > the discussion.
> > >
> > > Beside that, I had a discussion with Leigh today about this. first of
> > all,
> > > I think it is very good to add transaction support in distributedlog.
> It
> > is
> > > one of the primitives that would help building distributed service. But
> > we
> > > have a concern about making this system become complicated and
> introduce
> > > operational overhead when it runs in the large scale system on
> > production.
> > > There are two major suggestions that I have for this feature -
> > >
> > > Build the 'minimum' logic in core - I think the minimum logic that need
> > to
> > > be added to the core is -  the special control records (begin, commit
> and
> > > abort) and make the reader be able to detect those special control
> > records
> > > and know what do they mean and how to interrupt with them. Since they
> are
> > > special control records, there is not overhead to other readers that
> > > doesn't require this feature.
> > >
> > > Build the transaction coordinator as a separated proxy service  - I
> think
> > > the major concern that we have is putting more complexities into the
> > 'write
> > > proxy' service. We architected distributedlog in a more
> microservice-like
> > > way - we have the core as the stream store, the proxy for serving write
> > and
> > > read traffic. It would be good that the transaction feature can be done
> > in
> > > a similar way. So the architecture would be like this -
> > >
> > > *[ write service ] [ read service ] [ transaction coordinator ]*
> > > *[ stream store
> > >            ]*
> > >
> > > if people doesn't need the transaction feature, they can turn if off
> > > completely without any operational overhead.
> > >
> > > Beside that, I have one general question - What is the major goal for
> > this
> > > feature? Are you targeting on building a general XA transaction
> > coordinator
> > > or just for supporting things like `copy-modify-write' style workflow?
> > >
> > >
> > > Thanks,
> > > Sijie
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Dec 28, 2016 at 1:12 PM, Xi Liu <xi...@gmail.com> wrote:
> > >
> > >> Ping?
> > >>
> > >> On Mon, Dec 19, 2016 at 8:28 AM, Xi Liu <xi...@gmail.com> wrote:
> > >>
> > >>> Sijie,
> > >>>
> > >>> No. I thought it might be easier for people to comment on a google
> doc
> > to
> > >>> gather the initial feedback. I will put the content back to wiki page
> > >> once
> > >>> addressing the comments. Does that sound good to you?
> > >>>
> > >>> And thank you in advance.
> > >>>
> > >>> - Xi
> > >>>
> > >>>
> > >>>
> > >>> On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <si...@apache.org> wrote:
> > >>>
> > >>>> Hi Xi,
> > >>>>
> > >>>> sorry for late response. I will review it soon.
> > >>>>
> > >>>> regarding this, a separate question "are we going to use google doc
> > >>>> instead
> > >>>> of email thread for any discussion"? I am a bit worried that the
> > >>>> discussion
> > >>>> will become lost after moving to google doc. No idea on how other
> > apache
> > >>>> projects are doing.
> > >>>>
> > >>>> - Sijie
> > >>>>
> > >>>> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi...@gmail.com>
> > wrote:
> > >>>>
> > >>>>> Hi all,
> > >>>>>
> > >>>>> I finalized the first version of the design. This time I used a
> > google
> > >>>> doc
> > >>>>> so that it is easier for commenting and add a link the wiki page. I
> > >> will
> > >>>>> update this to the wiki page once we come to the finalized design.
> > >>>>>
> > >>>>> https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
> > >>>>> bSIGgSzXuTI5BA/edit
> > >>>>>
> > >>>>> Let me know if you have any questions. Appreciate your reviews!
> > >>>>>
> > >>>>> - Xi
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
> > >>>>> <lstewart@twitter.com.invalid
> > >>>>>> wrote:
> > >>>>>
> > >>>>>> Interesting proposal. A couple quick notes while you continue to
> > >> flesh
> > >>>>> this
> > >>>>>> out.
> > >>>>>>
> > >>>>>> a. just to be sure - does this eliminate the need to save seqno
> with
> > >>>>>> checkpoint?
> > >>>>>>
> > >>>>>> b. i.e. another way to describe this kind of improvement is
> "support
> > >>>>>> records (atomic writes) larger than 1MB", iiuc. the advantage
> being
> > >> it
> > >>>>>> avoids the baggage of transactions. disadvantages include
> inability
> > >>>> to do
> > >>>>>> cross stream transactions, and flexibility (interleaving, etc)
> (are
> > >>>> there
> > >>>>>> others?).
> > >>>>>>
> > >>>>>> c. proxy use case is for supporting multiple writers - have you
> > >>>> thought
> > >>>>>> about how this would work with multiple writers?
> > >>>>>>
> > >>>>>> Thanks!
> > >>>>>>
> > >>>>>>
> > >>>>>> On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo
> > >> <sijieg@twitter.com.invalid
> > >>>>>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Sound good to me. look forward to the detailed proposal.
> > >>>>>>>
> > >>>>>>> (I don't mind the format if it makes things easier to you)
> > >>>>>>>
> > >>>>>>> Sijie
> > >>>>>>>
> > >>>>>>> On Friday, October 14, 2016, Xi Liu <xi...@gmail.com>
> wrote:
> > >>>>>>>
> > >>>>>>>> Thank you, Sijie
> > >>>>>>>>
> > >>>>>>>> We have some internal discussions to sort out some details. We
> > >> are
> > >>>>>> ready
> > >>>>>>> to
> > >>>>>>>> collaborate with the community for adding the transaction
> > >> support
> > >>>> in
> > >>>>>> DL.
> > >>>>>>>> We'd like to share more.
> > >>>>>>>>
> > >>>>>>>> I created a proposal wiki here -
> > >>>>>>>> https://cwiki.apache.org/confluence/display/DL/DP-1+-+
> > >>>>>>>> DistributedLog+Transaction+Support
> > >>>>>>>>
> > >>>>>>>> (I followed KIP format and named it as DP (DistributedLog
> > >>>> Proposal -
> > >>>>> DP
> > >>>>>>> is
> > >>>>>>>> also short for Dynamic Programming). I don't know if you guys
> > >> like
> > >>>>> this
> > >>>>>>>> name or not. Feel free to change it :D)
> > >>>>>>>>
> > >>>>>>>> I basically put my initial email as the content there so far.
> > >>>> Once we
> > >>>>>>>> finished our final discussion, I will update with more details.
> > >> At
> > >>>>> the
> > >>>>>>> same
> > >>>>>>>> time, any comments are welcome.
> > >>>>>>>>
> > >>>>>>>> - Xi
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
> > >>>>>>> <javascript:;>>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Xi,
> > >>>>>>>>>
> > >>>>>>>>> I just granted you the edit permission.
> > >>>>>>>>>
> > >>>>>>>>> - Sijie
> > >>>>>>>>>
> > >>>>>>>>> On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
> > >>>>>>>> <javascript:;>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> I still can not edit the wiki. Can any of the pmc members
> > >>>> grant
> > >>>>> me
> > >>>>>>> the
> > >>>>>>>>>> permissions?
> > >>>>>>>>>>
> > >>>>>>>>>> - Xi
> > >>>>>>>>>>
> > >>>>>>>>>> On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <
> > >>>> xi.liu.ant@gmail.com
> > >>>>>>>> <javascript:;>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Sijie,
> > >>>>>>>>>>>
> > >>>>>>>>>>> I attempted to create a wiki page under that space. I
> > >> found
> > >>>>> that
> > >>>>>> I
> > >>>>>>> am
> > >>>>>>>>> not
> > >>>>>>>>>>> authorized with edit permission.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Can any of the committers grant me the wiki edit
> > >>>> permission? My
> > >>>>>>>> account
> > >>>>>>>>>> is
> > >>>>>>>>>>> "xi.liu.ant".
> > >>>>>>>>>>>
> > >>>>>>>>>>> - Xi
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <
> > >>>> sijie@apache.org
> > >>>>>>>> <javascript:;>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> This sounds interesting ... I will take a closer look and
> > >>>> give
> > >>>>>> my
> > >>>>>>>>>> comments
> > >>>>>>>>>>>> later.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> At the same time, do you mind creating a wiki page to put
> > >>>> your
> > >>>>>>> idea
> > >>>>>>>>>> there?
> > >>>>>>>>>>>> You can add your wiki page under
> > >>>>>>>>>>>> https://cwiki.apache.org/confluence/display/DL/Project+
> > >>>>>> Proposals
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> You might need to ask in the dev list to grant the wiki
> > >>>> edit
> > >>>>>>>>> permissions
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>> you once you have a wiki account.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> - Sijie
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <
> > >>>> xi.liu.ant@gmail.com
> > >>>>>>>> <javascript:;>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Hello,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I asked the transaction support in distributedlog user
> > >>>> group
> > >>>>>> two
> > >>>>>>>>>> months
> > >>>>>>>>>>>>> ago. I want to raise this up again, as we are looking
> > >> for
> > >>>>>> using
> > >>>>>>>>>>>>> distributedlog for building a transactional data
> > >>>> service. It
> > >>>>>> is
> > >>>>>>> a
> > >>>>>>>>>> major
> > >>>>>>>>>>>>> feature that is missing in distributedlog. We have some
> > >>>>> ideas
> > >>>>>> to
> > >>>>>>>> add
> > >>>>>>>>>>>> this
> > >>>>>>>>>>>>> to distributedlog and want to know if they make sense
> > >> or
> > >>>>> not.
> > >>>>>> If
> > >>>>>>>>> they
> > >>>>>>>>>>>> are
> > >>>>>>>>>>>>> good, we'd like to contribute and develop with the
> > >>>>> community.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Here are the thoughts:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> -------------------------------------------------
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> From our understanding, DL can provide "at-least-once"
> > >>>>>> delivery
> > >>>>>>>>>> semantic
> > >>>>>>>>>>>>> (if not, please correct me) but not "exactly-once"
> > >>>> delivery
> > >>>>>>>>> semantic.
> > >>>>>>>>>>>> That
> > >>>>>>>>>>>>> means that a message can be delivered one or more times
> > >>>> if
> > >>>>> the
> > >>>>>>>>> reader
> > >>>>>>>>>>>>> doesn't handle duplicates.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The duplicates come from two places, one is at writer
> > >>>> side
> > >>>>>> (this
> > >>>>>>>>>> assumes
> > >>>>>>>>>>>>> using write proxy not the core library), while the
> > >> other
> > >>>> one
> > >>>>>> is
> > >>>>>>> at
> > >>>>>>>>>>>> reader
> > >>>>>>>>>>>>> side.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - writer side: if the client attempts to write a record
> > >>>> to
> > >>>>> the
> > >>>>>>>> write
> > >>>>>>>>>>>>> proxies and gets a network error (e.g timeouts) then
> > >>>>> retries,
> > >>>>>>> the
> > >>>>>>>>>>>> retrying
> > >>>>>>>>>>>>> will potentially result in duplicates.
> > >>>>>>>>>>>>> - reader side:if the reader reads a message from a
> > >> stream
> > >>>>> and
> > >>>>>>> then
> > >>>>>>>>>>>> crashes,
> > >>>>>>>>>>>>> when the reader restarts it would restart from last
> > >> known
> > >>>>>>> position
> > >>>>>>>>>>>> (DLSN).
> > >>>>>>>>>>>>> If the reader fails after processing a record and
> > >> before
> > >>>>>>> recording
> > >>>>>>>>> the
> > >>>>>>>>>>>>> position, the processed record will be delivered again.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The reader problem can be properly addressed by making
> > >>>> use
> > >>>>> of
> > >>>>>>> the
> > >>>>>>>>>>>> sequence
> > >>>>>>>>>>>>> numbers of records and doing proper checkpointing. For
> > >>>>>> example,
> > >>>>>>> in
> > >>>>>>>>>>>>> database, it can checkpoint the indexed data with the
> > >>>>> sequence
> > >>>>>>>>> number
> > >>>>>>>>>> of
> > >>>>>>>>>>>>> records; in flink, it can checkpoint the state with the
> > >>>>>> sequence
> > >>>>>>>>>>>> numbers.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The writer problem can be addressed by implementing an
> > >>>>>>> idempotent
> > >>>>>>>>>>>> writer.
> > >>>>>>>>>>>>> However, an alternative and more powerful approach is
> > >> to
> > >>>>>> support
> > >>>>>>>>>>>>> transactions.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> *What does transaction mean?*
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> A transaction means a collection of records can be
> > >>>> written
> > >>>>>>>>>>>> transactionally
> > >>>>>>>>>>>>> within a stream or across multiple streams. They will
> > >> be
> > >>>>>>> consumed
> > >>>>>>>> by
> > >>>>>>>>>> the
> > >>>>>>>>>>>>> reader together when a transaction is committed, or
> > >> will
> > >>>>> never
> > >>>>>>> be
> > >>>>>>>>>>>> consumed
> > >>>>>>>>>>>>> by the reader when the transaction is aborted.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The transaction will expose following guarantees:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - The reader should not be exposed to records written
> > >>>> from
> > >>>>>>>>> uncommitted
> > >>>>>>>>>>>>> transactions (mandatory)
> > >>>>>>>>>>>>> - The reader should consume the records in the
> > >>>> transaction
> > >>>>>>> commit
> > >>>>>>>>>> order
> > >>>>>>>>>>>>> rather than the record written order (mandatory)
> > >>>>>>>>>>>>> - No duplicated records within a transaction
> > >> (mandatory)
> > >>>>>>>>>>>>> - Allow interleaving transactional writes and
> > >>>>>> non-transactional
> > >>>>>>>>> writes
> > >>>>>>>>>>>>> (optional)
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> *Stream Transaction & Namespace Transaction*
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> There will be two types of transaction, one is Stream
> > >>>> level
> > >>>>>>>>>> transaction
> > >>>>>>>>>>>>> (local transaction), while the other one is Namespace
> > >>>> level
> > >>>>>>>>>> transaction
> > >>>>>>>>>>>>> (global transaction).
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The stream level transaction is a transactional
> > >>>> operation on
> > >>>>>>>> writing
> > >>>>>>>>>>>>> records to one stream; the namespace level transaction
> > >>>> is a
> > >>>>>>>>>>>> transactional
> > >>>>>>>>>>>>> operation on writing records to multiple streams.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> *Implementation Thoughts*
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - A transaction is consist of begin control record, a
> > >>>> series
> > >>>>>> of
> > >>>>>>>> data
> > >>>>>>>>>>>>> records and commit/abort control record.
> > >>>>>>>>>>>>> - The begin/commit/abort control record is written to a
> > >>>>>> `commit`
> > >>>>>>>> log
> > >>>>>>>>>>>>> stream, while the data records will be written to
> > >> normal
> > >>>>> data
> > >>>>>>> log
> > >>>>>>>>>>>> streams.
> > >>>>>>>>>>>>> - The `commit` log stream will be the same log stream
> > >> for
> > >>>>>>>>> stream-level
> > >>>>>>>>>>>>> transaction,  while it will be a *system* stream (or
> > >>>>> multiple
> > >>>>>>>> system
> > >>>>>>>>>>>>> streams) for namespace-level transactions.
> > >>>>>>>>>>>>> - The transaction code looks like as below:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> <code>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Transaction txn = client.transaction();
> > >>>>>>>>>>>>> Future<DLSN> result1 = txn.write(stream-0, record);
> > >>>>>>>>>>>>> Future<DLSN> result2 = txn.write(stream-1, record);
> > >>>>>>>>>>>>> Future<DLSN> result3 = txn.write(stream-2, record);
> > >>>>>>>>>>>>> Future<Pair<DLSN, DLSN>> result = txn.commit();
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> </code>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> if the txn is committed, all the write futures will be
> > >>>>>> satisfied
> > >>>>>>>>> with
> > >>>>>>>>>>>> their
> > >>>>>>>>>>>>> written DLSNs. if the txn is aborted, all the write
> > >>>> futures
> > >>>>>> will
> > >>>>>>>> be
> > >>>>>>>>>>>> failed
> > >>>>>>>>>>>>> together. there is no partial failure state.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - The actually data flow will be:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 1. writer get a transaction id from the owner of the
> > >>>>> `commit'
> > >>>>>>> log
> > >>>>>>>>>> stream
> > >>>>>>>>>>>>> 1. write the begin control record (synchronously) with
> > >>>> the
> > >>>>>>>>> transaction
> > >>>>>>>>>>>> id
> > >>>>>>>>>>>>> 2. for each write within the same txn, it will be
> > >>>> assigned a
> > >>>>>>> local
> > >>>>>>>>>>>> sequence
> > >>>>>>>>>>>>> number starting from 0. the combination of transaction
> > >> id
> > >>>>> and
> > >>>>>>>> local
> > >>>>>>>>>>>>> sequence number will be used later on by the readers to
> > >>>>>>>> de-duplicate
> > >>>>>>>>>>>>> records.
> > >>>>>>>>>>>>> 3. the commit/abort control record will be written
> > >> based
> > >>>> on
> > >>>>>> the
> > >>>>>>>>>> results
> > >>>>>>>>>>>>> from 2.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Application can supply a timeout for the transaction
> > >>>> when
> > >>>>>>>>> #begin() a
> > >>>>>>>>>>>>> transaction. The owner of the `commit` log stream can
> > >>>> abort
> > >>>>>>>>>> transactions
> > >>>>>>>>>>>>> that never be committed/aborted within their timeout.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Failures:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> * all the log records can be simply retried as they
> > >> will
> > >>>> be
> > >>>>>>>>>>>> de-duplicated
> > >>>>>>>>>>>>> probably at the reader side.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Reader:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> * Reader can be configured to read uncommitted records
> > >> or
> > >>>>>>>> committed
> > >>>>>>>>>>>> records
> > >>>>>>>>>>>>> only (by default read uncommitted records)
> > >>>>>>>>>>>>> * If reader is configured to read committed records
> > >> only,
> > >>>>> the
> > >>>>>>> read
> > >>>>>>>>>> ahead
> > >>>>>>>>>>>>> cache will be changed to maintain one additional
> > >> pending
> > >>>>>>> committed
> > >>>>>>>>>>>> records.
> > >>>>>>>>>>>>> the pending committed records map is bounded and
> > >> records
> > >>>>> will
> > >>>>>> be
> > >>>>>>>>>> dropped
> > >>>>>>>>>>>>> when read ahead is moving.
> > >>>>>>>>>>>>> * when the reader hits a commit record, it will rewind
> > >> to
> > >>>>> the
> > >>>>>>>> begin
> > >>>>>>>>>>>> record
> > >>>>>>>>>>>>> and start reading from there. leveraging the proper
> > >> read
> > >>>>> ahead
> > >>>>>>>> cache
> > >>>>>>>>>> and
> > >>>>>>>>>>>>> pending commit records cache, it would be good for both
> > >>>>> short
> > >>>>>>>>>>>> transactions
> > >>>>>>>>>>>>> and long transactions.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - DLSN, SequenceId:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> * We will add a fourth field to DLSN. It is `local
> > >>>> sequence
> > >>>>>>>> number`
> > >>>>>>>>>>>> within
> > >>>>>>>>>>>>> a transaction session. So the new DLSN of records in a
> > >>>>>>> transaction
> > >>>>>>>>>> will
> > >>>>>>>>>>>> be
> > >>>>>>>>>>>>> the DLSN of commit control record plus its local
> > >> sequence
> > >>>>>>> number.
> > >>>>>>>>>>>>> * The sequence id will be still the position of the
> > >>>> commit
> > >>>>>>> record
> > >>>>>>>>> plus
> > >>>>>>>>>>>> its
> > >>>>>>>>>>>>> local sequence number. The position will be advanced
> > >> with
> > >>>>>> total
> > >>>>>>>>> number
> > >>>>>>>>>>>> of
> > >>>>>>>>>>>>> written records on writing the commit control record.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Transaction Group & Namespace Transaction
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> using one single log stream for namespace transaction
> > >> can
> > >>>>>> cause
> > >>>>>>>> the
> > >>>>>>>>>>>>> bottleneck problem since all the begin/commit/end
> > >> control
> > >>>>>>> records
> > >>>>>>>>> will
> > >>>>>>>>>>>> have
> > >>>>>>>>>>>>> to go through one log stream.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> the idea of 'transaction group' is to allow
> > >> partitioning
> > >>>> the
> > >>>>>>>> writers
> > >>>>>>>>>>>> into
> > >>>>>>>>>>>>> different transaction groups.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> clients can specify the `group-name` when starting the
> > >>>>>>>> transaction.
> > >>>>>>>>> if
> > >>>>>>>>>>>>> there is no `group-name` specified, it will use the
> > >>>> default
> > >>>>>>>> `commit`
> > >>>>>>>>>>>> log in
> > >>>>>>>>>>>>> the namespace for creating transactions.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> -------------------------------------------------
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I'd like to collect feedbacks on this idea. Appreciate
> > >>>> any
> > >>>>>>>> comments
> > >>>>>>>>>> and
> > >>>>>>>>>>>> if
> > >>>>>>>>>>>>> anyone is also interested in this idea, we'd like to
> > >>>>>> collaborate
> > >>>>>>>>> with
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>> community.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Xi
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>
> >
> >
>

Re: [Discuss] Transaction Support

Posted by Asko Kauppi <as...@zalando.fi>.

Xi, I'm curious about:

>be able to write data to multiple log streams in an atomic way.

Do you intend to group such logs together, so that one producer would be
able to "own" them and write to any of them, with other writers waiting
until (s)he's done. I'd like to know more of the use case.


On 6 January 2017 at 08:56, Xi Liu <xi...@gmail.com> wrote:

> Asko and Sijie,
>
> Thank you so much for your feedbacks.
>
> We are not targeting at building a general XA transaction coordinator. The
> feature we want is be able to write data to multiple log streams in an
> atomic way.
>
> I totally agreed with you about building minimal logic. We also don't want
> to enforce this feature to all the users of DL. Building the TC as a
> separated service sounds clear to me. We will do it follow your suggestion.
>
> I am also replying the comments to you and Leigh on the doc. Hopefully we
> can come to an agreement so that our changes can be accepted.
>
> - Xi
>
> On Wed, Jan 4, 2017 at 1:14 AM, Asko Kauppi <as...@zalando.fi>
> wrote:
>
> > > Beside that, I have one general question - What is the major goal for
> > this
> > > feature? Are you targeting on building a general XA transaction
> > coordinator
> > > or just for supporting things like `copy-modify-write' style workflow?
> >
> > The use case I would have for transactions - at some level of the stack -
> > is supporting dynamic configurations.
> >
> > If a config changes in e.g. three lines, some of the changes may
> logically
> > belong together. E.g. changing both “host” and “port” (if separate
> > entries). One shouldn’t be able to read a state, even temporarily, that
> has
> > new host but old port.
> >
> > I can do this in the application level - it does not need to be part of
> > the DL protocol.
> >
> >
> > Asko Kauppi
> > Zalando Tech Helsinki
> >
> > > On 4 Jan 2017, at 9.18, Sijie Guo <si...@apache.org> wrote:
> > >
> > > Sorry for late response. I think Leigh and you already had some very
> > > valuable discussions in the doc. I will try to add some of my questions
> > to
> > > the discussion.
> > >
> > > Beside that, I had a discussion with Leigh today about this. first of
> > all,
> > > I think it is very good to add transaction support in distributedlog.
> It
> > is
> > > one of the primitives that would help building distributed service. But
> > we
> > > have a concern about making this system become complicated and
> introduce
> > > operational overhead when it runs in the large scale system on
> > production.
> > > There are two major suggestions that I have for this feature -
> > >
> > > Build the 'minimum' logic in core - I think the minimum logic that need
> > to
> > > be added to the core is -  the special control records (begin, commit
> and
> > > abort) and make the reader be able to detect those special control
> > records
> > > and know what do they mean and how to interrupt with them. Since they
> are
> > > special control records, there is not overhead to other readers that
> > > doesn't require this feature.
> > >
> > > Build the transaction coordinator as a separated proxy service  - I
> think
> > > the major concern that we have is putting more complexities into the
> > 'write
> > > proxy' service. We architected distributedlog in a more
> microservice-like
> > > way - we have the core as the stream store, the proxy for serving write
> > and
> > > read traffic. It would be good that the transaction feature can be done
> > in
> > > a similar way. So the architecture would be like this -
> > >
> > > *[ write service ] [ read service ] [ transaction coordinator ]*
> > > *[ stream store
> > >            ]*
> > >
> > > if people doesn't need the transaction feature, they can turn if off
> > > completely without any operational overhead.
> > >
> > > Beside that, I have one general question - What is the major goal for
> > this
> > > feature? Are you targeting on building a general XA transaction
> > coordinator
> > > or just for supporting things like `copy-modify-write' style workflow?
> > >
> > >
> > > Thanks,
> > > Sijie
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Dec 28, 2016 at 1:12 PM, Xi Liu <xi...@gmail.com> wrote:
> > >
> > >> Ping?
> > >>
> > >> On Mon, Dec 19, 2016 at 8:28 AM, Xi Liu <xi...@gmail.com> wrote:
> > >>
> > >>> Sijie,
> > >>>
> > >>> No. I thought it might be easier for people to comment on a google
> doc
> > to
> > >>> gather the initial feedback. I will put the content back to wiki page
> > >> once
> > >>> addressing the comments. Does that sound good to you?
> > >>>
> > >>> And thank you in advance.
> > >>>
> > >>> - Xi
> > >>>
> > >>>
> > >>>
> > >>> On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <si...@apache.org> wrote:
> > >>>
> > >>>> Hi Xi,
> > >>>>
> > >>>> sorry for late response. I will review it soon.
> > >>>>
> > >>>> regarding this, a separate question "are we going to use google doc
> > >>>> instead
> > >>>> of email thread for any discussion"? I am a bit worried that the
> > >>>> discussion
> > >>>> will become lost after moving to google doc. No idea on how other
> > apache
> > >>>> projects are doing.
> > >>>>
> > >>>> - Sijie
> > >>>>
> > >>>> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi...@gmail.com>
> > wrote:
> > >>>>
> > >>>>> Hi all,
> > >>>>>
> > >>>>> I finalized the first version of the design. This time I used a
> > google
> > >>>> doc
> > >>>>> so that it is easier for commenting and add a link the wiki page. I
> > >> will
> > >>>>> update this to the wiki page once we come to the finalized design.
> > >>>>>
> > >>>>> https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
> > >>>>> bSIGgSzXuTI5BA/edit
> > >>>>>
> > >>>>> Let me know if you have any questions. Appreciate your reviews!
> > >>>>>
> > >>>>> - Xi
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
> > >>>>> <lstewart@twitter.com.invalid
> > >>>>>> wrote:
> > >>>>>
> > >>>>>> Interesting proposal. A couple quick notes while you continue to
> > >> flesh
> > >>>>> this
> > >>>>>> out.
> > >>>>>>
> > >>>>>> a. just to be sure - does this eliminate the need to save seqno
> with
> > >>>>>> checkpoint?
> > >>>>>>
> > >>>>>> b. i.e. another way to describe this kind of improvement is
> "support
> > >>>>>> records (atomic writes) larger than 1MB", iiuc. the advantage
> being
> > >> it
> > >>>>>> avoids the baggage of transactions. disadvantages include
> inability
> > >>>> to do
> > >>>>>> cross stream transactions, and flexibility (interleaving, etc)
> (are
> > >>>> there
> > >>>>>> others?).
> > >>>>>>
> > >>>>>> c. proxy use case is for supporting multiple writers - have you
> > >>>> thought
> > >>>>>> about how this would work with multiple writers?
> > >>>>>>
> > >>>>>> Thanks!
> > >>>>>>
> > >>>>>>
> > >>>>>> On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo
> > >> <sijieg@twitter.com.invalid
> > >>>>>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Sound good to me. look forward to the detailed proposal.
> > >>>>>>>
> > >>>>>>> (I don't mind the format if it makes things easier to you)
> > >>>>>>>
> > >>>>>>> Sijie
> > >>>>>>>
> > >>>>>>> On Friday, October 14, 2016, Xi Liu <xi...@gmail.com>
> wrote:
> > >>>>>>>
> > >>>>>>>> Thank you, Sijie
> > >>>>>>>>
> > >>>>>>>> We have some internal discussions to sort out some details. We
> > >> are
> > >>>>>> ready
> > >>>>>>> to
> > >>>>>>>> collaborate with the community for adding the transaction
> > >> support
> > >>>> in
> > >>>>>> DL.
> > >>>>>>>> We'd like to share more.
> > >>>>>>>>
> > >>>>>>>> I created a proposal wiki here -
> > >>>>>>>> https://cwiki.apache.org/confluence/display/DL/DP-1+-+
> > >>>>>>>> DistributedLog+Transaction+Support
> > >>>>>>>>
> > >>>>>>>> (I followed KIP format and named it as DP (DistributedLog
> > >>>> Proposal -
> > >>>>> DP
> > >>>>>>> is
> > >>>>>>>> also short for Dynamic Programming). I don't know if you guys
> > >> like
> > >>>>> this
> > >>>>>>>> name or not. Feel free to change it :D)
> > >>>>>>>>
> > >>>>>>>> I basically put my initial email as the content there so far.
> > >>>> Once we
> > >>>>>>>> finished our final discussion, I will update with more details.
> > >> At
> > >>>>> the
> > >>>>>>> same
> > >>>>>>>> time, any comments are welcome.
> > >>>>>>>>
> > >>>>>>>> - Xi
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
> > >>>>>>> <javascript:;>>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Xi,
> > >>>>>>>>>
> > >>>>>>>>> I just granted you the edit permission.
> > >>>>>>>>>
> > >>>>>>>>> - Sijie
> > >>>>>>>>>
> > >>>>>>>>> On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
> > >>>>>>>> <javascript:;>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> I still can not edit the wiki. Can any of the pmc members
> > >>>> grant
> > >>>>> me
> > >>>>>>> the
> > >>>>>>>>>> permissions?
> > >>>>>>>>>>
> > >>>>>>>>>> - Xi
> > >>>>>>>>>>
> > >>>>>>>>>> On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <
> > >>>> xi.liu.ant@gmail.com
> > >>>>>>>> <javascript:;>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Sijie,
> > >>>>>>>>>>>
> > >>>>>>>>>>> I attempted to create a wiki page under that space. I
> > >> found
> > >>>>> that
> > >>>>>> I
> > >>>>>>> am
> > >>>>>>>>> not
> > >>>>>>>>>>> authorized with edit permission.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Can any of the committers grant me the wiki edit
> > >>>> permission? My
> > >>>>>>>> account
> > >>>>>>>>>> is
> > >>>>>>>>>>> "xi.liu.ant".
> > >>>>>>>>>>>
> > >>>>>>>>>>> - Xi
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <
> > >>>> sijie@apache.org
> > >>>>>>>> <javascript:;>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> This sounds interesting ... I will take a closer look and
> > >>>> give
> > >>>>>> my
> > >>>>>>>>>> comments
> > >>>>>>>>>>>> later.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> At the same time, do you mind creating a wiki page to put
> > >>>> your
> > >>>>>>> idea
> > >>>>>>>>>> there?
> > >>>>>>>>>>>> You can add your wiki page under
> > >>>>>>>>>>>> https://cwiki.apache.org/confluence/display/DL/Project+
> > >>>>>> Proposals
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> You might need to ask in the dev list to grant the wiki
> > >>>> edit
> > >>>>>>>>> permissions
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>> you once you have a wiki account.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> - Sijie
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <
> > >>>> xi.liu.ant@gmail.com
> > >>>>>>>> <javascript:;>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Hello,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I asked the transaction support in distributedlog user
> > >>>> group
> > >>>>>> two
> > >>>>>>>>>> months
> > >>>>>>>>>>>>> ago. I want to raise this up again, as we are looking
> > >> for
> > >>>>>> using
> > >>>>>>>>>>>>> distributedlog for building a transactional data
> > >>>> service. It
> > >>>>>> is
> > >>>>>>> a
> > >>>>>>>>>> major
> > >>>>>>>>>>>>> feature that is missing in distributedlog. We have some
> > >>>>> ideas
> > >>>>>> to
> > >>>>>>>> add
> > >>>>>>>>>>>> this
> > >>>>>>>>>>>>> to distributedlog and want to know if they make sense
> > >> or
> > >>>>> not.
> > >>>>>> If
> > >>>>>>>>> they
> > >>>>>>>>>>>> are
> > >>>>>>>>>>>>> good, we'd like to contribute and develop with the
> > >>>>> community.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Here are the thoughts:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> -------------------------------------------------
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> From our understanding, DL can provide "at-least-once"
> > >>>>>> delivery
> > >>>>>>>>>> semantic
> > >>>>>>>>>>>>> (if not, please correct me) but not "exactly-once"
> > >>>> delivery
> > >>>>>>>>> semantic.
> > >>>>>>>>>>>> That
> > >>>>>>>>>>>>> means that a message can be delivered one or more times
> > >>>> if
> > >>>>> the
> > >>>>>>>>> reader
> > >>>>>>>>>>>>> doesn't handle duplicates.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The duplicates come from two places, one is at writer
> > >>>> side
> > >>>>>> (this
> > >>>>>>>>>> assumes
> > >>>>>>>>>>>>> using write proxy not the core library), while the
> > >> other
> > >>>> one
> > >>>>>> is
> > >>>>>>> at
> > >>>>>>>>>>>> reader
> > >>>>>>>>>>>>> side.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - writer side: if the client attempts to write a record
> > >>>> to
> > >>>>> the
> > >>>>>>>> write
> > >>>>>>>>>>>>> proxies and gets a network error (e.g timeouts) then
> > >>>>> retries,
> > >>>>>>> the
> > >>>>>>>>>>>> retrying
> > >>>>>>>>>>>>> will potentially result in duplicates.
> > >>>>>>>>>>>>> - reader side:if the reader reads a message from a
> > >> stream
> > >>>>> and
> > >>>>>>> then
> > >>>>>>>>>>>> crashes,
> > >>>>>>>>>>>>> when the reader restarts it would restart from last
> > >> known
> > >>>>>>> position
> > >>>>>>>>>>>> (DLSN).
> > >>>>>>>>>>>>> If the reader fails after processing a record and
> > >> before
> > >>>>>>> recording
> > >>>>>>>>> the
> > >>>>>>>>>>>>> position, the processed record will be delivered again.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The reader problem can be properly addressed by making
> > >>>> use
> > >>>>> of
> > >>>>>>> the
> > >>>>>>>>>>>> sequence
> > >>>>>>>>>>>>> numbers of records and doing proper checkpointing. For
> > >>>>>> example,
> > >>>>>>> in
> > >>>>>>>>>>>>> database, it can checkpoint the indexed data with the
> > >>>>> sequence
> > >>>>>>>>> number
> > >>>>>>>>>> of
> > >>>>>>>>>>>>> records; in flink, it can checkpoint the state with the
> > >>>>>> sequence
> > >>>>>>>>>>>> numbers.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The writer problem can be addressed by implementing an
> > >>>>>>> idempotent
> > >>>>>>>>>>>> writer.
> > >>>>>>>>>>>>> However, an alternative and more powerful approach is
> > >> to
> > >>>>>> support
> > >>>>>>>>>>>>> transactions.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> *What does transaction mean?*
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> A transaction means a collection of records can be
> > >>>> written
> > >>>>>>>>>>>> transactionally
> > >>>>>>>>>>>>> within a stream or across multiple streams. They will
> > >> be
> > >>>>>>> consumed
> > >>>>>>>> by
> > >>>>>>>>>> the
> > >>>>>>>>>>>>> reader together when a transaction is committed, or
> > >> will
> > >>>>> never
> > >>>>>>> be
> > >>>>>>>>>>>> consumed
> > >>>>>>>>>>>>> by the reader when the transaction is aborted.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The transaction will expose following guarantees:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - The reader should not be exposed to records written
> > >>>> from
> > >>>>>>>>> uncommitted
> > >>>>>>>>>>>>> transactions (mandatory)
> > >>>>>>>>>>>>> - The reader should consume the records in the
> > >>>> transaction
> > >>>>>>> commit
> > >>>>>>>>>> order
> > >>>>>>>>>>>>> rather than the record written order (mandatory)
> > >>>>>>>>>>>>> - No duplicated records within a transaction
> > >> (mandatory)
> > >>>>>>>>>>>>> - Allow interleaving transactional writes and
> > >>>>>> non-transactional
> > >>>>>>>>> writes
> > >>>>>>>>>>>>> (optional)
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> *Stream Transaction & Namespace Transaction*
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> There will be two types of transaction, one is Stream
> > >>>> level
> > >>>>>>>>>> transaction
> > >>>>>>>>>>>>> (local transaction), while the other one is Namespace
> > >>>> level
> > >>>>>>>>>> transaction
> > >>>>>>>>>>>>> (global transaction).
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The stream level transaction is a transactional
> > >>>> operation on
> > >>>>>>>> writing
> > >>>>>>>>>>>>> records to one stream; the namespace level transaction
> > >>>> is a
> > >>>>>>>>>>>> transactional
> > >>>>>>>>>>>>> operation on writing records to multiple streams.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> *Implementation Thoughts*
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - A transaction is consist of begin control record, a
> > >>>> series
> > >>>>>> of
> > >>>>>>>> data
> > >>>>>>>>>>>>> records and commit/abort control record.
> > >>>>>>>>>>>>> - The begin/commit/abort control record is written to a
> > >>>>>> `commit`
> > >>>>>>>> log
> > >>>>>>>>>>>>> stream, while the data records will be written to
> > >> normal
> > >>>>> data
> > >>>>>>> log
> > >>>>>>>>>>>> streams.
> > >>>>>>>>>>>>> - The `commit` log stream will be the same log stream
> > >> for
> > >>>>>>>>> stream-level
> > >>>>>>>>>>>>> transaction,  while it will be a *system* stream (or
> > >>>>> multiple
> > >>>>>>>> system
> > >>>>>>>>>>>>> streams) for namespace-level transactions.
> > >>>>>>>>>>>>> - The transaction code looks like as below:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> <code>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Transaction txn = client.transaction();
> > >>>>>>>>>>>>> Future<DLSN> result1 = txn.write(stream-0, record);
> > >>>>>>>>>>>>> Future<DLSN> result2 = txn.write(stream-1, record);
> > >>>>>>>>>>>>> Future<DLSN> result3 = txn.write(stream-2, record);
> > >>>>>>>>>>>>> Future<Pair<DLSN, DLSN>> result = txn.commit();
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> </code>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> if the txn is committed, all the write futures will be
> > >>>>>> satisfied
> > >>>>>>>>> with
> > >>>>>>>>>>>> their
> > >>>>>>>>>>>>> written DLSNs. if the txn is aborted, all the write
> > >>>> futures
> > >>>>>> will
> > >>>>>>>> be
> > >>>>>>>>>>>> failed
> > >>>>>>>>>>>>> together. there is no partial failure state.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - The actually data flow will be:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 1. writer get a transaction id from the owner of the
> > >>>>> `commit'
> > >>>>>>> log
> > >>>>>>>>>> stream
> > >>>>>>>>>>>>> 1. write the begin control record (synchronously) with
> > >>>> the
> > >>>>>>>>> transaction
> > >>>>>>>>>>>> id
> > >>>>>>>>>>>>> 2. for each write within the same txn, it will be
> > >>>> assigned a
> > >>>>>>> local
> > >>>>>>>>>>>> sequence
> > >>>>>>>>>>>>> number starting from 0. the combination of transaction
> > >> id
> > >>>>> and
> > >>>>>>>> local
> > >>>>>>>>>>>>> sequence number will be used later on by the readers to
> > >>>>>>>> de-duplicate
> > >>>>>>>>>>>>> records.
> > >>>>>>>>>>>>> 3. the commit/abort control record will be written
> > >> based
> > >>>> on
> > >>>>>> the
> > >>>>>>>>>> results
> > >>>>>>>>>>>>> from 2.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Application can supply a timeout for the transaction
> > >>>> when
> > >>>>>>>>> #begin() a
> > >>>>>>>>>>>>> transaction. The owner of the `commit` log stream can
> > >>>> abort
> > >>>>>>>>>> transactions
> > >>>>>>>>>>>>> that never be committed/aborted within their timeout.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Failures:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> * all the log records can be simply retried as they
> > >> will
> > >>>> be
> > >>>>>>>>>>>> de-duplicated
> > >>>>>>>>>>>>> probably at the reader side.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Reader:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> * Reader can be configured to read uncommitted records
> > >> or
> > >>>>>>>> committed
> > >>>>>>>>>>>> records
> > >>>>>>>>>>>>> only (by default read uncommitted records)
> > >>>>>>>>>>>>> * If reader is configured to read committed records
> > >> only,
> > >>>>> the
> > >>>>>>> read
> > >>>>>>>>>> ahead
> > >>>>>>>>>>>>> cache will be changed to maintain one additional
> > >> pending
> > >>>>>>> committed
> > >>>>>>>>>>>> records.
> > >>>>>>>>>>>>> the pending committed records map is bounded and
> > >> records
> > >>>>> will
> > >>>>>> be
> > >>>>>>>>>> dropped
> > >>>>>>>>>>>>> when read ahead is moving.
> > >>>>>>>>>>>>> * when the reader hits a commit record, it will rewind
> > >> to
> > >>>>> the
> > >>>>>>>> begin
> > >>>>>>>>>>>> record
> > >>>>>>>>>>>>> and start reading from there. leveraging the proper
> > >> read
> > >>>>> ahead
> > >>>>>>>> cache
> > >>>>>>>>>> and
> > >>>>>>>>>>>>> pending commit records cache, it would be good for both
> > >>>>> short
> > >>>>>>>>>>>> transactions
> > >>>>>>>>>>>>> and long transactions.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - DLSN, SequenceId:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> * We will add a fourth field to DLSN. It is `local
> > >>>> sequence
> > >>>>>>>> number`
> > >>>>>>>>>>>> within
> > >>>>>>>>>>>>> a transaction session. So the new DLSN of records in a
> > >>>>>>> transaction
> > >>>>>>>>>> will
> > >>>>>>>>>>>> be
> > >>>>>>>>>>>>> the DLSN of commit control record plus its local
> > >> sequence
> > >>>>>>> number.
> > >>>>>>>>>>>>> * The sequence id will be still the position of the
> > >>>> commit
> > >>>>>>> record
> > >>>>>>>>> plus
> > >>>>>>>>>>>> its
> > >>>>>>>>>>>>> local sequence number. The position will be advanced
> > >> with
> > >>>>>> total
> > >>>>>>>>> number
> > >>>>>>>>>>>> of
> > >>>>>>>>>>>>> written records on writing the commit control record.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Transaction Group & Namespace Transaction
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> using one single log stream for namespace transaction
> > >> can
> > >>>>>> cause
> > >>>>>>>> the
> > >>>>>>>>>>>>> bottleneck problem since all the begin/commit/end
> > >> control
> > >>>>>>> records
> > >>>>>>>>> will
> > >>>>>>>>>>>> have
> > >>>>>>>>>>>>> to go through one log stream.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> the idea of 'transaction group' is to allow
> > >> partitioning
> > >>>> the
> > >>>>>>>> writers
> > >>>>>>>>>>>> into
> > >>>>>>>>>>>>> different transaction groups.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> clients can specify the `group-name` when starting the
> > >>>>>>>> transaction.
> > >>>>>>>>> if
> > >>>>>>>>>>>>> there is no `group-name` specified, it will use the
> > >>>> default
> > >>>>>>>> `commit`
> > >>>>>>>>>>>> log in
> > >>>>>>>>>>>>> the namespace for creating transactions.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> -------------------------------------------------
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I'd like to collect feedbacks on this idea. Appreciate
> > >>>> any
> > >>>>>>>> comments
> > >>>>>>>>>> and
> > >>>>>>>>>>>> if
> > >>>>>>>>>>>>> anyone is also interested in this idea, we'd like to
> > >>>>>> collaborate
> > >>>>>>>>> with
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>> community.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Xi
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>
> >
> >
>

Re: [Discuss] Transaction Support

Posted by Xi Liu <xi...@gmail.com>.

Asko and Sijie,

Thank you so much for your feedbacks.

We are not targeting at building a general XA transaction coordinator. The
feature we want is be able to write data to multiple log streams in an
atomic way.

I totally agreed with you about building minimal logic. We also don't want
to enforce this feature to all the users of DL. Building the TC as a
separated service sounds clear to me. We will do it follow your suggestion.

I am also replying the comments to you and Leigh on the doc. Hopefully we
can come to an agreement so that our changes can be accepted.

- Xi

On Wed, Jan 4, 2017 at 1:14 AM, Asko Kauppi <as...@zalando.fi> wrote:

> > Beside that, I have one general question - What is the major goal for
> this
> > feature? Are you targeting on building a general XA transaction
> coordinator
> > or just for supporting things like `copy-modify-write' style workflow?
>
> The use case I would have for transactions - at some level of the stack -
> is supporting dynamic configurations.
>
> If a config changes in e.g. three lines, some of the changes may logically
> belong together. E.g. changing both “host” and “port” (if separate
> entries). One shouldn’t be able to read a state, even temporarily, that has
> new host but old port.
>
> I can do this in the application level - it does not need to be part of
> the DL protocol.
>
>
> Asko Kauppi
> Zalando Tech Helsinki
>
> > On 4 Jan 2017, at 9.18, Sijie Guo <si...@apache.org> wrote:
> >
> > Sorry for late response. I think Leigh and you already had some very
> > valuable discussions in the doc. I will try to add some of my questions
> to
> > the discussion.
> >
> > Beside that, I had a discussion with Leigh today about this. first of
> all,
> > I think it is very good to add transaction support in distributedlog. It
> is
> > one of the primitives that would help building distributed service. But
> we
> > have a concern about making this system become complicated and introduce
> > operational overhead when it runs in the large scale system on
> production.
> > There are two major suggestions that I have for this feature -
> >
> > Build the 'minimum' logic in core - I think the minimum logic that need
> to
> > be added to the core is -  the special control records (begin, commit and
> > abort) and make the reader be able to detect those special control
> records
> > and know what do they mean and how to interrupt with them. Since they are
> > special control records, there is not overhead to other readers that
> > doesn't require this feature.
> >
> > Build the transaction coordinator as a separated proxy service  - I think
> > the major concern that we have is putting more complexities into the
> 'write
> > proxy' service. We architected distributedlog in a more microservice-like
> > way - we have the core as the stream store, the proxy for serving write
> and
> > read traffic. It would be good that the transaction feature can be done
> in
> > a similar way. So the architecture would be like this -
> >
> > *[ write service ] [ read service ] [ transaction coordinator ]*
> > *[ stream store
> >            ]*
> >
> > if people doesn't need the transaction feature, they can turn if off
> > completely without any operational overhead.
> >
> > Beside that, I have one general question - What is the major goal for
> this
> > feature? Are you targeting on building a general XA transaction
> coordinator
> > or just for supporting things like `copy-modify-write' style workflow?
> >
> >
> > Thanks,
> > Sijie
> >
> >
> >
> >
> >
> > On Wed, Dec 28, 2016 at 1:12 PM, Xi Liu <xi...@gmail.com> wrote:
> >
> >> Ping?
> >>
> >> On Mon, Dec 19, 2016 at 8:28 AM, Xi Liu <xi...@gmail.com> wrote:
> >>
> >>> Sijie,
> >>>
> >>> No. I thought it might be easier for people to comment on a google doc
> to
> >>> gather the initial feedback. I will put the content back to wiki page
> >> once
> >>> addressing the comments. Does that sound good to you?
> >>>
> >>> And thank you in advance.
> >>>
> >>> - Xi
> >>>
> >>>
> >>>
> >>> On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <si...@apache.org> wrote:
> >>>
> >>>> Hi Xi,
> >>>>
> >>>> sorry for late response. I will review it soon.
> >>>>
> >>>> regarding this, a separate question "are we going to use google doc
> >>>> instead
> >>>> of email thread for any discussion"? I am a bit worried that the
> >>>> discussion
> >>>> will become lost after moving to google doc. No idea on how other
> apache
> >>>> projects are doing.
> >>>>
> >>>> - Sijie
> >>>>
> >>>> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi...@gmail.com>
> wrote:
> >>>>
> >>>>> Hi all,
> >>>>>
> >>>>> I finalized the first version of the design. This time I used a
> google
> >>>> doc
> >>>>> so that it is easier for commenting and add a link the wiki page. I
> >> will
> >>>>> update this to the wiki page once we come to the finalized design.
> >>>>>
> >>>>> https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
> >>>>> bSIGgSzXuTI5BA/edit
> >>>>>
> >>>>> Let me know if you have any questions. Appreciate your reviews!
> >>>>>
> >>>>> - Xi
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
> >>>>> <lstewart@twitter.com.invalid
> >>>>>> wrote:
> >>>>>
> >>>>>> Interesting proposal. A couple quick notes while you continue to
> >> flesh
> >>>>> this
> >>>>>> out.
> >>>>>>
> >>>>>> a. just to be sure - does this eliminate the need to save seqno with
> >>>>>> checkpoint?
> >>>>>>
> >>>>>> b. i.e. another way to describe this kind of improvement is "support
> >>>>>> records (atomic writes) larger than 1MB", iiuc. the advantage being
> >> it
> >>>>>> avoids the baggage of transactions. disadvantages include inability
> >>>> to do
> >>>>>> cross stream transactions, and flexibility (interleaving, etc) (are
> >>>> there
> >>>>>> others?).
> >>>>>>
> >>>>>> c. proxy use case is for supporting multiple writers - have you
> >>>> thought
> >>>>>> about how this would work with multiple writers?
> >>>>>>
> >>>>>> Thanks!
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo
> >> <sijieg@twitter.com.invalid
> >>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Sound good to me. look forward to the detailed proposal.
> >>>>>>>
> >>>>>>> (I don't mind the format if it makes things easier to you)
> >>>>>>>
> >>>>>>> Sijie
> >>>>>>>
> >>>>>>> On Friday, October 14, 2016, Xi Liu <xi...@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> Thank you, Sijie
> >>>>>>>>
> >>>>>>>> We have some internal discussions to sort out some details. We
> >> are
> >>>>>> ready
> >>>>>>> to
> >>>>>>>> collaborate with the community for adding the transaction
> >> support
> >>>> in
> >>>>>> DL.
> >>>>>>>> We'd like to share more.
> >>>>>>>>
> >>>>>>>> I created a proposal wiki here -
> >>>>>>>> https://cwiki.apache.org/confluence/display/DL/DP-1+-+
> >>>>>>>> DistributedLog+Transaction+Support
> >>>>>>>>
> >>>>>>>> (I followed KIP format and named it as DP (DistributedLog
> >>>> Proposal -
> >>>>> DP
> >>>>>>> is
> >>>>>>>> also short for Dynamic Programming). I don't know if you guys
> >> like
> >>>>> this
> >>>>>>>> name or not. Feel free to change it :D)
> >>>>>>>>
> >>>>>>>> I basically put my initial email as the content there so far.
> >>>> Once we
> >>>>>>>> finished our final discussion, I will update with more details.
> >> At
> >>>>> the
> >>>>>>> same
> >>>>>>>> time, any comments are welcome.
> >>>>>>>>
> >>>>>>>> - Xi
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
> >>>>>>> <javascript:;>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Xi,
> >>>>>>>>>
> >>>>>>>>> I just granted you the edit permission.
> >>>>>>>>>
> >>>>>>>>> - Sijie
> >>>>>>>>>
> >>>>>>>>> On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
> >>>>>>>> <javascript:;>> wrote:
> >>>>>>>>>
> >>>>>>>>>> I still can not edit the wiki. Can any of the pmc members
> >>>> grant
> >>>>> me
> >>>>>>> the
> >>>>>>>>>> permissions?
> >>>>>>>>>>
> >>>>>>>>>> - Xi
> >>>>>>>>>>
> >>>>>>>>>> On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <
> >>>> xi.liu.ant@gmail.com
> >>>>>>>> <javascript:;>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Sijie,
> >>>>>>>>>>>
> >>>>>>>>>>> I attempted to create a wiki page under that space. I
> >> found
> >>>>> that
> >>>>>> I
> >>>>>>> am
> >>>>>>>>> not
> >>>>>>>>>>> authorized with edit permission.
> >>>>>>>>>>>
> >>>>>>>>>>> Can any of the committers grant me the wiki edit
> >>>> permission? My
> >>>>>>>> account
> >>>>>>>>>> is
> >>>>>>>>>>> "xi.liu.ant".
> >>>>>>>>>>>
> >>>>>>>>>>> - Xi
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <
> >>>> sijie@apache.org
> >>>>>>>> <javascript:;>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> This sounds interesting ... I will take a closer look and
> >>>> give
> >>>>>> my
> >>>>>>>>>> comments
> >>>>>>>>>>>> later.
> >>>>>>>>>>>>
> >>>>>>>>>>>> At the same time, do you mind creating a wiki page to put
> >>>> your
> >>>>>>> idea
> >>>>>>>>>> there?
> >>>>>>>>>>>> You can add your wiki page under
> >>>>>>>>>>>> https://cwiki.apache.org/confluence/display/DL/Project+
> >>>>>> Proposals
> >>>>>>>>>>>>
> >>>>>>>>>>>> You might need to ask in the dev list to grant the wiki
> >>>> edit
> >>>>>>>>> permissions
> >>>>>>>>>>>> to
> >>>>>>>>>>>> you once you have a wiki account.
> >>>>>>>>>>>>
> >>>>>>>>>>>> - Sijie
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <
> >>>> xi.liu.ant@gmail.com
> >>>>>>>> <javascript:;>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I asked the transaction support in distributedlog user
> >>>> group
> >>>>>> two
> >>>>>>>>>> months
> >>>>>>>>>>>>> ago. I want to raise this up again, as we are looking
> >> for
> >>>>>> using
> >>>>>>>>>>>>> distributedlog for building a transactional data
> >>>> service. It
> >>>>>> is
> >>>>>>> a
> >>>>>>>>>> major
> >>>>>>>>>>>>> feature that is missing in distributedlog. We have some
> >>>>> ideas
> >>>>>> to
> >>>>>>>> add
> >>>>>>>>>>>> this
> >>>>>>>>>>>>> to distributedlog and want to know if they make sense
> >> or
> >>>>> not.
> >>>>>> If
> >>>>>>>>> they
> >>>>>>>>>>>> are
> >>>>>>>>>>>>> good, we'd like to contribute and develop with the
> >>>>> community.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Here are the thoughts:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -------------------------------------------------
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> From our understanding, DL can provide "at-least-once"
> >>>>>> delivery
> >>>>>>>>>> semantic
> >>>>>>>>>>>>> (if not, please correct me) but not "exactly-once"
> >>>> delivery
> >>>>>>>>> semantic.
> >>>>>>>>>>>> That
> >>>>>>>>>>>>> means that a message can be delivered one or more times
> >>>> if
> >>>>> the
> >>>>>>>>> reader
> >>>>>>>>>>>>> doesn't handle duplicates.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The duplicates come from two places, one is at writer
> >>>> side
> >>>>>> (this
> >>>>>>>>>> assumes
> >>>>>>>>>>>>> using write proxy not the core library), while the
> >> other
> >>>> one
> >>>>>> is
> >>>>>>> at
> >>>>>>>>>>>> reader
> >>>>>>>>>>>>> side.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - writer side: if the client attempts to write a record
> >>>> to
> >>>>> the
> >>>>>>>> write
> >>>>>>>>>>>>> proxies and gets a network error (e.g timeouts) then
> >>>>> retries,
> >>>>>>> the
> >>>>>>>>>>>> retrying
> >>>>>>>>>>>>> will potentially result in duplicates.
> >>>>>>>>>>>>> - reader side:if the reader reads a message from a
> >> stream
> >>>>> and
> >>>>>>> then
> >>>>>>>>>>>> crashes,
> >>>>>>>>>>>>> when the reader restarts it would restart from last
> >> known
> >>>>>>> position
> >>>>>>>>>>>> (DLSN).
> >>>>>>>>>>>>> If the reader fails after processing a record and
> >> before
> >>>>>>> recording
> >>>>>>>>> the
> >>>>>>>>>>>>> position, the processed record will be delivered again.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The reader problem can be properly addressed by making
> >>>> use
> >>>>> of
> >>>>>>> the
> >>>>>>>>>>>> sequence
> >>>>>>>>>>>>> numbers of records and doing proper checkpointing. For
> >>>>>> example,
> >>>>>>> in
> >>>>>>>>>>>>> database, it can checkpoint the indexed data with the
> >>>>> sequence
> >>>>>>>>> number
> >>>>>>>>>> of
> >>>>>>>>>>>>> records; in flink, it can checkpoint the state with the
> >>>>>> sequence
> >>>>>>>>>>>> numbers.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The writer problem can be addressed by implementing an
> >>>>>>> idempotent
> >>>>>>>>>>>> writer.
> >>>>>>>>>>>>> However, an alternative and more powerful approach is
> >> to
> >>>>>> support
> >>>>>>>>>>>>> transactions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *What does transaction mean?*
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> A transaction means a collection of records can be
> >>>> written
> >>>>>>>>>>>> transactionally
> >>>>>>>>>>>>> within a stream or across multiple streams. They will
> >> be
> >>>>>>> consumed
> >>>>>>>> by
> >>>>>>>>>> the
> >>>>>>>>>>>>> reader together when a transaction is committed, or
> >> will
> >>>>> never
> >>>>>>> be
> >>>>>>>>>>>> consumed
> >>>>>>>>>>>>> by the reader when the transaction is aborted.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The transaction will expose following guarantees:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - The reader should not be exposed to records written
> >>>> from
> >>>>>>>>> uncommitted
> >>>>>>>>>>>>> transactions (mandatory)
> >>>>>>>>>>>>> - The reader should consume the records in the
> >>>> transaction
> >>>>>>> commit
> >>>>>>>>>> order
> >>>>>>>>>>>>> rather than the record written order (mandatory)
> >>>>>>>>>>>>> - No duplicated records within a transaction
> >> (mandatory)
> >>>>>>>>>>>>> - Allow interleaving transactional writes and
> >>>>>> non-transactional
> >>>>>>>>> writes
> >>>>>>>>>>>>> (optional)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *Stream Transaction & Namespace Transaction*
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> There will be two types of transaction, one is Stream
> >>>> level
> >>>>>>>>>> transaction
> >>>>>>>>>>>>> (local transaction), while the other one is Namespace
> >>>> level
> >>>>>>>>>> transaction
> >>>>>>>>>>>>> (global transaction).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The stream level transaction is a transactional
> >>>> operation on
> >>>>>>>> writing
> >>>>>>>>>>>>> records to one stream; the namespace level transaction
> >>>> is a
> >>>>>>>>>>>> transactional
> >>>>>>>>>>>>> operation on writing records to multiple streams.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *Implementation Thoughts*
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - A transaction is consist of begin control record, a
> >>>> series
> >>>>>> of
> >>>>>>>> data
> >>>>>>>>>>>>> records and commit/abort control record.
> >>>>>>>>>>>>> - The begin/commit/abort control record is written to a
> >>>>>> `commit`
> >>>>>>>> log
> >>>>>>>>>>>>> stream, while the data records will be written to
> >> normal
> >>>>> data
> >>>>>>> log
> >>>>>>>>>>>> streams.
> >>>>>>>>>>>>> - The `commit` log stream will be the same log stream
> >> for
> >>>>>>>>> stream-level
> >>>>>>>>>>>>> transaction,  while it will be a *system* stream (or
> >>>>> multiple
> >>>>>>>> system
> >>>>>>>>>>>>> streams) for namespace-level transactions.
> >>>>>>>>>>>>> - The transaction code looks like as below:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> <code>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Transaction txn = client.transaction();
> >>>>>>>>>>>>> Future<DLSN> result1 = txn.write(stream-0, record);
> >>>>>>>>>>>>> Future<DLSN> result2 = txn.write(stream-1, record);
> >>>>>>>>>>>>> Future<DLSN> result3 = txn.write(stream-2, record);
> >>>>>>>>>>>>> Future<Pair<DLSN, DLSN>> result = txn.commit();
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> </code>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> if the txn is committed, all the write futures will be
> >>>>>> satisfied
> >>>>>>>>> with
> >>>>>>>>>>>> their
> >>>>>>>>>>>>> written DLSNs. if the txn is aborted, all the write
> >>>> futures
> >>>>>> will
> >>>>>>>> be
> >>>>>>>>>>>> failed
> >>>>>>>>>>>>> together. there is no partial failure state.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - The actually data flow will be:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1. writer get a transaction id from the owner of the
> >>>>> `commit'
> >>>>>>> log
> >>>>>>>>>> stream
> >>>>>>>>>>>>> 1. write the begin control record (synchronously) with
> >>>> the
> >>>>>>>>> transaction
> >>>>>>>>>>>> id
> >>>>>>>>>>>>> 2. for each write within the same txn, it will be
> >>>> assigned a
> >>>>>>> local
> >>>>>>>>>>>> sequence
> >>>>>>>>>>>>> number starting from 0. the combination of transaction
> >> id
> >>>>> and
> >>>>>>>> local
> >>>>>>>>>>>>> sequence number will be used later on by the readers to
> >>>>>>>> de-duplicate
> >>>>>>>>>>>>> records.
> >>>>>>>>>>>>> 3. the commit/abort control record will be written
> >> based
> >>>> on
> >>>>>> the
> >>>>>>>>>> results
> >>>>>>>>>>>>> from 2.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Application can supply a timeout for the transaction
> >>>> when
> >>>>>>>>> #begin() a
> >>>>>>>>>>>>> transaction. The owner of the `commit` log stream can
> >>>> abort
> >>>>>>>>>> transactions
> >>>>>>>>>>>>> that never be committed/aborted within their timeout.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Failures:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> * all the log records can be simply retried as they
> >> will
> >>>> be
> >>>>>>>>>>>> de-duplicated
> >>>>>>>>>>>>> probably at the reader side.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Reader:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> * Reader can be configured to read uncommitted records
> >> or
> >>>>>>>> committed
> >>>>>>>>>>>> records
> >>>>>>>>>>>>> only (by default read uncommitted records)
> >>>>>>>>>>>>> * If reader is configured to read committed records
> >> only,
> >>>>> the
> >>>>>>> read
> >>>>>>>>>> ahead
> >>>>>>>>>>>>> cache will be changed to maintain one additional
> >> pending
> >>>>>>> committed
> >>>>>>>>>>>> records.
> >>>>>>>>>>>>> the pending committed records map is bounded and
> >> records
> >>>>> will
> >>>>>> be
> >>>>>>>>>> dropped
> >>>>>>>>>>>>> when read ahead is moving.
> >>>>>>>>>>>>> * when the reader hits a commit record, it will rewind
> >> to
> >>>>> the
> >>>>>>>> begin
> >>>>>>>>>>>> record
> >>>>>>>>>>>>> and start reading from there. leveraging the proper
> >> read
> >>>>> ahead
> >>>>>>>> cache
> >>>>>>>>>> and
> >>>>>>>>>>>>> pending commit records cache, it would be good for both
> >>>>> short
> >>>>>>>>>>>> transactions
> >>>>>>>>>>>>> and long transactions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - DLSN, SequenceId:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> * We will add a fourth field to DLSN. It is `local
> >>>> sequence
> >>>>>>>> number`
> >>>>>>>>>>>> within
> >>>>>>>>>>>>> a transaction session. So the new DLSN of records in a
> >>>>>>> transaction
> >>>>>>>>>> will
> >>>>>>>>>>>> be
> >>>>>>>>>>>>> the DLSN of commit control record plus its local
> >> sequence
> >>>>>>> number.
> >>>>>>>>>>>>> * The sequence id will be still the position of the
> >>>> commit
> >>>>>>> record
> >>>>>>>>> plus
> >>>>>>>>>>>> its
> >>>>>>>>>>>>> local sequence number. The position will be advanced
> >> with
> >>>>>> total
> >>>>>>>>> number
> >>>>>>>>>>>> of
> >>>>>>>>>>>>> written records on writing the commit control record.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Transaction Group & Namespace Transaction
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> using one single log stream for namespace transaction
> >> can
> >>>>>> cause
> >>>>>>>> the
> >>>>>>>>>>>>> bottleneck problem since all the begin/commit/end
> >> control
> >>>>>>> records
> >>>>>>>>> will
> >>>>>>>>>>>> have
> >>>>>>>>>>>>> to go through one log stream.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> the idea of 'transaction group' is to allow
> >> partitioning
> >>>> the
> >>>>>>>> writers
> >>>>>>>>>>>> into
> >>>>>>>>>>>>> different transaction groups.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> clients can specify the `group-name` when starting the
> >>>>>>>> transaction.
> >>>>>>>>> if
> >>>>>>>>>>>>> there is no `group-name` specified, it will use the
> >>>> default
> >>>>>>>> `commit`
> >>>>>>>>>>>> log in
> >>>>>>>>>>>>> the namespace for creating transactions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -------------------------------------------------
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'd like to collect feedbacks on this idea. Appreciate
> >>>> any
> >>>>>>>> comments
> >>>>>>>>>> and
> >>>>>>>>>>>> if
> >>>>>>>>>>>>> anyone is also interested in this idea, we'd like to
> >>>>>> collaborate
> >>>>>>>>> with
> >>>>>>>>>>>> the
> >>>>>>>>>>>>> community.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Xi
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
>
>

Re: [Discuss] Transaction Support

Posted by Sijie Guo <si...@apache.org>.

On Wed, Jan 4, 2017 at 1:14 AM, Asko Kauppi <as...@zalando.fi> wrote:

> > Beside that, I have one general question - What is the major goal for
> this
> > feature? Are you targeting on building a general XA transaction
> coordinator
> > or just for supporting things like `copy-modify-write' style workflow?
>
> The use case I would have for transactions - at some level of the stack -
> is supporting dynamic configurations.
>
> If a config changes in e.g. three lines, some of the changes may logically
> belong together. E.g. changing both “host” and “port” (if separate
> entries). One shouldn’t be able to read a state, even temporarily, that has
> new host but old port.
>
> I can do this in the application level - it does not need to be part of
> the DL protocol.
>

Yeah, I can see 'transaction' as a large atomic write in DL is very useful.
Currently DL limits the record size to 1MB. If people wants to write a
record larger than 1MB, it will potentially produce a `partial` write if
application breaks their record into multiple records.

Your use case falls into this category.

I think the minimal support is large atomic write. That is good enough for
most of the log use cases.

Having a separated TC (transaction coordinator) is cool. but it can be an
opt-in solution.



>
>
> Asko Kauppi
> Zalando Tech Helsinki
>
> > On 4 Jan 2017, at 9.18, Sijie Guo <si...@apache.org> wrote:
> >
> > Sorry for late response. I think Leigh and you already had some very
> > valuable discussions in the doc. I will try to add some of my questions
> to
> > the discussion.
> >
> > Beside that, I had a discussion with Leigh today about this. first of
> all,
> > I think it is very good to add transaction support in distributedlog. It
> is
> > one of the primitives that would help building distributed service. But
> we
> > have a concern about making this system become complicated and introduce
> > operational overhead when it runs in the large scale system on
> production.
> > There are two major suggestions that I have for this feature -
> >
> > Build the 'minimum' logic in core - I think the minimum logic that need
> to
> > be added to the core is -  the special control records (begin, commit and
> > abort) and make the reader be able to detect those special control
> records
> > and know what do they mean and how to interrupt with them. Since they are
> > special control records, there is not overhead to other readers that
> > doesn't require this feature.
> >
> > Build the transaction coordinator as a separated proxy service  - I think
> > the major concern that we have is putting more complexities into the
> 'write
> > proxy' service. We architected distributedlog in a more microservice-like
> > way - we have the core as the stream store, the proxy for serving write
> and
> > read traffic. It would be good that the transaction feature can be done
> in
> > a similar way. So the architecture would be like this -
> >
> > *[ write service ] [ read service ] [ transaction coordinator ]*
> > *[ stream store
> >            ]*
> >
> > if people doesn't need the transaction feature, they can turn if off
> > completely without any operational overhead.
> >
> > Beside that, I have one general question - What is the major goal for
> this
> > feature? Are you targeting on building a general XA transaction
> coordinator
> > or just for supporting things like `copy-modify-write' style workflow?
> >
> >
> > Thanks,
> > Sijie
> >
> >
> >
> >
> >
> > On Wed, Dec 28, 2016 at 1:12 PM, Xi Liu <xi...@gmail.com> wrote:
> >
> >> Ping?
> >>
> >> On Mon, Dec 19, 2016 at 8:28 AM, Xi Liu <xi...@gmail.com> wrote:
> >>
> >>> Sijie,
> >>>
> >>> No. I thought it might be easier for people to comment on a google doc
> to
> >>> gather the initial feedback. I will put the content back to wiki page
> >> once
> >>> addressing the comments. Does that sound good to you?
> >>>
> >>> And thank you in advance.
> >>>
> >>> - Xi
> >>>
> >>>
> >>>
> >>> On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <si...@apache.org> wrote:
> >>>
> >>>> Hi Xi,
> >>>>
> >>>> sorry for late response. I will review it soon.
> >>>>
> >>>> regarding this, a separate question "are we going to use google doc
> >>>> instead
> >>>> of email thread for any discussion"? I am a bit worried that the
> >>>> discussion
> >>>> will become lost after moving to google doc. No idea on how other
> apache
> >>>> projects are doing.
> >>>>
> >>>> - Sijie
> >>>>
> >>>> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi...@gmail.com>
> wrote:
> >>>>
> >>>>> Hi all,
> >>>>>
> >>>>> I finalized the first version of the design. This time I used a
> google
> >>>> doc
> >>>>> so that it is easier for commenting and add a link the wiki page. I
> >> will
> >>>>> update this to the wiki page once we come to the finalized design.
> >>>>>
> >>>>> https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
> >>>>> bSIGgSzXuTI5BA/edit
> >>>>>
> >>>>> Let me know if you have any questions. Appreciate your reviews!
> >>>>>
> >>>>> - Xi
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
> >>>>> <lstewart@twitter.com.invalid
> >>>>>> wrote:
> >>>>>
> >>>>>> Interesting proposal. A couple quick notes while you continue to
> >> flesh
> >>>>> this
> >>>>>> out.
> >>>>>>
> >>>>>> a. just to be sure - does this eliminate the need to save seqno with
> >>>>>> checkpoint?
> >>>>>>
> >>>>>> b. i.e. another way to describe this kind of improvement is "support
> >>>>>> records (atomic writes) larger than 1MB", iiuc. the advantage being
> >> it
> >>>>>> avoids the baggage of transactions. disadvantages include inability
> >>>> to do
> >>>>>> cross stream transactions, and flexibility (interleaving, etc) (are
> >>>> there
> >>>>>> others?).
> >>>>>>
> >>>>>> c. proxy use case is for supporting multiple writers - have you
> >>>> thought
> >>>>>> about how this would work with multiple writers?
> >>>>>>
> >>>>>> Thanks!
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo
> >> <sijieg@twitter.com.invalid
> >>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Sound good to me. look forward to the detailed proposal.
> >>>>>>>
> >>>>>>> (I don't mind the format if it makes things easier to you)
> >>>>>>>
> >>>>>>> Sijie
> >>>>>>>
> >>>>>>> On Friday, October 14, 2016, Xi Liu <xi...@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> Thank you, Sijie
> >>>>>>>>
> >>>>>>>> We have some internal discussions to sort out some details. We
> >> are
> >>>>>> ready
> >>>>>>> to
> >>>>>>>> collaborate with the community for adding the transaction
> >> support
> >>>> in
> >>>>>> DL.
> >>>>>>>> We'd like to share more.
> >>>>>>>>
> >>>>>>>> I created a proposal wiki here -
> >>>>>>>> https://cwiki.apache.org/confluence/display/DL/DP-1+-+
> >>>>>>>> DistributedLog+Transaction+Support
> >>>>>>>>
> >>>>>>>> (I followed KIP format and named it as DP (DistributedLog
> >>>> Proposal -
> >>>>> DP
> >>>>>>> is
> >>>>>>>> also short for Dynamic Programming). I don't know if you guys
> >> like
> >>>>> this
> >>>>>>>> name or not. Feel free to change it :D)
> >>>>>>>>
> >>>>>>>> I basically put my initial email as the content there so far.
> >>>> Once we
> >>>>>>>> finished our final discussion, I will update with more details.
> >> At
> >>>>> the
> >>>>>>> same
> >>>>>>>> time, any comments are welcome.
> >>>>>>>>
> >>>>>>>> - Xi
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
> >>>>>>> <javascript:;>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Xi,
> >>>>>>>>>
> >>>>>>>>> I just granted you the edit permission.
> >>>>>>>>>
> >>>>>>>>> - Sijie
> >>>>>>>>>
> >>>>>>>>> On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
> >>>>>>>> <javascript:;>> wrote:
> >>>>>>>>>
> >>>>>>>>>> I still can not edit the wiki. Can any of the pmc members
> >>>> grant
> >>>>> me
> >>>>>>> the
> >>>>>>>>>> permissions?
> >>>>>>>>>>
> >>>>>>>>>> - Xi
> >>>>>>>>>>
> >>>>>>>>>> On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <
> >>>> xi.liu.ant@gmail.com
> >>>>>>>> <javascript:;>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Sijie,
> >>>>>>>>>>>
> >>>>>>>>>>> I attempted to create a wiki page under that space. I
> >> found
> >>>>> that
> >>>>>> I
> >>>>>>> am
> >>>>>>>>> not
> >>>>>>>>>>> authorized with edit permission.
> >>>>>>>>>>>
> >>>>>>>>>>> Can any of the committers grant me the wiki edit
> >>>> permission? My
> >>>>>>>> account
> >>>>>>>>>> is
> >>>>>>>>>>> "xi.liu.ant".
> >>>>>>>>>>>
> >>>>>>>>>>> - Xi
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <
> >>>> sijie@apache.org
> >>>>>>>> <javascript:;>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> This sounds interesting ... I will take a closer look and
> >>>> give
> >>>>>> my
> >>>>>>>>>> comments
> >>>>>>>>>>>> later.
> >>>>>>>>>>>>
> >>>>>>>>>>>> At the same time, do you mind creating a wiki page to put
> >>>> your
> >>>>>>> idea
> >>>>>>>>>> there?
> >>>>>>>>>>>> You can add your wiki page under
> >>>>>>>>>>>> https://cwiki.apache.org/confluence/display/DL/Project+
> >>>>>> Proposals
> >>>>>>>>>>>>
> >>>>>>>>>>>> You might need to ask in the dev list to grant the wiki
> >>>> edit
> >>>>>>>>> permissions
> >>>>>>>>>>>> to
> >>>>>>>>>>>> you once you have a wiki account.
> >>>>>>>>>>>>
> >>>>>>>>>>>> - Sijie
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <
> >>>> xi.liu.ant@gmail.com
> >>>>>>>> <javascript:;>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I asked the transaction support in distributedlog user
> >>>> group
> >>>>>> two
> >>>>>>>>>> months
> >>>>>>>>>>>>> ago. I want to raise this up again, as we are looking
> >> for
> >>>>>> using
> >>>>>>>>>>>>> distributedlog for building a transactional data
> >>>> service. It
> >>>>>> is
> >>>>>>> a
> >>>>>>>>>> major
> >>>>>>>>>>>>> feature that is missing in distributedlog. We have some
> >>>>> ideas
> >>>>>> to
> >>>>>>>> add
> >>>>>>>>>>>> this
> >>>>>>>>>>>>> to distributedlog and want to know if they make sense
> >> or
> >>>>> not.
> >>>>>> If
> >>>>>>>>> they
> >>>>>>>>>>>> are
> >>>>>>>>>>>>> good, we'd like to contribute and develop with the
> >>>>> community.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Here are the thoughts:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -------------------------------------------------
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> From our understanding, DL can provide "at-least-once"
> >>>>>> delivery
> >>>>>>>>>> semantic
> >>>>>>>>>>>>> (if not, please correct me) but not "exactly-once"
> >>>> delivery
> >>>>>>>>> semantic.
> >>>>>>>>>>>> That
> >>>>>>>>>>>>> means that a message can be delivered one or more times
> >>>> if
> >>>>> the
> >>>>>>>>> reader
> >>>>>>>>>>>>> doesn't handle duplicates.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The duplicates come from two places, one is at writer
> >>>> side
> >>>>>> (this
> >>>>>>>>>> assumes
> >>>>>>>>>>>>> using write proxy not the core library), while the
> >> other
> >>>> one
> >>>>>> is
> >>>>>>> at
> >>>>>>>>>>>> reader
> >>>>>>>>>>>>> side.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - writer side: if the client attempts to write a record
> >>>> to
> >>>>> the
> >>>>>>>> write
> >>>>>>>>>>>>> proxies and gets a network error (e.g timeouts) then
> >>>>> retries,
> >>>>>>> the
> >>>>>>>>>>>> retrying
> >>>>>>>>>>>>> will potentially result in duplicates.
> >>>>>>>>>>>>> - reader side:if the reader reads a message from a
> >> stream
> >>>>> and
> >>>>>>> then
> >>>>>>>>>>>> crashes,
> >>>>>>>>>>>>> when the reader restarts it would restart from last
> >> known
> >>>>>>> position
> >>>>>>>>>>>> (DLSN).
> >>>>>>>>>>>>> If the reader fails after processing a record and
> >> before
> >>>>>>> recording
> >>>>>>>>> the
> >>>>>>>>>>>>> position, the processed record will be delivered again.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The reader problem can be properly addressed by making
> >>>> use
> >>>>> of
> >>>>>>> the
> >>>>>>>>>>>> sequence
> >>>>>>>>>>>>> numbers of records and doing proper checkpointing. For
> >>>>>> example,
> >>>>>>> in
> >>>>>>>>>>>>> database, it can checkpoint the indexed data with the
> >>>>> sequence
> >>>>>>>>> number
> >>>>>>>>>> of
> >>>>>>>>>>>>> records; in flink, it can checkpoint the state with the
> >>>>>> sequence
> >>>>>>>>>>>> numbers.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The writer problem can be addressed by implementing an
> >>>>>>> idempotent
> >>>>>>>>>>>> writer.
> >>>>>>>>>>>>> However, an alternative and more powerful approach is
> >> to
> >>>>>> support
> >>>>>>>>>>>>> transactions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *What does transaction mean?*
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> A transaction means a collection of records can be
> >>>> written
> >>>>>>>>>>>> transactionally
> >>>>>>>>>>>>> within a stream or across multiple streams. They will
> >> be
> >>>>>>> consumed
> >>>>>>>> by
> >>>>>>>>>> the
> >>>>>>>>>>>>> reader together when a transaction is committed, or
> >> will
> >>>>> never
> >>>>>>> be
> >>>>>>>>>>>> consumed
> >>>>>>>>>>>>> by the reader when the transaction is aborted.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The transaction will expose following guarantees:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - The reader should not be exposed to records written
> >>>> from
> >>>>>>>>> uncommitted
> >>>>>>>>>>>>> transactions (mandatory)
> >>>>>>>>>>>>> - The reader should consume the records in the
> >>>> transaction
> >>>>>>> commit
> >>>>>>>>>> order
> >>>>>>>>>>>>> rather than the record written order (mandatory)
> >>>>>>>>>>>>> - No duplicated records within a transaction
> >> (mandatory)
> >>>>>>>>>>>>> - Allow interleaving transactional writes and
> >>>>>> non-transactional
> >>>>>>>>> writes
> >>>>>>>>>>>>> (optional)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *Stream Transaction & Namespace Transaction*
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> There will be two types of transaction, one is Stream
> >>>> level
> >>>>>>>>>> transaction
> >>>>>>>>>>>>> (local transaction), while the other one is Namespace
> >>>> level
> >>>>>>>>>> transaction
> >>>>>>>>>>>>> (global transaction).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The stream level transaction is a transactional
> >>>> operation on
> >>>>>>>> writing
> >>>>>>>>>>>>> records to one stream; the namespace level transaction
> >>>> is a
> >>>>>>>>>>>> transactional
> >>>>>>>>>>>>> operation on writing records to multiple streams.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *Implementation Thoughts*
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - A transaction is consist of begin control record, a
> >>>> series
> >>>>>> of
> >>>>>>>> data
> >>>>>>>>>>>>> records and commit/abort control record.
> >>>>>>>>>>>>> - The begin/commit/abort control record is written to a
> >>>>>> `commit`
> >>>>>>>> log
> >>>>>>>>>>>>> stream, while the data records will be written to
> >> normal
> >>>>> data
> >>>>>>> log
> >>>>>>>>>>>> streams.
> >>>>>>>>>>>>> - The `commit` log stream will be the same log stream
> >> for
> >>>>>>>>> stream-level
> >>>>>>>>>>>>> transaction,  while it will be a *system* stream (or
> >>>>> multiple
> >>>>>>>> system
> >>>>>>>>>>>>> streams) for namespace-level transactions.
> >>>>>>>>>>>>> - The transaction code looks like as below:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> <code>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Transaction txn = client.transaction();
> >>>>>>>>>>>>> Future<DLSN> result1 = txn.write(stream-0, record);
> >>>>>>>>>>>>> Future<DLSN> result2 = txn.write(stream-1, record);
> >>>>>>>>>>>>> Future<DLSN> result3 = txn.write(stream-2, record);
> >>>>>>>>>>>>> Future<Pair<DLSN, DLSN>> result = txn.commit();
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> </code>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> if the txn is committed, all the write futures will be
> >>>>>> satisfied
> >>>>>>>>> with
> >>>>>>>>>>>> their
> >>>>>>>>>>>>> written DLSNs. if the txn is aborted, all the write
> >>>> futures
> >>>>>> will
> >>>>>>>> be
> >>>>>>>>>>>> failed
> >>>>>>>>>>>>> together. there is no partial failure state.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - The actually data flow will be:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1. writer get a transaction id from the owner of the
> >>>>> `commit'
> >>>>>>> log
> >>>>>>>>>> stream
> >>>>>>>>>>>>> 1. write the begin control record (synchronously) with
> >>>> the
> >>>>>>>>> transaction
> >>>>>>>>>>>> id
> >>>>>>>>>>>>> 2. for each write within the same txn, it will be
> >>>> assigned a
> >>>>>>> local
> >>>>>>>>>>>> sequence
> >>>>>>>>>>>>> number starting from 0. the combination of transaction
> >> id
> >>>>> and
> >>>>>>>> local
> >>>>>>>>>>>>> sequence number will be used later on by the readers to
> >>>>>>>> de-duplicate
> >>>>>>>>>>>>> records.
> >>>>>>>>>>>>> 3. the commit/abort control record will be written
> >> based
> >>>> on
> >>>>>> the
> >>>>>>>>>> results
> >>>>>>>>>>>>> from 2.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Application can supply a timeout for the transaction
> >>>> when
> >>>>>>>>> #begin() a
> >>>>>>>>>>>>> transaction. The owner of the `commit` log stream can
> >>>> abort
> >>>>>>>>>> transactions
> >>>>>>>>>>>>> that never be committed/aborted within their timeout.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Failures:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> * all the log records can be simply retried as they
> >> will
> >>>> be
> >>>>>>>>>>>> de-duplicated
> >>>>>>>>>>>>> probably at the reader side.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Reader:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> * Reader can be configured to read uncommitted records
> >> or
> >>>>>>>> committed
> >>>>>>>>>>>> records
> >>>>>>>>>>>>> only (by default read uncommitted records)
> >>>>>>>>>>>>> * If reader is configured to read committed records
> >> only,
> >>>>> the
> >>>>>>> read
> >>>>>>>>>> ahead
> >>>>>>>>>>>>> cache will be changed to maintain one additional
> >> pending
> >>>>>>> committed
> >>>>>>>>>>>> records.
> >>>>>>>>>>>>> the pending committed records map is bounded and
> >> records
> >>>>> will
> >>>>>> be
> >>>>>>>>>> dropped
> >>>>>>>>>>>>> when read ahead is moving.
> >>>>>>>>>>>>> * when the reader hits a commit record, it will rewind
> >> to
> >>>>> the
> >>>>>>>> begin
> >>>>>>>>>>>> record
> >>>>>>>>>>>>> and start reading from there. leveraging the proper
> >> read
> >>>>> ahead
> >>>>>>>> cache
> >>>>>>>>>> and
> >>>>>>>>>>>>> pending commit records cache, it would be good for both
> >>>>> short
> >>>>>>>>>>>> transactions
> >>>>>>>>>>>>> and long transactions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - DLSN, SequenceId:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> * We will add a fourth field to DLSN. It is `local
> >>>> sequence
> >>>>>>>> number`
> >>>>>>>>>>>> within
> >>>>>>>>>>>>> a transaction session. So the new DLSN of records in a
> >>>>>>> transaction
> >>>>>>>>>> will
> >>>>>>>>>>>> be
> >>>>>>>>>>>>> the DLSN of commit control record plus its local
> >> sequence
> >>>>>>> number.
> >>>>>>>>>>>>> * The sequence id will be still the position of the
> >>>> commit
> >>>>>>> record
> >>>>>>>>> plus
> >>>>>>>>>>>> its
> >>>>>>>>>>>>> local sequence number. The position will be advanced
> >> with
> >>>>>> total
> >>>>>>>>> number
> >>>>>>>>>>>> of
> >>>>>>>>>>>>> written records on writing the commit control record.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Transaction Group & Namespace Transaction
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> using one single log stream for namespace transaction
> >> can
> >>>>>> cause
> >>>>>>>> the
> >>>>>>>>>>>>> bottleneck problem since all the begin/commit/end
> >> control
> >>>>>>> records
> >>>>>>>>> will
> >>>>>>>>>>>> have
> >>>>>>>>>>>>> to go through one log stream.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> the idea of 'transaction group' is to allow
> >> partitioning
> >>>> the
> >>>>>>>> writers
> >>>>>>>>>>>> into
> >>>>>>>>>>>>> different transaction groups.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> clients can specify the `group-name` when starting the
> >>>>>>>> transaction.
> >>>>>>>>> if
> >>>>>>>>>>>>> there is no `group-name` specified, it will use the
> >>>> default
> >>>>>>>> `commit`
> >>>>>>>>>>>> log in
> >>>>>>>>>>>>> the namespace for creating transactions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -------------------------------------------------
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'd like to collect feedbacks on this idea. Appreciate
> >>>> any
> >>>>>>>> comments
> >>>>>>>>>> and
> >>>>>>>>>>>> if
> >>>>>>>>>>>>> anyone is also interested in this idea, we'd like to
> >>>>>> collaborate
> >>>>>>>>> with
> >>>>>>>>>>>> the
> >>>>>>>>>>>>> community.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Xi
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
>
>

Re: [Discuss] Transaction Support

Posted by Asko Kauppi <as...@zalando.fi>.

> Beside that, I have one general question - What is the major goal for this
> feature? Are you targeting on building a general XA transaction coordinator
> or just for supporting things like `copy-modify-write' style workflow?

The use case I would have for transactions - at some level of the stack - is supporting dynamic configurations.

If a config changes in e.g. three lines, some of the changes may logically belong together. E.g. changing both “host” and “port” (if separate entries). One shouldn’t be able to read a state, even temporarily, that has new host but old port.

I can do this in the application level - it does not need to be part of the DL protocol.


Asko Kauppi
Zalando Tech Helsinki

> On 4 Jan 2017, at 9.18, Sijie Guo <si...@apache.org> wrote:
> 
> Sorry for late response. I think Leigh and you already had some very
> valuable discussions in the doc. I will try to add some of my questions to
> the discussion.
> 
> Beside that, I had a discussion with Leigh today about this. first of all,
> I think it is very good to add transaction support in distributedlog. It is
> one of the primitives that would help building distributed service. But we
> have a concern about making this system become complicated and introduce
> operational overhead when it runs in the large scale system on production.
> There are two major suggestions that I have for this feature -
> 
> Build the 'minimum' logic in core - I think the minimum logic that need to
> be added to the core is -  the special control records (begin, commit and
> abort) and make the reader be able to detect those special control records
> and know what do they mean and how to interrupt with them. Since they are
> special control records, there is not overhead to other readers that
> doesn't require this feature.
> 
> Build the transaction coordinator as a separated proxy service  - I think
> the major concern that we have is putting more complexities into the 'write
> proxy' service. We architected distributedlog in a more microservice-like
> way - we have the core as the stream store, the proxy for serving write and
> read traffic. It would be good that the transaction feature can be done in
> a similar way. So the architecture would be like this -
> 
> *[ write service ] [ read service ] [ transaction coordinator ]*
> *[ stream store
>            ]*
> 
> if people doesn't need the transaction feature, they can turn if off
> completely without any operational overhead.
> 
> Beside that, I have one general question - What is the major goal for this
> feature? Are you targeting on building a general XA transaction coordinator
> or just for supporting things like `copy-modify-write' style workflow?
> 
> 
> Thanks,
> Sijie
> 
> 
> 
> 
> 
> On Wed, Dec 28, 2016 at 1:12 PM, Xi Liu <xi...@gmail.com> wrote:
> 
>> Ping?
>> 
>> On Mon, Dec 19, 2016 at 8:28 AM, Xi Liu <xi...@gmail.com> wrote:
>> 
>>> Sijie,
>>> 
>>> No. I thought it might be easier for people to comment on a google doc to
>>> gather the initial feedback. I will put the content back to wiki page
>> once
>>> addressing the comments. Does that sound good to you?
>>> 
>>> And thank you in advance.
>>> 
>>> - Xi
>>> 
>>> 
>>> 
>>> On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <si...@apache.org> wrote:
>>> 
>>>> Hi Xi,
>>>> 
>>>> sorry for late response. I will review it soon.
>>>> 
>>>> regarding this, a separate question "are we going to use google doc
>>>> instead
>>>> of email thread for any discussion"? I am a bit worried that the
>>>> discussion
>>>> will become lost after moving to google doc. No idea on how other apache
>>>> projects are doing.
>>>> 
>>>> - Sijie
>>>> 
>>>> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi...@gmail.com> wrote:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I finalized the first version of the design. This time I used a google
>>>> doc
>>>>> so that it is easier for commenting and add a link the wiki page. I
>> will
>>>>> update this to the wiki page once we come to the finalized design.
>>>>> 
>>>>> https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
>>>>> bSIGgSzXuTI5BA/edit
>>>>> 
>>>>> Let me know if you have any questions. Appreciate your reviews!
>>>>> 
>>>>> - Xi
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
>>>>> <lstewart@twitter.com.invalid
>>>>>> wrote:
>>>>> 
>>>>>> Interesting proposal. A couple quick notes while you continue to
>> flesh
>>>>> this
>>>>>> out.
>>>>>> 
>>>>>> a. just to be sure - does this eliminate the need to save seqno with
>>>>>> checkpoint?
>>>>>> 
>>>>>> b. i.e. another way to describe this kind of improvement is "support
>>>>>> records (atomic writes) larger than 1MB", iiuc. the advantage being
>> it
>>>>>> avoids the baggage of transactions. disadvantages include inability
>>>> to do
>>>>>> cross stream transactions, and flexibility (interleaving, etc) (are
>>>> there
>>>>>> others?).
>>>>>> 
>>>>>> c. proxy use case is for supporting multiple writers - have you
>>>> thought
>>>>>> about how this would work with multiple writers?
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> 
>>>>>> On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo
>> <sijieg@twitter.com.invalid
>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Sound good to me. look forward to the detailed proposal.
>>>>>>> 
>>>>>>> (I don't mind the format if it makes things easier to you)
>>>>>>> 
>>>>>>> Sijie
>>>>>>> 
>>>>>>> On Friday, October 14, 2016, Xi Liu <xi...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> Thank you, Sijie
>>>>>>>> 
>>>>>>>> We have some internal discussions to sort out some details. We
>> are
>>>>>> ready
>>>>>>> to
>>>>>>>> collaborate with the community for adding the transaction
>> support
>>>> in
>>>>>> DL.
>>>>>>>> We'd like to share more.
>>>>>>>> 
>>>>>>>> I created a proposal wiki here -
>>>>>>>> https://cwiki.apache.org/confluence/display/DL/DP-1+-+
>>>>>>>> DistributedLog+Transaction+Support
>>>>>>>> 
>>>>>>>> (I followed KIP format and named it as DP (DistributedLog
>>>> Proposal -
>>>>> DP
>>>>>>> is
>>>>>>>> also short for Dynamic Programming). I don't know if you guys
>> like
>>>>> this
>>>>>>>> name or not. Feel free to change it :D)
>>>>>>>> 
>>>>>>>> I basically put my initial email as the content there so far.
>>>> Once we
>>>>>>>> finished our final discussion, I will update with more details.
>> At
>>>>> the
>>>>>>> same
>>>>>>>> time, any comments are welcome.
>>>>>>>> 
>>>>>>>> - Xi
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
>>>>>>> <javascript:;>>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Xi,
>>>>>>>>> 
>>>>>>>>> I just granted you the edit permission.
>>>>>>>>> 
>>>>>>>>> - Sijie
>>>>>>>>> 
>>>>>>>>> On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
>>>>>>>> <javascript:;>> wrote:
>>>>>>>>> 
>>>>>>>>>> I still can not edit the wiki. Can any of the pmc members
>>>> grant
>>>>> me
>>>>>>> the
>>>>>>>>>> permissions?
>>>>>>>>>> 
>>>>>>>>>> - Xi
>>>>>>>>>> 
>>>>>>>>>> On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <
>>>> xi.liu.ant@gmail.com
>>>>>>>> <javascript:;>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Sijie,
>>>>>>>>>>> 
>>>>>>>>>>> I attempted to create a wiki page under that space. I
>> found
>>>>> that
>>>>>> I
>>>>>>> am
>>>>>>>>> not
>>>>>>>>>>> authorized with edit permission.
>>>>>>>>>>> 
>>>>>>>>>>> Can any of the committers grant me the wiki edit
>>>> permission? My
>>>>>>>> account
>>>>>>>>>> is
>>>>>>>>>>> "xi.liu.ant".
>>>>>>>>>>> 
>>>>>>>>>>> - Xi
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <
>>>> sijie@apache.org
>>>>>>>> <javascript:;>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> This sounds interesting ... I will take a closer look and
>>>> give
>>>>>> my
>>>>>>>>>> comments
>>>>>>>>>>>> later.
>>>>>>>>>>>> 
>>>>>>>>>>>> At the same time, do you mind creating a wiki page to put
>>>> your
>>>>>>> idea
>>>>>>>>>> there?
>>>>>>>>>>>> You can add your wiki page under
>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/DL/Project+
>>>>>> Proposals
>>>>>>>>>>>> 
>>>>>>>>>>>> You might need to ask in the dev list to grant the wiki
>>>> edit
>>>>>>>>> permissions
>>>>>>>>>>>> to
>>>>>>>>>>>> you once you have a wiki account.
>>>>>>>>>>>> 
>>>>>>>>>>>> - Sijie
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <
>>>> xi.liu.ant@gmail.com
>>>>>>>> <javascript:;>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I asked the transaction support in distributedlog user
>>>> group
>>>>>> two
>>>>>>>>>> months
>>>>>>>>>>>>> ago. I want to raise this up again, as we are looking
>> for
>>>>>> using
>>>>>>>>>>>>> distributedlog for building a transactional data
>>>> service. It
>>>>>> is
>>>>>>> a
>>>>>>>>>> major
>>>>>>>>>>>>> feature that is missing in distributedlog. We have some
>>>>> ideas
>>>>>> to
>>>>>>>> add
>>>>>>>>>>>> this
>>>>>>>>>>>>> to distributedlog and want to know if they make sense
>> or
>>>>> not.
>>>>>> If
>>>>>>>>> they
>>>>>>>>>>>> are
>>>>>>>>>>>>> good, we'd like to contribute and develop with the
>>>>> community.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Here are the thoughts:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -------------------------------------------------
>>>>>>>>>>>>> 
>>>>>>>>>>>>> From our understanding, DL can provide "at-least-once"
>>>>>> delivery
>>>>>>>>>> semantic
>>>>>>>>>>>>> (if not, please correct me) but not "exactly-once"
>>>> delivery
>>>>>>>>> semantic.
>>>>>>>>>>>> That
>>>>>>>>>>>>> means that a message can be delivered one or more times
>>>> if
>>>>> the
>>>>>>>>> reader
>>>>>>>>>>>>> doesn't handle duplicates.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The duplicates come from two places, one is at writer
>>>> side
>>>>>> (this
>>>>>>>>>> assumes
>>>>>>>>>>>>> using write proxy not the core library), while the
>> other
>>>> one
>>>>>> is
>>>>>>> at
>>>>>>>>>>>> reader
>>>>>>>>>>>>> side.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - writer side: if the client attempts to write a record
>>>> to
>>>>> the
>>>>>>>> write
>>>>>>>>>>>>> proxies and gets a network error (e.g timeouts) then
>>>>> retries,
>>>>>>> the
>>>>>>>>>>>> retrying
>>>>>>>>>>>>> will potentially result in duplicates.
>>>>>>>>>>>>> - reader side:if the reader reads a message from a
>> stream
>>>>> and
>>>>>>> then
>>>>>>>>>>>> crashes,
>>>>>>>>>>>>> when the reader restarts it would restart from last
>> known
>>>>>>> position
>>>>>>>>>>>> (DLSN).
>>>>>>>>>>>>> If the reader fails after processing a record and
>> before
>>>>>>> recording
>>>>>>>>> the
>>>>>>>>>>>>> position, the processed record will be delivered again.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The reader problem can be properly addressed by making
>>>> use
>>>>> of
>>>>>>> the
>>>>>>>>>>>> sequence
>>>>>>>>>>>>> numbers of records and doing proper checkpointing. For
>>>>>> example,
>>>>>>> in
>>>>>>>>>>>>> database, it can checkpoint the indexed data with the
>>>>> sequence
>>>>>>>>> number
>>>>>>>>>> of
>>>>>>>>>>>>> records; in flink, it can checkpoint the state with the
>>>>>> sequence
>>>>>>>>>>>> numbers.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The writer problem can be addressed by implementing an
>>>>>>> idempotent
>>>>>>>>>>>> writer.
>>>>>>>>>>>>> However, an alternative and more powerful approach is
>> to
>>>>>> support
>>>>>>>>>>>>> transactions.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *What does transaction mean?*
>>>>>>>>>>>>> 
>>>>>>>>>>>>> A transaction means a collection of records can be
>>>> written
>>>>>>>>>>>> transactionally
>>>>>>>>>>>>> within a stream or across multiple streams. They will
>> be
>>>>>>> consumed
>>>>>>>> by
>>>>>>>>>> the
>>>>>>>>>>>>> reader together when a transaction is committed, or
>> will
>>>>> never
>>>>>>> be
>>>>>>>>>>>> consumed
>>>>>>>>>>>>> by the reader when the transaction is aborted.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The transaction will expose following guarantees:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - The reader should not be exposed to records written
>>>> from
>>>>>>>>> uncommitted
>>>>>>>>>>>>> transactions (mandatory)
>>>>>>>>>>>>> - The reader should consume the records in the
>>>> transaction
>>>>>>> commit
>>>>>>>>>> order
>>>>>>>>>>>>> rather than the record written order (mandatory)
>>>>>>>>>>>>> - No duplicated records within a transaction
>> (mandatory)
>>>>>>>>>>>>> - Allow interleaving transactional writes and
>>>>>> non-transactional
>>>>>>>>> writes
>>>>>>>>>>>>> (optional)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *Stream Transaction & Namespace Transaction*
>>>>>>>>>>>>> 
>>>>>>>>>>>>> There will be two types of transaction, one is Stream
>>>> level
>>>>>>>>>> transaction
>>>>>>>>>>>>> (local transaction), while the other one is Namespace
>>>> level
>>>>>>>>>> transaction
>>>>>>>>>>>>> (global transaction).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The stream level transaction is a transactional
>>>> operation on
>>>>>>>> writing
>>>>>>>>>>>>> records to one stream; the namespace level transaction
>>>> is a
>>>>>>>>>>>> transactional
>>>>>>>>>>>>> operation on writing records to multiple streams.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *Implementation Thoughts*
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - A transaction is consist of begin control record, a
>>>> series
>>>>>> of
>>>>>>>> data
>>>>>>>>>>>>> records and commit/abort control record.
>>>>>>>>>>>>> - The begin/commit/abort control record is written to a
>>>>>> `commit`
>>>>>>>> log
>>>>>>>>>>>>> stream, while the data records will be written to
>> normal
>>>>> data
>>>>>>> log
>>>>>>>>>>>> streams.
>>>>>>>>>>>>> - The `commit` log stream will be the same log stream
>> for
>>>>>>>>> stream-level
>>>>>>>>>>>>> transaction,  while it will be a *system* stream (or
>>>>> multiple
>>>>>>>> system
>>>>>>>>>>>>> streams) for namespace-level transactions.
>>>>>>>>>>>>> - The transaction code looks like as below:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> <code>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Transaction txn = client.transaction();
>>>>>>>>>>>>> Future<DLSN> result1 = txn.write(stream-0, record);
>>>>>>>>>>>>> Future<DLSN> result2 = txn.write(stream-1, record);
>>>>>>>>>>>>> Future<DLSN> result3 = txn.write(stream-2, record);
>>>>>>>>>>>>> Future<Pair<DLSN, DLSN>> result = txn.commit();
>>>>>>>>>>>>> 
>>>>>>>>>>>>> </code>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> if the txn is committed, all the write futures will be
>>>>>> satisfied
>>>>>>>>> with
>>>>>>>>>>>> their
>>>>>>>>>>>>> written DLSNs. if the txn is aborted, all the write
>>>> futures
>>>>>> will
>>>>>>>> be
>>>>>>>>>>>> failed
>>>>>>>>>>>>> together. there is no partial failure state.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - The actually data flow will be:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 1. writer get a transaction id from the owner of the
>>>>> `commit'
>>>>>>> log
>>>>>>>>>> stream
>>>>>>>>>>>>> 1. write the begin control record (synchronously) with
>>>> the
>>>>>>>>> transaction
>>>>>>>>>>>> id
>>>>>>>>>>>>> 2. for each write within the same txn, it will be
>>>> assigned a
>>>>>>> local
>>>>>>>>>>>> sequence
>>>>>>>>>>>>> number starting from 0. the combination of transaction
>> id
>>>>> and
>>>>>>>> local
>>>>>>>>>>>>> sequence number will be used later on by the readers to
>>>>>>>> de-duplicate
>>>>>>>>>>>>> records.
>>>>>>>>>>>>> 3. the commit/abort control record will be written
>> based
>>>> on
>>>>>> the
>>>>>>>>>> results
>>>>>>>>>>>>> from 2.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - Application can supply a timeout for the transaction
>>>> when
>>>>>>>>> #begin() a
>>>>>>>>>>>>> transaction. The owner of the `commit` log stream can
>>>> abort
>>>>>>>>>> transactions
>>>>>>>>>>>>> that never be committed/aborted within their timeout.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - Failures:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> * all the log records can be simply retried as they
>> will
>>>> be
>>>>>>>>>>>> de-duplicated
>>>>>>>>>>>>> probably at the reader side.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - Reader:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> * Reader can be configured to read uncommitted records
>> or
>>>>>>>> committed
>>>>>>>>>>>> records
>>>>>>>>>>>>> only (by default read uncommitted records)
>>>>>>>>>>>>> * If reader is configured to read committed records
>> only,
>>>>> the
>>>>>>> read
>>>>>>>>>> ahead
>>>>>>>>>>>>> cache will be changed to maintain one additional
>> pending
>>>>>>> committed
>>>>>>>>>>>> records.
>>>>>>>>>>>>> the pending committed records map is bounded and
>> records
>>>>> will
>>>>>> be
>>>>>>>>>> dropped
>>>>>>>>>>>>> when read ahead is moving.
>>>>>>>>>>>>> * when the reader hits a commit record, it will rewind
>> to
>>>>> the
>>>>>>>> begin
>>>>>>>>>>>> record
>>>>>>>>>>>>> and start reading from there. leveraging the proper
>> read
>>>>> ahead
>>>>>>>> cache
>>>>>>>>>> and
>>>>>>>>>>>>> pending commit records cache, it would be good for both
>>>>> short
>>>>>>>>>>>> transactions
>>>>>>>>>>>>> and long transactions.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - DLSN, SequenceId:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> * We will add a fourth field to DLSN. It is `local
>>>> sequence
>>>>>>>> number`
>>>>>>>>>>>> within
>>>>>>>>>>>>> a transaction session. So the new DLSN of records in a
>>>>>>> transaction
>>>>>>>>>> will
>>>>>>>>>>>> be
>>>>>>>>>>>>> the DLSN of commit control record plus its local
>> sequence
>>>>>>> number.
>>>>>>>>>>>>> * The sequence id will be still the position of the
>>>> commit
>>>>>>> record
>>>>>>>>> plus
>>>>>>>>>>>> its
>>>>>>>>>>>>> local sequence number. The position will be advanced
>> with
>>>>>> total
>>>>>>>>> number
>>>>>>>>>>>> of
>>>>>>>>>>>>> written records on writing the commit control record.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - Transaction Group & Namespace Transaction
>>>>>>>>>>>>> 
>>>>>>>>>>>>> using one single log stream for namespace transaction
>> can
>>>>>> cause
>>>>>>>> the
>>>>>>>>>>>>> bottleneck problem since all the begin/commit/end
>> control
>>>>>>> records
>>>>>>>>> will
>>>>>>>>>>>> have
>>>>>>>>>>>>> to go through one log stream.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> the idea of 'transaction group' is to allow
>> partitioning
>>>> the
>>>>>>>> writers
>>>>>>>>>>>> into
>>>>>>>>>>>>> different transaction groups.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> clients can specify the `group-name` when starting the
>>>>>>>> transaction.
>>>>>>>>> if
>>>>>>>>>>>>> there is no `group-name` specified, it will use the
>>>> default
>>>>>>>> `commit`
>>>>>>>>>>>> log in
>>>>>>>>>>>>> the namespace for creating transactions.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -------------------------------------------------
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I'd like to collect feedbacks on this idea. Appreciate
>>>> any
>>>>>>>> comments
>>>>>>>>>> and
>>>>>>>>>>>> if
>>>>>>>>>>>>> anyone is also interested in this idea, we'd like to
>>>>>> collaborate
>>>>>>>>> with
>>>>>>>>>>>> the
>>>>>>>>>>>>> community.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - Xi
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>

Re: [Discuss] Transaction Support

Posted by Sijie Guo <si...@apache.org>.

Sorry for late response. I think Leigh and you already had some very
valuable discussions in the doc. I will try to add some of my questions to
the discussion.

Beside that, I had a discussion with Leigh today about this. first of all,
I think it is very good to add transaction support in distributedlog. It is
one of the primitives that would help building distributed service. But we
have a concern about making this system become complicated and introduce
operational overhead when it runs in the large scale system on production.
There are two major suggestions that I have for this feature -

Build the 'minimum' logic in core - I think the minimum logic that need to
be added to the core is -  the special control records (begin, commit and
abort) and make the reader be able to detect those special control records
and know what do they mean and how to interrupt with them. Since they are
special control records, there is not overhead to other readers that
doesn't require this feature.

Build the transaction coordinator as a separated proxy service  - I think
the major concern that we have is putting more complexities into the 'write
proxy' service. We architected distributedlog in a more microservice-like
way - we have the core as the stream store, the proxy for serving write and
read traffic. It would be good that the transaction feature can be done in
a similar way. So the architecture would be like this -

*[ write service ] [ read service ] [ transaction coordinator ]*
*[ stream store
            ]*

if people doesn't need the transaction feature, they can turn if off
completely without any operational overhead.

Beside that, I have one general question - What is the major goal for this
feature? Are you targeting on building a general XA transaction coordinator
or just for supporting things like `copy-modify-write' style workflow?


Thanks,
Sijie





On Wed, Dec 28, 2016 at 1:12 PM, Xi Liu <xi...@gmail.com> wrote:

> Ping?
>
> On Mon, Dec 19, 2016 at 8:28 AM, Xi Liu <xi...@gmail.com> wrote:
>
> > Sijie,
> >
> > No. I thought it might be easier for people to comment on a google doc to
> > gather the initial feedback. I will put the content back to wiki page
> once
> > addressing the comments. Does that sound good to you?
> >
> > And thank you in advance.
> >
> > - Xi
> >
> >
> >
> > On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <si...@apache.org> wrote:
> >
> >> Hi Xi,
> >>
> >> sorry for late response. I will review it soon.
> >>
> >> regarding this, a separate question "are we going to use google doc
> >> instead
> >> of email thread for any discussion"? I am a bit worried that the
> >> discussion
> >> will become lost after moving to google doc. No idea on how other apache
> >> projects are doing.
> >>
> >> - Sijie
> >>
> >> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi...@gmail.com> wrote:
> >>
> >> > Hi all,
> >> >
> >> > I finalized the first version of the design. This time I used a google
> >> doc
> >> > so that it is easier for commenting and add a link the wiki page. I
> will
> >> > update this to the wiki page once we come to the finalized design.
> >> >
> >> > https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
> >> > bSIGgSzXuTI5BA/edit
> >> >
> >> > Let me know if you have any questions. Appreciate your reviews!
> >> >
> >> > - Xi
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
> >> > <lstewart@twitter.com.invalid
> >> > > wrote:
> >> >
> >> > > Interesting proposal. A couple quick notes while you continue to
> flesh
> >> > this
> >> > > out.
> >> > >
> >> > > a. just to be sure - does this eliminate the need to save seqno with
> >> > > checkpoint?
> >> > >
> >> > > b. i.e. another way to describe this kind of improvement is "support
> >> > > records (atomic writes) larger than 1MB", iiuc. the advantage being
> it
> >> > > avoids the baggage of transactions. disadvantages include inability
> >> to do
> >> > > cross stream transactions, and flexibility (interleaving, etc) (are
> >> there
> >> > > others?).
> >> > >
> >> > > c. proxy use case is for supporting multiple writers - have you
> >> thought
> >> > > about how this would work with multiple writers?
> >> > >
> >> > > Thanks!
> >> > >
> >> > >
> >> > > On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo
> <sijieg@twitter.com.invalid
> >> >
> >> > > wrote:
> >> > >
> >> > > > Sound good to me. look forward to the detailed proposal.
> >> > > >
> >> > > > (I don't mind the format if it makes things easier to you)
> >> > > >
> >> > > > Sijie
> >> > > >
> >> > > > On Friday, October 14, 2016, Xi Liu <xi...@gmail.com> wrote:
> >> > > >
> >> > > > > Thank you, Sijie
> >> > > > >
> >> > > > > We have some internal discussions to sort out some details. We
> are
> >> > > ready
> >> > > > to
> >> > > > > collaborate with the community for adding the transaction
> support
> >> in
> >> > > DL.
> >> > > > > We'd like to share more.
> >> > > > >
> >> > > > > I created a proposal wiki here -
> >> > > > > https://cwiki.apache.org/confluence/display/DL/DP-1+-+
> >> > > > > DistributedLog+Transaction+Support
> >> > > > >
> >> > > > > (I followed KIP format and named it as DP (DistributedLog
> >> Proposal -
> >> > DP
> >> > > > is
> >> > > > > also short for Dynamic Programming). I don't know if you guys
> like
> >> > this
> >> > > > > name or not. Feel free to change it :D)
> >> > > > >
> >> > > > > I basically put my initial email as the content there so far.
> >> Once we
> >> > > > > finished our final discussion, I will update with more details.
> At
> >> > the
> >> > > > same
> >> > > > > time, any comments are welcome.
> >> > > > >
> >> > > > > - Xi
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
> >> > > > <javascript:;>>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Xi,
> >> > > > > >
> >> > > > > > I just granted you the edit permission.
> >> > > > > >
> >> > > > > > - Sijie
> >> > > > > >
> >> > > > > > On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
> >> > > > > <javascript:;>> wrote:
> >> > > > > >
> >> > > > > > > I still can not edit the wiki. Can any of the pmc members
> >> grant
> >> > me
> >> > > > the
> >> > > > > > > permissions?
> >> > > > > > >
> >> > > > > > > - Xi
> >> > > > > > >
> >> > > > > > > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <
> >> xi.liu.ant@gmail.com
> >> > > > > <javascript:;>> wrote:
> >> > > > > > >
> >> > > > > > > > Sijie,
> >> > > > > > > >
> >> > > > > > > > I attempted to create a wiki page under that space. I
> found
> >> > that
> >> > > I
> >> > > > am
> >> > > > > > not
> >> > > > > > > > authorized with edit permission.
> >> > > > > > > >
> >> > > > > > > > Can any of the committers grant me the wiki edit
> >> permission? My
> >> > > > > account
> >> > > > > > > is
> >> > > > > > > > "xi.liu.ant".
> >> > > > > > > >
> >> > > > > > > > - Xi
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <
> >> sijie@apache.org
> >> > > > > <javascript:;>> wrote:
> >> > > > > > > >
> >> > > > > > > >> This sounds interesting ... I will take a closer look and
> >> give
> >> > > my
> >> > > > > > > comments
> >> > > > > > > >> later.
> >> > > > > > > >>
> >> > > > > > > >> At the same time, do you mind creating a wiki page to put
> >> your
> >> > > > idea
> >> > > > > > > there?
> >> > > > > > > >> You can add your wiki page under
> >> > > > > > > >> https://cwiki.apache.org/confluence/display/DL/Project+
> >> > > Proposals
> >> > > > > > > >>
> >> > > > > > > >> You might need to ask in the dev list to grant the wiki
> >> edit
> >> > > > > > permissions
> >> > > > > > > >> to
> >> > > > > > > >> you once you have a wiki account.
> >> > > > > > > >>
> >> > > > > > > >> - Sijie
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <
> >> xi.liu.ant@gmail.com
> >> > > > > <javascript:;>> wrote:
> >> > > > > > > >>
> >> > > > > > > >> > Hello,
> >> > > > > > > >> >
> >> > > > > > > >> > I asked the transaction support in distributedlog user
> >> group
> >> > > two
> >> > > > > > > months
> >> > > > > > > >> > ago. I want to raise this up again, as we are looking
> for
> >> > > using
> >> > > > > > > >> > distributedlog for building a transactional data
> >> service. It
> >> > > is
> >> > > > a
> >> > > > > > > major
> >> > > > > > > >> > feature that is missing in distributedlog. We have some
> >> > ideas
> >> > > to
> >> > > > > add
> >> > > > > > > >> this
> >> > > > > > > >> > to distributedlog and want to know if they make sense
> or
> >> > not.
> >> > > If
> >> > > > > > they
> >> > > > > > > >> are
> >> > > > > > > >> > good, we'd like to contribute and develop with the
> >> > community.
> >> > > > > > > >> >
> >> > > > > > > >> > Here are the thoughts:
> >> > > > > > > >> >
> >> > > > > > > >> > -------------------------------------------------
> >> > > > > > > >> >
> >> > > > > > > >> > From our understanding, DL can provide "at-least-once"
> >> > > delivery
> >> > > > > > > semantic
> >> > > > > > > >> > (if not, please correct me) but not "exactly-once"
> >> delivery
> >> > > > > > semantic.
> >> > > > > > > >> That
> >> > > > > > > >> > means that a message can be delivered one or more times
> >> if
> >> > the
> >> > > > > > reader
> >> > > > > > > >> > doesn't handle duplicates.
> >> > > > > > > >> >
> >> > > > > > > >> > The duplicates come from two places, one is at writer
> >> side
> >> > > (this
> >> > > > > > > assumes
> >> > > > > > > >> > using write proxy not the core library), while the
> other
> >> one
> >> > > is
> >> > > > at
> >> > > > > > > >> reader
> >> > > > > > > >> > side.
> >> > > > > > > >> >
> >> > > > > > > >> > - writer side: if the client attempts to write a record
> >> to
> >> > the
> >> > > > > write
> >> > > > > > > >> > proxies and gets a network error (e.g timeouts) then
> >> > retries,
> >> > > > the
> >> > > > > > > >> retrying
> >> > > > > > > >> > will potentially result in duplicates.
> >> > > > > > > >> > - reader side:if the reader reads a message from a
> stream
> >> > and
> >> > > > then
> >> > > > > > > >> crashes,
> >> > > > > > > >> > when the reader restarts it would restart from last
> known
> >> > > > position
> >> > > > > > > >> (DLSN).
> >> > > > > > > >> > If the reader fails after processing a record and
> before
> >> > > > recording
> >> > > > > > the
> >> > > > > > > >> > position, the processed record will be delivered again.
> >> > > > > > > >> >
> >> > > > > > > >> > The reader problem can be properly addressed by making
> >> use
> >> > of
> >> > > > the
> >> > > > > > > >> sequence
> >> > > > > > > >> > numbers of records and doing proper checkpointing. For
> >> > > example,
> >> > > > in
> >> > > > > > > >> > database, it can checkpoint the indexed data with the
> >> > sequence
> >> > > > > > number
> >> > > > > > > of
> >> > > > > > > >> > records; in flink, it can checkpoint the state with the
> >> > > sequence
> >> > > > > > > >> numbers.
> >> > > > > > > >> >
> >> > > > > > > >> > The writer problem can be addressed by implementing an
> >> > > > idempotent
> >> > > > > > > >> writer.
> >> > > > > > > >> > However, an alternative and more powerful approach is
> to
> >> > > support
> >> > > > > > > >> > transactions.
> >> > > > > > > >> >
> >> > > > > > > >> > *What does transaction mean?*
> >> > > > > > > >> >
> >> > > > > > > >> > A transaction means a collection of records can be
> >> written
> >> > > > > > > >> transactionally
> >> > > > > > > >> > within a stream or across multiple streams. They will
> be
> >> > > > consumed
> >> > > > > by
> >> > > > > > > the
> >> > > > > > > >> > reader together when a transaction is committed, or
> will
> >> > never
> >> > > > be
> >> > > > > > > >> consumed
> >> > > > > > > >> > by the reader when the transaction is aborted.
> >> > > > > > > >> >
> >> > > > > > > >> > The transaction will expose following guarantees:
> >> > > > > > > >> >
> >> > > > > > > >> > - The reader should not be exposed to records written
> >> from
> >> > > > > > uncommitted
> >> > > > > > > >> > transactions (mandatory)
> >> > > > > > > >> > - The reader should consume the records in the
> >> transaction
> >> > > > commit
> >> > > > > > > order
> >> > > > > > > >> > rather than the record written order (mandatory)
> >> > > > > > > >> > - No duplicated records within a transaction
> (mandatory)
> >> > > > > > > >> > - Allow interleaving transactional writes and
> >> > > non-transactional
> >> > > > > > writes
> >> > > > > > > >> > (optional)
> >> > > > > > > >> >
> >> > > > > > > >> > *Stream Transaction & Namespace Transaction*
> >> > > > > > > >> >
> >> > > > > > > >> > There will be two types of transaction, one is Stream
> >> level
> >> > > > > > > transaction
> >> > > > > > > >> > (local transaction), while the other one is Namespace
> >> level
> >> > > > > > > transaction
> >> > > > > > > >> > (global transaction).
> >> > > > > > > >> >
> >> > > > > > > >> > The stream level transaction is a transactional
> >> operation on
> >> > > > > writing
> >> > > > > > > >> > records to one stream; the namespace level transaction
> >> is a
> >> > > > > > > >> transactional
> >> > > > > > > >> > operation on writing records to multiple streams.
> >> > > > > > > >> >
> >> > > > > > > >> > *Implementation Thoughts*
> >> > > > > > > >> >
> >> > > > > > > >> > - A transaction is consist of begin control record, a
> >> series
> >> > > of
> >> > > > > data
> >> > > > > > > >> > records and commit/abort control record.
> >> > > > > > > >> > - The begin/commit/abort control record is written to a
> >> > > `commit`
> >> > > > > log
> >> > > > > > > >> > stream, while the data records will be written to
> normal
> >> > data
> >> > > > log
> >> > > > > > > >> streams.
> >> > > > > > > >> > - The `commit` log stream will be the same log stream
> for
> >> > > > > > stream-level
> >> > > > > > > >> > transaction,  while it will be a *system* stream (or
> >> > multiple
> >> > > > > system
> >> > > > > > > >> > streams) for namespace-level transactions.
> >> > > > > > > >> > - The transaction code looks like as below:
> >> > > > > > > >> >
> >> > > > > > > >> > <code>
> >> > > > > > > >> >
> >> > > > > > > >> > Transaction txn = client.transaction();
> >> > > > > > > >> > Future<DLSN> result1 = txn.write(stream-0, record);
> >> > > > > > > >> > Future<DLSN> result2 = txn.write(stream-1, record);
> >> > > > > > > >> > Future<DLSN> result3 = txn.write(stream-2, record);
> >> > > > > > > >> > Future<Pair<DLSN, DLSN>> result = txn.commit();
> >> > > > > > > >> >
> >> > > > > > > >> > </code>
> >> > > > > > > >> >
> >> > > > > > > >> > if the txn is committed, all the write futures will be
> >> > > satisfied
> >> > > > > > with
> >> > > > > > > >> their
> >> > > > > > > >> > written DLSNs. if the txn is aborted, all the write
> >> futures
> >> > > will
> >> > > > > be
> >> > > > > > > >> failed
> >> > > > > > > >> > together. there is no partial failure state.
> >> > > > > > > >> >
> >> > > > > > > >> > - The actually data flow will be:
> >> > > > > > > >> >
> >> > > > > > > >> > 1. writer get a transaction id from the owner of the
> >> > `commit'
> >> > > > log
> >> > > > > > > stream
> >> > > > > > > >> > 1. write the begin control record (synchronously) with
> >> the
> >> > > > > > transaction
> >> > > > > > > >> id
> >> > > > > > > >> > 2. for each write within the same txn, it will be
> >> assigned a
> >> > > > local
> >> > > > > > > >> sequence
> >> > > > > > > >> > number starting from 0. the combination of transaction
> id
> >> > and
> >> > > > > local
> >> > > > > > > >> > sequence number will be used later on by the readers to
> >> > > > > de-duplicate
> >> > > > > > > >> > records.
> >> > > > > > > >> > 3. the commit/abort control record will be written
> based
> >> on
> >> > > the
> >> > > > > > > results
> >> > > > > > > >> > from 2.
> >> > > > > > > >> >
> >> > > > > > > >> > - Application can supply a timeout for the transaction
> >> when
> >> > > > > > #begin() a
> >> > > > > > > >> > transaction. The owner of the `commit` log stream can
> >> abort
> >> > > > > > > transactions
> >> > > > > > > >> > that never be committed/aborted within their timeout.
> >> > > > > > > >> >
> >> > > > > > > >> > - Failures:
> >> > > > > > > >> >
> >> > > > > > > >> > * all the log records can be simply retried as they
> will
> >> be
> >> > > > > > > >> de-duplicated
> >> > > > > > > >> > probably at the reader side.
> >> > > > > > > >> >
> >> > > > > > > >> > - Reader:
> >> > > > > > > >> >
> >> > > > > > > >> > * Reader can be configured to read uncommitted records
> or
> >> > > > > committed
> >> > > > > > > >> records
> >> > > > > > > >> > only (by default read uncommitted records)
> >> > > > > > > >> > * If reader is configured to read committed records
> only,
> >> > the
> >> > > > read
> >> > > > > > > ahead
> >> > > > > > > >> > cache will be changed to maintain one additional
> pending
> >> > > > committed
> >> > > > > > > >> records.
> >> > > > > > > >> > the pending committed records map is bounded and
> records
> >> > will
> >> > > be
> >> > > > > > > dropped
> >> > > > > > > >> > when read ahead is moving.
> >> > > > > > > >> > * when the reader hits a commit record, it will rewind
> to
> >> > the
> >> > > > > begin
> >> > > > > > > >> record
> >> > > > > > > >> > and start reading from there. leveraging the proper
> read
> >> > ahead
> >> > > > > cache
> >> > > > > > > and
> >> > > > > > > >> > pending commit records cache, it would be good for both
> >> > short
> >> > > > > > > >> transactions
> >> > > > > > > >> > and long transactions.
> >> > > > > > > >> >
> >> > > > > > > >> > - DLSN, SequenceId:
> >> > > > > > > >> >
> >> > > > > > > >> > * We will add a fourth field to DLSN. It is `local
> >> sequence
> >> > > > > number`
> >> > > > > > > >> within
> >> > > > > > > >> > a transaction session. So the new DLSN of records in a
> >> > > > transaction
> >> > > > > > > will
> >> > > > > > > >> be
> >> > > > > > > >> > the DLSN of commit control record plus its local
> sequence
> >> > > > number.
> >> > > > > > > >> > * The sequence id will be still the position of the
> >> commit
> >> > > > record
> >> > > > > > plus
> >> > > > > > > >> its
> >> > > > > > > >> > local sequence number. The position will be advanced
> with
> >> > > total
> >> > > > > > number
> >> > > > > > > >> of
> >> > > > > > > >> > written records on writing the commit control record.
> >> > > > > > > >> >
> >> > > > > > > >> > - Transaction Group & Namespace Transaction
> >> > > > > > > >> >
> >> > > > > > > >> > using one single log stream for namespace transaction
> can
> >> > > cause
> >> > > > > the
> >> > > > > > > >> > bottleneck problem since all the begin/commit/end
> control
> >> > > > records
> >> > > > > > will
> >> > > > > > > >> have
> >> > > > > > > >> > to go through one log stream.
> >> > > > > > > >> >
> >> > > > > > > >> > the idea of 'transaction group' is to allow
> partitioning
> >> the
> >> > > > > writers
> >> > > > > > > >> into
> >> > > > > > > >> > different transaction groups.
> >> > > > > > > >> >
> >> > > > > > > >> > clients can specify the `group-name` when starting the
> >> > > > > transaction.
> >> > > > > > if
> >> > > > > > > >> > there is no `group-name` specified, it will use the
> >> default
> >> > > > > `commit`
> >> > > > > > > >> log in
> >> > > > > > > >> > the namespace for creating transactions.
> >> > > > > > > >> >
> >> > > > > > > >> > -------------------------------------------------
> >> > > > > > > >> >
> >> > > > > > > >> > I'd like to collect feedbacks on this idea. Appreciate
> >> any
> >> > > > > comments
> >> > > > > > > and
> >> > > > > > > >> if
> >> > > > > > > >> > anyone is also interested in this idea, we'd like to
> >> > > collaborate
> >> > > > > > with
> >> > > > > > > >> the
> >> > > > > > > >> > community.
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> > - Xi
> >> > > > > > > >> >
> >> > > > > > > >>
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: [Discuss] Transaction Support

Posted by Xi Liu <xi...@gmail.com>.

Ping?

On Mon, Dec 19, 2016 at 8:28 AM, Xi Liu <xi...@gmail.com> wrote:

> Sijie,
>
> No. I thought it might be easier for people to comment on a google doc to
> gather the initial feedback. I will put the content back to wiki page once
> addressing the comments. Does that sound good to you?
>
> And thank you in advance.
>
> - Xi
>
>
>
> On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <si...@apache.org> wrote:
>
>> Hi Xi,
>>
>> sorry for late response. I will review it soon.
>>
>> regarding this, a separate question "are we going to use google doc
>> instead
>> of email thread for any discussion"? I am a bit worried that the
>> discussion
>> will become lost after moving to google doc. No idea on how other apache
>> projects are doing.
>>
>> - Sijie
>>
>> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi...@gmail.com> wrote:
>>
>> > Hi all,
>> >
>> > I finalized the first version of the design. This time I used a google
>> doc
>> > so that it is easier for commenting and add a link the wiki page. I will
>> > update this to the wiki page once we come to the finalized design.
>> >
>> > https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
>> > bSIGgSzXuTI5BA/edit
>> >
>> > Let me know if you have any questions. Appreciate your reviews!
>> >
>> > - Xi
>> >
>> >
>> >
>> >
>> >
>> > On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
>> > <lstewart@twitter.com.invalid
>> > > wrote:
>> >
>> > > Interesting proposal. A couple quick notes while you continue to flesh
>> > this
>> > > out.
>> > >
>> > > a. just to be sure - does this eliminate the need to save seqno with
>> > > checkpoint?
>> > >
>> > > b. i.e. another way to describe this kind of improvement is "support
>> > > records (atomic writes) larger than 1MB", iiuc. the advantage being it
>> > > avoids the baggage of transactions. disadvantages include inability
>> to do
>> > > cross stream transactions, and flexibility (interleaving, etc) (are
>> there
>> > > others?).
>> > >
>> > > c. proxy use case is for supporting multiple writers - have you
>> thought
>> > > about how this would work with multiple writers?
>> > >
>> > > Thanks!
>> > >
>> > >
>> > > On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo <sijieg@twitter.com.invalid
>> >
>> > > wrote:
>> > >
>> > > > Sound good to me. look forward to the detailed proposal.
>> > > >
>> > > > (I don't mind the format if it makes things easier to you)
>> > > >
>> > > > Sijie
>> > > >
>> > > > On Friday, October 14, 2016, Xi Liu <xi...@gmail.com> wrote:
>> > > >
>> > > > > Thank you, Sijie
>> > > > >
>> > > > > We have some internal discussions to sort out some details. We are
>> > > ready
>> > > > to
>> > > > > collaborate with the community for adding the transaction support
>> in
>> > > DL.
>> > > > > We'd like to share more.
>> > > > >
>> > > > > I created a proposal wiki here -
>> > > > > https://cwiki.apache.org/confluence/display/DL/DP-1+-+
>> > > > > DistributedLog+Transaction+Support
>> > > > >
>> > > > > (I followed KIP format and named it as DP (DistributedLog
>> Proposal -
>> > DP
>> > > > is
>> > > > > also short for Dynamic Programming). I don't know if you guys like
>> > this
>> > > > > name or not. Feel free to change it :D)
>> > > > >
>> > > > > I basically put my initial email as the content there so far.
>> Once we
>> > > > > finished our final discussion, I will update with more details. At
>> > the
>> > > > same
>> > > > > time, any comments are welcome.
>> > > > >
>> > > > > - Xi
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
>> > > > <javascript:;>>
>> > > > > wrote:
>> > > > >
>> > > > > > Xi,
>> > > > > >
>> > > > > > I just granted you the edit permission.
>> > > > > >
>> > > > > > - Sijie
>> > > > > >
>> > > > > > On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
>> > > > > <javascript:;>> wrote:
>> > > > > >
>> > > > > > > I still can not edit the wiki. Can any of the pmc members
>> grant
>> > me
>> > > > the
>> > > > > > > permissions?
>> > > > > > >
>> > > > > > > - Xi
>> > > > > > >
>> > > > > > > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <
>> xi.liu.ant@gmail.com
>> > > > > <javascript:;>> wrote:
>> > > > > > >
>> > > > > > > > Sijie,
>> > > > > > > >
>> > > > > > > > I attempted to create a wiki page under that space. I found
>> > that
>> > > I
>> > > > am
>> > > > > > not
>> > > > > > > > authorized with edit permission.
>> > > > > > > >
>> > > > > > > > Can any of the committers grant me the wiki edit
>> permission? My
>> > > > > account
>> > > > > > > is
>> > > > > > > > "xi.liu.ant".
>> > > > > > > >
>> > > > > > > > - Xi
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <
>> sijie@apache.org
>> > > > > <javascript:;>> wrote:
>> > > > > > > >
>> > > > > > > >> This sounds interesting ... I will take a closer look and
>> give
>> > > my
>> > > > > > > comments
>> > > > > > > >> later.
>> > > > > > > >>
>> > > > > > > >> At the same time, do you mind creating a wiki page to put
>> your
>> > > > idea
>> > > > > > > there?
>> > > > > > > >> You can add your wiki page under
>> > > > > > > >> https://cwiki.apache.org/confluence/display/DL/Project+
>> > > Proposals
>> > > > > > > >>
>> > > > > > > >> You might need to ask in the dev list to grant the wiki
>> edit
>> > > > > > permissions
>> > > > > > > >> to
>> > > > > > > >> you once you have a wiki account.
>> > > > > > > >>
>> > > > > > > >> - Sijie
>> > > > > > > >>
>> > > > > > > >>
>> > > > > > > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <
>> xi.liu.ant@gmail.com
>> > > > > <javascript:;>> wrote:
>> > > > > > > >>
>> > > > > > > >> > Hello,
>> > > > > > > >> >
>> > > > > > > >> > I asked the transaction support in distributedlog user
>> group
>> > > two
>> > > > > > > months
>> > > > > > > >> > ago. I want to raise this up again, as we are looking for
>> > > using
>> > > > > > > >> > distributedlog for building a transactional data
>> service. It
>> > > is
>> > > > a
>> > > > > > > major
>> > > > > > > >> > feature that is missing in distributedlog. We have some
>> > ideas
>> > > to
>> > > > > add
>> > > > > > > >> this
>> > > > > > > >> > to distributedlog and want to know if they make sense or
>> > not.
>> > > If
>> > > > > > they
>> > > > > > > >> are
>> > > > > > > >> > good, we'd like to contribute and develop with the
>> > community.
>> > > > > > > >> >
>> > > > > > > >> > Here are the thoughts:
>> > > > > > > >> >
>> > > > > > > >> > -------------------------------------------------
>> > > > > > > >> >
>> > > > > > > >> > From our understanding, DL can provide "at-least-once"
>> > > delivery
>> > > > > > > semantic
>> > > > > > > >> > (if not, please correct me) but not "exactly-once"
>> delivery
>> > > > > > semantic.
>> > > > > > > >> That
>> > > > > > > >> > means that a message can be delivered one or more times
>> if
>> > the
>> > > > > > reader
>> > > > > > > >> > doesn't handle duplicates.
>> > > > > > > >> >
>> > > > > > > >> > The duplicates come from two places, one is at writer
>> side
>> > > (this
>> > > > > > > assumes
>> > > > > > > >> > using write proxy not the core library), while the other
>> one
>> > > is
>> > > > at
>> > > > > > > >> reader
>> > > > > > > >> > side.
>> > > > > > > >> >
>> > > > > > > >> > - writer side: if the client attempts to write a record
>> to
>> > the
>> > > > > write
>> > > > > > > >> > proxies and gets a network error (e.g timeouts) then
>> > retries,
>> > > > the
>> > > > > > > >> retrying
>> > > > > > > >> > will potentially result in duplicates.
>> > > > > > > >> > - reader side:if the reader reads a message from a stream
>> > and
>> > > > then
>> > > > > > > >> crashes,
>> > > > > > > >> > when the reader restarts it would restart from last known
>> > > > position
>> > > > > > > >> (DLSN).
>> > > > > > > >> > If the reader fails after processing a record and before
>> > > > recording
>> > > > > > the
>> > > > > > > >> > position, the processed record will be delivered again.
>> > > > > > > >> >
>> > > > > > > >> > The reader problem can be properly addressed by making
>> use
>> > of
>> > > > the
>> > > > > > > >> sequence
>> > > > > > > >> > numbers of records and doing proper checkpointing. For
>> > > example,
>> > > > in
>> > > > > > > >> > database, it can checkpoint the indexed data with the
>> > sequence
>> > > > > > number
>> > > > > > > of
>> > > > > > > >> > records; in flink, it can checkpoint the state with the
>> > > sequence
>> > > > > > > >> numbers.
>> > > > > > > >> >
>> > > > > > > >> > The writer problem can be addressed by implementing an
>> > > > idempotent
>> > > > > > > >> writer.
>> > > > > > > >> > However, an alternative and more powerful approach is to
>> > > support
>> > > > > > > >> > transactions.
>> > > > > > > >> >
>> > > > > > > >> > *What does transaction mean?*
>> > > > > > > >> >
>> > > > > > > >> > A transaction means a collection of records can be
>> written
>> > > > > > > >> transactionally
>> > > > > > > >> > within a stream or across multiple streams. They will be
>> > > > consumed
>> > > > > by
>> > > > > > > the
>> > > > > > > >> > reader together when a transaction is committed, or will
>> > never
>> > > > be
>> > > > > > > >> consumed
>> > > > > > > >> > by the reader when the transaction is aborted.
>> > > > > > > >> >
>> > > > > > > >> > The transaction will expose following guarantees:
>> > > > > > > >> >
>> > > > > > > >> > - The reader should not be exposed to records written
>> from
>> > > > > > uncommitted
>> > > > > > > >> > transactions (mandatory)
>> > > > > > > >> > - The reader should consume the records in the
>> transaction
>> > > > commit
>> > > > > > > order
>> > > > > > > >> > rather than the record written order (mandatory)
>> > > > > > > >> > - No duplicated records within a transaction (mandatory)
>> > > > > > > >> > - Allow interleaving transactional writes and
>> > > non-transactional
>> > > > > > writes
>> > > > > > > >> > (optional)
>> > > > > > > >> >
>> > > > > > > >> > *Stream Transaction & Namespace Transaction*
>> > > > > > > >> >
>> > > > > > > >> > There will be two types of transaction, one is Stream
>> level
>> > > > > > > transaction
>> > > > > > > >> > (local transaction), while the other one is Namespace
>> level
>> > > > > > > transaction
>> > > > > > > >> > (global transaction).
>> > > > > > > >> >
>> > > > > > > >> > The stream level transaction is a transactional
>> operation on
>> > > > > writing
>> > > > > > > >> > records to one stream; the namespace level transaction
>> is a
>> > > > > > > >> transactional
>> > > > > > > >> > operation on writing records to multiple streams.
>> > > > > > > >> >
>> > > > > > > >> > *Implementation Thoughts*
>> > > > > > > >> >
>> > > > > > > >> > - A transaction is consist of begin control record, a
>> series
>> > > of
>> > > > > data
>> > > > > > > >> > records and commit/abort control record.
>> > > > > > > >> > - The begin/commit/abort control record is written to a
>> > > `commit`
>> > > > > log
>> > > > > > > >> > stream, while the data records will be written to normal
>> > data
>> > > > log
>> > > > > > > >> streams.
>> > > > > > > >> > - The `commit` log stream will be the same log stream for
>> > > > > > stream-level
>> > > > > > > >> > transaction,  while it will be a *system* stream (or
>> > multiple
>> > > > > system
>> > > > > > > >> > streams) for namespace-level transactions.
>> > > > > > > >> > - The transaction code looks like as below:
>> > > > > > > >> >
>> > > > > > > >> > <code>
>> > > > > > > >> >
>> > > > > > > >> > Transaction txn = client.transaction();
>> > > > > > > >> > Future<DLSN> result1 = txn.write(stream-0, record);
>> > > > > > > >> > Future<DLSN> result2 = txn.write(stream-1, record);
>> > > > > > > >> > Future<DLSN> result3 = txn.write(stream-2, record);
>> > > > > > > >> > Future<Pair<DLSN, DLSN>> result = txn.commit();
>> > > > > > > >> >
>> > > > > > > >> > </code>
>> > > > > > > >> >
>> > > > > > > >> > if the txn is committed, all the write futures will be
>> > > satisfied
>> > > > > > with
>> > > > > > > >> their
>> > > > > > > >> > written DLSNs. if the txn is aborted, all the write
>> futures
>> > > will
>> > > > > be
>> > > > > > > >> failed
>> > > > > > > >> > together. there is no partial failure state.
>> > > > > > > >> >
>> > > > > > > >> > - The actually data flow will be:
>> > > > > > > >> >
>> > > > > > > >> > 1. writer get a transaction id from the owner of the
>> > `commit'
>> > > > log
>> > > > > > > stream
>> > > > > > > >> > 1. write the begin control record (synchronously) with
>> the
>> > > > > > transaction
>> > > > > > > >> id
>> > > > > > > >> > 2. for each write within the same txn, it will be
>> assigned a
>> > > > local
>> > > > > > > >> sequence
>> > > > > > > >> > number starting from 0. the combination of transaction id
>> > and
>> > > > > local
>> > > > > > > >> > sequence number will be used later on by the readers to
>> > > > > de-duplicate
>> > > > > > > >> > records.
>> > > > > > > >> > 3. the commit/abort control record will be written based
>> on
>> > > the
>> > > > > > > results
>> > > > > > > >> > from 2.
>> > > > > > > >> >
>> > > > > > > >> > - Application can supply a timeout for the transaction
>> when
>> > > > > > #begin() a
>> > > > > > > >> > transaction. The owner of the `commit` log stream can
>> abort
>> > > > > > > transactions
>> > > > > > > >> > that never be committed/aborted within their timeout.
>> > > > > > > >> >
>> > > > > > > >> > - Failures:
>> > > > > > > >> >
>> > > > > > > >> > * all the log records can be simply retried as they will
>> be
>> > > > > > > >> de-duplicated
>> > > > > > > >> > probably at the reader side.
>> > > > > > > >> >
>> > > > > > > >> > - Reader:
>> > > > > > > >> >
>> > > > > > > >> > * Reader can be configured to read uncommitted records or
>> > > > > committed
>> > > > > > > >> records
>> > > > > > > >> > only (by default read uncommitted records)
>> > > > > > > >> > * If reader is configured to read committed records only,
>> > the
>> > > > read
>> > > > > > > ahead
>> > > > > > > >> > cache will be changed to maintain one additional pending
>> > > > committed
>> > > > > > > >> records.
>> > > > > > > >> > the pending committed records map is bounded and records
>> > will
>> > > be
>> > > > > > > dropped
>> > > > > > > >> > when read ahead is moving.
>> > > > > > > >> > * when the reader hits a commit record, it will rewind to
>> > the
>> > > > > begin
>> > > > > > > >> record
>> > > > > > > >> > and start reading from there. leveraging the proper read
>> > ahead
>> > > > > cache
>> > > > > > > and
>> > > > > > > >> > pending commit records cache, it would be good for both
>> > short
>> > > > > > > >> transactions
>> > > > > > > >> > and long transactions.
>> > > > > > > >> >
>> > > > > > > >> > - DLSN, SequenceId:
>> > > > > > > >> >
>> > > > > > > >> > * We will add a fourth field to DLSN. It is `local
>> sequence
>> > > > > number`
>> > > > > > > >> within
>> > > > > > > >> > a transaction session. So the new DLSN of records in a
>> > > > transaction
>> > > > > > > will
>> > > > > > > >> be
>> > > > > > > >> > the DLSN of commit control record plus its local sequence
>> > > > number.
>> > > > > > > >> > * The sequence id will be still the position of the
>> commit
>> > > > record
>> > > > > > plus
>> > > > > > > >> its
>> > > > > > > >> > local sequence number. The position will be advanced with
>> > > total
>> > > > > > number
>> > > > > > > >> of
>> > > > > > > >> > written records on writing the commit control record.
>> > > > > > > >> >
>> > > > > > > >> > - Transaction Group & Namespace Transaction
>> > > > > > > >> >
>> > > > > > > >> > using one single log stream for namespace transaction can
>> > > cause
>> > > > > the
>> > > > > > > >> > bottleneck problem since all the begin/commit/end control
>> > > > records
>> > > > > > will
>> > > > > > > >> have
>> > > > > > > >> > to go through one log stream.
>> > > > > > > >> >
>> > > > > > > >> > the idea of 'transaction group' is to allow partitioning
>> the
>> > > > > writers
>> > > > > > > >> into
>> > > > > > > >> > different transaction groups.
>> > > > > > > >> >
>> > > > > > > >> > clients can specify the `group-name` when starting the
>> > > > > transaction.
>> > > > > > if
>> > > > > > > >> > there is no `group-name` specified, it will use the
>> default
>> > > > > `commit`
>> > > > > > > >> log in
>> > > > > > > >> > the namespace for creating transactions.
>> > > > > > > >> >
>> > > > > > > >> > -------------------------------------------------
>> > > > > > > >> >
>> > > > > > > >> > I'd like to collect feedbacks on this idea. Appreciate
>> any
>> > > > > comments
>> > > > > > > and
>> > > > > > > >> if
>> > > > > > > >> > anyone is also interested in this idea, we'd like to
>> > > collaborate
>> > > > > > with
>> > > > > > > >> the
>> > > > > > > >> > community.
>> > > > > > > >> >
>> > > > > > > >> >
>> > > > > > > >> > - Xi
>> > > > > > > >> >
>> > > > > > > >>
>> > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: [Discuss] Transaction Support

Posted by Xi Liu <xi...@gmail.com>.

Sijie,

No. I thought it might be easier for people to comment on a google doc to
gather the initial feedback. I will put the content back to wiki page once
addressing the comments. Does that sound good to you?

And thank you in advance.

- Xi



On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <si...@apache.org> wrote:

> Hi Xi,
>
> sorry for late response. I will review it soon.
>
> regarding this, a separate question "are we going to use google doc instead
> of email thread for any discussion"? I am a bit worried that the discussion
> will become lost after moving to google doc. No idea on how other apache
> projects are doing.
>
> - Sijie
>
> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi...@gmail.com> wrote:
>
> > Hi all,
> >
> > I finalized the first version of the design. This time I used a google
> doc
> > so that it is easier for commenting and add a link the wiki page. I will
> > update this to the wiki page once we come to the finalized design.
> >
> > https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
> > bSIGgSzXuTI5BA/edit
> >
> > Let me know if you have any questions. Appreciate your reviews!
> >
> > - Xi
> >
> >
> >
> >
> >
> > On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
> > <lstewart@twitter.com.invalid
> > > wrote:
> >
> > > Interesting proposal. A couple quick notes while you continue to flesh
> > this
> > > out.
> > >
> > > a. just to be sure - does this eliminate the need to save seqno with
> > > checkpoint?
> > >
> > > b. i.e. another way to describe this kind of improvement is "support
> > > records (atomic writes) larger than 1MB", iiuc. the advantage being it
> > > avoids the baggage of transactions. disadvantages include inability to
> do
> > > cross stream transactions, and flexibility (interleaving, etc) (are
> there
> > > others?).
> > >
> > > c. proxy use case is for supporting multiple writers - have you thought
> > > about how this would work with multiple writers?
> > >
> > > Thanks!
> > >
> > >
> > > On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo <sijieg@twitter.com.invalid
> >
> > > wrote:
> > >
> > > > Sound good to me. look forward to the detailed proposal.
> > > >
> > > > (I don't mind the format if it makes things easier to you)
> > > >
> > > > Sijie
> > > >
> > > > On Friday, October 14, 2016, Xi Liu <xi...@gmail.com> wrote:
> > > >
> > > > > Thank you, Sijie
> > > > >
> > > > > We have some internal discussions to sort out some details. We are
> > > ready
> > > > to
> > > > > collaborate with the community for adding the transaction support
> in
> > > DL.
> > > > > We'd like to share more.
> > > > >
> > > > > I created a proposal wiki here -
> > > > > https://cwiki.apache.org/confluence/display/DL/DP-1+-+
> > > > > DistributedLog+Transaction+Support
> > > > >
> > > > > (I followed KIP format and named it as DP (DistributedLog Proposal
> -
> > DP
> > > > is
> > > > > also short for Dynamic Programming). I don't know if you guys like
> > this
> > > > > name or not. Feel free to change it :D)
> > > > >
> > > > > I basically put my initial email as the content there so far. Once
> we
> > > > > finished our final discussion, I will update with more details. At
> > the
> > > > same
> > > > > time, any comments are welcome.
> > > > >
> > > > > - Xi
> > > > >
> > > > >
> > > > >
> > > > > On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
> > > > <javascript:;>>
> > > > > wrote:
> > > > >
> > > > > > Xi,
> > > > > >
> > > > > > I just granted you the edit permission.
> > > > > >
> > > > > > - Sijie
> > > > > >
> > > > > > On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
> > > > > <javascript:;>> wrote:
> > > > > >
> > > > > > > I still can not edit the wiki. Can any of the pmc members grant
> > me
> > > > the
> > > > > > > permissions?
> > > > > > >
> > > > > > > - Xi
> > > > > > >
> > > > > > > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <xi.liu.ant@gmail.com
> > > > > <javascript:;>> wrote:
> > > > > > >
> > > > > > > > Sijie,
> > > > > > > >
> > > > > > > > I attempted to create a wiki page under that space. I found
> > that
> > > I
> > > > am
> > > > > > not
> > > > > > > > authorized with edit permission.
> > > > > > > >
> > > > > > > > Can any of the committers grant me the wiki edit permission?
> My
> > > > > account
> > > > > > > is
> > > > > > > > "xi.liu.ant".
> > > > > > > >
> > > > > > > > - Xi
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <sijie@apache.org
> > > > > <javascript:;>> wrote:
> > > > > > > >
> > > > > > > >> This sounds interesting ... I will take a closer look and
> give
> > > my
> > > > > > > comments
> > > > > > > >> later.
> > > > > > > >>
> > > > > > > >> At the same time, do you mind creating a wiki page to put
> your
> > > > idea
> > > > > > > there?
> > > > > > > >> You can add your wiki page under
> > > > > > > >> https://cwiki.apache.org/confluence/display/DL/Project+
> > > Proposals
> > > > > > > >>
> > > > > > > >> You might need to ask in the dev list to grant the wiki edit
> > > > > > permissions
> > > > > > > >> to
> > > > > > > >> you once you have a wiki account.
> > > > > > > >>
> > > > > > > >> - Sijie
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <
> xi.liu.ant@gmail.com
> > > > > <javascript:;>> wrote:
> > > > > > > >>
> > > > > > > >> > Hello,
> > > > > > > >> >
> > > > > > > >> > I asked the transaction support in distributedlog user
> group
> > > two
> > > > > > > months
> > > > > > > >> > ago. I want to raise this up again, as we are looking for
> > > using
> > > > > > > >> > distributedlog for building a transactional data service.
> It
> > > is
> > > > a
> > > > > > > major
> > > > > > > >> > feature that is missing in distributedlog. We have some
> > ideas
> > > to
> > > > > add
> > > > > > > >> this
> > > > > > > >> > to distributedlog and want to know if they make sense or
> > not.
> > > If
> > > > > > they
> > > > > > > >> are
> > > > > > > >> > good, we'd like to contribute and develop with the
> > community.
> > > > > > > >> >
> > > > > > > >> > Here are the thoughts:
> > > > > > > >> >
> > > > > > > >> > -------------------------------------------------
> > > > > > > >> >
> > > > > > > >> > From our understanding, DL can provide "at-least-once"
> > > delivery
> > > > > > > semantic
> > > > > > > >> > (if not, please correct me) but not "exactly-once"
> delivery
> > > > > > semantic.
> > > > > > > >> That
> > > > > > > >> > means that a message can be delivered one or more times if
> > the
> > > > > > reader
> > > > > > > >> > doesn't handle duplicates.
> > > > > > > >> >
> > > > > > > >> > The duplicates come from two places, one is at writer side
> > > (this
> > > > > > > assumes
> > > > > > > >> > using write proxy not the core library), while the other
> one
> > > is
> > > > at
> > > > > > > >> reader
> > > > > > > >> > side.
> > > > > > > >> >
> > > > > > > >> > - writer side: if the client attempts to write a record to
> > the
> > > > > write
> > > > > > > >> > proxies and gets a network error (e.g timeouts) then
> > retries,
> > > > the
> > > > > > > >> retrying
> > > > > > > >> > will potentially result in duplicates.
> > > > > > > >> > - reader side:if the reader reads a message from a stream
> > and
> > > > then
> > > > > > > >> crashes,
> > > > > > > >> > when the reader restarts it would restart from last known
> > > > position
> > > > > > > >> (DLSN).
> > > > > > > >> > If the reader fails after processing a record and before
> > > > recording
> > > > > > the
> > > > > > > >> > position, the processed record will be delivered again.
> > > > > > > >> >
> > > > > > > >> > The reader problem can be properly addressed by making use
> > of
> > > > the
> > > > > > > >> sequence
> > > > > > > >> > numbers of records and doing proper checkpointing. For
> > > example,
> > > > in
> > > > > > > >> > database, it can checkpoint the indexed data with the
> > sequence
> > > > > > number
> > > > > > > of
> > > > > > > >> > records; in flink, it can checkpoint the state with the
> > > sequence
> > > > > > > >> numbers.
> > > > > > > >> >
> > > > > > > >> > The writer problem can be addressed by implementing an
> > > > idempotent
> > > > > > > >> writer.
> > > > > > > >> > However, an alternative and more powerful approach is to
> > > support
> > > > > > > >> > transactions.
> > > > > > > >> >
> > > > > > > >> > *What does transaction mean?*
> > > > > > > >> >
> > > > > > > >> > A transaction means a collection of records can be written
> > > > > > > >> transactionally
> > > > > > > >> > within a stream or across multiple streams. They will be
> > > > consumed
> > > > > by
> > > > > > > the
> > > > > > > >> > reader together when a transaction is committed, or will
> > never
> > > > be
> > > > > > > >> consumed
> > > > > > > >> > by the reader when the transaction is aborted.
> > > > > > > >> >
> > > > > > > >> > The transaction will expose following guarantees:
> > > > > > > >> >
> > > > > > > >> > - The reader should not be exposed to records written from
> > > > > > uncommitted
> > > > > > > >> > transactions (mandatory)
> > > > > > > >> > - The reader should consume the records in the transaction
> > > > commit
> > > > > > > order
> > > > > > > >> > rather than the record written order (mandatory)
> > > > > > > >> > - No duplicated records within a transaction (mandatory)
> > > > > > > >> > - Allow interleaving transactional writes and
> > > non-transactional
> > > > > > writes
> > > > > > > >> > (optional)
> > > > > > > >> >
> > > > > > > >> > *Stream Transaction & Namespace Transaction*
> > > > > > > >> >
> > > > > > > >> > There will be two types of transaction, one is Stream
> level
> > > > > > > transaction
> > > > > > > >> > (local transaction), while the other one is Namespace
> level
> > > > > > > transaction
> > > > > > > >> > (global transaction).
> > > > > > > >> >
> > > > > > > >> > The stream level transaction is a transactional operation
> on
> > > > > writing
> > > > > > > >> > records to one stream; the namespace level transaction is
> a
> > > > > > > >> transactional
> > > > > > > >> > operation on writing records to multiple streams.
> > > > > > > >> >
> > > > > > > >> > *Implementation Thoughts*
> > > > > > > >> >
> > > > > > > >> > - A transaction is consist of begin control record, a
> series
> > > of
> > > > > data
> > > > > > > >> > records and commit/abort control record.
> > > > > > > >> > - The begin/commit/abort control record is written to a
> > > `commit`
> > > > > log
> > > > > > > >> > stream, while the data records will be written to normal
> > data
> > > > log
> > > > > > > >> streams.
> > > > > > > >> > - The `commit` log stream will be the same log stream for
> > > > > > stream-level
> > > > > > > >> > transaction,  while it will be a *system* stream (or
> > multiple
> > > > > system
> > > > > > > >> > streams) for namespace-level transactions.
> > > > > > > >> > - The transaction code looks like as below:
> > > > > > > >> >
> > > > > > > >> > <code>
> > > > > > > >> >
> > > > > > > >> > Transaction txn = client.transaction();
> > > > > > > >> > Future<DLSN> result1 = txn.write(stream-0, record);
> > > > > > > >> > Future<DLSN> result2 = txn.write(stream-1, record);
> > > > > > > >> > Future<DLSN> result3 = txn.write(stream-2, record);
> > > > > > > >> > Future<Pair<DLSN, DLSN>> result = txn.commit();
> > > > > > > >> >
> > > > > > > >> > </code>
> > > > > > > >> >
> > > > > > > >> > if the txn is committed, all the write futures will be
> > > satisfied
> > > > > > with
> > > > > > > >> their
> > > > > > > >> > written DLSNs. if the txn is aborted, all the write
> futures
> > > will
> > > > > be
> > > > > > > >> failed
> > > > > > > >> > together. there is no partial failure state.
> > > > > > > >> >
> > > > > > > >> > - The actually data flow will be:
> > > > > > > >> >
> > > > > > > >> > 1. writer get a transaction id from the owner of the
> > `commit'
> > > > log
> > > > > > > stream
> > > > > > > >> > 1. write the begin control record (synchronously) with the
> > > > > > transaction
> > > > > > > >> id
> > > > > > > >> > 2. for each write within the same txn, it will be
> assigned a
> > > > local
> > > > > > > >> sequence
> > > > > > > >> > number starting from 0. the combination of transaction id
> > and
> > > > > local
> > > > > > > >> > sequence number will be used later on by the readers to
> > > > > de-duplicate
> > > > > > > >> > records.
> > > > > > > >> > 3. the commit/abort control record will be written based
> on
> > > the
> > > > > > > results
> > > > > > > >> > from 2.
> > > > > > > >> >
> > > > > > > >> > - Application can supply a timeout for the transaction
> when
> > > > > > #begin() a
> > > > > > > >> > transaction. The owner of the `commit` log stream can
> abort
> > > > > > > transactions
> > > > > > > >> > that never be committed/aborted within their timeout.
> > > > > > > >> >
> > > > > > > >> > - Failures:
> > > > > > > >> >
> > > > > > > >> > * all the log records can be simply retried as they will
> be
> > > > > > > >> de-duplicated
> > > > > > > >> > probably at the reader side.
> > > > > > > >> >
> > > > > > > >> > - Reader:
> > > > > > > >> >
> > > > > > > >> > * Reader can be configured to read uncommitted records or
> > > > > committed
> > > > > > > >> records
> > > > > > > >> > only (by default read uncommitted records)
> > > > > > > >> > * If reader is configured to read committed records only,
> > the
> > > > read
> > > > > > > ahead
> > > > > > > >> > cache will be changed to maintain one additional pending
> > > > committed
> > > > > > > >> records.
> > > > > > > >> > the pending committed records map is bounded and records
> > will
> > > be
> > > > > > > dropped
> > > > > > > >> > when read ahead is moving.
> > > > > > > >> > * when the reader hits a commit record, it will rewind to
> > the
> > > > > begin
> > > > > > > >> record
> > > > > > > >> > and start reading from there. leveraging the proper read
> > ahead
> > > > > cache
> > > > > > > and
> > > > > > > >> > pending commit records cache, it would be good for both
> > short
> > > > > > > >> transactions
> > > > > > > >> > and long transactions.
> > > > > > > >> >
> > > > > > > >> > - DLSN, SequenceId:
> > > > > > > >> >
> > > > > > > >> > * We will add a fourth field to DLSN. It is `local
> sequence
> > > > > number`
> > > > > > > >> within
> > > > > > > >> > a transaction session. So the new DLSN of records in a
> > > > transaction
> > > > > > > will
> > > > > > > >> be
> > > > > > > >> > the DLSN of commit control record plus its local sequence
> > > > number.
> > > > > > > >> > * The sequence id will be still the position of the commit
> > > > record
> > > > > > plus
> > > > > > > >> its
> > > > > > > >> > local sequence number. The position will be advanced with
> > > total
> > > > > > number
> > > > > > > >> of
> > > > > > > >> > written records on writing the commit control record.
> > > > > > > >> >
> > > > > > > >> > - Transaction Group & Namespace Transaction
> > > > > > > >> >
> > > > > > > >> > using one single log stream for namespace transaction can
> > > cause
> > > > > the
> > > > > > > >> > bottleneck problem since all the begin/commit/end control
> > > > records
> > > > > > will
> > > > > > > >> have
> > > > > > > >> > to go through one log stream.
> > > > > > > >> >
> > > > > > > >> > the idea of 'transaction group' is to allow partitioning
> the
> > > > > writers
> > > > > > > >> into
> > > > > > > >> > different transaction groups.
> > > > > > > >> >
> > > > > > > >> > clients can specify the `group-name` when starting the
> > > > > transaction.
> > > > > > if
> > > > > > > >> > there is no `group-name` specified, it will use the
> default
> > > > > `commit`
> > > > > > > >> log in
> > > > > > > >> > the namespace for creating transactions.
> > > > > > > >> >
> > > > > > > >> > -------------------------------------------------
> > > > > > > >> >
> > > > > > > >> > I'd like to collect feedbacks on this idea. Appreciate any
> > > > > comments
> > > > > > > and
> > > > > > > >> if
> > > > > > > >> > anyone is also interested in this idea, we'd like to
> > > collaborate
> > > > > > with
> > > > > > > >> the
> > > > > > > >> > community.
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > - Xi
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [Discuss] Transaction Support

Posted by Xi Liu <xi...@gmail.com>.

Thanks liang.

- Xi

On Sun, Dec 18, 2016 at 5:52 PM, liang xie <xi...@gmail.com> wrote:

> The developers upload the design doc onto JIRA at least for
> HADOOP/HBase/Cassandra/... projects
>
> On Mon, Dec 19, 2016 at 12:48 AM, Sijie Guo <si...@apache.org> wrote:
> > Hi Xi,
> >
> > sorry for late response. I will review it soon.
> >
> > regarding this, a separate question "are we going to use google doc
> instead
> > of email thread for any discussion"? I am a bit worried that the
> discussion
> > will become lost after moving to google doc. No idea on how other apache
> > projects are doing.
> >
> > - Sijie
> >
> > On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi...@gmail.com> wrote:
> >
> >> Hi all,
> >>
> >> I finalized the first version of the design. This time I used a google
> doc
> >> so that it is easier for commenting and add a link the wiki page. I will
> >> update this to the wiki page once we come to the finalized design.
> >>
> >> https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
> >> bSIGgSzXuTI5BA/edit
> >>
> >> Let me know if you have any questions. Appreciate your reviews!
> >>
> >> - Xi
> >>
> >>
> >>
> >>
> >>
> >> On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
> >> <lstewart@twitter.com.invalid
> >> > wrote:
> >>
> >> > Interesting proposal. A couple quick notes while you continue to flesh
> >> this
> >> > out.
> >> >
> >> > a. just to be sure - does this eliminate the need to save seqno with
> >> > checkpoint?
> >> >
> >> > b. i.e. another way to describe this kind of improvement is "support
> >> > records (atomic writes) larger than 1MB", iiuc. the advantage being it
> >> > avoids the baggage of transactions. disadvantages include inability
> to do
> >> > cross stream transactions, and flexibility (interleaving, etc) (are
> there
> >> > others?).
> >> >
> >> > c. proxy use case is for supporting multiple writers - have you
> thought
> >> > about how this would work with multiple writers?
> >> >
> >> > Thanks!
> >> >
> >> >
> >> > On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo <sijieg@twitter.com.invalid
> >
> >> > wrote:
> >> >
> >> > > Sound good to me. look forward to the detailed proposal.
> >> > >
> >> > > (I don't mind the format if it makes things easier to you)
> >> > >
> >> > > Sijie
> >> > >
> >> > > On Friday, October 14, 2016, Xi Liu <xi...@gmail.com> wrote:
> >> > >
> >> > > > Thank you, Sijie
> >> > > >
> >> > > > We have some internal discussions to sort out some details. We are
> >> > ready
> >> > > to
> >> > > > collaborate with the community for adding the transaction support
> in
> >> > DL.
> >> > > > We'd like to share more.
> >> > > >
> >> > > > I created a proposal wiki here -
> >> > > > https://cwiki.apache.org/confluence/display/DL/DP-1+-+
> >> > > > DistributedLog+Transaction+Support
> >> > > >
> >> > > > (I followed KIP format and named it as DP (DistributedLog
> Proposal -
> >> DP
> >> > > is
> >> > > > also short for Dynamic Programming). I don't know if you guys like
> >> this
> >> > > > name or not. Feel free to change it :D)
> >> > > >
> >> > > > I basically put my initial email as the content there so far.
> Once we
> >> > > > finished our final discussion, I will update with more details. At
> >> the
> >> > > same
> >> > > > time, any comments are welcome.
> >> > > >
> >> > > > - Xi
> >> > > >
> >> > > >
> >> > > >
> >> > > > On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
> >> > > <javascript:;>>
> >> > > > wrote:
> >> > > >
> >> > > > > Xi,
> >> > > > >
> >> > > > > I just granted you the edit permission.
> >> > > > >
> >> > > > > - Sijie
> >> > > > >
> >> > > > > On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
> >> > > > <javascript:;>> wrote:
> >> > > > >
> >> > > > > > I still can not edit the wiki. Can any of the pmc members
> grant
> >> me
> >> > > the
> >> > > > > > permissions?
> >> > > > > >
> >> > > > > > - Xi
> >> > > > > >
> >> > > > > > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <
> xi.liu.ant@gmail.com
> >> > > > <javascript:;>> wrote:
> >> > > > > >
> >> > > > > > > Sijie,
> >> > > > > > >
> >> > > > > > > I attempted to create a wiki page under that space. I found
> >> that
> >> > I
> >> > > am
> >> > > > > not
> >> > > > > > > authorized with edit permission.
> >> > > > > > >
> >> > > > > > > Can any of the committers grant me the wiki edit
> permission? My
> >> > > > account
> >> > > > > > is
> >> > > > > > > "xi.liu.ant".
> >> > > > > > >
> >> > > > > > > - Xi
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <
> sijie@apache.org
> >> > > > <javascript:;>> wrote:
> >> > > > > > >
> >> > > > > > >> This sounds interesting ... I will take a closer look and
> give
> >> > my
> >> > > > > > comments
> >> > > > > > >> later.
> >> > > > > > >>
> >> > > > > > >> At the same time, do you mind creating a wiki page to put
> your
> >> > > idea
> >> > > > > > there?
> >> > > > > > >> You can add your wiki page under
> >> > > > > > >> https://cwiki.apache.org/confluence/display/DL/Project+
> >> > Proposals
> >> > > > > > >>
> >> > > > > > >> You might need to ask in the dev list to grant the wiki
> edit
> >> > > > > permissions
> >> > > > > > >> to
> >> > > > > > >> you once you have a wiki account.
> >> > > > > > >>
> >> > > > > > >> - Sijie
> >> > > > > > >>
> >> > > > > > >>
> >> > > > > > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <
> xi.liu.ant@gmail.com
> >> > > > <javascript:;>> wrote:
> >> > > > > > >>
> >> > > > > > >> > Hello,
> >> > > > > > >> >
> >> > > > > > >> > I asked the transaction support in distributedlog user
> group
> >> > two
> >> > > > > > months
> >> > > > > > >> > ago. I want to raise this up again, as we are looking for
> >> > using
> >> > > > > > >> > distributedlog for building a transactional data
> service. It
> >> > is
> >> > > a
> >> > > > > > major
> >> > > > > > >> > feature that is missing in distributedlog. We have some
> >> ideas
> >> > to
> >> > > > add
> >> > > > > > >> this
> >> > > > > > >> > to distributedlog and want to know if they make sense or
> >> not.
> >> > If
> >> > > > > they
> >> > > > > > >> are
> >> > > > > > >> > good, we'd like to contribute and develop with the
> >> community.
> >> > > > > > >> >
> >> > > > > > >> > Here are the thoughts:
> >> > > > > > >> >
> >> > > > > > >> > -------------------------------------------------
> >> > > > > > >> >
> >> > > > > > >> > From our understanding, DL can provide "at-least-once"
> >> > delivery
> >> > > > > > semantic
> >> > > > > > >> > (if not, please correct me) but not "exactly-once"
> delivery
> >> > > > > semantic.
> >> > > > > > >> That
> >> > > > > > >> > means that a message can be delivered one or more times
> if
> >> the
> >> > > > > reader
> >> > > > > > >> > doesn't handle duplicates.
> >> > > > > > >> >
> >> > > > > > >> > The duplicates come from two places, one is at writer
> side
> >> > (this
> >> > > > > > assumes
> >> > > > > > >> > using write proxy not the core library), while the other
> one
> >> > is
> >> > > at
> >> > > > > > >> reader
> >> > > > > > >> > side.
> >> > > > > > >> >
> >> > > > > > >> > - writer side: if the client attempts to write a record
> to
> >> the
> >> > > > write
> >> > > > > > >> > proxies and gets a network error (e.g timeouts) then
> >> retries,
> >> > > the
> >> > > > > > >> retrying
> >> > > > > > >> > will potentially result in duplicates.
> >> > > > > > >> > - reader side:if the reader reads a message from a stream
> >> and
> >> > > then
> >> > > > > > >> crashes,
> >> > > > > > >> > when the reader restarts it would restart from last known
> >> > > position
> >> > > > > > >> (DLSN).
> >> > > > > > >> > If the reader fails after processing a record and before
> >> > > recording
> >> > > > > the
> >> > > > > > >> > position, the processed record will be delivered again.
> >> > > > > > >> >
> >> > > > > > >> > The reader problem can be properly addressed by making
> use
> >> of
> >> > > the
> >> > > > > > >> sequence
> >> > > > > > >> > numbers of records and doing proper checkpointing. For
> >> > example,
> >> > > in
> >> > > > > > >> > database, it can checkpoint the indexed data with the
> >> sequence
> >> > > > > number
> >> > > > > > of
> >> > > > > > >> > records; in flink, it can checkpoint the state with the
> >> > sequence
> >> > > > > > >> numbers.
> >> > > > > > >> >
> >> > > > > > >> > The writer problem can be addressed by implementing an
> >> > > idempotent
> >> > > > > > >> writer.
> >> > > > > > >> > However, an alternative and more powerful approach is to
> >> > support
> >> > > > > > >> > transactions.
> >> > > > > > >> >
> >> > > > > > >> > *What does transaction mean?*
> >> > > > > > >> >
> >> > > > > > >> > A transaction means a collection of records can be
> written
> >> > > > > > >> transactionally
> >> > > > > > >> > within a stream or across multiple streams. They will be
> >> > > consumed
> >> > > > by
> >> > > > > > the
> >> > > > > > >> > reader together when a transaction is committed, or will
> >> never
> >> > > be
> >> > > > > > >> consumed
> >> > > > > > >> > by the reader when the transaction is aborted.
> >> > > > > > >> >
> >> > > > > > >> > The transaction will expose following guarantees:
> >> > > > > > >> >
> >> > > > > > >> > - The reader should not be exposed to records written
> from
> >> > > > > uncommitted
> >> > > > > > >> > transactions (mandatory)
> >> > > > > > >> > - The reader should consume the records in the
> transaction
> >> > > commit
> >> > > > > > order
> >> > > > > > >> > rather than the record written order (mandatory)
> >> > > > > > >> > - No duplicated records within a transaction (mandatory)
> >> > > > > > >> > - Allow interleaving transactional writes and
> >> > non-transactional
> >> > > > > writes
> >> > > > > > >> > (optional)
> >> > > > > > >> >
> >> > > > > > >> > *Stream Transaction & Namespace Transaction*
> >> > > > > > >> >
> >> > > > > > >> > There will be two types of transaction, one is Stream
> level
> >> > > > > > transaction
> >> > > > > > >> > (local transaction), while the other one is Namespace
> level
> >> > > > > > transaction
> >> > > > > > >> > (global transaction).
> >> > > > > > >> >
> >> > > > > > >> > The stream level transaction is a transactional
> operation on
> >> > > > writing
> >> > > > > > >> > records to one stream; the namespace level transaction
> is a
> >> > > > > > >> transactional
> >> > > > > > >> > operation on writing records to multiple streams.
> >> > > > > > >> >
> >> > > > > > >> > *Implementation Thoughts*
> >> > > > > > >> >
> >> > > > > > >> > - A transaction is consist of begin control record, a
> series
> >> > of
> >> > > > data
> >> > > > > > >> > records and commit/abort control record.
> >> > > > > > >> > - The begin/commit/abort control record is written to a
> >> > `commit`
> >> > > > log
> >> > > > > > >> > stream, while the data records will be written to normal
> >> data
> >> > > log
> >> > > > > > >> streams.
> >> > > > > > >> > - The `commit` log stream will be the same log stream for
> >> > > > > stream-level
> >> > > > > > >> > transaction,  while it will be a *system* stream (or
> >> multiple
> >> > > > system
> >> > > > > > >> > streams) for namespace-level transactions.
> >> > > > > > >> > - The transaction code looks like as below:
> >> > > > > > >> >
> >> > > > > > >> > <code>
> >> > > > > > >> >
> >> > > > > > >> > Transaction txn = client.transaction();
> >> > > > > > >> > Future<DLSN> result1 = txn.write(stream-0, record);
> >> > > > > > >> > Future<DLSN> result2 = txn.write(stream-1, record);
> >> > > > > > >> > Future<DLSN> result3 = txn.write(stream-2, record);
> >> > > > > > >> > Future<Pair<DLSN, DLSN>> result = txn.commit();
> >> > > > > > >> >
> >> > > > > > >> > </code>
> >> > > > > > >> >
> >> > > > > > >> > if the txn is committed, all the write futures will be
> >> > satisfied
> >> > > > > with
> >> > > > > > >> their
> >> > > > > > >> > written DLSNs. if the txn is aborted, all the write
> futures
> >> > will
> >> > > > be
> >> > > > > > >> failed
> >> > > > > > >> > together. there is no partial failure state.
> >> > > > > > >> >
> >> > > > > > >> > - The actually data flow will be:
> >> > > > > > >> >
> >> > > > > > >> > 1. writer get a transaction id from the owner of the
> >> `commit'
> >> > > log
> >> > > > > > stream
> >> > > > > > >> > 1. write the begin control record (synchronously) with
> the
> >> > > > > transaction
> >> > > > > > >> id
> >> > > > > > >> > 2. for each write within the same txn, it will be
> assigned a
> >> > > local
> >> > > > > > >> sequence
> >> > > > > > >> > number starting from 0. the combination of transaction id
> >> and
> >> > > > local
> >> > > > > > >> > sequence number will be used later on by the readers to
> >> > > > de-duplicate
> >> > > > > > >> > records.
> >> > > > > > >> > 3. the commit/abort control record will be written based
> on
> >> > the
> >> > > > > > results
> >> > > > > > >> > from 2.
> >> > > > > > >> >
> >> > > > > > >> > - Application can supply a timeout for the transaction
> when
> >> > > > > #begin() a
> >> > > > > > >> > transaction. The owner of the `commit` log stream can
> abort
> >> > > > > > transactions
> >> > > > > > >> > that never be committed/aborted within their timeout.
> >> > > > > > >> >
> >> > > > > > >> > - Failures:
> >> > > > > > >> >
> >> > > > > > >> > * all the log records can be simply retried as they will
> be
> >> > > > > > >> de-duplicated
> >> > > > > > >> > probably at the reader side.
> >> > > > > > >> >
> >> > > > > > >> > - Reader:
> >> > > > > > >> >
> >> > > > > > >> > * Reader can be configured to read uncommitted records or
> >> > > > committed
> >> > > > > > >> records
> >> > > > > > >> > only (by default read uncommitted records)
> >> > > > > > >> > * If reader is configured to read committed records only,
> >> the
> >> > > read
> >> > > > > > ahead
> >> > > > > > >> > cache will be changed to maintain one additional pending
> >> > > committed
> >> > > > > > >> records.
> >> > > > > > >> > the pending committed records map is bounded and records
> >> will
> >> > be
> >> > > > > > dropped
> >> > > > > > >> > when read ahead is moving.
> >> > > > > > >> > * when the reader hits a commit record, it will rewind to
> >> the
> >> > > > begin
> >> > > > > > >> record
> >> > > > > > >> > and start reading from there. leveraging the proper read
> >> ahead
> >> > > > cache
> >> > > > > > and
> >> > > > > > >> > pending commit records cache, it would be good for both
> >> short
> >> > > > > > >> transactions
> >> > > > > > >> > and long transactions.
> >> > > > > > >> >
> >> > > > > > >> > - DLSN, SequenceId:
> >> > > > > > >> >
> >> > > > > > >> > * We will add a fourth field to DLSN. It is `local
> sequence
> >> > > > number`
> >> > > > > > >> within
> >> > > > > > >> > a transaction session. So the new DLSN of records in a
> >> > > transaction
> >> > > > > > will
> >> > > > > > >> be
> >> > > > > > >> > the DLSN of commit control record plus its local sequence
> >> > > number.
> >> > > > > > >> > * The sequence id will be still the position of the
> commit
> >> > > record
> >> > > > > plus
> >> > > > > > >> its
> >> > > > > > >> > local sequence number. The position will be advanced with
> >> > total
> >> > > > > number
> >> > > > > > >> of
> >> > > > > > >> > written records on writing the commit control record.
> >> > > > > > >> >
> >> > > > > > >> > - Transaction Group & Namespace Transaction
> >> > > > > > >> >
> >> > > > > > >> > using one single log stream for namespace transaction can
> >> > cause
> >> > > > the
> >> > > > > > >> > bottleneck problem since all the begin/commit/end control
> >> > > records
> >> > > > > will
> >> > > > > > >> have
> >> > > > > > >> > to go through one log stream.
> >> > > > > > >> >
> >> > > > > > >> > the idea of 'transaction group' is to allow partitioning
> the
> >> > > > writers
> >> > > > > > >> into
> >> > > > > > >> > different transaction groups.
> >> > > > > > >> >
> >> > > > > > >> > clients can specify the `group-name` when starting the
> >> > > > transaction.
> >> > > > > if
> >> > > > > > >> > there is no `group-name` specified, it will use the
> default
> >> > > > `commit`
> >> > > > > > >> log in
> >> > > > > > >> > the namespace for creating transactions.
> >> > > > > > >> >
> >> > > > > > >> > -------------------------------------------------
> >> > > > > > >> >
> >> > > > > > >> > I'd like to collect feedbacks on this idea. Appreciate
> any
> >> > > > comments
> >> > > > > > and
> >> > > > > > >> if
> >> > > > > > >> > anyone is also interested in this idea, we'd like to
> >> > collaborate
> >> > > > > with
> >> > > > > > >> the
> >> > > > > > >> > community.
> >> > > > > > >> >
> >> > > > > > >> >
> >> > > > > > >> > - Xi
> >> > > > > > >> >
> >> > > > > > >>
> >> > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
>

Re: [Discuss] Transaction Support

Posted by liang xie <xi...@gmail.com>.

The developers upload the design doc onto JIRA at least for
HADOOP/HBase/Cassandra/... projects

On Mon, Dec 19, 2016 at 12:48 AM, Sijie Guo <si...@apache.org> wrote:
> Hi Xi,
>
> sorry for late response. I will review it soon.
>
> regarding this, a separate question "are we going to use google doc instead
> of email thread for any discussion"? I am a bit worried that the discussion
> will become lost after moving to google doc. No idea on how other apache
> projects are doing.
>
> - Sijie
>
> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi...@gmail.com> wrote:
>
>> Hi all,
>>
>> I finalized the first version of the design. This time I used a google doc
>> so that it is easier for commenting and add a link the wiki page. I will
>> update this to the wiki page once we come to the finalized design.
>>
>> https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
>> bSIGgSzXuTI5BA/edit
>>
>> Let me know if you have any questions. Appreciate your reviews!
>>
>> - Xi
>>
>>
>>
>>
>>
>> On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
>> <lstewart@twitter.com.invalid
>> > wrote:
>>
>> > Interesting proposal. A couple quick notes while you continue to flesh
>> this
>> > out.
>> >
>> > a. just to be sure - does this eliminate the need to save seqno with
>> > checkpoint?
>> >
>> > b. i.e. another way to describe this kind of improvement is "support
>> > records (atomic writes) larger than 1MB", iiuc. the advantage being it
>> > avoids the baggage of transactions. disadvantages include inability to do
>> > cross stream transactions, and flexibility (interleaving, etc) (are there
>> > others?).
>> >
>> > c. proxy use case is for supporting multiple writers - have you thought
>> > about how this would work with multiple writers?
>> >
>> > Thanks!
>> >
>> >
>> > On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo <si...@twitter.com.invalid>
>> > wrote:
>> >
>> > > Sound good to me. look forward to the detailed proposal.
>> > >
>> > > (I don't mind the format if it makes things easier to you)
>> > >
>> > > Sijie
>> > >
>> > > On Friday, October 14, 2016, Xi Liu <xi...@gmail.com> wrote:
>> > >
>> > > > Thank you, Sijie
>> > > >
>> > > > We have some internal discussions to sort out some details. We are
>> > ready
>> > > to
>> > > > collaborate with the community for adding the transaction support in
>> > DL.
>> > > > We'd like to share more.
>> > > >
>> > > > I created a proposal wiki here -
>> > > > https://cwiki.apache.org/confluence/display/DL/DP-1+-+
>> > > > DistributedLog+Transaction+Support
>> > > >
>> > > > (I followed KIP format and named it as DP (DistributedLog Proposal -
>> DP
>> > > is
>> > > > also short for Dynamic Programming). I don't know if you guys like
>> this
>> > > > name or not. Feel free to change it :D)
>> > > >
>> > > > I basically put my initial email as the content there so far. Once we
>> > > > finished our final discussion, I will update with more details. At
>> the
>> > > same
>> > > > time, any comments are welcome.
>> > > >
>> > > > - Xi
>> > > >
>> > > >
>> > > >
>> > > > On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
>> > > <javascript:;>>
>> > > > wrote:
>> > > >
>> > > > > Xi,
>> > > > >
>> > > > > I just granted you the edit permission.
>> > > > >
>> > > > > - Sijie
>> > > > >
>> > > > > On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
>> > > > <javascript:;>> wrote:
>> > > > >
>> > > > > > I still can not edit the wiki. Can any of the pmc members grant
>> me
>> > > the
>> > > > > > permissions?
>> > > > > >
>> > > > > > - Xi
>> > > > > >
>> > > > > > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <xi.liu.ant@gmail.com
>> > > > <javascript:;>> wrote:
>> > > > > >
>> > > > > > > Sijie,
>> > > > > > >
>> > > > > > > I attempted to create a wiki page under that space. I found
>> that
>> > I
>> > > am
>> > > > > not
>> > > > > > > authorized with edit permission.
>> > > > > > >
>> > > > > > > Can any of the committers grant me the wiki edit permission? My
>> > > > account
>> > > > > > is
>> > > > > > > "xi.liu.ant".
>> > > > > > >
>> > > > > > > - Xi
>> > > > > > >
>> > > > > > >
>> > > > > > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <sijie@apache.org
>> > > > <javascript:;>> wrote:
>> > > > > > >
>> > > > > > >> This sounds interesting ... I will take a closer look and give
>> > my
>> > > > > > comments
>> > > > > > >> later.
>> > > > > > >>
>> > > > > > >> At the same time, do you mind creating a wiki page to put your
>> > > idea
>> > > > > > there?
>> > > > > > >> You can add your wiki page under
>> > > > > > >> https://cwiki.apache.org/confluence/display/DL/Project+
>> > Proposals
>> > > > > > >>
>> > > > > > >> You might need to ask in the dev list to grant the wiki edit
>> > > > > permissions
>> > > > > > >> to
>> > > > > > >> you once you have a wiki account.
>> > > > > > >>
>> > > > > > >> - Sijie
>> > > > > > >>
>> > > > > > >>
>> > > > > > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <xi.liu.ant@gmail.com
>> > > > <javascript:;>> wrote:
>> > > > > > >>
>> > > > > > >> > Hello,
>> > > > > > >> >
>> > > > > > >> > I asked the transaction support in distributedlog user group
>> > two
>> > > > > > months
>> > > > > > >> > ago. I want to raise this up again, as we are looking for
>> > using
>> > > > > > >> > distributedlog for building a transactional data service. It
>> > is
>> > > a
>> > > > > > major
>> > > > > > >> > feature that is missing in distributedlog. We have some
>> ideas
>> > to
>> > > > add
>> > > > > > >> this
>> > > > > > >> > to distributedlog and want to know if they make sense or
>> not.
>> > If
>> > > > > they
>> > > > > > >> are
>> > > > > > >> > good, we'd like to contribute and develop with the
>> community.
>> > > > > > >> >
>> > > > > > >> > Here are the thoughts:
>> > > > > > >> >
>> > > > > > >> > -------------------------------------------------
>> > > > > > >> >
>> > > > > > >> > From our understanding, DL can provide "at-least-once"
>> > delivery
>> > > > > > semantic
>> > > > > > >> > (if not, please correct me) but not "exactly-once" delivery
>> > > > > semantic.
>> > > > > > >> That
>> > > > > > >> > means that a message can be delivered one or more times if
>> the
>> > > > > reader
>> > > > > > >> > doesn't handle duplicates.
>> > > > > > >> >
>> > > > > > >> > The duplicates come from two places, one is at writer side
>> > (this
>> > > > > > assumes
>> > > > > > >> > using write proxy not the core library), while the other one
>> > is
>> > > at
>> > > > > > >> reader
>> > > > > > >> > side.
>> > > > > > >> >
>> > > > > > >> > - writer side: if the client attempts to write a record to
>> the
>> > > > write
>> > > > > > >> > proxies and gets a network error (e.g timeouts) then
>> retries,
>> > > the
>> > > > > > >> retrying
>> > > > > > >> > will potentially result in duplicates.
>> > > > > > >> > - reader side:if the reader reads a message from a stream
>> and
>> > > then
>> > > > > > >> crashes,
>> > > > > > >> > when the reader restarts it would restart from last known
>> > > position
>> > > > > > >> (DLSN).
>> > > > > > >> > If the reader fails after processing a record and before
>> > > recording
>> > > > > the
>> > > > > > >> > position, the processed record will be delivered again.
>> > > > > > >> >
>> > > > > > >> > The reader problem can be properly addressed by making use
>> of
>> > > the
>> > > > > > >> sequence
>> > > > > > >> > numbers of records and doing proper checkpointing. For
>> > example,
>> > > in
>> > > > > > >> > database, it can checkpoint the indexed data with the
>> sequence
>> > > > > number
>> > > > > > of
>> > > > > > >> > records; in flink, it can checkpoint the state with the
>> > sequence
>> > > > > > >> numbers.
>> > > > > > >> >
>> > > > > > >> > The writer problem can be addressed by implementing an
>> > > idempotent
>> > > > > > >> writer.
>> > > > > > >> > However, an alternative and more powerful approach is to
>> > support
>> > > > > > >> > transactions.
>> > > > > > >> >
>> > > > > > >> > *What does transaction mean?*
>> > > > > > >> >
>> > > > > > >> > A transaction means a collection of records can be written
>> > > > > > >> transactionally
>> > > > > > >> > within a stream or across multiple streams. They will be
>> > > consumed
>> > > > by
>> > > > > > the
>> > > > > > >> > reader together when a transaction is committed, or will
>> never
>> > > be
>> > > > > > >> consumed
>> > > > > > >> > by the reader when the transaction is aborted.
>> > > > > > >> >
>> > > > > > >> > The transaction will expose following guarantees:
>> > > > > > >> >
>> > > > > > >> > - The reader should not be exposed to records written from
>> > > > > uncommitted
>> > > > > > >> > transactions (mandatory)
>> > > > > > >> > - The reader should consume the records in the transaction
>> > > commit
>> > > > > > order
>> > > > > > >> > rather than the record written order (mandatory)
>> > > > > > >> > - No duplicated records within a transaction (mandatory)
>> > > > > > >> > - Allow interleaving transactional writes and
>> > non-transactional
>> > > > > writes
>> > > > > > >> > (optional)
>> > > > > > >> >
>> > > > > > >> > *Stream Transaction & Namespace Transaction*
>> > > > > > >> >
>> > > > > > >> > There will be two types of transaction, one is Stream level
>> > > > > > transaction
>> > > > > > >> > (local transaction), while the other one is Namespace level
>> > > > > > transaction
>> > > > > > >> > (global transaction).
>> > > > > > >> >
>> > > > > > >> > The stream level transaction is a transactional operation on
>> > > > writing
>> > > > > > >> > records to one stream; the namespace level transaction is a
>> > > > > > >> transactional
>> > > > > > >> > operation on writing records to multiple streams.
>> > > > > > >> >
>> > > > > > >> > *Implementation Thoughts*
>> > > > > > >> >
>> > > > > > >> > - A transaction is consist of begin control record, a series
>> > of
>> > > > data
>> > > > > > >> > records and commit/abort control record.
>> > > > > > >> > - The begin/commit/abort control record is written to a
>> > `commit`
>> > > > log
>> > > > > > >> > stream, while the data records will be written to normal
>> data
>> > > log
>> > > > > > >> streams.
>> > > > > > >> > - The `commit` log stream will be the same log stream for
>> > > > > stream-level
>> > > > > > >> > transaction,  while it will be a *system* stream (or
>> multiple
>> > > > system
>> > > > > > >> > streams) for namespace-level transactions.
>> > > > > > >> > - The transaction code looks like as below:
>> > > > > > >> >
>> > > > > > >> > <code>
>> > > > > > >> >
>> > > > > > >> > Transaction txn = client.transaction();
>> > > > > > >> > Future<DLSN> result1 = txn.write(stream-0, record);
>> > > > > > >> > Future<DLSN> result2 = txn.write(stream-1, record);
>> > > > > > >> > Future<DLSN> result3 = txn.write(stream-2, record);
>> > > > > > >> > Future<Pair<DLSN, DLSN>> result = txn.commit();
>> > > > > > >> >
>> > > > > > >> > </code>
>> > > > > > >> >
>> > > > > > >> > if the txn is committed, all the write futures will be
>> > satisfied
>> > > > > with
>> > > > > > >> their
>> > > > > > >> > written DLSNs. if the txn is aborted, all the write futures
>> > will
>> > > > be
>> > > > > > >> failed
>> > > > > > >> > together. there is no partial failure state.
>> > > > > > >> >
>> > > > > > >> > - The actually data flow will be:
>> > > > > > >> >
>> > > > > > >> > 1. writer get a transaction id from the owner of the
>> `commit'
>> > > log
>> > > > > > stream
>> > > > > > >> > 1. write the begin control record (synchronously) with the
>> > > > > transaction
>> > > > > > >> id
>> > > > > > >> > 2. for each write within the same txn, it will be assigned a
>> > > local
>> > > > > > >> sequence
>> > > > > > >> > number starting from 0. the combination of transaction id
>> and
>> > > > local
>> > > > > > >> > sequence number will be used later on by the readers to
>> > > > de-duplicate
>> > > > > > >> > records.
>> > > > > > >> > 3. the commit/abort control record will be written based on
>> > the
>> > > > > > results
>> > > > > > >> > from 2.
>> > > > > > >> >
>> > > > > > >> > - Application can supply a timeout for the transaction when
>> > > > > #begin() a
>> > > > > > >> > transaction. The owner of the `commit` log stream can abort
>> > > > > > transactions
>> > > > > > >> > that never be committed/aborted within their timeout.
>> > > > > > >> >
>> > > > > > >> > - Failures:
>> > > > > > >> >
>> > > > > > >> > * all the log records can be simply retried as they will be
>> > > > > > >> de-duplicated
>> > > > > > >> > probably at the reader side.
>> > > > > > >> >
>> > > > > > >> > - Reader:
>> > > > > > >> >
>> > > > > > >> > * Reader can be configured to read uncommitted records or
>> > > > committed
>> > > > > > >> records
>> > > > > > >> > only (by default read uncommitted records)
>> > > > > > >> > * If reader is configured to read committed records only,
>> the
>> > > read
>> > > > > > ahead
>> > > > > > >> > cache will be changed to maintain one additional pending
>> > > committed
>> > > > > > >> records.
>> > > > > > >> > the pending committed records map is bounded and records
>> will
>> > be
>> > > > > > dropped
>> > > > > > >> > when read ahead is moving.
>> > > > > > >> > * when the reader hits a commit record, it will rewind to
>> the
>> > > > begin
>> > > > > > >> record
>> > > > > > >> > and start reading from there. leveraging the proper read
>> ahead
>> > > > cache
>> > > > > > and
>> > > > > > >> > pending commit records cache, it would be good for both
>> short
>> > > > > > >> transactions
>> > > > > > >> > and long transactions.
>> > > > > > >> >
>> > > > > > >> > - DLSN, SequenceId:
>> > > > > > >> >
>> > > > > > >> > * We will add a fourth field to DLSN. It is `local sequence
>> > > > number`
>> > > > > > >> within
>> > > > > > >> > a transaction session. So the new DLSN of records in a
>> > > transaction
>> > > > > > will
>> > > > > > >> be
>> > > > > > >> > the DLSN of commit control record plus its local sequence
>> > > number.
>> > > > > > >> > * The sequence id will be still the position of the commit
>> > > record
>> > > > > plus
>> > > > > > >> its
>> > > > > > >> > local sequence number. The position will be advanced with
>> > total
>> > > > > number
>> > > > > > >> of
>> > > > > > >> > written records on writing the commit control record.
>> > > > > > >> >
>> > > > > > >> > - Transaction Group & Namespace Transaction
>> > > > > > >> >
>> > > > > > >> > using one single log stream for namespace transaction can
>> > cause
>> > > > the
>> > > > > > >> > bottleneck problem since all the begin/commit/end control
>> > > records
>> > > > > will
>> > > > > > >> have
>> > > > > > >> > to go through one log stream.
>> > > > > > >> >
>> > > > > > >> > the idea of 'transaction group' is to allow partitioning the
>> > > > writers
>> > > > > > >> into
>> > > > > > >> > different transaction groups.
>> > > > > > >> >
>> > > > > > >> > clients can specify the `group-name` when starting the
>> > > > transaction.
>> > > > > if
>> > > > > > >> > there is no `group-name` specified, it will use the default
>> > > > `commit`
>> > > > > > >> log in
>> > > > > > >> > the namespace for creating transactions.
>> > > > > > >> >
>> > > > > > >> > -------------------------------------------------
>> > > > > > >> >
>> > > > > > >> > I'd like to collect feedbacks on this idea. Appreciate any
>> > > > comments
>> > > > > > and
>> > > > > > >> if
>> > > > > > >> > anyone is also interested in this idea, we'd like to
>> > collaborate
>> > > > > with
>> > > > > > >> the
>> > > > > > >> > community.
>> > > > > > >> >
>> > > > > > >> >
>> > > > > > >> > - Xi
>> > > > > > >> >
>> > > > > > >>
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>

Re: [Discuss] Transaction Support

Posted by Sijie Guo <si...@apache.org>.

Hi Xi,

sorry for late response. I will review it soon.

regarding this, a separate question "are we going to use google doc instead
of email thread for any discussion"? I am a bit worried that the discussion
will become lost after moving to google doc. No idea on how other apache
projects are doing.

- Sijie

On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi...@gmail.com> wrote:

> Hi all,
>
> I finalized the first version of the design. This time I used a google doc
> so that it is easier for commenting and add a link the wiki page. I will
> update this to the wiki page once we come to the finalized design.
>
> https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
> bSIGgSzXuTI5BA/edit
>
> Let me know if you have any questions. Appreciate your reviews!
>
> - Xi
>
>
>
>
>
> On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
> <lstewart@twitter.com.invalid
> > wrote:
>
> > Interesting proposal. A couple quick notes while you continue to flesh
> this
> > out.
> >
> > a. just to be sure - does this eliminate the need to save seqno with
> > checkpoint?
> >
> > b. i.e. another way to describe this kind of improvement is "support
> > records (atomic writes) larger than 1MB", iiuc. the advantage being it
> > avoids the baggage of transactions. disadvantages include inability to do
> > cross stream transactions, and flexibility (interleaving, etc) (are there
> > others?).
> >
> > c. proxy use case is for supporting multiple writers - have you thought
> > about how this would work with multiple writers?
> >
> > Thanks!
> >
> >
> > On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo <si...@twitter.com.invalid>
> > wrote:
> >
> > > Sound good to me. look forward to the detailed proposal.
> > >
> > > (I don't mind the format if it makes things easier to you)
> > >
> > > Sijie
> > >
> > > On Friday, October 14, 2016, Xi Liu <xi...@gmail.com> wrote:
> > >
> > > > Thank you, Sijie
> > > >
> > > > We have some internal discussions to sort out some details. We are
> > ready
> > > to
> > > > collaborate with the community for adding the transaction support in
> > DL.
> > > > We'd like to share more.
> > > >
> > > > I created a proposal wiki here -
> > > > https://cwiki.apache.org/confluence/display/DL/DP-1+-+
> > > > DistributedLog+Transaction+Support
> > > >
> > > > (I followed KIP format and named it as DP (DistributedLog Proposal -
> DP
> > > is
> > > > also short for Dynamic Programming). I don't know if you guys like
> this
> > > > name or not. Feel free to change it :D)
> > > >
> > > > I basically put my initial email as the content there so far. Once we
> > > > finished our final discussion, I will update with more details. At
> the
> > > same
> > > > time, any comments are welcome.
> > > >
> > > > - Xi
> > > >
> > > >
> > > >
> > > > On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
> > > <javascript:;>>
> > > > wrote:
> > > >
> > > > > Xi,
> > > > >
> > > > > I just granted you the edit permission.
> > > > >
> > > > > - Sijie
> > > > >
> > > > > On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
> > > > <javascript:;>> wrote:
> > > > >
> > > > > > I still can not edit the wiki. Can any of the pmc members grant
> me
> > > the
> > > > > > permissions?
> > > > > >
> > > > > > - Xi
> > > > > >
> > > > > > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <xi.liu.ant@gmail.com
> > > > <javascript:;>> wrote:
> > > > > >
> > > > > > > Sijie,
> > > > > > >
> > > > > > > I attempted to create a wiki page under that space. I found
> that
> > I
> > > am
> > > > > not
> > > > > > > authorized with edit permission.
> > > > > > >
> > > > > > > Can any of the committers grant me the wiki edit permission? My
> > > > account
> > > > > > is
> > > > > > > "xi.liu.ant".
> > > > > > >
> > > > > > > - Xi
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <sijie@apache.org
> > > > <javascript:;>> wrote:
> > > > > > >
> > > > > > >> This sounds interesting ... I will take a closer look and give
> > my
> > > > > > comments
> > > > > > >> later.
> > > > > > >>
> > > > > > >> At the same time, do you mind creating a wiki page to put your
> > > idea
> > > > > > there?
> > > > > > >> You can add your wiki page under
> > > > > > >> https://cwiki.apache.org/confluence/display/DL/Project+
> > Proposals
> > > > > > >>
> > > > > > >> You might need to ask in the dev list to grant the wiki edit
> > > > > permissions
> > > > > > >> to
> > > > > > >> you once you have a wiki account.
> > > > > > >>
> > > > > > >> - Sijie
> > > > > > >>
> > > > > > >>
> > > > > > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <xi.liu.ant@gmail.com
> > > > <javascript:;>> wrote:
> > > > > > >>
> > > > > > >> > Hello,
> > > > > > >> >
> > > > > > >> > I asked the transaction support in distributedlog user group
> > two
> > > > > > months
> > > > > > >> > ago. I want to raise this up again, as we are looking for
> > using
> > > > > > >> > distributedlog for building a transactional data service. It
> > is
> > > a
> > > > > > major
> > > > > > >> > feature that is missing in distributedlog. We have some
> ideas
> > to
> > > > add
> > > > > > >> this
> > > > > > >> > to distributedlog and want to know if they make sense or
> not.
> > If
> > > > > they
> > > > > > >> are
> > > > > > >> > good, we'd like to contribute and develop with the
> community.
> > > > > > >> >
> > > > > > >> > Here are the thoughts:
> > > > > > >> >
> > > > > > >> > -------------------------------------------------
> > > > > > >> >
> > > > > > >> > From our understanding, DL can provide "at-least-once"
> > delivery
> > > > > > semantic
> > > > > > >> > (if not, please correct me) but not "exactly-once" delivery
> > > > > semantic.
> > > > > > >> That
> > > > > > >> > means that a message can be delivered one or more times if
> the
> > > > > reader
> > > > > > >> > doesn't handle duplicates.
> > > > > > >> >
> > > > > > >> > The duplicates come from two places, one is at writer side
> > (this
> > > > > > assumes
> > > > > > >> > using write proxy not the core library), while the other one
> > is
> > > at
> > > > > > >> reader
> > > > > > >> > side.
> > > > > > >> >
> > > > > > >> > - writer side: if the client attempts to write a record to
> the
> > > > write
> > > > > > >> > proxies and gets a network error (e.g timeouts) then
> retries,
> > > the
> > > > > > >> retrying
> > > > > > >> > will potentially result in duplicates.
> > > > > > >> > - reader side:if the reader reads a message from a stream
> and
> > > then
> > > > > > >> crashes,
> > > > > > >> > when the reader restarts it would restart from last known
> > > position
> > > > > > >> (DLSN).
> > > > > > >> > If the reader fails after processing a record and before
> > > recording
> > > > > the
> > > > > > >> > position, the processed record will be delivered again.
> > > > > > >> >
> > > > > > >> > The reader problem can be properly addressed by making use
> of
> > > the
> > > > > > >> sequence
> > > > > > >> > numbers of records and doing proper checkpointing. For
> > example,
> > > in
> > > > > > >> > database, it can checkpoint the indexed data with the
> sequence
> > > > > number
> > > > > > of
> > > > > > >> > records; in flink, it can checkpoint the state with the
> > sequence
> > > > > > >> numbers.
> > > > > > >> >
> > > > > > >> > The writer problem can be addressed by implementing an
> > > idempotent
> > > > > > >> writer.
> > > > > > >> > However, an alternative and more powerful approach is to
> > support
> > > > > > >> > transactions.
> > > > > > >> >
> > > > > > >> > *What does transaction mean?*
> > > > > > >> >
> > > > > > >> > A transaction means a collection of records can be written
> > > > > > >> transactionally
> > > > > > >> > within a stream or across multiple streams. They will be
> > > consumed
> > > > by
> > > > > > the
> > > > > > >> > reader together when a transaction is committed, or will
> never
> > > be
> > > > > > >> consumed
> > > > > > >> > by the reader when the transaction is aborted.
> > > > > > >> >
> > > > > > >> > The transaction will expose following guarantees:
> > > > > > >> >
> > > > > > >> > - The reader should not be exposed to records written from
> > > > > uncommitted
> > > > > > >> > transactions (mandatory)
> > > > > > >> > - The reader should consume the records in the transaction
> > > commit
> > > > > > order
> > > > > > >> > rather than the record written order (mandatory)
> > > > > > >> > - No duplicated records within a transaction (mandatory)
> > > > > > >> > - Allow interleaving transactional writes and
> > non-transactional
> > > > > writes
> > > > > > >> > (optional)
> > > > > > >> >
> > > > > > >> > *Stream Transaction & Namespace Transaction*
> > > > > > >> >
> > > > > > >> > There will be two types of transaction, one is Stream level
> > > > > > transaction
> > > > > > >> > (local transaction), while the other one is Namespace level
> > > > > > transaction
> > > > > > >> > (global transaction).
> > > > > > >> >
> > > > > > >> > The stream level transaction is a transactional operation on
> > > > writing
> > > > > > >> > records to one stream; the namespace level transaction is a
> > > > > > >> transactional
> > > > > > >> > operation on writing records to multiple streams.
> > > > > > >> >
> > > > > > >> > *Implementation Thoughts*
> > > > > > >> >
> > > > > > >> > - A transaction is consist of begin control record, a series
> > of
> > > > data
> > > > > > >> > records and commit/abort control record.
> > > > > > >> > - The begin/commit/abort control record is written to a
> > `commit`
> > > > log
> > > > > > >> > stream, while the data records will be written to normal
> data
> > > log
> > > > > > >> streams.
> > > > > > >> > - The `commit` log stream will be the same log stream for
> > > > > stream-level
> > > > > > >> > transaction,  while it will be a *system* stream (or
> multiple
> > > > system
> > > > > > >> > streams) for namespace-level transactions.
> > > > > > >> > - The transaction code looks like as below:
> > > > > > >> >
> > > > > > >> > <code>
> > > > > > >> >
> > > > > > >> > Transaction txn = client.transaction();
> > > > > > >> > Future<DLSN> result1 = txn.write(stream-0, record);
> > > > > > >> > Future<DLSN> result2 = txn.write(stream-1, record);
> > > > > > >> > Future<DLSN> result3 = txn.write(stream-2, record);
> > > > > > >> > Future<Pair<DLSN, DLSN>> result = txn.commit();
> > > > > > >> >
> > > > > > >> > </code>
> > > > > > >> >
> > > > > > >> > if the txn is committed, all the write futures will be
> > satisfied
> > > > > with
> > > > > > >> their
> > > > > > >> > written DLSNs. if the txn is aborted, all the write futures
> > will
> > > > be
> > > > > > >> failed
> > > > > > >> > together. there is no partial failure state.
> > > > > > >> >
> > > > > > >> > - The actually data flow will be:
> > > > > > >> >
> > > > > > >> > 1. writer get a transaction id from the owner of the
> `commit'
> > > log
> > > > > > stream
> > > > > > >> > 1. write the begin control record (synchronously) with the
> > > > > transaction
> > > > > > >> id
> > > > > > >> > 2. for each write within the same txn, it will be assigned a
> > > local
> > > > > > >> sequence
> > > > > > >> > number starting from 0. the combination of transaction id
> and
> > > > local
> > > > > > >> > sequence number will be used later on by the readers to
> > > > de-duplicate
> > > > > > >> > records.
> > > > > > >> > 3. the commit/abort control record will be written based on
> > the
> > > > > > results
> > > > > > >> > from 2.
> > > > > > >> >
> > > > > > >> > - Application can supply a timeout for the transaction when
> > > > > #begin() a
> > > > > > >> > transaction. The owner of the `commit` log stream can abort
> > > > > > transactions
> > > > > > >> > that never be committed/aborted within their timeout.
> > > > > > >> >
> > > > > > >> > - Failures:
> > > > > > >> >
> > > > > > >> > * all the log records can be simply retried as they will be
> > > > > > >> de-duplicated
> > > > > > >> > probably at the reader side.
> > > > > > >> >
> > > > > > >> > - Reader:
> > > > > > >> >
> > > > > > >> > * Reader can be configured to read uncommitted records or
> > > > committed
> > > > > > >> records
> > > > > > >> > only (by default read uncommitted records)
> > > > > > >> > * If reader is configured to read committed records only,
> the
> > > read
> > > > > > ahead
> > > > > > >> > cache will be changed to maintain one additional pending
> > > committed
> > > > > > >> records.
> > > > > > >> > the pending committed records map is bounded and records
> will
> > be
> > > > > > dropped
> > > > > > >> > when read ahead is moving.
> > > > > > >> > * when the reader hits a commit record, it will rewind to
> the
> > > > begin
> > > > > > >> record
> > > > > > >> > and start reading from there. leveraging the proper read
> ahead
> > > > cache
> > > > > > and
> > > > > > >> > pending commit records cache, it would be good for both
> short
> > > > > > >> transactions
> > > > > > >> > and long transactions.
> > > > > > >> >
> > > > > > >> > - DLSN, SequenceId:
> > > > > > >> >
> > > > > > >> > * We will add a fourth field to DLSN. It is `local sequence
> > > > number`
> > > > > > >> within
> > > > > > >> > a transaction session. So the new DLSN of records in a
> > > transaction
> > > > > > will
> > > > > > >> be
> > > > > > >> > the DLSN of commit control record plus its local sequence
> > > number.
> > > > > > >> > * The sequence id will be still the position of the commit
> > > record
> > > > > plus
> > > > > > >> its
> > > > > > >> > local sequence number. The position will be advanced with
> > total
> > > > > number
> > > > > > >> of
> > > > > > >> > written records on writing the commit control record.
> > > > > > >> >
> > > > > > >> > - Transaction Group & Namespace Transaction
> > > > > > >> >
> > > > > > >> > using one single log stream for namespace transaction can
> > cause
> > > > the
> > > > > > >> > bottleneck problem since all the begin/commit/end control
> > > records
> > > > > will
> > > > > > >> have
> > > > > > >> > to go through one log stream.
> > > > > > >> >
> > > > > > >> > the idea of 'transaction group' is to allow partitioning the
> > > > writers
> > > > > > >> into
> > > > > > >> > different transaction groups.
> > > > > > >> >
> > > > > > >> > clients can specify the `group-name` when starting the
> > > > transaction.
> > > > > if
> > > > > > >> > there is no `group-name` specified, it will use the default
> > > > `commit`
> > > > > > >> log in
> > > > > > >> > the namespace for creating transactions.
> > > > > > >> >
> > > > > > >> > -------------------------------------------------
> > > > > > >> >
> > > > > > >> > I'd like to collect feedbacks on this idea. Appreciate any
> > > > comments
> > > > > > and
> > > > > > >> if
> > > > > > >> > anyone is also interested in this idea, we'd like to
> > collaborate
> > > > > with
> > > > > > >> the
> > > > > > >> > community.
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > - Xi
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [Discuss] Transaction Support

Posted by Xi Liu <xi...@gmail.com>.

Hi all,

I finalized the first version of the design. This time I used a google doc
so that it is easier for commenting and add a link the wiki page. I will
update this to the wiki page once we come to the finalized design.

https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeKbSIGgSzXuTI5BA/edit

Let me know if you have any questions. Appreciate your reviews!

- Xi





On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart <lstewart@twitter.com.invalid
> wrote:

> Interesting proposal. A couple quick notes while you continue to flesh this
> out.
>
> a. just to be sure - does this eliminate the need to save seqno with
> checkpoint?
>
> b. i.e. another way to describe this kind of improvement is "support
> records (atomic writes) larger than 1MB", iiuc. the advantage being it
> avoids the baggage of transactions. disadvantages include inability to do
> cross stream transactions, and flexibility (interleaving, etc) (are there
> others?).
>
> c. proxy use case is for supporting multiple writers - have you thought
> about how this would work with multiple writers?
>
> Thanks!
>
>
> On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo <si...@twitter.com.invalid>
> wrote:
>
> > Sound good to me. look forward to the detailed proposal.
> >
> > (I don't mind the format if it makes things easier to you)
> >
> > Sijie
> >
> > On Friday, October 14, 2016, Xi Liu <xi...@gmail.com> wrote:
> >
> > > Thank you, Sijie
> > >
> > > We have some internal discussions to sort out some details. We are
> ready
> > to
> > > collaborate with the community for adding the transaction support in
> DL.
> > > We'd like to share more.
> > >
> > > I created a proposal wiki here -
> > > https://cwiki.apache.org/confluence/display/DL/DP-1+-+
> > > DistributedLog+Transaction+Support
> > >
> > > (I followed KIP format and named it as DP (DistributedLog Proposal - DP
> > is
> > > also short for Dynamic Programming). I don't know if you guys like this
> > > name or not. Feel free to change it :D)
> > >
> > > I basically put my initial email as the content there so far. Once we
> > > finished our final discussion, I will update with more details. At the
> > same
> > > time, any comments are welcome.
> > >
> > > - Xi
> > >
> > >
> > >
> > > On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
> > <javascript:;>>
> > > wrote:
> > >
> > > > Xi,
> > > >
> > > > I just granted you the edit permission.
> > > >
> > > > - Sijie
> > > >
> > > > On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
> > > <javascript:;>> wrote:
> > > >
> > > > > I still can not edit the wiki. Can any of the pmc members grant me
> > the
> > > > > permissions?
> > > > >
> > > > > - Xi
> > > > >
> > > > > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <xi.liu.ant@gmail.com
> > > <javascript:;>> wrote:
> > > > >
> > > > > > Sijie,
> > > > > >
> > > > > > I attempted to create a wiki page under that space. I found that
> I
> > am
> > > > not
> > > > > > authorized with edit permission.
> > > > > >
> > > > > > Can any of the committers grant me the wiki edit permission? My
> > > account
> > > > > is
> > > > > > "xi.liu.ant".
> > > > > >
> > > > > > - Xi
> > > > > >
> > > > > >
> > > > > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <sijie@apache.org
> > > <javascript:;>> wrote:
> > > > > >
> > > > > >> This sounds interesting ... I will take a closer look and give
> my
> > > > > comments
> > > > > >> later.
> > > > > >>
> > > > > >> At the same time, do you mind creating a wiki page to put your
> > idea
> > > > > there?
> > > > > >> You can add your wiki page under
> > > > > >> https://cwiki.apache.org/confluence/display/DL/Project+
> Proposals
> > > > > >>
> > > > > >> You might need to ask in the dev list to grant the wiki edit
> > > > permissions
> > > > > >> to
> > > > > >> you once you have a wiki account.
> > > > > >>
> > > > > >> - Sijie
> > > > > >>
> > > > > >>
> > > > > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <xi.liu.ant@gmail.com
> > > <javascript:;>> wrote:
> > > > > >>
> > > > > >> > Hello,
> > > > > >> >
> > > > > >> > I asked the transaction support in distributedlog user group
> two
> > > > > months
> > > > > >> > ago. I want to raise this up again, as we are looking for
> using
> > > > > >> > distributedlog for building a transactional data service. It
> is
> > a
> > > > > major
> > > > > >> > feature that is missing in distributedlog. We have some ideas
> to
> > > add
> > > > > >> this
> > > > > >> > to distributedlog and want to know if they make sense or not.
> If
> > > > they
> > > > > >> are
> > > > > >> > good, we'd like to contribute and develop with the community.
> > > > > >> >
> > > > > >> > Here are the thoughts:
> > > > > >> >
> > > > > >> > -------------------------------------------------
> > > > > >> >
> > > > > >> > From our understanding, DL can provide "at-least-once"
> delivery
> > > > > semantic
> > > > > >> > (if not, please correct me) but not "exactly-once" delivery
> > > > semantic.
> > > > > >> That
> > > > > >> > means that a message can be delivered one or more times if the
> > > > reader
> > > > > >> > doesn't handle duplicates.
> > > > > >> >
> > > > > >> > The duplicates come from two places, one is at writer side
> (this
> > > > > assumes
> > > > > >> > using write proxy not the core library), while the other one
> is
> > at
> > > > > >> reader
> > > > > >> > side.
> > > > > >> >
> > > > > >> > - writer side: if the client attempts to write a record to the
> > > write
> > > > > >> > proxies and gets a network error (e.g timeouts) then retries,
> > the
> > > > > >> retrying
> > > > > >> > will potentially result in duplicates.
> > > > > >> > - reader side:if the reader reads a message from a stream and
> > then
> > > > > >> crashes,
> > > > > >> > when the reader restarts it would restart from last known
> > position
> > > > > >> (DLSN).
> > > > > >> > If the reader fails after processing a record and before
> > recording
> > > > the
> > > > > >> > position, the processed record will be delivered again.
> > > > > >> >
> > > > > >> > The reader problem can be properly addressed by making use of
> > the
> > > > > >> sequence
> > > > > >> > numbers of records and doing proper checkpointing. For
> example,
> > in
> > > > > >> > database, it can checkpoint the indexed data with the sequence
> > > > number
> > > > > of
> > > > > >> > records; in flink, it can checkpoint the state with the
> sequence
> > > > > >> numbers.
> > > > > >> >
> > > > > >> > The writer problem can be addressed by implementing an
> > idempotent
> > > > > >> writer.
> > > > > >> > However, an alternative and more powerful approach is to
> support
> > > > > >> > transactions.
> > > > > >> >
> > > > > >> > *What does transaction mean?*
> > > > > >> >
> > > > > >> > A transaction means a collection of records can be written
> > > > > >> transactionally
> > > > > >> > within a stream or across multiple streams. They will be
> > consumed
> > > by
> > > > > the
> > > > > >> > reader together when a transaction is committed, or will never
> > be
> > > > > >> consumed
> > > > > >> > by the reader when the transaction is aborted.
> > > > > >> >
> > > > > >> > The transaction will expose following guarantees:
> > > > > >> >
> > > > > >> > - The reader should not be exposed to records written from
> > > > uncommitted
> > > > > >> > transactions (mandatory)
> > > > > >> > - The reader should consume the records in the transaction
> > commit
> > > > > order
> > > > > >> > rather than the record written order (mandatory)
> > > > > >> > - No duplicated records within a transaction (mandatory)
> > > > > >> > - Allow interleaving transactional writes and
> non-transactional
> > > > writes
> > > > > >> > (optional)
> > > > > >> >
> > > > > >> > *Stream Transaction & Namespace Transaction*
> > > > > >> >
> > > > > >> > There will be two types of transaction, one is Stream level
> > > > > transaction
> > > > > >> > (local transaction), while the other one is Namespace level
> > > > > transaction
> > > > > >> > (global transaction).
> > > > > >> >
> > > > > >> > The stream level transaction is a transactional operation on
> > > writing
> > > > > >> > records to one stream; the namespace level transaction is a
> > > > > >> transactional
> > > > > >> > operation on writing records to multiple streams.
> > > > > >> >
> > > > > >> > *Implementation Thoughts*
> > > > > >> >
> > > > > >> > - A transaction is consist of begin control record, a series
> of
> > > data
> > > > > >> > records and commit/abort control record.
> > > > > >> > - The begin/commit/abort control record is written to a
> `commit`
> > > log
> > > > > >> > stream, while the data records will be written to normal data
> > log
> > > > > >> streams.
> > > > > >> > - The `commit` log stream will be the same log stream for
> > > > stream-level
> > > > > >> > transaction,  while it will be a *system* stream (or multiple
> > > system
> > > > > >> > streams) for namespace-level transactions.
> > > > > >> > - The transaction code looks like as below:
> > > > > >> >
> > > > > >> > <code>
> > > > > >> >
> > > > > >> > Transaction txn = client.transaction();
> > > > > >> > Future<DLSN> result1 = txn.write(stream-0, record);
> > > > > >> > Future<DLSN> result2 = txn.write(stream-1, record);
> > > > > >> > Future<DLSN> result3 = txn.write(stream-2, record);
> > > > > >> > Future<Pair<DLSN, DLSN>> result = txn.commit();
> > > > > >> >
> > > > > >> > </code>
> > > > > >> >
> > > > > >> > if the txn is committed, all the write futures will be
> satisfied
> > > > with
> > > > > >> their
> > > > > >> > written DLSNs. if the txn is aborted, all the write futures
> will
> > > be
> > > > > >> failed
> > > > > >> > together. there is no partial failure state.
> > > > > >> >
> > > > > >> > - The actually data flow will be:
> > > > > >> >
> > > > > >> > 1. writer get a transaction id from the owner of the `commit'
> > log
> > > > > stream
> > > > > >> > 1. write the begin control record (synchronously) with the
> > > > transaction
> > > > > >> id
> > > > > >> > 2. for each write within the same txn, it will be assigned a
> > local
> > > > > >> sequence
> > > > > >> > number starting from 0. the combination of transaction id and
> > > local
> > > > > >> > sequence number will be used later on by the readers to
> > > de-duplicate
> > > > > >> > records.
> > > > > >> > 3. the commit/abort control record will be written based on
> the
> > > > > results
> > > > > >> > from 2.
> > > > > >> >
> > > > > >> > - Application can supply a timeout for the transaction when
> > > > #begin() a
> > > > > >> > transaction. The owner of the `commit` log stream can abort
> > > > > transactions
> > > > > >> > that never be committed/aborted within their timeout.
> > > > > >> >
> > > > > >> > - Failures:
> > > > > >> >
> > > > > >> > * all the log records can be simply retried as they will be
> > > > > >> de-duplicated
> > > > > >> > probably at the reader side.
> > > > > >> >
> > > > > >> > - Reader:
> > > > > >> >
> > > > > >> > * Reader can be configured to read uncommitted records or
> > > committed
> > > > > >> records
> > > > > >> > only (by default read uncommitted records)
> > > > > >> > * If reader is configured to read committed records only, the
> > read
> > > > > ahead
> > > > > >> > cache will be changed to maintain one additional pending
> > committed
> > > > > >> records.
> > > > > >> > the pending committed records map is bounded and records will
> be
> > > > > dropped
> > > > > >> > when read ahead is moving.
> > > > > >> > * when the reader hits a commit record, it will rewind to the
> > > begin
> > > > > >> record
> > > > > >> > and start reading from there. leveraging the proper read ahead
> > > cache
> > > > > and
> > > > > >> > pending commit records cache, it would be good for both short
> > > > > >> transactions
> > > > > >> > and long transactions.
> > > > > >> >
> > > > > >> > - DLSN, SequenceId:
> > > > > >> >
> > > > > >> > * We will add a fourth field to DLSN. It is `local sequence
> > > number`
> > > > > >> within
> > > > > >> > a transaction session. So the new DLSN of records in a
> > transaction
> > > > > will
> > > > > >> be
> > > > > >> > the DLSN of commit control record plus its local sequence
> > number.
> > > > > >> > * The sequence id will be still the position of the commit
> > record
> > > > plus
> > > > > >> its
> > > > > >> > local sequence number. The position will be advanced with
> total
> > > > number
> > > > > >> of
> > > > > >> > written records on writing the commit control record.
> > > > > >> >
> > > > > >> > - Transaction Group & Namespace Transaction
> > > > > >> >
> > > > > >> > using one single log stream for namespace transaction can
> cause
> > > the
> > > > > >> > bottleneck problem since all the begin/commit/end control
> > records
> > > > will
> > > > > >> have
> > > > > >> > to go through one log stream.
> > > > > >> >
> > > > > >> > the idea of 'transaction group' is to allow partitioning the
> > > writers
> > > > > >> into
> > > > > >> > different transaction groups.
> > > > > >> >
> > > > > >> > clients can specify the `group-name` when starting the
> > > transaction.
> > > > if
> > > > > >> > there is no `group-name` specified, it will use the default
> > > `commit`
> > > > > >> log in
> > > > > >> > the namespace for creating transactions.
> > > > > >> >
> > > > > >> > -------------------------------------------------
> > > > > >> >
> > > > > >> > I'd like to collect feedbacks on this idea. Appreciate any
> > > comments
> > > > > and
> > > > > >> if
> > > > > >> > anyone is also interested in this idea, we'd like to
> collaborate
> > > > with
> > > > > >> the
> > > > > >> > community.
> > > > > >> >
> > > > > >> >
> > > > > >> > - Xi
> > > > > >> >
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [Discuss] Transaction Support

Posted by Leigh Stewart <ls...@twitter.com.INVALID>.

Interesting proposal. A couple quick notes while you continue to flesh this
out.

a. just to be sure - does this eliminate the need to save seqno with
checkpoint?

b. i.e. another way to describe this kind of improvement is "support
records (atomic writes) larger than 1MB", iiuc. the advantage being it
avoids the baggage of transactions. disadvantages include inability to do
cross stream transactions, and flexibility (interleaving, etc) (are there
others?).

c. proxy use case is for supporting multiple writers - have you thought
about how this would work with multiple writers?

Thanks!


On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo <si...@twitter.com.invalid>
wrote:

> Sound good to me. look forward to the detailed proposal.
>
> (I don't mind the format if it makes things easier to you)
>
> Sijie
>
> On Friday, October 14, 2016, Xi Liu <xi...@gmail.com> wrote:
>
> > Thank you, Sijie
> >
> > We have some internal discussions to sort out some details. We are ready
> to
> > collaborate with the community for adding the transaction support in DL.
> > We'd like to share more.
> >
> > I created a proposal wiki here -
> > https://cwiki.apache.org/confluence/display/DL/DP-1+-+
> > DistributedLog+Transaction+Support
> >
> > (I followed KIP format and named it as DP (DistributedLog Proposal - DP
> is
> > also short for Dynamic Programming). I don't know if you guys like this
> > name or not. Feel free to change it :D)
> >
> > I basically put my initial email as the content there so far. Once we
> > finished our final discussion, I will update with more details. At the
> same
> > time, any comments are welcome.
> >
> > - Xi
> >
> >
> >
> > On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
> <javascript:;>>
> > wrote:
> >
> > > Xi,
> > >
> > > I just granted you the edit permission.
> > >
> > > - Sijie
> > >
> > > On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
> > <javascript:;>> wrote:
> > >
> > > > I still can not edit the wiki. Can any of the pmc members grant me
> the
> > > > permissions?
> > > >
> > > > - Xi
> > > >
> > > > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <xi.liu.ant@gmail.com
> > <javascript:;>> wrote:
> > > >
> > > > > Sijie,
> > > > >
> > > > > I attempted to create a wiki page under that space. I found that I
> am
> > > not
> > > > > authorized with edit permission.
> > > > >
> > > > > Can any of the committers grant me the wiki edit permission? My
> > account
> > > > is
> > > > > "xi.liu.ant".
> > > > >
> > > > > - Xi
> > > > >
> > > > >
> > > > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <sijie@apache.org
> > <javascript:;>> wrote:
> > > > >
> > > > >> This sounds interesting ... I will take a closer look and give my
> > > > comments
> > > > >> later.
> > > > >>
> > > > >> At the same time, do you mind creating a wiki page to put your
> idea
> > > > there?
> > > > >> You can add your wiki page under
> > > > >> https://cwiki.apache.org/confluence/display/DL/Project+Proposals
> > > > >>
> > > > >> You might need to ask in the dev list to grant the wiki edit
> > > permissions
> > > > >> to
> > > > >> you once you have a wiki account.
> > > > >>
> > > > >> - Sijie
> > > > >>
> > > > >>
> > > > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <xi.liu.ant@gmail.com
> > <javascript:;>> wrote:
> > > > >>
> > > > >> > Hello,
> > > > >> >
> > > > >> > I asked the transaction support in distributedlog user group two
> > > > months
> > > > >> > ago. I want to raise this up again, as we are looking for using
> > > > >> > distributedlog for building a transactional data service. It is
> a
> > > > major
> > > > >> > feature that is missing in distributedlog. We have some ideas to
> > add
> > > > >> this
> > > > >> > to distributedlog and want to know if they make sense or not. If
> > > they
> > > > >> are
> > > > >> > good, we'd like to contribute and develop with the community.
> > > > >> >
> > > > >> > Here are the thoughts:
> > > > >> >
> > > > >> > -------------------------------------------------
> > > > >> >
> > > > >> > From our understanding, DL can provide "at-least-once" delivery
> > > > semantic
> > > > >> > (if not, please correct me) but not "exactly-once" delivery
> > > semantic.
> > > > >> That
> > > > >> > means that a message can be delivered one or more times if the
> > > reader
> > > > >> > doesn't handle duplicates.
> > > > >> >
> > > > >> > The duplicates come from two places, one is at writer side (this
> > > > assumes
> > > > >> > using write proxy not the core library), while the other one is
> at
> > > > >> reader
> > > > >> > side.
> > > > >> >
> > > > >> > - writer side: if the client attempts to write a record to the
> > write
> > > > >> > proxies and gets a network error (e.g timeouts) then retries,
> the
> > > > >> retrying
> > > > >> > will potentially result in duplicates.
> > > > >> > - reader side:if the reader reads a message from a stream and
> then
> > > > >> crashes,
> > > > >> > when the reader restarts it would restart from last known
> position
> > > > >> (DLSN).
> > > > >> > If the reader fails after processing a record and before
> recording
> > > the
> > > > >> > position, the processed record will be delivered again.
> > > > >> >
> > > > >> > The reader problem can be properly addressed by making use of
> the
> > > > >> sequence
> > > > >> > numbers of records and doing proper checkpointing. For example,
> in
> > > > >> > database, it can checkpoint the indexed data with the sequence
> > > number
> > > > of
> > > > >> > records; in flink, it can checkpoint the state with the sequence
> > > > >> numbers.
> > > > >> >
> > > > >> > The writer problem can be addressed by implementing an
> idempotent
> > > > >> writer.
> > > > >> > However, an alternative and more powerful approach is to support
> > > > >> > transactions.
> > > > >> >
> > > > >> > *What does transaction mean?*
> > > > >> >
> > > > >> > A transaction means a collection of records can be written
> > > > >> transactionally
> > > > >> > within a stream or across multiple streams. They will be
> consumed
> > by
> > > > the
> > > > >> > reader together when a transaction is committed, or will never
> be
> > > > >> consumed
> > > > >> > by the reader when the transaction is aborted.
> > > > >> >
> > > > >> > The transaction will expose following guarantees:
> > > > >> >
> > > > >> > - The reader should not be exposed to records written from
> > > uncommitted
> > > > >> > transactions (mandatory)
> > > > >> > - The reader should consume the records in the transaction
> commit
> > > > order
> > > > >> > rather than the record written order (mandatory)
> > > > >> > - No duplicated records within a transaction (mandatory)
> > > > >> > - Allow interleaving transactional writes and non-transactional
> > > writes
> > > > >> > (optional)
> > > > >> >
> > > > >> > *Stream Transaction & Namespace Transaction*
> > > > >> >
> > > > >> > There will be two types of transaction, one is Stream level
> > > > transaction
> > > > >> > (local transaction), while the other one is Namespace level
> > > > transaction
> > > > >> > (global transaction).
> > > > >> >
> > > > >> > The stream level transaction is a transactional operation on
> > writing
> > > > >> > records to one stream; the namespace level transaction is a
> > > > >> transactional
> > > > >> > operation on writing records to multiple streams.
> > > > >> >
> > > > >> > *Implementation Thoughts*
> > > > >> >
> > > > >> > - A transaction is consist of begin control record, a series of
> > data
> > > > >> > records and commit/abort control record.
> > > > >> > - The begin/commit/abort control record is written to a `commit`
> > log
> > > > >> > stream, while the data records will be written to normal data
> log
> > > > >> streams.
> > > > >> > - The `commit` log stream will be the same log stream for
> > > stream-level
> > > > >> > transaction,  while it will be a *system* stream (or multiple
> > system
> > > > >> > streams) for namespace-level transactions.
> > > > >> > - The transaction code looks like as below:
> > > > >> >
> > > > >> > <code>
> > > > >> >
> > > > >> > Transaction txn = client.transaction();
> > > > >> > Future<DLSN> result1 = txn.write(stream-0, record);
> > > > >> > Future<DLSN> result2 = txn.write(stream-1, record);
> > > > >> > Future<DLSN> result3 = txn.write(stream-2, record);
> > > > >> > Future<Pair<DLSN, DLSN>> result = txn.commit();
> > > > >> >
> > > > >> > </code>
> > > > >> >
> > > > >> > if the txn is committed, all the write futures will be satisfied
> > > with
> > > > >> their
> > > > >> > written DLSNs. if the txn is aborted, all the write futures will
> > be
> > > > >> failed
> > > > >> > together. there is no partial failure state.
> > > > >> >
> > > > >> > - The actually data flow will be:
> > > > >> >
> > > > >> > 1. writer get a transaction id from the owner of the `commit'
> log
> > > > stream
> > > > >> > 1. write the begin control record (synchronously) with the
> > > transaction
> > > > >> id
> > > > >> > 2. for each write within the same txn, it will be assigned a
> local
> > > > >> sequence
> > > > >> > number starting from 0. the combination of transaction id and
> > local
> > > > >> > sequence number will be used later on by the readers to
> > de-duplicate
> > > > >> > records.
> > > > >> > 3. the commit/abort control record will be written based on the
> > > > results
> > > > >> > from 2.
> > > > >> >
> > > > >> > - Application can supply a timeout for the transaction when
> > > #begin() a
> > > > >> > transaction. The owner of the `commit` log stream can abort
> > > > transactions
> > > > >> > that never be committed/aborted within their timeout.
> > > > >> >
> > > > >> > - Failures:
> > > > >> >
> > > > >> > * all the log records can be simply retried as they will be
> > > > >> de-duplicated
> > > > >> > probably at the reader side.
> > > > >> >
> > > > >> > - Reader:
> > > > >> >
> > > > >> > * Reader can be configured to read uncommitted records or
> > committed
> > > > >> records
> > > > >> > only (by default read uncommitted records)
> > > > >> > * If reader is configured to read committed records only, the
> read
> > > > ahead
> > > > >> > cache will be changed to maintain one additional pending
> committed
> > > > >> records.
> > > > >> > the pending committed records map is bounded and records will be
> > > > dropped
> > > > >> > when read ahead is moving.
> > > > >> > * when the reader hits a commit record, it will rewind to the
> > begin
> > > > >> record
> > > > >> > and start reading from there. leveraging the proper read ahead
> > cache
> > > > and
> > > > >> > pending commit records cache, it would be good for both short
> > > > >> transactions
> > > > >> > and long transactions.
> > > > >> >
> > > > >> > - DLSN, SequenceId:
> > > > >> >
> > > > >> > * We will add a fourth field to DLSN. It is `local sequence
> > number`
> > > > >> within
> > > > >> > a transaction session. So the new DLSN of records in a
> transaction
> > > > will
> > > > >> be
> > > > >> > the DLSN of commit control record plus its local sequence
> number.
> > > > >> > * The sequence id will be still the position of the commit
> record
> > > plus
> > > > >> its
> > > > >> > local sequence number. The position will be advanced with total
> > > number
> > > > >> of
> > > > >> > written records on writing the commit control record.
> > > > >> >
> > > > >> > - Transaction Group & Namespace Transaction
> > > > >> >
> > > > >> > using one single log stream for namespace transaction can cause
> > the
> > > > >> > bottleneck problem since all the begin/commit/end control
> records
> > > will
> > > > >> have
> > > > >> > to go through one log stream.
> > > > >> >
> > > > >> > the idea of 'transaction group' is to allow partitioning the
> > writers
> > > > >> into
> > > > >> > different transaction groups.
> > > > >> >
> > > > >> > clients can specify the `group-name` when starting the
> > transaction.
> > > if
> > > > >> > there is no `group-name` specified, it will use the default
> > `commit`
> > > > >> log in
> > > > >> > the namespace for creating transactions.
> > > > >> >
> > > > >> > -------------------------------------------------
> > > > >> >
> > > > >> > I'd like to collect feedbacks on this idea. Appreciate any
> > comments
> > > > and
> > > > >> if
> > > > >> > anyone is also interested in this idea, we'd like to collaborate
> > > with
> > > > >> the
> > > > >> > community.
> > > > >> >
> > > > >> >
> > > > >> > - Xi
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [Discuss] Transaction Support

Posted by Sijie Guo <si...@twitter.com.INVALID>.

Sound good to me. look forward to the detailed proposal.

(I don't mind the format if it makes things easier to you)

Sijie

On Friday, October 14, 2016, Xi Liu <xi...@gmail.com> wrote:

> Thank you, Sijie
>
> We have some internal discussions to sort out some details. We are ready to
> collaborate with the community for adding the transaction support in DL.
> We'd like to share more.
>
> I created a proposal wiki here -
> https://cwiki.apache.org/confluence/display/DL/DP-1+-+
> DistributedLog+Transaction+Support
>
> (I followed KIP format and named it as DP (DistributedLog Proposal - DP is
> also short for Dynamic Programming). I don't know if you guys like this
> name or not. Feel free to change it :D)
>
> I basically put my initial email as the content there so far. Once we
> finished our final discussion, I will update with more details. At the same
> time, any comments are welcome.
>
> - Xi
>
>
>
> On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org <javascript:;>>
> wrote:
>
> > Xi,
> >
> > I just granted you the edit permission.
> >
> > - Sijie
> >
> > On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
> <javascript:;>> wrote:
> >
> > > I still can not edit the wiki. Can any of the pmc members grant me the
> > > permissions?
> > >
> > > - Xi
> > >
> > > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <xi.liu.ant@gmail.com
> <javascript:;>> wrote:
> > >
> > > > Sijie,
> > > >
> > > > I attempted to create a wiki page under that space. I found that I am
> > not
> > > > authorized with edit permission.
> > > >
> > > > Can any of the committers grant me the wiki edit permission? My
> account
> > > is
> > > > "xi.liu.ant".
> > > >
> > > > - Xi
> > > >
> > > >
> > > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <sijie@apache.org
> <javascript:;>> wrote:
> > > >
> > > >> This sounds interesting ... I will take a closer look and give my
> > > comments
> > > >> later.
> > > >>
> > > >> At the same time, do you mind creating a wiki page to put your idea
> > > there?
> > > >> You can add your wiki page under
> > > >> https://cwiki.apache.org/confluence/display/DL/Project+Proposals
> > > >>
> > > >> You might need to ask in the dev list to grant the wiki edit
> > permissions
> > > >> to
> > > >> you once you have a wiki account.
> > > >>
> > > >> - Sijie
> > > >>
> > > >>
> > > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <xi.liu.ant@gmail.com
> <javascript:;>> wrote:
> > > >>
> > > >> > Hello,
> > > >> >
> > > >> > I asked the transaction support in distributedlog user group two
> > > months
> > > >> > ago. I want to raise this up again, as we are looking for using
> > > >> > distributedlog for building a transactional data service. It is a
> > > major
> > > >> > feature that is missing in distributedlog. We have some ideas to
> add
> > > >> this
> > > >> > to distributedlog and want to know if they make sense or not. If
> > they
> > > >> are
> > > >> > good, we'd like to contribute and develop with the community.
> > > >> >
> > > >> > Here are the thoughts:
> > > >> >
> > > >> > -------------------------------------------------
> > > >> >
> > > >> > From our understanding, DL can provide "at-least-once" delivery
> > > semantic
> > > >> > (if not, please correct me) but not "exactly-once" delivery
> > semantic.
> > > >> That
> > > >> > means that a message can be delivered one or more times if the
> > reader
> > > >> > doesn't handle duplicates.
> > > >> >
> > > >> > The duplicates come from two places, one is at writer side (this
> > > assumes
> > > >> > using write proxy not the core library), while the other one is at
> > > >> reader
> > > >> > side.
> > > >> >
> > > >> > - writer side: if the client attempts to write a record to the
> write
> > > >> > proxies and gets a network error (e.g timeouts) then retries, the
> > > >> retrying
> > > >> > will potentially result in duplicates.
> > > >> > - reader side:if the reader reads a message from a stream and then
> > > >> crashes,
> > > >> > when the reader restarts it would restart from last known position
> > > >> (DLSN).
> > > >> > If the reader fails after processing a record and before recording
> > the
> > > >> > position, the processed record will be delivered again.
> > > >> >
> > > >> > The reader problem can be properly addressed by making use of the
> > > >> sequence
> > > >> > numbers of records and doing proper checkpointing. For example, in
> > > >> > database, it can checkpoint the indexed data with the sequence
> > number
> > > of
> > > >> > records; in flink, it can checkpoint the state with the sequence
> > > >> numbers.
> > > >> >
> > > >> > The writer problem can be addressed by implementing an idempotent
> > > >> writer.
> > > >> > However, an alternative and more powerful approach is to support
> > > >> > transactions.
> > > >> >
> > > >> > *What does transaction mean?*
> > > >> >
> > > >> > A transaction means a collection of records can be written
> > > >> transactionally
> > > >> > within a stream or across multiple streams. They will be consumed
> by
> > > the
> > > >> > reader together when a transaction is committed, or will never be
> > > >> consumed
> > > >> > by the reader when the transaction is aborted.
> > > >> >
> > > >> > The transaction will expose following guarantees:
> > > >> >
> > > >> > - The reader should not be exposed to records written from
> > uncommitted
> > > >> > transactions (mandatory)
> > > >> > - The reader should consume the records in the transaction commit
> > > order
> > > >> > rather than the record written order (mandatory)
> > > >> > - No duplicated records within a transaction (mandatory)
> > > >> > - Allow interleaving transactional writes and non-transactional
> > writes
> > > >> > (optional)
> > > >> >
> > > >> > *Stream Transaction & Namespace Transaction*
> > > >> >
> > > >> > There will be two types of transaction, one is Stream level
> > > transaction
> > > >> > (local transaction), while the other one is Namespace level
> > > transaction
> > > >> > (global transaction).
> > > >> >
> > > >> > The stream level transaction is a transactional operation on
> writing
> > > >> > records to one stream; the namespace level transaction is a
> > > >> transactional
> > > >> > operation on writing records to multiple streams.
> > > >> >
> > > >> > *Implementation Thoughts*
> > > >> >
> > > >> > - A transaction is consist of begin control record, a series of
> data
> > > >> > records and commit/abort control record.
> > > >> > - The begin/commit/abort control record is written to a `commit`
> log
> > > >> > stream, while the data records will be written to normal data log
> > > >> streams.
> > > >> > - The `commit` log stream will be the same log stream for
> > stream-level
> > > >> > transaction,  while it will be a *system* stream (or multiple
> system
> > > >> > streams) for namespace-level transactions.
> > > >> > - The transaction code looks like as below:
> > > >> >
> > > >> > <code>
> > > >> >
> > > >> > Transaction txn = client.transaction();
> > > >> > Future<DLSN> result1 = txn.write(stream-0, record);
> > > >> > Future<DLSN> result2 = txn.write(stream-1, record);
> > > >> > Future<DLSN> result3 = txn.write(stream-2, record);
> > > >> > Future<Pair<DLSN, DLSN>> result = txn.commit();
> > > >> >
> > > >> > </code>
> > > >> >
> > > >> > if the txn is committed, all the write futures will be satisfied
> > with
> > > >> their
> > > >> > written DLSNs. if the txn is aborted, all the write futures will
> be
> > > >> failed
> > > >> > together. there is no partial failure state.
> > > >> >
> > > >> > - The actually data flow will be:
> > > >> >
> > > >> > 1. writer get a transaction id from the owner of the `commit' log
> > > stream
> > > >> > 1. write the begin control record (synchronously) with the
> > transaction
> > > >> id
> > > >> > 2. for each write within the same txn, it will be assigned a local
> > > >> sequence
> > > >> > number starting from 0. the combination of transaction id and
> local
> > > >> > sequence number will be used later on by the readers to
> de-duplicate
> > > >> > records.
> > > >> > 3. the commit/abort control record will be written based on the
> > > results
> > > >> > from 2.
> > > >> >
> > > >> > - Application can supply a timeout for the transaction when
> > #begin() a
> > > >> > transaction. The owner of the `commit` log stream can abort
> > > transactions
> > > >> > that never be committed/aborted within their timeout.
> > > >> >
> > > >> > - Failures:
> > > >> >
> > > >> > * all the log records can be simply retried as they will be
> > > >> de-duplicated
> > > >> > probably at the reader side.
> > > >> >
> > > >> > - Reader:
> > > >> >
> > > >> > * Reader can be configured to read uncommitted records or
> committed
> > > >> records
> > > >> > only (by default read uncommitted records)
> > > >> > * If reader is configured to read committed records only, the read
> > > ahead
> > > >> > cache will be changed to maintain one additional pending committed
> > > >> records.
> > > >> > the pending committed records map is bounded and records will be
> > > dropped
> > > >> > when read ahead is moving.
> > > >> > * when the reader hits a commit record, it will rewind to the
> begin
> > > >> record
> > > >> > and start reading from there. leveraging the proper read ahead
> cache
> > > and
> > > >> > pending commit records cache, it would be good for both short
> > > >> transactions
> > > >> > and long transactions.
> > > >> >
> > > >> > - DLSN, SequenceId:
> > > >> >
> > > >> > * We will add a fourth field to DLSN. It is `local sequence
> number`
> > > >> within
> > > >> > a transaction session. So the new DLSN of records in a transaction
> > > will
> > > >> be
> > > >> > the DLSN of commit control record plus its local sequence number.
> > > >> > * The sequence id will be still the position of the commit record
> > plus
> > > >> its
> > > >> > local sequence number. The position will be advanced with total
> > number
> > > >> of
> > > >> > written records on writing the commit control record.
> > > >> >
> > > >> > - Transaction Group & Namespace Transaction
> > > >> >
> > > >> > using one single log stream for namespace transaction can cause
> the
> > > >> > bottleneck problem since all the begin/commit/end control records
> > will
> > > >> have
> > > >> > to go through one log stream.
> > > >> >
> > > >> > the idea of 'transaction group' is to allow partitioning the
> writers
> > > >> into
> > > >> > different transaction groups.
> > > >> >
> > > >> > clients can specify the `group-name` when starting the
> transaction.
> > if
> > > >> > there is no `group-name` specified, it will use the default
> `commit`
> > > >> log in
> > > >> > the namespace for creating transactions.
> > > >> >
> > > >> > -------------------------------------------------
> > > >> >
> > > >> > I'd like to collect feedbacks on this idea. Appreciate any
> comments
> > > and
> > > >> if
> > > >> > anyone is also interested in this idea, we'd like to collaborate
> > with
> > > >> the
> > > >> > community.
> > > >> >
> > > >> >
> > > >> > - Xi
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: [Discuss] Transaction Support

Posted by Xi Liu <xi...@gmail.com>.

Thank you, Sijie

We have some internal discussions to sort out some details. We are ready to
collaborate with the community for adding the transaction support in DL.
We'd like to share more.

I created a proposal wiki here -
https://cwiki.apache.org/confluence/display/DL/DP-1+-+DistributedLog+Transaction+Support

(I followed KIP format and named it as DP (DistributedLog Proposal - DP is
also short for Dynamic Programming). I don't know if you guys like this
name or not. Feel free to change it :D)

I basically put my initial email as the content there so far. Once we
finished our final discussion, I will update with more details. At the same
time, any comments are welcome.

- Xi



On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <si...@apache.org> wrote:

> Xi,
>
> I just granted you the edit permission.
>
> - Sijie
>
> On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi...@gmail.com> wrote:
>
> > I still can not edit the wiki. Can any of the pmc members grant me the
> > permissions?
> >
> > - Xi
> >
> > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <xi...@gmail.com> wrote:
> >
> > > Sijie,
> > >
> > > I attempted to create a wiki page under that space. I found that I am
> not
> > > authorized with edit permission.
> > >
> > > Can any of the committers grant me the wiki edit permission? My account
> > is
> > > "xi.liu.ant".
> > >
> > > - Xi
> > >
> > >
> > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <si...@apache.org> wrote:
> > >
> > >> This sounds interesting ... I will take a closer look and give my
> > comments
> > >> later.
> > >>
> > >> At the same time, do you mind creating a wiki page to put your idea
> > there?
> > >> You can add your wiki page under
> > >> https://cwiki.apache.org/confluence/display/DL/Project+Proposals
> > >>
> > >> You might need to ask in the dev list to grant the wiki edit
> permissions
> > >> to
> > >> you once you have a wiki account.
> > >>
> > >> - Sijie
> > >>
> > >>
> > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <xi...@gmail.com> wrote:
> > >>
> > >> > Hello,
> > >> >
> > >> > I asked the transaction support in distributedlog user group two
> > months
> > >> > ago. I want to raise this up again, as we are looking for using
> > >> > distributedlog for building a transactional data service. It is a
> > major
> > >> > feature that is missing in distributedlog. We have some ideas to add
> > >> this
> > >> > to distributedlog and want to know if they make sense or not. If
> they
> > >> are
> > >> > good, we'd like to contribute and develop with the community.
> > >> >
> > >> > Here are the thoughts:
> > >> >
> > >> > -------------------------------------------------
> > >> >
> > >> > From our understanding, DL can provide "at-least-once" delivery
> > semantic
> > >> > (if not, please correct me) but not "exactly-once" delivery
> semantic.
> > >> That
> > >> > means that a message can be delivered one or more times if the
> reader
> > >> > doesn't handle duplicates.
> > >> >
> > >> > The duplicates come from two places, one is at writer side (this
> > assumes
> > >> > using write proxy not the core library), while the other one is at
> > >> reader
> > >> > side.
> > >> >
> > >> > - writer side: if the client attempts to write a record to the write
> > >> > proxies and gets a network error (e.g timeouts) then retries, the
> > >> retrying
> > >> > will potentially result in duplicates.
> > >> > - reader side:if the reader reads a message from a stream and then
> > >> crashes,
> > >> > when the reader restarts it would restart from last known position
> > >> (DLSN).
> > >> > If the reader fails after processing a record and before recording
> the
> > >> > position, the processed record will be delivered again.
> > >> >
> > >> > The reader problem can be properly addressed by making use of the
> > >> sequence
> > >> > numbers of records and doing proper checkpointing. For example, in
> > >> > database, it can checkpoint the indexed data with the sequence
> number
> > of
> > >> > records; in flink, it can checkpoint the state with the sequence
> > >> numbers.
> > >> >
> > >> > The writer problem can be addressed by implementing an idempotent
> > >> writer.
> > >> > However, an alternative and more powerful approach is to support
> > >> > transactions.
> > >> >
> > >> > *What does transaction mean?*
> > >> >
> > >> > A transaction means a collection of records can be written
> > >> transactionally
> > >> > within a stream or across multiple streams. They will be consumed by
> > the
> > >> > reader together when a transaction is committed, or will never be
> > >> consumed
> > >> > by the reader when the transaction is aborted.
> > >> >
> > >> > The transaction will expose following guarantees:
> > >> >
> > >> > - The reader should not be exposed to records written from
> uncommitted
> > >> > transactions (mandatory)
> > >> > - The reader should consume the records in the transaction commit
> > order
> > >> > rather than the record written order (mandatory)
> > >> > - No duplicated records within a transaction (mandatory)
> > >> > - Allow interleaving transactional writes and non-transactional
> writes
> > >> > (optional)
> > >> >
> > >> > *Stream Transaction & Namespace Transaction*
> > >> >
> > >> > There will be two types of transaction, one is Stream level
> > transaction
> > >> > (local transaction), while the other one is Namespace level
> > transaction
> > >> > (global transaction).
> > >> >
> > >> > The stream level transaction is a transactional operation on writing
> > >> > records to one stream; the namespace level transaction is a
> > >> transactional
> > >> > operation on writing records to multiple streams.
> > >> >
> > >> > *Implementation Thoughts*
> > >> >
> > >> > - A transaction is consist of begin control record, a series of data
> > >> > records and commit/abort control record.
> > >> > - The begin/commit/abort control record is written to a `commit` log
> > >> > stream, while the data records will be written to normal data log
> > >> streams.
> > >> > - The `commit` log stream will be the same log stream for
> stream-level
> > >> > transaction,  while it will be a *system* stream (or multiple system
> > >> > streams) for namespace-level transactions.
> > >> > - The transaction code looks like as below:
> > >> >
> > >> > <code>
> > >> >
> > >> > Transaction txn = client.transaction();
> > >> > Future<DLSN> result1 = txn.write(stream-0, record);
> > >> > Future<DLSN> result2 = txn.write(stream-1, record);
> > >> > Future<DLSN> result3 = txn.write(stream-2, record);
> > >> > Future<Pair<DLSN, DLSN>> result = txn.commit();
> > >> >
> > >> > </code>
> > >> >
> > >> > if the txn is committed, all the write futures will be satisfied
> with
> > >> their
> > >> > written DLSNs. if the txn is aborted, all the write futures will be
> > >> failed
> > >> > together. there is no partial failure state.
> > >> >
> > >> > - The actually data flow will be:
> > >> >
> > >> > 1. writer get a transaction id from the owner of the `commit' log
> > stream
> > >> > 1. write the begin control record (synchronously) with the
> transaction
> > >> id
> > >> > 2. for each write within the same txn, it will be assigned a local
> > >> sequence
> > >> > number starting from 0. the combination of transaction id and local
> > >> > sequence number will be used later on by the readers to de-duplicate
> > >> > records.
> > >> > 3. the commit/abort control record will be written based on the
> > results
> > >> > from 2.
> > >> >
> > >> > - Application can supply a timeout for the transaction when
> #begin() a
> > >> > transaction. The owner of the `commit` log stream can abort
> > transactions
> > >> > that never be committed/aborted within their timeout.
> > >> >
> > >> > - Failures:
> > >> >
> > >> > * all the log records can be simply retried as they will be
> > >> de-duplicated
> > >> > probably at the reader side.
> > >> >
> > >> > - Reader:
> > >> >
> > >> > * Reader can be configured to read uncommitted records or committed
> > >> records
> > >> > only (by default read uncommitted records)
> > >> > * If reader is configured to read committed records only, the read
> > ahead
> > >> > cache will be changed to maintain one additional pending committed
> > >> records.
> > >> > the pending committed records map is bounded and records will be
> > dropped
> > >> > when read ahead is moving.
> > >> > * when the reader hits a commit record, it will rewind to the begin
> > >> record
> > >> > and start reading from there. leveraging the proper read ahead cache
> > and
> > >> > pending commit records cache, it would be good for both short
> > >> transactions
> > >> > and long transactions.
> > >> >
> > >> > - DLSN, SequenceId:
> > >> >
> > >> > * We will add a fourth field to DLSN. It is `local sequence number`
> > >> within
> > >> > a transaction session. So the new DLSN of records in a transaction
> > will
> > >> be
> > >> > the DLSN of commit control record plus its local sequence number.
> > >> > * The sequence id will be still the position of the commit record
> plus
> > >> its
> > >> > local sequence number. The position will be advanced with total
> number
> > >> of
> > >> > written records on writing the commit control record.
> > >> >
> > >> > - Transaction Group & Namespace Transaction
> > >> >
> > >> > using one single log stream for namespace transaction can cause the
> > >> > bottleneck problem since all the begin/commit/end control records
> will
> > >> have
> > >> > to go through one log stream.
> > >> >
> > >> > the idea of 'transaction group' is to allow partitioning the writers
> > >> into
> > >> > different transaction groups.
> > >> >
> > >> > clients can specify the `group-name` when starting the transaction.
> if
> > >> > there is no `group-name` specified, it will use the default `commit`
> > >> log in
> > >> > the namespace for creating transactions.
> > >> >
> > >> > -------------------------------------------------
> > >> >
> > >> > I'd like to collect feedbacks on this idea. Appreciate any comments
> > and
> > >> if
> > >> > anyone is also interested in this idea, we'd like to collaborate
> with
> > >> the
> > >> > community.
> > >> >
> > >> >
> > >> > - Xi
> > >> >
> > >>
> > >
> > >
> >
>

Re: [Discuss] Transaction Support

Posted by Sijie Guo <si...@apache.org>.

Xi,

I just granted you the edit permission.

- Sijie

On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi...@gmail.com> wrote:

> I still can not edit the wiki. Can any of the pmc members grant me the
> permissions?
>
> - Xi
>
> On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <xi...@gmail.com> wrote:
>
> > Sijie,
> >
> > I attempted to create a wiki page under that space. I found that I am not
> > authorized with edit permission.
> >
> > Can any of the committers grant me the wiki edit permission? My account
> is
> > "xi.liu.ant".
> >
> > - Xi
> >
> >
> > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <si...@apache.org> wrote:
> >
> >> This sounds interesting ... I will take a closer look and give my
> comments
> >> later.
> >>
> >> At the same time, do you mind creating a wiki page to put your idea
> there?
> >> You can add your wiki page under
> >> https://cwiki.apache.org/confluence/display/DL/Project+Proposals
> >>
> >> You might need to ask in the dev list to grant the wiki edit permissions
> >> to
> >> you once you have a wiki account.
> >>
> >> - Sijie
> >>
> >>
> >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <xi...@gmail.com> wrote:
> >>
> >> > Hello,
> >> >
> >> > I asked the transaction support in distributedlog user group two
> months
> >> > ago. I want to raise this up again, as we are looking for using
> >> > distributedlog for building a transactional data service. It is a
> major
> >> > feature that is missing in distributedlog. We have some ideas to add
> >> this
> >> > to distributedlog and want to know if they make sense or not. If they
> >> are
> >> > good, we'd like to contribute and develop with the community.
> >> >
> >> > Here are the thoughts:
> >> >
> >> > -------------------------------------------------
> >> >
> >> > From our understanding, DL can provide "at-least-once" delivery
> semantic
> >> > (if not, please correct me) but not "exactly-once" delivery semantic.
> >> That
> >> > means that a message can be delivered one or more times if the reader
> >> > doesn't handle duplicates.
> >> >
> >> > The duplicates come from two places, one is at writer side (this
> assumes
> >> > using write proxy not the core library), while the other one is at
> >> reader
> >> > side.
> >> >
> >> > - writer side: if the client attempts to write a record to the write
> >> > proxies and gets a network error (e.g timeouts) then retries, the
> >> retrying
> >> > will potentially result in duplicates.
> >> > - reader side:if the reader reads a message from a stream and then
> >> crashes,
> >> > when the reader restarts it would restart from last known position
> >> (DLSN).
> >> > If the reader fails after processing a record and before recording the
> >> > position, the processed record will be delivered again.
> >> >
> >> > The reader problem can be properly addressed by making use of the
> >> sequence
> >> > numbers of records and doing proper checkpointing. For example, in
> >> > database, it can checkpoint the indexed data with the sequence number
> of
> >> > records; in flink, it can checkpoint the state with the sequence
> >> numbers.
> >> >
> >> > The writer problem can be addressed by implementing an idempotent
> >> writer.
> >> > However, an alternative and more powerful approach is to support
> >> > transactions.
> >> >
> >> > *What does transaction mean?*
> >> >
> >> > A transaction means a collection of records can be written
> >> transactionally
> >> > within a stream or across multiple streams. They will be consumed by
> the
> >> > reader together when a transaction is committed, or will never be
> >> consumed
> >> > by the reader when the transaction is aborted.
> >> >
> >> > The transaction will expose following guarantees:
> >> >
> >> > - The reader should not be exposed to records written from uncommitted
> >> > transactions (mandatory)
> >> > - The reader should consume the records in the transaction commit
> order
> >> > rather than the record written order (mandatory)
> >> > - No duplicated records within a transaction (mandatory)
> >> > - Allow interleaving transactional writes and non-transactional writes
> >> > (optional)
> >> >
> >> > *Stream Transaction & Namespace Transaction*
> >> >
> >> > There will be two types of transaction, one is Stream level
> transaction
> >> > (local transaction), while the other one is Namespace level
> transaction
> >> > (global transaction).
> >> >
> >> > The stream level transaction is a transactional operation on writing
> >> > records to one stream; the namespace level transaction is a
> >> transactional
> >> > operation on writing records to multiple streams.
> >> >
> >> > *Implementation Thoughts*
> >> >
> >> > - A transaction is consist of begin control record, a series of data
> >> > records and commit/abort control record.
> >> > - The begin/commit/abort control record is written to a `commit` log
> >> > stream, while the data records will be written to normal data log
> >> streams.
> >> > - The `commit` log stream will be the same log stream for stream-level
> >> > transaction,  while it will be a *system* stream (or multiple system
> >> > streams) for namespace-level transactions.
> >> > - The transaction code looks like as below:
> >> >
> >> > <code>
> >> >
> >> > Transaction txn = client.transaction();
> >> > Future<DLSN> result1 = txn.write(stream-0, record);
> >> > Future<DLSN> result2 = txn.write(stream-1, record);
> >> > Future<DLSN> result3 = txn.write(stream-2, record);
> >> > Future<Pair<DLSN, DLSN>> result = txn.commit();
> >> >
> >> > </code>
> >> >
> >> > if the txn is committed, all the write futures will be satisfied with
> >> their
> >> > written DLSNs. if the txn is aborted, all the write futures will be
> >> failed
> >> > together. there is no partial failure state.
> >> >
> >> > - The actually data flow will be:
> >> >
> >> > 1. writer get a transaction id from the owner of the `commit' log
> stream
> >> > 1. write the begin control record (synchronously) with the transaction
> >> id
> >> > 2. for each write within the same txn, it will be assigned a local
> >> sequence
> >> > number starting from 0. the combination of transaction id and local
> >> > sequence number will be used later on by the readers to de-duplicate
> >> > records.
> >> > 3. the commit/abort control record will be written based on the
> results
> >> > from 2.
> >> >
> >> > - Application can supply a timeout for the transaction when #begin() a
> >> > transaction. The owner of the `commit` log stream can abort
> transactions
> >> > that never be committed/aborted within their timeout.
> >> >
> >> > - Failures:
> >> >
> >> > * all the log records can be simply retried as they will be
> >> de-duplicated
> >> > probably at the reader side.
> >> >
> >> > - Reader:
> >> >
> >> > * Reader can be configured to read uncommitted records or committed
> >> records
> >> > only (by default read uncommitted records)
> >> > * If reader is configured to read committed records only, the read
> ahead
> >> > cache will be changed to maintain one additional pending committed
> >> records.
> >> > the pending committed records map is bounded and records will be
> dropped
> >> > when read ahead is moving.
> >> > * when the reader hits a commit record, it will rewind to the begin
> >> record
> >> > and start reading from there. leveraging the proper read ahead cache
> and
> >> > pending commit records cache, it would be good for both short
> >> transactions
> >> > and long transactions.
> >> >
> >> > - DLSN, SequenceId:
> >> >
> >> > * We will add a fourth field to DLSN. It is `local sequence number`
> >> within
> >> > a transaction session. So the new DLSN of records in a transaction
> will
> >> be
> >> > the DLSN of commit control record plus its local sequence number.
> >> > * The sequence id will be still the position of the commit record plus
> >> its
> >> > local sequence number. The position will be advanced with total number
> >> of
> >> > written records on writing the commit control record.
> >> >
> >> > - Transaction Group & Namespace Transaction
> >> >
> >> > using one single log stream for namespace transaction can cause the
> >> > bottleneck problem since all the begin/commit/end control records will
> >> have
> >> > to go through one log stream.
> >> >
> >> > the idea of 'transaction group' is to allow partitioning the writers
> >> into
> >> > different transaction groups.
> >> >
> >> > clients can specify the `group-name` when starting the transaction. if
> >> > there is no `group-name` specified, it will use the default `commit`
> >> log in
> >> > the namespace for creating transactions.
> >> >
> >> > -------------------------------------------------
> >> >
> >> > I'd like to collect feedbacks on this idea. Appreciate any comments
> and
> >> if
> >> > anyone is also interested in this idea, we'd like to collaborate with
> >> the
> >> > community.
> >> >
> >> >
> >> > - Xi
> >> >
> >>
> >
> >
>

Re: [Discuss] Transaction Support

Posted by Xi Liu <xi...@gmail.com>.

I still can not edit the wiki. Can any of the pmc members grant me the
permissions?

- Xi

On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <xi...@gmail.com> wrote:

> Sijie,
>
> I attempted to create a wiki page under that space. I found that I am not
> authorized with edit permission.
>
> Can any of the committers grant me the wiki edit permission? My account is
> "xi.liu.ant".
>
> - Xi
>
>
> On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <si...@apache.org> wrote:
>
>> This sounds interesting ... I will take a closer look and give my comments
>> later.
>>
>> At the same time, do you mind creating a wiki page to put your idea there?
>> You can add your wiki page under
>> https://cwiki.apache.org/confluence/display/DL/Project+Proposals
>>
>> You might need to ask in the dev list to grant the wiki edit permissions
>> to
>> you once you have a wiki account.
>>
>> - Sijie
>>
>>
>> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <xi...@gmail.com> wrote:
>>
>> > Hello,
>> >
>> > I asked the transaction support in distributedlog user group two months
>> > ago. I want to raise this up again, as we are looking for using
>> > distributedlog for building a transactional data service. It is a major
>> > feature that is missing in distributedlog. We have some ideas to add
>> this
>> > to distributedlog and want to know if they make sense or not. If they
>> are
>> > good, we'd like to contribute and develop with the community.
>> >
>> > Here are the thoughts:
>> >
>> > -------------------------------------------------
>> >
>> > From our understanding, DL can provide "at-least-once" delivery semantic
>> > (if not, please correct me) but not "exactly-once" delivery semantic.
>> That
>> > means that a message can be delivered one or more times if the reader
>> > doesn't handle duplicates.
>> >
>> > The duplicates come from two places, one is at writer side (this assumes
>> > using write proxy not the core library), while the other one is at
>> reader
>> > side.
>> >
>> > - writer side: if the client attempts to write a record to the write
>> > proxies and gets a network error (e.g timeouts) then retries, the
>> retrying
>> > will potentially result in duplicates.
>> > - reader side:if the reader reads a message from a stream and then
>> crashes,
>> > when the reader restarts it would restart from last known position
>> (DLSN).
>> > If the reader fails after processing a record and before recording the
>> > position, the processed record will be delivered again.
>> >
>> > The reader problem can be properly addressed by making use of the
>> sequence
>> > numbers of records and doing proper checkpointing. For example, in
>> > database, it can checkpoint the indexed data with the sequence number of
>> > records; in flink, it can checkpoint the state with the sequence
>> numbers.
>> >
>> > The writer problem can be addressed by implementing an idempotent
>> writer.
>> > However, an alternative and more powerful approach is to support
>> > transactions.
>> >
>> > *What does transaction mean?*
>> >
>> > A transaction means a collection of records can be written
>> transactionally
>> > within a stream or across multiple streams. They will be consumed by the
>> > reader together when a transaction is committed, or will never be
>> consumed
>> > by the reader when the transaction is aborted.
>> >
>> > The transaction will expose following guarantees:
>> >
>> > - The reader should not be exposed to records written from uncommitted
>> > transactions (mandatory)
>> > - The reader should consume the records in the transaction commit order
>> > rather than the record written order (mandatory)
>> > - No duplicated records within a transaction (mandatory)
>> > - Allow interleaving transactional writes and non-transactional writes
>> > (optional)
>> >
>> > *Stream Transaction & Namespace Transaction*
>> >
>> > There will be two types of transaction, one is Stream level transaction
>> > (local transaction), while the other one is Namespace level transaction
>> > (global transaction).
>> >
>> > The stream level transaction is a transactional operation on writing
>> > records to one stream; the namespace level transaction is a
>> transactional
>> > operation on writing records to multiple streams.
>> >
>> > *Implementation Thoughts*
>> >
>> > - A transaction is consist of begin control record, a series of data
>> > records and commit/abort control record.
>> > - The begin/commit/abort control record is written to a `commit` log
>> > stream, while the data records will be written to normal data log
>> streams.
>> > - The `commit` log stream will be the same log stream for stream-level
>> > transaction,  while it will be a *system* stream (or multiple system
>> > streams) for namespace-level transactions.
>> > - The transaction code looks like as below:
>> >
>> > <code>
>> >
>> > Transaction txn = client.transaction();
>> > Future<DLSN> result1 = txn.write(stream-0, record);
>> > Future<DLSN> result2 = txn.write(stream-1, record);
>> > Future<DLSN> result3 = txn.write(stream-2, record);
>> > Future<Pair<DLSN, DLSN>> result = txn.commit();
>> >
>> > </code>
>> >
>> > if the txn is committed, all the write futures will be satisfied with
>> their
>> > written DLSNs. if the txn is aborted, all the write futures will be
>> failed
>> > together. there is no partial failure state.
>> >
>> > - The actually data flow will be:
>> >
>> > 1. writer get a transaction id from the owner of the `commit' log stream
>> > 1. write the begin control record (synchronously) with the transaction
>> id
>> > 2. for each write within the same txn, it will be assigned a local
>> sequence
>> > number starting from 0. the combination of transaction id and local
>> > sequence number will be used later on by the readers to de-duplicate
>> > records.
>> > 3. the commit/abort control record will be written based on the results
>> > from 2.
>> >
>> > - Application can supply a timeout for the transaction when #begin() a
>> > transaction. The owner of the `commit` log stream can abort transactions
>> > that never be committed/aborted within their timeout.
>> >
>> > - Failures:
>> >
>> > * all the log records can be simply retried as they will be
>> de-duplicated
>> > probably at the reader side.
>> >
>> > - Reader:
>> >
>> > * Reader can be configured to read uncommitted records or committed
>> records
>> > only (by default read uncommitted records)
>> > * If reader is configured to read committed records only, the read ahead
>> > cache will be changed to maintain one additional pending committed
>> records.
>> > the pending committed records map is bounded and records will be dropped
>> > when read ahead is moving.
>> > * when the reader hits a commit record, it will rewind to the begin
>> record
>> > and start reading from there. leveraging the proper read ahead cache and
>> > pending commit records cache, it would be good for both short
>> transactions
>> > and long transactions.
>> >
>> > - DLSN, SequenceId:
>> >
>> > * We will add a fourth field to DLSN. It is `local sequence number`
>> within
>> > a transaction session. So the new DLSN of records in a transaction will
>> be
>> > the DLSN of commit control record plus its local sequence number.
>> > * The sequence id will be still the position of the commit record plus
>> its
>> > local sequence number. The position will be advanced with total number
>> of
>> > written records on writing the commit control record.
>> >
>> > - Transaction Group & Namespace Transaction
>> >
>> > using one single log stream for namespace transaction can cause the
>> > bottleneck problem since all the begin/commit/end control records will
>> have
>> > to go through one log stream.
>> >
>> > the idea of 'transaction group' is to allow partitioning the writers
>> into
>> > different transaction groups.
>> >
>> > clients can specify the `group-name` when starting the transaction. if
>> > there is no `group-name` specified, it will use the default `commit`
>> log in
>> > the namespace for creating transactions.
>> >
>> > -------------------------------------------------
>> >
>> > I'd like to collect feedbacks on this idea. Appreciate any comments and
>> if
>> > anyone is also interested in this idea, we'd like to collaborate with
>> the
>> > community.
>> >
>> >
>> > - Xi
>> >
>>
>
>

Re: [Discuss] Transaction Support

Posted by Xi Liu <xi...@gmail.com>.

Sijie,

I attempted to create a wiki page under that space. I found that I am not
authorized with edit permission.

Can any of the committers grant me the wiki edit permission? My account is "
xi.liu.ant".

- Xi

On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <si...@apache.org> wrote:

> This sounds interesting ... I will take a closer look and give my comments
> later.
>
> At the same time, do you mind creating a wiki page to put your idea there?
> You can add your wiki page under
> https://cwiki.apache.org/confluence/display/DL/Project+Proposals
>
> You might need to ask in the dev list to grant the wiki edit permissions to
> you once you have a wiki account.
>
> - Sijie
>
>
> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <xi...@gmail.com> wrote:
>
> > Hello,
> >
> > I asked the transaction support in distributedlog user group two months
> > ago. I want to raise this up again, as we are looking for using
> > distributedlog for building a transactional data service. It is a major
> > feature that is missing in distributedlog. We have some ideas to add this
> > to distributedlog and want to know if they make sense or not. If they are
> > good, we'd like to contribute and develop with the community.
> >
> > Here are the thoughts:
> >
> > -------------------------------------------------
> >
> > From our understanding, DL can provide "at-least-once" delivery semantic
> > (if not, please correct me) but not "exactly-once" delivery semantic.
> That
> > means that a message can be delivered one or more times if the reader
> > doesn't handle duplicates.
> >
> > The duplicates come from two places, one is at writer side (this assumes
> > using write proxy not the core library), while the other one is at reader
> > side.
> >
> > - writer side: if the client attempts to write a record to the write
> > proxies and gets a network error (e.g timeouts) then retries, the
> retrying
> > will potentially result in duplicates.
> > - reader side:if the reader reads a message from a stream and then
> crashes,
> > when the reader restarts it would restart from last known position
> (DLSN).
> > If the reader fails after processing a record and before recording the
> > position, the processed record will be delivered again.
> >
> > The reader problem can be properly addressed by making use of the
> sequence
> > numbers of records and doing proper checkpointing. For example, in
> > database, it can checkpoint the indexed data with the sequence number of
> > records; in flink, it can checkpoint the state with the sequence numbers.
> >
> > The writer problem can be addressed by implementing an idempotent writer.
> > However, an alternative and more powerful approach is to support
> > transactions.
> >
> > *What does transaction mean?*
> >
> > A transaction means a collection of records can be written
> transactionally
> > within a stream or across multiple streams. They will be consumed by the
> > reader together when a transaction is committed, or will never be
> consumed
> > by the reader when the transaction is aborted.
> >
> > The transaction will expose following guarantees:
> >
> > - The reader should not be exposed to records written from uncommitted
> > transactions (mandatory)
> > - The reader should consume the records in the transaction commit order
> > rather than the record written order (mandatory)
> > - No duplicated records within a transaction (mandatory)
> > - Allow interleaving transactional writes and non-transactional writes
> > (optional)
> >
> > *Stream Transaction & Namespace Transaction*
> >
> > There will be two types of transaction, one is Stream level transaction
> > (local transaction), while the other one is Namespace level transaction
> > (global transaction).
> >
> > The stream level transaction is a transactional operation on writing
> > records to one stream; the namespace level transaction is a transactional
> > operation on writing records to multiple streams.
> >
> > *Implementation Thoughts*
> >
> > - A transaction is consist of begin control record, a series of data
> > records and commit/abort control record.
> > - The begin/commit/abort control record is written to a `commit` log
> > stream, while the data records will be written to normal data log
> streams.
> > - The `commit` log stream will be the same log stream for stream-level
> > transaction,  while it will be a *system* stream (or multiple system
> > streams) for namespace-level transactions.
> > - The transaction code looks like as below:
> >
> > <code>
> >
> > Transaction txn = client.transaction();
> > Future<DLSN> result1 = txn.write(stream-0, record);
> > Future<DLSN> result2 = txn.write(stream-1, record);
> > Future<DLSN> result3 = txn.write(stream-2, record);
> > Future<Pair<DLSN, DLSN>> result = txn.commit();
> >
> > </code>
> >
> > if the txn is committed, all the write futures will be satisfied with
> their
> > written DLSNs. if the txn is aborted, all the write futures will be
> failed
> > together. there is no partial failure state.
> >
> > - The actually data flow will be:
> >
> > 1. writer get a transaction id from the owner of the `commit' log stream
> > 1. write the begin control record (synchronously) with the transaction id
> > 2. for each write within the same txn, it will be assigned a local
> sequence
> > number starting from 0. the combination of transaction id and local
> > sequence number will be used later on by the readers to de-duplicate
> > records.
> > 3. the commit/abort control record will be written based on the results
> > from 2.
> >
> > - Application can supply a timeout for the transaction when #begin() a
> > transaction. The owner of the `commit` log stream can abort transactions
> > that never be committed/aborted within their timeout.
> >
> > - Failures:
> >
> > * all the log records can be simply retried as they will be de-duplicated
> > probably at the reader side.
> >
> > - Reader:
> >
> > * Reader can be configured to read uncommitted records or committed
> records
> > only (by default read uncommitted records)
> > * If reader is configured to read committed records only, the read ahead
> > cache will be changed to maintain one additional pending committed
> records.
> > the pending committed records map is bounded and records will be dropped
> > when read ahead is moving.
> > * when the reader hits a commit record, it will rewind to the begin
> record
> > and start reading from there. leveraging the proper read ahead cache and
> > pending commit records cache, it would be good for both short
> transactions
> > and long transactions.
> >
> > - DLSN, SequenceId:
> >
> > * We will add a fourth field to DLSN. It is `local sequence number`
> within
> > a transaction session. So the new DLSN of records in a transaction will
> be
> > the DLSN of commit control record plus its local sequence number.
> > * The sequence id will be still the position of the commit record plus
> its
> > local sequence number. The position will be advanced with total number of
> > written records on writing the commit control record.
> >
> > - Transaction Group & Namespace Transaction
> >
> > using one single log stream for namespace transaction can cause the
> > bottleneck problem since all the begin/commit/end control records will
> have
> > to go through one log stream.
> >
> > the idea of 'transaction group' is to allow partitioning the writers into
> > different transaction groups.
> >
> > clients can specify the `group-name` when starting the transaction. if
> > there is no `group-name` specified, it will use the default `commit` log
> in
> > the namespace for creating transactions.
> >
> > -------------------------------------------------
> >
> > I'd like to collect feedbacks on this idea. Appreciate any comments and
> if
> > anyone is also interested in this idea, we'd like to collaborate with the
> > community.
> >
> >
> > - Xi
> >
>

Re: [Discuss] Transaction Support

Posted by Sijie Guo <si...@apache.org>.

This sounds interesting ... I will take a closer look and give my comments
later.

At the same time, do you mind creating a wiki page to put your idea there?
You can add your wiki page under
https://cwiki.apache.org/confluence/display/DL/Project+Proposals

You might need to ask in the dev list to grant the wiki edit permissions to
you once you have a wiki account.

- Sijie


On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <xi...@gmail.com> wrote:

> Hello,
>
> I asked the transaction support in distributedlog user group two months
> ago. I want to raise this up again, as we are looking for using
> distributedlog for building a transactional data service. It is a major
> feature that is missing in distributedlog. We have some ideas to add this
> to distributedlog and want to know if they make sense or not. If they are
> good, we'd like to contribute and develop with the community.
>
> Here are the thoughts:
>
> -------------------------------------------------
>
> From our understanding, DL can provide "at-least-once" delivery semantic
> (if not, please correct me) but not "exactly-once" delivery semantic. That
> means that a message can be delivered one or more times if the reader
> doesn't handle duplicates.
>
> The duplicates come from two places, one is at writer side (this assumes
> using write proxy not the core library), while the other one is at reader
> side.
>
> - writer side: if the client attempts to write a record to the write
> proxies and gets a network error (e.g timeouts) then retries, the retrying
> will potentially result in duplicates.
> - reader side:if the reader reads a message from a stream and then crashes,
> when the reader restarts it would restart from last known position (DLSN).
> If the reader fails after processing a record and before recording the
> position, the processed record will be delivered again.
>
> The reader problem can be properly addressed by making use of the sequence
> numbers of records and doing proper checkpointing. For example, in
> database, it can checkpoint the indexed data with the sequence number of
> records; in flink, it can checkpoint the state with the sequence numbers.
>
> The writer problem can be addressed by implementing an idempotent writer.
> However, an alternative and more powerful approach is to support
> transactions.
>
> *What does transaction mean?*
>
> A transaction means a collection of records can be written transactionally
> within a stream or across multiple streams. They will be consumed by the
> reader together when a transaction is committed, or will never be consumed
> by the reader when the transaction is aborted.
>
> The transaction will expose following guarantees:
>
> - The reader should not be exposed to records written from uncommitted
> transactions (mandatory)
> - The reader should consume the records in the transaction commit order
> rather than the record written order (mandatory)
> - No duplicated records within a transaction (mandatory)
> - Allow interleaving transactional writes and non-transactional writes
> (optional)
>
> *Stream Transaction & Namespace Transaction*
>
> There will be two types of transaction, one is Stream level transaction
> (local transaction), while the other one is Namespace level transaction
> (global transaction).
>
> The stream level transaction is a transactional operation on writing
> records to one stream; the namespace level transaction is a transactional
> operation on writing records to multiple streams.
>
> *Implementation Thoughts*
>
> - A transaction is consist of begin control record, a series of data
> records and commit/abort control record.
> - The begin/commit/abort control record is written to a `commit` log
> stream, while the data records will be written to normal data log streams.
> - The `commit` log stream will be the same log stream for stream-level
> transaction,  while it will be a *system* stream (or multiple system
> streams) for namespace-level transactions.
> - The transaction code looks like as below:
>
> <code>
>
> Transaction txn = client.transaction();
> Future<DLSN> result1 = txn.write(stream-0, record);
> Future<DLSN> result2 = txn.write(stream-1, record);
> Future<DLSN> result3 = txn.write(stream-2, record);
> Future<Pair<DLSN, DLSN>> result = txn.commit();
>
> </code>
>
> if the txn is committed, all the write futures will be satisfied with their
> written DLSNs. if the txn is aborted, all the write futures will be failed
> together. there is no partial failure state.
>
> - The actually data flow will be:
>
> 1. writer get a transaction id from the owner of the `commit' log stream
> 1. write the begin control record (synchronously) with the transaction id
> 2. for each write within the same txn, it will be assigned a local sequence
> number starting from 0. the combination of transaction id and local
> sequence number will be used later on by the readers to de-duplicate
> records.
> 3. the commit/abort control record will be written based on the results
> from 2.
>
> - Application can supply a timeout for the transaction when #begin() a
> transaction. The owner of the `commit` log stream can abort transactions
> that never be committed/aborted within their timeout.
>
> - Failures:
>
> * all the log records can be simply retried as they will be de-duplicated
> probably at the reader side.
>
> - Reader:
>
> * Reader can be configured to read uncommitted records or committed records
> only (by default read uncommitted records)
> * If reader is configured to read committed records only, the read ahead
> cache will be changed to maintain one additional pending committed records.
> the pending committed records map is bounded and records will be dropped
> when read ahead is moving.
> * when the reader hits a commit record, it will rewind to the begin record
> and start reading from there. leveraging the proper read ahead cache and
> pending commit records cache, it would be good for both short transactions
> and long transactions.
>
> - DLSN, SequenceId:
>
> * We will add a fourth field to DLSN. It is `local sequence number` within
> a transaction session. So the new DLSN of records in a transaction will be
> the DLSN of commit control record plus its local sequence number.
> * The sequence id will be still the position of the commit record plus its
> local sequence number. The position will be advanced with total number of
> written records on writing the commit control record.
>
> - Transaction Group & Namespace Transaction
>
> using one single log stream for namespace transaction can cause the
> bottleneck problem since all the begin/commit/end control records will have
> to go through one log stream.
>
> the idea of 'transaction group' is to allow partitioning the writers into
> different transaction groups.
>
> clients can specify the `group-name` when starting the transaction. if
> there is no `group-name` specified, it will use the default `commit` log in
> the namespace for creating transactions.
>
> -------------------------------------------------
>
> I'd like to collect feedbacks on this idea. Appreciate any comments and if
> anyone is also interested in this idea, we'd like to collaborate with the
> community.
>
>
> - Xi
>

Re: [Discuss] Transaction Support

Posted by Xi Liu <xi...@gmail.com>.

Khurrum, thank your for your comment.

A few use cases for your reference.

- we partitioned the data by keys (for example, user id). An operation may
update keys at the same time. We'd like the updates should only be seen by
the readers by only once. There is no partial failure state in between. A
single stream level transaction can not help in this case.
- imaging two stream computing job, the first one is reading from a set of
distributedlog streams and writing the computation results to the other set
of distributedlog streams. The other set of distributedlog streams are the
input for the second job. We use another distributedlog stream for tracking
the read dlsns for the first set of distributedlog stream and we want to
make sure updating the offset stream and propagating the computation
results into the streams of second job can be updated atomically.

Let me know your opinions. Appreciate your help.

- Xi

On Sun, Sep 18, 2016 at 10:25 AM, Khurrum Nasim <kh...@gmail.com>
wrote:

> Xi,
>
> The "stream level" transaction makes more sense to me. It is an extension
> of 'atomic write' records over the size limitation. I can see the value of
> having such ability.
>
> But I am not convinced by the namespace level transaction. Do you have any
> concrente use cases that you can talk about more?
>
> - KN
>
> On Mon, Sep 12, 2016 at 5:20 PM, Xi Liu <xi...@gmail.com> wrote:
>
> > Hello,
> >
> > I asked the transaction support in distributedlog user group two months
> > ago. I want to raise this up again, as we are looking for using
> > distributedlog for building a transactional data service. It is a major
> > feature that is missing in distributedlog. We have some ideas to add this
> > to distributedlog and want to know if they make sense or not. If they are
> > good, we'd like to contribute and develop with the community.
> >
> > Here are the thoughts:
> >
> > -------------------------------------------------
> >
> > From our understanding, DL can provide "at-least-once" delivery semantic
> > (if not, please correct me) but not "exactly-once" delivery semantic.
> That
> > means that a message can be delivered one or more times if the reader
> > doesn't handle duplicates.
> >
> > The duplicates come from two places, one is at writer side (this assumes
> > using write proxy not the core library), while the other one is at reader
> > side.
> >
> > - writer side: if the client attempts to write a record to the write
> > proxies and gets a network error (e.g timeouts) then retries, the
> retrying
> > will potentially result in duplicates.
> > - reader side:if the reader reads a message from a stream and then
> crashes,
> > when the reader restarts it would restart from last known position
> (DLSN).
> > If the reader fails after processing a record and before recording the
> > position, the processed record will be delivered again.
> >
> > The reader problem can be properly addressed by making use of the
> sequence
> > numbers of records and doing proper checkpointing. For example, in
> > database, it can checkpoint the indexed data with the sequence number of
> > records; in flink, it can checkpoint the state with the sequence numbers.
> >
> > The writer problem can be addressed by implementing an idempotent writer.
> > However, an alternative and more powerful approach is to support
> > transactions.
> >
> > *What does transaction mean?*
> >
> > A transaction means a collection of records can be written
> transactionally
> > within a stream or across multiple streams. They will be consumed by the
> > reader together when a transaction is committed, or will never be
> consumed
> > by the reader when the transaction is aborted.
> >
> > The transaction will expose following guarantees:
> >
> > - The reader should not be exposed to records written from uncommitted
> > transactions (mandatory)
> > - The reader should consume the records in the transaction commit order
> > rather than the record written order (mandatory)
> > - No duplicated records within a transaction (mandatory)
> > - Allow interleaving transactional writes and non-transactional writes
> > (optional)
> >
> > *Stream Transaction & Namespace Transaction*
> >
> > There will be two types of transaction, one is Stream level transaction
> > (local transaction), while the other one is Namespace level transaction
> > (global transaction).
> >
> > The stream level transaction is a transactional operation on writing
> > records to one stream; the namespace level transaction is a transactional
> > operation on writing records to multiple streams.
> >
> > *Implementation Thoughts*
> >
> > - A transaction is consist of begin control record, a series of data
> > records and commit/abort control record.
> > - The begin/commit/abort control record is written to a `commit` log
> > stream, while the data records will be written to normal data log
> streams.
> > - The `commit` log stream will be the same log stream for stream-level
> > transaction,  while it will be a *system* stream (or multiple system
> > streams) for namespace-level transactions.
> > - The transaction code looks like as below:
> >
> > <code>
> >
> > Transaction txn = client.transaction();
> > Future<DLSN> result1 = txn.write(stream-0, record);
> > Future<DLSN> result2 = txn.write(stream-1, record);
> > Future<DLSN> result3 = txn.write(stream-2, record);
> > Future<Pair<DLSN, DLSN>> result = txn.commit();
> >
> > </code>
> >
> > if the txn is committed, all the write futures will be satisfied with
> their
> > written DLSNs. if the txn is aborted, all the write futures will be
> failed
> > together. there is no partial failure state.
> >
> > - The actually data flow will be:
> >
> > 1. writer get a transaction id from the owner of the `commit' log stream
> > 1. write the begin control record (synchronously) with the transaction id
> > 2. for each write within the same txn, it will be assigned a local
> sequence
> > number starting from 0. the combination of transaction id and local
> > sequence number will be used later on by the readers to de-duplicate
> > records.
> > 3. the commit/abort control record will be written based on the results
> > from 2.
> >
> > - Application can supply a timeout for the transaction when #begin() a
> > transaction. The owner of the `commit` log stream can abort transactions
> > that never be committed/aborted within their timeout.
> >
> > - Failures:
> >
> > * all the log records can be simply retried as they will be de-duplicated
> > probably at the reader side.
> >
> > - Reader:
> >
> > * Reader can be configured to read uncommitted records or committed
> records
> > only (by default read uncommitted records)
> > * If reader is configured to read committed records only, the read ahead
> > cache will be changed to maintain one additional pending committed
> records.
> > the pending committed records map is bounded and records will be dropped
> > when read ahead is moving.
> > * when the reader hits a commit record, it will rewind to the begin
> record
> > and start reading from there. leveraging the proper read ahead cache and
> > pending commit records cache, it would be good for both short
> transactions
> > and long transactions.
> >
> > - DLSN, SequenceId:
> >
> > * We will add a fourth field to DLSN. It is `local sequence number`
> within
> > a transaction session. So the new DLSN of records in a transaction will
> be
> > the DLSN of commit control record plus its local sequence number.
> > * The sequence id will be still the position of the commit record plus
> its
> > local sequence number. The position will be advanced with total number of
> > written records on writing the commit control record.
> >
> > - Transaction Group & Namespace Transaction
> >
> > using one single log stream for namespace transaction can cause the
> > bottleneck problem since all the begin/commit/end control records will
> have
> > to go through one log stream.
> >
> > the idea of 'transaction group' is to allow partitioning the writers into
> > different transaction groups.
> >
> > clients can specify the `group-name` when starting the transaction. if
> > there is no `group-name` specified, it will use the default `commit` log
> in
> > the namespace for creating transactions.
> >
> > -------------------------------------------------
> >
> > I'd like to collect feedbacks on this idea. Appreciate any comments and
> if
> > anyone is also interested in this idea, we'd like to collaborate with the
> > community.
> >
> >
> > - Xi
> >
>

Re: [Discuss] Transaction Support

Posted by Khurrum Nasim <kh...@gmail.com>.

Xi,

The "stream level" transaction makes more sense to me. It is an extension
of 'atomic write' records over the size limitation. I can see the value of
having such ability.

But I am not convinced by the namespace level transaction. Do you have any
concrente use cases that you can talk about more?

- KN

On Mon, Sep 12, 2016 at 5:20 PM, Xi Liu <xi...@gmail.com> wrote:

> Hello,
>
> I asked the transaction support in distributedlog user group two months
> ago. I want to raise this up again, as we are looking for using
> distributedlog for building a transactional data service. It is a major
> feature that is missing in distributedlog. We have some ideas to add this
> to distributedlog and want to know if they make sense or not. If they are
> good, we'd like to contribute and develop with the community.
>
> Here are the thoughts:
>
> -------------------------------------------------
>
> From our understanding, DL can provide "at-least-once" delivery semantic
> (if not, please correct me) but not "exactly-once" delivery semantic. That
> means that a message can be delivered one or more times if the reader
> doesn't handle duplicates.
>
> The duplicates come from two places, one is at writer side (this assumes
> using write proxy not the core library), while the other one is at reader
> side.
>
> - writer side: if the client attempts to write a record to the write
> proxies and gets a network error (e.g timeouts) then retries, the retrying
> will potentially result in duplicates.
> - reader side:if the reader reads a message from a stream and then crashes,
> when the reader restarts it would restart from last known position (DLSN).
> If the reader fails after processing a record and before recording the
> position, the processed record will be delivered again.
>
> The reader problem can be properly addressed by making use of the sequence
> numbers of records and doing proper checkpointing. For example, in
> database, it can checkpoint the indexed data with the sequence number of
> records; in flink, it can checkpoint the state with the sequence numbers.
>
> The writer problem can be addressed by implementing an idempotent writer.
> However, an alternative and more powerful approach is to support
> transactions.
>
> *What does transaction mean?*
>
> A transaction means a collection of records can be written transactionally
> within a stream or across multiple streams. They will be consumed by the
> reader together when a transaction is committed, or will never be consumed
> by the reader when the transaction is aborted.
>
> The transaction will expose following guarantees:
>
> - The reader should not be exposed to records written from uncommitted
> transactions (mandatory)
> - The reader should consume the records in the transaction commit order
> rather than the record written order (mandatory)
> - No duplicated records within a transaction (mandatory)
> - Allow interleaving transactional writes and non-transactional writes
> (optional)
>
> *Stream Transaction & Namespace Transaction*
>
> There will be two types of transaction, one is Stream level transaction
> (local transaction), while the other one is Namespace level transaction
> (global transaction).
>
> The stream level transaction is a transactional operation on writing
> records to one stream; the namespace level transaction is a transactional
> operation on writing records to multiple streams.
>
> *Implementation Thoughts*
>
> - A transaction is consist of begin control record, a series of data
> records and commit/abort control record.
> - The begin/commit/abort control record is written to a `commit` log
> stream, while the data records will be written to normal data log streams.
> - The `commit` log stream will be the same log stream for stream-level
> transaction,  while it will be a *system* stream (or multiple system
> streams) for namespace-level transactions.
> - The transaction code looks like as below:
>
> <code>
>
> Transaction txn = client.transaction();
> Future<DLSN> result1 = txn.write(stream-0, record);
> Future<DLSN> result2 = txn.write(stream-1, record);
> Future<DLSN> result3 = txn.write(stream-2, record);
> Future<Pair<DLSN, DLSN>> result = txn.commit();
>
> </code>
>
> if the txn is committed, all the write futures will be satisfied with their
> written DLSNs. if the txn is aborted, all the write futures will be failed
> together. there is no partial failure state.
>
> - The actually data flow will be:
>
> 1. writer get a transaction id from the owner of the `commit' log stream
> 1. write the begin control record (synchronously) with the transaction id
> 2. for each write within the same txn, it will be assigned a local sequence
> number starting from 0. the combination of transaction id and local
> sequence number will be used later on by the readers to de-duplicate
> records.
> 3. the commit/abort control record will be written based on the results
> from 2.
>
> - Application can supply a timeout for the transaction when #begin() a
> transaction. The owner of the `commit` log stream can abort transactions
> that never be committed/aborted within their timeout.
>
> - Failures:
>
> * all the log records can be simply retried as they will be de-duplicated
> probably at the reader side.
>
> - Reader:
>
> * Reader can be configured to read uncommitted records or committed records
> only (by default read uncommitted records)
> * If reader is configured to read committed records only, the read ahead
> cache will be changed to maintain one additional pending committed records.
> the pending committed records map is bounded and records will be dropped
> when read ahead is moving.
> * when the reader hits a commit record, it will rewind to the begin record
> and start reading from there. leveraging the proper read ahead cache and
> pending commit records cache, it would be good for both short transactions
> and long transactions.
>
> - DLSN, SequenceId:
>
> * We will add a fourth field to DLSN. It is `local sequence number` within
> a transaction session. So the new DLSN of records in a transaction will be
> the DLSN of commit control record plus its local sequence number.
> * The sequence id will be still the position of the commit record plus its
> local sequence number. The position will be advanced with total number of
> written records on writing the commit control record.
>
> - Transaction Group & Namespace Transaction
>
> using one single log stream for namespace transaction can cause the
> bottleneck problem since all the begin/commit/end control records will have
> to go through one log stream.
>
> the idea of 'transaction group' is to allow partitioning the writers into
> different transaction groups.
>
> clients can specify the `group-name` when starting the transaction. if
> there is no `group-name` specified, it will use the default `commit` log in
> the namespace for creating transactions.
>
> -------------------------------------------------
>
> I'd like to collect feedbacks on this idea. Appreciate any comments and if
> anyone is also interested in this idea, we'd like to collaborate with the
> community.
>
>
> - Xi
>