You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@distributedlog.apache.org by Sijie Guo <si...@apache.org> on 2017/01/04 07:18:19 UTC

Re: [Discuss] Transaction Support

Sorry for late response. I think Leigh and you already had some very
valuable discussions in the doc. I will try to add some of my questions to
the discussion.

Beside that, I had a discussion with Leigh today about this. first of all,
I think it is very good to add transaction support in distributedlog. It is
one of the primitives that would help building distributed service. But we
have a concern about making this system become complicated and introduce
operational overhead when it runs in the large scale system on production.
There are two major suggestions that I have for this feature -

Build the 'minimum' logic in core - I think the minimum logic that need to
be added to the core is -  the special control records (begin, commit and
abort) and make the reader be able to detect those special control records
and know what do they mean and how to interrupt with them. Since they are
special control records, there is not overhead to other readers that
doesn't require this feature.

Build the transaction coordinator as a separated proxy service  - I think
the major concern that we have is putting more complexities into the 'write
proxy' service. We architected distributedlog in a more microservice-like
way - we have the core as the stream store, the proxy for serving write and
read traffic. It would be good that the transaction feature can be done in
a similar way. So the architecture would be like this -

*[ write service ] [ read service ] [ transaction coordinator ]*
*[ stream store
            ]*

if people doesn't need the transaction feature, they can turn if off
completely without any operational overhead.

Beside that, I have one general question - What is the major goal for this
feature? Are you targeting on building a general XA transaction coordinator
or just for supporting things like `copy-modify-write' style workflow?


Thanks,
Sijie





On Wed, Dec 28, 2016 at 1:12 PM, Xi Liu <xi...@gmail.com> wrote:

> Ping?
>
> On Mon, Dec 19, 2016 at 8:28 AM, Xi Liu <xi...@gmail.com> wrote:
>
> > Sijie,
> >
> > No. I thought it might be easier for people to comment on a google doc to
> > gather the initial feedback. I will put the content back to wiki page
> once
> > addressing the comments. Does that sound good to you?
> >
> > And thank you in advance.
> >
> > - Xi
> >
> >
> >
> > On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <si...@apache.org> wrote:
> >
> >> Hi Xi,
> >>
> >> sorry for late response. I will review it soon.
> >>
> >> regarding this, a separate question "are we going to use google doc
> >> instead
> >> of email thread for any discussion"? I am a bit worried that the
> >> discussion
> >> will become lost after moving to google doc. No idea on how other apache
> >> projects are doing.
> >>
> >> - Sijie
> >>
> >> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi...@gmail.com> wrote:
> >>
> >> > Hi all,
> >> >
> >> > I finalized the first version of the design. This time I used a google
> >> doc
> >> > so that it is easier for commenting and add a link the wiki page. I
> will
> >> > update this to the wiki page once we come to the finalized design.
> >> >
> >> > https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
> >> > bSIGgSzXuTI5BA/edit
> >> >
> >> > Let me know if you have any questions. Appreciate your reviews!
> >> >
> >> > - Xi
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
> >> > <lstewart@twitter.com.invalid
> >> > > wrote:
> >> >
> >> > > Interesting proposal. A couple quick notes while you continue to
> flesh
> >> > this
> >> > > out.
> >> > >
> >> > > a. just to be sure - does this eliminate the need to save seqno with
> >> > > checkpoint?
> >> > >
> >> > > b. i.e. another way to describe this kind of improvement is "support
> >> > > records (atomic writes) larger than 1MB", iiuc. the advantage being
> it
> >> > > avoids the baggage of transactions. disadvantages include inability
> >> to do
> >> > > cross stream transactions, and flexibility (interleaving, etc) (are
> >> there
> >> > > others?).
> >> > >
> >> > > c. proxy use case is for supporting multiple writers - have you
> >> thought
> >> > > about how this would work with multiple writers?
> >> > >
> >> > > Thanks!
> >> > >
> >> > >
> >> > > On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo
> <sijieg@twitter.com.invalid
> >> >
> >> > > wrote:
> >> > >
> >> > > > Sound good to me. look forward to the detailed proposal.
> >> > > >
> >> > > > (I don't mind the format if it makes things easier to you)
> >> > > >
> >> > > > Sijie
> >> > > >
> >> > > > On Friday, October 14, 2016, Xi Liu <xi...@gmail.com> wrote:
> >> > > >
> >> > > > > Thank you, Sijie
> >> > > > >
> >> > > > > We have some internal discussions to sort out some details. We
> are
> >> > > ready
> >> > > > to
> >> > > > > collaborate with the community for adding the transaction
> support
> >> in
> >> > > DL.
> >> > > > > We'd like to share more.
> >> > > > >
> >> > > > > I created a proposal wiki here -
> >> > > > > https://cwiki.apache.org/confluence/display/DL/DP-1+-+
> >> > > > > DistributedLog+Transaction+Support
> >> > > > >
> >> > > > > (I followed KIP format and named it as DP (DistributedLog
> >> Proposal -
> >> > DP
> >> > > > is
> >> > > > > also short for Dynamic Programming). I don't know if you guys
> like
> >> > this
> >> > > > > name or not. Feel free to change it :D)
> >> > > > >
> >> > > > > I basically put my initial email as the content there so far.
> >> Once we
> >> > > > > finished our final discussion, I will update with more details.
> At
> >> > the
> >> > > > same
> >> > > > > time, any comments are welcome.
> >> > > > >
> >> > > > > - Xi
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
> >> > > > <javascript:;>>
> >> > > > > wrote:
> >> > > > >
> >> > > > > > Xi,
> >> > > > > >
> >> > > > > > I just granted you the edit permission.
> >> > > > > >
> >> > > > > > - Sijie
> >> > > > > >
> >> > > > > > On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
> >> > > > > <javascript:;>> wrote:
> >> > > > > >
> >> > > > > > > I still can not edit the wiki. Can any of the pmc members
> >> grant
> >> > me
> >> > > > the
> >> > > > > > > permissions?
> >> > > > > > >
> >> > > > > > > - Xi
> >> > > > > > >
> >> > > > > > > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <
> >> xi.liu.ant@gmail.com
> >> > > > > <javascript:;>> wrote:
> >> > > > > > >
> >> > > > > > > > Sijie,
> >> > > > > > > >
> >> > > > > > > > I attempted to create a wiki page under that space. I
> found
> >> > that
> >> > > I
> >> > > > am
> >> > > > > > not
> >> > > > > > > > authorized with edit permission.
> >> > > > > > > >
> >> > > > > > > > Can any of the committers grant me the wiki edit
> >> permission? My
> >> > > > > account
> >> > > > > > > is
> >> > > > > > > > "xi.liu.ant".
> >> > > > > > > >
> >> > > > > > > > - Xi
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <
> >> sijie@apache.org
> >> > > > > <javascript:;>> wrote:
> >> > > > > > > >
> >> > > > > > > >> This sounds interesting ... I will take a closer look and
> >> give
> >> > > my
> >> > > > > > > comments
> >> > > > > > > >> later.
> >> > > > > > > >>
> >> > > > > > > >> At the same time, do you mind creating a wiki page to put
> >> your
> >> > > > idea
> >> > > > > > > there?
> >> > > > > > > >> You can add your wiki page under
> >> > > > > > > >> https://cwiki.apache.org/confluence/display/DL/Project+
> >> > > Proposals
> >> > > > > > > >>
> >> > > > > > > >> You might need to ask in the dev list to grant the wiki
> >> edit
> >> > > > > > permissions
> >> > > > > > > >> to
> >> > > > > > > >> you once you have a wiki account.
> >> > > > > > > >>
> >> > > > > > > >> - Sijie
> >> > > > > > > >>
> >> > > > > > > >>
> >> > > > > > > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <
> >> xi.liu.ant@gmail.com
> >> > > > > <javascript:;>> wrote:
> >> > > > > > > >>
> >> > > > > > > >> > Hello,
> >> > > > > > > >> >
> >> > > > > > > >> > I asked the transaction support in distributedlog user
> >> group
> >> > > two
> >> > > > > > > months
> >> > > > > > > >> > ago. I want to raise this up again, as we are looking
> for
> >> > > using
> >> > > > > > > >> > distributedlog for building a transactional data
> >> service. It
> >> > > is
> >> > > > a
> >> > > > > > > major
> >> > > > > > > >> > feature that is missing in distributedlog. We have some
> >> > ideas
> >> > > to
> >> > > > > add
> >> > > > > > > >> this
> >> > > > > > > >> > to distributedlog and want to know if they make sense
> or
> >> > not.
> >> > > If
> >> > > > > > they
> >> > > > > > > >> are
> >> > > > > > > >> > good, we'd like to contribute and develop with the
> >> > community.
> >> > > > > > > >> >
> >> > > > > > > >> > Here are the thoughts:
> >> > > > > > > >> >
> >> > > > > > > >> > -------------------------------------------------
> >> > > > > > > >> >
> >> > > > > > > >> > From our understanding, DL can provide "at-least-once"
> >> > > delivery
> >> > > > > > > semantic
> >> > > > > > > >> > (if not, please correct me) but not "exactly-once"
> >> delivery
> >> > > > > > semantic.
> >> > > > > > > >> That
> >> > > > > > > >> > means that a message can be delivered one or more times
> >> if
> >> > the
> >> > > > > > reader
> >> > > > > > > >> > doesn't handle duplicates.
> >> > > > > > > >> >
> >> > > > > > > >> > The duplicates come from two places, one is at writer
> >> side
> >> > > (this
> >> > > > > > > assumes
> >> > > > > > > >> > using write proxy not the core library), while the
> other
> >> one
> >> > > is
> >> > > > at
> >> > > > > > > >> reader
> >> > > > > > > >> > side.
> >> > > > > > > >> >
> >> > > > > > > >> > - writer side: if the client attempts to write a record
> >> to
> >> > the
> >> > > > > write
> >> > > > > > > >> > proxies and gets a network error (e.g timeouts) then
> >> > retries,
> >> > > > the
> >> > > > > > > >> retrying
> >> > > > > > > >> > will potentially result in duplicates.
> >> > > > > > > >> > - reader side:if the reader reads a message from a
> stream
> >> > and
> >> > > > then
> >> > > > > > > >> crashes,
> >> > > > > > > >> > when the reader restarts it would restart from last
> known
> >> > > > position
> >> > > > > > > >> (DLSN).
> >> > > > > > > >> > If the reader fails after processing a record and
> before
> >> > > > recording
> >> > > > > > the
> >> > > > > > > >> > position, the processed record will be delivered again.
> >> > > > > > > >> >
> >> > > > > > > >> > The reader problem can be properly addressed by making
> >> use
> >> > of
> >> > > > the
> >> > > > > > > >> sequence
> >> > > > > > > >> > numbers of records and doing proper checkpointing. For
> >> > > example,
> >> > > > in
> >> > > > > > > >> > database, it can checkpoint the indexed data with the
> >> > sequence
> >> > > > > > number
> >> > > > > > > of
> >> > > > > > > >> > records; in flink, it can checkpoint the state with the
> >> > > sequence
> >> > > > > > > >> numbers.
> >> > > > > > > >> >
> >> > > > > > > >> > The writer problem can be addressed by implementing an
> >> > > > idempotent
> >> > > > > > > >> writer.
> >> > > > > > > >> > However, an alternative and more powerful approach is
> to
> >> > > support
> >> > > > > > > >> > transactions.
> >> > > > > > > >> >
> >> > > > > > > >> > *What does transaction mean?*
> >> > > > > > > >> >
> >> > > > > > > >> > A transaction means a collection of records can be
> >> written
> >> > > > > > > >> transactionally
> >> > > > > > > >> > within a stream or across multiple streams. They will
> be
> >> > > > consumed
> >> > > > > by
> >> > > > > > > the
> >> > > > > > > >> > reader together when a transaction is committed, or
> will
> >> > never
> >> > > > be
> >> > > > > > > >> consumed
> >> > > > > > > >> > by the reader when the transaction is aborted.
> >> > > > > > > >> >
> >> > > > > > > >> > The transaction will expose following guarantees:
> >> > > > > > > >> >
> >> > > > > > > >> > - The reader should not be exposed to records written
> >> from
> >> > > > > > uncommitted
> >> > > > > > > >> > transactions (mandatory)
> >> > > > > > > >> > - The reader should consume the records in the
> >> transaction
> >> > > > commit
> >> > > > > > > order
> >> > > > > > > >> > rather than the record written order (mandatory)
> >> > > > > > > >> > - No duplicated records within a transaction
> (mandatory)
> >> > > > > > > >> > - Allow interleaving transactional writes and
> >> > > non-transactional
> >> > > > > > writes
> >> > > > > > > >> > (optional)
> >> > > > > > > >> >
> >> > > > > > > >> > *Stream Transaction & Namespace Transaction*
> >> > > > > > > >> >
> >> > > > > > > >> > There will be two types of transaction, one is Stream
> >> level
> >> > > > > > > transaction
> >> > > > > > > >> > (local transaction), while the other one is Namespace
> >> level
> >> > > > > > > transaction
> >> > > > > > > >> > (global transaction).
> >> > > > > > > >> >
> >> > > > > > > >> > The stream level transaction is a transactional
> >> operation on
> >> > > > > writing
> >> > > > > > > >> > records to one stream; the namespace level transaction
> >> is a
> >> > > > > > > >> transactional
> >> > > > > > > >> > operation on writing records to multiple streams.
> >> > > > > > > >> >
> >> > > > > > > >> > *Implementation Thoughts*
> >> > > > > > > >> >
> >> > > > > > > >> > - A transaction is consist of begin control record, a
> >> series
> >> > > of
> >> > > > > data
> >> > > > > > > >> > records and commit/abort control record.
> >> > > > > > > >> > - The begin/commit/abort control record is written to a
> >> > > `commit`
> >> > > > > log
> >> > > > > > > >> > stream, while the data records will be written to
> normal
> >> > data
> >> > > > log
> >> > > > > > > >> streams.
> >> > > > > > > >> > - The `commit` log stream will be the same log stream
> for
> >> > > > > > stream-level
> >> > > > > > > >> > transaction,  while it will be a *system* stream (or
> >> > multiple
> >> > > > > system
> >> > > > > > > >> > streams) for namespace-level transactions.
> >> > > > > > > >> > - The transaction code looks like as below:
> >> > > > > > > >> >
> >> > > > > > > >> > <code>
> >> > > > > > > >> >
> >> > > > > > > >> > Transaction txn = client.transaction();
> >> > > > > > > >> > Future<DLSN> result1 = txn.write(stream-0, record);
> >> > > > > > > >> > Future<DLSN> result2 = txn.write(stream-1, record);
> >> > > > > > > >> > Future<DLSN> result3 = txn.write(stream-2, record);
> >> > > > > > > >> > Future<Pair<DLSN, DLSN>> result = txn.commit();
> >> > > > > > > >> >
> >> > > > > > > >> > </code>
> >> > > > > > > >> >
> >> > > > > > > >> > if the txn is committed, all the write futures will be
> >> > > satisfied
> >> > > > > > with
> >> > > > > > > >> their
> >> > > > > > > >> > written DLSNs. if the txn is aborted, all the write
> >> futures
> >> > > will
> >> > > > > be
> >> > > > > > > >> failed
> >> > > > > > > >> > together. there is no partial failure state.
> >> > > > > > > >> >
> >> > > > > > > >> > - The actually data flow will be:
> >> > > > > > > >> >
> >> > > > > > > >> > 1. writer get a transaction id from the owner of the
> >> > `commit'
> >> > > > log
> >> > > > > > > stream
> >> > > > > > > >> > 1. write the begin control record (synchronously) with
> >> the
> >> > > > > > transaction
> >> > > > > > > >> id
> >> > > > > > > >> > 2. for each write within the same txn, it will be
> >> assigned a
> >> > > > local
> >> > > > > > > >> sequence
> >> > > > > > > >> > number starting from 0. the combination of transaction
> id
> >> > and
> >> > > > > local
> >> > > > > > > >> > sequence number will be used later on by the readers to
> >> > > > > de-duplicate
> >> > > > > > > >> > records.
> >> > > > > > > >> > 3. the commit/abort control record will be written
> based
> >> on
> >> > > the
> >> > > > > > > results
> >> > > > > > > >> > from 2.
> >> > > > > > > >> >
> >> > > > > > > >> > - Application can supply a timeout for the transaction
> >> when
> >> > > > > > #begin() a
> >> > > > > > > >> > transaction. The owner of the `commit` log stream can
> >> abort
> >> > > > > > > transactions
> >> > > > > > > >> > that never be committed/aborted within their timeout.
> >> > > > > > > >> >
> >> > > > > > > >> > - Failures:
> >> > > > > > > >> >
> >> > > > > > > >> > * all the log records can be simply retried as they
> will
> >> be
> >> > > > > > > >> de-duplicated
> >> > > > > > > >> > probably at the reader side.
> >> > > > > > > >> >
> >> > > > > > > >> > - Reader:
> >> > > > > > > >> >
> >> > > > > > > >> > * Reader can be configured to read uncommitted records
> or
> >> > > > > committed
> >> > > > > > > >> records
> >> > > > > > > >> > only (by default read uncommitted records)
> >> > > > > > > >> > * If reader is configured to read committed records
> only,
> >> > the
> >> > > > read
> >> > > > > > > ahead
> >> > > > > > > >> > cache will be changed to maintain one additional
> pending
> >> > > > committed
> >> > > > > > > >> records.
> >> > > > > > > >> > the pending committed records map is bounded and
> records
> >> > will
> >> > > be
> >> > > > > > > dropped
> >> > > > > > > >> > when read ahead is moving.
> >> > > > > > > >> > * when the reader hits a commit record, it will rewind
> to
> >> > the
> >> > > > > begin
> >> > > > > > > >> record
> >> > > > > > > >> > and start reading from there. leveraging the proper
> read
> >> > ahead
> >> > > > > cache
> >> > > > > > > and
> >> > > > > > > >> > pending commit records cache, it would be good for both
> >> > short
> >> > > > > > > >> transactions
> >> > > > > > > >> > and long transactions.
> >> > > > > > > >> >
> >> > > > > > > >> > - DLSN, SequenceId:
> >> > > > > > > >> >
> >> > > > > > > >> > * We will add a fourth field to DLSN. It is `local
> >> sequence
> >> > > > > number`
> >> > > > > > > >> within
> >> > > > > > > >> > a transaction session. So the new DLSN of records in a
> >> > > > transaction
> >> > > > > > > will
> >> > > > > > > >> be
> >> > > > > > > >> > the DLSN of commit control record plus its local
> sequence
> >> > > > number.
> >> > > > > > > >> > * The sequence id will be still the position of the
> >> commit
> >> > > > record
> >> > > > > > plus
> >> > > > > > > >> its
> >> > > > > > > >> > local sequence number. The position will be advanced
> with
> >> > > total
> >> > > > > > number
> >> > > > > > > >> of
> >> > > > > > > >> > written records on writing the commit control record.
> >> > > > > > > >> >
> >> > > > > > > >> > - Transaction Group & Namespace Transaction
> >> > > > > > > >> >
> >> > > > > > > >> > using one single log stream for namespace transaction
> can
> >> > > cause
> >> > > > > the
> >> > > > > > > >> > bottleneck problem since all the begin/commit/end
> control
> >> > > > records
> >> > > > > > will
> >> > > > > > > >> have
> >> > > > > > > >> > to go through one log stream.
> >> > > > > > > >> >
> >> > > > > > > >> > the idea of 'transaction group' is to allow
> partitioning
> >> the
> >> > > > > writers
> >> > > > > > > >> into
> >> > > > > > > >> > different transaction groups.
> >> > > > > > > >> >
> >> > > > > > > >> > clients can specify the `group-name` when starting the
> >> > > > > transaction.
> >> > > > > > if
> >> > > > > > > >> > there is no `group-name` specified, it will use the
> >> default
> >> > > > > `commit`
> >> > > > > > > >> log in
> >> > > > > > > >> > the namespace for creating transactions.
> >> > > > > > > >> >
> >> > > > > > > >> > -------------------------------------------------
> >> > > > > > > >> >
> >> > > > > > > >> > I'd like to collect feedbacks on this idea. Appreciate
> >> any
> >> > > > > comments
> >> > > > > > > and
> >> > > > > > > >> if
> >> > > > > > > >> > anyone is also interested in this idea, we'd like to
> >> > > collaborate
> >> > > > > > with
> >> > > > > > > >> the
> >> > > > > > > >> > community.
> >> > > > > > > >> >
> >> > > > > > > >> >
> >> > > > > > > >> > - Xi
> >> > > > > > > >> >
> >> > > > > > > >>
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: [Discuss] Transaction Support

Posted by Sijie Guo <si...@apache.org>.
Xi, I added more comments. We are looking forward to your reply and seeing
this happen.

- Sijie

On Tue, Jan 3, 2017 at 11:18 PM, Sijie Guo <si...@apache.org> wrote:

> Sorry for late response. I think Leigh and you already had some very
> valuable discussions in the doc. I will try to add some of my questions to
> the discussion.
>
> Beside that, I had a discussion with Leigh today about this. first of all,
> I think it is very good to add transaction support in distributedlog. It is
> one of the primitives that would help building distributed service. But we
> have a concern about making this system become complicated and introduce
> operational overhead when it runs in the large scale system on production.
> There are two major suggestions that I have for this feature -
>
> Build the 'minimum' logic in core - I think the minimum logic that need to
> be added to the core is -  the special control records (begin, commit and
> abort) and make the reader be able to detect those special control records
> and know what do they mean and how to interrupt with them. Since they are
> special control records, there is not overhead to other readers that
> doesn't require this feature.
>
> Build the transaction coordinator as a separated proxy service  - I think
> the major concern that we have is putting more complexities into the 'write
> proxy' service. We architected distributedlog in a more microservice-like
> way - we have the core as the stream store, the proxy for serving write and
> read traffic. It would be good that the transaction feature can be done in
> a similar way. So the architecture would be like this -
>
> *[ write service ] [ read service ] [ transaction coordinator ]*
> *[ stream store
>               ]*
>
> if people doesn't need the transaction feature, they can turn if off
> completely without any operational overhead.
>
> Beside that, I have one general question - What is the major goal for this
> feature? Are you targeting on building a general XA transaction coordinator
> or just for supporting things like `copy-modify-write' style workflow?
>
>
> Thanks,
> Sijie
>
>
>
>
>
> On Wed, Dec 28, 2016 at 1:12 PM, Xi Liu <xi...@gmail.com> wrote:
>
>> Ping?
>>
>> On Mon, Dec 19, 2016 at 8:28 AM, Xi Liu <xi...@gmail.com> wrote:
>>
>> > Sijie,
>> >
>> > No. I thought it might be easier for people to comment on a google doc
>> to
>> > gather the initial feedback. I will put the content back to wiki page
>> once
>> > addressing the comments. Does that sound good to you?
>> >
>> > And thank you in advance.
>> >
>> > - Xi
>> >
>> >
>> >
>> > On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <si...@apache.org> wrote:
>> >
>> >> Hi Xi,
>> >>
>> >> sorry for late response. I will review it soon.
>> >>
>> >> regarding this, a separate question "are we going to use google doc
>> >> instead
>> >> of email thread for any discussion"? I am a bit worried that the
>> >> discussion
>> >> will become lost after moving to google doc. No idea on how other
>> apache
>> >> projects are doing.
>> >>
>> >> - Sijie
>> >>
>> >> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi...@gmail.com> wrote:
>> >>
>> >> > Hi all,
>> >> >
>> >> > I finalized the first version of the design. This time I used a
>> google
>> >> doc
>> >> > so that it is easier for commenting and add a link the wiki page. I
>> will
>> >> > update this to the wiki page once we come to the finalized design.
>> >> >
>> >> > https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
>> >> > bSIGgSzXuTI5BA/edit
>> >> >
>> >> > Let me know if you have any questions. Appreciate your reviews!
>> >> >
>> >> > - Xi
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
>> >> > <lstewart@twitter.com.invalid
>> >> > > wrote:
>> >> >
>> >> > > Interesting proposal. A couple quick notes while you continue to
>> flesh
>> >> > this
>> >> > > out.
>> >> > >
>> >> > > a. just to be sure - does this eliminate the need to save seqno
>> with
>> >> > > checkpoint?
>> >> > >
>> >> > > b. i.e. another way to describe this kind of improvement is
>> "support
>> >> > > records (atomic writes) larger than 1MB", iiuc. the advantage
>> being it
>> >> > > avoids the baggage of transactions. disadvantages include inability
>> >> to do
>> >> > > cross stream transactions, and flexibility (interleaving, etc) (are
>> >> there
>> >> > > others?).
>> >> > >
>> >> > > c. proxy use case is for supporting multiple writers - have you
>> >> thought
>> >> > > about how this would work with multiple writers?
>> >> > >
>> >> > > Thanks!
>> >> > >
>> >> > >
>> >> > > On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo
>> <sijieg@twitter.com.invalid
>> >> >
>> >> > > wrote:
>> >> > >
>> >> > > > Sound good to me. look forward to the detailed proposal.
>> >> > > >
>> >> > > > (I don't mind the format if it makes things easier to you)
>> >> > > >
>> >> > > > Sijie
>> >> > > >
>> >> > > > On Friday, October 14, 2016, Xi Liu <xi...@gmail.com>
>> wrote:
>> >> > > >
>> >> > > > > Thank you, Sijie
>> >> > > > >
>> >> > > > > We have some internal discussions to sort out some details. We
>> are
>> >> > > ready
>> >> > > > to
>> >> > > > > collaborate with the community for adding the transaction
>> support
>> >> in
>> >> > > DL.
>> >> > > > > We'd like to share more.
>> >> > > > >
>> >> > > > > I created a proposal wiki here -
>> >> > > > > https://cwiki.apache.org/confluence/display/DL/DP-1+-+
>> >> > > > > DistributedLog+Transaction+Support
>> >> > > > >
>> >> > > > > (I followed KIP format and named it as DP (DistributedLog
>> >> Proposal -
>> >> > DP
>> >> > > > is
>> >> > > > > also short for Dynamic Programming). I don't know if you guys
>> like
>> >> > this
>> >> > > > > name or not. Feel free to change it :D)
>> >> > > > >
>> >> > > > > I basically put my initial email as the content there so far.
>> >> Once we
>> >> > > > > finished our final discussion, I will update with more
>> details. At
>> >> > the
>> >> > > > same
>> >> > > > > time, any comments are welcome.
>> >> > > > >
>> >> > > > > - Xi
>> >> > > > >
>> >> > > > >
>> >> > > > >
>> >> > > > > On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
>> >> > > > <javascript:;>>
>> >> > > > > wrote:
>> >> > > > >
>> >> > > > > > Xi,
>> >> > > > > >
>> >> > > > > > I just granted you the edit permission.
>> >> > > > > >
>> >> > > > > > - Sijie
>> >> > > > > >
>> >> > > > > > On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <
>> xi.liu.ant@gmail.com
>> >> > > > > <javascript:;>> wrote:
>> >> > > > > >
>> >> > > > > > > I still can not edit the wiki. Can any of the pmc members
>> >> grant
>> >> > me
>> >> > > > the
>> >> > > > > > > permissions?
>> >> > > > > > >
>> >> > > > > > > - Xi
>> >> > > > > > >
>> >> > > > > > > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <
>> >> xi.liu.ant@gmail.com
>> >> > > > > <javascript:;>> wrote:
>> >> > > > > > >
>> >> > > > > > > > Sijie,
>> >> > > > > > > >
>> >> > > > > > > > I attempted to create a wiki page under that space. I
>> found
>> >> > that
>> >> > > I
>> >> > > > am
>> >> > > > > > not
>> >> > > > > > > > authorized with edit permission.
>> >> > > > > > > >
>> >> > > > > > > > Can any of the committers grant me the wiki edit
>> >> permission? My
>> >> > > > > account
>> >> > > > > > > is
>> >> > > > > > > > "xi.liu.ant".
>> >> > > > > > > >
>> >> > > > > > > > - Xi
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <
>> >> sijie@apache.org
>> >> > > > > <javascript:;>> wrote:
>> >> > > > > > > >
>> >> > > > > > > >> This sounds interesting ... I will take a closer look
>> and
>> >> give
>> >> > > my
>> >> > > > > > > comments
>> >> > > > > > > >> later.
>> >> > > > > > > >>
>> >> > > > > > > >> At the same time, do you mind creating a wiki page to
>> put
>> >> your
>> >> > > > idea
>> >> > > > > > > there?
>> >> > > > > > > >> You can add your wiki page under
>> >> > > > > > > >> https://cwiki.apache.org/confluence/display/DL/Project+
>> >> > > Proposals
>> >> > > > > > > >>
>> >> > > > > > > >> You might need to ask in the dev list to grant the wiki
>> >> edit
>> >> > > > > > permissions
>> >> > > > > > > >> to
>> >> > > > > > > >> you once you have a wiki account.
>> >> > > > > > > >>
>> >> > > > > > > >> - Sijie
>> >> > > > > > > >>
>> >> > > > > > > >>
>> >> > > > > > > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <
>> >> xi.liu.ant@gmail.com
>> >> > > > > <javascript:;>> wrote:
>> >> > > > > > > >>
>> >> > > > > > > >> > Hello,
>> >> > > > > > > >> >
>> >> > > > > > > >> > I asked the transaction support in distributedlog user
>> >> group
>> >> > > two
>> >> > > > > > > months
>> >> > > > > > > >> > ago. I want to raise this up again, as we are looking
>> for
>> >> > > using
>> >> > > > > > > >> > distributedlog for building a transactional data
>> >> service. It
>> >> > > is
>> >> > > > a
>> >> > > > > > > major
>> >> > > > > > > >> > feature that is missing in distributedlog. We have
>> some
>> >> > ideas
>> >> > > to
>> >> > > > > add
>> >> > > > > > > >> this
>> >> > > > > > > >> > to distributedlog and want to know if they make sense
>> or
>> >> > not.
>> >> > > If
>> >> > > > > > they
>> >> > > > > > > >> are
>> >> > > > > > > >> > good, we'd like to contribute and develop with the
>> >> > community.
>> >> > > > > > > >> >
>> >> > > > > > > >> > Here are the thoughts:
>> >> > > > > > > >> >
>> >> > > > > > > >> > -------------------------------------------------
>> >> > > > > > > >> >
>> >> > > > > > > >> > From our understanding, DL can provide "at-least-once"
>> >> > > delivery
>> >> > > > > > > semantic
>> >> > > > > > > >> > (if not, please correct me) but not "exactly-once"
>> >> delivery
>> >> > > > > > semantic.
>> >> > > > > > > >> That
>> >> > > > > > > >> > means that a message can be delivered one or more
>> times
>> >> if
>> >> > the
>> >> > > > > > reader
>> >> > > > > > > >> > doesn't handle duplicates.
>> >> > > > > > > >> >
>> >> > > > > > > >> > The duplicates come from two places, one is at writer
>> >> side
>> >> > > (this
>> >> > > > > > > assumes
>> >> > > > > > > >> > using write proxy not the core library), while the
>> other
>> >> one
>> >> > > is
>> >> > > > at
>> >> > > > > > > >> reader
>> >> > > > > > > >> > side.
>> >> > > > > > > >> >
>> >> > > > > > > >> > - writer side: if the client attempts to write a
>> record
>> >> to
>> >> > the
>> >> > > > > write
>> >> > > > > > > >> > proxies and gets a network error (e.g timeouts) then
>> >> > retries,
>> >> > > > the
>> >> > > > > > > >> retrying
>> >> > > > > > > >> > will potentially result in duplicates.
>> >> > > > > > > >> > - reader side:if the reader reads a message from a
>> stream
>> >> > and
>> >> > > > then
>> >> > > > > > > >> crashes,
>> >> > > > > > > >> > when the reader restarts it would restart from last
>> known
>> >> > > > position
>> >> > > > > > > >> (DLSN).
>> >> > > > > > > >> > If the reader fails after processing a record and
>> before
>> >> > > > recording
>> >> > > > > > the
>> >> > > > > > > >> > position, the processed record will be delivered
>> again.
>> >> > > > > > > >> >
>> >> > > > > > > >> > The reader problem can be properly addressed by making
>> >> use
>> >> > of
>> >> > > > the
>> >> > > > > > > >> sequence
>> >> > > > > > > >> > numbers of records and doing proper checkpointing. For
>> >> > > example,
>> >> > > > in
>> >> > > > > > > >> > database, it can checkpoint the indexed data with the
>> >> > sequence
>> >> > > > > > number
>> >> > > > > > > of
>> >> > > > > > > >> > records; in flink, it can checkpoint the state with
>> the
>> >> > > sequence
>> >> > > > > > > >> numbers.
>> >> > > > > > > >> >
>> >> > > > > > > >> > The writer problem can be addressed by implementing an
>> >> > > > idempotent
>> >> > > > > > > >> writer.
>> >> > > > > > > >> > However, an alternative and more powerful approach is
>> to
>> >> > > support
>> >> > > > > > > >> > transactions.
>> >> > > > > > > >> >
>> >> > > > > > > >> > *What does transaction mean?*
>> >> > > > > > > >> >
>> >> > > > > > > >> > A transaction means a collection of records can be
>> >> written
>> >> > > > > > > >> transactionally
>> >> > > > > > > >> > within a stream or across multiple streams. They will
>> be
>> >> > > > consumed
>> >> > > > > by
>> >> > > > > > > the
>> >> > > > > > > >> > reader together when a transaction is committed, or
>> will
>> >> > never
>> >> > > > be
>> >> > > > > > > >> consumed
>> >> > > > > > > >> > by the reader when the transaction is aborted.
>> >> > > > > > > >> >
>> >> > > > > > > >> > The transaction will expose following guarantees:
>> >> > > > > > > >> >
>> >> > > > > > > >> > - The reader should not be exposed to records written
>> >> from
>> >> > > > > > uncommitted
>> >> > > > > > > >> > transactions (mandatory)
>> >> > > > > > > >> > - The reader should consume the records in the
>> >> transaction
>> >> > > > commit
>> >> > > > > > > order
>> >> > > > > > > >> > rather than the record written order (mandatory)
>> >> > > > > > > >> > - No duplicated records within a transaction
>> (mandatory)
>> >> > > > > > > >> > - Allow interleaving transactional writes and
>> >> > > non-transactional
>> >> > > > > > writes
>> >> > > > > > > >> > (optional)
>> >> > > > > > > >> >
>> >> > > > > > > >> > *Stream Transaction & Namespace Transaction*
>> >> > > > > > > >> >
>> >> > > > > > > >> > There will be two types of transaction, one is Stream
>> >> level
>> >> > > > > > > transaction
>> >> > > > > > > >> > (local transaction), while the other one is Namespace
>> >> level
>> >> > > > > > > transaction
>> >> > > > > > > >> > (global transaction).
>> >> > > > > > > >> >
>> >> > > > > > > >> > The stream level transaction is a transactional
>> >> operation on
>> >> > > > > writing
>> >> > > > > > > >> > records to one stream; the namespace level transaction
>> >> is a
>> >> > > > > > > >> transactional
>> >> > > > > > > >> > operation on writing records to multiple streams.
>> >> > > > > > > >> >
>> >> > > > > > > >> > *Implementation Thoughts*
>> >> > > > > > > >> >
>> >> > > > > > > >> > - A transaction is consist of begin control record, a
>> >> series
>> >> > > of
>> >> > > > > data
>> >> > > > > > > >> > records and commit/abort control record.
>> >> > > > > > > >> > - The begin/commit/abort control record is written to
>> a
>> >> > > `commit`
>> >> > > > > log
>> >> > > > > > > >> > stream, while the data records will be written to
>> normal
>> >> > data
>> >> > > > log
>> >> > > > > > > >> streams.
>> >> > > > > > > >> > - The `commit` log stream will be the same log stream
>> for
>> >> > > > > > stream-level
>> >> > > > > > > >> > transaction,  while it will be a *system* stream (or
>> >> > multiple
>> >> > > > > system
>> >> > > > > > > >> > streams) for namespace-level transactions.
>> >> > > > > > > >> > - The transaction code looks like as below:
>> >> > > > > > > >> >
>> >> > > > > > > >> > <code>
>> >> > > > > > > >> >
>> >> > > > > > > >> > Transaction txn = client.transaction();
>> >> > > > > > > >> > Future<DLSN> result1 = txn.write(stream-0, record);
>> >> > > > > > > >> > Future<DLSN> result2 = txn.write(stream-1, record);
>> >> > > > > > > >> > Future<DLSN> result3 = txn.write(stream-2, record);
>> >> > > > > > > >> > Future<Pair<DLSN, DLSN>> result = txn.commit();
>> >> > > > > > > >> >
>> >> > > > > > > >> > </code>
>> >> > > > > > > >> >
>> >> > > > > > > >> > if the txn is committed, all the write futures will be
>> >> > > satisfied
>> >> > > > > > with
>> >> > > > > > > >> their
>> >> > > > > > > >> > written DLSNs. if the txn is aborted, all the write
>> >> futures
>> >> > > will
>> >> > > > > be
>> >> > > > > > > >> failed
>> >> > > > > > > >> > together. there is no partial failure state.
>> >> > > > > > > >> >
>> >> > > > > > > >> > - The actually data flow will be:
>> >> > > > > > > >> >
>> >> > > > > > > >> > 1. writer get a transaction id from the owner of the
>> >> > `commit'
>> >> > > > log
>> >> > > > > > > stream
>> >> > > > > > > >> > 1. write the begin control record (synchronously) with
>> >> the
>> >> > > > > > transaction
>> >> > > > > > > >> id
>> >> > > > > > > >> > 2. for each write within the same txn, it will be
>> >> assigned a
>> >> > > > local
>> >> > > > > > > >> sequence
>> >> > > > > > > >> > number starting from 0. the combination of
>> transaction id
>> >> > and
>> >> > > > > local
>> >> > > > > > > >> > sequence number will be used later on by the readers
>> to
>> >> > > > > de-duplicate
>> >> > > > > > > >> > records.
>> >> > > > > > > >> > 3. the commit/abort control record will be written
>> based
>> >> on
>> >> > > the
>> >> > > > > > > results
>> >> > > > > > > >> > from 2.
>> >> > > > > > > >> >
>> >> > > > > > > >> > - Application can supply a timeout for the transaction
>> >> when
>> >> > > > > > #begin() a
>> >> > > > > > > >> > transaction. The owner of the `commit` log stream can
>> >> abort
>> >> > > > > > > transactions
>> >> > > > > > > >> > that never be committed/aborted within their timeout.
>> >> > > > > > > >> >
>> >> > > > > > > >> > - Failures:
>> >> > > > > > > >> >
>> >> > > > > > > >> > * all the log records can be simply retried as they
>> will
>> >> be
>> >> > > > > > > >> de-duplicated
>> >> > > > > > > >> > probably at the reader side.
>> >> > > > > > > >> >
>> >> > > > > > > >> > - Reader:
>> >> > > > > > > >> >
>> >> > > > > > > >> > * Reader can be configured to read uncommitted
>> records or
>> >> > > > > committed
>> >> > > > > > > >> records
>> >> > > > > > > >> > only (by default read uncommitted records)
>> >> > > > > > > >> > * If reader is configured to read committed records
>> only,
>> >> > the
>> >> > > > read
>> >> > > > > > > ahead
>> >> > > > > > > >> > cache will be changed to maintain one additional
>> pending
>> >> > > > committed
>> >> > > > > > > >> records.
>> >> > > > > > > >> > the pending committed records map is bounded and
>> records
>> >> > will
>> >> > > be
>> >> > > > > > > dropped
>> >> > > > > > > >> > when read ahead is moving.
>> >> > > > > > > >> > * when the reader hits a commit record, it will
>> rewind to
>> >> > the
>> >> > > > > begin
>> >> > > > > > > >> record
>> >> > > > > > > >> > and start reading from there. leveraging the proper
>> read
>> >> > ahead
>> >> > > > > cache
>> >> > > > > > > and
>> >> > > > > > > >> > pending commit records cache, it would be good for
>> both
>> >> > short
>> >> > > > > > > >> transactions
>> >> > > > > > > >> > and long transactions.
>> >> > > > > > > >> >
>> >> > > > > > > >> > - DLSN, SequenceId:
>> >> > > > > > > >> >
>> >> > > > > > > >> > * We will add a fourth field to DLSN. It is `local
>> >> sequence
>> >> > > > > number`
>> >> > > > > > > >> within
>> >> > > > > > > >> > a transaction session. So the new DLSN of records in a
>> >> > > > transaction
>> >> > > > > > > will
>> >> > > > > > > >> be
>> >> > > > > > > >> > the DLSN of commit control record plus its local
>> sequence
>> >> > > > number.
>> >> > > > > > > >> > * The sequence id will be still the position of the
>> >> commit
>> >> > > > record
>> >> > > > > > plus
>> >> > > > > > > >> its
>> >> > > > > > > >> > local sequence number. The position will be advanced
>> with
>> >> > > total
>> >> > > > > > number
>> >> > > > > > > >> of
>> >> > > > > > > >> > written records on writing the commit control record.
>> >> > > > > > > >> >
>> >> > > > > > > >> > - Transaction Group & Namespace Transaction
>> >> > > > > > > >> >
>> >> > > > > > > >> > using one single log stream for namespace transaction
>> can
>> >> > > cause
>> >> > > > > the
>> >> > > > > > > >> > bottleneck problem since all the begin/commit/end
>> control
>> >> > > > records
>> >> > > > > > will
>> >> > > > > > > >> have
>> >> > > > > > > >> > to go through one log stream.
>> >> > > > > > > >> >
>> >> > > > > > > >> > the idea of 'transaction group' is to allow
>> partitioning
>> >> the
>> >> > > > > writers
>> >> > > > > > > >> into
>> >> > > > > > > >> > different transaction groups.
>> >> > > > > > > >> >
>> >> > > > > > > >> > clients can specify the `group-name` when starting the
>> >> > > > > transaction.
>> >> > > > > > if
>> >> > > > > > > >> > there is no `group-name` specified, it will use the
>> >> default
>> >> > > > > `commit`
>> >> > > > > > > >> log in
>> >> > > > > > > >> > the namespace for creating transactions.
>> >> > > > > > > >> >
>> >> > > > > > > >> > -------------------------------------------------
>> >> > > > > > > >> >
>> >> > > > > > > >> > I'd like to collect feedbacks on this idea. Appreciate
>> >> any
>> >> > > > > comments
>> >> > > > > > > and
>> >> > > > > > > >> if
>> >> > > > > > > >> > anyone is also interested in this idea, we'd like to
>> >> > > collaborate
>> >> > > > > > with
>> >> > > > > > > >> the
>> >> > > > > > > >> > community.
>> >> > > > > > > >> >
>> >> > > > > > > >> >
>> >> > > > > > > >> > - Xi
>> >> > > > > > > >> >
>> >> > > > > > > >>
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > >
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> >
>> >
>>
>
>

Re: [Discuss] Transaction Support

Posted by Sijie Guo <si...@apache.org>.
On Thu, Jan 5, 2017 at 10:56 PM, Xi Liu <xi...@gmail.com> wrote:

> Asko and Sijie,
>
> Thank you so much for your feedbacks.
>
> We are not targeting at building a general XA transaction coordinator. The
> feature we want is be able to write data to multiple log streams in an
> atomic way.
>

So in other words, that means this feature is 'atomic writes across log
streams', no?


>
> I totally agreed with you about building minimal logic. We also don't want
> to enforce this feature to all the users of DL. Building the TC as a
> separated service sounds clear to me. We will do it follow your suggestion.
>
> I am also replying the comments to you and Leigh on the doc. Hopefully we
> can come to an agreement so that our changes can be accepted.
>
> - Xi
>
> On Wed, Jan 4, 2017 at 1:14 AM, Asko Kauppi <as...@zalando.fi>
> wrote:
>
> > > Beside that, I have one general question - What is the major goal for
> > this
> > > feature? Are you targeting on building a general XA transaction
> > coordinator
> > > or just for supporting things like `copy-modify-write' style workflow?
> >
> > The use case I would have for transactions - at some level of the stack -
> > is supporting dynamic configurations.
> >
> > If a config changes in e.g. three lines, some of the changes may
> logically
> > belong together. E.g. changing both “host” and “port” (if separate
> > entries). One shouldn’t be able to read a state, even temporarily, that
> has
> > new host but old port.
> >
> > I can do this in the application level - it does not need to be part of
> > the DL protocol.
> >
> >
> > Asko Kauppi
> > Zalando Tech Helsinki
> >
> > > On 4 Jan 2017, at 9.18, Sijie Guo <si...@apache.org> wrote:
> > >
> > > Sorry for late response. I think Leigh and you already had some very
> > > valuable discussions in the doc. I will try to add some of my questions
> > to
> > > the discussion.
> > >
> > > Beside that, I had a discussion with Leigh today about this. first of
> > all,
> > > I think it is very good to add transaction support in distributedlog.
> It
> > is
> > > one of the primitives that would help building distributed service. But
> > we
> > > have a concern about making this system become complicated and
> introduce
> > > operational overhead when it runs in the large scale system on
> > production.
> > > There are two major suggestions that I have for this feature -
> > >
> > > Build the 'minimum' logic in core - I think the minimum logic that need
> > to
> > > be added to the core is -  the special control records (begin, commit
> and
> > > abort) and make the reader be able to detect those special control
> > records
> > > and know what do they mean and how to interrupt with them. Since they
> are
> > > special control records, there is not overhead to other readers that
> > > doesn't require this feature.
> > >
> > > Build the transaction coordinator as a separated proxy service  - I
> think
> > > the major concern that we have is putting more complexities into the
> > 'write
> > > proxy' service. We architected distributedlog in a more
> microservice-like
> > > way - we have the core as the stream store, the proxy for serving write
> > and
> > > read traffic. It would be good that the transaction feature can be done
> > in
> > > a similar way. So the architecture would be like this -
> > >
> > > *[ write service ] [ read service ] [ transaction coordinator ]*
> > > *[ stream store
> > >            ]*
> > >
> > > if people doesn't need the transaction feature, they can turn if off
> > > completely without any operational overhead.
> > >
> > > Beside that, I have one general question - What is the major goal for
> > this
> > > feature? Are you targeting on building a general XA transaction
> > coordinator
> > > or just for supporting things like `copy-modify-write' style workflow?
> > >
> > >
> > > Thanks,
> > > Sijie
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Dec 28, 2016 at 1:12 PM, Xi Liu <xi...@gmail.com> wrote:
> > >
> > >> Ping?
> > >>
> > >> On Mon, Dec 19, 2016 at 8:28 AM, Xi Liu <xi...@gmail.com> wrote:
> > >>
> > >>> Sijie,
> > >>>
> > >>> No. I thought it might be easier for people to comment on a google
> doc
> > to
> > >>> gather the initial feedback. I will put the content back to wiki page
> > >> once
> > >>> addressing the comments. Does that sound good to you?
> > >>>
> > >>> And thank you in advance.
> > >>>
> > >>> - Xi
> > >>>
> > >>>
> > >>>
> > >>> On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <si...@apache.org> wrote:
> > >>>
> > >>>> Hi Xi,
> > >>>>
> > >>>> sorry for late response. I will review it soon.
> > >>>>
> > >>>> regarding this, a separate question "are we going to use google doc
> > >>>> instead
> > >>>> of email thread for any discussion"? I am a bit worried that the
> > >>>> discussion
> > >>>> will become lost after moving to google doc. No idea on how other
> > apache
> > >>>> projects are doing.
> > >>>>
> > >>>> - Sijie
> > >>>>
> > >>>> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi...@gmail.com>
> > wrote:
> > >>>>
> > >>>>> Hi all,
> > >>>>>
> > >>>>> I finalized the first version of the design. This time I used a
> > google
> > >>>> doc
> > >>>>> so that it is easier for commenting and add a link the wiki page. I
> > >> will
> > >>>>> update this to the wiki page once we come to the finalized design.
> > >>>>>
> > >>>>> https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
> > >>>>> bSIGgSzXuTI5BA/edit
> > >>>>>
> > >>>>> Let me know if you have any questions. Appreciate your reviews!
> > >>>>>
> > >>>>> - Xi
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
> > >>>>> <lstewart@twitter.com.invalid
> > >>>>>> wrote:
> > >>>>>
> > >>>>>> Interesting proposal. A couple quick notes while you continue to
> > >> flesh
> > >>>>> this
> > >>>>>> out.
> > >>>>>>
> > >>>>>> a. just to be sure - does this eliminate the need to save seqno
> with
> > >>>>>> checkpoint?
> > >>>>>>
> > >>>>>> b. i.e. another way to describe this kind of improvement is
> "support
> > >>>>>> records (atomic writes) larger than 1MB", iiuc. the advantage
> being
> > >> it
> > >>>>>> avoids the baggage of transactions. disadvantages include
> inability
> > >>>> to do
> > >>>>>> cross stream transactions, and flexibility (interleaving, etc)
> (are
> > >>>> there
> > >>>>>> others?).
> > >>>>>>
> > >>>>>> c. proxy use case is for supporting multiple writers - have you
> > >>>> thought
> > >>>>>> about how this would work with multiple writers?
> > >>>>>>
> > >>>>>> Thanks!
> > >>>>>>
> > >>>>>>
> > >>>>>> On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo
> > >> <sijieg@twitter.com.invalid
> > >>>>>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Sound good to me. look forward to the detailed proposal.
> > >>>>>>>
> > >>>>>>> (I don't mind the format if it makes things easier to you)
> > >>>>>>>
> > >>>>>>> Sijie
> > >>>>>>>
> > >>>>>>> On Friday, October 14, 2016, Xi Liu <xi...@gmail.com>
> wrote:
> > >>>>>>>
> > >>>>>>>> Thank you, Sijie
> > >>>>>>>>
> > >>>>>>>> We have some internal discussions to sort out some details. We
> > >> are
> > >>>>>> ready
> > >>>>>>> to
> > >>>>>>>> collaborate with the community for adding the transaction
> > >> support
> > >>>> in
> > >>>>>> DL.
> > >>>>>>>> We'd like to share more.
> > >>>>>>>>
> > >>>>>>>> I created a proposal wiki here -
> > >>>>>>>> https://cwiki.apache.org/confluence/display/DL/DP-1+-+
> > >>>>>>>> DistributedLog+Transaction+Support
> > >>>>>>>>
> > >>>>>>>> (I followed KIP format and named it as DP (DistributedLog
> > >>>> Proposal -
> > >>>>> DP
> > >>>>>>> is
> > >>>>>>>> also short for Dynamic Programming). I don't know if you guys
> > >> like
> > >>>>> this
> > >>>>>>>> name or not. Feel free to change it :D)
> > >>>>>>>>
> > >>>>>>>> I basically put my initial email as the content there so far.
> > >>>> Once we
> > >>>>>>>> finished our final discussion, I will update with more details.
> > >> At
> > >>>>> the
> > >>>>>>> same
> > >>>>>>>> time, any comments are welcome.
> > >>>>>>>>
> > >>>>>>>> - Xi
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
> > >>>>>>> <javascript:;>>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Xi,
> > >>>>>>>>>
> > >>>>>>>>> I just granted you the edit permission.
> > >>>>>>>>>
> > >>>>>>>>> - Sijie
> > >>>>>>>>>
> > >>>>>>>>> On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
> > >>>>>>>> <javascript:;>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> I still can not edit the wiki. Can any of the pmc members
> > >>>> grant
> > >>>>> me
> > >>>>>>> the
> > >>>>>>>>>> permissions?
> > >>>>>>>>>>
> > >>>>>>>>>> - Xi
> > >>>>>>>>>>
> > >>>>>>>>>> On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <
> > >>>> xi.liu.ant@gmail.com
> > >>>>>>>> <javascript:;>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Sijie,
> > >>>>>>>>>>>
> > >>>>>>>>>>> I attempted to create a wiki page under that space. I
> > >> found
> > >>>>> that
> > >>>>>> I
> > >>>>>>> am
> > >>>>>>>>> not
> > >>>>>>>>>>> authorized with edit permission.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Can any of the committers grant me the wiki edit
> > >>>> permission? My
> > >>>>>>>> account
> > >>>>>>>>>> is
> > >>>>>>>>>>> "xi.liu.ant".
> > >>>>>>>>>>>
> > >>>>>>>>>>> - Xi
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <
> > >>>> sijie@apache.org
> > >>>>>>>> <javascript:;>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> This sounds interesting ... I will take a closer look and
> > >>>> give
> > >>>>>> my
> > >>>>>>>>>> comments
> > >>>>>>>>>>>> later.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> At the same time, do you mind creating a wiki page to put
> > >>>> your
> > >>>>>>> idea
> > >>>>>>>>>> there?
> > >>>>>>>>>>>> You can add your wiki page under
> > >>>>>>>>>>>> https://cwiki.apache.org/confluence/display/DL/Project+
> > >>>>>> Proposals
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> You might need to ask in the dev list to grant the wiki
> > >>>> edit
> > >>>>>>>>> permissions
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>> you once you have a wiki account.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> - Sijie
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <
> > >>>> xi.liu.ant@gmail.com
> > >>>>>>>> <javascript:;>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Hello,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I asked the transaction support in distributedlog user
> > >>>> group
> > >>>>>> two
> > >>>>>>>>>> months
> > >>>>>>>>>>>>> ago. I want to raise this up again, as we are looking
> > >> for
> > >>>>>> using
> > >>>>>>>>>>>>> distributedlog for building a transactional data
> > >>>> service. It
> > >>>>>> is
> > >>>>>>> a
> > >>>>>>>>>> major
> > >>>>>>>>>>>>> feature that is missing in distributedlog. We have some
> > >>>>> ideas
> > >>>>>> to
> > >>>>>>>> add
> > >>>>>>>>>>>> this
> > >>>>>>>>>>>>> to distributedlog and want to know if they make sense
> > >> or
> > >>>>> not.
> > >>>>>> If
> > >>>>>>>>> they
> > >>>>>>>>>>>> are
> > >>>>>>>>>>>>> good, we'd like to contribute and develop with the
> > >>>>> community.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Here are the thoughts:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> -------------------------------------------------
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> From our understanding, DL can provide "at-least-once"
> > >>>>>> delivery
> > >>>>>>>>>> semantic
> > >>>>>>>>>>>>> (if not, please correct me) but not "exactly-once"
> > >>>> delivery
> > >>>>>>>>> semantic.
> > >>>>>>>>>>>> That
> > >>>>>>>>>>>>> means that a message can be delivered one or more times
> > >>>> if
> > >>>>> the
> > >>>>>>>>> reader
> > >>>>>>>>>>>>> doesn't handle duplicates.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The duplicates come from two places, one is at writer
> > >>>> side
> > >>>>>> (this
> > >>>>>>>>>> assumes
> > >>>>>>>>>>>>> using write proxy not the core library), while the
> > >> other
> > >>>> one
> > >>>>>> is
> > >>>>>>> at
> > >>>>>>>>>>>> reader
> > >>>>>>>>>>>>> side.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - writer side: if the client attempts to write a record
> > >>>> to
> > >>>>> the
> > >>>>>>>> write
> > >>>>>>>>>>>>> proxies and gets a network error (e.g timeouts) then
> > >>>>> retries,
> > >>>>>>> the
> > >>>>>>>>>>>> retrying
> > >>>>>>>>>>>>> will potentially result in duplicates.
> > >>>>>>>>>>>>> - reader side:if the reader reads a message from a
> > >> stream
> > >>>>> and
> > >>>>>>> then
> > >>>>>>>>>>>> crashes,
> > >>>>>>>>>>>>> when the reader restarts it would restart from last
> > >> known
> > >>>>>>> position
> > >>>>>>>>>>>> (DLSN).
> > >>>>>>>>>>>>> If the reader fails after processing a record and
> > >> before
> > >>>>>>> recording
> > >>>>>>>>> the
> > >>>>>>>>>>>>> position, the processed record will be delivered again.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The reader problem can be properly addressed by making
> > >>>> use
> > >>>>> of
> > >>>>>>> the
> > >>>>>>>>>>>> sequence
> > >>>>>>>>>>>>> numbers of records and doing proper checkpointing. For
> > >>>>>> example,
> > >>>>>>> in
> > >>>>>>>>>>>>> database, it can checkpoint the indexed data with the
> > >>>>> sequence
> > >>>>>>>>> number
> > >>>>>>>>>> of
> > >>>>>>>>>>>>> records; in flink, it can checkpoint the state with the
> > >>>>>> sequence
> > >>>>>>>>>>>> numbers.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The writer problem can be addressed by implementing an
> > >>>>>>> idempotent
> > >>>>>>>>>>>> writer.
> > >>>>>>>>>>>>> However, an alternative and more powerful approach is
> > >> to
> > >>>>>> support
> > >>>>>>>>>>>>> transactions.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> *What does transaction mean?*
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> A transaction means a collection of records can be
> > >>>> written
> > >>>>>>>>>>>> transactionally
> > >>>>>>>>>>>>> within a stream or across multiple streams. They will
> > >> be
> > >>>>>>> consumed
> > >>>>>>>> by
> > >>>>>>>>>> the
> > >>>>>>>>>>>>> reader together when a transaction is committed, or
> > >> will
> > >>>>> never
> > >>>>>>> be
> > >>>>>>>>>>>> consumed
> > >>>>>>>>>>>>> by the reader when the transaction is aborted.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The transaction will expose following guarantees:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - The reader should not be exposed to records written
> > >>>> from
> > >>>>>>>>> uncommitted
> > >>>>>>>>>>>>> transactions (mandatory)
> > >>>>>>>>>>>>> - The reader should consume the records in the
> > >>>> transaction
> > >>>>>>> commit
> > >>>>>>>>>> order
> > >>>>>>>>>>>>> rather than the record written order (mandatory)
> > >>>>>>>>>>>>> - No duplicated records within a transaction
> > >> (mandatory)
> > >>>>>>>>>>>>> - Allow interleaving transactional writes and
> > >>>>>> non-transactional
> > >>>>>>>>> writes
> > >>>>>>>>>>>>> (optional)
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> *Stream Transaction & Namespace Transaction*
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> There will be two types of transaction, one is Stream
> > >>>> level
> > >>>>>>>>>> transaction
> > >>>>>>>>>>>>> (local transaction), while the other one is Namespace
> > >>>> level
> > >>>>>>>>>> transaction
> > >>>>>>>>>>>>> (global transaction).
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The stream level transaction is a transactional
> > >>>> operation on
> > >>>>>>>> writing
> > >>>>>>>>>>>>> records to one stream; the namespace level transaction
> > >>>> is a
> > >>>>>>>>>>>> transactional
> > >>>>>>>>>>>>> operation on writing records to multiple streams.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> *Implementation Thoughts*
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - A transaction is consist of begin control record, a
> > >>>> series
> > >>>>>> of
> > >>>>>>>> data
> > >>>>>>>>>>>>> records and commit/abort control record.
> > >>>>>>>>>>>>> - The begin/commit/abort control record is written to a
> > >>>>>> `commit`
> > >>>>>>>> log
> > >>>>>>>>>>>>> stream, while the data records will be written to
> > >> normal
> > >>>>> data
> > >>>>>>> log
> > >>>>>>>>>>>> streams.
> > >>>>>>>>>>>>> - The `commit` log stream will be the same log stream
> > >> for
> > >>>>>>>>> stream-level
> > >>>>>>>>>>>>> transaction,  while it will be a *system* stream (or
> > >>>>> multiple
> > >>>>>>>> system
> > >>>>>>>>>>>>> streams) for namespace-level transactions.
> > >>>>>>>>>>>>> - The transaction code looks like as below:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> <code>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Transaction txn = client.transaction();
> > >>>>>>>>>>>>> Future<DLSN> result1 = txn.write(stream-0, record);
> > >>>>>>>>>>>>> Future<DLSN> result2 = txn.write(stream-1, record);
> > >>>>>>>>>>>>> Future<DLSN> result3 = txn.write(stream-2, record);
> > >>>>>>>>>>>>> Future<Pair<DLSN, DLSN>> result = txn.commit();
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> </code>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> if the txn is committed, all the write futures will be
> > >>>>>> satisfied
> > >>>>>>>>> with
> > >>>>>>>>>>>> their
> > >>>>>>>>>>>>> written DLSNs. if the txn is aborted, all the write
> > >>>> futures
> > >>>>>> will
> > >>>>>>>> be
> > >>>>>>>>>>>> failed
> > >>>>>>>>>>>>> together. there is no partial failure state.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - The actually data flow will be:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 1. writer get a transaction id from the owner of the
> > >>>>> `commit'
> > >>>>>>> log
> > >>>>>>>>>> stream
> > >>>>>>>>>>>>> 1. write the begin control record (synchronously) with
> > >>>> the
> > >>>>>>>>> transaction
> > >>>>>>>>>>>> id
> > >>>>>>>>>>>>> 2. for each write within the same txn, it will be
> > >>>> assigned a
> > >>>>>>> local
> > >>>>>>>>>>>> sequence
> > >>>>>>>>>>>>> number starting from 0. the combination of transaction
> > >> id
> > >>>>> and
> > >>>>>>>> local
> > >>>>>>>>>>>>> sequence number will be used later on by the readers to
> > >>>>>>>> de-duplicate
> > >>>>>>>>>>>>> records.
> > >>>>>>>>>>>>> 3. the commit/abort control record will be written
> > >> based
> > >>>> on
> > >>>>>> the
> > >>>>>>>>>> results
> > >>>>>>>>>>>>> from 2.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Application can supply a timeout for the transaction
> > >>>> when
> > >>>>>>>>> #begin() a
> > >>>>>>>>>>>>> transaction. The owner of the `commit` log stream can
> > >>>> abort
> > >>>>>>>>>> transactions
> > >>>>>>>>>>>>> that never be committed/aborted within their timeout.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Failures:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> * all the log records can be simply retried as they
> > >> will
> > >>>> be
> > >>>>>>>>>>>> de-duplicated
> > >>>>>>>>>>>>> probably at the reader side.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Reader:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> * Reader can be configured to read uncommitted records
> > >> or
> > >>>>>>>> committed
> > >>>>>>>>>>>> records
> > >>>>>>>>>>>>> only (by default read uncommitted records)
> > >>>>>>>>>>>>> * If reader is configured to read committed records
> > >> only,
> > >>>>> the
> > >>>>>>> read
> > >>>>>>>>>> ahead
> > >>>>>>>>>>>>> cache will be changed to maintain one additional
> > >> pending
> > >>>>>>> committed
> > >>>>>>>>>>>> records.
> > >>>>>>>>>>>>> the pending committed records map is bounded and
> > >> records
> > >>>>> will
> > >>>>>> be
> > >>>>>>>>>> dropped
> > >>>>>>>>>>>>> when read ahead is moving.
> > >>>>>>>>>>>>> * when the reader hits a commit record, it will rewind
> > >> to
> > >>>>> the
> > >>>>>>>> begin
> > >>>>>>>>>>>> record
> > >>>>>>>>>>>>> and start reading from there. leveraging the proper
> > >> read
> > >>>>> ahead
> > >>>>>>>> cache
> > >>>>>>>>>> and
> > >>>>>>>>>>>>> pending commit records cache, it would be good for both
> > >>>>> short
> > >>>>>>>>>>>> transactions
> > >>>>>>>>>>>>> and long transactions.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - DLSN, SequenceId:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> * We will add a fourth field to DLSN. It is `local
> > >>>> sequence
> > >>>>>>>> number`
> > >>>>>>>>>>>> within
> > >>>>>>>>>>>>> a transaction session. So the new DLSN of records in a
> > >>>>>>> transaction
> > >>>>>>>>>> will
> > >>>>>>>>>>>> be
> > >>>>>>>>>>>>> the DLSN of commit control record plus its local
> > >> sequence
> > >>>>>>> number.
> > >>>>>>>>>>>>> * The sequence id will be still the position of the
> > >>>> commit
> > >>>>>>> record
> > >>>>>>>>> plus
> > >>>>>>>>>>>> its
> > >>>>>>>>>>>>> local sequence number. The position will be advanced
> > >> with
> > >>>>>> total
> > >>>>>>>>> number
> > >>>>>>>>>>>> of
> > >>>>>>>>>>>>> written records on writing the commit control record.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Transaction Group & Namespace Transaction
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> using one single log stream for namespace transaction
> > >> can
> > >>>>>> cause
> > >>>>>>>> the
> > >>>>>>>>>>>>> bottleneck problem since all the begin/commit/end
> > >> control
> > >>>>>>> records
> > >>>>>>>>> will
> > >>>>>>>>>>>> have
> > >>>>>>>>>>>>> to go through one log stream.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> the idea of 'transaction group' is to allow
> > >> partitioning
> > >>>> the
> > >>>>>>>> writers
> > >>>>>>>>>>>> into
> > >>>>>>>>>>>>> different transaction groups.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> clients can specify the `group-name` when starting the
> > >>>>>>>> transaction.
> > >>>>>>>>> if
> > >>>>>>>>>>>>> there is no `group-name` specified, it will use the
> > >>>> default
> > >>>>>>>> `commit`
> > >>>>>>>>>>>> log in
> > >>>>>>>>>>>>> the namespace for creating transactions.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> -------------------------------------------------
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I'd like to collect feedbacks on this idea. Appreciate
> > >>>> any
> > >>>>>>>> comments
> > >>>>>>>>>> and
> > >>>>>>>>>>>> if
> > >>>>>>>>>>>>> anyone is also interested in this idea, we'd like to
> > >>>>>> collaborate
> > >>>>>>>>> with
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>> community.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Xi
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>
> >
> >
>

Re: [Discuss] Transaction Support

Posted by Asko Kauppi <as...@zalando.fi>.
Xi, I'm curious about:

>be able to write data to multiple log streams in an atomic way.

Do you intend to group such logs together, so that one producer would be
able to "own" them and write to any of them, with other writers waiting
until (s)he's done. I'd like to know more of the use case.


On 6 January 2017 at 08:56, Xi Liu <xi...@gmail.com> wrote:

> Asko and Sijie,
>
> Thank you so much for your feedbacks.
>
> We are not targeting at building a general XA transaction coordinator. The
> feature we want is be able to write data to multiple log streams in an
> atomic way.
>
> I totally agreed with you about building minimal logic. We also don't want
> to enforce this feature to all the users of DL. Building the TC as a
> separated service sounds clear to me. We will do it follow your suggestion.
>
> I am also replying the comments to you and Leigh on the doc. Hopefully we
> can come to an agreement so that our changes can be accepted.
>
> - Xi
>
> On Wed, Jan 4, 2017 at 1:14 AM, Asko Kauppi <as...@zalando.fi>
> wrote:
>
> > > Beside that, I have one general question - What is the major goal for
> > this
> > > feature? Are you targeting on building a general XA transaction
> > coordinator
> > > or just for supporting things like `copy-modify-write' style workflow?
> >
> > The use case I would have for transactions - at some level of the stack -
> > is supporting dynamic configurations.
> >
> > If a config changes in e.g. three lines, some of the changes may
> logically
> > belong together. E.g. changing both “host” and “port” (if separate
> > entries). One shouldn’t be able to read a state, even temporarily, that
> has
> > new host but old port.
> >
> > I can do this in the application level - it does not need to be part of
> > the DL protocol.
> >
> >
> > Asko Kauppi
> > Zalando Tech Helsinki
> >
> > > On 4 Jan 2017, at 9.18, Sijie Guo <si...@apache.org> wrote:
> > >
> > > Sorry for late response. I think Leigh and you already had some very
> > > valuable discussions in the doc. I will try to add some of my questions
> > to
> > > the discussion.
> > >
> > > Beside that, I had a discussion with Leigh today about this. first of
> > all,
> > > I think it is very good to add transaction support in distributedlog.
> It
> > is
> > > one of the primitives that would help building distributed service. But
> > we
> > > have a concern about making this system become complicated and
> introduce
> > > operational overhead when it runs in the large scale system on
> > production.
> > > There are two major suggestions that I have for this feature -
> > >
> > > Build the 'minimum' logic in core - I think the minimum logic that need
> > to
> > > be added to the core is -  the special control records (begin, commit
> and
> > > abort) and make the reader be able to detect those special control
> > records
> > > and know what do they mean and how to interrupt with them. Since they
> are
> > > special control records, there is not overhead to other readers that
> > > doesn't require this feature.
> > >
> > > Build the transaction coordinator as a separated proxy service  - I
> think
> > > the major concern that we have is putting more complexities into the
> > 'write
> > > proxy' service. We architected distributedlog in a more
> microservice-like
> > > way - we have the core as the stream store, the proxy for serving write
> > and
> > > read traffic. It would be good that the transaction feature can be done
> > in
> > > a similar way. So the architecture would be like this -
> > >
> > > *[ write service ] [ read service ] [ transaction coordinator ]*
> > > *[ stream store
> > >            ]*
> > >
> > > if people doesn't need the transaction feature, they can turn if off
> > > completely without any operational overhead.
> > >
> > > Beside that, I have one general question - What is the major goal for
> > this
> > > feature? Are you targeting on building a general XA transaction
> > coordinator
> > > or just for supporting things like `copy-modify-write' style workflow?
> > >
> > >
> > > Thanks,
> > > Sijie
> > >
> > >
> > >
> > >
> > >
> > > On Wed, Dec 28, 2016 at 1:12 PM, Xi Liu <xi...@gmail.com> wrote:
> > >
> > >> Ping?
> > >>
> > >> On Mon, Dec 19, 2016 at 8:28 AM, Xi Liu <xi...@gmail.com> wrote:
> > >>
> > >>> Sijie,
> > >>>
> > >>> No. I thought it might be easier for people to comment on a google
> doc
> > to
> > >>> gather the initial feedback. I will put the content back to wiki page
> > >> once
> > >>> addressing the comments. Does that sound good to you?
> > >>>
> > >>> And thank you in advance.
> > >>>
> > >>> - Xi
> > >>>
> > >>>
> > >>>
> > >>> On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <si...@apache.org> wrote:
> > >>>
> > >>>> Hi Xi,
> > >>>>
> > >>>> sorry for late response. I will review it soon.
> > >>>>
> > >>>> regarding this, a separate question "are we going to use google doc
> > >>>> instead
> > >>>> of email thread for any discussion"? I am a bit worried that the
> > >>>> discussion
> > >>>> will become lost after moving to google doc. No idea on how other
> > apache
> > >>>> projects are doing.
> > >>>>
> > >>>> - Sijie
> > >>>>
> > >>>> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi...@gmail.com>
> > wrote:
> > >>>>
> > >>>>> Hi all,
> > >>>>>
> > >>>>> I finalized the first version of the design. This time I used a
> > google
> > >>>> doc
> > >>>>> so that it is easier for commenting and add a link the wiki page. I
> > >> will
> > >>>>> update this to the wiki page once we come to the finalized design.
> > >>>>>
> > >>>>> https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
> > >>>>> bSIGgSzXuTI5BA/edit
> > >>>>>
> > >>>>> Let me know if you have any questions. Appreciate your reviews!
> > >>>>>
> > >>>>> - Xi
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
> > >>>>> <lstewart@twitter.com.invalid
> > >>>>>> wrote:
> > >>>>>
> > >>>>>> Interesting proposal. A couple quick notes while you continue to
> > >> flesh
> > >>>>> this
> > >>>>>> out.
> > >>>>>>
> > >>>>>> a. just to be sure - does this eliminate the need to save seqno
> with
> > >>>>>> checkpoint?
> > >>>>>>
> > >>>>>> b. i.e. another way to describe this kind of improvement is
> "support
> > >>>>>> records (atomic writes) larger than 1MB", iiuc. the advantage
> being
> > >> it
> > >>>>>> avoids the baggage of transactions. disadvantages include
> inability
> > >>>> to do
> > >>>>>> cross stream transactions, and flexibility (interleaving, etc)
> (are
> > >>>> there
> > >>>>>> others?).
> > >>>>>>
> > >>>>>> c. proxy use case is for supporting multiple writers - have you
> > >>>> thought
> > >>>>>> about how this would work with multiple writers?
> > >>>>>>
> > >>>>>> Thanks!
> > >>>>>>
> > >>>>>>
> > >>>>>> On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo
> > >> <sijieg@twitter.com.invalid
> > >>>>>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Sound good to me. look forward to the detailed proposal.
> > >>>>>>>
> > >>>>>>> (I don't mind the format if it makes things easier to you)
> > >>>>>>>
> > >>>>>>> Sijie
> > >>>>>>>
> > >>>>>>> On Friday, October 14, 2016, Xi Liu <xi...@gmail.com>
> wrote:
> > >>>>>>>
> > >>>>>>>> Thank you, Sijie
> > >>>>>>>>
> > >>>>>>>> We have some internal discussions to sort out some details. We
> > >> are
> > >>>>>> ready
> > >>>>>>> to
> > >>>>>>>> collaborate with the community for adding the transaction
> > >> support
> > >>>> in
> > >>>>>> DL.
> > >>>>>>>> We'd like to share more.
> > >>>>>>>>
> > >>>>>>>> I created a proposal wiki here -
> > >>>>>>>> https://cwiki.apache.org/confluence/display/DL/DP-1+-+
> > >>>>>>>> DistributedLog+Transaction+Support
> > >>>>>>>>
> > >>>>>>>> (I followed KIP format and named it as DP (DistributedLog
> > >>>> Proposal -
> > >>>>> DP
> > >>>>>>> is
> > >>>>>>>> also short for Dynamic Programming). I don't know if you guys
> > >> like
> > >>>>> this
> > >>>>>>>> name or not. Feel free to change it :D)
> > >>>>>>>>
> > >>>>>>>> I basically put my initial email as the content there so far.
> > >>>> Once we
> > >>>>>>>> finished our final discussion, I will update with more details.
> > >> At
> > >>>>> the
> > >>>>>>> same
> > >>>>>>>> time, any comments are welcome.
> > >>>>>>>>
> > >>>>>>>> - Xi
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
> > >>>>>>> <javascript:;>>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Xi,
> > >>>>>>>>>
> > >>>>>>>>> I just granted you the edit permission.
> > >>>>>>>>>
> > >>>>>>>>> - Sijie
> > >>>>>>>>>
> > >>>>>>>>> On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
> > >>>>>>>> <javascript:;>> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> I still can not edit the wiki. Can any of the pmc members
> > >>>> grant
> > >>>>> me
> > >>>>>>> the
> > >>>>>>>>>> permissions?
> > >>>>>>>>>>
> > >>>>>>>>>> - Xi
> > >>>>>>>>>>
> > >>>>>>>>>> On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <
> > >>>> xi.liu.ant@gmail.com
> > >>>>>>>> <javascript:;>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Sijie,
> > >>>>>>>>>>>
> > >>>>>>>>>>> I attempted to create a wiki page under that space. I
> > >> found
> > >>>>> that
> > >>>>>> I
> > >>>>>>> am
> > >>>>>>>>> not
> > >>>>>>>>>>> authorized with edit permission.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Can any of the committers grant me the wiki edit
> > >>>> permission? My
> > >>>>>>>> account
> > >>>>>>>>>> is
> > >>>>>>>>>>> "xi.liu.ant".
> > >>>>>>>>>>>
> > >>>>>>>>>>> - Xi
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <
> > >>>> sijie@apache.org
> > >>>>>>>> <javascript:;>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> This sounds interesting ... I will take a closer look and
> > >>>> give
> > >>>>>> my
> > >>>>>>>>>> comments
> > >>>>>>>>>>>> later.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> At the same time, do you mind creating a wiki page to put
> > >>>> your
> > >>>>>>> idea
> > >>>>>>>>>> there?
> > >>>>>>>>>>>> You can add your wiki page under
> > >>>>>>>>>>>> https://cwiki.apache.org/confluence/display/DL/Project+
> > >>>>>> Proposals
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> You might need to ask in the dev list to grant the wiki
> > >>>> edit
> > >>>>>>>>> permissions
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>> you once you have a wiki account.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> - Sijie
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <
> > >>>> xi.liu.ant@gmail.com
> > >>>>>>>> <javascript:;>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Hello,
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I asked the transaction support in distributedlog user
> > >>>> group
> > >>>>>> two
> > >>>>>>>>>> months
> > >>>>>>>>>>>>> ago. I want to raise this up again, as we are looking
> > >> for
> > >>>>>> using
> > >>>>>>>>>>>>> distributedlog for building a transactional data
> > >>>> service. It
> > >>>>>> is
> > >>>>>>> a
> > >>>>>>>>>> major
> > >>>>>>>>>>>>> feature that is missing in distributedlog. We have some
> > >>>>> ideas
> > >>>>>> to
> > >>>>>>>> add
> > >>>>>>>>>>>> this
> > >>>>>>>>>>>>> to distributedlog and want to know if they make sense
> > >> or
> > >>>>> not.
> > >>>>>> If
> > >>>>>>>>> they
> > >>>>>>>>>>>> are
> > >>>>>>>>>>>>> good, we'd like to contribute and develop with the
> > >>>>> community.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Here are the thoughts:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> -------------------------------------------------
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> From our understanding, DL can provide "at-least-once"
> > >>>>>> delivery
> > >>>>>>>>>> semantic
> > >>>>>>>>>>>>> (if not, please correct me) but not "exactly-once"
> > >>>> delivery
> > >>>>>>>>> semantic.
> > >>>>>>>>>>>> That
> > >>>>>>>>>>>>> means that a message can be delivered one or more times
> > >>>> if
> > >>>>> the
> > >>>>>>>>> reader
> > >>>>>>>>>>>>> doesn't handle duplicates.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The duplicates come from two places, one is at writer
> > >>>> side
> > >>>>>> (this
> > >>>>>>>>>> assumes
> > >>>>>>>>>>>>> using write proxy not the core library), while the
> > >> other
> > >>>> one
> > >>>>>> is
> > >>>>>>> at
> > >>>>>>>>>>>> reader
> > >>>>>>>>>>>>> side.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - writer side: if the client attempts to write a record
> > >>>> to
> > >>>>> the
> > >>>>>>>> write
> > >>>>>>>>>>>>> proxies and gets a network error (e.g timeouts) then
> > >>>>> retries,
> > >>>>>>> the
> > >>>>>>>>>>>> retrying
> > >>>>>>>>>>>>> will potentially result in duplicates.
> > >>>>>>>>>>>>> - reader side:if the reader reads a message from a
> > >> stream
> > >>>>> and
> > >>>>>>> then
> > >>>>>>>>>>>> crashes,
> > >>>>>>>>>>>>> when the reader restarts it would restart from last
> > >> known
> > >>>>>>> position
> > >>>>>>>>>>>> (DLSN).
> > >>>>>>>>>>>>> If the reader fails after processing a record and
> > >> before
> > >>>>>>> recording
> > >>>>>>>>> the
> > >>>>>>>>>>>>> position, the processed record will be delivered again.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The reader problem can be properly addressed by making
> > >>>> use
> > >>>>> of
> > >>>>>>> the
> > >>>>>>>>>>>> sequence
> > >>>>>>>>>>>>> numbers of records and doing proper checkpointing. For
> > >>>>>> example,
> > >>>>>>> in
> > >>>>>>>>>>>>> database, it can checkpoint the indexed data with the
> > >>>>> sequence
> > >>>>>>>>> number
> > >>>>>>>>>> of
> > >>>>>>>>>>>>> records; in flink, it can checkpoint the state with the
> > >>>>>> sequence
> > >>>>>>>>>>>> numbers.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The writer problem can be addressed by implementing an
> > >>>>>>> idempotent
> > >>>>>>>>>>>> writer.
> > >>>>>>>>>>>>> However, an alternative and more powerful approach is
> > >> to
> > >>>>>> support
> > >>>>>>>>>>>>> transactions.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> *What does transaction mean?*
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> A transaction means a collection of records can be
> > >>>> written
> > >>>>>>>>>>>> transactionally
> > >>>>>>>>>>>>> within a stream or across multiple streams. They will
> > >> be
> > >>>>>>> consumed
> > >>>>>>>> by
> > >>>>>>>>>> the
> > >>>>>>>>>>>>> reader together when a transaction is committed, or
> > >> will
> > >>>>> never
> > >>>>>>> be
> > >>>>>>>>>>>> consumed
> > >>>>>>>>>>>>> by the reader when the transaction is aborted.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The transaction will expose following guarantees:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - The reader should not be exposed to records written
> > >>>> from
> > >>>>>>>>> uncommitted
> > >>>>>>>>>>>>> transactions (mandatory)
> > >>>>>>>>>>>>> - The reader should consume the records in the
> > >>>> transaction
> > >>>>>>> commit
> > >>>>>>>>>> order
> > >>>>>>>>>>>>> rather than the record written order (mandatory)
> > >>>>>>>>>>>>> - No duplicated records within a transaction
> > >> (mandatory)
> > >>>>>>>>>>>>> - Allow interleaving transactional writes and
> > >>>>>> non-transactional
> > >>>>>>>>> writes
> > >>>>>>>>>>>>> (optional)
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> *Stream Transaction & Namespace Transaction*
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> There will be two types of transaction, one is Stream
> > >>>> level
> > >>>>>>>>>> transaction
> > >>>>>>>>>>>>> (local transaction), while the other one is Namespace
> > >>>> level
> > >>>>>>>>>> transaction
> > >>>>>>>>>>>>> (global transaction).
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The stream level transaction is a transactional
> > >>>> operation on
> > >>>>>>>> writing
> > >>>>>>>>>>>>> records to one stream; the namespace level transaction
> > >>>> is a
> > >>>>>>>>>>>> transactional
> > >>>>>>>>>>>>> operation on writing records to multiple streams.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> *Implementation Thoughts*
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - A transaction is consist of begin control record, a
> > >>>> series
> > >>>>>> of
> > >>>>>>>> data
> > >>>>>>>>>>>>> records and commit/abort control record.
> > >>>>>>>>>>>>> - The begin/commit/abort control record is written to a
> > >>>>>> `commit`
> > >>>>>>>> log
> > >>>>>>>>>>>>> stream, while the data records will be written to
> > >> normal
> > >>>>> data
> > >>>>>>> log
> > >>>>>>>>>>>> streams.
> > >>>>>>>>>>>>> - The `commit` log stream will be the same log stream
> > >> for
> > >>>>>>>>> stream-level
> > >>>>>>>>>>>>> transaction,  while it will be a *system* stream (or
> > >>>>> multiple
> > >>>>>>>> system
> > >>>>>>>>>>>>> streams) for namespace-level transactions.
> > >>>>>>>>>>>>> - The transaction code looks like as below:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> <code>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Transaction txn = client.transaction();
> > >>>>>>>>>>>>> Future<DLSN> result1 = txn.write(stream-0, record);
> > >>>>>>>>>>>>> Future<DLSN> result2 = txn.write(stream-1, record);
> > >>>>>>>>>>>>> Future<DLSN> result3 = txn.write(stream-2, record);
> > >>>>>>>>>>>>> Future<Pair<DLSN, DLSN>> result = txn.commit();
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> </code>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> if the txn is committed, all the write futures will be
> > >>>>>> satisfied
> > >>>>>>>>> with
> > >>>>>>>>>>>> their
> > >>>>>>>>>>>>> written DLSNs. if the txn is aborted, all the write
> > >>>> futures
> > >>>>>> will
> > >>>>>>>> be
> > >>>>>>>>>>>> failed
> > >>>>>>>>>>>>> together. there is no partial failure state.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - The actually data flow will be:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 1. writer get a transaction id from the owner of the
> > >>>>> `commit'
> > >>>>>>> log
> > >>>>>>>>>> stream
> > >>>>>>>>>>>>> 1. write the begin control record (synchronously) with
> > >>>> the
> > >>>>>>>>> transaction
> > >>>>>>>>>>>> id
> > >>>>>>>>>>>>> 2. for each write within the same txn, it will be
> > >>>> assigned a
> > >>>>>>> local
> > >>>>>>>>>>>> sequence
> > >>>>>>>>>>>>> number starting from 0. the combination of transaction
> > >> id
> > >>>>> and
> > >>>>>>>> local
> > >>>>>>>>>>>>> sequence number will be used later on by the readers to
> > >>>>>>>> de-duplicate
> > >>>>>>>>>>>>> records.
> > >>>>>>>>>>>>> 3. the commit/abort control record will be written
> > >> based
> > >>>> on
> > >>>>>> the
> > >>>>>>>>>> results
> > >>>>>>>>>>>>> from 2.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Application can supply a timeout for the transaction
> > >>>> when
> > >>>>>>>>> #begin() a
> > >>>>>>>>>>>>> transaction. The owner of the `commit` log stream can
> > >>>> abort
> > >>>>>>>>>> transactions
> > >>>>>>>>>>>>> that never be committed/aborted within their timeout.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Failures:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> * all the log records can be simply retried as they
> > >> will
> > >>>> be
> > >>>>>>>>>>>> de-duplicated
> > >>>>>>>>>>>>> probably at the reader side.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Reader:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> * Reader can be configured to read uncommitted records
> > >> or
> > >>>>>>>> committed
> > >>>>>>>>>>>> records
> > >>>>>>>>>>>>> only (by default read uncommitted records)
> > >>>>>>>>>>>>> * If reader is configured to read committed records
> > >> only,
> > >>>>> the
> > >>>>>>> read
> > >>>>>>>>>> ahead
> > >>>>>>>>>>>>> cache will be changed to maintain one additional
> > >> pending
> > >>>>>>> committed
> > >>>>>>>>>>>> records.
> > >>>>>>>>>>>>> the pending committed records map is bounded and
> > >> records
> > >>>>> will
> > >>>>>> be
> > >>>>>>>>>> dropped
> > >>>>>>>>>>>>> when read ahead is moving.
> > >>>>>>>>>>>>> * when the reader hits a commit record, it will rewind
> > >> to
> > >>>>> the
> > >>>>>>>> begin
> > >>>>>>>>>>>> record
> > >>>>>>>>>>>>> and start reading from there. leveraging the proper
> > >> read
> > >>>>> ahead
> > >>>>>>>> cache
> > >>>>>>>>>> and
> > >>>>>>>>>>>>> pending commit records cache, it would be good for both
> > >>>>> short
> > >>>>>>>>>>>> transactions
> > >>>>>>>>>>>>> and long transactions.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - DLSN, SequenceId:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> * We will add a fourth field to DLSN. It is `local
> > >>>> sequence
> > >>>>>>>> number`
> > >>>>>>>>>>>> within
> > >>>>>>>>>>>>> a transaction session. So the new DLSN of records in a
> > >>>>>>> transaction
> > >>>>>>>>>> will
> > >>>>>>>>>>>> be
> > >>>>>>>>>>>>> the DLSN of commit control record plus its local
> > >> sequence
> > >>>>>>> number.
> > >>>>>>>>>>>>> * The sequence id will be still the position of the
> > >>>> commit
> > >>>>>>> record
> > >>>>>>>>> plus
> > >>>>>>>>>>>> its
> > >>>>>>>>>>>>> local sequence number. The position will be advanced
> > >> with
> > >>>>>> total
> > >>>>>>>>> number
> > >>>>>>>>>>>> of
> > >>>>>>>>>>>>> written records on writing the commit control record.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Transaction Group & Namespace Transaction
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> using one single log stream for namespace transaction
> > >> can
> > >>>>>> cause
> > >>>>>>>> the
> > >>>>>>>>>>>>> bottleneck problem since all the begin/commit/end
> > >> control
> > >>>>>>> records
> > >>>>>>>>> will
> > >>>>>>>>>>>> have
> > >>>>>>>>>>>>> to go through one log stream.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> the idea of 'transaction group' is to allow
> > >> partitioning
> > >>>> the
> > >>>>>>>> writers
> > >>>>>>>>>>>> into
> > >>>>>>>>>>>>> different transaction groups.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> clients can specify the `group-name` when starting the
> > >>>>>>>> transaction.
> > >>>>>>>>> if
> > >>>>>>>>>>>>> there is no `group-name` specified, it will use the
> > >>>> default
> > >>>>>>>> `commit`
> > >>>>>>>>>>>> log in
> > >>>>>>>>>>>>> the namespace for creating transactions.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> -------------------------------------------------
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> I'd like to collect feedbacks on this idea. Appreciate
> > >>>> any
> > >>>>>>>> comments
> > >>>>>>>>>> and
> > >>>>>>>>>>>> if
> > >>>>>>>>>>>>> anyone is also interested in this idea, we'd like to
> > >>>>>> collaborate
> > >>>>>>>>> with
> > >>>>>>>>>>>> the
> > >>>>>>>>>>>>> community.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> - Xi
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>
> >
> >
>

Re: [Discuss] Transaction Support

Posted by Xi Liu <xi...@gmail.com>.
Asko and Sijie,

Thank you so much for your feedbacks.

We are not targeting at building a general XA transaction coordinator. The
feature we want is be able to write data to multiple log streams in an
atomic way.

I totally agreed with you about building minimal logic. We also don't want
to enforce this feature to all the users of DL. Building the TC as a
separated service sounds clear to me. We will do it follow your suggestion.

I am also replying the comments to you and Leigh on the doc. Hopefully we
can come to an agreement so that our changes can be accepted.

- Xi

On Wed, Jan 4, 2017 at 1:14 AM, Asko Kauppi <as...@zalando.fi> wrote:

> > Beside that, I have one general question - What is the major goal for
> this
> > feature? Are you targeting on building a general XA transaction
> coordinator
> > or just for supporting things like `copy-modify-write' style workflow?
>
> The use case I would have for transactions - at some level of the stack -
> is supporting dynamic configurations.
>
> If a config changes in e.g. three lines, some of the changes may logically
> belong together. E.g. changing both “host” and “port” (if separate
> entries). One shouldn’t be able to read a state, even temporarily, that has
> new host but old port.
>
> I can do this in the application level - it does not need to be part of
> the DL protocol.
>
>
> Asko Kauppi
> Zalando Tech Helsinki
>
> > On 4 Jan 2017, at 9.18, Sijie Guo <si...@apache.org> wrote:
> >
> > Sorry for late response. I think Leigh and you already had some very
> > valuable discussions in the doc. I will try to add some of my questions
> to
> > the discussion.
> >
> > Beside that, I had a discussion with Leigh today about this. first of
> all,
> > I think it is very good to add transaction support in distributedlog. It
> is
> > one of the primitives that would help building distributed service. But
> we
> > have a concern about making this system become complicated and introduce
> > operational overhead when it runs in the large scale system on
> production.
> > There are two major suggestions that I have for this feature -
> >
> > Build the 'minimum' logic in core - I think the minimum logic that need
> to
> > be added to the core is -  the special control records (begin, commit and
> > abort) and make the reader be able to detect those special control
> records
> > and know what do they mean and how to interrupt with them. Since they are
> > special control records, there is not overhead to other readers that
> > doesn't require this feature.
> >
> > Build the transaction coordinator as a separated proxy service  - I think
> > the major concern that we have is putting more complexities into the
> 'write
> > proxy' service. We architected distributedlog in a more microservice-like
> > way - we have the core as the stream store, the proxy for serving write
> and
> > read traffic. It would be good that the transaction feature can be done
> in
> > a similar way. So the architecture would be like this -
> >
> > *[ write service ] [ read service ] [ transaction coordinator ]*
> > *[ stream store
> >            ]*
> >
> > if people doesn't need the transaction feature, they can turn if off
> > completely without any operational overhead.
> >
> > Beside that, I have one general question - What is the major goal for
> this
> > feature? Are you targeting on building a general XA transaction
> coordinator
> > or just for supporting things like `copy-modify-write' style workflow?
> >
> >
> > Thanks,
> > Sijie
> >
> >
> >
> >
> >
> > On Wed, Dec 28, 2016 at 1:12 PM, Xi Liu <xi...@gmail.com> wrote:
> >
> >> Ping?
> >>
> >> On Mon, Dec 19, 2016 at 8:28 AM, Xi Liu <xi...@gmail.com> wrote:
> >>
> >>> Sijie,
> >>>
> >>> No. I thought it might be easier for people to comment on a google doc
> to
> >>> gather the initial feedback. I will put the content back to wiki page
> >> once
> >>> addressing the comments. Does that sound good to you?
> >>>
> >>> And thank you in advance.
> >>>
> >>> - Xi
> >>>
> >>>
> >>>
> >>> On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <si...@apache.org> wrote:
> >>>
> >>>> Hi Xi,
> >>>>
> >>>> sorry for late response. I will review it soon.
> >>>>
> >>>> regarding this, a separate question "are we going to use google doc
> >>>> instead
> >>>> of email thread for any discussion"? I am a bit worried that the
> >>>> discussion
> >>>> will become lost after moving to google doc. No idea on how other
> apache
> >>>> projects are doing.
> >>>>
> >>>> - Sijie
> >>>>
> >>>> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi...@gmail.com>
> wrote:
> >>>>
> >>>>> Hi all,
> >>>>>
> >>>>> I finalized the first version of the design. This time I used a
> google
> >>>> doc
> >>>>> so that it is easier for commenting and add a link the wiki page. I
> >> will
> >>>>> update this to the wiki page once we come to the finalized design.
> >>>>>
> >>>>> https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
> >>>>> bSIGgSzXuTI5BA/edit
> >>>>>
> >>>>> Let me know if you have any questions. Appreciate your reviews!
> >>>>>
> >>>>> - Xi
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
> >>>>> <lstewart@twitter.com.invalid
> >>>>>> wrote:
> >>>>>
> >>>>>> Interesting proposal. A couple quick notes while you continue to
> >> flesh
> >>>>> this
> >>>>>> out.
> >>>>>>
> >>>>>> a. just to be sure - does this eliminate the need to save seqno with
> >>>>>> checkpoint?
> >>>>>>
> >>>>>> b. i.e. another way to describe this kind of improvement is "support
> >>>>>> records (atomic writes) larger than 1MB", iiuc. the advantage being
> >> it
> >>>>>> avoids the baggage of transactions. disadvantages include inability
> >>>> to do
> >>>>>> cross stream transactions, and flexibility (interleaving, etc) (are
> >>>> there
> >>>>>> others?).
> >>>>>>
> >>>>>> c. proxy use case is for supporting multiple writers - have you
> >>>> thought
> >>>>>> about how this would work with multiple writers?
> >>>>>>
> >>>>>> Thanks!
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo
> >> <sijieg@twitter.com.invalid
> >>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Sound good to me. look forward to the detailed proposal.
> >>>>>>>
> >>>>>>> (I don't mind the format if it makes things easier to you)
> >>>>>>>
> >>>>>>> Sijie
> >>>>>>>
> >>>>>>> On Friday, October 14, 2016, Xi Liu <xi...@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> Thank you, Sijie
> >>>>>>>>
> >>>>>>>> We have some internal discussions to sort out some details. We
> >> are
> >>>>>> ready
> >>>>>>> to
> >>>>>>>> collaborate with the community for adding the transaction
> >> support
> >>>> in
> >>>>>> DL.
> >>>>>>>> We'd like to share more.
> >>>>>>>>
> >>>>>>>> I created a proposal wiki here -
> >>>>>>>> https://cwiki.apache.org/confluence/display/DL/DP-1+-+
> >>>>>>>> DistributedLog+Transaction+Support
> >>>>>>>>
> >>>>>>>> (I followed KIP format and named it as DP (DistributedLog
> >>>> Proposal -
> >>>>> DP
> >>>>>>> is
> >>>>>>>> also short for Dynamic Programming). I don't know if you guys
> >> like
> >>>>> this
> >>>>>>>> name or not. Feel free to change it :D)
> >>>>>>>>
> >>>>>>>> I basically put my initial email as the content there so far.
> >>>> Once we
> >>>>>>>> finished our final discussion, I will update with more details.
> >> At
> >>>>> the
> >>>>>>> same
> >>>>>>>> time, any comments are welcome.
> >>>>>>>>
> >>>>>>>> - Xi
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
> >>>>>>> <javascript:;>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Xi,
> >>>>>>>>>
> >>>>>>>>> I just granted you the edit permission.
> >>>>>>>>>
> >>>>>>>>> - Sijie
> >>>>>>>>>
> >>>>>>>>> On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
> >>>>>>>> <javascript:;>> wrote:
> >>>>>>>>>
> >>>>>>>>>> I still can not edit the wiki. Can any of the pmc members
> >>>> grant
> >>>>> me
> >>>>>>> the
> >>>>>>>>>> permissions?
> >>>>>>>>>>
> >>>>>>>>>> - Xi
> >>>>>>>>>>
> >>>>>>>>>> On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <
> >>>> xi.liu.ant@gmail.com
> >>>>>>>> <javascript:;>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Sijie,
> >>>>>>>>>>>
> >>>>>>>>>>> I attempted to create a wiki page under that space. I
> >> found
> >>>>> that
> >>>>>> I
> >>>>>>> am
> >>>>>>>>> not
> >>>>>>>>>>> authorized with edit permission.
> >>>>>>>>>>>
> >>>>>>>>>>> Can any of the committers grant me the wiki edit
> >>>> permission? My
> >>>>>>>> account
> >>>>>>>>>> is
> >>>>>>>>>>> "xi.liu.ant".
> >>>>>>>>>>>
> >>>>>>>>>>> - Xi
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <
> >>>> sijie@apache.org
> >>>>>>>> <javascript:;>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> This sounds interesting ... I will take a closer look and
> >>>> give
> >>>>>> my
> >>>>>>>>>> comments
> >>>>>>>>>>>> later.
> >>>>>>>>>>>>
> >>>>>>>>>>>> At the same time, do you mind creating a wiki page to put
> >>>> your
> >>>>>>> idea
> >>>>>>>>>> there?
> >>>>>>>>>>>> You can add your wiki page under
> >>>>>>>>>>>> https://cwiki.apache.org/confluence/display/DL/Project+
> >>>>>> Proposals
> >>>>>>>>>>>>
> >>>>>>>>>>>> You might need to ask in the dev list to grant the wiki
> >>>> edit
> >>>>>>>>> permissions
> >>>>>>>>>>>> to
> >>>>>>>>>>>> you once you have a wiki account.
> >>>>>>>>>>>>
> >>>>>>>>>>>> - Sijie
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <
> >>>> xi.liu.ant@gmail.com
> >>>>>>>> <javascript:;>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I asked the transaction support in distributedlog user
> >>>> group
> >>>>>> two
> >>>>>>>>>> months
> >>>>>>>>>>>>> ago. I want to raise this up again, as we are looking
> >> for
> >>>>>> using
> >>>>>>>>>>>>> distributedlog for building a transactional data
> >>>> service. It
> >>>>>> is
> >>>>>>> a
> >>>>>>>>>> major
> >>>>>>>>>>>>> feature that is missing in distributedlog. We have some
> >>>>> ideas
> >>>>>> to
> >>>>>>>> add
> >>>>>>>>>>>> this
> >>>>>>>>>>>>> to distributedlog and want to know if they make sense
> >> or
> >>>>> not.
> >>>>>> If
> >>>>>>>>> they
> >>>>>>>>>>>> are
> >>>>>>>>>>>>> good, we'd like to contribute and develop with the
> >>>>> community.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Here are the thoughts:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -------------------------------------------------
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> From our understanding, DL can provide "at-least-once"
> >>>>>> delivery
> >>>>>>>>>> semantic
> >>>>>>>>>>>>> (if not, please correct me) but not "exactly-once"
> >>>> delivery
> >>>>>>>>> semantic.
> >>>>>>>>>>>> That
> >>>>>>>>>>>>> means that a message can be delivered one or more times
> >>>> if
> >>>>> the
> >>>>>>>>> reader
> >>>>>>>>>>>>> doesn't handle duplicates.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The duplicates come from two places, one is at writer
> >>>> side
> >>>>>> (this
> >>>>>>>>>> assumes
> >>>>>>>>>>>>> using write proxy not the core library), while the
> >> other
> >>>> one
> >>>>>> is
> >>>>>>> at
> >>>>>>>>>>>> reader
> >>>>>>>>>>>>> side.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - writer side: if the client attempts to write a record
> >>>> to
> >>>>> the
> >>>>>>>> write
> >>>>>>>>>>>>> proxies and gets a network error (e.g timeouts) then
> >>>>> retries,
> >>>>>>> the
> >>>>>>>>>>>> retrying
> >>>>>>>>>>>>> will potentially result in duplicates.
> >>>>>>>>>>>>> - reader side:if the reader reads a message from a
> >> stream
> >>>>> and
> >>>>>>> then
> >>>>>>>>>>>> crashes,
> >>>>>>>>>>>>> when the reader restarts it would restart from last
> >> known
> >>>>>>> position
> >>>>>>>>>>>> (DLSN).
> >>>>>>>>>>>>> If the reader fails after processing a record and
> >> before
> >>>>>>> recording
> >>>>>>>>> the
> >>>>>>>>>>>>> position, the processed record will be delivered again.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The reader problem can be properly addressed by making
> >>>> use
> >>>>> of
> >>>>>>> the
> >>>>>>>>>>>> sequence
> >>>>>>>>>>>>> numbers of records and doing proper checkpointing. For
> >>>>>> example,
> >>>>>>> in
> >>>>>>>>>>>>> database, it can checkpoint the indexed data with the
> >>>>> sequence
> >>>>>>>>> number
> >>>>>>>>>> of
> >>>>>>>>>>>>> records; in flink, it can checkpoint the state with the
> >>>>>> sequence
> >>>>>>>>>>>> numbers.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The writer problem can be addressed by implementing an
> >>>>>>> idempotent
> >>>>>>>>>>>> writer.
> >>>>>>>>>>>>> However, an alternative and more powerful approach is
> >> to
> >>>>>> support
> >>>>>>>>>>>>> transactions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *What does transaction mean?*
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> A transaction means a collection of records can be
> >>>> written
> >>>>>>>>>>>> transactionally
> >>>>>>>>>>>>> within a stream or across multiple streams. They will
> >> be
> >>>>>>> consumed
> >>>>>>>> by
> >>>>>>>>>> the
> >>>>>>>>>>>>> reader together when a transaction is committed, or
> >> will
> >>>>> never
> >>>>>>> be
> >>>>>>>>>>>> consumed
> >>>>>>>>>>>>> by the reader when the transaction is aborted.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The transaction will expose following guarantees:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - The reader should not be exposed to records written
> >>>> from
> >>>>>>>>> uncommitted
> >>>>>>>>>>>>> transactions (mandatory)
> >>>>>>>>>>>>> - The reader should consume the records in the
> >>>> transaction
> >>>>>>> commit
> >>>>>>>>>> order
> >>>>>>>>>>>>> rather than the record written order (mandatory)
> >>>>>>>>>>>>> - No duplicated records within a transaction
> >> (mandatory)
> >>>>>>>>>>>>> - Allow interleaving transactional writes and
> >>>>>> non-transactional
> >>>>>>>>> writes
> >>>>>>>>>>>>> (optional)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *Stream Transaction & Namespace Transaction*
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> There will be two types of transaction, one is Stream
> >>>> level
> >>>>>>>>>> transaction
> >>>>>>>>>>>>> (local transaction), while the other one is Namespace
> >>>> level
> >>>>>>>>>> transaction
> >>>>>>>>>>>>> (global transaction).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The stream level transaction is a transactional
> >>>> operation on
> >>>>>>>> writing
> >>>>>>>>>>>>> records to one stream; the namespace level transaction
> >>>> is a
> >>>>>>>>>>>> transactional
> >>>>>>>>>>>>> operation on writing records to multiple streams.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *Implementation Thoughts*
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - A transaction is consist of begin control record, a
> >>>> series
> >>>>>> of
> >>>>>>>> data
> >>>>>>>>>>>>> records and commit/abort control record.
> >>>>>>>>>>>>> - The begin/commit/abort control record is written to a
> >>>>>> `commit`
> >>>>>>>> log
> >>>>>>>>>>>>> stream, while the data records will be written to
> >> normal
> >>>>> data
> >>>>>>> log
> >>>>>>>>>>>> streams.
> >>>>>>>>>>>>> - The `commit` log stream will be the same log stream
> >> for
> >>>>>>>>> stream-level
> >>>>>>>>>>>>> transaction,  while it will be a *system* stream (or
> >>>>> multiple
> >>>>>>>> system
> >>>>>>>>>>>>> streams) for namespace-level transactions.
> >>>>>>>>>>>>> - The transaction code looks like as below:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> <code>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Transaction txn = client.transaction();
> >>>>>>>>>>>>> Future<DLSN> result1 = txn.write(stream-0, record);
> >>>>>>>>>>>>> Future<DLSN> result2 = txn.write(stream-1, record);
> >>>>>>>>>>>>> Future<DLSN> result3 = txn.write(stream-2, record);
> >>>>>>>>>>>>> Future<Pair<DLSN, DLSN>> result = txn.commit();
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> </code>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> if the txn is committed, all the write futures will be
> >>>>>> satisfied
> >>>>>>>>> with
> >>>>>>>>>>>> their
> >>>>>>>>>>>>> written DLSNs. if the txn is aborted, all the write
> >>>> futures
> >>>>>> will
> >>>>>>>> be
> >>>>>>>>>>>> failed
> >>>>>>>>>>>>> together. there is no partial failure state.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - The actually data flow will be:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1. writer get a transaction id from the owner of the
> >>>>> `commit'
> >>>>>>> log
> >>>>>>>>>> stream
> >>>>>>>>>>>>> 1. write the begin control record (synchronously) with
> >>>> the
> >>>>>>>>> transaction
> >>>>>>>>>>>> id
> >>>>>>>>>>>>> 2. for each write within the same txn, it will be
> >>>> assigned a
> >>>>>>> local
> >>>>>>>>>>>> sequence
> >>>>>>>>>>>>> number starting from 0. the combination of transaction
> >> id
> >>>>> and
> >>>>>>>> local
> >>>>>>>>>>>>> sequence number will be used later on by the readers to
> >>>>>>>> de-duplicate
> >>>>>>>>>>>>> records.
> >>>>>>>>>>>>> 3. the commit/abort control record will be written
> >> based
> >>>> on
> >>>>>> the
> >>>>>>>>>> results
> >>>>>>>>>>>>> from 2.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Application can supply a timeout for the transaction
> >>>> when
> >>>>>>>>> #begin() a
> >>>>>>>>>>>>> transaction. The owner of the `commit` log stream can
> >>>> abort
> >>>>>>>>>> transactions
> >>>>>>>>>>>>> that never be committed/aborted within their timeout.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Failures:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> * all the log records can be simply retried as they
> >> will
> >>>> be
> >>>>>>>>>>>> de-duplicated
> >>>>>>>>>>>>> probably at the reader side.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Reader:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> * Reader can be configured to read uncommitted records
> >> or
> >>>>>>>> committed
> >>>>>>>>>>>> records
> >>>>>>>>>>>>> only (by default read uncommitted records)
> >>>>>>>>>>>>> * If reader is configured to read committed records
> >> only,
> >>>>> the
> >>>>>>> read
> >>>>>>>>>> ahead
> >>>>>>>>>>>>> cache will be changed to maintain one additional
> >> pending
> >>>>>>> committed
> >>>>>>>>>>>> records.
> >>>>>>>>>>>>> the pending committed records map is bounded and
> >> records
> >>>>> will
> >>>>>> be
> >>>>>>>>>> dropped
> >>>>>>>>>>>>> when read ahead is moving.
> >>>>>>>>>>>>> * when the reader hits a commit record, it will rewind
> >> to
> >>>>> the
> >>>>>>>> begin
> >>>>>>>>>>>> record
> >>>>>>>>>>>>> and start reading from there. leveraging the proper
> >> read
> >>>>> ahead
> >>>>>>>> cache
> >>>>>>>>>> and
> >>>>>>>>>>>>> pending commit records cache, it would be good for both
> >>>>> short
> >>>>>>>>>>>> transactions
> >>>>>>>>>>>>> and long transactions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - DLSN, SequenceId:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> * We will add a fourth field to DLSN. It is `local
> >>>> sequence
> >>>>>>>> number`
> >>>>>>>>>>>> within
> >>>>>>>>>>>>> a transaction session. So the new DLSN of records in a
> >>>>>>> transaction
> >>>>>>>>>> will
> >>>>>>>>>>>> be
> >>>>>>>>>>>>> the DLSN of commit control record plus its local
> >> sequence
> >>>>>>> number.
> >>>>>>>>>>>>> * The sequence id will be still the position of the
> >>>> commit
> >>>>>>> record
> >>>>>>>>> plus
> >>>>>>>>>>>> its
> >>>>>>>>>>>>> local sequence number. The position will be advanced
> >> with
> >>>>>> total
> >>>>>>>>> number
> >>>>>>>>>>>> of
> >>>>>>>>>>>>> written records on writing the commit control record.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Transaction Group & Namespace Transaction
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> using one single log stream for namespace transaction
> >> can
> >>>>>> cause
> >>>>>>>> the
> >>>>>>>>>>>>> bottleneck problem since all the begin/commit/end
> >> control
> >>>>>>> records
> >>>>>>>>> will
> >>>>>>>>>>>> have
> >>>>>>>>>>>>> to go through one log stream.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> the idea of 'transaction group' is to allow
> >> partitioning
> >>>> the
> >>>>>>>> writers
> >>>>>>>>>>>> into
> >>>>>>>>>>>>> different transaction groups.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> clients can specify the `group-name` when starting the
> >>>>>>>> transaction.
> >>>>>>>>> if
> >>>>>>>>>>>>> there is no `group-name` specified, it will use the
> >>>> default
> >>>>>>>> `commit`
> >>>>>>>>>>>> log in
> >>>>>>>>>>>>> the namespace for creating transactions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -------------------------------------------------
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'd like to collect feedbacks on this idea. Appreciate
> >>>> any
> >>>>>>>> comments
> >>>>>>>>>> and
> >>>>>>>>>>>> if
> >>>>>>>>>>>>> anyone is also interested in this idea, we'd like to
> >>>>>> collaborate
> >>>>>>>>> with
> >>>>>>>>>>>> the
> >>>>>>>>>>>>> community.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Xi
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
>
>

Re: [Discuss] Transaction Support

Posted by Sijie Guo <si...@apache.org>.
On Wed, Jan 4, 2017 at 1:14 AM, Asko Kauppi <as...@zalando.fi> wrote:

> > Beside that, I have one general question - What is the major goal for
> this
> > feature? Are you targeting on building a general XA transaction
> coordinator
> > or just for supporting things like `copy-modify-write' style workflow?
>
> The use case I would have for transactions - at some level of the stack -
> is supporting dynamic configurations.
>
> If a config changes in e.g. three lines, some of the changes may logically
> belong together. E.g. changing both “host” and “port” (if separate
> entries). One shouldn’t be able to read a state, even temporarily, that has
> new host but old port.
>
> I can do this in the application level - it does not need to be part of
> the DL protocol.
>

Yeah, I can see 'transaction' as a large atomic write in DL is very useful.
Currently DL limits the record size to 1MB. If people wants to write a
record larger than 1MB, it will potentially produce a `partial` write if
application breaks their record into multiple records.

Your use case falls into this category.

I think the minimal support is large atomic write. That is good enough for
most of the log use cases.

Having a separated TC (transaction coordinator) is cool. but it can be an
opt-in solution.



>
>
> Asko Kauppi
> Zalando Tech Helsinki
>
> > On 4 Jan 2017, at 9.18, Sijie Guo <si...@apache.org> wrote:
> >
> > Sorry for late response. I think Leigh and you already had some very
> > valuable discussions in the doc. I will try to add some of my questions
> to
> > the discussion.
> >
> > Beside that, I had a discussion with Leigh today about this. first of
> all,
> > I think it is very good to add transaction support in distributedlog. It
> is
> > one of the primitives that would help building distributed service. But
> we
> > have a concern about making this system become complicated and introduce
> > operational overhead when it runs in the large scale system on
> production.
> > There are two major suggestions that I have for this feature -
> >
> > Build the 'minimum' logic in core - I think the minimum logic that need
> to
> > be added to the core is -  the special control records (begin, commit and
> > abort) and make the reader be able to detect those special control
> records
> > and know what do they mean and how to interrupt with them. Since they are
> > special control records, there is not overhead to other readers that
> > doesn't require this feature.
> >
> > Build the transaction coordinator as a separated proxy service  - I think
> > the major concern that we have is putting more complexities into the
> 'write
> > proxy' service. We architected distributedlog in a more microservice-like
> > way - we have the core as the stream store, the proxy for serving write
> and
> > read traffic. It would be good that the transaction feature can be done
> in
> > a similar way. So the architecture would be like this -
> >
> > *[ write service ] [ read service ] [ transaction coordinator ]*
> > *[ stream store
> >            ]*
> >
> > if people doesn't need the transaction feature, they can turn if off
> > completely without any operational overhead.
> >
> > Beside that, I have one general question - What is the major goal for
> this
> > feature? Are you targeting on building a general XA transaction
> coordinator
> > or just for supporting things like `copy-modify-write' style workflow?
> >
> >
> > Thanks,
> > Sijie
> >
> >
> >
> >
> >
> > On Wed, Dec 28, 2016 at 1:12 PM, Xi Liu <xi...@gmail.com> wrote:
> >
> >> Ping?
> >>
> >> On Mon, Dec 19, 2016 at 8:28 AM, Xi Liu <xi...@gmail.com> wrote:
> >>
> >>> Sijie,
> >>>
> >>> No. I thought it might be easier for people to comment on a google doc
> to
> >>> gather the initial feedback. I will put the content back to wiki page
> >> once
> >>> addressing the comments. Does that sound good to you?
> >>>
> >>> And thank you in advance.
> >>>
> >>> - Xi
> >>>
> >>>
> >>>
> >>> On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <si...@apache.org> wrote:
> >>>
> >>>> Hi Xi,
> >>>>
> >>>> sorry for late response. I will review it soon.
> >>>>
> >>>> regarding this, a separate question "are we going to use google doc
> >>>> instead
> >>>> of email thread for any discussion"? I am a bit worried that the
> >>>> discussion
> >>>> will become lost after moving to google doc. No idea on how other
> apache
> >>>> projects are doing.
> >>>>
> >>>> - Sijie
> >>>>
> >>>> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi...@gmail.com>
> wrote:
> >>>>
> >>>>> Hi all,
> >>>>>
> >>>>> I finalized the first version of the design. This time I used a
> google
> >>>> doc
> >>>>> so that it is easier for commenting and add a link the wiki page. I
> >> will
> >>>>> update this to the wiki page once we come to the finalized design.
> >>>>>
> >>>>> https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
> >>>>> bSIGgSzXuTI5BA/edit
> >>>>>
> >>>>> Let me know if you have any questions. Appreciate your reviews!
> >>>>>
> >>>>> - Xi
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
> >>>>> <lstewart@twitter.com.invalid
> >>>>>> wrote:
> >>>>>
> >>>>>> Interesting proposal. A couple quick notes while you continue to
> >> flesh
> >>>>> this
> >>>>>> out.
> >>>>>>
> >>>>>> a. just to be sure - does this eliminate the need to save seqno with
> >>>>>> checkpoint?
> >>>>>>
> >>>>>> b. i.e. another way to describe this kind of improvement is "support
> >>>>>> records (atomic writes) larger than 1MB", iiuc. the advantage being
> >> it
> >>>>>> avoids the baggage of transactions. disadvantages include inability
> >>>> to do
> >>>>>> cross stream transactions, and flexibility (interleaving, etc) (are
> >>>> there
> >>>>>> others?).
> >>>>>>
> >>>>>> c. proxy use case is for supporting multiple writers - have you
> >>>> thought
> >>>>>> about how this would work with multiple writers?
> >>>>>>
> >>>>>> Thanks!
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo
> >> <sijieg@twitter.com.invalid
> >>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Sound good to me. look forward to the detailed proposal.
> >>>>>>>
> >>>>>>> (I don't mind the format if it makes things easier to you)
> >>>>>>>
> >>>>>>> Sijie
> >>>>>>>
> >>>>>>> On Friday, October 14, 2016, Xi Liu <xi...@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> Thank you, Sijie
> >>>>>>>>
> >>>>>>>> We have some internal discussions to sort out some details. We
> >> are
> >>>>>> ready
> >>>>>>> to
> >>>>>>>> collaborate with the community for adding the transaction
> >> support
> >>>> in
> >>>>>> DL.
> >>>>>>>> We'd like to share more.
> >>>>>>>>
> >>>>>>>> I created a proposal wiki here -
> >>>>>>>> https://cwiki.apache.org/confluence/display/DL/DP-1+-+
> >>>>>>>> DistributedLog+Transaction+Support
> >>>>>>>>
> >>>>>>>> (I followed KIP format and named it as DP (DistributedLog
> >>>> Proposal -
> >>>>> DP
> >>>>>>> is
> >>>>>>>> also short for Dynamic Programming). I don't know if you guys
> >> like
> >>>>> this
> >>>>>>>> name or not. Feel free to change it :D)
> >>>>>>>>
> >>>>>>>> I basically put my initial email as the content there so far.
> >>>> Once we
> >>>>>>>> finished our final discussion, I will update with more details.
> >> At
> >>>>> the
> >>>>>>> same
> >>>>>>>> time, any comments are welcome.
> >>>>>>>>
> >>>>>>>> - Xi
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
> >>>>>>> <javascript:;>>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Xi,
> >>>>>>>>>
> >>>>>>>>> I just granted you the edit permission.
> >>>>>>>>>
> >>>>>>>>> - Sijie
> >>>>>>>>>
> >>>>>>>>> On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
> >>>>>>>> <javascript:;>> wrote:
> >>>>>>>>>
> >>>>>>>>>> I still can not edit the wiki. Can any of the pmc members
> >>>> grant
> >>>>> me
> >>>>>>> the
> >>>>>>>>>> permissions?
> >>>>>>>>>>
> >>>>>>>>>> - Xi
> >>>>>>>>>>
> >>>>>>>>>> On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <
> >>>> xi.liu.ant@gmail.com
> >>>>>>>> <javascript:;>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Sijie,
> >>>>>>>>>>>
> >>>>>>>>>>> I attempted to create a wiki page under that space. I
> >> found
> >>>>> that
> >>>>>> I
> >>>>>>> am
> >>>>>>>>> not
> >>>>>>>>>>> authorized with edit permission.
> >>>>>>>>>>>
> >>>>>>>>>>> Can any of the committers grant me the wiki edit
> >>>> permission? My
> >>>>>>>> account
> >>>>>>>>>> is
> >>>>>>>>>>> "xi.liu.ant".
> >>>>>>>>>>>
> >>>>>>>>>>> - Xi
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <
> >>>> sijie@apache.org
> >>>>>>>> <javascript:;>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> This sounds interesting ... I will take a closer look and
> >>>> give
> >>>>>> my
> >>>>>>>>>> comments
> >>>>>>>>>>>> later.
> >>>>>>>>>>>>
> >>>>>>>>>>>> At the same time, do you mind creating a wiki page to put
> >>>> your
> >>>>>>> idea
> >>>>>>>>>> there?
> >>>>>>>>>>>> You can add your wiki page under
> >>>>>>>>>>>> https://cwiki.apache.org/confluence/display/DL/Project+
> >>>>>> Proposals
> >>>>>>>>>>>>
> >>>>>>>>>>>> You might need to ask in the dev list to grant the wiki
> >>>> edit
> >>>>>>>>> permissions
> >>>>>>>>>>>> to
> >>>>>>>>>>>> you once you have a wiki account.
> >>>>>>>>>>>>
> >>>>>>>>>>>> - Sijie
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <
> >>>> xi.liu.ant@gmail.com
> >>>>>>>> <javascript:;>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I asked the transaction support in distributedlog user
> >>>> group
> >>>>>> two
> >>>>>>>>>> months
> >>>>>>>>>>>>> ago. I want to raise this up again, as we are looking
> >> for
> >>>>>> using
> >>>>>>>>>>>>> distributedlog for building a transactional data
> >>>> service. It
> >>>>>> is
> >>>>>>> a
> >>>>>>>>>> major
> >>>>>>>>>>>>> feature that is missing in distributedlog. We have some
> >>>>> ideas
> >>>>>> to
> >>>>>>>> add
> >>>>>>>>>>>> this
> >>>>>>>>>>>>> to distributedlog and want to know if they make sense
> >> or
> >>>>> not.
> >>>>>> If
> >>>>>>>>> they
> >>>>>>>>>>>> are
> >>>>>>>>>>>>> good, we'd like to contribute and develop with the
> >>>>> community.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Here are the thoughts:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -------------------------------------------------
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> From our understanding, DL can provide "at-least-once"
> >>>>>> delivery
> >>>>>>>>>> semantic
> >>>>>>>>>>>>> (if not, please correct me) but not "exactly-once"
> >>>> delivery
> >>>>>>>>> semantic.
> >>>>>>>>>>>> That
> >>>>>>>>>>>>> means that a message can be delivered one or more times
> >>>> if
> >>>>> the
> >>>>>>>>> reader
> >>>>>>>>>>>>> doesn't handle duplicates.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The duplicates come from two places, one is at writer
> >>>> side
> >>>>>> (this
> >>>>>>>>>> assumes
> >>>>>>>>>>>>> using write proxy not the core library), while the
> >> other
> >>>> one
> >>>>>> is
> >>>>>>> at
> >>>>>>>>>>>> reader
> >>>>>>>>>>>>> side.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - writer side: if the client attempts to write a record
> >>>> to
> >>>>> the
> >>>>>>>> write
> >>>>>>>>>>>>> proxies and gets a network error (e.g timeouts) then
> >>>>> retries,
> >>>>>>> the
> >>>>>>>>>>>> retrying
> >>>>>>>>>>>>> will potentially result in duplicates.
> >>>>>>>>>>>>> - reader side:if the reader reads a message from a
> >> stream
> >>>>> and
> >>>>>>> then
> >>>>>>>>>>>> crashes,
> >>>>>>>>>>>>> when the reader restarts it would restart from last
> >> known
> >>>>>>> position
> >>>>>>>>>>>> (DLSN).
> >>>>>>>>>>>>> If the reader fails after processing a record and
> >> before
> >>>>>>> recording
> >>>>>>>>> the
> >>>>>>>>>>>>> position, the processed record will be delivered again.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The reader problem can be properly addressed by making
> >>>> use
> >>>>> of
> >>>>>>> the
> >>>>>>>>>>>> sequence
> >>>>>>>>>>>>> numbers of records and doing proper checkpointing. For
> >>>>>> example,
> >>>>>>> in
> >>>>>>>>>>>>> database, it can checkpoint the indexed data with the
> >>>>> sequence
> >>>>>>>>> number
> >>>>>>>>>> of
> >>>>>>>>>>>>> records; in flink, it can checkpoint the state with the
> >>>>>> sequence
> >>>>>>>>>>>> numbers.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The writer problem can be addressed by implementing an
> >>>>>>> idempotent
> >>>>>>>>>>>> writer.
> >>>>>>>>>>>>> However, an alternative and more powerful approach is
> >> to
> >>>>>> support
> >>>>>>>>>>>>> transactions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *What does transaction mean?*
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> A transaction means a collection of records can be
> >>>> written
> >>>>>>>>>>>> transactionally
> >>>>>>>>>>>>> within a stream or across multiple streams. They will
> >> be
> >>>>>>> consumed
> >>>>>>>> by
> >>>>>>>>>> the
> >>>>>>>>>>>>> reader together when a transaction is committed, or
> >> will
> >>>>> never
> >>>>>>> be
> >>>>>>>>>>>> consumed
> >>>>>>>>>>>>> by the reader when the transaction is aborted.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The transaction will expose following guarantees:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - The reader should not be exposed to records written
> >>>> from
> >>>>>>>>> uncommitted
> >>>>>>>>>>>>> transactions (mandatory)
> >>>>>>>>>>>>> - The reader should consume the records in the
> >>>> transaction
> >>>>>>> commit
> >>>>>>>>>> order
> >>>>>>>>>>>>> rather than the record written order (mandatory)
> >>>>>>>>>>>>> - No duplicated records within a transaction
> >> (mandatory)
> >>>>>>>>>>>>> - Allow interleaving transactional writes and
> >>>>>> non-transactional
> >>>>>>>>> writes
> >>>>>>>>>>>>> (optional)
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *Stream Transaction & Namespace Transaction*
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> There will be two types of transaction, one is Stream
> >>>> level
> >>>>>>>>>> transaction
> >>>>>>>>>>>>> (local transaction), while the other one is Namespace
> >>>> level
> >>>>>>>>>> transaction
> >>>>>>>>>>>>> (global transaction).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The stream level transaction is a transactional
> >>>> operation on
> >>>>>>>> writing
> >>>>>>>>>>>>> records to one stream; the namespace level transaction
> >>>> is a
> >>>>>>>>>>>> transactional
> >>>>>>>>>>>>> operation on writing records to multiple streams.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> *Implementation Thoughts*
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - A transaction is consist of begin control record, a
> >>>> series
> >>>>>> of
> >>>>>>>> data
> >>>>>>>>>>>>> records and commit/abort control record.
> >>>>>>>>>>>>> - The begin/commit/abort control record is written to a
> >>>>>> `commit`
> >>>>>>>> log
> >>>>>>>>>>>>> stream, while the data records will be written to
> >> normal
> >>>>> data
> >>>>>>> log
> >>>>>>>>>>>> streams.
> >>>>>>>>>>>>> - The `commit` log stream will be the same log stream
> >> for
> >>>>>>>>> stream-level
> >>>>>>>>>>>>> transaction,  while it will be a *system* stream (or
> >>>>> multiple
> >>>>>>>> system
> >>>>>>>>>>>>> streams) for namespace-level transactions.
> >>>>>>>>>>>>> - The transaction code looks like as below:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> <code>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Transaction txn = client.transaction();
> >>>>>>>>>>>>> Future<DLSN> result1 = txn.write(stream-0, record);
> >>>>>>>>>>>>> Future<DLSN> result2 = txn.write(stream-1, record);
> >>>>>>>>>>>>> Future<DLSN> result3 = txn.write(stream-2, record);
> >>>>>>>>>>>>> Future<Pair<DLSN, DLSN>> result = txn.commit();
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> </code>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> if the txn is committed, all the write futures will be
> >>>>>> satisfied
> >>>>>>>>> with
> >>>>>>>>>>>> their
> >>>>>>>>>>>>> written DLSNs. if the txn is aborted, all the write
> >>>> futures
> >>>>>> will
> >>>>>>>> be
> >>>>>>>>>>>> failed
> >>>>>>>>>>>>> together. there is no partial failure state.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - The actually data flow will be:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1. writer get a transaction id from the owner of the
> >>>>> `commit'
> >>>>>>> log
> >>>>>>>>>> stream
> >>>>>>>>>>>>> 1. write the begin control record (synchronously) with
> >>>> the
> >>>>>>>>> transaction
> >>>>>>>>>>>> id
> >>>>>>>>>>>>> 2. for each write within the same txn, it will be
> >>>> assigned a
> >>>>>>> local
> >>>>>>>>>>>> sequence
> >>>>>>>>>>>>> number starting from 0. the combination of transaction
> >> id
> >>>>> and
> >>>>>>>> local
> >>>>>>>>>>>>> sequence number will be used later on by the readers to
> >>>>>>>> de-duplicate
> >>>>>>>>>>>>> records.
> >>>>>>>>>>>>> 3. the commit/abort control record will be written
> >> based
> >>>> on
> >>>>>> the
> >>>>>>>>>> results
> >>>>>>>>>>>>> from 2.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Application can supply a timeout for the transaction
> >>>> when
> >>>>>>>>> #begin() a
> >>>>>>>>>>>>> transaction. The owner of the `commit` log stream can
> >>>> abort
> >>>>>>>>>> transactions
> >>>>>>>>>>>>> that never be committed/aborted within their timeout.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Failures:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> * all the log records can be simply retried as they
> >> will
> >>>> be
> >>>>>>>>>>>> de-duplicated
> >>>>>>>>>>>>> probably at the reader side.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Reader:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> * Reader can be configured to read uncommitted records
> >> or
> >>>>>>>> committed
> >>>>>>>>>>>> records
> >>>>>>>>>>>>> only (by default read uncommitted records)
> >>>>>>>>>>>>> * If reader is configured to read committed records
> >> only,
> >>>>> the
> >>>>>>> read
> >>>>>>>>>> ahead
> >>>>>>>>>>>>> cache will be changed to maintain one additional
> >> pending
> >>>>>>> committed
> >>>>>>>>>>>> records.
> >>>>>>>>>>>>> the pending committed records map is bounded and
> >> records
> >>>>> will
> >>>>>> be
> >>>>>>>>>> dropped
> >>>>>>>>>>>>> when read ahead is moving.
> >>>>>>>>>>>>> * when the reader hits a commit record, it will rewind
> >> to
> >>>>> the
> >>>>>>>> begin
> >>>>>>>>>>>> record
> >>>>>>>>>>>>> and start reading from there. leveraging the proper
> >> read
> >>>>> ahead
> >>>>>>>> cache
> >>>>>>>>>> and
> >>>>>>>>>>>>> pending commit records cache, it would be good for both
> >>>>> short
> >>>>>>>>>>>> transactions
> >>>>>>>>>>>>> and long transactions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - DLSN, SequenceId:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> * We will add a fourth field to DLSN. It is `local
> >>>> sequence
> >>>>>>>> number`
> >>>>>>>>>>>> within
> >>>>>>>>>>>>> a transaction session. So the new DLSN of records in a
> >>>>>>> transaction
> >>>>>>>>>> will
> >>>>>>>>>>>> be
> >>>>>>>>>>>>> the DLSN of commit control record plus its local
> >> sequence
> >>>>>>> number.
> >>>>>>>>>>>>> * The sequence id will be still the position of the
> >>>> commit
> >>>>>>> record
> >>>>>>>>> plus
> >>>>>>>>>>>> its
> >>>>>>>>>>>>> local sequence number. The position will be advanced
> >> with
> >>>>>> total
> >>>>>>>>> number
> >>>>>>>>>>>> of
> >>>>>>>>>>>>> written records on writing the commit control record.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Transaction Group & Namespace Transaction
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> using one single log stream for namespace transaction
> >> can
> >>>>>> cause
> >>>>>>>> the
> >>>>>>>>>>>>> bottleneck problem since all the begin/commit/end
> >> control
> >>>>>>> records
> >>>>>>>>> will
> >>>>>>>>>>>> have
> >>>>>>>>>>>>> to go through one log stream.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> the idea of 'transaction group' is to allow
> >> partitioning
> >>>> the
> >>>>>>>> writers
> >>>>>>>>>>>> into
> >>>>>>>>>>>>> different transaction groups.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> clients can specify the `group-name` when starting the
> >>>>>>>> transaction.
> >>>>>>>>> if
> >>>>>>>>>>>>> there is no `group-name` specified, it will use the
> >>>> default
> >>>>>>>> `commit`
> >>>>>>>>>>>> log in
> >>>>>>>>>>>>> the namespace for creating transactions.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -------------------------------------------------
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'd like to collect feedbacks on this idea. Appreciate
> >>>> any
> >>>>>>>> comments
> >>>>>>>>>> and
> >>>>>>>>>>>> if
> >>>>>>>>>>>>> anyone is also interested in this idea, we'd like to
> >>>>>> collaborate
> >>>>>>>>> with
> >>>>>>>>>>>> the
> >>>>>>>>>>>>> community.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Xi
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>
>
>

Re: [Discuss] Transaction Support

Posted by Asko Kauppi <as...@zalando.fi>.
> Beside that, I have one general question - What is the major goal for this
> feature? Are you targeting on building a general XA transaction coordinator
> or just for supporting things like `copy-modify-write' style workflow?

The use case I would have for transactions - at some level of the stack - is supporting dynamic configurations.

If a config changes in e.g. three lines, some of the changes may logically belong together. E.g. changing both “host” and “port” (if separate entries). One shouldn’t be able to read a state, even temporarily, that has new host but old port.

I can do this in the application level - it does not need to be part of the DL protocol.


Asko Kauppi
Zalando Tech Helsinki

> On 4 Jan 2017, at 9.18, Sijie Guo <si...@apache.org> wrote:
> 
> Sorry for late response. I think Leigh and you already had some very
> valuable discussions in the doc. I will try to add some of my questions to
> the discussion.
> 
> Beside that, I had a discussion with Leigh today about this. first of all,
> I think it is very good to add transaction support in distributedlog. It is
> one of the primitives that would help building distributed service. But we
> have a concern about making this system become complicated and introduce
> operational overhead when it runs in the large scale system on production.
> There are two major suggestions that I have for this feature -
> 
> Build the 'minimum' logic in core - I think the minimum logic that need to
> be added to the core is -  the special control records (begin, commit and
> abort) and make the reader be able to detect those special control records
> and know what do they mean and how to interrupt with them. Since they are
> special control records, there is not overhead to other readers that
> doesn't require this feature.
> 
> Build the transaction coordinator as a separated proxy service  - I think
> the major concern that we have is putting more complexities into the 'write
> proxy' service. We architected distributedlog in a more microservice-like
> way - we have the core as the stream store, the proxy for serving write and
> read traffic. It would be good that the transaction feature can be done in
> a similar way. So the architecture would be like this -
> 
> *[ write service ] [ read service ] [ transaction coordinator ]*
> *[ stream store
>            ]*
> 
> if people doesn't need the transaction feature, they can turn if off
> completely without any operational overhead.
> 
> Beside that, I have one general question - What is the major goal for this
> feature? Are you targeting on building a general XA transaction coordinator
> or just for supporting things like `copy-modify-write' style workflow?
> 
> 
> Thanks,
> Sijie
> 
> 
> 
> 
> 
> On Wed, Dec 28, 2016 at 1:12 PM, Xi Liu <xi...@gmail.com> wrote:
> 
>> Ping?
>> 
>> On Mon, Dec 19, 2016 at 8:28 AM, Xi Liu <xi...@gmail.com> wrote:
>> 
>>> Sijie,
>>> 
>>> No. I thought it might be easier for people to comment on a google doc to
>>> gather the initial feedback. I will put the content back to wiki page
>> once
>>> addressing the comments. Does that sound good to you?
>>> 
>>> And thank you in advance.
>>> 
>>> - Xi
>>> 
>>> 
>>> 
>>> On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <si...@apache.org> wrote:
>>> 
>>>> Hi Xi,
>>>> 
>>>> sorry for late response. I will review it soon.
>>>> 
>>>> regarding this, a separate question "are we going to use google doc
>>>> instead
>>>> of email thread for any discussion"? I am a bit worried that the
>>>> discussion
>>>> will become lost after moving to google doc. No idea on how other apache
>>>> projects are doing.
>>>> 
>>>> - Sijie
>>>> 
>>>> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi...@gmail.com> wrote:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I finalized the first version of the design. This time I used a google
>>>> doc
>>>>> so that it is easier for commenting and add a link the wiki page. I
>> will
>>>>> update this to the wiki page once we come to the finalized design.
>>>>> 
>>>>> https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
>>>>> bSIGgSzXuTI5BA/edit
>>>>> 
>>>>> Let me know if you have any questions. Appreciate your reviews!
>>>>> 
>>>>> - Xi
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
>>>>> <lstewart@twitter.com.invalid
>>>>>> wrote:
>>>>> 
>>>>>> Interesting proposal. A couple quick notes while you continue to
>> flesh
>>>>> this
>>>>>> out.
>>>>>> 
>>>>>> a. just to be sure - does this eliminate the need to save seqno with
>>>>>> checkpoint?
>>>>>> 
>>>>>> b. i.e. another way to describe this kind of improvement is "support
>>>>>> records (atomic writes) larger than 1MB", iiuc. the advantage being
>> it
>>>>>> avoids the baggage of transactions. disadvantages include inability
>>>> to do
>>>>>> cross stream transactions, and flexibility (interleaving, etc) (are
>>>> there
>>>>>> others?).
>>>>>> 
>>>>>> c. proxy use case is for supporting multiple writers - have you
>>>> thought
>>>>>> about how this would work with multiple writers?
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> 
>>>>>> On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo
>> <sijieg@twitter.com.invalid
>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Sound good to me. look forward to the detailed proposal.
>>>>>>> 
>>>>>>> (I don't mind the format if it makes things easier to you)
>>>>>>> 
>>>>>>> Sijie
>>>>>>> 
>>>>>>> On Friday, October 14, 2016, Xi Liu <xi...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> Thank you, Sijie
>>>>>>>> 
>>>>>>>> We have some internal discussions to sort out some details. We
>> are
>>>>>> ready
>>>>>>> to
>>>>>>>> collaborate with the community for adding the transaction
>> support
>>>> in
>>>>>> DL.
>>>>>>>> We'd like to share more.
>>>>>>>> 
>>>>>>>> I created a proposal wiki here -
>>>>>>>> https://cwiki.apache.org/confluence/display/DL/DP-1+-+
>>>>>>>> DistributedLog+Transaction+Support
>>>>>>>> 
>>>>>>>> (I followed KIP format and named it as DP (DistributedLog
>>>> Proposal -
>>>>> DP
>>>>>>> is
>>>>>>>> also short for Dynamic Programming). I don't know if you guys
>> like
>>>>> this
>>>>>>>> name or not. Feel free to change it :D)
>>>>>>>> 
>>>>>>>> I basically put my initial email as the content there so far.
>>>> Once we
>>>>>>>> finished our final discussion, I will update with more details.
>> At
>>>>> the
>>>>>>> same
>>>>>>>> time, any comments are welcome.
>>>>>>>> 
>>>>>>>> - Xi
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <sijie@apache.org
>>>>>>> <javascript:;>>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Xi,
>>>>>>>>> 
>>>>>>>>> I just granted you the edit permission.
>>>>>>>>> 
>>>>>>>>> - Sijie
>>>>>>>>> 
>>>>>>>>> On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu.ant@gmail.com
>>>>>>>> <javascript:;>> wrote:
>>>>>>>>> 
>>>>>>>>>> I still can not edit the wiki. Can any of the pmc members
>>>> grant
>>>>> me
>>>>>>> the
>>>>>>>>>> permissions?
>>>>>>>>>> 
>>>>>>>>>> - Xi
>>>>>>>>>> 
>>>>>>>>>> On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <
>>>> xi.liu.ant@gmail.com
>>>>>>>> <javascript:;>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Sijie,
>>>>>>>>>>> 
>>>>>>>>>>> I attempted to create a wiki page under that space. I
>> found
>>>>> that
>>>>>> I
>>>>>>> am
>>>>>>>>> not
>>>>>>>>>>> authorized with edit permission.
>>>>>>>>>>> 
>>>>>>>>>>> Can any of the committers grant me the wiki edit
>>>> permission? My
>>>>>>>> account
>>>>>>>>>> is
>>>>>>>>>>> "xi.liu.ant".
>>>>>>>>>>> 
>>>>>>>>>>> - Xi
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <
>>>> sijie@apache.org
>>>>>>>> <javascript:;>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> This sounds interesting ... I will take a closer look and
>>>> give
>>>>>> my
>>>>>>>>>> comments
>>>>>>>>>>>> later.
>>>>>>>>>>>> 
>>>>>>>>>>>> At the same time, do you mind creating a wiki page to put
>>>> your
>>>>>>> idea
>>>>>>>>>> there?
>>>>>>>>>>>> You can add your wiki page under
>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/DL/Project+
>>>>>> Proposals
>>>>>>>>>>>> 
>>>>>>>>>>>> You might need to ask in the dev list to grant the wiki
>>>> edit
>>>>>>>>> permissions
>>>>>>>>>>>> to
>>>>>>>>>>>> you once you have a wiki account.
>>>>>>>>>>>> 
>>>>>>>>>>>> - Sijie
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <
>>>> xi.liu.ant@gmail.com
>>>>>>>> <javascript:;>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I asked the transaction support in distributedlog user
>>>> group
>>>>>> two
>>>>>>>>>> months
>>>>>>>>>>>>> ago. I want to raise this up again, as we are looking
>> for
>>>>>> using
>>>>>>>>>>>>> distributedlog for building a transactional data
>>>> service. It
>>>>>> is
>>>>>>> a
>>>>>>>>>> major
>>>>>>>>>>>>> feature that is missing in distributedlog. We have some
>>>>> ideas
>>>>>> to
>>>>>>>> add
>>>>>>>>>>>> this
>>>>>>>>>>>>> to distributedlog and want to know if they make sense
>> or
>>>>> not.
>>>>>> If
>>>>>>>>> they
>>>>>>>>>>>> are
>>>>>>>>>>>>> good, we'd like to contribute and develop with the
>>>>> community.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Here are the thoughts:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -------------------------------------------------
>>>>>>>>>>>>> 
>>>>>>>>>>>>> From our understanding, DL can provide "at-least-once"
>>>>>> delivery
>>>>>>>>>> semantic
>>>>>>>>>>>>> (if not, please correct me) but not "exactly-once"
>>>> delivery
>>>>>>>>> semantic.
>>>>>>>>>>>> That
>>>>>>>>>>>>> means that a message can be delivered one or more times
>>>> if
>>>>> the
>>>>>>>>> reader
>>>>>>>>>>>>> doesn't handle duplicates.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The duplicates come from two places, one is at writer
>>>> side
>>>>>> (this
>>>>>>>>>> assumes
>>>>>>>>>>>>> using write proxy not the core library), while the
>> other
>>>> one
>>>>>> is
>>>>>>> at
>>>>>>>>>>>> reader
>>>>>>>>>>>>> side.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - writer side: if the client attempts to write a record
>>>> to
>>>>> the
>>>>>>>> write
>>>>>>>>>>>>> proxies and gets a network error (e.g timeouts) then
>>>>> retries,
>>>>>>> the
>>>>>>>>>>>> retrying
>>>>>>>>>>>>> will potentially result in duplicates.
>>>>>>>>>>>>> - reader side:if the reader reads a message from a
>> stream
>>>>> and
>>>>>>> then
>>>>>>>>>>>> crashes,
>>>>>>>>>>>>> when the reader restarts it would restart from last
>> known
>>>>>>> position
>>>>>>>>>>>> (DLSN).
>>>>>>>>>>>>> If the reader fails after processing a record and
>> before
>>>>>>> recording
>>>>>>>>> the
>>>>>>>>>>>>> position, the processed record will be delivered again.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The reader problem can be properly addressed by making
>>>> use
>>>>> of
>>>>>>> the
>>>>>>>>>>>> sequence
>>>>>>>>>>>>> numbers of records and doing proper checkpointing. For
>>>>>> example,
>>>>>>> in
>>>>>>>>>>>>> database, it can checkpoint the indexed data with the
>>>>> sequence
>>>>>>>>> number
>>>>>>>>>> of
>>>>>>>>>>>>> records; in flink, it can checkpoint the state with the
>>>>>> sequence
>>>>>>>>>>>> numbers.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The writer problem can be addressed by implementing an
>>>>>>> idempotent
>>>>>>>>>>>> writer.
>>>>>>>>>>>>> However, an alternative and more powerful approach is
>> to
>>>>>> support
>>>>>>>>>>>>> transactions.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *What does transaction mean?*
>>>>>>>>>>>>> 
>>>>>>>>>>>>> A transaction means a collection of records can be
>>>> written
>>>>>>>>>>>> transactionally
>>>>>>>>>>>>> within a stream or across multiple streams. They will
>> be
>>>>>>> consumed
>>>>>>>> by
>>>>>>>>>> the
>>>>>>>>>>>>> reader together when a transaction is committed, or
>> will
>>>>> never
>>>>>>> be
>>>>>>>>>>>> consumed
>>>>>>>>>>>>> by the reader when the transaction is aborted.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The transaction will expose following guarantees:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - The reader should not be exposed to records written
>>>> from
>>>>>>>>> uncommitted
>>>>>>>>>>>>> transactions (mandatory)
>>>>>>>>>>>>> - The reader should consume the records in the
>>>> transaction
>>>>>>> commit
>>>>>>>>>> order
>>>>>>>>>>>>> rather than the record written order (mandatory)
>>>>>>>>>>>>> - No duplicated records within a transaction
>> (mandatory)
>>>>>>>>>>>>> - Allow interleaving transactional writes and
>>>>>> non-transactional
>>>>>>>>> writes
>>>>>>>>>>>>> (optional)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *Stream Transaction & Namespace Transaction*
>>>>>>>>>>>>> 
>>>>>>>>>>>>> There will be two types of transaction, one is Stream
>>>> level
>>>>>>>>>> transaction
>>>>>>>>>>>>> (local transaction), while the other one is Namespace
>>>> level
>>>>>>>>>> transaction
>>>>>>>>>>>>> (global transaction).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The stream level transaction is a transactional
>>>> operation on
>>>>>>>> writing
>>>>>>>>>>>>> records to one stream; the namespace level transaction
>>>> is a
>>>>>>>>>>>> transactional
>>>>>>>>>>>>> operation on writing records to multiple streams.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *Implementation Thoughts*
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - A transaction is consist of begin control record, a
>>>> series
>>>>>> of
>>>>>>>> data
>>>>>>>>>>>>> records and commit/abort control record.
>>>>>>>>>>>>> - The begin/commit/abort control record is written to a
>>>>>> `commit`
>>>>>>>> log
>>>>>>>>>>>>> stream, while the data records will be written to
>> normal
>>>>> data
>>>>>>> log
>>>>>>>>>>>> streams.
>>>>>>>>>>>>> - The `commit` log stream will be the same log stream
>> for
>>>>>>>>> stream-level
>>>>>>>>>>>>> transaction,  while it will be a *system* stream (or
>>>>> multiple
>>>>>>>> system
>>>>>>>>>>>>> streams) for namespace-level transactions.
>>>>>>>>>>>>> - The transaction code looks like as below:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> <code>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Transaction txn = client.transaction();
>>>>>>>>>>>>> Future<DLSN> result1 = txn.write(stream-0, record);
>>>>>>>>>>>>> Future<DLSN> result2 = txn.write(stream-1, record);
>>>>>>>>>>>>> Future<DLSN> result3 = txn.write(stream-2, record);
>>>>>>>>>>>>> Future<Pair<DLSN, DLSN>> result = txn.commit();
>>>>>>>>>>>>> 
>>>>>>>>>>>>> </code>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> if the txn is committed, all the write futures will be
>>>>>> satisfied
>>>>>>>>> with
>>>>>>>>>>>> their
>>>>>>>>>>>>> written DLSNs. if the txn is aborted, all the write
>>>> futures
>>>>>> will
>>>>>>>> be
>>>>>>>>>>>> failed
>>>>>>>>>>>>> together. there is no partial failure state.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - The actually data flow will be:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 1. writer get a transaction id from the owner of the
>>>>> `commit'
>>>>>>> log
>>>>>>>>>> stream
>>>>>>>>>>>>> 1. write the begin control record (synchronously) with
>>>> the
>>>>>>>>> transaction
>>>>>>>>>>>> id
>>>>>>>>>>>>> 2. for each write within the same txn, it will be
>>>> assigned a
>>>>>>> local
>>>>>>>>>>>> sequence
>>>>>>>>>>>>> number starting from 0. the combination of transaction
>> id
>>>>> and
>>>>>>>> local
>>>>>>>>>>>>> sequence number will be used later on by the readers to
>>>>>>>> de-duplicate
>>>>>>>>>>>>> records.
>>>>>>>>>>>>> 3. the commit/abort control record will be written
>> based
>>>> on
>>>>>> the
>>>>>>>>>> results
>>>>>>>>>>>>> from 2.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - Application can supply a timeout for the transaction
>>>> when
>>>>>>>>> #begin() a
>>>>>>>>>>>>> transaction. The owner of the `commit` log stream can
>>>> abort
>>>>>>>>>> transactions
>>>>>>>>>>>>> that never be committed/aborted within their timeout.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - Failures:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> * all the log records can be simply retried as they
>> will
>>>> be
>>>>>>>>>>>> de-duplicated
>>>>>>>>>>>>> probably at the reader side.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - Reader:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> * Reader can be configured to read uncommitted records
>> or
>>>>>>>> committed
>>>>>>>>>>>> records
>>>>>>>>>>>>> only (by default read uncommitted records)
>>>>>>>>>>>>> * If reader is configured to read committed records
>> only,
>>>>> the
>>>>>>> read
>>>>>>>>>> ahead
>>>>>>>>>>>>> cache will be changed to maintain one additional
>> pending
>>>>>>> committed
>>>>>>>>>>>> records.
>>>>>>>>>>>>> the pending committed records map is bounded and
>> records
>>>>> will
>>>>>> be
>>>>>>>>>> dropped
>>>>>>>>>>>>> when read ahead is moving.
>>>>>>>>>>>>> * when the reader hits a commit record, it will rewind
>> to
>>>>> the
>>>>>>>> begin
>>>>>>>>>>>> record
>>>>>>>>>>>>> and start reading from there. leveraging the proper
>> read
>>>>> ahead
>>>>>>>> cache
>>>>>>>>>> and
>>>>>>>>>>>>> pending commit records cache, it would be good for both
>>>>> short
>>>>>>>>>>>> transactions
>>>>>>>>>>>>> and long transactions.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - DLSN, SequenceId:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> * We will add a fourth field to DLSN. It is `local
>>>> sequence
>>>>>>>> number`
>>>>>>>>>>>> within
>>>>>>>>>>>>> a transaction session. So the new DLSN of records in a
>>>>>>> transaction
>>>>>>>>>> will
>>>>>>>>>>>> be
>>>>>>>>>>>>> the DLSN of commit control record plus its local
>> sequence
>>>>>>> number.
>>>>>>>>>>>>> * The sequence id will be still the position of the
>>>> commit
>>>>>>> record
>>>>>>>>> plus
>>>>>>>>>>>> its
>>>>>>>>>>>>> local sequence number. The position will be advanced
>> with
>>>>>> total
>>>>>>>>> number
>>>>>>>>>>>> of
>>>>>>>>>>>>> written records on writing the commit control record.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - Transaction Group & Namespace Transaction
>>>>>>>>>>>>> 
>>>>>>>>>>>>> using one single log stream for namespace transaction
>> can
>>>>>> cause
>>>>>>>> the
>>>>>>>>>>>>> bottleneck problem since all the begin/commit/end
>> control
>>>>>>> records
>>>>>>>>> will
>>>>>>>>>>>> have
>>>>>>>>>>>>> to go through one log stream.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> the idea of 'transaction group' is to allow
>> partitioning
>>>> the
>>>>>>>> writers
>>>>>>>>>>>> into
>>>>>>>>>>>>> different transaction groups.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> clients can specify the `group-name` when starting the
>>>>>>>> transaction.
>>>>>>>>> if
>>>>>>>>>>>>> there is no `group-name` specified, it will use the
>>>> default
>>>>>>>> `commit`
>>>>>>>>>>>> log in
>>>>>>>>>>>>> the namespace for creating transactions.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -------------------------------------------------
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I'd like to collect feedbacks on this idea. Appreciate
>>>> any
>>>>>>>> comments
>>>>>>>>>> and
>>>>>>>>>>>> if
>>>>>>>>>>>>> anyone is also interested in this idea, we'd like to
>>>>>> collaborate
>>>>>>>>> with
>>>>>>>>>>>> the
>>>>>>>>>>>>> community.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - Xi
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>