You are viewing a plain text version of this content. The canonical link for it is here.
Posted to distributedlog-dev@bookkeeper.apache.org by Jia Zhai <zh...@gmail.com> on 2017/09/01 02:22:31 UTC

Re: Relax durability

I second that!  Thanks Enrico for starting and holding this productive
discussion.  Thanks Enrico, JV, Sijie and other guys to make this
consensus.
Looking forward for the design.

On Fri, Sep 1, 2017 at 12:53 AM, Venkateswara Rao Jujjuri <jujjuri@gmail.com
> wrote:

> Hi all,
>
> It has been a great and lively discussion. I can say this is one of the
> highly trended topics in the recent BK community discussion.
> Kudos to Enrico for starting this.
>
> Enrico, Sijie and I met and discussed this further and came up with the
> following consensus on how to move forward.
>
> * Introduce LedgerType/LedgerProperties which goes into ZK metadata.
> * No changes to AddEntry API (application view); but AddEntry RPC will add
> a flag to bookies to inform about the type/durability.
> * Introduce a sync() RPC which needs to be called explicitly on RD ledgers.
> * No changes to LAC and how we update it.
> * No changes to the behavior of readEntries() API, which reads only until
> LAC.
> * Applications can use readUnConcirmed API to read until last add pushed.
> * Segregate stats based on the ledgertype.
>
>
> Enrico is going to merge two docs and publish a detailed design. Thanks a
> lot Enrico
>
>
> On Mon, Aug 21, 2017 at 10:01 PM, Sijie Guo <gu...@gmail.com> wrote:
>
> > On Aug 21, 2017 5:44 AM, "Enrico Olivelli" <eo...@gmail.com> wrote:
> >
> > As the issue is really huge, I need to narrow the design and
> implementation
> > efforts to a specific case at the moment: I am interested in having a
> > per-ledger flag to not require fsynch on entries on journal.
> >
> >
> > It is good to narrow down the implementation. However because there are
> > different requirements from different people. It would be good to discuss
> > and cover all thoughts.
> >
> >
> > If the "no-synch" flag is applied per ledger than we have to decide what
> to
> > do on the LAC protocol, I see two opposite ways:
> > 1) the LAC will never advanced (no fsynch is guaranteed on journal)
> > 2) the LAC is advanced as usual but it will be possible to have missing
> > entries
> >
> >
> > Personally I am -1 to approach 2) as for the reasons I stated in previous
> > emails.
> >
> >
> > There is a "gray" situation:
> > 3) as entries will be interleaved on the journal with entries of other
> > "synch" ledgers it will be possible to detect some kind of "synched"
> > entries and return the info to the writing client which in turn will be
> > able to advance the LAC:
> > this option is not useful as the behavior is unpredictable
> >
> > For my "urgent" usecase I would prefer 2), but 1) is possible too,
> because
> > I am using LedgerHandlerAdv (I have manual allocation of entry ids) +
> > readUnconfirmedEntries (which allows to read entries even if LAC did not
> > advance)
> >
> >
> > As JV suggested, please start the design doc and let's iterate over it
> > before the implementation.
> >
> >
> > -- Enrico
> >
> >
> > 2017-08-19 14:09 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:
> >
> > >
> > >
> > > On ven 18 ago 2017, 20:12 Sijie Guo <gu...@gmail.com> wrote:
> > >
> > >> /cc (distributedlog-dev@)
> > >>
> > >> I know JV has similar use cases. This might require a broad
> discussion.
> > >> The
> > >> most tricky part would be LAC protocol - when can the client advance
> the
> > >> LAC. I think a BP, initially with a google doc shared to the community
> > >> would be good to start the discussion. because I would expect a lot
> > points
> > >> to discuss for this topic. Once we finalize the details, we can copy
> the
> > >> google doc content back to the wiki page.
> > >>
> > >
> > > Thank you Sijie and JV for pointing me to the right direction.
> > > I had underestimated the problems related to the ensemble changes, and
> > > also effectively in my projects  it can happen that a single
> > 'transaction'
> > > can span more then one ledger so the ordering issues are nore complex
> > than
> > > I expected. If somehow it would be possible to keep ordering inside the
> > > scope of a single ledger it is very hard to get it using multiple
> > ledgers.
> > >
> > > Next week I will write the doc, but I think I am going to split the
> > > problem into multiple parts.
> > > I see that the LAC must be advanced only when an fsynch is done. This
> > will
> > > preserve correctness as Sijie told.
> > >
> > > I think that the problems related to the ordering of events must be
> > > addressed at application level and it would be the best thing to have
> > such
> > > support in DL.
> > >
> > > For instance at first glance I omage that we should add in BK some
> > support
> > > in order to let the application receive notifications of changes to LAC
> > to
> > > the writer more easily.
> > >
> > > The first step would be to add a new flag to addEntry to receive
> > > acknowledge on fwrite and flush (with the needed changes to the
> journal),
> > > and in the addresponse a flag wjich tells that the entry has been
> synched
> > > or only flushed, and handle the LAC according to this information.
> > >
> > > Other comments inline
> > > Enrico
> > >
> > >
> > >
> > >
> > >
> > >> Other comments inline:
> > >>
> > >>
> > >> On Thu, Aug 17, 2017 at 4:42 AM, Enrico Olivelli <eolivelli@gmail.com
> >
> > >> wrote:
> > >>
> > >> > Hi,
> > >> > I am working with my colleagues at an implementation to relax the
> > >> > constraint that every acknowledged entry must have been successfully
> > >> > written and fsynced to disk at journal level.
> > >> >
> > >> > The idea is to have a flag in addEntry to ask for acknowledge not
> > after
> > >> the
> > >> > fsync in journal but only when data has been successfully written
> and
> > >> > flushed to the SO.
> > >> >
> > >> > I have the requirement that if an entry requires synch all the
> entries
> > >> > successfully sent 'before' that entry (causality) are synched too,
> > even
> > >> if
> > >> > they have been added with the new relaxed durability flag.
> > >>
> > >>
> > >> > Imagine a database transaction log, during a transaction I will
> write
> > >> every
> > >> > change to data to the WAL with the new flag, and only the commit
> > >> > transaction command will be added with synch requirement. The idea
> is
> > >> that
> > >> > all the changes inside the scope of the transaction have a meaning
> > only
> > >> if
> > >> > the transaction is committed, so it is important that the commit
> entry
> > >> > won't be lost and if that entry isn't lost all of the other entries
> of
> > >> the
> > >> > same transaction aren't lost too.
> > >> >
> > >>
> > >> can you do:
> > >>
> > >> - lh.asyncAddEntry('entry-1')
> > >> - lh.asyncAddEntry('entry-2')
> > >> - lh.addEntry('commit')
> > >>
> > >> ?
> > >>
> > >
> > > Yes, currently ut is the best we can do and I am doing so
> > >
> > >
> > >> Does this work for you? If it doesn't, what is the problem? do you
> have
> > >> any
> > >> performance number to support why this doesn't work?
> > >>
> > >
> > > I do not have numbers for this case, ingeneral limiting the number for
> > > fsynch could bring better performances.
> > > It is hard to play with grouping settings in the journal
> > >
> > >
> > >>
> > >> >
> > >> > I have another use case. In another project I am storing binary
> > objects
> > >> > into BK and I have to obtain great performance even on single disk
> > >> bookie
> > >> > layouts (journal + data + index on the same partition).
> > >>
> > >> In this project it
> > >> > is acceptable to compensate the risk of not doing fsynch if
> requesting
> > >> > enough replication.
> > >> > IMHO it will be somehow like the Kakfa idea of durability, as far
> as I
> > >> know
> > >> > Kafka by default does not impose fsynch but it leaves all to the SO
> > and
> > >> to
> > >> > the fact that there is a minimal configurable number of replicas
> which
> > >> are
> > >> > in-synch.
> > >>
> > >>
> > >>
> > >> when you are talking about kafka durability, what durability level are
> > you
> > >> looking for? Are you looking for replication durability without fsync?
> > >>
> > >
> > > Yes, the clients waits for acks from a number of brokers, which do not
> > > necessarily have performed fsynch. Dataloss risk is mitigated by
> > replication
> > >
> > >
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> >
> > >> > There are many open points, already suggested by Matteo, JV and
> Sijie:
> > >> > - LAC protocol?
> > >> > - replication in case of lost entries?
> > >> > - under production load mixing non synched entries with synched
> > entries
> > >> > will not give much benefits
> > >> >
> > >>
> > >> a couple thoughts to this feature:
> > >>
> > >> 1) we should always stick to a rule: LAC should only be advanced on
> > >> receiving acknowledgement of entries (persist on disk after fsync, it
> > can
> > >> bypass journal if necessary). so all the assumptions for LAC,
> > replication
> > >> can remain same and no change is needed.
> > >>
> > >> 2) separate the acknowledgement of replication and the acknowledgement
> > of
> > >> fsync (LAC) can achieve 'replicated durability without fsync' while
> > still
> > >> maintain the correctness of LAC. That means:
> > >>
> > >> an add request (no-sync) can be completed after receiving enough
> > responses
> > >> from bookies, however the response of (no-sync) add can't advance LAC.
> > The
> > >> LAC can only be advanced on acknowledgement of sync adds.
> > >>
> > >>
> > >> 3) request ordering and ensemble changes will make things complicated
> to
> > >> ensure correctness. the elegancy of current replication durability
> with
> > >> fsync is you don't rely on request ordering or physical layout to
> ensure
> > >> ordering and correctness. However if you relax durability and mixing
> > sync
> > >> adds and fsync adds, you have to pay attention to request  ordering
> and
> > >> flush ordering to ensure correctness, that is going to make things
> > tricky
> > >> and complicated.
> > >>
> > >>
> > >>
> > >> >
> > >> >
> > >> > For the LAC protocol I think that there is no impact, the point is
> > that
> > >> the
> > >> > LastAddConfirmed is the max entryid which is known to have been
> > >> > acknowledged to the writer, so durability is not a concern. You can
> > >> loose
> > >> > entries even with fsynch, just by loosing all the disks which
> contains
> > >> the
> > >> > data. Without fsynch it is just more probable.
> > >> >
> > >>
> > >> I am against on relaxing durability for LAC protocol, because that is
> > the
> > >> foundation to correctness.
> > >>
> > >> I will perfer - advancing LAC only when entries are replicated and
> > durably
> > >> synced to disks.
> > >>
> > >
> > > Yes. Now I am convinced
> > >
> > >>
> > >>
> > >>
> > >> >
> > >> > Replication: maybe we should write in the ledger metadata that the
> > >> ledger
> > >> > allows this feature and deal with it. But I am not sure, I have to
> > >> > understand better how LaderHandleAdv deals with sparse entryids
> inside
> > >> the
> > >> > re-replication process
> > >> >
> > >>
> > >> replication should not be changed if we stick to same lac behavior.
> > >>
> > >>
> > >> >
> > >> > Mixed workload: honestly I would like to add this feature to limit
> the
> > >> > number of fsynch, and I expect to have lots of bursts of unsynched
> > >> entries
> > >> > to be interleaved with a few synched entries. I know that this
> feature
> > >> is
> > >> > not to be encouraged in general but only for specific cases, like
> the
> > >> > stories of LedgerHandleAdv or readUnconfirmedEntries
> > >> >
> > >> > If this makes sense to you I will create a BP and attach a first
> patch
> > >> >
> > >>
> > >> sure
> > >>
> > >>
> > >> >
> > >> > Enrico
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> >
> > >> >
> > >> > -- Enrico Olivelli
> > >> >
> > >>
> > > --
> > >
> > >
> > > -- Enrico Olivelli
> > >
> >
>
>
>
> --
> Jvrao
> ---
> First they ignore you, then they laugh at you, then they fight you, then
> you win. - Mahatma Gandhi
>