You are viewing a plain text version of this content. The canonical link for it is here.
Posted to distributedlog-dev@bookkeeper.apache.org by Enrico Olivelli <eo...@gmail.com> on 2017/09/07 07:32:35 UTC

Re: [DISCUSS] BP-14 Relax Durability

Hi all,


You can find the revised proposal here
https://cwiki.apache.org/confluence/display/BOOKKEEPER/BP-14+Relax+durability

The link to the document open for comments is this:
https://docs.google.com/document/d/1yNi9t2_deOOMXDaGzrnmaHTQeB3B3Fnym82DUERH7LM/edit?usp=sharing

Please check it out
We are going to review this Proposal at the meeting

-- Enrico


2017-08-30 8:56 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:

> Thank you Sijie for summarizing and thanks to the community for helping in
> this important enhancement to BookKeeper
>
> I am convinced that as JV pointed out we need to declare at ledger
> creation time that the ledger is going to perform no-sync writes.
>
> I think we need an explicit declaration currently to make things "clear"
> to the developer which is using the LedgerHandle API even and ledger
> creation tyime.
>
> The case is that we are going to forbid "striping" ledgers (ensemble size
> > quorum size) for no-sync writes in the first implementation:
> - one option is to  fail at the first no-sync addEntry, but this will be
> really uncomfortable because usually the ack/write/ensemble sizes are
> configured by the admin, and there will be configurations in which errors
> will come out only after starting the system.
> - the second option is to make the developer explicitly enable no-sync
> writes at creation time and fail the creation of the ledger if the
> requested combination of options if not possible
>
> I am not sure that the changes to the bookie internals are a Client-API
> matter, maybe we can leverage custom metadata (as JV said) in order to make
> the bookie handle ledgers in a different manner, this way will be always
> open as custom metadata are already here.
>
> JV preferred the ledger-type approach, the dual solution is to introduce a
> list of "capabilities" or "ledger options".
> I think that this ability to perform no-syc writes is so important that
> "custom metadata" is not the good place to declare it, same for "ledger
> type"
>
> So I am proposing to add a boolean 'allowNoSyncWrites" at ledger creation
> time, without writing in to ledger metadata on ZK,
> I think that if further improvements will need ledger metadata changes we
> will do.
>
> I have updated the BP-14 document, I have added an "Open issues" footer
> with the open points,
> please add comments and I will correct the document as soon as possible.
>
>
> Enrico
>
>
>
>
> 2017-08-30 1:24 GMT+02:00 Sijie Guo <gu...@gmail.com>:
>
>> Thank you, Enrico, JV.
>>
>> These are great discussions.
>>
>> After reading these two proposals, I have a few very high-level comments,
>> dividing into three categories.
>>
>>
>> *API*
>>
>> - I think there are not fundamentally differences between these two
>> proposals.
>> They are trying to achieve similar goals by exposing durability levels in
>> different way.
>> So this will be a discussion on what API/interface should look like from
>> user / admin perspective.
>> I would suggest focusing what would be the API itself, putting the
>> implementation design aside when talking about this.
>>
>> *Core*
>>
>> - Both proposals need to deal with a core function - what happen to LAC
>> and
>> what semantic that bookkeeper provides.
>> JV did a good summary in his proposal. However I am not a fan of
>> maintaining two different semantics. So I am looking for
>> a solution that bookkeeper can only maintain one semantic. The semantic is
>> basically:
>>
>> 1) LAC only advanced when entries before LAC are committed to the
>> persistent storage
>> 2) All the entries until LAC are successfully committed to the persistence
>> storage
>> 3) Entries until LAC: all the entries must be readable all the time.
>>
>> If we maintain such semantic, there is no need to change the auto recovery
>> protocol in bookkeeper. All what we guarantee are the entries durably
>> persistent.
>>
>> In order to maintain such semantic, I think both me and JV proposed
>> similar
>> solution in either proposal. I am trying to finalize one here:
>>
>> * bookie maintains a LAS (Last Add Synced) point for each entry.
>> * LAS can be piggybacked on AddResponses
>> * Client uses the LAS to advance LAC.
>>
>> If we can agree on the core semantic we are going to provide, the other
>> things are just logistics.
>>
>> *Others*
>>
>> - Regarding separating journal or bypassing journal, there is no
>> difference
>> when we talking from the core semantic. They are all non-durably writes
>> (acknowledging before fsyncing).
>> We can start with same journal approach (but just acknowledge before
>> fsyncing), implement the core and add other options later on.
>>
>>
>> From my point of view, I'd be more interesting in providing a single
>> consistent durable semantic that application can rely on for both durable
>> writes and non-durable writes. The other stuffs seem to be more logistics
>> things.
>>
>>
>> - Sijie
>>
>>
>> On Mon, Aug 28, 2017 at 11:27 PM, Enrico Olivelli <eo...@gmail.com>
>> wrote:
>>
>> > 2017-08-29 8:01 GMT+02:00 Venkateswara Rao Jujjuri <ju...@gmail.com>:
>> >
>> > > I don't believe I fully followed your second case. But even in this
>> case,
>> > > your major concern is about the additional 'sync' RPC?
>> > >
>> >
>> > yes apart from that I am fine with your proposal too, that is to have a
>> > LedgerType which drives durability
>> > and I think we need to add per-entry durability options
>> >
>> > I think that at least for the 'simple' no-sync addEntry we do not need
>> to
>> > change many things, I am drafting a prototype, I will share it as soon
>> as
>> > we all agree on the roadmap
>> >
>> > The first implementation can cover the first cases (no-sync addEntry)
>> and
>> > change the way the writer advances the LAC in order to support 'relaxed
>> > durability writes'.
>> > This change will be compatible with future improvements and it will open
>> > the door for big changes on the bookie side like bypassing the journal
>> or
>> > leveraging multiple journals.....
>> >
>> > -- Enrico
>> >
>> > or something else that the LedgerType proposal won't work?
>> > >
>> >
>> > >
>> > >
>> > > On Mon, Aug 28, 2017 at 7:35 AM, Enrico Olivelli <eolivelli@gmail.com
>> >
>> > > wrote:
>> > >
>> > > > I think that having a set of options on the ledger metadata will be
>> a
>> > > good
>> > > > enhancement and I am sure we will do it as soon as it will be
>> needed,
>> > > maybe
>> > > > we do not need it now.
>> > > >
>> > > > Actually I think we will need to declare this durability-level at
>> entry
>> > > > level to support some uses cases in BP-14 document, let me explain
>> two
>> > of
>> > > > my usecases for which I need it:
>> > > >
>> > > > At higher level we have to choices:
>> > > >
>> > > > A) per-ledger durability options (JV proposal)
>> > > > all addEntry operations are durable or non-durable and there is an
>> > > explicit
>> > > > 'sync' API (+ forced sync at close)
>> > > >
>> > > > B) per-entry durability options (original BP-14 proposal)
>> > > > every addEntry has an own durable/non-durable option (sync/no-sync),
>> > with
>> > > > the ability to call 'sync' without addEntry (+ forced sync at close)
>> > > >
>> > > > I am speaking about the the database WAL case, I am using the
>> ledger as
>> > > > segment for the WAL of a database and I am writing all data changes
>> in
>> > > the
>> > > > scope of a 'transaction' with the relaxed-durability flag, then I am
>> > > > writing the 'transaction committed' entry with "strict durability"
>> > > > requirement, this will in fact require that all previous entries are
>> > > > persisted durably and so that the transaction will never be lost.
>> > > >
>> > > > In this scenario we would need an addEntry + sync API in fact:
>> > > >
>> > > > using option  A) the WAL will look like:
>> > > > - open ledger no-sync = true
>> > > > - addEntry (set foo=bar)  (this will be no-sync)
>> > > > - addEntry (set foo=bar2) (this will be no-sync)
>> > > > - addEntry (commit)
>> > > > - sync
>> > > >
>> > > > using option B) the WAL will look like
>> > > > - open ledger
>> > > > - addEntry (set foo=bar), no-sync
>> > > > - addEntry (set foo=bar2), no-sync
>> > > > - addEntry (commit), sync
>> > > >
>> > > > in case B) we are "saving" one RPC call to every bookie (the 'sync'
>> > one)
>> > > > same for single data change entries, like updating a single record
>> on
>> > the
>> > > > database, this with BK 4.5 "costs" only a single RPC to every bookie
>> > > >
>> > > > Second case:
>> > > > I am using BookKeeper to store binary objects, so I am packing more
>> > > > 'objects' (named sequences of bytes) into a single ledger, like you
>> do
>> > > when
>> > > > you write many records to a file in a streaming fashion and keep
>> track
>> > of
>> > > > offsets of the beginning of every record (LedgerHandeAdv is perfect
>> for
>> > > > this case).
>> > > > I am not using a single ledger per 'file' because it kills
>> zookeeper to
>> > > > create many ledgers very fast, in my systems I have big busts of
>> > writes,
>> > > > which need to be really "fast", so I am writing multiple 'files' to
>> > every
>> > > > single ledger. So the close-to-open consistency at ledger level is
>> not
>> > > > suitable for this case.
>> > > > I have to write as fast as possible to this 'ledger-backed' stream,
>> and
>> > > as
>> > > > with a 'traditional'  filesystem I am writing parts of each file and
>> > than
>> > > > requiring 'sync' at the end of each file.
>> > > > Using BookKeeper you need to split big 'files' into "little" parts,
>> you
>> > > > cannot transmit the contents as to "real" stream on network.
>> > > >
>> > > > I am not talking about bookie level implementation details I would
>> like
>> > > to
>> > > > define the high level API in order to support all the relevant known
>> > use
>> > > > cases and keep space for the future,
>> > > > at this moment adding a per-entry 'durability option' seems to be
>> very
>> > > > flexible and simple to implement, it does not prevent us from doing
>> > > further
>> > > > improvements, like namely skipping the journal.
>> > > >
>> > > > Enrico
>> > > >
>> > > >
>> > > >
>> > > > 2017-08-26 19:55 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:
>> > > >
>> > > > >
>> > > > >
>> > > > > On sab 26 ago 2017, 19:19 Venkateswara Rao Jujjuri <
>> > jujjuri@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > >> Hi all,
>> > > > >>
>> > > > >> As promised during Thursday call, here is my proposal.
>> > > > >>
>> > > > >> *NOTE*: Major difference in this proposal compared to Enrico’s
>> > > > >> <https://docs.google.com/document/d/1JLYO3K3tZ5PJGmyS0YK_-
>> > > > >> NW8VOUUgUWVBmswCUOG158/edit#heading=h.q2rewiqndr5v>
>> > > > >> is
>> > > > >> making the durability a property of the ledger(type) as opposed
>> to
>> > > > >> addEntry(). Rest of the technical details have a lot of
>> > similarities.
>> > > > >>
>> > > > >
>> > > > > Thank you JV. I have just read quickly the doc and your view is
>> > > centantly
>> > > > > broader.
>> > > > > I will dig into the doc as soon as possible on Monday.
>> > > > > For me it is ok to have a ledger wide configuration I think that
>> the
>> > > most
>> > > > > important decision is about the API we will provide as in the
>> future
>> > it
>> > > > > will be difficult to change it.
>> > > > >
>> > > > >
>> > > > > Cheers
>> > > > > Enrico
>> > > > >
>> > > > >
>> > > > >
>> > > > >> https://docs.google.com/document/d/1g1eBcVVCZrTG8YZliZP0LVqv
>> Wpq43
>> > > > >> 2ODEghrGVQ4d4Q/edit?usp=sharing
>> > > > >>
>> > > > >> On Thu, Aug 24, 2017 at 1:14 AM, Enrico Olivelli <
>> > eolivelli@gmail.com
>> > > >
>> > > > >> wrote:
>> > > > >>
>> > > > >> > Thank you all for the comments and for taking a look to the
>> > document
>> > > > so
>> > > > >> > soon.
>> > > > >> > I have updated the doc, we will discuss the document at the
>> > meeting,
>> > > > >> >
>> > > > >> >
>> > > > >> > Enrico
>> > > > >> >
>> > > > >> > 2017-08-24 2:27 GMT+02:00 Sijie Guo <gu...@gmail.com>:
>> > > > >> >
>> > > > >> > > Enrico,
>> > > > >> > >
>> > > > >> > > Thank you so much! It is a great effort for putting this up.
>> > > Overall
>> > > > >> > looks
>> > > > >> > > good. I made some comments, we can discuss at tomorrow's
>> > community
>> > > > >> > meeting.
>> > > > >> > >
>> > > > >> > > - Sijie
>> > > > >> > >
>> > > > >> > > On Wed, Aug 23, 2017 at 8:25 AM, Enrico Olivelli <
>> > > > eolivelli@gmail.com
>> > > > >> >
>> > > > >> > > wrote:
>> > > > >> > >
>> > > > >> > > > Hi all,
>> > > > >> > > > I have drafted a first proposal for BP-14 - Relax
>> Durability
>> > > > >> > > >
>> > > > >> > > > We are talking about limiting the number of fsync to the
>> > journal
>> > > > >> while
>> > > > >> > > > preserving the correctness of the LAC protocol.
>> > > > >> > > >
>> > > > >> > > > This is the link to the wiki page, but as the issue is
>> huge we
>> > > > >> prefer
>> > > > >> > to
>> > > > >> > > > use Google Documents for sharing comments
>> > > > >> > > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
>> > > > >> > > > BP+-+14+Relax+durability
>> > > > >> > > >
>> > > > >> > > > This is the document
>> > > > >> > > > https://docs.google.com/document/d/1JLYO3K3tZ5PJGmyS0YK_-
>> > > > >> > > > NW8VOUUgUWVBmswCUOG158/edit?usp=sharing
>> > > > >> > > >
>> > > > >> > > > All comments are welcome
>> > > > >> > > >
>> > > > >> > > > I have added DL dev list in cc as the discussion is
>> > interesting
>> > > > for
>> > > > >> > both
>> > > > >> > > > groups
>> > > > >> > > >
>> > > > >> > > > Enrico Olivelli
>> > > > >> > > >
>> > > > >> > >
>> > > > >> >
>> > > > >>
>> > > > >>
>> > > > >>
>> > > > >> --
>> > > > >> Jvrao
>> > > > >> ---
>> > > > >> First they ignore you, then they laugh at you, then they fight
>> you,
>> > > then
>> > > > >> you win. - Mahatma Gandhi
>> > > > >>
>> > > > > --
>> > > > >
>> > > > >
>> > > > > -- Enrico Olivelli
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Jvrao
>> > > ---
>> > > First they ignore you, then they laugh at you, then they fight you,
>> then
>> > > you win. - Mahatma Gandhi
>> > >
>> >
>>
>
>

Re: [DISCUSS] BP-14 Relax Durability

Posted by Enrico Olivelli <eo...@gmail.com>.
Thanks Sijie
I will do my best.

I can try to separate:
1) protocol changes (protobuf)
2) new client side API
3) LAC protocol changes bookie side changes
4) additional tests

Actually I already have a private work-in-progress branch with the full
stack, I will finish to implement the document and the split into pieces.

b.q.
I left one comment on the doc about the retention of the SyncCounter on the
bookie side

-- Enrico


2017-09-12 10:08 GMT+02:00 Sijie Guo <gu...@gmail.com>:

> Cool.
>
> I would expect this is a big change. It would be good if you can divide it
> into smaller tasks, so people can review them easier.
>
> - Sijie
>
> On Tue, Sep 12, 2017 at 1:05 AM, Enrico Olivelli <eo...@gmail.com>
> wrote:
>
> > Thank you all !
> >
> > I will copy the content of the Final draft to the Wiki and mark the
> > document as "Accepted"
> >
> > I will send a PR soon but it will depend on BP-15 New CreateLeader API
> >
> > I hope we could make it for 4.6
> >
> >
> > Enrico
> >
> >
> > 2017-09-11 18:58 GMT+02:00 Sijie Guo <gu...@gmail.com>:
> >
> > > Enrico,
> > >
> > > Feel free to close the thread and mark this BP as accepted, if there is
> > no
> > > -1.
> > >
> > > - Sijie
> > >
> > > On Mon, Sep 11, 2017 at 2:26 AM, Enrico Olivelli <eo...@gmail.com>
> > > wrote:
> > >
> > > > Ping
> > > >
> > > > 2017-09-07 9:32 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:
> > > >
> > > > > Hi all,
> > > > >
> > > > >
> > > > > You can find the revised proposal here
> > > > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> > > > > BP-14+Relax+durability
> > > > >
> > > > > The link to the document open for comments is this:
> > > > > https://docs.google.com/document/d/1yNi9t2_
> > > > deOOMXDaGzrnmaHTQeB3B3Fnym82DU
> > > > > ERH7LM/edit?usp=sharing
> > > > >
> > > > > Please check it out
> > > > > We are going to review this Proposal at the meeting
> > > > >
> > > > > -- Enrico
> > > > >
> > > > >
> > > > > 2017-08-30 8:56 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:
> > > > >
> > > > >> Thank you Sijie for summarizing and thanks to the community for
> > > helping
> > > > >> in this important enhancement to BookKeeper
> > > > >>
> > > > >> I am convinced that as JV pointed out we need to declare at ledger
> > > > >> creation time that the ledger is going to perform no-sync writes.
> > > > >>
> > > > >> I think we need an explicit declaration currently to make things
> > > "clear"
> > > > >> to the developer which is using the LedgerHandle API even and
> ledger
> > > > >> creation tyime.
> > > > >>
> > > > >> The case is that we are going to forbid "striping" ledgers
> (ensemble
> > > > size
> > > > >> > quorum size) for no-sync writes in the first implementation:
> > > > >> - one option is to  fail at the first no-sync addEntry, but this
> > will
> > > be
> > > > >> really uncomfortable because usually the ack/write/ensemble sizes
> > are
> > > > >> configured by the admin, and there will be configurations in which
> > > > errors
> > > > >> will come out only after starting the system.
> > > > >> - the second option is to make the developer explicitly enable
> > no-sync
> > > > >> writes at creation time and fail the creation of the ledger if the
> > > > >> requested combination of options if not possible
> > > > >>
> > > > >> I am not sure that the changes to the bookie internals are a
> > > Client-API
> > > > >> matter, maybe we can leverage custom metadata (as JV said) in
> order
> > to
> > > > make
> > > > >> the bookie handle ledgers in a different manner, this way will be
> > > always
> > > > >> open as custom metadata are already here.
> > > > >>
> > > > >> JV preferred the ledger-type approach, the dual solution is to
> > > introduce
> > > > >> a list of "capabilities" or "ledger options".
> > > > >> I think that this ability to perform no-syc writes is so important
> > > that
> > > > >> "custom metadata" is not the good place to declare it, same for
> > > "ledger
> > > > >> type"
> > > > >>
> > > > >> So I am proposing to add a boolean 'allowNoSyncWrites" at ledger
> > > > creation
> > > > >> time, without writing in to ledger metadata on ZK,
> > > > >> I think that if further improvements will need ledger metadata
> > changes
> > > > we
> > > > >> will do.
> > > > >>
> > > > >> I have updated the BP-14 document, I have added an "Open issues"
> > > footer
> > > > >> with the open points,
> > > > >> please add comments and I will correct the document as soon as
> > > possible.
> > > > >>
> > > > >>
> > > > >> Enrico
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> 2017-08-30 1:24 GMT+02:00 Sijie Guo <gu...@gmail.com>:
> > > > >>
> > > > >>> Thank you, Enrico, JV.
> > > > >>>
> > > > >>> These are great discussions.
> > > > >>>
> > > > >>> After reading these two proposals, I have a few very high-level
> > > > comments,
> > > > >>> dividing into three categories.
> > > > >>>
> > > > >>>
> > > > >>> *API*
> > > > >>>
> > > > >>> - I think there are not fundamentally differences between these
> two
> > > > >>> proposals.
> > > > >>> They are trying to achieve similar goals by exposing durability
> > > levels
> > > > in
> > > > >>> different way.
> > > > >>> So this will be a discussion on what API/interface should look
> like
> > > > from
> > > > >>> user / admin perspective.
> > > > >>> I would suggest focusing what would be the API itself, putting
> the
> > > > >>> implementation design aside when talking about this.
> > > > >>>
> > > > >>> *Core*
> > > > >>>
> > > > >>> - Both proposals need to deal with a core function - what happen
> to
> > > LAC
> > > > >>> and
> > > > >>> what semantic that bookkeeper provides.
> > > > >>> JV did a good summary in his proposal. However I am not a fan of
> > > > >>> maintaining two different semantics. So I am looking for
> > > > >>> a solution that bookkeeper can only maintain one semantic. The
> > > semantic
> > > > >>> is
> > > > >>> basically:
> > > > >>>
> > > > >>> 1) LAC only advanced when entries before LAC are committed to the
> > > > >>> persistent storage
> > > > >>> 2) All the entries until LAC are successfully committed to the
> > > > >>> persistence
> > > > >>> storage
> > > > >>> 3) Entries until LAC: all the entries must be readable all the
> > time.
> > > > >>>
> > > > >>> If we maintain such semantic, there is no need to change the auto
> > > > >>> recovery
> > > > >>> protocol in bookkeeper. All what we guarantee are the entries
> > durably
> > > > >>> persistent.
> > > > >>>
> > > > >>> In order to maintain such semantic, I think both me and JV
> proposed
> > > > >>> similar
> > > > >>> solution in either proposal. I am trying to finalize one here:
> > > > >>>
> > > > >>> * bookie maintains a LAS (Last Add Synced) point for each entry.
> > > > >>> * LAS can be piggybacked on AddResponses
> > > > >>> * Client uses the LAS to advance LAC.
> > > > >>>
> > > > >>> If we can agree on the core semantic we are going to provide, the
> > > other
> > > > >>> things are just logistics.
> > > > >>>
> > > > >>> *Others*
> > > > >>>
> > > > >>> - Regarding separating journal or bypassing journal, there is no
> > > > >>> difference
> > > > >>> when we talking from the core semantic. They are all non-durably
> > > writes
> > > > >>> (acknowledging before fsyncing).
> > > > >>> We can start with same journal approach (but just acknowledge
> > before
> > > > >>> fsyncing), implement the core and add other options later on.
> > > > >>>
> > > > >>>
> > > > >>> From my point of view, I'd be more interesting in providing a
> > single
> > > > >>> consistent durable semantic that application can rely on for both
> > > > durable
> > > > >>> writes and non-durable writes. The other stuffs seem to be more
> > > > logistics
> > > > >>> things.
> > > > >>>
> > > > >>>
> > > > >>> - Sijie
> > > > >>>
> > > > >>>
> > > > >>> On Mon, Aug 28, 2017 at 11:27 PM, Enrico Olivelli <
> > > eolivelli@gmail.com
> > > > >
> > > > >>> wrote:
> > > > >>>
> > > > >>> > 2017-08-29 8:01 GMT+02:00 Venkateswara Rao Jujjuri <
> > > > jujjuri@gmail.com
> > > > >>> >:
> > > > >>> >
> > > > >>> > > I don't believe I fully followed your second case. But even
> in
> > > this
> > > > >>> case,
> > > > >>> > > your major concern is about the additional 'sync' RPC?
> > > > >>> > >
> > > > >>> >
> > > > >>> > yes apart from that I am fine with your proposal too, that is
> to
> > > > have a
> > > > >>> > LedgerType which drives durability
> > > > >>> > and I think we need to add per-entry durability options
> > > > >>> >
> > > > >>> > I think that at least for the 'simple' no-sync addEntry we do
> not
> > > > need
> > > > >>> to
> > > > >>> > change many things, I am drafting a prototype, I will share it
> as
> > > > soon
> > > > >>> as
> > > > >>> > we all agree on the roadmap
> > > > >>> >
> > > > >>> > The first implementation can cover the first cases (no-sync
> > > addEntry)
> > > > >>> and
> > > > >>> > change the way the writer advances the LAC in order to support
> > > > 'relaxed
> > > > >>> > durability writes'.
> > > > >>> > This change will be compatible with future improvements and it
> > will
> > > > >>> open
> > > > >>> > the door for big changes on the bookie side like bypassing the
> > > > journal
> > > > >>> or
> > > > >>> > leveraging multiple journals.....
> > > > >>> >
> > > > >>> > -- Enrico
> > > > >>> >
> > > > >>> > or something else that the LedgerType proposal won't work?
> > > > >>> > >
> > > > >>> >
> > > > >>> > >
> > > > >>> > >
> > > > >>> > > On Mon, Aug 28, 2017 at 7:35 AM, Enrico Olivelli <
> > > > >>> eolivelli@gmail.com>
> > > > >>> > > wrote:
> > > > >>> > >
> > > > >>> > > > I think that having a set of options on the ledger metadata
> > > will
> > > > >>> be a
> > > > >>> > > good
> > > > >>> > > > enhancement and I am sure we will do it as soon as it will
> be
> > > > >>> needed,
> > > > >>> > > maybe
> > > > >>> > > > we do not need it now.
> > > > >>> > > >
> > > > >>> > > > Actually I think we will need to declare this
> > durability-level
> > > at
> > > > >>> entry
> > > > >>> > > > level to support some uses cases in BP-14 document, let me
> > > > explain
> > > > >>> two
> > > > >>> > of
> > > > >>> > > > my usecases for which I need it:
> > > > >>> > > >
> > > > >>> > > > At higher level we have to choices:
> > > > >>> > > >
> > > > >>> > > > A) per-ledger durability options (JV proposal)
> > > > >>> > > > all addEntry operations are durable or non-durable and
> there
> > is
> > > > an
> > > > >>> > > explicit
> > > > >>> > > > 'sync' API (+ forced sync at close)
> > > > >>> > > >
> > > > >>> > > > B) per-entry durability options (original BP-14 proposal)
> > > > >>> > > > every addEntry has an own durable/non-durable option
> > > > >>> (sync/no-sync),
> > > > >>> > with
> > > > >>> > > > the ability to call 'sync' without addEntry (+ forced sync
> at
> > > > >>> close)
> > > > >>> > > >
> > > > >>> > > > I am speaking about the the database WAL case, I am using
> the
> > > > >>> ledger as
> > > > >>> > > > segment for the WAL of a database and I am writing all data
> > > > >>> changes in
> > > > >>> > > the
> > > > >>> > > > scope of a 'transaction' with the relaxed-durability flag,
> > > then I
> > > > >>> am
> > > > >>> > > > writing the 'transaction committed' entry with "strict
> > > > durability"
> > > > >>> > > > requirement, this will in fact require that all previous
> > > entries
> > > > >>> are
> > > > >>> > > > persisted durably and so that the transaction will never be
> > > lost.
> > > > >>> > > >
> > > > >>> > > > In this scenario we would need an addEntry + sync API in
> > fact:
> > > > >>> > > >
> > > > >>> > > > using option  A) the WAL will look like:
> > > > >>> > > > - open ledger no-sync = true
> > > > >>> > > > - addEntry (set foo=bar)  (this will be no-sync)
> > > > >>> > > > - addEntry (set foo=bar2) (this will be no-sync)
> > > > >>> > > > - addEntry (commit)
> > > > >>> > > > - sync
> > > > >>> > > >
> > > > >>> > > > using option B) the WAL will look like
> > > > >>> > > > - open ledger
> > > > >>> > > > - addEntry (set foo=bar), no-sync
> > > > >>> > > > - addEntry (set foo=bar2), no-sync
> > > > >>> > > > - addEntry (commit), sync
> > > > >>> > > >
> > > > >>> > > > in case B) we are "saving" one RPC call to every bookie
> (the
> > > > 'sync'
> > > > >>> > one)
> > > > >>> > > > same for single data change entries, like updating a single
> > > > record
> > > > >>> on
> > > > >>> > the
> > > > >>> > > > database, this with BK 4.5 "costs" only a single RPC to
> every
> > > > >>> bookie
> > > > >>> > > >
> > > > >>> > > > Second case:
> > > > >>> > > > I am using BookKeeper to store binary objects, so I am
> > packing
> > > > more
> > > > >>> > > > 'objects' (named sequences of bytes) into a single ledger,
> > like
> > > > >>> you do
> > > > >>> > > when
> > > > >>> > > > you write many records to a file in a streaming fashion and
> > > keep
> > > > >>> track
> > > > >>> > of
> > > > >>> > > > offsets of the beginning of every record (LedgerHandeAdv is
> > > > >>> perfect for
> > > > >>> > > > this case).
> > > > >>> > > > I am not using a single ledger per 'file' because it kills
> > > > >>> zookeeper to
> > > > >>> > > > create many ledgers very fast, in my systems I have big
> busts
> > > of
> > > > >>> > writes,
> > > > >>> > > > which need to be really "fast", so I am writing multiple
> > > 'files'
> > > > to
> > > > >>> > every
> > > > >>> > > > single ledger. So the close-to-open consistency at ledger
> > level
> > > > is
> > > > >>> not
> > > > >>> > > > suitable for this case.
> > > > >>> > > > I have to write as fast as possible to this 'ledger-backed'
> > > > >>> stream, and
> > > > >>> > > as
> > > > >>> > > > with a 'traditional'  filesystem I am writing parts of each
> > > file
> > > > >>> and
> > > > >>> > than
> > > > >>> > > > requiring 'sync' at the end of each file.
> > > > >>> > > > Using BookKeeper you need to split big 'files' into
> "little"
> > > > >>> parts, you
> > > > >>> > > > cannot transmit the contents as to "real" stream on
> network.
> > > > >>> > > >
> > > > >>> > > > I am not talking about bookie level implementation details
> I
> > > > would
> > > > >>> like
> > > > >>> > > to
> > > > >>> > > > define the high level API in order to support all the
> > relevant
> > > > >>> known
> > > > >>> > use
> > > > >>> > > > cases and keep space for the future,
> > > > >>> > > > at this moment adding a per-entry 'durability option' seems
> > to
> > > be
> > > > >>> very
> > > > >>> > > > flexible and simple to implement, it does not prevent us
> from
> > > > doing
> > > > >>> > > further
> > > > >>> > > > improvements, like namely skipping the journal.
> > > > >>> > > >
> > > > >>> > > > Enrico
> > > > >>> > > >
> > > > >>> > > >
> > > > >>> > > >
> > > > >>> > > > 2017-08-26 19:55 GMT+02:00 Enrico Olivelli <
> > > eolivelli@gmail.com
> > > > >:
> > > > >>> > > >
> > > > >>> > > > >
> > > > >>> > > > >
> > > > >>> > > > > On sab 26 ago 2017, 19:19 Venkateswara Rao Jujjuri <
> > > > >>> > jujjuri@gmail.com>
> > > > >>> > > > > wrote:
> > > > >>> > > > >
> > > > >>> > > > >> Hi all,
> > > > >>> > > > >>
> > > > >>> > > > >> As promised during Thursday call, here is my proposal.
> > > > >>> > > > >>
> > > > >>> > > > >> *NOTE*: Major difference in this proposal compared to
> > > Enrico’s
> > > > >>> > > > >> <https://docs.google.com/document/d/
> > 1JLYO3K3tZ5PJGmyS0YK_-
> > > > >>> > > > >> NW8VOUUgUWVBmswCUOG158/edit#heading=h.q2rewiqndr5v>
> > > > >>> > > > >> is
> > > > >>> > > > >> making the durability a property of the ledger(type) as
> > > > opposed
> > > > >>> to
> > > > >>> > > > >> addEntry(). Rest of the technical details have a lot of
> > > > >>> > similarities.
> > > > >>> > > > >>
> > > > >>> > > > >
> > > > >>> > > > > Thank you JV. I have just read quickly the doc and your
> > view
> > > is
> > > > >>> > > centantly
> > > > >>> > > > > broader.
> > > > >>> > > > > I will dig into the doc as soon as possible on Monday.
> > > > >>> > > > > For me it is ok to have a ledger wide configuration I
> think
> > > > that
> > > > >>> the
> > > > >>> > > most
> > > > >>> > > > > important decision is about the API we will provide as in
> > the
> > > > >>> future
> > > > >>> > it
> > > > >>> > > > > will be difficult to change it.
> > > > >>> > > > >
> > > > >>> > > > >
> > > > >>> > > > > Cheers
> > > > >>> > > > > Enrico
> > > > >>> > > > >
> > > > >>> > > > >
> > > > >>> > > > >
> > > > >>> > > > >> https://docs.google.com/document/d/
> > > 1g1eBcVVCZrTG8YZliZP0LVqv
> > > > >>> Wpq43
> > > > >>> > > > >> 2ODEghrGVQ4d4Q/edit?usp=sharing
> > > > >>> > > > >>
> > > > >>> > > > >> On Thu, Aug 24, 2017 at 1:14 AM, Enrico Olivelli <
> > > > >>> > eolivelli@gmail.com
> > > > >>> > > >
> > > > >>> > > > >> wrote:
> > > > >>> > > > >>
> > > > >>> > > > >> > Thank you all for the comments and for taking a look
> to
> > > the
> > > > >>> > document
> > > > >>> > > > so
> > > > >>> > > > >> > soon.
> > > > >>> > > > >> > I have updated the doc, we will discuss the document
> at
> > > the
> > > > >>> > meeting,
> > > > >>> > > > >> >
> > > > >>> > > > >> >
> > > > >>> > > > >> > Enrico
> > > > >>> > > > >> >
> > > > >>> > > > >> > 2017-08-24 2:27 GMT+02:00 Sijie Guo <
> guosijie@gmail.com
> > >:
> > > > >>> > > > >> >
> > > > >>> > > > >> > > Enrico,
> > > > >>> > > > >> > >
> > > > >>> > > > >> > > Thank you so much! It is a great effort for putting
> > this
> > > > up.
> > > > >>> > > Overall
> > > > >>> > > > >> > looks
> > > > >>> > > > >> > > good. I made some comments, we can discuss at
> > tomorrow's
> > > > >>> > community
> > > > >>> > > > >> > meeting.
> > > > >>> > > > >> > >
> > > > >>> > > > >> > > - Sijie
> > > > >>> > > > >> > >
> > > > >>> > > > >> > > On Wed, Aug 23, 2017 at 8:25 AM, Enrico Olivelli <
> > > > >>> > > > eolivelli@gmail.com
> > > > >>> > > > >> >
> > > > >>> > > > >> > > wrote:
> > > > >>> > > > >> > >
> > > > >>> > > > >> > > > Hi all,
> > > > >>> > > > >> > > > I have drafted a first proposal for BP-14 - Relax
> > > > >>> Durability
> > > > >>> > > > >> > > >
> > > > >>> > > > >> > > > We are talking about limiting the number of fsync
> to
> > > the
> > > > >>> > journal
> > > > >>> > > > >> while
> > > > >>> > > > >> > > > preserving the correctness of the LAC protocol.
> > > > >>> > > > >> > > >
> > > > >>> > > > >> > > > This is the link to the wiki page, but as the
> issue
> > is
> > > > >>> huge we
> > > > >>> > > > >> prefer
> > > > >>> > > > >> > to
> > > > >>> > > > >> > > > use Google Documents for sharing comments
> > > > >>> > > > >> > > > https://cwiki.apache.org/
> > > confluence/display/BOOKKEEPER/
> > > > >>> > > > >> > > > BP+-+14+Relax+durability
> > > > >>> > > > >> > > >
> > > > >>> > > > >> > > > This is the document
> > > > >>> > > > >> > > > https://docs.google.com/document/d/
> > > > 1JLYO3K3tZ5PJGmyS0YK_-
> > > > >>> > > > >> > > > NW8VOUUgUWVBmswCUOG158/edit?usp=sharing
> > > > >>> > > > >> > > >
> > > > >>> > > > >> > > > All comments are welcome
> > > > >>> > > > >> > > >
> > > > >>> > > > >> > > > I have added DL dev list in cc as the discussion
> is
> > > > >>> > interesting
> > > > >>> > > > for
> > > > >>> > > > >> > both
> > > > >>> > > > >> > > > groups
> > > > >>> > > > >> > > >
> > > > >>> > > > >> > > > Enrico Olivelli
> > > > >>> > > > >> > > >
> > > > >>> > > > >> > >
> > > > >>> > > > >> >
> > > > >>> > > > >>
> > > > >>> > > > >>
> > > > >>> > > > >>
> > > > >>> > > > >> --
> > > > >>> > > > >> Jvrao
> > > > >>> > > > >> ---
> > > > >>> > > > >> First they ignore you, then they laugh at you, then they
> > > fight
> > > > >>> you,
> > > > >>> > > then
> > > > >>> > > > >> you win. - Mahatma Gandhi
> > > > >>> > > > >>
> > > > >>> > > > > --
> > > > >>> > > > >
> > > > >>> > > > >
> > > > >>> > > > > -- Enrico Olivelli
> > > > >>> > > > >
> > > > >>> > > >
> > > > >>> > >
> > > > >>> > >
> > > > >>> > >
> > > > >>> > > --
> > > > >>> > > Jvrao
> > > > >>> > > ---
> > > > >>> > > First they ignore you, then they laugh at you, then they
> fight
> > > you,
> > > > >>> then
> > > > >>> > > you win. - Mahatma Gandhi
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > > >>
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] BP-14 Relax Durability

Posted by Enrico Olivelli <eo...@gmail.com>.
Thanks Sijie
I will do my best.

I can try to separate:
1) protocol changes (protobuf)
2) new client side API
3) LAC protocol changes bookie side changes
4) additional tests

Actually I already have a private work-in-progress branch with the full
stack, I will finish to implement the document and the split into pieces.

b.q.
I left one comment on the doc about the retention of the SyncCounter on the
bookie side

-- Enrico


2017-09-12 10:08 GMT+02:00 Sijie Guo <gu...@gmail.com>:

> Cool.
>
> I would expect this is a big change. It would be good if you can divide it
> into smaller tasks, so people can review them easier.
>
> - Sijie
>
> On Tue, Sep 12, 2017 at 1:05 AM, Enrico Olivelli <eo...@gmail.com>
> wrote:
>
> > Thank you all !
> >
> > I will copy the content of the Final draft to the Wiki and mark the
> > document as "Accepted"
> >
> > I will send a PR soon but it will depend on BP-15 New CreateLeader API
> >
> > I hope we could make it for 4.6
> >
> >
> > Enrico
> >
> >
> > 2017-09-11 18:58 GMT+02:00 Sijie Guo <gu...@gmail.com>:
> >
> > > Enrico,
> > >
> > > Feel free to close the thread and mark this BP as accepted, if there is
> > no
> > > -1.
> > >
> > > - Sijie
> > >
> > > On Mon, Sep 11, 2017 at 2:26 AM, Enrico Olivelli <eo...@gmail.com>
> > > wrote:
> > >
> > > > Ping
> > > >
> > > > 2017-09-07 9:32 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:
> > > >
> > > > > Hi all,
> > > > >
> > > > >
> > > > > You can find the revised proposal here
> > > > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> > > > > BP-14+Relax+durability
> > > > >
> > > > > The link to the document open for comments is this:
> > > > > https://docs.google.com/document/d/1yNi9t2_
> > > > deOOMXDaGzrnmaHTQeB3B3Fnym82DU
> > > > > ERH7LM/edit?usp=sharing
> > > > >
> > > > > Please check it out
> > > > > We are going to review this Proposal at the meeting
> > > > >
> > > > > -- Enrico
> > > > >
> > > > >
> > > > > 2017-08-30 8:56 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:
> > > > >
> > > > >> Thank you Sijie for summarizing and thanks to the community for
> > > helping
> > > > >> in this important enhancement to BookKeeper
> > > > >>
> > > > >> I am convinced that as JV pointed out we need to declare at ledger
> > > > >> creation time that the ledger is going to perform no-sync writes.
> > > > >>
> > > > >> I think we need an explicit declaration currently to make things
> > > "clear"
> > > > >> to the developer which is using the LedgerHandle API even and
> ledger
> > > > >> creation tyime.
> > > > >>
> > > > >> The case is that we are going to forbid "striping" ledgers
> (ensemble
> > > > size
> > > > >> > quorum size) for no-sync writes in the first implementation:
> > > > >> - one option is to  fail at the first no-sync addEntry, but this
> > will
> > > be
> > > > >> really uncomfortable because usually the ack/write/ensemble sizes
> > are
> > > > >> configured by the admin, and there will be configurations in which
> > > > errors
> > > > >> will come out only after starting the system.
> > > > >> - the second option is to make the developer explicitly enable
> > no-sync
> > > > >> writes at creation time and fail the creation of the ledger if the
> > > > >> requested combination of options if not possible
> > > > >>
> > > > >> I am not sure that the changes to the bookie internals are a
> > > Client-API
> > > > >> matter, maybe we can leverage custom metadata (as JV said) in
> order
> > to
> > > > make
> > > > >> the bookie handle ledgers in a different manner, this way will be
> > > always
> > > > >> open as custom metadata are already here.
> > > > >>
> > > > >> JV preferred the ledger-type approach, the dual solution is to
> > > introduce
> > > > >> a list of "capabilities" or "ledger options".
> > > > >> I think that this ability to perform no-syc writes is so important
> > > that
> > > > >> "custom metadata" is not the good place to declare it, same for
> > > "ledger
> > > > >> type"
> > > > >>
> > > > >> So I am proposing to add a boolean 'allowNoSyncWrites" at ledger
> > > > creation
> > > > >> time, without writing in to ledger metadata on ZK,
> > > > >> I think that if further improvements will need ledger metadata
> > changes
> > > > we
> > > > >> will do.
> > > > >>
> > > > >> I have updated the BP-14 document, I have added an "Open issues"
> > > footer
> > > > >> with the open points,
> > > > >> please add comments and I will correct the document as soon as
> > > possible.
> > > > >>
> > > > >>
> > > > >> Enrico
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> 2017-08-30 1:24 GMT+02:00 Sijie Guo <gu...@gmail.com>:
> > > > >>
> > > > >>> Thank you, Enrico, JV.
> > > > >>>
> > > > >>> These are great discussions.
> > > > >>>
> > > > >>> After reading these two proposals, I have a few very high-level
> > > > comments,
> > > > >>> dividing into three categories.
> > > > >>>
> > > > >>>
> > > > >>> *API*
> > > > >>>
> > > > >>> - I think there are not fundamentally differences between these
> two
> > > > >>> proposals.
> > > > >>> They are trying to achieve similar goals by exposing durability
> > > levels
> > > > in
> > > > >>> different way.
> > > > >>> So this will be a discussion on what API/interface should look
> like
> > > > from
> > > > >>> user / admin perspective.
> > > > >>> I would suggest focusing what would be the API itself, putting
> the
> > > > >>> implementation design aside when talking about this.
> > > > >>>
> > > > >>> *Core*
> > > > >>>
> > > > >>> - Both proposals need to deal with a core function - what happen
> to
> > > LAC
> > > > >>> and
> > > > >>> what semantic that bookkeeper provides.
> > > > >>> JV did a good summary in his proposal. However I am not a fan of
> > > > >>> maintaining two different semantics. So I am looking for
> > > > >>> a solution that bookkeeper can only maintain one semantic. The
> > > semantic
> > > > >>> is
> > > > >>> basically:
> > > > >>>
> > > > >>> 1) LAC only advanced when entries before LAC are committed to the
> > > > >>> persistent storage
> > > > >>> 2) All the entries until LAC are successfully committed to the
> > > > >>> persistence
> > > > >>> storage
> > > > >>> 3) Entries until LAC: all the entries must be readable all the
> > time.
> > > > >>>
> > > > >>> If we maintain such semantic, there is no need to change the auto
> > > > >>> recovery
> > > > >>> protocol in bookkeeper. All what we guarantee are the entries
> > durably
> > > > >>> persistent.
> > > > >>>
> > > > >>> In order to maintain such semantic, I think both me and JV
> proposed
> > > > >>> similar
> > > > >>> solution in either proposal. I am trying to finalize one here:
> > > > >>>
> > > > >>> * bookie maintains a LAS (Last Add Synced) point for each entry.
> > > > >>> * LAS can be piggybacked on AddResponses
> > > > >>> * Client uses the LAS to advance LAC.
> > > > >>>
> > > > >>> If we can agree on the core semantic we are going to provide, the
> > > other
> > > > >>> things are just logistics.
> > > > >>>
> > > > >>> *Others*
> > > > >>>
> > > > >>> - Regarding separating journal or bypassing journal, there is no
> > > > >>> difference
> > > > >>> when we talking from the core semantic. They are all non-durably
> > > writes
> > > > >>> (acknowledging before fsyncing).
> > > > >>> We can start with same journal approach (but just acknowledge
> > before
> > > > >>> fsyncing), implement the core and add other options later on.
> > > > >>>
> > > > >>>
> > > > >>> From my point of view, I'd be more interesting in providing a
> > single
> > > > >>> consistent durable semantic that application can rely on for both
> > > > durable
> > > > >>> writes and non-durable writes. The other stuffs seem to be more
> > > > logistics
> > > > >>> things.
> > > > >>>
> > > > >>>
> > > > >>> - Sijie
> > > > >>>
> > > > >>>
> > > > >>> On Mon, Aug 28, 2017 at 11:27 PM, Enrico Olivelli <
> > > eolivelli@gmail.com
> > > > >
> > > > >>> wrote:
> > > > >>>
> > > > >>> > 2017-08-29 8:01 GMT+02:00 Venkateswara Rao Jujjuri <
> > > > jujjuri@gmail.com
> > > > >>> >:
> > > > >>> >
> > > > >>> > > I don't believe I fully followed your second case. But even
> in
> > > this
> > > > >>> case,
> > > > >>> > > your major concern is about the additional 'sync' RPC?
> > > > >>> > >
> > > > >>> >
> > > > >>> > yes apart from that I am fine with your proposal too, that is
> to
> > > > have a
> > > > >>> > LedgerType which drives durability
> > > > >>> > and I think we need to add per-entry durability options
> > > > >>> >
> > > > >>> > I think that at least for the 'simple' no-sync addEntry we do
> not
> > > > need
> > > > >>> to
> > > > >>> > change many things, I am drafting a prototype, I will share it
> as
> > > > soon
> > > > >>> as
> > > > >>> > we all agree on the roadmap
> > > > >>> >
> > > > >>> > The first implementation can cover the first cases (no-sync
> > > addEntry)
> > > > >>> and
> > > > >>> > change the way the writer advances the LAC in order to support
> > > > 'relaxed
> > > > >>> > durability writes'.
> > > > >>> > This change will be compatible with future improvements and it
> > will
> > > > >>> open
> > > > >>> > the door for big changes on the bookie side like bypassing the
> > > > journal
> > > > >>> or
> > > > >>> > leveraging multiple journals.....
> > > > >>> >
> > > > >>> > -- Enrico
> > > > >>> >
> > > > >>> > or something else that the LedgerType proposal won't work?
> > > > >>> > >
> > > > >>> >
> > > > >>> > >
> > > > >>> > >
> > > > >>> > > On Mon, Aug 28, 2017 at 7:35 AM, Enrico Olivelli <
> > > > >>> eolivelli@gmail.com>
> > > > >>> > > wrote:
> > > > >>> > >
> > > > >>> > > > I think that having a set of options on the ledger metadata
> > > will
> > > > >>> be a
> > > > >>> > > good
> > > > >>> > > > enhancement and I am sure we will do it as soon as it will
> be
> > > > >>> needed,
> > > > >>> > > maybe
> > > > >>> > > > we do not need it now.
> > > > >>> > > >
> > > > >>> > > > Actually I think we will need to declare this
> > durability-level
> > > at
> > > > >>> entry
> > > > >>> > > > level to support some uses cases in BP-14 document, let me
> > > > explain
> > > > >>> two
> > > > >>> > of
> > > > >>> > > > my usecases for which I need it:
> > > > >>> > > >
> > > > >>> > > > At higher level we have to choices:
> > > > >>> > > >
> > > > >>> > > > A) per-ledger durability options (JV proposal)
> > > > >>> > > > all addEntry operations are durable or non-durable and
> there
> > is
> > > > an
> > > > >>> > > explicit
> > > > >>> > > > 'sync' API (+ forced sync at close)
> > > > >>> > > >
> > > > >>> > > > B) per-entry durability options (original BP-14 proposal)
> > > > >>> > > > every addEntry has an own durable/non-durable option
> > > > >>> (sync/no-sync),
> > > > >>> > with
> > > > >>> > > > the ability to call 'sync' without addEntry (+ forced sync
> at
> > > > >>> close)
> > > > >>> > > >
> > > > >>> > > > I am speaking about the the database WAL case, I am using
> the
> > > > >>> ledger as
> > > > >>> > > > segment for the WAL of a database and I am writing all data
> > > > >>> changes in
> > > > >>> > > the
> > > > >>> > > > scope of a 'transaction' with the relaxed-durability flag,
> > > then I
> > > > >>> am
> > > > >>> > > > writing the 'transaction committed' entry with "strict
> > > > durability"
> > > > >>> > > > requirement, this will in fact require that all previous
> > > entries
> > > > >>> are
> > > > >>> > > > persisted durably and so that the transaction will never be
> > > lost.
> > > > >>> > > >
> > > > >>> > > > In this scenario we would need an addEntry + sync API in
> > fact:
> > > > >>> > > >
> > > > >>> > > > using option  A) the WAL will look like:
> > > > >>> > > > - open ledger no-sync = true
> > > > >>> > > > - addEntry (set foo=bar)  (this will be no-sync)
> > > > >>> > > > - addEntry (set foo=bar2) (this will be no-sync)
> > > > >>> > > > - addEntry (commit)
> > > > >>> > > > - sync
> > > > >>> > > >
> > > > >>> > > > using option B) the WAL will look like
> > > > >>> > > > - open ledger
> > > > >>> > > > - addEntry (set foo=bar), no-sync
> > > > >>> > > > - addEntry (set foo=bar2), no-sync
> > > > >>> > > > - addEntry (commit), sync
> > > > >>> > > >
> > > > >>> > > > in case B) we are "saving" one RPC call to every bookie
> (the
> > > > 'sync'
> > > > >>> > one)
> > > > >>> > > > same for single data change entries, like updating a single
> > > > record
> > > > >>> on
> > > > >>> > the
> > > > >>> > > > database, this with BK 4.5 "costs" only a single RPC to
> every
> > > > >>> bookie
> > > > >>> > > >
> > > > >>> > > > Second case:
> > > > >>> > > > I am using BookKeeper to store binary objects, so I am
> > packing
> > > > more
> > > > >>> > > > 'objects' (named sequences of bytes) into a single ledger,
> > like
> > > > >>> you do
> > > > >>> > > when
> > > > >>> > > > you write many records to a file in a streaming fashion and
> > > keep
> > > > >>> track
> > > > >>> > of
> > > > >>> > > > offsets of the beginning of every record (LedgerHandeAdv is
> > > > >>> perfect for
> > > > >>> > > > this case).
> > > > >>> > > > I am not using a single ledger per 'file' because it kills
> > > > >>> zookeeper to
> > > > >>> > > > create many ledgers very fast, in my systems I have big
> busts
> > > of
> > > > >>> > writes,
> > > > >>> > > > which need to be really "fast", so I am writing multiple
> > > 'files'
> > > > to
> > > > >>> > every
> > > > >>> > > > single ledger. So the close-to-open consistency at ledger
> > level
> > > > is
> > > > >>> not
> > > > >>> > > > suitable for this case.
> > > > >>> > > > I have to write as fast as possible to this 'ledger-backed'
> > > > >>> stream, and
> > > > >>> > > as
> > > > >>> > > > with a 'traditional'  filesystem I am writing parts of each
> > > file
> > > > >>> and
> > > > >>> > than
> > > > >>> > > > requiring 'sync' at the end of each file.
> > > > >>> > > > Using BookKeeper you need to split big 'files' into
> "little"
> > > > >>> parts, you
> > > > >>> > > > cannot transmit the contents as to "real" stream on
> network.
> > > > >>> > > >
> > > > >>> > > > I am not talking about bookie level implementation details
> I
> > > > would
> > > > >>> like
> > > > >>> > > to
> > > > >>> > > > define the high level API in order to support all the
> > relevant
> > > > >>> known
> > > > >>> > use
> > > > >>> > > > cases and keep space for the future,
> > > > >>> > > > at this moment adding a per-entry 'durability option' seems
> > to
> > > be
> > > > >>> very
> > > > >>> > > > flexible and simple to implement, it does not prevent us
> from
> > > > doing
> > > > >>> > > further
> > > > >>> > > > improvements, like namely skipping the journal.
> > > > >>> > > >
> > > > >>> > > > Enrico
> > > > >>> > > >
> > > > >>> > > >
> > > > >>> > > >
> > > > >>> > > > 2017-08-26 19:55 GMT+02:00 Enrico Olivelli <
> > > eolivelli@gmail.com
> > > > >:
> > > > >>> > > >
> > > > >>> > > > >
> > > > >>> > > > >
> > > > >>> > > > > On sab 26 ago 2017, 19:19 Venkateswara Rao Jujjuri <
> > > > >>> > jujjuri@gmail.com>
> > > > >>> > > > > wrote:
> > > > >>> > > > >
> > > > >>> > > > >> Hi all,
> > > > >>> > > > >>
> > > > >>> > > > >> As promised during Thursday call, here is my proposal.
> > > > >>> > > > >>
> > > > >>> > > > >> *NOTE*: Major difference in this proposal compared to
> > > Enrico’s
> > > > >>> > > > >> <https://docs.google.com/document/d/
> > 1JLYO3K3tZ5PJGmyS0YK_-
> > > > >>> > > > >> NW8VOUUgUWVBmswCUOG158/edit#heading=h.q2rewiqndr5v>
> > > > >>> > > > >> is
> > > > >>> > > > >> making the durability a property of the ledger(type) as
> > > > opposed
> > > > >>> to
> > > > >>> > > > >> addEntry(). Rest of the technical details have a lot of
> > > > >>> > similarities.
> > > > >>> > > > >>
> > > > >>> > > > >
> > > > >>> > > > > Thank you JV. I have just read quickly the doc and your
> > view
> > > is
> > > > >>> > > centantly
> > > > >>> > > > > broader.
> > > > >>> > > > > I will dig into the doc as soon as possible on Monday.
> > > > >>> > > > > For me it is ok to have a ledger wide configuration I
> think
> > > > that
> > > > >>> the
> > > > >>> > > most
> > > > >>> > > > > important decision is about the API we will provide as in
> > the
> > > > >>> future
> > > > >>> > it
> > > > >>> > > > > will be difficult to change it.
> > > > >>> > > > >
> > > > >>> > > > >
> > > > >>> > > > > Cheers
> > > > >>> > > > > Enrico
> > > > >>> > > > >
> > > > >>> > > > >
> > > > >>> > > > >
> > > > >>> > > > >> https://docs.google.com/document/d/
> > > 1g1eBcVVCZrTG8YZliZP0LVqv
> > > > >>> Wpq43
> > > > >>> > > > >> 2ODEghrGVQ4d4Q/edit?usp=sharing
> > > > >>> > > > >>
> > > > >>> > > > >> On Thu, Aug 24, 2017 at 1:14 AM, Enrico Olivelli <
> > > > >>> > eolivelli@gmail.com
> > > > >>> > > >
> > > > >>> > > > >> wrote:
> > > > >>> > > > >>
> > > > >>> > > > >> > Thank you all for the comments and for taking a look
> to
> > > the
> > > > >>> > document
> > > > >>> > > > so
> > > > >>> > > > >> > soon.
> > > > >>> > > > >> > I have updated the doc, we will discuss the document
> at
> > > the
> > > > >>> > meeting,
> > > > >>> > > > >> >
> > > > >>> > > > >> >
> > > > >>> > > > >> > Enrico
> > > > >>> > > > >> >
> > > > >>> > > > >> > 2017-08-24 2:27 GMT+02:00 Sijie Guo <
> guosijie@gmail.com
> > >:
> > > > >>> > > > >> >
> > > > >>> > > > >> > > Enrico,
> > > > >>> > > > >> > >
> > > > >>> > > > >> > > Thank you so much! It is a great effort for putting
> > this
> > > > up.
> > > > >>> > > Overall
> > > > >>> > > > >> > looks
> > > > >>> > > > >> > > good. I made some comments, we can discuss at
> > tomorrow's
> > > > >>> > community
> > > > >>> > > > >> > meeting.
> > > > >>> > > > >> > >
> > > > >>> > > > >> > > - Sijie
> > > > >>> > > > >> > >
> > > > >>> > > > >> > > On Wed, Aug 23, 2017 at 8:25 AM, Enrico Olivelli <
> > > > >>> > > > eolivelli@gmail.com
> > > > >>> > > > >> >
> > > > >>> > > > >> > > wrote:
> > > > >>> > > > >> > >
> > > > >>> > > > >> > > > Hi all,
> > > > >>> > > > >> > > > I have drafted a first proposal for BP-14 - Relax
> > > > >>> Durability
> > > > >>> > > > >> > > >
> > > > >>> > > > >> > > > We are talking about limiting the number of fsync
> to
> > > the
> > > > >>> > journal
> > > > >>> > > > >> while
> > > > >>> > > > >> > > > preserving the correctness of the LAC protocol.
> > > > >>> > > > >> > > >
> > > > >>> > > > >> > > > This is the link to the wiki page, but as the
> issue
> > is
> > > > >>> huge we
> > > > >>> > > > >> prefer
> > > > >>> > > > >> > to
> > > > >>> > > > >> > > > use Google Documents for sharing comments
> > > > >>> > > > >> > > > https://cwiki.apache.org/
> > > confluence/display/BOOKKEEPER/
> > > > >>> > > > >> > > > BP+-+14+Relax+durability
> > > > >>> > > > >> > > >
> > > > >>> > > > >> > > > This is the document
> > > > >>> > > > >> > > > https://docs.google.com/document/d/
> > > > 1JLYO3K3tZ5PJGmyS0YK_-
> > > > >>> > > > >> > > > NW8VOUUgUWVBmswCUOG158/edit?usp=sharing
> > > > >>> > > > >> > > >
> > > > >>> > > > >> > > > All comments are welcome
> > > > >>> > > > >> > > >
> > > > >>> > > > >> > > > I have added DL dev list in cc as the discussion
> is
> > > > >>> > interesting
> > > > >>> > > > for
> > > > >>> > > > >> > both
> > > > >>> > > > >> > > > groups
> > > > >>> > > > >> > > >
> > > > >>> > > > >> > > > Enrico Olivelli
> > > > >>> > > > >> > > >
> > > > >>> > > > >> > >
> > > > >>> > > > >> >
> > > > >>> > > > >>
> > > > >>> > > > >>
> > > > >>> > > > >>
> > > > >>> > > > >> --
> > > > >>> > > > >> Jvrao
> > > > >>> > > > >> ---
> > > > >>> > > > >> First they ignore you, then they laugh at you, then they
> > > fight
> > > > >>> you,
> > > > >>> > > then
> > > > >>> > > > >> you win. - Mahatma Gandhi
> > > > >>> > > > >>
> > > > >>> > > > > --
> > > > >>> > > > >
> > > > >>> > > > >
> > > > >>> > > > > -- Enrico Olivelli
> > > > >>> > > > >
> > > > >>> > > >
> > > > >>> > >
> > > > >>> > >
> > > > >>> > >
> > > > >>> > > --
> > > > >>> > > Jvrao
> > > > >>> > > ---
> > > > >>> > > First they ignore you, then they laugh at you, then they
> fight
> > > you,
> > > > >>> then
> > > > >>> > > you win. - Mahatma Gandhi
> > > > >>> > >
> > > > >>> >
> > > > >>>
> > > > >>
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] BP-14 Relax Durability

Posted by Sijie Guo <gu...@gmail.com>.
Cool.

I would expect this is a big change. It would be good if you can divide it
into smaller tasks, so people can review them easier.

- Sijie

On Tue, Sep 12, 2017 at 1:05 AM, Enrico Olivelli <eo...@gmail.com>
wrote:

> Thank you all !
>
> I will copy the content of the Final draft to the Wiki and mark the
> document as "Accepted"
>
> I will send a PR soon but it will depend on BP-15 New CreateLeader API
>
> I hope we could make it for 4.6
>
>
> Enrico
>
>
> 2017-09-11 18:58 GMT+02:00 Sijie Guo <gu...@gmail.com>:
>
> > Enrico,
> >
> > Feel free to close the thread and mark this BP as accepted, if there is
> no
> > -1.
> >
> > - Sijie
> >
> > On Mon, Sep 11, 2017 at 2:26 AM, Enrico Olivelli <eo...@gmail.com>
> > wrote:
> >
> > > Ping
> > >
> > > 2017-09-07 9:32 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:
> > >
> > > > Hi all,
> > > >
> > > >
> > > > You can find the revised proposal here
> > > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> > > > BP-14+Relax+durability
> > > >
> > > > The link to the document open for comments is this:
> > > > https://docs.google.com/document/d/1yNi9t2_
> > > deOOMXDaGzrnmaHTQeB3B3Fnym82DU
> > > > ERH7LM/edit?usp=sharing
> > > >
> > > > Please check it out
> > > > We are going to review this Proposal at the meeting
> > > >
> > > > -- Enrico
> > > >
> > > >
> > > > 2017-08-30 8:56 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:
> > > >
> > > >> Thank you Sijie for summarizing and thanks to the community for
> > helping
> > > >> in this important enhancement to BookKeeper
> > > >>
> > > >> I am convinced that as JV pointed out we need to declare at ledger
> > > >> creation time that the ledger is going to perform no-sync writes.
> > > >>
> > > >> I think we need an explicit declaration currently to make things
> > "clear"
> > > >> to the developer which is using the LedgerHandle API even and ledger
> > > >> creation tyime.
> > > >>
> > > >> The case is that we are going to forbid "striping" ledgers (ensemble
> > > size
> > > >> > quorum size) for no-sync writes in the first implementation:
> > > >> - one option is to  fail at the first no-sync addEntry, but this
> will
> > be
> > > >> really uncomfortable because usually the ack/write/ensemble sizes
> are
> > > >> configured by the admin, and there will be configurations in which
> > > errors
> > > >> will come out only after starting the system.
> > > >> - the second option is to make the developer explicitly enable
> no-sync
> > > >> writes at creation time and fail the creation of the ledger if the
> > > >> requested combination of options if not possible
> > > >>
> > > >> I am not sure that the changes to the bookie internals are a
> > Client-API
> > > >> matter, maybe we can leverage custom metadata (as JV said) in order
> to
> > > make
> > > >> the bookie handle ledgers in a different manner, this way will be
> > always
> > > >> open as custom metadata are already here.
> > > >>
> > > >> JV preferred the ledger-type approach, the dual solution is to
> > introduce
> > > >> a list of "capabilities" or "ledger options".
> > > >> I think that this ability to perform no-syc writes is so important
> > that
> > > >> "custom metadata" is not the good place to declare it, same for
> > "ledger
> > > >> type"
> > > >>
> > > >> So I am proposing to add a boolean 'allowNoSyncWrites" at ledger
> > > creation
> > > >> time, without writing in to ledger metadata on ZK,
> > > >> I think that if further improvements will need ledger metadata
> changes
> > > we
> > > >> will do.
> > > >>
> > > >> I have updated the BP-14 document, I have added an "Open issues"
> > footer
> > > >> with the open points,
> > > >> please add comments and I will correct the document as soon as
> > possible.
> > > >>
> > > >>
> > > >> Enrico
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> 2017-08-30 1:24 GMT+02:00 Sijie Guo <gu...@gmail.com>:
> > > >>
> > > >>> Thank you, Enrico, JV.
> > > >>>
> > > >>> These are great discussions.
> > > >>>
> > > >>> After reading these two proposals, I have a few very high-level
> > > comments,
> > > >>> dividing into three categories.
> > > >>>
> > > >>>
> > > >>> *API*
> > > >>>
> > > >>> - I think there are not fundamentally differences between these two
> > > >>> proposals.
> > > >>> They are trying to achieve similar goals by exposing durability
> > levels
> > > in
> > > >>> different way.
> > > >>> So this will be a discussion on what API/interface should look like
> > > from
> > > >>> user / admin perspective.
> > > >>> I would suggest focusing what would be the API itself, putting the
> > > >>> implementation design aside when talking about this.
> > > >>>
> > > >>> *Core*
> > > >>>
> > > >>> - Both proposals need to deal with a core function - what happen to
> > LAC
> > > >>> and
> > > >>> what semantic that bookkeeper provides.
> > > >>> JV did a good summary in his proposal. However I am not a fan of
> > > >>> maintaining two different semantics. So I am looking for
> > > >>> a solution that bookkeeper can only maintain one semantic. The
> > semantic
> > > >>> is
> > > >>> basically:
> > > >>>
> > > >>> 1) LAC only advanced when entries before LAC are committed to the
> > > >>> persistent storage
> > > >>> 2) All the entries until LAC are successfully committed to the
> > > >>> persistence
> > > >>> storage
> > > >>> 3) Entries until LAC: all the entries must be readable all the
> time.
> > > >>>
> > > >>> If we maintain such semantic, there is no need to change the auto
> > > >>> recovery
> > > >>> protocol in bookkeeper. All what we guarantee are the entries
> durably
> > > >>> persistent.
> > > >>>
> > > >>> In order to maintain such semantic, I think both me and JV proposed
> > > >>> similar
> > > >>> solution in either proposal. I am trying to finalize one here:
> > > >>>
> > > >>> * bookie maintains a LAS (Last Add Synced) point for each entry.
> > > >>> * LAS can be piggybacked on AddResponses
> > > >>> * Client uses the LAS to advance LAC.
> > > >>>
> > > >>> If we can agree on the core semantic we are going to provide, the
> > other
> > > >>> things are just logistics.
> > > >>>
> > > >>> *Others*
> > > >>>
> > > >>> - Regarding separating journal or bypassing journal, there is no
> > > >>> difference
> > > >>> when we talking from the core semantic. They are all non-durably
> > writes
> > > >>> (acknowledging before fsyncing).
> > > >>> We can start with same journal approach (but just acknowledge
> before
> > > >>> fsyncing), implement the core and add other options later on.
> > > >>>
> > > >>>
> > > >>> From my point of view, I'd be more interesting in providing a
> single
> > > >>> consistent durable semantic that application can rely on for both
> > > durable
> > > >>> writes and non-durable writes. The other stuffs seem to be more
> > > logistics
> > > >>> things.
> > > >>>
> > > >>>
> > > >>> - Sijie
> > > >>>
> > > >>>
> > > >>> On Mon, Aug 28, 2017 at 11:27 PM, Enrico Olivelli <
> > eolivelli@gmail.com
> > > >
> > > >>> wrote:
> > > >>>
> > > >>> > 2017-08-29 8:01 GMT+02:00 Venkateswara Rao Jujjuri <
> > > jujjuri@gmail.com
> > > >>> >:
> > > >>> >
> > > >>> > > I don't believe I fully followed your second case. But even in
> > this
> > > >>> case,
> > > >>> > > your major concern is about the additional 'sync' RPC?
> > > >>> > >
> > > >>> >
> > > >>> > yes apart from that I am fine with your proposal too, that is to
> > > have a
> > > >>> > LedgerType which drives durability
> > > >>> > and I think we need to add per-entry durability options
> > > >>> >
> > > >>> > I think that at least for the 'simple' no-sync addEntry we do not
> > > need
> > > >>> to
> > > >>> > change many things, I am drafting a prototype, I will share it as
> > > soon
> > > >>> as
> > > >>> > we all agree on the roadmap
> > > >>> >
> > > >>> > The first implementation can cover the first cases (no-sync
> > addEntry)
> > > >>> and
> > > >>> > change the way the writer advances the LAC in order to support
> > > 'relaxed
> > > >>> > durability writes'.
> > > >>> > This change will be compatible with future improvements and it
> will
> > > >>> open
> > > >>> > the door for big changes on the bookie side like bypassing the
> > > journal
> > > >>> or
> > > >>> > leveraging multiple journals.....
> > > >>> >
> > > >>> > -- Enrico
> > > >>> >
> > > >>> > or something else that the LedgerType proposal won't work?
> > > >>> > >
> > > >>> >
> > > >>> > >
> > > >>> > >
> > > >>> > > On Mon, Aug 28, 2017 at 7:35 AM, Enrico Olivelli <
> > > >>> eolivelli@gmail.com>
> > > >>> > > wrote:
> > > >>> > >
> > > >>> > > > I think that having a set of options on the ledger metadata
> > will
> > > >>> be a
> > > >>> > > good
> > > >>> > > > enhancement and I am sure we will do it as soon as it will be
> > > >>> needed,
> > > >>> > > maybe
> > > >>> > > > we do not need it now.
> > > >>> > > >
> > > >>> > > > Actually I think we will need to declare this
> durability-level
> > at
> > > >>> entry
> > > >>> > > > level to support some uses cases in BP-14 document, let me
> > > explain
> > > >>> two
> > > >>> > of
> > > >>> > > > my usecases for which I need it:
> > > >>> > > >
> > > >>> > > > At higher level we have to choices:
> > > >>> > > >
> > > >>> > > > A) per-ledger durability options (JV proposal)
> > > >>> > > > all addEntry operations are durable or non-durable and there
> is
> > > an
> > > >>> > > explicit
> > > >>> > > > 'sync' API (+ forced sync at close)
> > > >>> > > >
> > > >>> > > > B) per-entry durability options (original BP-14 proposal)
> > > >>> > > > every addEntry has an own durable/non-durable option
> > > >>> (sync/no-sync),
> > > >>> > with
> > > >>> > > > the ability to call 'sync' without addEntry (+ forced sync at
> > > >>> close)
> > > >>> > > >
> > > >>> > > > I am speaking about the the database WAL case, I am using the
> > > >>> ledger as
> > > >>> > > > segment for the WAL of a database and I am writing all data
> > > >>> changes in
> > > >>> > > the
> > > >>> > > > scope of a 'transaction' with the relaxed-durability flag,
> > then I
> > > >>> am
> > > >>> > > > writing the 'transaction committed' entry with "strict
> > > durability"
> > > >>> > > > requirement, this will in fact require that all previous
> > entries
> > > >>> are
> > > >>> > > > persisted durably and so that the transaction will never be
> > lost.
> > > >>> > > >
> > > >>> > > > In this scenario we would need an addEntry + sync API in
> fact:
> > > >>> > > >
> > > >>> > > > using option  A) the WAL will look like:
> > > >>> > > > - open ledger no-sync = true
> > > >>> > > > - addEntry (set foo=bar)  (this will be no-sync)
> > > >>> > > > - addEntry (set foo=bar2) (this will be no-sync)
> > > >>> > > > - addEntry (commit)
> > > >>> > > > - sync
> > > >>> > > >
> > > >>> > > > using option B) the WAL will look like
> > > >>> > > > - open ledger
> > > >>> > > > - addEntry (set foo=bar), no-sync
> > > >>> > > > - addEntry (set foo=bar2), no-sync
> > > >>> > > > - addEntry (commit), sync
> > > >>> > > >
> > > >>> > > > in case B) we are "saving" one RPC call to every bookie (the
> > > 'sync'
> > > >>> > one)
> > > >>> > > > same for single data change entries, like updating a single
> > > record
> > > >>> on
> > > >>> > the
> > > >>> > > > database, this with BK 4.5 "costs" only a single RPC to every
> > > >>> bookie
> > > >>> > > >
> > > >>> > > > Second case:
> > > >>> > > > I am using BookKeeper to store binary objects, so I am
> packing
> > > more
> > > >>> > > > 'objects' (named sequences of bytes) into a single ledger,
> like
> > > >>> you do
> > > >>> > > when
> > > >>> > > > you write many records to a file in a streaming fashion and
> > keep
> > > >>> track
> > > >>> > of
> > > >>> > > > offsets of the beginning of every record (LedgerHandeAdv is
> > > >>> perfect for
> > > >>> > > > this case).
> > > >>> > > > I am not using a single ledger per 'file' because it kills
> > > >>> zookeeper to
> > > >>> > > > create many ledgers very fast, in my systems I have big busts
> > of
> > > >>> > writes,
> > > >>> > > > which need to be really "fast", so I am writing multiple
> > 'files'
> > > to
> > > >>> > every
> > > >>> > > > single ledger. So the close-to-open consistency at ledger
> level
> > > is
> > > >>> not
> > > >>> > > > suitable for this case.
> > > >>> > > > I have to write as fast as possible to this 'ledger-backed'
> > > >>> stream, and
> > > >>> > > as
> > > >>> > > > with a 'traditional'  filesystem I am writing parts of each
> > file
> > > >>> and
> > > >>> > than
> > > >>> > > > requiring 'sync' at the end of each file.
> > > >>> > > > Using BookKeeper you need to split big 'files' into "little"
> > > >>> parts, you
> > > >>> > > > cannot transmit the contents as to "real" stream on network.
> > > >>> > > >
> > > >>> > > > I am not talking about bookie level implementation details I
> > > would
> > > >>> like
> > > >>> > > to
> > > >>> > > > define the high level API in order to support all the
> relevant
> > > >>> known
> > > >>> > use
> > > >>> > > > cases and keep space for the future,
> > > >>> > > > at this moment adding a per-entry 'durability option' seems
> to
> > be
> > > >>> very
> > > >>> > > > flexible and simple to implement, it does not prevent us from
> > > doing
> > > >>> > > further
> > > >>> > > > improvements, like namely skipping the journal.
> > > >>> > > >
> > > >>> > > > Enrico
> > > >>> > > >
> > > >>> > > >
> > > >>> > > >
> > > >>> > > > 2017-08-26 19:55 GMT+02:00 Enrico Olivelli <
> > eolivelli@gmail.com
> > > >:
> > > >>> > > >
> > > >>> > > > >
> > > >>> > > > >
> > > >>> > > > > On sab 26 ago 2017, 19:19 Venkateswara Rao Jujjuri <
> > > >>> > jujjuri@gmail.com>
> > > >>> > > > > wrote:
> > > >>> > > > >
> > > >>> > > > >> Hi all,
> > > >>> > > > >>
> > > >>> > > > >> As promised during Thursday call, here is my proposal.
> > > >>> > > > >>
> > > >>> > > > >> *NOTE*: Major difference in this proposal compared to
> > Enrico’s
> > > >>> > > > >> <https://docs.google.com/document/d/
> 1JLYO3K3tZ5PJGmyS0YK_-
> > > >>> > > > >> NW8VOUUgUWVBmswCUOG158/edit#heading=h.q2rewiqndr5v>
> > > >>> > > > >> is
> > > >>> > > > >> making the durability a property of the ledger(type) as
> > > opposed
> > > >>> to
> > > >>> > > > >> addEntry(). Rest of the technical details have a lot of
> > > >>> > similarities.
> > > >>> > > > >>
> > > >>> > > > >
> > > >>> > > > > Thank you JV. I have just read quickly the doc and your
> view
> > is
> > > >>> > > centantly
> > > >>> > > > > broader.
> > > >>> > > > > I will dig into the doc as soon as possible on Monday.
> > > >>> > > > > For me it is ok to have a ledger wide configuration I think
> > > that
> > > >>> the
> > > >>> > > most
> > > >>> > > > > important decision is about the API we will provide as in
> the
> > > >>> future
> > > >>> > it
> > > >>> > > > > will be difficult to change it.
> > > >>> > > > >
> > > >>> > > > >
> > > >>> > > > > Cheers
> > > >>> > > > > Enrico
> > > >>> > > > >
> > > >>> > > > >
> > > >>> > > > >
> > > >>> > > > >> https://docs.google.com/document/d/
> > 1g1eBcVVCZrTG8YZliZP0LVqv
> > > >>> Wpq43
> > > >>> > > > >> 2ODEghrGVQ4d4Q/edit?usp=sharing
> > > >>> > > > >>
> > > >>> > > > >> On Thu, Aug 24, 2017 at 1:14 AM, Enrico Olivelli <
> > > >>> > eolivelli@gmail.com
> > > >>> > > >
> > > >>> > > > >> wrote:
> > > >>> > > > >>
> > > >>> > > > >> > Thank you all for the comments and for taking a look to
> > the
> > > >>> > document
> > > >>> > > > so
> > > >>> > > > >> > soon.
> > > >>> > > > >> > I have updated the doc, we will discuss the document at
> > the
> > > >>> > meeting,
> > > >>> > > > >> >
> > > >>> > > > >> >
> > > >>> > > > >> > Enrico
> > > >>> > > > >> >
> > > >>> > > > >> > 2017-08-24 2:27 GMT+02:00 Sijie Guo <guosijie@gmail.com
> >:
> > > >>> > > > >> >
> > > >>> > > > >> > > Enrico,
> > > >>> > > > >> > >
> > > >>> > > > >> > > Thank you so much! It is a great effort for putting
> this
> > > up.
> > > >>> > > Overall
> > > >>> > > > >> > looks
> > > >>> > > > >> > > good. I made some comments, we can discuss at
> tomorrow's
> > > >>> > community
> > > >>> > > > >> > meeting.
> > > >>> > > > >> > >
> > > >>> > > > >> > > - Sijie
> > > >>> > > > >> > >
> > > >>> > > > >> > > On Wed, Aug 23, 2017 at 8:25 AM, Enrico Olivelli <
> > > >>> > > > eolivelli@gmail.com
> > > >>> > > > >> >
> > > >>> > > > >> > > wrote:
> > > >>> > > > >> > >
> > > >>> > > > >> > > > Hi all,
> > > >>> > > > >> > > > I have drafted a first proposal for BP-14 - Relax
> > > >>> Durability
> > > >>> > > > >> > > >
> > > >>> > > > >> > > > We are talking about limiting the number of fsync to
> > the
> > > >>> > journal
> > > >>> > > > >> while
> > > >>> > > > >> > > > preserving the correctness of the LAC protocol.
> > > >>> > > > >> > > >
> > > >>> > > > >> > > > This is the link to the wiki page, but as the issue
> is
> > > >>> huge we
> > > >>> > > > >> prefer
> > > >>> > > > >> > to
> > > >>> > > > >> > > > use Google Documents for sharing comments
> > > >>> > > > >> > > > https://cwiki.apache.org/
> > confluence/display/BOOKKEEPER/
> > > >>> > > > >> > > > BP+-+14+Relax+durability
> > > >>> > > > >> > > >
> > > >>> > > > >> > > > This is the document
> > > >>> > > > >> > > > https://docs.google.com/document/d/
> > > 1JLYO3K3tZ5PJGmyS0YK_-
> > > >>> > > > >> > > > NW8VOUUgUWVBmswCUOG158/edit?usp=sharing
> > > >>> > > > >> > > >
> > > >>> > > > >> > > > All comments are welcome
> > > >>> > > > >> > > >
> > > >>> > > > >> > > > I have added DL dev list in cc as the discussion is
> > > >>> > interesting
> > > >>> > > > for
> > > >>> > > > >> > both
> > > >>> > > > >> > > > groups
> > > >>> > > > >> > > >
> > > >>> > > > >> > > > Enrico Olivelli
> > > >>> > > > >> > > >
> > > >>> > > > >> > >
> > > >>> > > > >> >
> > > >>> > > > >>
> > > >>> > > > >>
> > > >>> > > > >>
> > > >>> > > > >> --
> > > >>> > > > >> Jvrao
> > > >>> > > > >> ---
> > > >>> > > > >> First they ignore you, then they laugh at you, then they
> > fight
> > > >>> you,
> > > >>> > > then
> > > >>> > > > >> you win. - Mahatma Gandhi
> > > >>> > > > >>
> > > >>> > > > > --
> > > >>> > > > >
> > > >>> > > > >
> > > >>> > > > > -- Enrico Olivelli
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>> > >
> > > >>> > >
> > > >>> > > --
> > > >>> > > Jvrao
> > > >>> > > ---
> > > >>> > > First they ignore you, then they laugh at you, then they fight
> > you,
> > > >>> then
> > > >>> > > you win. - Mahatma Gandhi
> > > >>> > >
> > > >>> >
> > > >>>
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] BP-14 Relax Durability

Posted by Sijie Guo <gu...@gmail.com>.
Cool.

I would expect this is a big change. It would be good if you can divide it
into smaller tasks, so people can review them easier.

- Sijie

On Tue, Sep 12, 2017 at 1:05 AM, Enrico Olivelli <eo...@gmail.com>
wrote:

> Thank you all !
>
> I will copy the content of the Final draft to the Wiki and mark the
> document as "Accepted"
>
> I will send a PR soon but it will depend on BP-15 New CreateLeader API
>
> I hope we could make it for 4.6
>
>
> Enrico
>
>
> 2017-09-11 18:58 GMT+02:00 Sijie Guo <gu...@gmail.com>:
>
> > Enrico,
> >
> > Feel free to close the thread and mark this BP as accepted, if there is
> no
> > -1.
> >
> > - Sijie
> >
> > On Mon, Sep 11, 2017 at 2:26 AM, Enrico Olivelli <eo...@gmail.com>
> > wrote:
> >
> > > Ping
> > >
> > > 2017-09-07 9:32 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:
> > >
> > > > Hi all,
> > > >
> > > >
> > > > You can find the revised proposal here
> > > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> > > > BP-14+Relax+durability
> > > >
> > > > The link to the document open for comments is this:
> > > > https://docs.google.com/document/d/1yNi9t2_
> > > deOOMXDaGzrnmaHTQeB3B3Fnym82DU
> > > > ERH7LM/edit?usp=sharing
> > > >
> > > > Please check it out
> > > > We are going to review this Proposal at the meeting
> > > >
> > > > -- Enrico
> > > >
> > > >
> > > > 2017-08-30 8:56 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:
> > > >
> > > >> Thank you Sijie for summarizing and thanks to the community for
> > helping
> > > >> in this important enhancement to BookKeeper
> > > >>
> > > >> I am convinced that as JV pointed out we need to declare at ledger
> > > >> creation time that the ledger is going to perform no-sync writes.
> > > >>
> > > >> I think we need an explicit declaration currently to make things
> > "clear"
> > > >> to the developer which is using the LedgerHandle API even and ledger
> > > >> creation tyime.
> > > >>
> > > >> The case is that we are going to forbid "striping" ledgers (ensemble
> > > size
> > > >> > quorum size) for no-sync writes in the first implementation:
> > > >> - one option is to  fail at the first no-sync addEntry, but this
> will
> > be
> > > >> really uncomfortable because usually the ack/write/ensemble sizes
> are
> > > >> configured by the admin, and there will be configurations in which
> > > errors
> > > >> will come out only after starting the system.
> > > >> - the second option is to make the developer explicitly enable
> no-sync
> > > >> writes at creation time and fail the creation of the ledger if the
> > > >> requested combination of options if not possible
> > > >>
> > > >> I am not sure that the changes to the bookie internals are a
> > Client-API
> > > >> matter, maybe we can leverage custom metadata (as JV said) in order
> to
> > > make
> > > >> the bookie handle ledgers in a different manner, this way will be
> > always
> > > >> open as custom metadata are already here.
> > > >>
> > > >> JV preferred the ledger-type approach, the dual solution is to
> > introduce
> > > >> a list of "capabilities" or "ledger options".
> > > >> I think that this ability to perform no-syc writes is so important
> > that
> > > >> "custom metadata" is not the good place to declare it, same for
> > "ledger
> > > >> type"
> > > >>
> > > >> So I am proposing to add a boolean 'allowNoSyncWrites" at ledger
> > > creation
> > > >> time, without writing in to ledger metadata on ZK,
> > > >> I think that if further improvements will need ledger metadata
> changes
> > > we
> > > >> will do.
> > > >>
> > > >> I have updated the BP-14 document, I have added an "Open issues"
> > footer
> > > >> with the open points,
> > > >> please add comments and I will correct the document as soon as
> > possible.
> > > >>
> > > >>
> > > >> Enrico
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> 2017-08-30 1:24 GMT+02:00 Sijie Guo <gu...@gmail.com>:
> > > >>
> > > >>> Thank you, Enrico, JV.
> > > >>>
> > > >>> These are great discussions.
> > > >>>
> > > >>> After reading these two proposals, I have a few very high-level
> > > comments,
> > > >>> dividing into three categories.
> > > >>>
> > > >>>
> > > >>> *API*
> > > >>>
> > > >>> - I think there are not fundamentally differences between these two
> > > >>> proposals.
> > > >>> They are trying to achieve similar goals by exposing durability
> > levels
> > > in
> > > >>> different way.
> > > >>> So this will be a discussion on what API/interface should look like
> > > from
> > > >>> user / admin perspective.
> > > >>> I would suggest focusing what would be the API itself, putting the
> > > >>> implementation design aside when talking about this.
> > > >>>
> > > >>> *Core*
> > > >>>
> > > >>> - Both proposals need to deal with a core function - what happen to
> > LAC
> > > >>> and
> > > >>> what semantic that bookkeeper provides.
> > > >>> JV did a good summary in his proposal. However I am not a fan of
> > > >>> maintaining two different semantics. So I am looking for
> > > >>> a solution that bookkeeper can only maintain one semantic. The
> > semantic
> > > >>> is
> > > >>> basically:
> > > >>>
> > > >>> 1) LAC only advanced when entries before LAC are committed to the
> > > >>> persistent storage
> > > >>> 2) All the entries until LAC are successfully committed to the
> > > >>> persistence
> > > >>> storage
> > > >>> 3) Entries until LAC: all the entries must be readable all the
> time.
> > > >>>
> > > >>> If we maintain such semantic, there is no need to change the auto
> > > >>> recovery
> > > >>> protocol in bookkeeper. All what we guarantee are the entries
> durably
> > > >>> persistent.
> > > >>>
> > > >>> In order to maintain such semantic, I think both me and JV proposed
> > > >>> similar
> > > >>> solution in either proposal. I am trying to finalize one here:
> > > >>>
> > > >>> * bookie maintains a LAS (Last Add Synced) point for each entry.
> > > >>> * LAS can be piggybacked on AddResponses
> > > >>> * Client uses the LAS to advance LAC.
> > > >>>
> > > >>> If we can agree on the core semantic we are going to provide, the
> > other
> > > >>> things are just logistics.
> > > >>>
> > > >>> *Others*
> > > >>>
> > > >>> - Regarding separating journal or bypassing journal, there is no
> > > >>> difference
> > > >>> when we talking from the core semantic. They are all non-durably
> > writes
> > > >>> (acknowledging before fsyncing).
> > > >>> We can start with same journal approach (but just acknowledge
> before
> > > >>> fsyncing), implement the core and add other options later on.
> > > >>>
> > > >>>
> > > >>> From my point of view, I'd be more interesting in providing a
> single
> > > >>> consistent durable semantic that application can rely on for both
> > > durable
> > > >>> writes and non-durable writes. The other stuffs seem to be more
> > > logistics
> > > >>> things.
> > > >>>
> > > >>>
> > > >>> - Sijie
> > > >>>
> > > >>>
> > > >>> On Mon, Aug 28, 2017 at 11:27 PM, Enrico Olivelli <
> > eolivelli@gmail.com
> > > >
> > > >>> wrote:
> > > >>>
> > > >>> > 2017-08-29 8:01 GMT+02:00 Venkateswara Rao Jujjuri <
> > > jujjuri@gmail.com
> > > >>> >:
> > > >>> >
> > > >>> > > I don't believe I fully followed your second case. But even in
> > this
> > > >>> case,
> > > >>> > > your major concern is about the additional 'sync' RPC?
> > > >>> > >
> > > >>> >
> > > >>> > yes apart from that I am fine with your proposal too, that is to
> > > have a
> > > >>> > LedgerType which drives durability
> > > >>> > and I think we need to add per-entry durability options
> > > >>> >
> > > >>> > I think that at least for the 'simple' no-sync addEntry we do not
> > > need
> > > >>> to
> > > >>> > change many things, I am drafting a prototype, I will share it as
> > > soon
> > > >>> as
> > > >>> > we all agree on the roadmap
> > > >>> >
> > > >>> > The first implementation can cover the first cases (no-sync
> > addEntry)
> > > >>> and
> > > >>> > change the way the writer advances the LAC in order to support
> > > 'relaxed
> > > >>> > durability writes'.
> > > >>> > This change will be compatible with future improvements and it
> will
> > > >>> open
> > > >>> > the door for big changes on the bookie side like bypassing the
> > > journal
> > > >>> or
> > > >>> > leveraging multiple journals.....
> > > >>> >
> > > >>> > -- Enrico
> > > >>> >
> > > >>> > or something else that the LedgerType proposal won't work?
> > > >>> > >
> > > >>> >
> > > >>> > >
> > > >>> > >
> > > >>> > > On Mon, Aug 28, 2017 at 7:35 AM, Enrico Olivelli <
> > > >>> eolivelli@gmail.com>
> > > >>> > > wrote:
> > > >>> > >
> > > >>> > > > I think that having a set of options on the ledger metadata
> > will
> > > >>> be a
> > > >>> > > good
> > > >>> > > > enhancement and I am sure we will do it as soon as it will be
> > > >>> needed,
> > > >>> > > maybe
> > > >>> > > > we do not need it now.
> > > >>> > > >
> > > >>> > > > Actually I think we will need to declare this
> durability-level
> > at
> > > >>> entry
> > > >>> > > > level to support some uses cases in BP-14 document, let me
> > > explain
> > > >>> two
> > > >>> > of
> > > >>> > > > my usecases for which I need it:
> > > >>> > > >
> > > >>> > > > At higher level we have to choices:
> > > >>> > > >
> > > >>> > > > A) per-ledger durability options (JV proposal)
> > > >>> > > > all addEntry operations are durable or non-durable and there
> is
> > > an
> > > >>> > > explicit
> > > >>> > > > 'sync' API (+ forced sync at close)
> > > >>> > > >
> > > >>> > > > B) per-entry durability options (original BP-14 proposal)
> > > >>> > > > every addEntry has an own durable/non-durable option
> > > >>> (sync/no-sync),
> > > >>> > with
> > > >>> > > > the ability to call 'sync' without addEntry (+ forced sync at
> > > >>> close)
> > > >>> > > >
> > > >>> > > > I am speaking about the the database WAL case, I am using the
> > > >>> ledger as
> > > >>> > > > segment for the WAL of a database and I am writing all data
> > > >>> changes in
> > > >>> > > the
> > > >>> > > > scope of a 'transaction' with the relaxed-durability flag,
> > then I
> > > >>> am
> > > >>> > > > writing the 'transaction committed' entry with "strict
> > > durability"
> > > >>> > > > requirement, this will in fact require that all previous
> > entries
> > > >>> are
> > > >>> > > > persisted durably and so that the transaction will never be
> > lost.
> > > >>> > > >
> > > >>> > > > In this scenario we would need an addEntry + sync API in
> fact:
> > > >>> > > >
> > > >>> > > > using option  A) the WAL will look like:
> > > >>> > > > - open ledger no-sync = true
> > > >>> > > > - addEntry (set foo=bar)  (this will be no-sync)
> > > >>> > > > - addEntry (set foo=bar2) (this will be no-sync)
> > > >>> > > > - addEntry (commit)
> > > >>> > > > - sync
> > > >>> > > >
> > > >>> > > > using option B) the WAL will look like
> > > >>> > > > - open ledger
> > > >>> > > > - addEntry (set foo=bar), no-sync
> > > >>> > > > - addEntry (set foo=bar2), no-sync
> > > >>> > > > - addEntry (commit), sync
> > > >>> > > >
> > > >>> > > > in case B) we are "saving" one RPC call to every bookie (the
> > > 'sync'
> > > >>> > one)
> > > >>> > > > same for single data change entries, like updating a single
> > > record
> > > >>> on
> > > >>> > the
> > > >>> > > > database, this with BK 4.5 "costs" only a single RPC to every
> > > >>> bookie
> > > >>> > > >
> > > >>> > > > Second case:
> > > >>> > > > I am using BookKeeper to store binary objects, so I am
> packing
> > > more
> > > >>> > > > 'objects' (named sequences of bytes) into a single ledger,
> like
> > > >>> you do
> > > >>> > > when
> > > >>> > > > you write many records to a file in a streaming fashion and
> > keep
> > > >>> track
> > > >>> > of
> > > >>> > > > offsets of the beginning of every record (LedgerHandeAdv is
> > > >>> perfect for
> > > >>> > > > this case).
> > > >>> > > > I am not using a single ledger per 'file' because it kills
> > > >>> zookeeper to
> > > >>> > > > create many ledgers very fast, in my systems I have big busts
> > of
> > > >>> > writes,
> > > >>> > > > which need to be really "fast", so I am writing multiple
> > 'files'
> > > to
> > > >>> > every
> > > >>> > > > single ledger. So the close-to-open consistency at ledger
> level
> > > is
> > > >>> not
> > > >>> > > > suitable for this case.
> > > >>> > > > I have to write as fast as possible to this 'ledger-backed'
> > > >>> stream, and
> > > >>> > > as
> > > >>> > > > with a 'traditional'  filesystem I am writing parts of each
> > file
> > > >>> and
> > > >>> > than
> > > >>> > > > requiring 'sync' at the end of each file.
> > > >>> > > > Using BookKeeper you need to split big 'files' into "little"
> > > >>> parts, you
> > > >>> > > > cannot transmit the contents as to "real" stream on network.
> > > >>> > > >
> > > >>> > > > I am not talking about bookie level implementation details I
> > > would
> > > >>> like
> > > >>> > > to
> > > >>> > > > define the high level API in order to support all the
> relevant
> > > >>> known
> > > >>> > use
> > > >>> > > > cases and keep space for the future,
> > > >>> > > > at this moment adding a per-entry 'durability option' seems
> to
> > be
> > > >>> very
> > > >>> > > > flexible and simple to implement, it does not prevent us from
> > > doing
> > > >>> > > further
> > > >>> > > > improvements, like namely skipping the journal.
> > > >>> > > >
> > > >>> > > > Enrico
> > > >>> > > >
> > > >>> > > >
> > > >>> > > >
> > > >>> > > > 2017-08-26 19:55 GMT+02:00 Enrico Olivelli <
> > eolivelli@gmail.com
> > > >:
> > > >>> > > >
> > > >>> > > > >
> > > >>> > > > >
> > > >>> > > > > On sab 26 ago 2017, 19:19 Venkateswara Rao Jujjuri <
> > > >>> > jujjuri@gmail.com>
> > > >>> > > > > wrote:
> > > >>> > > > >
> > > >>> > > > >> Hi all,
> > > >>> > > > >>
> > > >>> > > > >> As promised during Thursday call, here is my proposal.
> > > >>> > > > >>
> > > >>> > > > >> *NOTE*: Major difference in this proposal compared to
> > Enrico’s
> > > >>> > > > >> <https://docs.google.com/document/d/
> 1JLYO3K3tZ5PJGmyS0YK_-
> > > >>> > > > >> NW8VOUUgUWVBmswCUOG158/edit#heading=h.q2rewiqndr5v>
> > > >>> > > > >> is
> > > >>> > > > >> making the durability a property of the ledger(type) as
> > > opposed
> > > >>> to
> > > >>> > > > >> addEntry(). Rest of the technical details have a lot of
> > > >>> > similarities.
> > > >>> > > > >>
> > > >>> > > > >
> > > >>> > > > > Thank you JV. I have just read quickly the doc and your
> view
> > is
> > > >>> > > centantly
> > > >>> > > > > broader.
> > > >>> > > > > I will dig into the doc as soon as possible on Monday.
> > > >>> > > > > For me it is ok to have a ledger wide configuration I think
> > > that
> > > >>> the
> > > >>> > > most
> > > >>> > > > > important decision is about the API we will provide as in
> the
> > > >>> future
> > > >>> > it
> > > >>> > > > > will be difficult to change it.
> > > >>> > > > >
> > > >>> > > > >
> > > >>> > > > > Cheers
> > > >>> > > > > Enrico
> > > >>> > > > >
> > > >>> > > > >
> > > >>> > > > >
> > > >>> > > > >> https://docs.google.com/document/d/
> > 1g1eBcVVCZrTG8YZliZP0LVqv
> > > >>> Wpq43
> > > >>> > > > >> 2ODEghrGVQ4d4Q/edit?usp=sharing
> > > >>> > > > >>
> > > >>> > > > >> On Thu, Aug 24, 2017 at 1:14 AM, Enrico Olivelli <
> > > >>> > eolivelli@gmail.com
> > > >>> > > >
> > > >>> > > > >> wrote:
> > > >>> > > > >>
> > > >>> > > > >> > Thank you all for the comments and for taking a look to
> > the
> > > >>> > document
> > > >>> > > > so
> > > >>> > > > >> > soon.
> > > >>> > > > >> > I have updated the doc, we will discuss the document at
> > the
> > > >>> > meeting,
> > > >>> > > > >> >
> > > >>> > > > >> >
> > > >>> > > > >> > Enrico
> > > >>> > > > >> >
> > > >>> > > > >> > 2017-08-24 2:27 GMT+02:00 Sijie Guo <guosijie@gmail.com
> >:
> > > >>> > > > >> >
> > > >>> > > > >> > > Enrico,
> > > >>> > > > >> > >
> > > >>> > > > >> > > Thank you so much! It is a great effort for putting
> this
> > > up.
> > > >>> > > Overall
> > > >>> > > > >> > looks
> > > >>> > > > >> > > good. I made some comments, we can discuss at
> tomorrow's
> > > >>> > community
> > > >>> > > > >> > meeting.
> > > >>> > > > >> > >
> > > >>> > > > >> > > - Sijie
> > > >>> > > > >> > >
> > > >>> > > > >> > > On Wed, Aug 23, 2017 at 8:25 AM, Enrico Olivelli <
> > > >>> > > > eolivelli@gmail.com
> > > >>> > > > >> >
> > > >>> > > > >> > > wrote:
> > > >>> > > > >> > >
> > > >>> > > > >> > > > Hi all,
> > > >>> > > > >> > > > I have drafted a first proposal for BP-14 - Relax
> > > >>> Durability
> > > >>> > > > >> > > >
> > > >>> > > > >> > > > We are talking about limiting the number of fsync to
> > the
> > > >>> > journal
> > > >>> > > > >> while
> > > >>> > > > >> > > > preserving the correctness of the LAC protocol.
> > > >>> > > > >> > > >
> > > >>> > > > >> > > > This is the link to the wiki page, but as the issue
> is
> > > >>> huge we
> > > >>> > > > >> prefer
> > > >>> > > > >> > to
> > > >>> > > > >> > > > use Google Documents for sharing comments
> > > >>> > > > >> > > > https://cwiki.apache.org/
> > confluence/display/BOOKKEEPER/
> > > >>> > > > >> > > > BP+-+14+Relax+durability
> > > >>> > > > >> > > >
> > > >>> > > > >> > > > This is the document
> > > >>> > > > >> > > > https://docs.google.com/document/d/
> > > 1JLYO3K3tZ5PJGmyS0YK_-
> > > >>> > > > >> > > > NW8VOUUgUWVBmswCUOG158/edit?usp=sharing
> > > >>> > > > >> > > >
> > > >>> > > > >> > > > All comments are welcome
> > > >>> > > > >> > > >
> > > >>> > > > >> > > > I have added DL dev list in cc as the discussion is
> > > >>> > interesting
> > > >>> > > > for
> > > >>> > > > >> > both
> > > >>> > > > >> > > > groups
> > > >>> > > > >> > > >
> > > >>> > > > >> > > > Enrico Olivelli
> > > >>> > > > >> > > >
> > > >>> > > > >> > >
> > > >>> > > > >> >
> > > >>> > > > >>
> > > >>> > > > >>
> > > >>> > > > >>
> > > >>> > > > >> --
> > > >>> > > > >> Jvrao
> > > >>> > > > >> ---
> > > >>> > > > >> First they ignore you, then they laugh at you, then they
> > fight
> > > >>> you,
> > > >>> > > then
> > > >>> > > > >> you win. - Mahatma Gandhi
> > > >>> > > > >>
> > > >>> > > > > --
> > > >>> > > > >
> > > >>> > > > >
> > > >>> > > > > -- Enrico Olivelli
> > > >>> > > > >
> > > >>> > > >
> > > >>> > >
> > > >>> > >
> > > >>> > >
> > > >>> > > --
> > > >>> > > Jvrao
> > > >>> > > ---
> > > >>> > > First they ignore you, then they laugh at you, then they fight
> > you,
> > > >>> then
> > > >>> > > you win. - Mahatma Gandhi
> > > >>> > >
> > > >>> >
> > > >>>
> > > >>
> > > >>
> > > >
> > >
> >
>

Re: [DISCUSS] BP-14 Relax Durability

Posted by Enrico Olivelli <eo...@gmail.com>.
Thank you all !

I will copy the content of the Final draft to the Wiki and mark the
document as "Accepted"

I will send a PR soon but it will depend on BP-15 New CreateLeader API

I hope we could make it for 4.6


Enrico


2017-09-11 18:58 GMT+02:00 Sijie Guo <gu...@gmail.com>:

> Enrico,
>
> Feel free to close the thread and mark this BP as accepted, if there is no
> -1.
>
> - Sijie
>
> On Mon, Sep 11, 2017 at 2:26 AM, Enrico Olivelli <eo...@gmail.com>
> wrote:
>
> > Ping
> >
> > 2017-09-07 9:32 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:
> >
> > > Hi all,
> > >
> > >
> > > You can find the revised proposal here
> > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> > > BP-14+Relax+durability
> > >
> > > The link to the document open for comments is this:
> > > https://docs.google.com/document/d/1yNi9t2_
> > deOOMXDaGzrnmaHTQeB3B3Fnym82DU
> > > ERH7LM/edit?usp=sharing
> > >
> > > Please check it out
> > > We are going to review this Proposal at the meeting
> > >
> > > -- Enrico
> > >
> > >
> > > 2017-08-30 8:56 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:
> > >
> > >> Thank you Sijie for summarizing and thanks to the community for
> helping
> > >> in this important enhancement to BookKeeper
> > >>
> > >> I am convinced that as JV pointed out we need to declare at ledger
> > >> creation time that the ledger is going to perform no-sync writes.
> > >>
> > >> I think we need an explicit declaration currently to make things
> "clear"
> > >> to the developer which is using the LedgerHandle API even and ledger
> > >> creation tyime.
> > >>
> > >> The case is that we are going to forbid "striping" ledgers (ensemble
> > size
> > >> > quorum size) for no-sync writes in the first implementation:
> > >> - one option is to  fail at the first no-sync addEntry, but this will
> be
> > >> really uncomfortable because usually the ack/write/ensemble sizes are
> > >> configured by the admin, and there will be configurations in which
> > errors
> > >> will come out only after starting the system.
> > >> - the second option is to make the developer explicitly enable no-sync
> > >> writes at creation time and fail the creation of the ledger if the
> > >> requested combination of options if not possible
> > >>
> > >> I am not sure that the changes to the bookie internals are a
> Client-API
> > >> matter, maybe we can leverage custom metadata (as JV said) in order to
> > make
> > >> the bookie handle ledgers in a different manner, this way will be
> always
> > >> open as custom metadata are already here.
> > >>
> > >> JV preferred the ledger-type approach, the dual solution is to
> introduce
> > >> a list of "capabilities" or "ledger options".
> > >> I think that this ability to perform no-syc writes is so important
> that
> > >> "custom metadata" is not the good place to declare it, same for
> "ledger
> > >> type"
> > >>
> > >> So I am proposing to add a boolean 'allowNoSyncWrites" at ledger
> > creation
> > >> time, without writing in to ledger metadata on ZK,
> > >> I think that if further improvements will need ledger metadata changes
> > we
> > >> will do.
> > >>
> > >> I have updated the BP-14 document, I have added an "Open issues"
> footer
> > >> with the open points,
> > >> please add comments and I will correct the document as soon as
> possible.
> > >>
> > >>
> > >> Enrico
> > >>
> > >>
> > >>
> > >>
> > >> 2017-08-30 1:24 GMT+02:00 Sijie Guo <gu...@gmail.com>:
> > >>
> > >>> Thank you, Enrico, JV.
> > >>>
> > >>> These are great discussions.
> > >>>
> > >>> After reading these two proposals, I have a few very high-level
> > comments,
> > >>> dividing into three categories.
> > >>>
> > >>>
> > >>> *API*
> > >>>
> > >>> - I think there are not fundamentally differences between these two
> > >>> proposals.
> > >>> They are trying to achieve similar goals by exposing durability
> levels
> > in
> > >>> different way.
> > >>> So this will be a discussion on what API/interface should look like
> > from
> > >>> user / admin perspective.
> > >>> I would suggest focusing what would be the API itself, putting the
> > >>> implementation design aside when talking about this.
> > >>>
> > >>> *Core*
> > >>>
> > >>> - Both proposals need to deal with a core function - what happen to
> LAC
> > >>> and
> > >>> what semantic that bookkeeper provides.
> > >>> JV did a good summary in his proposal. However I am not a fan of
> > >>> maintaining two different semantics. So I am looking for
> > >>> a solution that bookkeeper can only maintain one semantic. The
> semantic
> > >>> is
> > >>> basically:
> > >>>
> > >>> 1) LAC only advanced when entries before LAC are committed to the
> > >>> persistent storage
> > >>> 2) All the entries until LAC are successfully committed to the
> > >>> persistence
> > >>> storage
> > >>> 3) Entries until LAC: all the entries must be readable all the time.
> > >>>
> > >>> If we maintain such semantic, there is no need to change the auto
> > >>> recovery
> > >>> protocol in bookkeeper. All what we guarantee are the entries durably
> > >>> persistent.
> > >>>
> > >>> In order to maintain such semantic, I think both me and JV proposed
> > >>> similar
> > >>> solution in either proposal. I am trying to finalize one here:
> > >>>
> > >>> * bookie maintains a LAS (Last Add Synced) point for each entry.
> > >>> * LAS can be piggybacked on AddResponses
> > >>> * Client uses the LAS to advance LAC.
> > >>>
> > >>> If we can agree on the core semantic we are going to provide, the
> other
> > >>> things are just logistics.
> > >>>
> > >>> *Others*
> > >>>
> > >>> - Regarding separating journal or bypassing journal, there is no
> > >>> difference
> > >>> when we talking from the core semantic. They are all non-durably
> writes
> > >>> (acknowledging before fsyncing).
> > >>> We can start with same journal approach (but just acknowledge before
> > >>> fsyncing), implement the core and add other options later on.
> > >>>
> > >>>
> > >>> From my point of view, I'd be more interesting in providing a single
> > >>> consistent durable semantic that application can rely on for both
> > durable
> > >>> writes and non-durable writes. The other stuffs seem to be more
> > logistics
> > >>> things.
> > >>>
> > >>>
> > >>> - Sijie
> > >>>
> > >>>
> > >>> On Mon, Aug 28, 2017 at 11:27 PM, Enrico Olivelli <
> eolivelli@gmail.com
> > >
> > >>> wrote:
> > >>>
> > >>> > 2017-08-29 8:01 GMT+02:00 Venkateswara Rao Jujjuri <
> > jujjuri@gmail.com
> > >>> >:
> > >>> >
> > >>> > > I don't believe I fully followed your second case. But even in
> this
> > >>> case,
> > >>> > > your major concern is about the additional 'sync' RPC?
> > >>> > >
> > >>> >
> > >>> > yes apart from that I am fine with your proposal too, that is to
> > have a
> > >>> > LedgerType which drives durability
> > >>> > and I think we need to add per-entry durability options
> > >>> >
> > >>> > I think that at least for the 'simple' no-sync addEntry we do not
> > need
> > >>> to
> > >>> > change many things, I am drafting a prototype, I will share it as
> > soon
> > >>> as
> > >>> > we all agree on the roadmap
> > >>> >
> > >>> > The first implementation can cover the first cases (no-sync
> addEntry)
> > >>> and
> > >>> > change the way the writer advances the LAC in order to support
> > 'relaxed
> > >>> > durability writes'.
> > >>> > This change will be compatible with future improvements and it will
> > >>> open
> > >>> > the door for big changes on the bookie side like bypassing the
> > journal
> > >>> or
> > >>> > leveraging multiple journals.....
> > >>> >
> > >>> > -- Enrico
> > >>> >
> > >>> > or something else that the LedgerType proposal won't work?
> > >>> > >
> > >>> >
> > >>> > >
> > >>> > >
> > >>> > > On Mon, Aug 28, 2017 at 7:35 AM, Enrico Olivelli <
> > >>> eolivelli@gmail.com>
> > >>> > > wrote:
> > >>> > >
> > >>> > > > I think that having a set of options on the ledger metadata
> will
> > >>> be a
> > >>> > > good
> > >>> > > > enhancement and I am sure we will do it as soon as it will be
> > >>> needed,
> > >>> > > maybe
> > >>> > > > we do not need it now.
> > >>> > > >
> > >>> > > > Actually I think we will need to declare this durability-level
> at
> > >>> entry
> > >>> > > > level to support some uses cases in BP-14 document, let me
> > explain
> > >>> two
> > >>> > of
> > >>> > > > my usecases for which I need it:
> > >>> > > >
> > >>> > > > At higher level we have to choices:
> > >>> > > >
> > >>> > > > A) per-ledger durability options (JV proposal)
> > >>> > > > all addEntry operations are durable or non-durable and there is
> > an
> > >>> > > explicit
> > >>> > > > 'sync' API (+ forced sync at close)
> > >>> > > >
> > >>> > > > B) per-entry durability options (original BP-14 proposal)
> > >>> > > > every addEntry has an own durable/non-durable option
> > >>> (sync/no-sync),
> > >>> > with
> > >>> > > > the ability to call 'sync' without addEntry (+ forced sync at
> > >>> close)
> > >>> > > >
> > >>> > > > I am speaking about the the database WAL case, I am using the
> > >>> ledger as
> > >>> > > > segment for the WAL of a database and I am writing all data
> > >>> changes in
> > >>> > > the
> > >>> > > > scope of a 'transaction' with the relaxed-durability flag,
> then I
> > >>> am
> > >>> > > > writing the 'transaction committed' entry with "strict
> > durability"
> > >>> > > > requirement, this will in fact require that all previous
> entries
> > >>> are
> > >>> > > > persisted durably and so that the transaction will never be
> lost.
> > >>> > > >
> > >>> > > > In this scenario we would need an addEntry + sync API in fact:
> > >>> > > >
> > >>> > > > using option  A) the WAL will look like:
> > >>> > > > - open ledger no-sync = true
> > >>> > > > - addEntry (set foo=bar)  (this will be no-sync)
> > >>> > > > - addEntry (set foo=bar2) (this will be no-sync)
> > >>> > > > - addEntry (commit)
> > >>> > > > - sync
> > >>> > > >
> > >>> > > > using option B) the WAL will look like
> > >>> > > > - open ledger
> > >>> > > > - addEntry (set foo=bar), no-sync
> > >>> > > > - addEntry (set foo=bar2), no-sync
> > >>> > > > - addEntry (commit), sync
> > >>> > > >
> > >>> > > > in case B) we are "saving" one RPC call to every bookie (the
> > 'sync'
> > >>> > one)
> > >>> > > > same for single data change entries, like updating a single
> > record
> > >>> on
> > >>> > the
> > >>> > > > database, this with BK 4.5 "costs" only a single RPC to every
> > >>> bookie
> > >>> > > >
> > >>> > > > Second case:
> > >>> > > > I am using BookKeeper to store binary objects, so I am packing
> > more
> > >>> > > > 'objects' (named sequences of bytes) into a single ledger, like
> > >>> you do
> > >>> > > when
> > >>> > > > you write many records to a file in a streaming fashion and
> keep
> > >>> track
> > >>> > of
> > >>> > > > offsets of the beginning of every record (LedgerHandeAdv is
> > >>> perfect for
> > >>> > > > this case).
> > >>> > > > I am not using a single ledger per 'file' because it kills
> > >>> zookeeper to
> > >>> > > > create many ledgers very fast, in my systems I have big busts
> of
> > >>> > writes,
> > >>> > > > which need to be really "fast", so I am writing multiple
> 'files'
> > to
> > >>> > every
> > >>> > > > single ledger. So the close-to-open consistency at ledger level
> > is
> > >>> not
> > >>> > > > suitable for this case.
> > >>> > > > I have to write as fast as possible to this 'ledger-backed'
> > >>> stream, and
> > >>> > > as
> > >>> > > > with a 'traditional'  filesystem I am writing parts of each
> file
> > >>> and
> > >>> > than
> > >>> > > > requiring 'sync' at the end of each file.
> > >>> > > > Using BookKeeper you need to split big 'files' into "little"
> > >>> parts, you
> > >>> > > > cannot transmit the contents as to "real" stream on network.
> > >>> > > >
> > >>> > > > I am not talking about bookie level implementation details I
> > would
> > >>> like
> > >>> > > to
> > >>> > > > define the high level API in order to support all the relevant
> > >>> known
> > >>> > use
> > >>> > > > cases and keep space for the future,
> > >>> > > > at this moment adding a per-entry 'durability option' seems to
> be
> > >>> very
> > >>> > > > flexible and simple to implement, it does not prevent us from
> > doing
> > >>> > > further
> > >>> > > > improvements, like namely skipping the journal.
> > >>> > > >
> > >>> > > > Enrico
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > > 2017-08-26 19:55 GMT+02:00 Enrico Olivelli <
> eolivelli@gmail.com
> > >:
> > >>> > > >
> > >>> > > > >
> > >>> > > > >
> > >>> > > > > On sab 26 ago 2017, 19:19 Venkateswara Rao Jujjuri <
> > >>> > jujjuri@gmail.com>
> > >>> > > > > wrote:
> > >>> > > > >
> > >>> > > > >> Hi all,
> > >>> > > > >>
> > >>> > > > >> As promised during Thursday call, here is my proposal.
> > >>> > > > >>
> > >>> > > > >> *NOTE*: Major difference in this proposal compared to
> Enrico’s
> > >>> > > > >> <https://docs.google.com/document/d/1JLYO3K3tZ5PJGmyS0YK_-
> > >>> > > > >> NW8VOUUgUWVBmswCUOG158/edit#heading=h.q2rewiqndr5v>
> > >>> > > > >> is
> > >>> > > > >> making the durability a property of the ledger(type) as
> > opposed
> > >>> to
> > >>> > > > >> addEntry(). Rest of the technical details have a lot of
> > >>> > similarities.
> > >>> > > > >>
> > >>> > > > >
> > >>> > > > > Thank you JV. I have just read quickly the doc and your view
> is
> > >>> > > centantly
> > >>> > > > > broader.
> > >>> > > > > I will dig into the doc as soon as possible on Monday.
> > >>> > > > > For me it is ok to have a ledger wide configuration I think
> > that
> > >>> the
> > >>> > > most
> > >>> > > > > important decision is about the API we will provide as in the
> > >>> future
> > >>> > it
> > >>> > > > > will be difficult to change it.
> > >>> > > > >
> > >>> > > > >
> > >>> > > > > Cheers
> > >>> > > > > Enrico
> > >>> > > > >
> > >>> > > > >
> > >>> > > > >
> > >>> > > > >> https://docs.google.com/document/d/
> 1g1eBcVVCZrTG8YZliZP0LVqv
> > >>> Wpq43
> > >>> > > > >> 2ODEghrGVQ4d4Q/edit?usp=sharing
> > >>> > > > >>
> > >>> > > > >> On Thu, Aug 24, 2017 at 1:14 AM, Enrico Olivelli <
> > >>> > eolivelli@gmail.com
> > >>> > > >
> > >>> > > > >> wrote:
> > >>> > > > >>
> > >>> > > > >> > Thank you all for the comments and for taking a look to
> the
> > >>> > document
> > >>> > > > so
> > >>> > > > >> > soon.
> > >>> > > > >> > I have updated the doc, we will discuss the document at
> the
> > >>> > meeting,
> > >>> > > > >> >
> > >>> > > > >> >
> > >>> > > > >> > Enrico
> > >>> > > > >> >
> > >>> > > > >> > 2017-08-24 2:27 GMT+02:00 Sijie Guo <gu...@gmail.com>:
> > >>> > > > >> >
> > >>> > > > >> > > Enrico,
> > >>> > > > >> > >
> > >>> > > > >> > > Thank you so much! It is a great effort for putting this
> > up.
> > >>> > > Overall
> > >>> > > > >> > looks
> > >>> > > > >> > > good. I made some comments, we can discuss at tomorrow's
> > >>> > community
> > >>> > > > >> > meeting.
> > >>> > > > >> > >
> > >>> > > > >> > > - Sijie
> > >>> > > > >> > >
> > >>> > > > >> > > On Wed, Aug 23, 2017 at 8:25 AM, Enrico Olivelli <
> > >>> > > > eolivelli@gmail.com
> > >>> > > > >> >
> > >>> > > > >> > > wrote:
> > >>> > > > >> > >
> > >>> > > > >> > > > Hi all,
> > >>> > > > >> > > > I have drafted a first proposal for BP-14 - Relax
> > >>> Durability
> > >>> > > > >> > > >
> > >>> > > > >> > > > We are talking about limiting the number of fsync to
> the
> > >>> > journal
> > >>> > > > >> while
> > >>> > > > >> > > > preserving the correctness of the LAC protocol.
> > >>> > > > >> > > >
> > >>> > > > >> > > > This is the link to the wiki page, but as the issue is
> > >>> huge we
> > >>> > > > >> prefer
> > >>> > > > >> > to
> > >>> > > > >> > > > use Google Documents for sharing comments
> > >>> > > > >> > > > https://cwiki.apache.org/
> confluence/display/BOOKKEEPER/
> > >>> > > > >> > > > BP+-+14+Relax+durability
> > >>> > > > >> > > >
> > >>> > > > >> > > > This is the document
> > >>> > > > >> > > > https://docs.google.com/document/d/
> > 1JLYO3K3tZ5PJGmyS0YK_-
> > >>> > > > >> > > > NW8VOUUgUWVBmswCUOG158/edit?usp=sharing
> > >>> > > > >> > > >
> > >>> > > > >> > > > All comments are welcome
> > >>> > > > >> > > >
> > >>> > > > >> > > > I have added DL dev list in cc as the discussion is
> > >>> > interesting
> > >>> > > > for
> > >>> > > > >> > both
> > >>> > > > >> > > > groups
> > >>> > > > >> > > >
> > >>> > > > >> > > > Enrico Olivelli
> > >>> > > > >> > > >
> > >>> > > > >> > >
> > >>> > > > >> >
> > >>> > > > >>
> > >>> > > > >>
> > >>> > > > >>
> > >>> > > > >> --
> > >>> > > > >> Jvrao
> > >>> > > > >> ---
> > >>> > > > >> First they ignore you, then they laugh at you, then they
> fight
> > >>> you,
> > >>> > > then
> > >>> > > > >> you win. - Mahatma Gandhi
> > >>> > > > >>
> > >>> > > > > --
> > >>> > > > >
> > >>> > > > >
> > >>> > > > > -- Enrico Olivelli
> > >>> > > > >
> > >>> > > >
> > >>> > >
> > >>> > >
> > >>> > >
> > >>> > > --
> > >>> > > Jvrao
> > >>> > > ---
> > >>> > > First they ignore you, then they laugh at you, then they fight
> you,
> > >>> then
> > >>> > > you win. - Mahatma Gandhi
> > >>> > >
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>

Re: [DISCUSS] BP-14 Relax Durability

Posted by Enrico Olivelli <eo...@gmail.com>.
Thank you all !

I will copy the content of the Final draft to the Wiki and mark the
document as "Accepted"

I will send a PR soon but it will depend on BP-15 New CreateLeader API

I hope we could make it for 4.6


Enrico


2017-09-11 18:58 GMT+02:00 Sijie Guo <gu...@gmail.com>:

> Enrico,
>
> Feel free to close the thread and mark this BP as accepted, if there is no
> -1.
>
> - Sijie
>
> On Mon, Sep 11, 2017 at 2:26 AM, Enrico Olivelli <eo...@gmail.com>
> wrote:
>
> > Ping
> >
> > 2017-09-07 9:32 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:
> >
> > > Hi all,
> > >
> > >
> > > You can find the revised proposal here
> > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> > > BP-14+Relax+durability
> > >
> > > The link to the document open for comments is this:
> > > https://docs.google.com/document/d/1yNi9t2_
> > deOOMXDaGzrnmaHTQeB3B3Fnym82DU
> > > ERH7LM/edit?usp=sharing
> > >
> > > Please check it out
> > > We are going to review this Proposal at the meeting
> > >
> > > -- Enrico
> > >
> > >
> > > 2017-08-30 8:56 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:
> > >
> > >> Thank you Sijie for summarizing and thanks to the community for
> helping
> > >> in this important enhancement to BookKeeper
> > >>
> > >> I am convinced that as JV pointed out we need to declare at ledger
> > >> creation time that the ledger is going to perform no-sync writes.
> > >>
> > >> I think we need an explicit declaration currently to make things
> "clear"
> > >> to the developer which is using the LedgerHandle API even and ledger
> > >> creation tyime.
> > >>
> > >> The case is that we are going to forbid "striping" ledgers (ensemble
> > size
> > >> > quorum size) for no-sync writes in the first implementation:
> > >> - one option is to  fail at the first no-sync addEntry, but this will
> be
> > >> really uncomfortable because usually the ack/write/ensemble sizes are
> > >> configured by the admin, and there will be configurations in which
> > errors
> > >> will come out only after starting the system.
> > >> - the second option is to make the developer explicitly enable no-sync
> > >> writes at creation time and fail the creation of the ledger if the
> > >> requested combination of options if not possible
> > >>
> > >> I am not sure that the changes to the bookie internals are a
> Client-API
> > >> matter, maybe we can leverage custom metadata (as JV said) in order to
> > make
> > >> the bookie handle ledgers in a different manner, this way will be
> always
> > >> open as custom metadata are already here.
> > >>
> > >> JV preferred the ledger-type approach, the dual solution is to
> introduce
> > >> a list of "capabilities" or "ledger options".
> > >> I think that this ability to perform no-syc writes is so important
> that
> > >> "custom metadata" is not the good place to declare it, same for
> "ledger
> > >> type"
> > >>
> > >> So I am proposing to add a boolean 'allowNoSyncWrites" at ledger
> > creation
> > >> time, without writing in to ledger metadata on ZK,
> > >> I think that if further improvements will need ledger metadata changes
> > we
> > >> will do.
> > >>
> > >> I have updated the BP-14 document, I have added an "Open issues"
> footer
> > >> with the open points,
> > >> please add comments and I will correct the document as soon as
> possible.
> > >>
> > >>
> > >> Enrico
> > >>
> > >>
> > >>
> > >>
> > >> 2017-08-30 1:24 GMT+02:00 Sijie Guo <gu...@gmail.com>:
> > >>
> > >>> Thank you, Enrico, JV.
> > >>>
> > >>> These are great discussions.
> > >>>
> > >>> After reading these two proposals, I have a few very high-level
> > comments,
> > >>> dividing into three categories.
> > >>>
> > >>>
> > >>> *API*
> > >>>
> > >>> - I think there are not fundamentally differences between these two
> > >>> proposals.
> > >>> They are trying to achieve similar goals by exposing durability
> levels
> > in
> > >>> different way.
> > >>> So this will be a discussion on what API/interface should look like
> > from
> > >>> user / admin perspective.
> > >>> I would suggest focusing what would be the API itself, putting the
> > >>> implementation design aside when talking about this.
> > >>>
> > >>> *Core*
> > >>>
> > >>> - Both proposals need to deal with a core function - what happen to
> LAC
> > >>> and
> > >>> what semantic that bookkeeper provides.
> > >>> JV did a good summary in his proposal. However I am not a fan of
> > >>> maintaining two different semantics. So I am looking for
> > >>> a solution that bookkeeper can only maintain one semantic. The
> semantic
> > >>> is
> > >>> basically:
> > >>>
> > >>> 1) LAC only advanced when entries before LAC are committed to the
> > >>> persistent storage
> > >>> 2) All the entries until LAC are successfully committed to the
> > >>> persistence
> > >>> storage
> > >>> 3) Entries until LAC: all the entries must be readable all the time.
> > >>>
> > >>> If we maintain such semantic, there is no need to change the auto
> > >>> recovery
> > >>> protocol in bookkeeper. All what we guarantee are the entries durably
> > >>> persistent.
> > >>>
> > >>> In order to maintain such semantic, I think both me and JV proposed
> > >>> similar
> > >>> solution in either proposal. I am trying to finalize one here:
> > >>>
> > >>> * bookie maintains a LAS (Last Add Synced) point for each entry.
> > >>> * LAS can be piggybacked on AddResponses
> > >>> * Client uses the LAS to advance LAC.
> > >>>
> > >>> If we can agree on the core semantic we are going to provide, the
> other
> > >>> things are just logistics.
> > >>>
> > >>> *Others*
> > >>>
> > >>> - Regarding separating journal or bypassing journal, there is no
> > >>> difference
> > >>> when we talking from the core semantic. They are all non-durably
> writes
> > >>> (acknowledging before fsyncing).
> > >>> We can start with same journal approach (but just acknowledge before
> > >>> fsyncing), implement the core and add other options later on.
> > >>>
> > >>>
> > >>> From my point of view, I'd be more interesting in providing a single
> > >>> consistent durable semantic that application can rely on for both
> > durable
> > >>> writes and non-durable writes. The other stuffs seem to be more
> > logistics
> > >>> things.
> > >>>
> > >>>
> > >>> - Sijie
> > >>>
> > >>>
> > >>> On Mon, Aug 28, 2017 at 11:27 PM, Enrico Olivelli <
> eolivelli@gmail.com
> > >
> > >>> wrote:
> > >>>
> > >>> > 2017-08-29 8:01 GMT+02:00 Venkateswara Rao Jujjuri <
> > jujjuri@gmail.com
> > >>> >:
> > >>> >
> > >>> > > I don't believe I fully followed your second case. But even in
> this
> > >>> case,
> > >>> > > your major concern is about the additional 'sync' RPC?
> > >>> > >
> > >>> >
> > >>> > yes apart from that I am fine with your proposal too, that is to
> > have a
> > >>> > LedgerType which drives durability
> > >>> > and I think we need to add per-entry durability options
> > >>> >
> > >>> > I think that at least for the 'simple' no-sync addEntry we do not
> > need
> > >>> to
> > >>> > change many things, I am drafting a prototype, I will share it as
> > soon
> > >>> as
> > >>> > we all agree on the roadmap
> > >>> >
> > >>> > The first implementation can cover the first cases (no-sync
> addEntry)
> > >>> and
> > >>> > change the way the writer advances the LAC in order to support
> > 'relaxed
> > >>> > durability writes'.
> > >>> > This change will be compatible with future improvements and it will
> > >>> open
> > >>> > the door for big changes on the bookie side like bypassing the
> > journal
> > >>> or
> > >>> > leveraging multiple journals.....
> > >>> >
> > >>> > -- Enrico
> > >>> >
> > >>> > or something else that the LedgerType proposal won't work?
> > >>> > >
> > >>> >
> > >>> > >
> > >>> > >
> > >>> > > On Mon, Aug 28, 2017 at 7:35 AM, Enrico Olivelli <
> > >>> eolivelli@gmail.com>
> > >>> > > wrote:
> > >>> > >
> > >>> > > > I think that having a set of options on the ledger metadata
> will
> > >>> be a
> > >>> > > good
> > >>> > > > enhancement and I am sure we will do it as soon as it will be
> > >>> needed,
> > >>> > > maybe
> > >>> > > > we do not need it now.
> > >>> > > >
> > >>> > > > Actually I think we will need to declare this durability-level
> at
> > >>> entry
> > >>> > > > level to support some uses cases in BP-14 document, let me
> > explain
> > >>> two
> > >>> > of
> > >>> > > > my usecases for which I need it:
> > >>> > > >
> > >>> > > > At higher level we have to choices:
> > >>> > > >
> > >>> > > > A) per-ledger durability options (JV proposal)
> > >>> > > > all addEntry operations are durable or non-durable and there is
> > an
> > >>> > > explicit
> > >>> > > > 'sync' API (+ forced sync at close)
> > >>> > > >
> > >>> > > > B) per-entry durability options (original BP-14 proposal)
> > >>> > > > every addEntry has an own durable/non-durable option
> > >>> (sync/no-sync),
> > >>> > with
> > >>> > > > the ability to call 'sync' without addEntry (+ forced sync at
> > >>> close)
> > >>> > > >
> > >>> > > > I am speaking about the the database WAL case, I am using the
> > >>> ledger as
> > >>> > > > segment for the WAL of a database and I am writing all data
> > >>> changes in
> > >>> > > the
> > >>> > > > scope of a 'transaction' with the relaxed-durability flag,
> then I
> > >>> am
> > >>> > > > writing the 'transaction committed' entry with "strict
> > durability"
> > >>> > > > requirement, this will in fact require that all previous
> entries
> > >>> are
> > >>> > > > persisted durably and so that the transaction will never be
> lost.
> > >>> > > >
> > >>> > > > In this scenario we would need an addEntry + sync API in fact:
> > >>> > > >
> > >>> > > > using option  A) the WAL will look like:
> > >>> > > > - open ledger no-sync = true
> > >>> > > > - addEntry (set foo=bar)  (this will be no-sync)
> > >>> > > > - addEntry (set foo=bar2) (this will be no-sync)
> > >>> > > > - addEntry (commit)
> > >>> > > > - sync
> > >>> > > >
> > >>> > > > using option B) the WAL will look like
> > >>> > > > - open ledger
> > >>> > > > - addEntry (set foo=bar), no-sync
> > >>> > > > - addEntry (set foo=bar2), no-sync
> > >>> > > > - addEntry (commit), sync
> > >>> > > >
> > >>> > > > in case B) we are "saving" one RPC call to every bookie (the
> > 'sync'
> > >>> > one)
> > >>> > > > same for single data change entries, like updating a single
> > record
> > >>> on
> > >>> > the
> > >>> > > > database, this with BK 4.5 "costs" only a single RPC to every
> > >>> bookie
> > >>> > > >
> > >>> > > > Second case:
> > >>> > > > I am using BookKeeper to store binary objects, so I am packing
> > more
> > >>> > > > 'objects' (named sequences of bytes) into a single ledger, like
> > >>> you do
> > >>> > > when
> > >>> > > > you write many records to a file in a streaming fashion and
> keep
> > >>> track
> > >>> > of
> > >>> > > > offsets of the beginning of every record (LedgerHandeAdv is
> > >>> perfect for
> > >>> > > > this case).
> > >>> > > > I am not using a single ledger per 'file' because it kills
> > >>> zookeeper to
> > >>> > > > create many ledgers very fast, in my systems I have big busts
> of
> > >>> > writes,
> > >>> > > > which need to be really "fast", so I am writing multiple
> 'files'
> > to
> > >>> > every
> > >>> > > > single ledger. So the close-to-open consistency at ledger level
> > is
> > >>> not
> > >>> > > > suitable for this case.
> > >>> > > > I have to write as fast as possible to this 'ledger-backed'
> > >>> stream, and
> > >>> > > as
> > >>> > > > with a 'traditional'  filesystem I am writing parts of each
> file
> > >>> and
> > >>> > than
> > >>> > > > requiring 'sync' at the end of each file.
> > >>> > > > Using BookKeeper you need to split big 'files' into "little"
> > >>> parts, you
> > >>> > > > cannot transmit the contents as to "real" stream on network.
> > >>> > > >
> > >>> > > > I am not talking about bookie level implementation details I
> > would
> > >>> like
> > >>> > > to
> > >>> > > > define the high level API in order to support all the relevant
> > >>> known
> > >>> > use
> > >>> > > > cases and keep space for the future,
> > >>> > > > at this moment adding a per-entry 'durability option' seems to
> be
> > >>> very
> > >>> > > > flexible and simple to implement, it does not prevent us from
> > doing
> > >>> > > further
> > >>> > > > improvements, like namely skipping the journal.
> > >>> > > >
> > >>> > > > Enrico
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > > 2017-08-26 19:55 GMT+02:00 Enrico Olivelli <
> eolivelli@gmail.com
> > >:
> > >>> > > >
> > >>> > > > >
> > >>> > > > >
> > >>> > > > > On sab 26 ago 2017, 19:19 Venkateswara Rao Jujjuri <
> > >>> > jujjuri@gmail.com>
> > >>> > > > > wrote:
> > >>> > > > >
> > >>> > > > >> Hi all,
> > >>> > > > >>
> > >>> > > > >> As promised during Thursday call, here is my proposal.
> > >>> > > > >>
> > >>> > > > >> *NOTE*: Major difference in this proposal compared to
> Enrico’s
> > >>> > > > >> <https://docs.google.com/document/d/1JLYO3K3tZ5PJGmyS0YK_-
> > >>> > > > >> NW8VOUUgUWVBmswCUOG158/edit#heading=h.q2rewiqndr5v>
> > >>> > > > >> is
> > >>> > > > >> making the durability a property of the ledger(type) as
> > opposed
> > >>> to
> > >>> > > > >> addEntry(). Rest of the technical details have a lot of
> > >>> > similarities.
> > >>> > > > >>
> > >>> > > > >
> > >>> > > > > Thank you JV. I have just read quickly the doc and your view
> is
> > >>> > > centantly
> > >>> > > > > broader.
> > >>> > > > > I will dig into the doc as soon as possible on Monday.
> > >>> > > > > For me it is ok to have a ledger wide configuration I think
> > that
> > >>> the
> > >>> > > most
> > >>> > > > > important decision is about the API we will provide as in the
> > >>> future
> > >>> > it
> > >>> > > > > will be difficult to change it.
> > >>> > > > >
> > >>> > > > >
> > >>> > > > > Cheers
> > >>> > > > > Enrico
> > >>> > > > >
> > >>> > > > >
> > >>> > > > >
> > >>> > > > >> https://docs.google.com/document/d/
> 1g1eBcVVCZrTG8YZliZP0LVqv
> > >>> Wpq43
> > >>> > > > >> 2ODEghrGVQ4d4Q/edit?usp=sharing
> > >>> > > > >>
> > >>> > > > >> On Thu, Aug 24, 2017 at 1:14 AM, Enrico Olivelli <
> > >>> > eolivelli@gmail.com
> > >>> > > >
> > >>> > > > >> wrote:
> > >>> > > > >>
> > >>> > > > >> > Thank you all for the comments and for taking a look to
> the
> > >>> > document
> > >>> > > > so
> > >>> > > > >> > soon.
> > >>> > > > >> > I have updated the doc, we will discuss the document at
> the
> > >>> > meeting,
> > >>> > > > >> >
> > >>> > > > >> >
> > >>> > > > >> > Enrico
> > >>> > > > >> >
> > >>> > > > >> > 2017-08-24 2:27 GMT+02:00 Sijie Guo <gu...@gmail.com>:
> > >>> > > > >> >
> > >>> > > > >> > > Enrico,
> > >>> > > > >> > >
> > >>> > > > >> > > Thank you so much! It is a great effort for putting this
> > up.
> > >>> > > Overall
> > >>> > > > >> > looks
> > >>> > > > >> > > good. I made some comments, we can discuss at tomorrow's
> > >>> > community
> > >>> > > > >> > meeting.
> > >>> > > > >> > >
> > >>> > > > >> > > - Sijie
> > >>> > > > >> > >
> > >>> > > > >> > > On Wed, Aug 23, 2017 at 8:25 AM, Enrico Olivelli <
> > >>> > > > eolivelli@gmail.com
> > >>> > > > >> >
> > >>> > > > >> > > wrote:
> > >>> > > > >> > >
> > >>> > > > >> > > > Hi all,
> > >>> > > > >> > > > I have drafted a first proposal for BP-14 - Relax
> > >>> Durability
> > >>> > > > >> > > >
> > >>> > > > >> > > > We are talking about limiting the number of fsync to
> the
> > >>> > journal
> > >>> > > > >> while
> > >>> > > > >> > > > preserving the correctness of the LAC protocol.
> > >>> > > > >> > > >
> > >>> > > > >> > > > This is the link to the wiki page, but as the issue is
> > >>> huge we
> > >>> > > > >> prefer
> > >>> > > > >> > to
> > >>> > > > >> > > > use Google Documents for sharing comments
> > >>> > > > >> > > > https://cwiki.apache.org/
> confluence/display/BOOKKEEPER/
> > >>> > > > >> > > > BP+-+14+Relax+durability
> > >>> > > > >> > > >
> > >>> > > > >> > > > This is the document
> > >>> > > > >> > > > https://docs.google.com/document/d/
> > 1JLYO3K3tZ5PJGmyS0YK_-
> > >>> > > > >> > > > NW8VOUUgUWVBmswCUOG158/edit?usp=sharing
> > >>> > > > >> > > >
> > >>> > > > >> > > > All comments are welcome
> > >>> > > > >> > > >
> > >>> > > > >> > > > I have added DL dev list in cc as the discussion is
> > >>> > interesting
> > >>> > > > for
> > >>> > > > >> > both
> > >>> > > > >> > > > groups
> > >>> > > > >> > > >
> > >>> > > > >> > > > Enrico Olivelli
> > >>> > > > >> > > >
> > >>> > > > >> > >
> > >>> > > > >> >
> > >>> > > > >>
> > >>> > > > >>
> > >>> > > > >>
> > >>> > > > >> --
> > >>> > > > >> Jvrao
> > >>> > > > >> ---
> > >>> > > > >> First they ignore you, then they laugh at you, then they
> fight
> > >>> you,
> > >>> > > then
> > >>> > > > >> you win. - Mahatma Gandhi
> > >>> > > > >>
> > >>> > > > > --
> > >>> > > > >
> > >>> > > > >
> > >>> > > > > -- Enrico Olivelli
> > >>> > > > >
> > >>> > > >
> > >>> > >
> > >>> > >
> > >>> > >
> > >>> > > --
> > >>> > > Jvrao
> > >>> > > ---
> > >>> > > First they ignore you, then they laugh at you, then they fight
> you,
> > >>> then
> > >>> > > you win. - Mahatma Gandhi
> > >>> > >
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>

Re: [DISCUSS] BP-14 Relax Durability

Posted by Sijie Guo <gu...@gmail.com>.
Enrico,

Feel free to close the thread and mark this BP as accepted, if there is no
-1.

- Sijie

On Mon, Sep 11, 2017 at 2:26 AM, Enrico Olivelli <eo...@gmail.com>
wrote:

> Ping
>
> 2017-09-07 9:32 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:
>
> > Hi all,
> >
> >
> > You can find the revised proposal here
> > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> > BP-14+Relax+durability
> >
> > The link to the document open for comments is this:
> > https://docs.google.com/document/d/1yNi9t2_
> deOOMXDaGzrnmaHTQeB3B3Fnym82DU
> > ERH7LM/edit?usp=sharing
> >
> > Please check it out
> > We are going to review this Proposal at the meeting
> >
> > -- Enrico
> >
> >
> > 2017-08-30 8:56 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:
> >
> >> Thank you Sijie for summarizing and thanks to the community for helping
> >> in this important enhancement to BookKeeper
> >>
> >> I am convinced that as JV pointed out we need to declare at ledger
> >> creation time that the ledger is going to perform no-sync writes.
> >>
> >> I think we need an explicit declaration currently to make things "clear"
> >> to the developer which is using the LedgerHandle API even and ledger
> >> creation tyime.
> >>
> >> The case is that we are going to forbid "striping" ledgers (ensemble
> size
> >> > quorum size) for no-sync writes in the first implementation:
> >> - one option is to  fail at the first no-sync addEntry, but this will be
> >> really uncomfortable because usually the ack/write/ensemble sizes are
> >> configured by the admin, and there will be configurations in which
> errors
> >> will come out only after starting the system.
> >> - the second option is to make the developer explicitly enable no-sync
> >> writes at creation time and fail the creation of the ledger if the
> >> requested combination of options if not possible
> >>
> >> I am not sure that the changes to the bookie internals are a Client-API
> >> matter, maybe we can leverage custom metadata (as JV said) in order to
> make
> >> the bookie handle ledgers in a different manner, this way will be always
> >> open as custom metadata are already here.
> >>
> >> JV preferred the ledger-type approach, the dual solution is to introduce
> >> a list of "capabilities" or "ledger options".
> >> I think that this ability to perform no-syc writes is so important that
> >> "custom metadata" is not the good place to declare it, same for "ledger
> >> type"
> >>
> >> So I am proposing to add a boolean 'allowNoSyncWrites" at ledger
> creation
> >> time, without writing in to ledger metadata on ZK,
> >> I think that if further improvements will need ledger metadata changes
> we
> >> will do.
> >>
> >> I have updated the BP-14 document, I have added an "Open issues" footer
> >> with the open points,
> >> please add comments and I will correct the document as soon as possible.
> >>
> >>
> >> Enrico
> >>
> >>
> >>
> >>
> >> 2017-08-30 1:24 GMT+02:00 Sijie Guo <gu...@gmail.com>:
> >>
> >>> Thank you, Enrico, JV.
> >>>
> >>> These are great discussions.
> >>>
> >>> After reading these two proposals, I have a few very high-level
> comments,
> >>> dividing into three categories.
> >>>
> >>>
> >>> *API*
> >>>
> >>> - I think there are not fundamentally differences between these two
> >>> proposals.
> >>> They are trying to achieve similar goals by exposing durability levels
> in
> >>> different way.
> >>> So this will be a discussion on what API/interface should look like
> from
> >>> user / admin perspective.
> >>> I would suggest focusing what would be the API itself, putting the
> >>> implementation design aside when talking about this.
> >>>
> >>> *Core*
> >>>
> >>> - Both proposals need to deal with a core function - what happen to LAC
> >>> and
> >>> what semantic that bookkeeper provides.
> >>> JV did a good summary in his proposal. However I am not a fan of
> >>> maintaining two different semantics. So I am looking for
> >>> a solution that bookkeeper can only maintain one semantic. The semantic
> >>> is
> >>> basically:
> >>>
> >>> 1) LAC only advanced when entries before LAC are committed to the
> >>> persistent storage
> >>> 2) All the entries until LAC are successfully committed to the
> >>> persistence
> >>> storage
> >>> 3) Entries until LAC: all the entries must be readable all the time.
> >>>
> >>> If we maintain such semantic, there is no need to change the auto
> >>> recovery
> >>> protocol in bookkeeper. All what we guarantee are the entries durably
> >>> persistent.
> >>>
> >>> In order to maintain such semantic, I think both me and JV proposed
> >>> similar
> >>> solution in either proposal. I am trying to finalize one here:
> >>>
> >>> * bookie maintains a LAS (Last Add Synced) point for each entry.
> >>> * LAS can be piggybacked on AddResponses
> >>> * Client uses the LAS to advance LAC.
> >>>
> >>> If we can agree on the core semantic we are going to provide, the other
> >>> things are just logistics.
> >>>
> >>> *Others*
> >>>
> >>> - Regarding separating journal or bypassing journal, there is no
> >>> difference
> >>> when we talking from the core semantic. They are all non-durably writes
> >>> (acknowledging before fsyncing).
> >>> We can start with same journal approach (but just acknowledge before
> >>> fsyncing), implement the core and add other options later on.
> >>>
> >>>
> >>> From my point of view, I'd be more interesting in providing a single
> >>> consistent durable semantic that application can rely on for both
> durable
> >>> writes and non-durable writes. The other stuffs seem to be more
> logistics
> >>> things.
> >>>
> >>>
> >>> - Sijie
> >>>
> >>>
> >>> On Mon, Aug 28, 2017 at 11:27 PM, Enrico Olivelli <eolivelli@gmail.com
> >
> >>> wrote:
> >>>
> >>> > 2017-08-29 8:01 GMT+02:00 Venkateswara Rao Jujjuri <
> jujjuri@gmail.com
> >>> >:
> >>> >
> >>> > > I don't believe I fully followed your second case. But even in this
> >>> case,
> >>> > > your major concern is about the additional 'sync' RPC?
> >>> > >
> >>> >
> >>> > yes apart from that I am fine with your proposal too, that is to
> have a
> >>> > LedgerType which drives durability
> >>> > and I think we need to add per-entry durability options
> >>> >
> >>> > I think that at least for the 'simple' no-sync addEntry we do not
> need
> >>> to
> >>> > change many things, I am drafting a prototype, I will share it as
> soon
> >>> as
> >>> > we all agree on the roadmap
> >>> >
> >>> > The first implementation can cover the first cases (no-sync addEntry)
> >>> and
> >>> > change the way the writer advances the LAC in order to support
> 'relaxed
> >>> > durability writes'.
> >>> > This change will be compatible with future improvements and it will
> >>> open
> >>> > the door for big changes on the bookie side like bypassing the
> journal
> >>> or
> >>> > leveraging multiple journals.....
> >>> >
> >>> > -- Enrico
> >>> >
> >>> > or something else that the LedgerType proposal won't work?
> >>> > >
> >>> >
> >>> > >
> >>> > >
> >>> > > On Mon, Aug 28, 2017 at 7:35 AM, Enrico Olivelli <
> >>> eolivelli@gmail.com>
> >>> > > wrote:
> >>> > >
> >>> > > > I think that having a set of options on the ledger metadata will
> >>> be a
> >>> > > good
> >>> > > > enhancement and I am sure we will do it as soon as it will be
> >>> needed,
> >>> > > maybe
> >>> > > > we do not need it now.
> >>> > > >
> >>> > > > Actually I think we will need to declare this durability-level at
> >>> entry
> >>> > > > level to support some uses cases in BP-14 document, let me
> explain
> >>> two
> >>> > of
> >>> > > > my usecases for which I need it:
> >>> > > >
> >>> > > > At higher level we have to choices:
> >>> > > >
> >>> > > > A) per-ledger durability options (JV proposal)
> >>> > > > all addEntry operations are durable or non-durable and there is
> an
> >>> > > explicit
> >>> > > > 'sync' API (+ forced sync at close)
> >>> > > >
> >>> > > > B) per-entry durability options (original BP-14 proposal)
> >>> > > > every addEntry has an own durable/non-durable option
> >>> (sync/no-sync),
> >>> > with
> >>> > > > the ability to call 'sync' without addEntry (+ forced sync at
> >>> close)
> >>> > > >
> >>> > > > I am speaking about the the database WAL case, I am using the
> >>> ledger as
> >>> > > > segment for the WAL of a database and I am writing all data
> >>> changes in
> >>> > > the
> >>> > > > scope of a 'transaction' with the relaxed-durability flag, then I
> >>> am
> >>> > > > writing the 'transaction committed' entry with "strict
> durability"
> >>> > > > requirement, this will in fact require that all previous entries
> >>> are
> >>> > > > persisted durably and so that the transaction will never be lost.
> >>> > > >
> >>> > > > In this scenario we would need an addEntry + sync API in fact:
> >>> > > >
> >>> > > > using option  A) the WAL will look like:
> >>> > > > - open ledger no-sync = true
> >>> > > > - addEntry (set foo=bar)  (this will be no-sync)
> >>> > > > - addEntry (set foo=bar2) (this will be no-sync)
> >>> > > > - addEntry (commit)
> >>> > > > - sync
> >>> > > >
> >>> > > > using option B) the WAL will look like
> >>> > > > - open ledger
> >>> > > > - addEntry (set foo=bar), no-sync
> >>> > > > - addEntry (set foo=bar2), no-sync
> >>> > > > - addEntry (commit), sync
> >>> > > >
> >>> > > > in case B) we are "saving" one RPC call to every bookie (the
> 'sync'
> >>> > one)
> >>> > > > same for single data change entries, like updating a single
> record
> >>> on
> >>> > the
> >>> > > > database, this with BK 4.5 "costs" only a single RPC to every
> >>> bookie
> >>> > > >
> >>> > > > Second case:
> >>> > > > I am using BookKeeper to store binary objects, so I am packing
> more
> >>> > > > 'objects' (named sequences of bytes) into a single ledger, like
> >>> you do
> >>> > > when
> >>> > > > you write many records to a file in a streaming fashion and keep
> >>> track
> >>> > of
> >>> > > > offsets of the beginning of every record (LedgerHandeAdv is
> >>> perfect for
> >>> > > > this case).
> >>> > > > I am not using a single ledger per 'file' because it kills
> >>> zookeeper to
> >>> > > > create many ledgers very fast, in my systems I have big busts of
> >>> > writes,
> >>> > > > which need to be really "fast", so I am writing multiple 'files'
> to
> >>> > every
> >>> > > > single ledger. So the close-to-open consistency at ledger level
> is
> >>> not
> >>> > > > suitable for this case.
> >>> > > > I have to write as fast as possible to this 'ledger-backed'
> >>> stream, and
> >>> > > as
> >>> > > > with a 'traditional'  filesystem I am writing parts of each file
> >>> and
> >>> > than
> >>> > > > requiring 'sync' at the end of each file.
> >>> > > > Using BookKeeper you need to split big 'files' into "little"
> >>> parts, you
> >>> > > > cannot transmit the contents as to "real" stream on network.
> >>> > > >
> >>> > > > I am not talking about bookie level implementation details I
> would
> >>> like
> >>> > > to
> >>> > > > define the high level API in order to support all the relevant
> >>> known
> >>> > use
> >>> > > > cases and keep space for the future,
> >>> > > > at this moment adding a per-entry 'durability option' seems to be
> >>> very
> >>> > > > flexible and simple to implement, it does not prevent us from
> doing
> >>> > > further
> >>> > > > improvements, like namely skipping the journal.
> >>> > > >
> >>> > > > Enrico
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > > 2017-08-26 19:55 GMT+02:00 Enrico Olivelli <eolivelli@gmail.com
> >:
> >>> > > >
> >>> > > > >
> >>> > > > >
> >>> > > > > On sab 26 ago 2017, 19:19 Venkateswara Rao Jujjuri <
> >>> > jujjuri@gmail.com>
> >>> > > > > wrote:
> >>> > > > >
> >>> > > > >> Hi all,
> >>> > > > >>
> >>> > > > >> As promised during Thursday call, here is my proposal.
> >>> > > > >>
> >>> > > > >> *NOTE*: Major difference in this proposal compared to Enrico’s
> >>> > > > >> <https://docs.google.com/document/d/1JLYO3K3tZ5PJGmyS0YK_-
> >>> > > > >> NW8VOUUgUWVBmswCUOG158/edit#heading=h.q2rewiqndr5v>
> >>> > > > >> is
> >>> > > > >> making the durability a property of the ledger(type) as
> opposed
> >>> to
> >>> > > > >> addEntry(). Rest of the technical details have a lot of
> >>> > similarities.
> >>> > > > >>
> >>> > > > >
> >>> > > > > Thank you JV. I have just read quickly the doc and your view is
> >>> > > centantly
> >>> > > > > broader.
> >>> > > > > I will dig into the doc as soon as possible on Monday.
> >>> > > > > For me it is ok to have a ledger wide configuration I think
> that
> >>> the
> >>> > > most
> >>> > > > > important decision is about the API we will provide as in the
> >>> future
> >>> > it
> >>> > > > > will be difficult to change it.
> >>> > > > >
> >>> > > > >
> >>> > > > > Cheers
> >>> > > > > Enrico
> >>> > > > >
> >>> > > > >
> >>> > > > >
> >>> > > > >> https://docs.google.com/document/d/1g1eBcVVCZrTG8YZliZP0LVqv
> >>> Wpq43
> >>> > > > >> 2ODEghrGVQ4d4Q/edit?usp=sharing
> >>> > > > >>
> >>> > > > >> On Thu, Aug 24, 2017 at 1:14 AM, Enrico Olivelli <
> >>> > eolivelli@gmail.com
> >>> > > >
> >>> > > > >> wrote:
> >>> > > > >>
> >>> > > > >> > Thank you all for the comments and for taking a look to the
> >>> > document
> >>> > > > so
> >>> > > > >> > soon.
> >>> > > > >> > I have updated the doc, we will discuss the document at the
> >>> > meeting,
> >>> > > > >> >
> >>> > > > >> >
> >>> > > > >> > Enrico
> >>> > > > >> >
> >>> > > > >> > 2017-08-24 2:27 GMT+02:00 Sijie Guo <gu...@gmail.com>:
> >>> > > > >> >
> >>> > > > >> > > Enrico,
> >>> > > > >> > >
> >>> > > > >> > > Thank you so much! It is a great effort for putting this
> up.
> >>> > > Overall
> >>> > > > >> > looks
> >>> > > > >> > > good. I made some comments, we can discuss at tomorrow's
> >>> > community
> >>> > > > >> > meeting.
> >>> > > > >> > >
> >>> > > > >> > > - Sijie
> >>> > > > >> > >
> >>> > > > >> > > On Wed, Aug 23, 2017 at 8:25 AM, Enrico Olivelli <
> >>> > > > eolivelli@gmail.com
> >>> > > > >> >
> >>> > > > >> > > wrote:
> >>> > > > >> > >
> >>> > > > >> > > > Hi all,
> >>> > > > >> > > > I have drafted a first proposal for BP-14 - Relax
> >>> Durability
> >>> > > > >> > > >
> >>> > > > >> > > > We are talking about limiting the number of fsync to the
> >>> > journal
> >>> > > > >> while
> >>> > > > >> > > > preserving the correctness of the LAC protocol.
> >>> > > > >> > > >
> >>> > > > >> > > > This is the link to the wiki page, but as the issue is
> >>> huge we
> >>> > > > >> prefer
> >>> > > > >> > to
> >>> > > > >> > > > use Google Documents for sharing comments
> >>> > > > >> > > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> >>> > > > >> > > > BP+-+14+Relax+durability
> >>> > > > >> > > >
> >>> > > > >> > > > This is the document
> >>> > > > >> > > > https://docs.google.com/document/d/
> 1JLYO3K3tZ5PJGmyS0YK_-
> >>> > > > >> > > > NW8VOUUgUWVBmswCUOG158/edit?usp=sharing
> >>> > > > >> > > >
> >>> > > > >> > > > All comments are welcome
> >>> > > > >> > > >
> >>> > > > >> > > > I have added DL dev list in cc as the discussion is
> >>> > interesting
> >>> > > > for
> >>> > > > >> > both
> >>> > > > >> > > > groups
> >>> > > > >> > > >
> >>> > > > >> > > > Enrico Olivelli
> >>> > > > >> > > >
> >>> > > > >> > >
> >>> > > > >> >
> >>> > > > >>
> >>> > > > >>
> >>> > > > >>
> >>> > > > >> --
> >>> > > > >> Jvrao
> >>> > > > >> ---
> >>> > > > >> First they ignore you, then they laugh at you, then they fight
> >>> you,
> >>> > > then
> >>> > > > >> you win. - Mahatma Gandhi
> >>> > > > >>
> >>> > > > > --
> >>> > > > >
> >>> > > > >
> >>> > > > > -- Enrico Olivelli
> >>> > > > >
> >>> > > >
> >>> > >
> >>> > >
> >>> > >
> >>> > > --
> >>> > > Jvrao
> >>> > > ---
> >>> > > First they ignore you, then they laugh at you, then they fight you,
> >>> then
> >>> > > you win. - Mahatma Gandhi
> >>> > >
> >>> >
> >>>
> >>
> >>
> >
>

Re: [DISCUSS] BP-14 Relax Durability

Posted by Sijie Guo <gu...@gmail.com>.
Enrico,

Feel free to close the thread and mark this BP as accepted, if there is no
-1.

- Sijie

On Mon, Sep 11, 2017 at 2:26 AM, Enrico Olivelli <eo...@gmail.com>
wrote:

> Ping
>
> 2017-09-07 9:32 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:
>
> > Hi all,
> >
> >
> > You can find the revised proposal here
> > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> > BP-14+Relax+durability
> >
> > The link to the document open for comments is this:
> > https://docs.google.com/document/d/1yNi9t2_
> deOOMXDaGzrnmaHTQeB3B3Fnym82DU
> > ERH7LM/edit?usp=sharing
> >
> > Please check it out
> > We are going to review this Proposal at the meeting
> >
> > -- Enrico
> >
> >
> > 2017-08-30 8:56 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:
> >
> >> Thank you Sijie for summarizing and thanks to the community for helping
> >> in this important enhancement to BookKeeper
> >>
> >> I am convinced that as JV pointed out we need to declare at ledger
> >> creation time that the ledger is going to perform no-sync writes.
> >>
> >> I think we need an explicit declaration currently to make things "clear"
> >> to the developer which is using the LedgerHandle API even and ledger
> >> creation tyime.
> >>
> >> The case is that we are going to forbid "striping" ledgers (ensemble
> size
> >> > quorum size) for no-sync writes in the first implementation:
> >> - one option is to  fail at the first no-sync addEntry, but this will be
> >> really uncomfortable because usually the ack/write/ensemble sizes are
> >> configured by the admin, and there will be configurations in which
> errors
> >> will come out only after starting the system.
> >> - the second option is to make the developer explicitly enable no-sync
> >> writes at creation time and fail the creation of the ledger if the
> >> requested combination of options if not possible
> >>
> >> I am not sure that the changes to the bookie internals are a Client-API
> >> matter, maybe we can leverage custom metadata (as JV said) in order to
> make
> >> the bookie handle ledgers in a different manner, this way will be always
> >> open as custom metadata are already here.
> >>
> >> JV preferred the ledger-type approach, the dual solution is to introduce
> >> a list of "capabilities" or "ledger options".
> >> I think that this ability to perform no-syc writes is so important that
> >> "custom metadata" is not the good place to declare it, same for "ledger
> >> type"
> >>
> >> So I am proposing to add a boolean 'allowNoSyncWrites" at ledger
> creation
> >> time, without writing in to ledger metadata on ZK,
> >> I think that if further improvements will need ledger metadata changes
> we
> >> will do.
> >>
> >> I have updated the BP-14 document, I have added an "Open issues" footer
> >> with the open points,
> >> please add comments and I will correct the document as soon as possible.
> >>
> >>
> >> Enrico
> >>
> >>
> >>
> >>
> >> 2017-08-30 1:24 GMT+02:00 Sijie Guo <gu...@gmail.com>:
> >>
> >>> Thank you, Enrico, JV.
> >>>
> >>> These are great discussions.
> >>>
> >>> After reading these two proposals, I have a few very high-level
> comments,
> >>> dividing into three categories.
> >>>
> >>>
> >>> *API*
> >>>
> >>> - I think there are not fundamentally differences between these two
> >>> proposals.
> >>> They are trying to achieve similar goals by exposing durability levels
> in
> >>> different way.
> >>> So this will be a discussion on what API/interface should look like
> from
> >>> user / admin perspective.
> >>> I would suggest focusing what would be the API itself, putting the
> >>> implementation design aside when talking about this.
> >>>
> >>> *Core*
> >>>
> >>> - Both proposals need to deal with a core function - what happen to LAC
> >>> and
> >>> what semantic that bookkeeper provides.
> >>> JV did a good summary in his proposal. However I am not a fan of
> >>> maintaining two different semantics. So I am looking for
> >>> a solution that bookkeeper can only maintain one semantic. The semantic
> >>> is
> >>> basically:
> >>>
> >>> 1) LAC only advanced when entries before LAC are committed to the
> >>> persistent storage
> >>> 2) All the entries until LAC are successfully committed to the
> >>> persistence
> >>> storage
> >>> 3) Entries until LAC: all the entries must be readable all the time.
> >>>
> >>> If we maintain such semantic, there is no need to change the auto
> >>> recovery
> >>> protocol in bookkeeper. All what we guarantee are the entries durably
> >>> persistent.
> >>>
> >>> In order to maintain such semantic, I think both me and JV proposed
> >>> similar
> >>> solution in either proposal. I am trying to finalize one here:
> >>>
> >>> * bookie maintains a LAS (Last Add Synced) point for each entry.
> >>> * LAS can be piggybacked on AddResponses
> >>> * Client uses the LAS to advance LAC.
> >>>
> >>> If we can agree on the core semantic we are going to provide, the other
> >>> things are just logistics.
> >>>
> >>> *Others*
> >>>
> >>> - Regarding separating journal or bypassing journal, there is no
> >>> difference
> >>> when we talking from the core semantic. They are all non-durably writes
> >>> (acknowledging before fsyncing).
> >>> We can start with same journal approach (but just acknowledge before
> >>> fsyncing), implement the core and add other options later on.
> >>>
> >>>
> >>> From my point of view, I'd be more interesting in providing a single
> >>> consistent durable semantic that application can rely on for both
> durable
> >>> writes and non-durable writes. The other stuffs seem to be more
> logistics
> >>> things.
> >>>
> >>>
> >>> - Sijie
> >>>
> >>>
> >>> On Mon, Aug 28, 2017 at 11:27 PM, Enrico Olivelli <eolivelli@gmail.com
> >
> >>> wrote:
> >>>
> >>> > 2017-08-29 8:01 GMT+02:00 Venkateswara Rao Jujjuri <
> jujjuri@gmail.com
> >>> >:
> >>> >
> >>> > > I don't believe I fully followed your second case. But even in this
> >>> case,
> >>> > > your major concern is about the additional 'sync' RPC?
> >>> > >
> >>> >
> >>> > yes apart from that I am fine with your proposal too, that is to
> have a
> >>> > LedgerType which drives durability
> >>> > and I think we need to add per-entry durability options
> >>> >
> >>> > I think that at least for the 'simple' no-sync addEntry we do not
> need
> >>> to
> >>> > change many things, I am drafting a prototype, I will share it as
> soon
> >>> as
> >>> > we all agree on the roadmap
> >>> >
> >>> > The first implementation can cover the first cases (no-sync addEntry)
> >>> and
> >>> > change the way the writer advances the LAC in order to support
> 'relaxed
> >>> > durability writes'.
> >>> > This change will be compatible with future improvements and it will
> >>> open
> >>> > the door for big changes on the bookie side like bypassing the
> journal
> >>> or
> >>> > leveraging multiple journals.....
> >>> >
> >>> > -- Enrico
> >>> >
> >>> > or something else that the LedgerType proposal won't work?
> >>> > >
> >>> >
> >>> > >
> >>> > >
> >>> > > On Mon, Aug 28, 2017 at 7:35 AM, Enrico Olivelli <
> >>> eolivelli@gmail.com>
> >>> > > wrote:
> >>> > >
> >>> > > > I think that having a set of options on the ledger metadata will
> >>> be a
> >>> > > good
> >>> > > > enhancement and I am sure we will do it as soon as it will be
> >>> needed,
> >>> > > maybe
> >>> > > > we do not need it now.
> >>> > > >
> >>> > > > Actually I think we will need to declare this durability-level at
> >>> entry
> >>> > > > level to support some uses cases in BP-14 document, let me
> explain
> >>> two
> >>> > of
> >>> > > > my usecases for which I need it:
> >>> > > >
> >>> > > > At higher level we have to choices:
> >>> > > >
> >>> > > > A) per-ledger durability options (JV proposal)
> >>> > > > all addEntry operations are durable or non-durable and there is
> an
> >>> > > explicit
> >>> > > > 'sync' API (+ forced sync at close)
> >>> > > >
> >>> > > > B) per-entry durability options (original BP-14 proposal)
> >>> > > > every addEntry has an own durable/non-durable option
> >>> (sync/no-sync),
> >>> > with
> >>> > > > the ability to call 'sync' without addEntry (+ forced sync at
> >>> close)
> >>> > > >
> >>> > > > I am speaking about the the database WAL case, I am using the
> >>> ledger as
> >>> > > > segment for the WAL of a database and I am writing all data
> >>> changes in
> >>> > > the
> >>> > > > scope of a 'transaction' with the relaxed-durability flag, then I
> >>> am
> >>> > > > writing the 'transaction committed' entry with "strict
> durability"
> >>> > > > requirement, this will in fact require that all previous entries
> >>> are
> >>> > > > persisted durably and so that the transaction will never be lost.
> >>> > > >
> >>> > > > In this scenario we would need an addEntry + sync API in fact:
> >>> > > >
> >>> > > > using option  A) the WAL will look like:
> >>> > > > - open ledger no-sync = true
> >>> > > > - addEntry (set foo=bar)  (this will be no-sync)
> >>> > > > - addEntry (set foo=bar2) (this will be no-sync)
> >>> > > > - addEntry (commit)
> >>> > > > - sync
> >>> > > >
> >>> > > > using option B) the WAL will look like
> >>> > > > - open ledger
> >>> > > > - addEntry (set foo=bar), no-sync
> >>> > > > - addEntry (set foo=bar2), no-sync
> >>> > > > - addEntry (commit), sync
> >>> > > >
> >>> > > > in case B) we are "saving" one RPC call to every bookie (the
> 'sync'
> >>> > one)
> >>> > > > same for single data change entries, like updating a single
> record
> >>> on
> >>> > the
> >>> > > > database, this with BK 4.5 "costs" only a single RPC to every
> >>> bookie
> >>> > > >
> >>> > > > Second case:
> >>> > > > I am using BookKeeper to store binary objects, so I am packing
> more
> >>> > > > 'objects' (named sequences of bytes) into a single ledger, like
> >>> you do
> >>> > > when
> >>> > > > you write many records to a file in a streaming fashion and keep
> >>> track
> >>> > of
> >>> > > > offsets of the beginning of every record (LedgerHandeAdv is
> >>> perfect for
> >>> > > > this case).
> >>> > > > I am not using a single ledger per 'file' because it kills
> >>> zookeeper to
> >>> > > > create many ledgers very fast, in my systems I have big busts of
> >>> > writes,
> >>> > > > which need to be really "fast", so I am writing multiple 'files'
> to
> >>> > every
> >>> > > > single ledger. So the close-to-open consistency at ledger level
> is
> >>> not
> >>> > > > suitable for this case.
> >>> > > > I have to write as fast as possible to this 'ledger-backed'
> >>> stream, and
> >>> > > as
> >>> > > > with a 'traditional'  filesystem I am writing parts of each file
> >>> and
> >>> > than
> >>> > > > requiring 'sync' at the end of each file.
> >>> > > > Using BookKeeper you need to split big 'files' into "little"
> >>> parts, you
> >>> > > > cannot transmit the contents as to "real" stream on network.
> >>> > > >
> >>> > > > I am not talking about bookie level implementation details I
> would
> >>> like
> >>> > > to
> >>> > > > define the high level API in order to support all the relevant
> >>> known
> >>> > use
> >>> > > > cases and keep space for the future,
> >>> > > > at this moment adding a per-entry 'durability option' seems to be
> >>> very
> >>> > > > flexible and simple to implement, it does not prevent us from
> doing
> >>> > > further
> >>> > > > improvements, like namely skipping the journal.
> >>> > > >
> >>> > > > Enrico
> >>> > > >
> >>> > > >
> >>> > > >
> >>> > > > 2017-08-26 19:55 GMT+02:00 Enrico Olivelli <eolivelli@gmail.com
> >:
> >>> > > >
> >>> > > > >
> >>> > > > >
> >>> > > > > On sab 26 ago 2017, 19:19 Venkateswara Rao Jujjuri <
> >>> > jujjuri@gmail.com>
> >>> > > > > wrote:
> >>> > > > >
> >>> > > > >> Hi all,
> >>> > > > >>
> >>> > > > >> As promised during Thursday call, here is my proposal.
> >>> > > > >>
> >>> > > > >> *NOTE*: Major difference in this proposal compared to Enrico’s
> >>> > > > >> <https://docs.google.com/document/d/1JLYO3K3tZ5PJGmyS0YK_-
> >>> > > > >> NW8VOUUgUWVBmswCUOG158/edit#heading=h.q2rewiqndr5v>
> >>> > > > >> is
> >>> > > > >> making the durability a property of the ledger(type) as
> opposed
> >>> to
> >>> > > > >> addEntry(). Rest of the technical details have a lot of
> >>> > similarities.
> >>> > > > >>
> >>> > > > >
> >>> > > > > Thank you JV. I have just read quickly the doc and your view is
> >>> > > centantly
> >>> > > > > broader.
> >>> > > > > I will dig into the doc as soon as possible on Monday.
> >>> > > > > For me it is ok to have a ledger wide configuration I think
> that
> >>> the
> >>> > > most
> >>> > > > > important decision is about the API we will provide as in the
> >>> future
> >>> > it
> >>> > > > > will be difficult to change it.
> >>> > > > >
> >>> > > > >
> >>> > > > > Cheers
> >>> > > > > Enrico
> >>> > > > >
> >>> > > > >
> >>> > > > >
> >>> > > > >> https://docs.google.com/document/d/1g1eBcVVCZrTG8YZliZP0LVqv
> >>> Wpq43
> >>> > > > >> 2ODEghrGVQ4d4Q/edit?usp=sharing
> >>> > > > >>
> >>> > > > >> On Thu, Aug 24, 2017 at 1:14 AM, Enrico Olivelli <
> >>> > eolivelli@gmail.com
> >>> > > >
> >>> > > > >> wrote:
> >>> > > > >>
> >>> > > > >> > Thank you all for the comments and for taking a look to the
> >>> > document
> >>> > > > so
> >>> > > > >> > soon.
> >>> > > > >> > I have updated the doc, we will discuss the document at the
> >>> > meeting,
> >>> > > > >> >
> >>> > > > >> >
> >>> > > > >> > Enrico
> >>> > > > >> >
> >>> > > > >> > 2017-08-24 2:27 GMT+02:00 Sijie Guo <gu...@gmail.com>:
> >>> > > > >> >
> >>> > > > >> > > Enrico,
> >>> > > > >> > >
> >>> > > > >> > > Thank you so much! It is a great effort for putting this
> up.
> >>> > > Overall
> >>> > > > >> > looks
> >>> > > > >> > > good. I made some comments, we can discuss at tomorrow's
> >>> > community
> >>> > > > >> > meeting.
> >>> > > > >> > >
> >>> > > > >> > > - Sijie
> >>> > > > >> > >
> >>> > > > >> > > On Wed, Aug 23, 2017 at 8:25 AM, Enrico Olivelli <
> >>> > > > eolivelli@gmail.com
> >>> > > > >> >
> >>> > > > >> > > wrote:
> >>> > > > >> > >
> >>> > > > >> > > > Hi all,
> >>> > > > >> > > > I have drafted a first proposal for BP-14 - Relax
> >>> Durability
> >>> > > > >> > > >
> >>> > > > >> > > > We are talking about limiting the number of fsync to the
> >>> > journal
> >>> > > > >> while
> >>> > > > >> > > > preserving the correctness of the LAC protocol.
> >>> > > > >> > > >
> >>> > > > >> > > > This is the link to the wiki page, but as the issue is
> >>> huge we
> >>> > > > >> prefer
> >>> > > > >> > to
> >>> > > > >> > > > use Google Documents for sharing comments
> >>> > > > >> > > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> >>> > > > >> > > > BP+-+14+Relax+durability
> >>> > > > >> > > >
> >>> > > > >> > > > This is the document
> >>> > > > >> > > > https://docs.google.com/document/d/
> 1JLYO3K3tZ5PJGmyS0YK_-
> >>> > > > >> > > > NW8VOUUgUWVBmswCUOG158/edit?usp=sharing
> >>> > > > >> > > >
> >>> > > > >> > > > All comments are welcome
> >>> > > > >> > > >
> >>> > > > >> > > > I have added DL dev list in cc as the discussion is
> >>> > interesting
> >>> > > > for
> >>> > > > >> > both
> >>> > > > >> > > > groups
> >>> > > > >> > > >
> >>> > > > >> > > > Enrico Olivelli
> >>> > > > >> > > >
> >>> > > > >> > >
> >>> > > > >> >
> >>> > > > >>
> >>> > > > >>
> >>> > > > >>
> >>> > > > >> --
> >>> > > > >> Jvrao
> >>> > > > >> ---
> >>> > > > >> First they ignore you, then they laugh at you, then they fight
> >>> you,
> >>> > > then
> >>> > > > >> you win. - Mahatma Gandhi
> >>> > > > >>
> >>> > > > > --
> >>> > > > >
> >>> > > > >
> >>> > > > > -- Enrico Olivelli
> >>> > > > >
> >>> > > >
> >>> > >
> >>> > >
> >>> > >
> >>> > > --
> >>> > > Jvrao
> >>> > > ---
> >>> > > First they ignore you, then they laugh at you, then they fight you,
> >>> then
> >>> > > you win. - Mahatma Gandhi
> >>> > >
> >>> >
> >>>
> >>
> >>
> >
>

Re: [DISCUSS] BP-14 Relax Durability

Posted by Enrico Olivelli <eo...@gmail.com>.
Ping

2017-09-07 9:32 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:

> Hi all,
>
>
> You can find the revised proposal here
> https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> BP-14+Relax+durability
>
> The link to the document open for comments is this:
> https://docs.google.com/document/d/1yNi9t2_deOOMXDaGzrnmaHTQeB3B3Fnym82DU
> ERH7LM/edit?usp=sharing
>
> Please check it out
> We are going to review this Proposal at the meeting
>
> -- Enrico
>
>
> 2017-08-30 8:56 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:
>
>> Thank you Sijie for summarizing and thanks to the community for helping
>> in this important enhancement to BookKeeper
>>
>> I am convinced that as JV pointed out we need to declare at ledger
>> creation time that the ledger is going to perform no-sync writes.
>>
>> I think we need an explicit declaration currently to make things "clear"
>> to the developer which is using the LedgerHandle API even and ledger
>> creation tyime.
>>
>> The case is that we are going to forbid "striping" ledgers (ensemble size
>> > quorum size) for no-sync writes in the first implementation:
>> - one option is to  fail at the first no-sync addEntry, but this will be
>> really uncomfortable because usually the ack/write/ensemble sizes are
>> configured by the admin, and there will be configurations in which errors
>> will come out only after starting the system.
>> - the second option is to make the developer explicitly enable no-sync
>> writes at creation time and fail the creation of the ledger if the
>> requested combination of options if not possible
>>
>> I am not sure that the changes to the bookie internals are a Client-API
>> matter, maybe we can leverage custom metadata (as JV said) in order to make
>> the bookie handle ledgers in a different manner, this way will be always
>> open as custom metadata are already here.
>>
>> JV preferred the ledger-type approach, the dual solution is to introduce
>> a list of "capabilities" or "ledger options".
>> I think that this ability to perform no-syc writes is so important that
>> "custom metadata" is not the good place to declare it, same for "ledger
>> type"
>>
>> So I am proposing to add a boolean 'allowNoSyncWrites" at ledger creation
>> time, without writing in to ledger metadata on ZK,
>> I think that if further improvements will need ledger metadata changes we
>> will do.
>>
>> I have updated the BP-14 document, I have added an "Open issues" footer
>> with the open points,
>> please add comments and I will correct the document as soon as possible.
>>
>>
>> Enrico
>>
>>
>>
>>
>> 2017-08-30 1:24 GMT+02:00 Sijie Guo <gu...@gmail.com>:
>>
>>> Thank you, Enrico, JV.
>>>
>>> These are great discussions.
>>>
>>> After reading these two proposals, I have a few very high-level comments,
>>> dividing into three categories.
>>>
>>>
>>> *API*
>>>
>>> - I think there are not fundamentally differences between these two
>>> proposals.
>>> They are trying to achieve similar goals by exposing durability levels in
>>> different way.
>>> So this will be a discussion on what API/interface should look like from
>>> user / admin perspective.
>>> I would suggest focusing what would be the API itself, putting the
>>> implementation design aside when talking about this.
>>>
>>> *Core*
>>>
>>> - Both proposals need to deal with a core function - what happen to LAC
>>> and
>>> what semantic that bookkeeper provides.
>>> JV did a good summary in his proposal. However I am not a fan of
>>> maintaining two different semantics. So I am looking for
>>> a solution that bookkeeper can only maintain one semantic. The semantic
>>> is
>>> basically:
>>>
>>> 1) LAC only advanced when entries before LAC are committed to the
>>> persistent storage
>>> 2) All the entries until LAC are successfully committed to the
>>> persistence
>>> storage
>>> 3) Entries until LAC: all the entries must be readable all the time.
>>>
>>> If we maintain such semantic, there is no need to change the auto
>>> recovery
>>> protocol in bookkeeper. All what we guarantee are the entries durably
>>> persistent.
>>>
>>> In order to maintain such semantic, I think both me and JV proposed
>>> similar
>>> solution in either proposal. I am trying to finalize one here:
>>>
>>> * bookie maintains a LAS (Last Add Synced) point for each entry.
>>> * LAS can be piggybacked on AddResponses
>>> * Client uses the LAS to advance LAC.
>>>
>>> If we can agree on the core semantic we are going to provide, the other
>>> things are just logistics.
>>>
>>> *Others*
>>>
>>> - Regarding separating journal or bypassing journal, there is no
>>> difference
>>> when we talking from the core semantic. They are all non-durably writes
>>> (acknowledging before fsyncing).
>>> We can start with same journal approach (but just acknowledge before
>>> fsyncing), implement the core and add other options later on.
>>>
>>>
>>> From my point of view, I'd be more interesting in providing a single
>>> consistent durable semantic that application can rely on for both durable
>>> writes and non-durable writes. The other stuffs seem to be more logistics
>>> things.
>>>
>>>
>>> - Sijie
>>>
>>>
>>> On Mon, Aug 28, 2017 at 11:27 PM, Enrico Olivelli <eo...@gmail.com>
>>> wrote:
>>>
>>> > 2017-08-29 8:01 GMT+02:00 Venkateswara Rao Jujjuri <jujjuri@gmail.com
>>> >:
>>> >
>>> > > I don't believe I fully followed your second case. But even in this
>>> case,
>>> > > your major concern is about the additional 'sync' RPC?
>>> > >
>>> >
>>> > yes apart from that I am fine with your proposal too, that is to have a
>>> > LedgerType which drives durability
>>> > and I think we need to add per-entry durability options
>>> >
>>> > I think that at least for the 'simple' no-sync addEntry we do not need
>>> to
>>> > change many things, I am drafting a prototype, I will share it as soon
>>> as
>>> > we all agree on the roadmap
>>> >
>>> > The first implementation can cover the first cases (no-sync addEntry)
>>> and
>>> > change the way the writer advances the LAC in order to support 'relaxed
>>> > durability writes'.
>>> > This change will be compatible with future improvements and it will
>>> open
>>> > the door for big changes on the bookie side like bypassing the journal
>>> or
>>> > leveraging multiple journals.....
>>> >
>>> > -- Enrico
>>> >
>>> > or something else that the LedgerType proposal won't work?
>>> > >
>>> >
>>> > >
>>> > >
>>> > > On Mon, Aug 28, 2017 at 7:35 AM, Enrico Olivelli <
>>> eolivelli@gmail.com>
>>> > > wrote:
>>> > >
>>> > > > I think that having a set of options on the ledger metadata will
>>> be a
>>> > > good
>>> > > > enhancement and I am sure we will do it as soon as it will be
>>> needed,
>>> > > maybe
>>> > > > we do not need it now.
>>> > > >
>>> > > > Actually I think we will need to declare this durability-level at
>>> entry
>>> > > > level to support some uses cases in BP-14 document, let me explain
>>> two
>>> > of
>>> > > > my usecases for which I need it:
>>> > > >
>>> > > > At higher level we have to choices:
>>> > > >
>>> > > > A) per-ledger durability options (JV proposal)
>>> > > > all addEntry operations are durable or non-durable and there is an
>>> > > explicit
>>> > > > 'sync' API (+ forced sync at close)
>>> > > >
>>> > > > B) per-entry durability options (original BP-14 proposal)
>>> > > > every addEntry has an own durable/non-durable option
>>> (sync/no-sync),
>>> > with
>>> > > > the ability to call 'sync' without addEntry (+ forced sync at
>>> close)
>>> > > >
>>> > > > I am speaking about the the database WAL case, I am using the
>>> ledger as
>>> > > > segment for the WAL of a database and I am writing all data
>>> changes in
>>> > > the
>>> > > > scope of a 'transaction' with the relaxed-durability flag, then I
>>> am
>>> > > > writing the 'transaction committed' entry with "strict durability"
>>> > > > requirement, this will in fact require that all previous entries
>>> are
>>> > > > persisted durably and so that the transaction will never be lost.
>>> > > >
>>> > > > In this scenario we would need an addEntry + sync API in fact:
>>> > > >
>>> > > > using option  A) the WAL will look like:
>>> > > > - open ledger no-sync = true
>>> > > > - addEntry (set foo=bar)  (this will be no-sync)
>>> > > > - addEntry (set foo=bar2) (this will be no-sync)
>>> > > > - addEntry (commit)
>>> > > > - sync
>>> > > >
>>> > > > using option B) the WAL will look like
>>> > > > - open ledger
>>> > > > - addEntry (set foo=bar), no-sync
>>> > > > - addEntry (set foo=bar2), no-sync
>>> > > > - addEntry (commit), sync
>>> > > >
>>> > > > in case B) we are "saving" one RPC call to every bookie (the 'sync'
>>> > one)
>>> > > > same for single data change entries, like updating a single record
>>> on
>>> > the
>>> > > > database, this with BK 4.5 "costs" only a single RPC to every
>>> bookie
>>> > > >
>>> > > > Second case:
>>> > > > I am using BookKeeper to store binary objects, so I am packing more
>>> > > > 'objects' (named sequences of bytes) into a single ledger, like
>>> you do
>>> > > when
>>> > > > you write many records to a file in a streaming fashion and keep
>>> track
>>> > of
>>> > > > offsets of the beginning of every record (LedgerHandeAdv is
>>> perfect for
>>> > > > this case).
>>> > > > I am not using a single ledger per 'file' because it kills
>>> zookeeper to
>>> > > > create many ledgers very fast, in my systems I have big busts of
>>> > writes,
>>> > > > which need to be really "fast", so I am writing multiple 'files' to
>>> > every
>>> > > > single ledger. So the close-to-open consistency at ledger level is
>>> not
>>> > > > suitable for this case.
>>> > > > I have to write as fast as possible to this 'ledger-backed'
>>> stream, and
>>> > > as
>>> > > > with a 'traditional'  filesystem I am writing parts of each file
>>> and
>>> > than
>>> > > > requiring 'sync' at the end of each file.
>>> > > > Using BookKeeper you need to split big 'files' into "little"
>>> parts, you
>>> > > > cannot transmit the contents as to "real" stream on network.
>>> > > >
>>> > > > I am not talking about bookie level implementation details I would
>>> like
>>> > > to
>>> > > > define the high level API in order to support all the relevant
>>> known
>>> > use
>>> > > > cases and keep space for the future,
>>> > > > at this moment adding a per-entry 'durability option' seems to be
>>> very
>>> > > > flexible and simple to implement, it does not prevent us from doing
>>> > > further
>>> > > > improvements, like namely skipping the journal.
>>> > > >
>>> > > > Enrico
>>> > > >
>>> > > >
>>> > > >
>>> > > > 2017-08-26 19:55 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:
>>> > > >
>>> > > > >
>>> > > > >
>>> > > > > On sab 26 ago 2017, 19:19 Venkateswara Rao Jujjuri <
>>> > jujjuri@gmail.com>
>>> > > > > wrote:
>>> > > > >
>>> > > > >> Hi all,
>>> > > > >>
>>> > > > >> As promised during Thursday call, here is my proposal.
>>> > > > >>
>>> > > > >> *NOTE*: Major difference in this proposal compared to Enrico’s
>>> > > > >> <https://docs.google.com/document/d/1JLYO3K3tZ5PJGmyS0YK_-
>>> > > > >> NW8VOUUgUWVBmswCUOG158/edit#heading=h.q2rewiqndr5v>
>>> > > > >> is
>>> > > > >> making the durability a property of the ledger(type) as opposed
>>> to
>>> > > > >> addEntry(). Rest of the technical details have a lot of
>>> > similarities.
>>> > > > >>
>>> > > > >
>>> > > > > Thank you JV. I have just read quickly the doc and your view is
>>> > > centantly
>>> > > > > broader.
>>> > > > > I will dig into the doc as soon as possible on Monday.
>>> > > > > For me it is ok to have a ledger wide configuration I think that
>>> the
>>> > > most
>>> > > > > important decision is about the API we will provide as in the
>>> future
>>> > it
>>> > > > > will be difficult to change it.
>>> > > > >
>>> > > > >
>>> > > > > Cheers
>>> > > > > Enrico
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > >> https://docs.google.com/document/d/1g1eBcVVCZrTG8YZliZP0LVqv
>>> Wpq43
>>> > > > >> 2ODEghrGVQ4d4Q/edit?usp=sharing
>>> > > > >>
>>> > > > >> On Thu, Aug 24, 2017 at 1:14 AM, Enrico Olivelli <
>>> > eolivelli@gmail.com
>>> > > >
>>> > > > >> wrote:
>>> > > > >>
>>> > > > >> > Thank you all for the comments and for taking a look to the
>>> > document
>>> > > > so
>>> > > > >> > soon.
>>> > > > >> > I have updated the doc, we will discuss the document at the
>>> > meeting,
>>> > > > >> >
>>> > > > >> >
>>> > > > >> > Enrico
>>> > > > >> >
>>> > > > >> > 2017-08-24 2:27 GMT+02:00 Sijie Guo <gu...@gmail.com>:
>>> > > > >> >
>>> > > > >> > > Enrico,
>>> > > > >> > >
>>> > > > >> > > Thank you so much! It is a great effort for putting this up.
>>> > > Overall
>>> > > > >> > looks
>>> > > > >> > > good. I made some comments, we can discuss at tomorrow's
>>> > community
>>> > > > >> > meeting.
>>> > > > >> > >
>>> > > > >> > > - Sijie
>>> > > > >> > >
>>> > > > >> > > On Wed, Aug 23, 2017 at 8:25 AM, Enrico Olivelli <
>>> > > > eolivelli@gmail.com
>>> > > > >> >
>>> > > > >> > > wrote:
>>> > > > >> > >
>>> > > > >> > > > Hi all,
>>> > > > >> > > > I have drafted a first proposal for BP-14 - Relax
>>> Durability
>>> > > > >> > > >
>>> > > > >> > > > We are talking about limiting the number of fsync to the
>>> > journal
>>> > > > >> while
>>> > > > >> > > > preserving the correctness of the LAC protocol.
>>> > > > >> > > >
>>> > > > >> > > > This is the link to the wiki page, but as the issue is
>>> huge we
>>> > > > >> prefer
>>> > > > >> > to
>>> > > > >> > > > use Google Documents for sharing comments
>>> > > > >> > > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
>>> > > > >> > > > BP+-+14+Relax+durability
>>> > > > >> > > >
>>> > > > >> > > > This is the document
>>> > > > >> > > > https://docs.google.com/document/d/1JLYO3K3tZ5PJGmyS0YK_-
>>> > > > >> > > > NW8VOUUgUWVBmswCUOG158/edit?usp=sharing
>>> > > > >> > > >
>>> > > > >> > > > All comments are welcome
>>> > > > >> > > >
>>> > > > >> > > > I have added DL dev list in cc as the discussion is
>>> > interesting
>>> > > > for
>>> > > > >> > both
>>> > > > >> > > > groups
>>> > > > >> > > >
>>> > > > >> > > > Enrico Olivelli
>>> > > > >> > > >
>>> > > > >> > >
>>> > > > >> >
>>> > > > >>
>>> > > > >>
>>> > > > >>
>>> > > > >> --
>>> > > > >> Jvrao
>>> > > > >> ---
>>> > > > >> First they ignore you, then they laugh at you, then they fight
>>> you,
>>> > > then
>>> > > > >> you win. - Mahatma Gandhi
>>> > > > >>
>>> > > > > --
>>> > > > >
>>> > > > >
>>> > > > > -- Enrico Olivelli
>>> > > > >
>>> > > >
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > > Jvrao
>>> > > ---
>>> > > First they ignore you, then they laugh at you, then they fight you,
>>> then
>>> > > you win. - Mahatma Gandhi
>>> > >
>>> >
>>>
>>
>>
>

Re: [DISCUSS] BP-14 Relax Durability

Posted by Enrico Olivelli <eo...@gmail.com>.
Ping

2017-09-07 9:32 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:

> Hi all,
>
>
> You can find the revised proposal here
> https://cwiki.apache.org/confluence/display/BOOKKEEPER/
> BP-14+Relax+durability
>
> The link to the document open for comments is this:
> https://docs.google.com/document/d/1yNi9t2_deOOMXDaGzrnmaHTQeB3B3Fnym82DU
> ERH7LM/edit?usp=sharing
>
> Please check it out
> We are going to review this Proposal at the meeting
>
> -- Enrico
>
>
> 2017-08-30 8:56 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:
>
>> Thank you Sijie for summarizing and thanks to the community for helping
>> in this important enhancement to BookKeeper
>>
>> I am convinced that as JV pointed out we need to declare at ledger
>> creation time that the ledger is going to perform no-sync writes.
>>
>> I think we need an explicit declaration currently to make things "clear"
>> to the developer which is using the LedgerHandle API even and ledger
>> creation tyime.
>>
>> The case is that we are going to forbid "striping" ledgers (ensemble size
>> > quorum size) for no-sync writes in the first implementation:
>> - one option is to  fail at the first no-sync addEntry, but this will be
>> really uncomfortable because usually the ack/write/ensemble sizes are
>> configured by the admin, and there will be configurations in which errors
>> will come out only after starting the system.
>> - the second option is to make the developer explicitly enable no-sync
>> writes at creation time and fail the creation of the ledger if the
>> requested combination of options if not possible
>>
>> I am not sure that the changes to the bookie internals are a Client-API
>> matter, maybe we can leverage custom metadata (as JV said) in order to make
>> the bookie handle ledgers in a different manner, this way will be always
>> open as custom metadata are already here.
>>
>> JV preferred the ledger-type approach, the dual solution is to introduce
>> a list of "capabilities" or "ledger options".
>> I think that this ability to perform no-syc writes is so important that
>> "custom metadata" is not the good place to declare it, same for "ledger
>> type"
>>
>> So I am proposing to add a boolean 'allowNoSyncWrites" at ledger creation
>> time, without writing in to ledger metadata on ZK,
>> I think that if further improvements will need ledger metadata changes we
>> will do.
>>
>> I have updated the BP-14 document, I have added an "Open issues" footer
>> with the open points,
>> please add comments and I will correct the document as soon as possible.
>>
>>
>> Enrico
>>
>>
>>
>>
>> 2017-08-30 1:24 GMT+02:00 Sijie Guo <gu...@gmail.com>:
>>
>>> Thank you, Enrico, JV.
>>>
>>> These are great discussions.
>>>
>>> After reading these two proposals, I have a few very high-level comments,
>>> dividing into three categories.
>>>
>>>
>>> *API*
>>>
>>> - I think there are not fundamentally differences between these two
>>> proposals.
>>> They are trying to achieve similar goals by exposing durability levels in
>>> different way.
>>> So this will be a discussion on what API/interface should look like from
>>> user / admin perspective.
>>> I would suggest focusing what would be the API itself, putting the
>>> implementation design aside when talking about this.
>>>
>>> *Core*
>>>
>>> - Both proposals need to deal with a core function - what happen to LAC
>>> and
>>> what semantic that bookkeeper provides.
>>> JV did a good summary in his proposal. However I am not a fan of
>>> maintaining two different semantics. So I am looking for
>>> a solution that bookkeeper can only maintain one semantic. The semantic
>>> is
>>> basically:
>>>
>>> 1) LAC only advanced when entries before LAC are committed to the
>>> persistent storage
>>> 2) All the entries until LAC are successfully committed to the
>>> persistence
>>> storage
>>> 3) Entries until LAC: all the entries must be readable all the time.
>>>
>>> If we maintain such semantic, there is no need to change the auto
>>> recovery
>>> protocol in bookkeeper. All what we guarantee are the entries durably
>>> persistent.
>>>
>>> In order to maintain such semantic, I think both me and JV proposed
>>> similar
>>> solution in either proposal. I am trying to finalize one here:
>>>
>>> * bookie maintains a LAS (Last Add Synced) point for each entry.
>>> * LAS can be piggybacked on AddResponses
>>> * Client uses the LAS to advance LAC.
>>>
>>> If we can agree on the core semantic we are going to provide, the other
>>> things are just logistics.
>>>
>>> *Others*
>>>
>>> - Regarding separating journal or bypassing journal, there is no
>>> difference
>>> when we talking from the core semantic. They are all non-durably writes
>>> (acknowledging before fsyncing).
>>> We can start with same journal approach (but just acknowledge before
>>> fsyncing), implement the core and add other options later on.
>>>
>>>
>>> From my point of view, I'd be more interesting in providing a single
>>> consistent durable semantic that application can rely on for both durable
>>> writes and non-durable writes. The other stuffs seem to be more logistics
>>> things.
>>>
>>>
>>> - Sijie
>>>
>>>
>>> On Mon, Aug 28, 2017 at 11:27 PM, Enrico Olivelli <eo...@gmail.com>
>>> wrote:
>>>
>>> > 2017-08-29 8:01 GMT+02:00 Venkateswara Rao Jujjuri <jujjuri@gmail.com
>>> >:
>>> >
>>> > > I don't believe I fully followed your second case. But even in this
>>> case,
>>> > > your major concern is about the additional 'sync' RPC?
>>> > >
>>> >
>>> > yes apart from that I am fine with your proposal too, that is to have a
>>> > LedgerType which drives durability
>>> > and I think we need to add per-entry durability options
>>> >
>>> > I think that at least for the 'simple' no-sync addEntry we do not need
>>> to
>>> > change many things, I am drafting a prototype, I will share it as soon
>>> as
>>> > we all agree on the roadmap
>>> >
>>> > The first implementation can cover the first cases (no-sync addEntry)
>>> and
>>> > change the way the writer advances the LAC in order to support 'relaxed
>>> > durability writes'.
>>> > This change will be compatible with future improvements and it will
>>> open
>>> > the door for big changes on the bookie side like bypassing the journal
>>> or
>>> > leveraging multiple journals.....
>>> >
>>> > -- Enrico
>>> >
>>> > or something else that the LedgerType proposal won't work?
>>> > >
>>> >
>>> > >
>>> > >
>>> > > On Mon, Aug 28, 2017 at 7:35 AM, Enrico Olivelli <
>>> eolivelli@gmail.com>
>>> > > wrote:
>>> > >
>>> > > > I think that having a set of options on the ledger metadata will
>>> be a
>>> > > good
>>> > > > enhancement and I am sure we will do it as soon as it will be
>>> needed,
>>> > > maybe
>>> > > > we do not need it now.
>>> > > >
>>> > > > Actually I think we will need to declare this durability-level at
>>> entry
>>> > > > level to support some uses cases in BP-14 document, let me explain
>>> two
>>> > of
>>> > > > my usecases for which I need it:
>>> > > >
>>> > > > At higher level we have to choices:
>>> > > >
>>> > > > A) per-ledger durability options (JV proposal)
>>> > > > all addEntry operations are durable or non-durable and there is an
>>> > > explicit
>>> > > > 'sync' API (+ forced sync at close)
>>> > > >
>>> > > > B) per-entry durability options (original BP-14 proposal)
>>> > > > every addEntry has an own durable/non-durable option
>>> (sync/no-sync),
>>> > with
>>> > > > the ability to call 'sync' without addEntry (+ forced sync at
>>> close)
>>> > > >
>>> > > > I am speaking about the the database WAL case, I am using the
>>> ledger as
>>> > > > segment for the WAL of a database and I am writing all data
>>> changes in
>>> > > the
>>> > > > scope of a 'transaction' with the relaxed-durability flag, then I
>>> am
>>> > > > writing the 'transaction committed' entry with "strict durability"
>>> > > > requirement, this will in fact require that all previous entries
>>> are
>>> > > > persisted durably and so that the transaction will never be lost.
>>> > > >
>>> > > > In this scenario we would need an addEntry + sync API in fact:
>>> > > >
>>> > > > using option  A) the WAL will look like:
>>> > > > - open ledger no-sync = true
>>> > > > - addEntry (set foo=bar)  (this will be no-sync)
>>> > > > - addEntry (set foo=bar2) (this will be no-sync)
>>> > > > - addEntry (commit)
>>> > > > - sync
>>> > > >
>>> > > > using option B) the WAL will look like
>>> > > > - open ledger
>>> > > > - addEntry (set foo=bar), no-sync
>>> > > > - addEntry (set foo=bar2), no-sync
>>> > > > - addEntry (commit), sync
>>> > > >
>>> > > > in case B) we are "saving" one RPC call to every bookie (the 'sync'
>>> > one)
>>> > > > same for single data change entries, like updating a single record
>>> on
>>> > the
>>> > > > database, this with BK 4.5 "costs" only a single RPC to every
>>> bookie
>>> > > >
>>> > > > Second case:
>>> > > > I am using BookKeeper to store binary objects, so I am packing more
>>> > > > 'objects' (named sequences of bytes) into a single ledger, like
>>> you do
>>> > > when
>>> > > > you write many records to a file in a streaming fashion and keep
>>> track
>>> > of
>>> > > > offsets of the beginning of every record (LedgerHandeAdv is
>>> perfect for
>>> > > > this case).
>>> > > > I am not using a single ledger per 'file' because it kills
>>> zookeeper to
>>> > > > create many ledgers very fast, in my systems I have big busts of
>>> > writes,
>>> > > > which need to be really "fast", so I am writing multiple 'files' to
>>> > every
>>> > > > single ledger. So the close-to-open consistency at ledger level is
>>> not
>>> > > > suitable for this case.
>>> > > > I have to write as fast as possible to this 'ledger-backed'
>>> stream, and
>>> > > as
>>> > > > with a 'traditional'  filesystem I am writing parts of each file
>>> and
>>> > than
>>> > > > requiring 'sync' at the end of each file.
>>> > > > Using BookKeeper you need to split big 'files' into "little"
>>> parts, you
>>> > > > cannot transmit the contents as to "real" stream on network.
>>> > > >
>>> > > > I am not talking about bookie level implementation details I would
>>> like
>>> > > to
>>> > > > define the high level API in order to support all the relevant
>>> known
>>> > use
>>> > > > cases and keep space for the future,
>>> > > > at this moment adding a per-entry 'durability option' seems to be
>>> very
>>> > > > flexible and simple to implement, it does not prevent us from doing
>>> > > further
>>> > > > improvements, like namely skipping the journal.
>>> > > >
>>> > > > Enrico
>>> > > >
>>> > > >
>>> > > >
>>> > > > 2017-08-26 19:55 GMT+02:00 Enrico Olivelli <eo...@gmail.com>:
>>> > > >
>>> > > > >
>>> > > > >
>>> > > > > On sab 26 ago 2017, 19:19 Venkateswara Rao Jujjuri <
>>> > jujjuri@gmail.com>
>>> > > > > wrote:
>>> > > > >
>>> > > > >> Hi all,
>>> > > > >>
>>> > > > >> As promised during Thursday call, here is my proposal.
>>> > > > >>
>>> > > > >> *NOTE*: Major difference in this proposal compared to Enrico’s
>>> > > > >> <https://docs.google.com/document/d/1JLYO3K3tZ5PJGmyS0YK_-
>>> > > > >> NW8VOUUgUWVBmswCUOG158/edit#heading=h.q2rewiqndr5v>
>>> > > > >> is
>>> > > > >> making the durability a property of the ledger(type) as opposed
>>> to
>>> > > > >> addEntry(). Rest of the technical details have a lot of
>>> > similarities.
>>> > > > >>
>>> > > > >
>>> > > > > Thank you JV. I have just read quickly the doc and your view is
>>> > > centantly
>>> > > > > broader.
>>> > > > > I will dig into the doc as soon as possible on Monday.
>>> > > > > For me it is ok to have a ledger wide configuration I think that
>>> the
>>> > > most
>>> > > > > important decision is about the API we will provide as in the
>>> future
>>> > it
>>> > > > > will be difficult to change it.
>>> > > > >
>>> > > > >
>>> > > > > Cheers
>>> > > > > Enrico
>>> > > > >
>>> > > > >
>>> > > > >
>>> > > > >> https://docs.google.com/document/d/1g1eBcVVCZrTG8YZliZP0LVqv
>>> Wpq43
>>> > > > >> 2ODEghrGVQ4d4Q/edit?usp=sharing
>>> > > > >>
>>> > > > >> On Thu, Aug 24, 2017 at 1:14 AM, Enrico Olivelli <
>>> > eolivelli@gmail.com
>>> > > >
>>> > > > >> wrote:
>>> > > > >>
>>> > > > >> > Thank you all for the comments and for taking a look to the
>>> > document
>>> > > > so
>>> > > > >> > soon.
>>> > > > >> > I have updated the doc, we will discuss the document at the
>>> > meeting,
>>> > > > >> >
>>> > > > >> >
>>> > > > >> > Enrico
>>> > > > >> >
>>> > > > >> > 2017-08-24 2:27 GMT+02:00 Sijie Guo <gu...@gmail.com>:
>>> > > > >> >
>>> > > > >> > > Enrico,
>>> > > > >> > >
>>> > > > >> > > Thank you so much! It is a great effort for putting this up.
>>> > > Overall
>>> > > > >> > looks
>>> > > > >> > > good. I made some comments, we can discuss at tomorrow's
>>> > community
>>> > > > >> > meeting.
>>> > > > >> > >
>>> > > > >> > > - Sijie
>>> > > > >> > >
>>> > > > >> > > On Wed, Aug 23, 2017 at 8:25 AM, Enrico Olivelli <
>>> > > > eolivelli@gmail.com
>>> > > > >> >
>>> > > > >> > > wrote:
>>> > > > >> > >
>>> > > > >> > > > Hi all,
>>> > > > >> > > > I have drafted a first proposal for BP-14 - Relax
>>> Durability
>>> > > > >> > > >
>>> > > > >> > > > We are talking about limiting the number of fsync to the
>>> > journal
>>> > > > >> while
>>> > > > >> > > > preserving the correctness of the LAC protocol.
>>> > > > >> > > >
>>> > > > >> > > > This is the link to the wiki page, but as the issue is
>>> huge we
>>> > > > >> prefer
>>> > > > >> > to
>>> > > > >> > > > use Google Documents for sharing comments
>>> > > > >> > > > https://cwiki.apache.org/confluence/display/BOOKKEEPER/
>>> > > > >> > > > BP+-+14+Relax+durability
>>> > > > >> > > >
>>> > > > >> > > > This is the document
>>> > > > >> > > > https://docs.google.com/document/d/1JLYO3K3tZ5PJGmyS0YK_-
>>> > > > >> > > > NW8VOUUgUWVBmswCUOG158/edit?usp=sharing
>>> > > > >> > > >
>>> > > > >> > > > All comments are welcome
>>> > > > >> > > >
>>> > > > >> > > > I have added DL dev list in cc as the discussion is
>>> > interesting
>>> > > > for
>>> > > > >> > both
>>> > > > >> > > > groups
>>> > > > >> > > >
>>> > > > >> > > > Enrico Olivelli
>>> > > > >> > > >
>>> > > > >> > >
>>> > > > >> >
>>> > > > >>
>>> > > > >>
>>> > > > >>
>>> > > > >> --
>>> > > > >> Jvrao
>>> > > > >> ---
>>> > > > >> First they ignore you, then they laugh at you, then they fight
>>> you,
>>> > > then
>>> > > > >> you win. - Mahatma Gandhi
>>> > > > >>
>>> > > > > --
>>> > > > >
>>> > > > >
>>> > > > > -- Enrico Olivelli
>>> > > > >
>>> > > >
>>> > >
>>> > >
>>> > >
>>> > > --
>>> > > Jvrao
>>> > > ---
>>> > > First they ignore you, then they laugh at you, then they fight you,
>>> then
>>> > > you win. - Mahatma Gandhi
>>> > >
>>> >
>>>
>>
>>
>