You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@bookkeeper.apache.org by Andrey Yegorov <an...@gmail.com> on 2017/05/01 21:13:59 UTC

why write to journal is happening after write to ledgerStorage?

Hi,

Looking at the code in Bookie.java I noticed that write to journal (which
is supposed to be a write-ahead log as I understand) happened after write
to ledger storage.
This looks counter-intuitive, can someone explain why is it done in this
order?

My primary concern is that ledger storage write can be delayed (i.e.
EntryMemTable's addEntry can do throttleWriters() in some cases) thus
dragging overall client's view of add latency up even though it is possible
that journal's write (i.e. in case of dedicated journal disk) will complete
faster.

    private void addEntryInternal(LedgerDescriptor handle, ByteBuffer
entry, WriteCallback cb, Object ctx)

            throws IOException, BookieException {

        long ledgerId = handle.getLedgerId();

        entry.rewind();

*// ledgerStorage.addEntry() is happening here*

        long entryId = handle.addEntry(entry);


        entry.rewind();

        writeBytes.add(entry.remaining());


        LOG.trace("Adding {}@{}", entryId, ledgerId);

*// journal add entry is happening here*

*// callback/response to client is sent after journal add is done.*

        journal.logAddEntry(entry, cb, ctx);

    }



----------
Andrey Yegorov

Re: why write to journal is happening after write to ledgerStorage?

Posted by Sijie Guo <gu...@gmail.com>.

On Mon, May 1, 2017 at 6:03 PM, Venkateswara Rao Jujjuri <ju...@gmail.com>
wrote:

> On Mon, May 1, 2017 at 5:56 PM, Sijie Guo <gu...@gmail.com> wrote:
>
> > I don't think this is an inconsistent issue. The in memory update is
> > updating lac not current entry. Even the entry is added into memory but
> > this entry will not be readable after lac is advanced, lac is advanced
> only
> > after the next entry is added which happened after current entry is
> acked.
> >
>
> That is not true. You are talking about piggy-backed LAC only. But with
> Explicit LAC
> you don't need next entry to move LAC on bookie.
>

It doesn't matter it is a piggy-backed LAC or an explicit LAC. LAC is only
advanced after client receives acknowledge from bookies (which happens
after the entry has been persisted in journal). so the order of adding
entry to memory and adding the entry to the journal doesn't affect
correctness here.


>
>
>
> > So adding the entry to memory doesn't expose any consistency issue.
> >
> > On May 1, 2017 5:44 PM, "Venkateswara Rao Jujjuri" <ju...@gmail.com>
> > wrote:
> >
> > On Mon, May 1, 2017 at 2:31 PM, Yiming Zang <yz...@twitter.com.invalid>
> > wrote:
> >
> > > Hi Andrey,
> > >
> > > That's a good point, and you're actually correct that if write to
> > memTable
> > > got throttled somehow, the addEntry request latency will be affected a
> > lot.
> > > This actually happens a few times in production cluster. Normally, the
> > idea
> > > of using Journal is to write data to the write-ahead log and then
> persist
> > > the actual data to disks or add to memTable. However, my understanding
> of
> > > why we choose to write entry to ledgerStorage first is to improve the
> > > tailing-read performance.
> > >
> > > In SortedLedgerStorage.java, we first add entry to memTable and then we
> > > update lastAddConfirmed, which means if there's a long poll read
> request
> > or
> > > readLastAddConfirmed request, it will immediately get satisfied for the
> > > latest entry before we actually log the entry into Journal. So
> > tailing-read
> > > doesn't actually need to wait for any disk operation in Bookkeeper
> > > including Journal operation.
> > >
> > > public long addEntry(ByteBuffer entry) throws IOException {
> > > long ledgerId = entry.getLong();
> > > long entryId = entry.getLong();
> > > long lac = entry.getLong();
> > > entry.rewind();
> > > memTable.addEntry(ledgerId, entryId, entry, this);
> > > ledgerCache.updateLastAddConfirmed(ledgerId, lac);
> > > return entryId;
> > > }
> > >
> > > But thinking about here, I'm wondering if it's actually safe to update
> > the
> > > LAC before we write the entry to Journal. What if we tell the client
> the
> > > LAC has been updated but we actually failed to write the entry to
> Journal
> > > and Bookie crashed at that time? Would this bring any inconsistency
> > issue?
> > >
> >
> > Good point. This is indeed an inconsistency issue. BK guarantees "if you
> > read once you can read it all the time".
> > If it is really done for LAC it is not really good idea. Unless I am
> > missing something, this must be changed ASAP.
> >
> > Thanks,
> > JV
> >
> >
> > >
> > > On Mon, May 1, 2017 at 2:13 PM, Andrey Yegorov <
> andrey.yegorov@gmail.com
> > >
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Looking at the code in Bookie.java I noticed that write to journal
> > (which
> > > > is supposed to be a write-ahead log as I understand) happened after
> > write
> > > > to ledger storage.
> > > > This looks counter-intuitive, can someone explain why is it done in
> > this
> > > > order?
> > > >
> > > > My primary concern is that ledger storage write can be delayed (i.e.
> > > > EntryMemTable's addEntry can do throttleWriters() in some cases) thus
> > > > dragging overall client's view of add latency up even though it is
> > > possible
> > > > that journal's write (i.e. in case of dedicated journal disk) will
> > > complete
> > > > faster.
> > > >
> > > >     private void addEntryInternal(LedgerDescriptor handle,
> ByteBuffer
> > > > entry, WriteCallback cb, Object ctx)
> > > >
> > > >             throws IOException, BookieException {
> > > >
> > > >         long ledgerId = handle.getLedgerId();
> > > >
> > > >         entry.rewind();
> > > >
> > > > *// ledgerStorage.addEntry() is happening here*
> > > >
> > > >         long entryId = handle.addEntry(entry);
> > > >
> > > >
> > > >         entry.rewind();
> > > >
> > > >         writeBytes.add(entry.remaining());
> > > >
> > > >
> > > >         LOG.trace("Adding {}@{}", entryId, ledgerId);
> > > >
> > > > *// journal add entry is happening here*
> > > >
> > > > *// callback/response to client is sent after journal add is done.*
> > > >
> > > >         journal.logAddEntry(entry, cb, ctx);
> > > >
> > > >     }
> > > >
> > > >
> > > >
> > > > ----------
> > > > Andrey Yegorov
> > > >
> > >
> >
> >
> >
> > --
> > Jvrao
> > ---
> > First they ignore you, then they laugh at you, then they fight you, then
> > you win. - Mahatma Gandhi
> >
>
>
>
> --
> Jvrao
> ---
> First they ignore you, then they laugh at you, then they fight you, then
> you win. - Mahatma Gandhi
>

Re: why write to journal is happening after write to ledgerStorage?

Posted by Sijie Guo <gu...@gmail.com>.

On Mon, May 1, 2017 at 7:14 PM, Venkateswara Rao Jujjuri <ju...@gmail.com>
wrote:

> May be the issue is because we block writer while we are flushing ledger. I
> believe we should separate ledger writers and sync and allow write
> (multiple buffers) wile disk sync is going one.


In theory, there will be always a point that member is full and we have to
block on waiting flushing out the memory, no matter how many buffers are
you using, no?

What is the real difference between using 2 large buffers and using
multiple smaller buffers?


> One can configure #of
> buffers and sizes based on the latency and throughout ratio of journal and
> ledger disk.
>
> On Mon, May 1, 2017 at 7:04 PM Yiming Zang <yz...@twitter.com.invalid>
> wrote:
>
> > What about long poll read? I know it's not in OSS version yet, but if
> we're
> > going to support long poll read in open source version in the future we
> > need to think about this. Based on my understanding, the long poll read
> > request will be satisfied once the LAC is advanced to previousLAC+1.
> >
> > For example, the current lac=1 and client issue a long poll read request
> > for entry 2 with previousLAC=1. Now we send a write request for entry 2,
> > and when that entry is added to memTable and LAC is advanced to 2, and
> that
> > entry is not yet written to Journal, the long poll read request can
> already
> > be satisfied, and client will get response back for entry 2, but in fact,
> > entry 2 could even failed to be written to Journal, or bookie could crash
> > at that time. Will that bring inconsistency?
> >
> > Correct me if I'm wrong.
> >
> > Best,
> > Yiming
> >
> > On Mon, May 1, 2017 at 6:52 PM, Sijie Guo <gu...@gmail.com> wrote:
> >
> > > On Mon, May 1, 2017 at 6:35 PM, Venkateswara Rao Jujjuri <
> > > jujjuri@gmail.com>
> > > wrote:
> > >
> > > > The real problem/issue is - having extremely fast journal disk
> doesn't
> > > > really mask write latencies from a slower ledger disk.
> > > >
> > >
> > > In most of the case, it mask the write latencies from a slower ledger
> > disk.
> > > Because the write should only happen in the memory.
> > >
> > > The worse case here is the write is being throttled - that typically
> > means
> > > something really bad happening.
> > >
> > > In this case, a larger write buffer would help?
> > >
> > >
> > > >
> > > > To address this rate correctness issue, cant we read from journal if
> > the
> > > > entryid >= LAC (as we cache now on bookie) and journal read fails?
> > > >
> > >
> > > First, the correctness isn't entryid >= LAC case, as client can't
> really
> > > read beyond LAC. The correctness issue is on entryid <= LAC case: the
> > entry
> > > appears on journal but not in ledger storage.
> > >
> > > Second, the purpose of having a separate journal disk is to avoid reads
> > on
> > > journal that would impact writes. If we circle back reads on journals,
> > this
> > > would potentially  cause performance degradation on writes as well.
> > >
> > > Last, in order to be able to read journals, you basically still need to
> > add
> > > some indexed structures into memory, so you know where to look up the
> > > entries. No matter you store an entry in memtable or just store an
> entry
> > > pointer pointing back to journal, you will still hit same problem - as
> > you
> > > can keep everything in memory, which you have to write data back to
> disks
> > > and the throttling would happen again.
> > >
> > > From all these three points, I don't see too much value about changing
> > > this. Instead, the question would be simpler - can you increase the
> > memory
> > > buffer size? If you can't, that means your hardware's capacity can't
> > > basically keep up with the incoming write traffic. More capacity is
> > needed
> > > then.
> > >
> > > - Sijie
> > >
> > >
> > >
> > > >
> > > > On Mon, May 1, 2017 at 6:33 PM, Sijie Guo <gu...@gmail.com>
> wrote:
> > > >
> > > > > In the other to think about this,
> > > > >
> > > > > when 'throttling' happens,  it typically means:
> > > > >
> > > > > - the bookie doesn't have enough bandwidth/capacity to keep up with
> > the
> > > > > traffic.
> > > > > - the disks on the bookie might have problems (e.g. slow down or
> > other
> > > > > hardware issues).
> > > > >
> > > > > Either case can happen. It might be worth to let the throttling
> kick
> > > in,
> > > > > rather than let journal disk accepting writes and putting ledger
> > > storage
> > > > > into worse state.
> > > > >
> > > > > - Sijie
> > > > >
> > > > > On Mon, May 1, 2017 at 6:23 PM, Sijie Guo <gu...@gmail.com>
> > wrote:
> > > > >
> > > > > >
> > > > > >
> > > > > > On Mon, May 1, 2017 at 6:14 PM, Venkateswara Rao Jujjuri <
> > > > > > jujjuri@gmail.com> wrote:
> > > > > >
> > > > > >> On Mon, May 1, 2017 at 6:03 PM, Venkateswara Rao Jujjuri <
> > > > > >> jujjuri@gmail.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >> >
> > > > > >> >
> > > > > >> > On Mon, May 1, 2017 at 5:56 PM, Sijie Guo <guosijie@gmail.com
> >
> > > > wrote:
> > > > > >> >
> > > > > >> >> I don't think this is an inconsistent issue. The in memory
> > update
> > > > is
> > > > > >> >> updating lac not current entry. Even the entry is added into
> > > memory
> > > > > but
> > > > > >> >> this entry will not be readable after lac is advanced, lac is
> > > > > advanced
> > > > > >> >> only
> > > > > >> >> after the next entry is added which happened after current
> > entry
> > > is
> > > > > >> acked.
> > > > > >> >>
> > > > > >> >
> > > > > >> > That is not true. You are talking about piggy-backed LAC only.
> > But
> > > > > with
> > > > > >> > Explicit LAC
> > > > > >> > you don't need next entry to move LAC on bookie.
> > > > > >> >
> > > > > >>
> > > > > >> Sorry, I pushed send before finishing. :)
> > > > > >>
> > > > > >> So you don't need next entry to move LAC forward, but its client
> > job
> > > > to
> > > > > >> move LAC forward.
> > > > > >> Hence client need to send explicit LAC to update LAC after it
> hear
> > > > back
> > > > > >> from AckQuorum.
> > > > > >> Hence Sijie is right on this part, it is not a consistency
> issue.
> > :)
> > > > > >>
> > > > > >>
> > > > > >> But never the less, I believe we need to change the order as it
> is
> > > not
> > > > > >> completely shielding
> > > > > >> writes from other activity. @Sijie do you see any issue if we
> > write
> > > to
> > > > > >> journal, ack to client
> > > > > >> and the write to ledger ?
> > > > > >>
> > > > > >
> > > > > > Based on my understanding about this email thread, the concern
> > comes
> > > > from
> > > > > > the latency on write. However, it doesn't change any latency
> > behavior
> > > > if
> > > > > > you add to journal first and add to memtable later. 'Throttling'
> > will
> > > > > still
> > > > > > happen when you add entry to memtable.
> > > > > >
> > > > > > So the question would be "can we write to journal and back back
> > > > immediate
> > > > > > after written to journal, and add the entry to memtable in
> > > background"?
> > > > > >
> > > > > > The answer would be "no". Because this would volatile the
> > > correctness.
> > > > It
> > > > > > might end up a case - the lac is already advanced but the entry
> is
> > > not
> > > > > > found - it can happen in following sequence.
> > > > > >
> > > > > > - Client issue write entry N (lac = N-1)
> > > > > > - Bookie write the entry to the journal and acknowledge. Entry N
> is
> > > in
> > > > > the
> > > > > > journal but haven't been added to the memtable.
> > > > > > - Client received the acknowledge and advanced LAC from N-1 to N.
> > > > > > - Client write another entry N+1 (lac = N) to advance LAC.
> > > > > > - Another client (reader) detects LAC is advanced from N-1 to N.
> it
> > > > > > attempts to read entry N but N isn't added to ledger storage.
> (*The
> > > > > > correctness is volatiled here*)
> > > > > >
> > > > > > So to summarize my thoughts on this:
> > > > > >
> > > > > > - The acknowledge should happen after both writing the entry to
> > > journal
> > > > > > and write the entry to memtable.
> > > > > > - The order of writing the entry to journal and writing entry to
> > > > memtable
> > > > > > doesn't matter here.
> > > > > > - Writing the entry to the memtable helps with tailing latency
> > > (because
> > > > > it
> > > > > > will advance LAC first).
> > > > > >
> > > > > > - Sijie
> > > > > >
> > > > > >
> > > > > >>
> > > > > >> JV
> > > > > >>
> > > > > >>
> > > > > >> >
> > > > > >> >
> > > > > >> >> So adding the entry to memory doesn't expose any consistency
> > > issue.
> > > > > >> >>
> > > > > >> >> On May 1, 2017 5:44 PM, "Venkateswara Rao Jujjuri" <
> > > > > jujjuri@gmail.com>
> > > > > >> >> wrote:
> > > > > >> >>
> > > > > >> >> On Mon, May 1, 2017 at 2:31 PM, Yiming Zang
> > > > > <yzang@twitter.com.invalid
> > > > > >> >
> > > > > >> >> wrote:
> > > > > >> >>
> > > > > >> >> > Hi Andrey,
> > > > > >> >> >
> > > > > >> >> > That's a good point, and you're actually correct that if
> > write
> > > to
> > > > > >> >> memTable
> > > > > >> >> > got throttled somehow, the addEntry request latency will be
> > > > > affected
> > > > > >> a
> > > > > >> >> lot.
> > > > > >> >> > This actually happens a few times in production cluster.
> > > > Normally,
> > > > > >> the
> > > > > >> >> idea
> > > > > >> >> > of using Journal is to write data to the write-ahead log
> and
> > > then
> > > > > >> >> persist
> > > > > >> >> > the actual data to disks or add to memTable. However, my
> > > > > >> understanding
> > > > > >> >> of
> > > > > >> >> > why we choose to write entry to ledgerStorage first is to
> > > improve
> > > > > the
> > > > > >> >> > tailing-read performance.
> > > > > >> >> >
> > > > > >> >> > In SortedLedgerStorage.java, we first add entry to memTable
> > and
> > > > > then
> > > > > >> we
> > > > > >> >> > update lastAddConfirmed, which means if there's a long poll
> > > read
> > > > > >> request
> > > > > >> >> or
> > > > > >> >> > readLastAddConfirmed request, it will immediately get
> > satisfied
> > > > for
> > > > > >> the
> > > > > >> >> > latest entry before we actually log the entry into Journal.
> > So
> > > > > >> >> tailing-read
> > > > > >> >> > doesn't actually need to wait for any disk operation in
> > > > Bookkeeper
> > > > > >> >> > including Journal operation.
> > > > > >> >> >
> > > > > >> >> > public long addEntry(ByteBuffer entry) throws IOException {
> > > > > >> >> > long ledgerId = entry.getLong();
> > > > > >> >> > long entryId = entry.getLong();
> > > > > >> >> > long lac = entry.getLong();
> > > > > >> >> > entry.rewind();
> > > > > >> >> > memTable.addEntry(ledgerId, entryId, entry, this);
> > > > > >> >> > ledgerCache.updateLastAddConfirmed(ledgerId, lac);
> > > > > >> >> > return entryId;
> > > > > >> >> > }
> > > > > >> >> >
> > > > > >> >> > But thinking about here, I'm wondering if it's actually
> safe
> > to
> > > > > >> update
> > > > > >> >> the
> > > > > >> >> > LAC before we write the entry to Journal. What if we tell
> the
> > > > > client
> > > > > >> the
> > > > > >> >> > LAC has been updated but we actually failed to write the
> > entry
> > > to
> > > > > >> >> Journal
> > > > > >> >> > and Bookie crashed at that time? Would this bring any
> > > > inconsistency
> > > > > >> >> issue?
> > > > > >> >> >
> > > > > >> >>
> > > > > >> >> Good point. This is indeed an inconsistency issue. BK
> > guarantees
> > > > "if
> > > > > >> you
> > > > > >> >> read once you can read it all the time".
> > > > > >> >> If it is really done for LAC it is not really good idea.
> > Unless I
> > > > am
> > > > > >> >> missing something, this must be changed ASAP.
> > > > > >> >>
> > > > > >> >> Thanks,
> > > > > >> >> JV
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> >
> > > > > >> >> > On Mon, May 1, 2017 at 2:13 PM, Andrey Yegorov <
> > > > > >> >> andrey.yegorov@gmail.com>
> > > > > >> >> > wrote:
> > > > > >> >> >
> > > > > >> >> > > Hi,
> > > > > >> >> > >
> > > > > >> >> > > Looking at the code in Bookie.java I noticed that write
> to
> > > > > journal
> > > > > >> >> (which
> > > > > >> >> > > is supposed to be a write-ahead log as I understand)
> > happened
> > > > > after
> > > > > >> >> write
> > > > > >> >> > > to ledger storage.
> > > > > >> >> > > This looks counter-intuitive, can someone explain why is
> it
> > > > done
> > > > > in
> > > > > >> >> this
> > > > > >> >> > > order?
> > > > > >> >> > >
> > > > > >> >> > > My primary concern is that ledger storage write can be
> > > delayed
> > > > > >> (i.e.
> > > > > >> >> > > EntryMemTable's addEntry can do throttleWriters() in some
> > > > cases)
> > > > > >> thus
> > > > > >> >> > > dragging overall client's view of add latency up even
> > though
> > > it
> > > > > is
> > > > > >> >> > possible
> > > > > >> >> > > that journal's write (i.e. in case of dedicated journal
> > disk)
> > > > > will
> > > > > >> >> > complete
> > > > > >> >> > > faster.
> > > > > >> >> > >
> > > > > >> >> > >     private void addEntryInternal(LedgerDescriptor
> handle,
> > > > > >> ByteBuffer
> > > > > >> >> > > entry, WriteCallback cb, Object ctx)
> > > > > >> >> > >
> > > > > >> >> > >             throws IOException, BookieException {
> > > > > >> >> > >
> > > > > >> >> > >         long ledgerId = handle.getLedgerId();
> > > > > >> >> > >
> > > > > >> >> > >         entry.rewind();
> > > > > >> >> > >
> > > > > >> >> > > *// ledgerStorage.addEntry() is happening here*
> > > > > >> >> > >
> > > > > >> >> > >         long entryId = handle.addEntry(entry);
> > > > > >> >> > >
> > > > > >> >> > >
> > > > > >> >> > >         entry.rewind();
> > > > > >> >> > >
> > > > > >> >> > >         writeBytes.add(entry.remaining());
> > > > > >> >> > >
> > > > > >> >> > >
> > > > > >> >> > >         LOG.trace("Adding {}@{}", entryId, ledgerId);
> > > > > >> >> > >
> > > > > >> >> > > *// journal add entry is happening here*
> > > > > >> >> > >
> > > > > >> >> > > *// callback/response to client is sent after journal add
> > is
> > > > > done.*
> > > > > >> >> > >
> > > > > >> >> > >         journal.logAddEntry(entry, cb, ctx);
> > > > > >> >> > >
> > > > > >> >> > >     }
> > > > > >> >> > >
> > > > > >> >> > >
> > > > > >> >> > >
> > > > > >> >> > > ----------
> > > > > >> >> > > Andrey Yegorov
> > > > > >> >> > >
> > > > > >> >> >
> > > > > >> >>
> > > > > >> >>
> > > > > >> >>
> > > > > >> >> --
> > > > > >> >> Jvrao
> > > > > >> >> ---
> > > > > >> >> First they ignore you, then they laugh at you, then they
> fight
> > > you,
> > > > > >> then
> > > > > >> >> you win. - Mahatma Gandhi
> > > > > >> >>
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > --
> > > > > >> > Jvrao
> > > > > >> > ---
> > > > > >> > First they ignore you, then they laugh at you, then they fight
> > > you,
> > > > > then
> > > > > >> > you win. - Mahatma Gandhi
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >>
> > > > > >>
> > > > > >> --
> > > > > >> Jvrao
> > > > > >> ---
> > > > > >> First they ignore you, then they laugh at you, then they fight
> > you,
> > > > then
> > > > > >> you win. - Mahatma Gandhi
> > > > > >>
> > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Jvrao
> > > > ---
> > > > First they ignore you, then they laugh at you, then they fight you,
> > then
> > > > you win. - Mahatma Gandhi
> > > >
> > >
> >
> --
> Sent from iPhone
>

Re: why write to journal is happening after write to ledgerStorage?

Posted by Venkateswara Rao Jujjuri <ju...@gmail.com>.

May be the issue is because we block writer while we are flushing ledger. I
believe we should separate ledger writers and sync and allow write
(multiple buffers) wile disk sync is going one. One can configure #of
buffers and sizes based on the latency and throughout ratio of journal and
ledger disk.

On Mon, May 1, 2017 at 7:04 PM Yiming Zang <yz...@twitter.com.invalid>
wrote:

> What about long poll read? I know it's not in OSS version yet, but if we're
> going to support long poll read in open source version in the future we
> need to think about this. Based on my understanding, the long poll read
> request will be satisfied once the LAC is advanced to previousLAC+1.
>
> For example, the current lac=1 and client issue a long poll read request
> for entry 2 with previousLAC=1. Now we send a write request for entry 2,
> and when that entry is added to memTable and LAC is advanced to 2, and that
> entry is not yet written to Journal, the long poll read request can already
> be satisfied, and client will get response back for entry 2, but in fact,
> entry 2 could even failed to be written to Journal, or bookie could crash
> at that time. Will that bring inconsistency?
>
> Correct me if I'm wrong.
>
> Best,
> Yiming
>
> On Mon, May 1, 2017 at 6:52 PM, Sijie Guo <gu...@gmail.com> wrote:
>
> > On Mon, May 1, 2017 at 6:35 PM, Venkateswara Rao Jujjuri <
> > jujjuri@gmail.com>
> > wrote:
> >
> > > The real problem/issue is - having extremely fast journal disk doesn't
> > > really mask write latencies from a slower ledger disk.
> > >
> >
> > In most of the case, it mask the write latencies from a slower ledger
> disk.
> > Because the write should only happen in the memory.
> >
> > The worse case here is the write is being throttled - that typically
> means
> > something really bad happening.
> >
> > In this case, a larger write buffer would help?
> >
> >
> > >
> > > To address this rate correctness issue, cant we read from journal if
> the
> > > entryid >= LAC (as we cache now on bookie) and journal read fails?
> > >
> >
> > First, the correctness isn't entryid >= LAC case, as client can't really
> > read beyond LAC. The correctness issue is on entryid <= LAC case: the
> entry
> > appears on journal but not in ledger storage.
> >
> > Second, the purpose of having a separate journal disk is to avoid reads
> on
> > journal that would impact writes. If we circle back reads on journals,
> this
> > would potentially  cause performance degradation on writes as well.
> >
> > Last, in order to be able to read journals, you basically still need to
> add
> > some indexed structures into memory, so you know where to look up the
> > entries. No matter you store an entry in memtable or just store an entry
> > pointer pointing back to journal, you will still hit same problem - as
> you
> > can keep everything in memory, which you have to write data back to disks
> > and the throttling would happen again.
> >
> > From all these three points, I don't see too much value about changing
> > this. Instead, the question would be simpler - can you increase the
> memory
> > buffer size? If you can't, that means your hardware's capacity can't
> > basically keep up with the incoming write traffic. More capacity is
> needed
> > then.
> >
> > - Sijie
> >
> >
> >
> > >
> > > On Mon, May 1, 2017 at 6:33 PM, Sijie Guo <gu...@gmail.com> wrote:
> > >
> > > > In the other to think about this,
> > > >
> > > > when 'throttling' happens,  it typically means:
> > > >
> > > > - the bookie doesn't have enough bandwidth/capacity to keep up with
> the
> > > > traffic.
> > > > - the disks on the bookie might have problems (e.g. slow down or
> other
> > > > hardware issues).
> > > >
> > > > Either case can happen. It might be worth to let the throttling kick
> > in,
> > > > rather than let journal disk accepting writes and putting ledger
> > storage
> > > > into worse state.
> > > >
> > > > - Sijie
> > > >
> > > > On Mon, May 1, 2017 at 6:23 PM, Sijie Guo <gu...@gmail.com>
> wrote:
> > > >
> > > > >
> > > > >
> > > > > On Mon, May 1, 2017 at 6:14 PM, Venkateswara Rao Jujjuri <
> > > > > jujjuri@gmail.com> wrote:
> > > > >
> > > > >> On Mon, May 1, 2017 at 6:03 PM, Venkateswara Rao Jujjuri <
> > > > >> jujjuri@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> >
> > > > >> >
> > > > >> > On Mon, May 1, 2017 at 5:56 PM, Sijie Guo <gu...@gmail.com>
> > > wrote:
> > > > >> >
> > > > >> >> I don't think this is an inconsistent issue. The in memory
> update
> > > is
> > > > >> >> updating lac not current entry. Even the entry is added into
> > memory
> > > > but
> > > > >> >> this entry will not be readable after lac is advanced, lac is
> > > > advanced
> > > > >> >> only
> > > > >> >> after the next entry is added which happened after current
> entry
> > is
> > > > >> acked.
> > > > >> >>
> > > > >> >
> > > > >> > That is not true. You are talking about piggy-backed LAC only.
> But
> > > > with
> > > > >> > Explicit LAC
> > > > >> > you don't need next entry to move LAC on bookie.
> > > > >> >
> > > > >>
> > > > >> Sorry, I pushed send before finishing. :)
> > > > >>
> > > > >> So you don't need next entry to move LAC forward, but its client
> job
> > > to
> > > > >> move LAC forward.
> > > > >> Hence client need to send explicit LAC to update LAC after it hear
> > > back
> > > > >> from AckQuorum.
> > > > >> Hence Sijie is right on this part, it is not a consistency issue.
> :)
> > > > >>
> > > > >>
> > > > >> But never the less, I believe we need to change the order as it is
> > not
> > > > >> completely shielding
> > > > >> writes from other activity. @Sijie do you see any issue if we
> write
> > to
> > > > >> journal, ack to client
> > > > >> and the write to ledger ?
> > > > >>
> > > > >
> > > > > Based on my understanding about this email thread, the concern
> comes
> > > from
> > > > > the latency on write. However, it doesn't change any latency
> behavior
> > > if
> > > > > you add to journal first and add to memtable later. 'Throttling'
> will
> > > > still
> > > > > happen when you add entry to memtable.
> > > > >
> > > > > So the question would be "can we write to journal and back back
> > > immediate
> > > > > after written to journal, and add the entry to memtable in
> > background"?
> > > > >
> > > > > The answer would be "no". Because this would volatile the
> > correctness.
> > > It
> > > > > might end up a case - the lac is already advanced but the entry is
> > not
> > > > > found - it can happen in following sequence.
> > > > >
> > > > > - Client issue write entry N (lac = N-1)
> > > > > - Bookie write the entry to the journal and acknowledge. Entry N is
> > in
> > > > the
> > > > > journal but haven't been added to the memtable.
> > > > > - Client received the acknowledge and advanced LAC from N-1 to N.
> > > > > - Client write another entry N+1 (lac = N) to advance LAC.
> > > > > - Another client (reader) detects LAC is advanced from N-1 to N. it
> > > > > attempts to read entry N but N isn't added to ledger storage. (*The
> > > > > correctness is volatiled here*)
> > > > >
> > > > > So to summarize my thoughts on this:
> > > > >
> > > > > - The acknowledge should happen after both writing the entry to
> > journal
> > > > > and write the entry to memtable.
> > > > > - The order of writing the entry to journal and writing entry to
> > > memtable
> > > > > doesn't matter here.
> > > > > - Writing the entry to the memtable helps with tailing latency
> > (because
> > > > it
> > > > > will advance LAC first).
> > > > >
> > > > > - Sijie
> > > > >
> > > > >
> > > > >>
> > > > >> JV
> > > > >>
> > > > >>
> > > > >> >
> > > > >> >
> > > > >> >> So adding the entry to memory doesn't expose any consistency
> > issue.
> > > > >> >>
> > > > >> >> On May 1, 2017 5:44 PM, "Venkateswara Rao Jujjuri" <
> > > > jujjuri@gmail.com>
> > > > >> >> wrote:
> > > > >> >>
> > > > >> >> On Mon, May 1, 2017 at 2:31 PM, Yiming Zang
> > > > <yzang@twitter.com.invalid
> > > > >> >
> > > > >> >> wrote:
> > > > >> >>
> > > > >> >> > Hi Andrey,
> > > > >> >> >
> > > > >> >> > That's a good point, and you're actually correct that if
> write
> > to
> > > > >> >> memTable
> > > > >> >> > got throttled somehow, the addEntry request latency will be
> > > > affected
> > > > >> a
> > > > >> >> lot.
> > > > >> >> > This actually happens a few times in production cluster.
> > > Normally,
> > > > >> the
> > > > >> >> idea
> > > > >> >> > of using Journal is to write data to the write-ahead log and
> > then
> > > > >> >> persist
> > > > >> >> > the actual data to disks or add to memTable. However, my
> > > > >> understanding
> > > > >> >> of
> > > > >> >> > why we choose to write entry to ledgerStorage first is to
> > improve
> > > > the
> > > > >> >> > tailing-read performance.
> > > > >> >> >
> > > > >> >> > In SortedLedgerStorage.java, we first add entry to memTable
> and
> > > > then
> > > > >> we
> > > > >> >> > update lastAddConfirmed, which means if there's a long poll
> > read
> > > > >> request
> > > > >> >> or
> > > > >> >> > readLastAddConfirmed request, it will immediately get
> satisfied
> > > for
> > > > >> the
> > > > >> >> > latest entry before we actually log the entry into Journal.
> So
> > > > >> >> tailing-read
> > > > >> >> > doesn't actually need to wait for any disk operation in
> > > Bookkeeper
> > > > >> >> > including Journal operation.
> > > > >> >> >
> > > > >> >> > public long addEntry(ByteBuffer entry) throws IOException {
> > > > >> >> > long ledgerId = entry.getLong();
> > > > >> >> > long entryId = entry.getLong();
> > > > >> >> > long lac = entry.getLong();
> > > > >> >> > entry.rewind();
> > > > >> >> > memTable.addEntry(ledgerId, entryId, entry, this);
> > > > >> >> > ledgerCache.updateLastAddConfirmed(ledgerId, lac);
> > > > >> >> > return entryId;
> > > > >> >> > }
> > > > >> >> >
> > > > >> >> > But thinking about here, I'm wondering if it's actually safe
> to
> > > > >> update
> > > > >> >> the
> > > > >> >> > LAC before we write the entry to Journal. What if we tell the
> > > > client
> > > > >> the
> > > > >> >> > LAC has been updated but we actually failed to write the
> entry
> > to
> > > > >> >> Journal
> > > > >> >> > and Bookie crashed at that time? Would this bring any
> > > inconsistency
> > > > >> >> issue?
> > > > >> >> >
> > > > >> >>
> > > > >> >> Good point. This is indeed an inconsistency issue. BK
> guarantees
> > > "if
> > > > >> you
> > > > >> >> read once you can read it all the time".
> > > > >> >> If it is really done for LAC it is not really good idea.
> Unless I
> > > am
> > > > >> >> missing something, this must be changed ASAP.
> > > > >> >>
> > > > >> >> Thanks,
> > > > >> >> JV
> > > > >> >>
> > > > >> >>
> > > > >> >> >
> > > > >> >> > On Mon, May 1, 2017 at 2:13 PM, Andrey Yegorov <
> > > > >> >> andrey.yegorov@gmail.com>
> > > > >> >> > wrote:
> > > > >> >> >
> > > > >> >> > > Hi,
> > > > >> >> > >
> > > > >> >> > > Looking at the code in Bookie.java I noticed that write to
> > > > journal
> > > > >> >> (which
> > > > >> >> > > is supposed to be a write-ahead log as I understand)
> happened
> > > > after
> > > > >> >> write
> > > > >> >> > > to ledger storage.
> > > > >> >> > > This looks counter-intuitive, can someone explain why is it
> > > done
> > > > in
> > > > >> >> this
> > > > >> >> > > order?
> > > > >> >> > >
> > > > >> >> > > My primary concern is that ledger storage write can be
> > delayed
> > > > >> (i.e.
> > > > >> >> > > EntryMemTable's addEntry can do throttleWriters() in some
> > > cases)
> > > > >> thus
> > > > >> >> > > dragging overall client's view of add latency up even
> though
> > it
> > > > is
> > > > >> >> > possible
> > > > >> >> > > that journal's write (i.e. in case of dedicated journal
> disk)
> > > > will
> > > > >> >> > complete
> > > > >> >> > > faster.
> > > > >> >> > >
> > > > >> >> > >     private void addEntryInternal(LedgerDescriptor handle,
> > > > >> ByteBuffer
> > > > >> >> > > entry, WriteCallback cb, Object ctx)
> > > > >> >> > >
> > > > >> >> > >             throws IOException, BookieException {
> > > > >> >> > >
> > > > >> >> > >         long ledgerId = handle.getLedgerId();
> > > > >> >> > >
> > > > >> >> > >         entry.rewind();
> > > > >> >> > >
> > > > >> >> > > *// ledgerStorage.addEntry() is happening here*
> > > > >> >> > >
> > > > >> >> > >         long entryId = handle.addEntry(entry);
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > >         entry.rewind();
> > > > >> >> > >
> > > > >> >> > >         writeBytes.add(entry.remaining());
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > >         LOG.trace("Adding {}@{}", entryId, ledgerId);
> > > > >> >> > >
> > > > >> >> > > *// journal add entry is happening here*
> > > > >> >> > >
> > > > >> >> > > *// callback/response to client is sent after journal add
> is
> > > > done.*
> > > > >> >> > >
> > > > >> >> > >         journal.logAddEntry(entry, cb, ctx);
> > > > >> >> > >
> > > > >> >> > >     }
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > > ----------
> > > > >> >> > > Andrey Yegorov
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >> --
> > > > >> >> Jvrao
> > > > >> >> ---
> > > > >> >> First they ignore you, then they laugh at you, then they fight
> > you,
> > > > >> then
> > > > >> >> you win. - Mahatma Gandhi
> > > > >> >>
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > Jvrao
> > > > >> > ---
> > > > >> > First they ignore you, then they laugh at you, then they fight
> > you,
> > > > then
> > > > >> > you win. - Mahatma Gandhi
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Jvrao
> > > > >> ---
> > > > >> First they ignore you, then they laugh at you, then they fight
> you,
> > > then
> > > > >> you win. - Mahatma Gandhi
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Jvrao
> > > ---
> > > First they ignore you, then they laugh at you, then they fight you,
> then
> > > you win. - Mahatma Gandhi
> > >
> >
>
-- 
Sent from iPhone

Re: why write to journal is happening after write to ledgerStorage?

Posted by Sijie Guo <gu...@gmail.com>.

On Mon, May 1, 2017 at 7:04 PM, Yiming Zang <yz...@twitter.com.invalid>
wrote:

> What about long poll read? I know it's not in OSS version yet, but if we're
> going to support long poll read in open source version in the future we
> need to think about this. Based on my understanding, the long poll read
> request will be satisfied once the LAC is advanced to previousLAC+1.
>
> For example, the current lac=1 and client issue a long poll read request
> for entry 2 with previousLAC=1. Now we send a write request for entry 2,
> and when that entry is added to memTable and LAC is advanced to 2,


When adding entry 2, LAC is not 2. LAC is always less than the entry id.


> and that
> entry is not yet written to Journal, the long poll read request can already
> be satisfied, and client will get response back for entry 2, but in fact,
> entry 2 could even failed to be written to Journal, or bookie could crash
> at that time. Will that bring inconsistency?
>
> Correct me if I'm wrong.
>
> Best,
> Yiming
>
> On Mon, May 1, 2017 at 6:52 PM, Sijie Guo <gu...@gmail.com> wrote:
>
> > On Mon, May 1, 2017 at 6:35 PM, Venkateswara Rao Jujjuri <
> > jujjuri@gmail.com>
> > wrote:
> >
> > > The real problem/issue is - having extremely fast journal disk doesn't
> > > really mask write latencies from a slower ledger disk.
> > >
> >
> > In most of the case, it mask the write latencies from a slower ledger
> disk.
> > Because the write should only happen in the memory.
> >
> > The worse case here is the write is being throttled - that typically
> means
> > something really bad happening.
> >
> > In this case, a larger write buffer would help?
> >
> >
> > >
> > > To address this rate correctness issue, cant we read from journal if
> the
> > > entryid >= LAC (as we cache now on bookie) and journal read fails?
> > >
> >
> > First, the correctness isn't entryid >= LAC case, as client can't really
> > read beyond LAC. The correctness issue is on entryid <= LAC case: the
> entry
> > appears on journal but not in ledger storage.
> >
> > Second, the purpose of having a separate journal disk is to avoid reads
> on
> > journal that would impact writes. If we circle back reads on journals,
> this
> > would potentially  cause performance degradation on writes as well.
> >
> > Last, in order to be able to read journals, you basically still need to
> add
> > some indexed structures into memory, so you know where to look up the
> > entries. No matter you store an entry in memtable or just store an entry
> > pointer pointing back to journal, you will still hit same problem - as
> you
> > can keep everything in memory, which you have to write data back to disks
> > and the throttling would happen again.
> >
> > From all these three points, I don't see too much value about changing
> > this. Instead, the question would be simpler - can you increase the
> memory
> > buffer size? If you can't, that means your hardware's capacity can't
> > basically keep up with the incoming write traffic. More capacity is
> needed
> > then.
> >
> > - Sijie
> >
> >
> >
> > >
> > > On Mon, May 1, 2017 at 6:33 PM, Sijie Guo <gu...@gmail.com> wrote:
> > >
> > > > In the other to think about this,
> > > >
> > > > when 'throttling' happens,  it typically means:
> > > >
> > > > - the bookie doesn't have enough bandwidth/capacity to keep up with
> the
> > > > traffic.
> > > > - the disks on the bookie might have problems (e.g. slow down or
> other
> > > > hardware issues).
> > > >
> > > > Either case can happen. It might be worth to let the throttling kick
> > in,
> > > > rather than let journal disk accepting writes and putting ledger
> > storage
> > > > into worse state.
> > > >
> > > > - Sijie
> > > >
> > > > On Mon, May 1, 2017 at 6:23 PM, Sijie Guo <gu...@gmail.com>
> wrote:
> > > >
> > > > >
> > > > >
> > > > > On Mon, May 1, 2017 at 6:14 PM, Venkateswara Rao Jujjuri <
> > > > > jujjuri@gmail.com> wrote:
> > > > >
> > > > >> On Mon, May 1, 2017 at 6:03 PM, Venkateswara Rao Jujjuri <
> > > > >> jujjuri@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >> >
> > > > >> >
> > > > >> > On Mon, May 1, 2017 at 5:56 PM, Sijie Guo <gu...@gmail.com>
> > > wrote:
> > > > >> >
> > > > >> >> I don't think this is an inconsistent issue. The in memory
> update
> > > is
> > > > >> >> updating lac not current entry. Even the entry is added into
> > memory
> > > > but
> > > > >> >> this entry will not be readable after lac is advanced, lac is
> > > > advanced
> > > > >> >> only
> > > > >> >> after the next entry is added which happened after current
> entry
> > is
> > > > >> acked.
> > > > >> >>
> > > > >> >
> > > > >> > That is not true. You are talking about piggy-backed LAC only.
> But
> > > > with
> > > > >> > Explicit LAC
> > > > >> > you don't need next entry to move LAC on bookie.
> > > > >> >
> > > > >>
> > > > >> Sorry, I pushed send before finishing. :)
> > > > >>
> > > > >> So you don't need next entry to move LAC forward, but its client
> job
> > > to
> > > > >> move LAC forward.
> > > > >> Hence client need to send explicit LAC to update LAC after it hear
> > > back
> > > > >> from AckQuorum.
> > > > >> Hence Sijie is right on this part, it is not a consistency issue.
> :)
> > > > >>
> > > > >>
> > > > >> But never the less, I believe we need to change the order as it is
> > not
> > > > >> completely shielding
> > > > >> writes from other activity. @Sijie do you see any issue if we
> write
> > to
> > > > >> journal, ack to client
> > > > >> and the write to ledger ?
> > > > >>
> > > > >
> > > > > Based on my understanding about this email thread, the concern
> comes
> > > from
> > > > > the latency on write. However, it doesn't change any latency
> behavior
> > > if
> > > > > you add to journal first and add to memtable later. 'Throttling'
> will
> > > > still
> > > > > happen when you add entry to memtable.
> > > > >
> > > > > So the question would be "can we write to journal and back back
> > > immediate
> > > > > after written to journal, and add the entry to memtable in
> > background"?
> > > > >
> > > > > The answer would be "no". Because this would volatile the
> > correctness.
> > > It
> > > > > might end up a case - the lac is already advanced but the entry is
> > not
> > > > > found - it can happen in following sequence.
> > > > >
> > > > > - Client issue write entry N (lac = N-1)
> > > > > - Bookie write the entry to the journal and acknowledge. Entry N is
> > in
> > > > the
> > > > > journal but haven't been added to the memtable.
> > > > > - Client received the acknowledge and advanced LAC from N-1 to N.
> > > > > - Client write another entry N+1 (lac = N) to advance LAC.
> > > > > - Another client (reader) detects LAC is advanced from N-1 to N. it
> > > > > attempts to read entry N but N isn't added to ledger storage. (*The
> > > > > correctness is volatiled here*)
> > > > >
> > > > > So to summarize my thoughts on this:
> > > > >
> > > > > - The acknowledge should happen after both writing the entry to
> > journal
> > > > > and write the entry to memtable.
> > > > > - The order of writing the entry to journal and writing entry to
> > > memtable
> > > > > doesn't matter here.
> > > > > - Writing the entry to the memtable helps with tailing latency
> > (because
> > > > it
> > > > > will advance LAC first).
> > > > >
> > > > > - Sijie
> > > > >
> > > > >
> > > > >>
> > > > >> JV
> > > > >>
> > > > >>
> > > > >> >
> > > > >> >
> > > > >> >> So adding the entry to memory doesn't expose any consistency
> > issue.
> > > > >> >>
> > > > >> >> On May 1, 2017 5:44 PM, "Venkateswara Rao Jujjuri" <
> > > > jujjuri@gmail.com>
> > > > >> >> wrote:
> > > > >> >>
> > > > >> >> On Mon, May 1, 2017 at 2:31 PM, Yiming Zang
> > > > <yzang@twitter.com.invalid
> > > > >> >
> > > > >> >> wrote:
> > > > >> >>
> > > > >> >> > Hi Andrey,
> > > > >> >> >
> > > > >> >> > That's a good point, and you're actually correct that if
> write
> > to
> > > > >> >> memTable
> > > > >> >> > got throttled somehow, the addEntry request latency will be
> > > > affected
> > > > >> a
> > > > >> >> lot.
> > > > >> >> > This actually happens a few times in production cluster.
> > > Normally,
> > > > >> the
> > > > >> >> idea
> > > > >> >> > of using Journal is to write data to the write-ahead log and
> > then
> > > > >> >> persist
> > > > >> >> > the actual data to disks or add to memTable. However, my
> > > > >> understanding
> > > > >> >> of
> > > > >> >> > why we choose to write entry to ledgerStorage first is to
> > improve
> > > > the
> > > > >> >> > tailing-read performance.
> > > > >> >> >
> > > > >> >> > In SortedLedgerStorage.java, we first add entry to memTable
> and
> > > > then
> > > > >> we
> > > > >> >> > update lastAddConfirmed, which means if there's a long poll
> > read
> > > > >> request
> > > > >> >> or
> > > > >> >> > readLastAddConfirmed request, it will immediately get
> satisfied
> > > for
> > > > >> the
> > > > >> >> > latest entry before we actually log the entry into Journal.
> So
> > > > >> >> tailing-read
> > > > >> >> > doesn't actually need to wait for any disk operation in
> > > Bookkeeper
> > > > >> >> > including Journal operation.
> > > > >> >> >
> > > > >> >> > public long addEntry(ByteBuffer entry) throws IOException {
> > > > >> >> > long ledgerId = entry.getLong();
> > > > >> >> > long entryId = entry.getLong();
> > > > >> >> > long lac = entry.getLong();
> > > > >> >> > entry.rewind();
> > > > >> >> > memTable.addEntry(ledgerId, entryId, entry, this);
> > > > >> >> > ledgerCache.updateLastAddConfirmed(ledgerId, lac);
> > > > >> >> > return entryId;
> > > > >> >> > }
> > > > >> >> >
> > > > >> >> > But thinking about here, I'm wondering if it's actually safe
> to
> > > > >> update
> > > > >> >> the
> > > > >> >> > LAC before we write the entry to Journal. What if we tell the
> > > > client
> > > > >> the
> > > > >> >> > LAC has been updated but we actually failed to write the
> entry
> > to
> > > > >> >> Journal
> > > > >> >> > and Bookie crashed at that time? Would this bring any
> > > inconsistency
> > > > >> >> issue?
> > > > >> >> >
> > > > >> >>
> > > > >> >> Good point. This is indeed an inconsistency issue. BK
> guarantees
> > > "if
> > > > >> you
> > > > >> >> read once you can read it all the time".
> > > > >> >> If it is really done for LAC it is not really good idea.
> Unless I
> > > am
> > > > >> >> missing something, this must be changed ASAP.
> > > > >> >>
> > > > >> >> Thanks,
> > > > >> >> JV
> > > > >> >>
> > > > >> >>
> > > > >> >> >
> > > > >> >> > On Mon, May 1, 2017 at 2:13 PM, Andrey Yegorov <
> > > > >> >> andrey.yegorov@gmail.com>
> > > > >> >> > wrote:
> > > > >> >> >
> > > > >> >> > > Hi,
> > > > >> >> > >
> > > > >> >> > > Looking at the code in Bookie.java I noticed that write to
> > > > journal
> > > > >> >> (which
> > > > >> >> > > is supposed to be a write-ahead log as I understand)
> happened
> > > > after
> > > > >> >> write
> > > > >> >> > > to ledger storage.
> > > > >> >> > > This looks counter-intuitive, can someone explain why is it
> > > done
> > > > in
> > > > >> >> this
> > > > >> >> > > order?
> > > > >> >> > >
> > > > >> >> > > My primary concern is that ledger storage write can be
> > delayed
> > > > >> (i.e.
> > > > >> >> > > EntryMemTable's addEntry can do throttleWriters() in some
> > > cases)
> > > > >> thus
> > > > >> >> > > dragging overall client's view of add latency up even
> though
> > it
> > > > is
> > > > >> >> > possible
> > > > >> >> > > that journal's write (i.e. in case of dedicated journal
> disk)
> > > > will
> > > > >> >> > complete
> > > > >> >> > > faster.
> > > > >> >> > >
> > > > >> >> > >     private void addEntryInternal(LedgerDescriptor handle,
> > > > >> ByteBuffer
> > > > >> >> > > entry, WriteCallback cb, Object ctx)
> > > > >> >> > >
> > > > >> >> > >             throws IOException, BookieException {
> > > > >> >> > >
> > > > >> >> > >         long ledgerId = handle.getLedgerId();
> > > > >> >> > >
> > > > >> >> > >         entry.rewind();
> > > > >> >> > >
> > > > >> >> > > *// ledgerStorage.addEntry() is happening here*
> > > > >> >> > >
> > > > >> >> > >         long entryId = handle.addEntry(entry);
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > >         entry.rewind();
> > > > >> >> > >
> > > > >> >> > >         writeBytes.add(entry.remaining());
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > >         LOG.trace("Adding {}@{}", entryId, ledgerId);
> > > > >> >> > >
> > > > >> >> > > *// journal add entry is happening here*
> > > > >> >> > >
> > > > >> >> > > *// callback/response to client is sent after journal add
> is
> > > > done.*
> > > > >> >> > >
> > > > >> >> > >         journal.logAddEntry(entry, cb, ctx);
> > > > >> >> > >
> > > > >> >> > >     }
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > >
> > > > >> >> > > ----------
> > > > >> >> > > Andrey Yegorov
> > > > >> >> > >
> > > > >> >> >
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >> --
> > > > >> >> Jvrao
> > > > >> >> ---
> > > > >> >> First they ignore you, then they laugh at you, then they fight
> > you,
> > > > >> then
> > > > >> >> you win. - Mahatma Gandhi
> > > > >> >>
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > Jvrao
> > > > >> > ---
> > > > >> > First they ignore you, then they laugh at you, then they fight
> > you,
> > > > then
> > > > >> > you win. - Mahatma Gandhi
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Jvrao
> > > > >> ---
> > > > >> First they ignore you, then they laugh at you, then they fight
> you,
> > > then
> > > > >> you win. - Mahatma Gandhi
> > > > >>
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Jvrao
> > > ---
> > > First they ignore you, then they laugh at you, then they fight you,
> then
> > > you win. - Mahatma Gandhi
> > >
> >
>

Re: why write to journal is happening after write to ledgerStorage?

Posted by Yiming Zang <yz...@twitter.com.INVALID>.

What about long poll read? I know it's not in OSS version yet, but if we're
going to support long poll read in open source version in the future we
need to think about this. Based on my understanding, the long poll read
request will be satisfied once the LAC is advanced to previousLAC+1.

For example, the current lac=1 and client issue a long poll read request
for entry 2 with previousLAC=1. Now we send a write request for entry 2,
and when that entry is added to memTable and LAC is advanced to 2, and that
entry is not yet written to Journal, the long poll read request can already
be satisfied, and client will get response back for entry 2, but in fact,
entry 2 could even failed to be written to Journal, or bookie could crash
at that time. Will that bring inconsistency?

Correct me if I'm wrong.

Best,
Yiming

On Mon, May 1, 2017 at 6:52 PM, Sijie Guo <gu...@gmail.com> wrote:

> On Mon, May 1, 2017 at 6:35 PM, Venkateswara Rao Jujjuri <
> jujjuri@gmail.com>
> wrote:
>
> > The real problem/issue is - having extremely fast journal disk doesn't
> > really mask write latencies from a slower ledger disk.
> >
>
> In most of the case, it mask the write latencies from a slower ledger disk.
> Because the write should only happen in the memory.
>
> The worse case here is the write is being throttled - that typically means
> something really bad happening.
>
> In this case, a larger write buffer would help?
>
>
> >
> > To address this rate correctness issue, cant we read from journal if the
> > entryid >= LAC (as we cache now on bookie) and journal read fails?
> >
>
> First, the correctness isn't entryid >= LAC case, as client can't really
> read beyond LAC. The correctness issue is on entryid <= LAC case: the entry
> appears on journal but not in ledger storage.
>
> Second, the purpose of having a separate journal disk is to avoid reads on
> journal that would impact writes. If we circle back reads on journals, this
> would potentially  cause performance degradation on writes as well.
>
> Last, in order to be able to read journals, you basically still need to add
> some indexed structures into memory, so you know where to look up the
> entries. No matter you store an entry in memtable or just store an entry
> pointer pointing back to journal, you will still hit same problem - as you
> can keep everything in memory, which you have to write data back to disks
> and the throttling would happen again.
>
> From all these three points, I don't see too much value about changing
> this. Instead, the question would be simpler - can you increase the memory
> buffer size? If you can't, that means your hardware's capacity can't
> basically keep up with the incoming write traffic. More capacity is needed
> then.
>
> - Sijie
>
>
>
> >
> > On Mon, May 1, 2017 at 6:33 PM, Sijie Guo <gu...@gmail.com> wrote:
> >
> > > In the other to think about this,
> > >
> > > when 'throttling' happens,  it typically means:
> > >
> > > - the bookie doesn't have enough bandwidth/capacity to keep up with the
> > > traffic.
> > > - the disks on the bookie might have problems (e.g. slow down or other
> > > hardware issues).
> > >
> > > Either case can happen. It might be worth to let the throttling kick
> in,
> > > rather than let journal disk accepting writes and putting ledger
> storage
> > > into worse state.
> > >
> > > - Sijie
> > >
> > > On Mon, May 1, 2017 at 6:23 PM, Sijie Guo <gu...@gmail.com> wrote:
> > >
> > > >
> > > >
> > > > On Mon, May 1, 2017 at 6:14 PM, Venkateswara Rao Jujjuri <
> > > > jujjuri@gmail.com> wrote:
> > > >
> > > >> On Mon, May 1, 2017 at 6:03 PM, Venkateswara Rao Jujjuri <
> > > >> jujjuri@gmail.com>
> > > >> wrote:
> > > >>
> > > >> >
> > > >> >
> > > >> > On Mon, May 1, 2017 at 5:56 PM, Sijie Guo <gu...@gmail.com>
> > wrote:
> > > >> >
> > > >> >> I don't think this is an inconsistent issue. The in memory update
> > is
> > > >> >> updating lac not current entry. Even the entry is added into
> memory
> > > but
> > > >> >> this entry will not be readable after lac is advanced, lac is
> > > advanced
> > > >> >> only
> > > >> >> after the next entry is added which happened after current entry
> is
> > > >> acked.
> > > >> >>
> > > >> >
> > > >> > That is not true. You are talking about piggy-backed LAC only. But
> > > with
> > > >> > Explicit LAC
> > > >> > you don't need next entry to move LAC on bookie.
> > > >> >
> > > >>
> > > >> Sorry, I pushed send before finishing. :)
> > > >>
> > > >> So you don't need next entry to move LAC forward, but its client job
> > to
> > > >> move LAC forward.
> > > >> Hence client need to send explicit LAC to update LAC after it hear
> > back
> > > >> from AckQuorum.
> > > >> Hence Sijie is right on this part, it is not a consistency issue. :)
> > > >>
> > > >>
> > > >> But never the less, I believe we need to change the order as it is
> not
> > > >> completely shielding
> > > >> writes from other activity. @Sijie do you see any issue if we write
> to
> > > >> journal, ack to client
> > > >> and the write to ledger ?
> > > >>
> > > >
> > > > Based on my understanding about this email thread, the concern comes
> > from
> > > > the latency on write. However, it doesn't change any latency behavior
> > if
> > > > you add to journal first and add to memtable later. 'Throttling' will
> > > still
> > > > happen when you add entry to memtable.
> > > >
> > > > So the question would be "can we write to journal and back back
> > immediate
> > > > after written to journal, and add the entry to memtable in
> background"?
> > > >
> > > > The answer would be "no". Because this would volatile the
> correctness.
> > It
> > > > might end up a case - the lac is already advanced but the entry is
> not
> > > > found - it can happen in following sequence.
> > > >
> > > > - Client issue write entry N (lac = N-1)
> > > > - Bookie write the entry to the journal and acknowledge. Entry N is
> in
> > > the
> > > > journal but haven't been added to the memtable.
> > > > - Client received the acknowledge and advanced LAC from N-1 to N.
> > > > - Client write another entry N+1 (lac = N) to advance LAC.
> > > > - Another client (reader) detects LAC is advanced from N-1 to N. it
> > > > attempts to read entry N but N isn't added to ledger storage. (*The
> > > > correctness is volatiled here*)
> > > >
> > > > So to summarize my thoughts on this:
> > > >
> > > > - The acknowledge should happen after both writing the entry to
> journal
> > > > and write the entry to memtable.
> > > > - The order of writing the entry to journal and writing entry to
> > memtable
> > > > doesn't matter here.
> > > > - Writing the entry to the memtable helps with tailing latency
> (because
> > > it
> > > > will advance LAC first).
> > > >
> > > > - Sijie
> > > >
> > > >
> > > >>
> > > >> JV
> > > >>
> > > >>
> > > >> >
> > > >> >
> > > >> >> So adding the entry to memory doesn't expose any consistency
> issue.
> > > >> >>
> > > >> >> On May 1, 2017 5:44 PM, "Venkateswara Rao Jujjuri" <
> > > jujjuri@gmail.com>
> > > >> >> wrote:
> > > >> >>
> > > >> >> On Mon, May 1, 2017 at 2:31 PM, Yiming Zang
> > > <yzang@twitter.com.invalid
> > > >> >
> > > >> >> wrote:
> > > >> >>
> > > >> >> > Hi Andrey,
> > > >> >> >
> > > >> >> > That's a good point, and you're actually correct that if write
> to
> > > >> >> memTable
> > > >> >> > got throttled somehow, the addEntry request latency will be
> > > affected
> > > >> a
> > > >> >> lot.
> > > >> >> > This actually happens a few times in production cluster.
> > Normally,
> > > >> the
> > > >> >> idea
> > > >> >> > of using Journal is to write data to the write-ahead log and
> then
> > > >> >> persist
> > > >> >> > the actual data to disks or add to memTable. However, my
> > > >> understanding
> > > >> >> of
> > > >> >> > why we choose to write entry to ledgerStorage first is to
> improve
> > > the
> > > >> >> > tailing-read performance.
> > > >> >> >
> > > >> >> > In SortedLedgerStorage.java, we first add entry to memTable and
> > > then
> > > >> we
> > > >> >> > update lastAddConfirmed, which means if there's a long poll
> read
> > > >> request
> > > >> >> or
> > > >> >> > readLastAddConfirmed request, it will immediately get satisfied
> > for
> > > >> the
> > > >> >> > latest entry before we actually log the entry into Journal. So
> > > >> >> tailing-read
> > > >> >> > doesn't actually need to wait for any disk operation in
> > Bookkeeper
> > > >> >> > including Journal operation.
> > > >> >> >
> > > >> >> > public long addEntry(ByteBuffer entry) throws IOException {
> > > >> >> > long ledgerId = entry.getLong();
> > > >> >> > long entryId = entry.getLong();
> > > >> >> > long lac = entry.getLong();
> > > >> >> > entry.rewind();
> > > >> >> > memTable.addEntry(ledgerId, entryId, entry, this);
> > > >> >> > ledgerCache.updateLastAddConfirmed(ledgerId, lac);
> > > >> >> > return entryId;
> > > >> >> > }
> > > >> >> >
> > > >> >> > But thinking about here, I'm wondering if it's actually safe to
> > > >> update
> > > >> >> the
> > > >> >> > LAC before we write the entry to Journal. What if we tell the
> > > client
> > > >> the
> > > >> >> > LAC has been updated but we actually failed to write the entry
> to
> > > >> >> Journal
> > > >> >> > and Bookie crashed at that time? Would this bring any
> > inconsistency
> > > >> >> issue?
> > > >> >> >
> > > >> >>
> > > >> >> Good point. This is indeed an inconsistency issue. BK guarantees
> > "if
> > > >> you
> > > >> >> read once you can read it all the time".
> > > >> >> If it is really done for LAC it is not really good idea. Unless I
> > am
> > > >> >> missing something, this must be changed ASAP.
> > > >> >>
> > > >> >> Thanks,
> > > >> >> JV
> > > >> >>
> > > >> >>
> > > >> >> >
> > > >> >> > On Mon, May 1, 2017 at 2:13 PM, Andrey Yegorov <
> > > >> >> andrey.yegorov@gmail.com>
> > > >> >> > wrote:
> > > >> >> >
> > > >> >> > > Hi,
> > > >> >> > >
> > > >> >> > > Looking at the code in Bookie.java I noticed that write to
> > > journal
> > > >> >> (which
> > > >> >> > > is supposed to be a write-ahead log as I understand) happened
> > > after
> > > >> >> write
> > > >> >> > > to ledger storage.
> > > >> >> > > This looks counter-intuitive, can someone explain why is it
> > done
> > > in
> > > >> >> this
> > > >> >> > > order?
> > > >> >> > >
> > > >> >> > > My primary concern is that ledger storage write can be
> delayed
> > > >> (i.e.
> > > >> >> > > EntryMemTable's addEntry can do throttleWriters() in some
> > cases)
> > > >> thus
> > > >> >> > > dragging overall client's view of add latency up even though
> it
> > > is
> > > >> >> > possible
> > > >> >> > > that journal's write (i.e. in case of dedicated journal disk)
> > > will
> > > >> >> > complete
> > > >> >> > > faster.
> > > >> >> > >
> > > >> >> > >     private void addEntryInternal(LedgerDescriptor handle,
> > > >> ByteBuffer
> > > >> >> > > entry, WriteCallback cb, Object ctx)
> > > >> >> > >
> > > >> >> > >             throws IOException, BookieException {
> > > >> >> > >
> > > >> >> > >         long ledgerId = handle.getLedgerId();
> > > >> >> > >
> > > >> >> > >         entry.rewind();
> > > >> >> > >
> > > >> >> > > *// ledgerStorage.addEntry() is happening here*
> > > >> >> > >
> > > >> >> > >         long entryId = handle.addEntry(entry);
> > > >> >> > >
> > > >> >> > >
> > > >> >> > >         entry.rewind();
> > > >> >> > >
> > > >> >> > >         writeBytes.add(entry.remaining());
> > > >> >> > >
> > > >> >> > >
> > > >> >> > >         LOG.trace("Adding {}@{}", entryId, ledgerId);
> > > >> >> > >
> > > >> >> > > *// journal add entry is happening here*
> > > >> >> > >
> > > >> >> > > *// callback/response to client is sent after journal add is
> > > done.*
> > > >> >> > >
> > > >> >> > >         journal.logAddEntry(entry, cb, ctx);
> > > >> >> > >
> > > >> >> > >     }
> > > >> >> > >
> > > >> >> > >
> > > >> >> > >
> > > >> >> > > ----------
> > > >> >> > > Andrey Yegorov
> > > >> >> > >
> > > >> >> >
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> --
> > > >> >> Jvrao
> > > >> >> ---
> > > >> >> First they ignore you, then they laugh at you, then they fight
> you,
> > > >> then
> > > >> >> you win. - Mahatma Gandhi
> > > >> >>
> > > >> >
> > > >> >
> > > >> >
> > > >> > --
> > > >> > Jvrao
> > > >> > ---
> > > >> > First they ignore you, then they laugh at you, then they fight
> you,
> > > then
> > > >> > you win. - Mahatma Gandhi
> > > >> >
> > > >> >
> > > >> >
> > > >>
> > > >>
> > > >> --
> > > >> Jvrao
> > > >> ---
> > > >> First they ignore you, then they laugh at you, then they fight you,
> > then
> > > >> you win. - Mahatma Gandhi
> > > >>
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Jvrao
> > ---
> > First they ignore you, then they laugh at you, then they fight you, then
> > you win. - Mahatma Gandhi
> >
>

Re: why write to journal is happening after write to ledgerStorage?

Posted by Sijie Guo <gu...@gmail.com>.

On Mon, May 1, 2017 at 6:35 PM, Venkateswara Rao Jujjuri <ju...@gmail.com>
wrote:

> The real problem/issue is - having extremely fast journal disk doesn't
> really mask write latencies from a slower ledger disk.
>

In most of the case, it mask the write latencies from a slower ledger disk.
Because the write should only happen in the memory.

The worse case here is the write is being throttled - that typically means
something really bad happening.

In this case, a larger write buffer would help?


>
> To address this rate correctness issue, cant we read from journal if the
> entryid >= LAC (as we cache now on bookie) and journal read fails?
>

First, the correctness isn't entryid >= LAC case, as client can't really
read beyond LAC. The correctness issue is on entryid <= LAC case: the entry
appears on journal but not in ledger storage.

Second, the purpose of having a separate journal disk is to avoid reads on
journal that would impact writes. If we circle back reads on journals, this
would potentially  cause performance degradation on writes as well.

Last, in order to be able to read journals, you basically still need to add
some indexed structures into memory, so you know where to look up the
entries. No matter you store an entry in memtable or just store an entry
pointer pointing back to journal, you will still hit same problem - as you
can keep everything in memory, which you have to write data back to disks
and the throttling would happen again.

From all these three points, I don't see too much value about changing
this. Instead, the question would be simpler - can you increase the memory
buffer size? If you can't, that means your hardware's capacity can't
basically keep up with the incoming write traffic. More capacity is needed
then.

- Sijie



>
> On Mon, May 1, 2017 at 6:33 PM, Sijie Guo <gu...@gmail.com> wrote:
>
> > In the other to think about this,
> >
> > when 'throttling' happens,  it typically means:
> >
> > - the bookie doesn't have enough bandwidth/capacity to keep up with the
> > traffic.
> > - the disks on the bookie might have problems (e.g. slow down or other
> > hardware issues).
> >
> > Either case can happen. It might be worth to let the throttling kick in,
> > rather than let journal disk accepting writes and putting ledger storage
> > into worse state.
> >
> > - Sijie
> >
> > On Mon, May 1, 2017 at 6:23 PM, Sijie Guo <gu...@gmail.com> wrote:
> >
> > >
> > >
> > > On Mon, May 1, 2017 at 6:14 PM, Venkateswara Rao Jujjuri <
> > > jujjuri@gmail.com> wrote:
> > >
> > >> On Mon, May 1, 2017 at 6:03 PM, Venkateswara Rao Jujjuri <
> > >> jujjuri@gmail.com>
> > >> wrote:
> > >>
> > >> >
> > >> >
> > >> > On Mon, May 1, 2017 at 5:56 PM, Sijie Guo <gu...@gmail.com>
> wrote:
> > >> >
> > >> >> I don't think this is an inconsistent issue. The in memory update
> is
> > >> >> updating lac not current entry. Even the entry is added into memory
> > but
> > >> >> this entry will not be readable after lac is advanced, lac is
> > advanced
> > >> >> only
> > >> >> after the next entry is added which happened after current entry is
> > >> acked.
> > >> >>
> > >> >
> > >> > That is not true. You are talking about piggy-backed LAC only. But
> > with
> > >> > Explicit LAC
> > >> > you don't need next entry to move LAC on bookie.
> > >> >
> > >>
> > >> Sorry, I pushed send before finishing. :)
> > >>
> > >> So you don't need next entry to move LAC forward, but its client job
> to
> > >> move LAC forward.
> > >> Hence client need to send explicit LAC to update LAC after it hear
> back
> > >> from AckQuorum.
> > >> Hence Sijie is right on this part, it is not a consistency issue. :)
> > >>
> > >>
> > >> But never the less, I believe we need to change the order as it is not
> > >> completely shielding
> > >> writes from other activity. @Sijie do you see any issue if we write to
> > >> journal, ack to client
> > >> and the write to ledger ?
> > >>
> > >
> > > Based on my understanding about this email thread, the concern comes
> from
> > > the latency on write. However, it doesn't change any latency behavior
> if
> > > you add to journal first and add to memtable later. 'Throttling' will
> > still
> > > happen when you add entry to memtable.
> > >
> > > So the question would be "can we write to journal and back back
> immediate
> > > after written to journal, and add the entry to memtable in background"?
> > >
> > > The answer would be "no". Because this would volatile the correctness.
> It
> > > might end up a case - the lac is already advanced but the entry is not
> > > found - it can happen in following sequence.
> > >
> > > - Client issue write entry N (lac = N-1)
> > > - Bookie write the entry to the journal and acknowledge. Entry N is in
> > the
> > > journal but haven't been added to the memtable.
> > > - Client received the acknowledge and advanced LAC from N-1 to N.
> > > - Client write another entry N+1 (lac = N) to advance LAC.
> > > - Another client (reader) detects LAC is advanced from N-1 to N. it
> > > attempts to read entry N but N isn't added to ledger storage. (*The
> > > correctness is volatiled here*)
> > >
> > > So to summarize my thoughts on this:
> > >
> > > - The acknowledge should happen after both writing the entry to journal
> > > and write the entry to memtable.
> > > - The order of writing the entry to journal and writing entry to
> memtable
> > > doesn't matter here.
> > > - Writing the entry to the memtable helps with tailing latency (because
> > it
> > > will advance LAC first).
> > >
> > > - Sijie
> > >
> > >
> > >>
> > >> JV
> > >>
> > >>
> > >> >
> > >> >
> > >> >> So adding the entry to memory doesn't expose any consistency issue.
> > >> >>
> > >> >> On May 1, 2017 5:44 PM, "Venkateswara Rao Jujjuri" <
> > jujjuri@gmail.com>
> > >> >> wrote:
> > >> >>
> > >> >> On Mon, May 1, 2017 at 2:31 PM, Yiming Zang
> > <yzang@twitter.com.invalid
> > >> >
> > >> >> wrote:
> > >> >>
> > >> >> > Hi Andrey,
> > >> >> >
> > >> >> > That's a good point, and you're actually correct that if write to
> > >> >> memTable
> > >> >> > got throttled somehow, the addEntry request latency will be
> > affected
> > >> a
> > >> >> lot.
> > >> >> > This actually happens a few times in production cluster.
> Normally,
> > >> the
> > >> >> idea
> > >> >> > of using Journal is to write data to the write-ahead log and then
> > >> >> persist
> > >> >> > the actual data to disks or add to memTable. However, my
> > >> understanding
> > >> >> of
> > >> >> > why we choose to write entry to ledgerStorage first is to improve
> > the
> > >> >> > tailing-read performance.
> > >> >> >
> > >> >> > In SortedLedgerStorage.java, we first add entry to memTable and
> > then
> > >> we
> > >> >> > update lastAddConfirmed, which means if there's a long poll read
> > >> request
> > >> >> or
> > >> >> > readLastAddConfirmed request, it will immediately get satisfied
> for
> > >> the
> > >> >> > latest entry before we actually log the entry into Journal. So
> > >> >> tailing-read
> > >> >> > doesn't actually need to wait for any disk operation in
> Bookkeeper
> > >> >> > including Journal operation.
> > >> >> >
> > >> >> > public long addEntry(ByteBuffer entry) throws IOException {
> > >> >> > long ledgerId = entry.getLong();
> > >> >> > long entryId = entry.getLong();
> > >> >> > long lac = entry.getLong();
> > >> >> > entry.rewind();
> > >> >> > memTable.addEntry(ledgerId, entryId, entry, this);
> > >> >> > ledgerCache.updateLastAddConfirmed(ledgerId, lac);
> > >> >> > return entryId;
> > >> >> > }
> > >> >> >
> > >> >> > But thinking about here, I'm wondering if it's actually safe to
> > >> update
> > >> >> the
> > >> >> > LAC before we write the entry to Journal. What if we tell the
> > client
> > >> the
> > >> >> > LAC has been updated but we actually failed to write the entry to
> > >> >> Journal
> > >> >> > and Bookie crashed at that time? Would this bring any
> inconsistency
> > >> >> issue?
> > >> >> >
> > >> >>
> > >> >> Good point. This is indeed an inconsistency issue. BK guarantees
> "if
> > >> you
> > >> >> read once you can read it all the time".
> > >> >> If it is really done for LAC it is not really good idea. Unless I
> am
> > >> >> missing something, this must be changed ASAP.
> > >> >>
> > >> >> Thanks,
> > >> >> JV
> > >> >>
> > >> >>
> > >> >> >
> > >> >> > On Mon, May 1, 2017 at 2:13 PM, Andrey Yegorov <
> > >> >> andrey.yegorov@gmail.com>
> > >> >> > wrote:
> > >> >> >
> > >> >> > > Hi,
> > >> >> > >
> > >> >> > > Looking at the code in Bookie.java I noticed that write to
> > journal
> > >> >> (which
> > >> >> > > is supposed to be a write-ahead log as I understand) happened
> > after
> > >> >> write
> > >> >> > > to ledger storage.
> > >> >> > > This looks counter-intuitive, can someone explain why is it
> done
> > in
> > >> >> this
> > >> >> > > order?
> > >> >> > >
> > >> >> > > My primary concern is that ledger storage write can be delayed
> > >> (i.e.
> > >> >> > > EntryMemTable's addEntry can do throttleWriters() in some
> cases)
> > >> thus
> > >> >> > > dragging overall client's view of add latency up even though it
> > is
> > >> >> > possible
> > >> >> > > that journal's write (i.e. in case of dedicated journal disk)
> > will
> > >> >> > complete
> > >> >> > > faster.
> > >> >> > >
> > >> >> > >     private void addEntryInternal(LedgerDescriptor handle,
> > >> ByteBuffer
> > >> >> > > entry, WriteCallback cb, Object ctx)
> > >> >> > >
> > >> >> > >             throws IOException, BookieException {
> > >> >> > >
> > >> >> > >         long ledgerId = handle.getLedgerId();
> > >> >> > >
> > >> >> > >         entry.rewind();
> > >> >> > >
> > >> >> > > *// ledgerStorage.addEntry() is happening here*
> > >> >> > >
> > >> >> > >         long entryId = handle.addEntry(entry);
> > >> >> > >
> > >> >> > >
> > >> >> > >         entry.rewind();
> > >> >> > >
> > >> >> > >         writeBytes.add(entry.remaining());
> > >> >> > >
> > >> >> > >
> > >> >> > >         LOG.trace("Adding {}@{}", entryId, ledgerId);
> > >> >> > >
> > >> >> > > *// journal add entry is happening here*
> > >> >> > >
> > >> >> > > *// callback/response to client is sent after journal add is
> > done.*
> > >> >> > >
> > >> >> > >         journal.logAddEntry(entry, cb, ctx);
> > >> >> > >
> > >> >> > >     }
> > >> >> > >
> > >> >> > >
> > >> >> > >
> > >> >> > > ----------
> > >> >> > > Andrey Yegorov
> > >> >> > >
> > >> >> >
> > >> >>
> > >> >>
> > >> >>
> > >> >> --
> > >> >> Jvrao
> > >> >> ---
> > >> >> First they ignore you, then they laugh at you, then they fight you,
> > >> then
> > >> >> you win. - Mahatma Gandhi
> > >> >>
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Jvrao
> > >> > ---
> > >> > First they ignore you, then they laugh at you, then they fight you,
> > then
> > >> > you win. - Mahatma Gandhi
> > >> >
> > >> >
> > >> >
> > >>
> > >>
> > >> --
> > >> Jvrao
> > >> ---
> > >> First they ignore you, then they laugh at you, then they fight you,
> then
> > >> you win. - Mahatma Gandhi
> > >>
> > >
> > >
> >
>
>
>
> --
> Jvrao
> ---
> First they ignore you, then they laugh at you, then they fight you, then
> you win. - Mahatma Gandhi
>

Re: why write to journal is happening after write to ledgerStorage?

Posted by Venkateswara Rao Jujjuri <ju...@gmail.com>.

The real problem/issue is - having extremely fast journal disk doesn't
really mask write latencies from a slower ledger disk.

To address this rate correctness issue, cant we read from journal if the
entryid >= LAC (as we cache now on bookie) and journal read fails?

On Mon, May 1, 2017 at 6:33 PM, Sijie Guo <gu...@gmail.com> wrote:

> In the other to think about this,
>
> when 'throttling' happens,  it typically means:
>
> - the bookie doesn't have enough bandwidth/capacity to keep up with the
> traffic.
> - the disks on the bookie might have problems (e.g. slow down or other
> hardware issues).
>
> Either case can happen. It might be worth to let the throttling kick in,
> rather than let journal disk accepting writes and putting ledger storage
> into worse state.
>
> - Sijie
>
> On Mon, May 1, 2017 at 6:23 PM, Sijie Guo <gu...@gmail.com> wrote:
>
> >
> >
> > On Mon, May 1, 2017 at 6:14 PM, Venkateswara Rao Jujjuri <
> > jujjuri@gmail.com> wrote:
> >
> >> On Mon, May 1, 2017 at 6:03 PM, Venkateswara Rao Jujjuri <
> >> jujjuri@gmail.com>
> >> wrote:
> >>
> >> >
> >> >
> >> > On Mon, May 1, 2017 at 5:56 PM, Sijie Guo <gu...@gmail.com> wrote:
> >> >
> >> >> I don't think this is an inconsistent issue. The in memory update is
> >> >> updating lac not current entry. Even the entry is added into memory
> but
> >> >> this entry will not be readable after lac is advanced, lac is
> advanced
> >> >> only
> >> >> after the next entry is added which happened after current entry is
> >> acked.
> >> >>
> >> >
> >> > That is not true. You are talking about piggy-backed LAC only. But
> with
> >> > Explicit LAC
> >> > you don't need next entry to move LAC on bookie.
> >> >
> >>
> >> Sorry, I pushed send before finishing. :)
> >>
> >> So you don't need next entry to move LAC forward, but its client job to
> >> move LAC forward.
> >> Hence client need to send explicit LAC to update LAC after it hear back
> >> from AckQuorum.
> >> Hence Sijie is right on this part, it is not a consistency issue. :)
> >>
> >>
> >> But never the less, I believe we need to change the order as it is not
> >> completely shielding
> >> writes from other activity. @Sijie do you see any issue if we write to
> >> journal, ack to client
> >> and the write to ledger ?
> >>
> >
> > Based on my understanding about this email thread, the concern comes from
> > the latency on write. However, it doesn't change any latency behavior if
> > you add to journal first and add to memtable later. 'Throttling' will
> still
> > happen when you add entry to memtable.
> >
> > So the question would be "can we write to journal and back back immediate
> > after written to journal, and add the entry to memtable in background"?
> >
> > The answer would be "no". Because this would volatile the correctness. It
> > might end up a case - the lac is already advanced but the entry is not
> > found - it can happen in following sequence.
> >
> > - Client issue write entry N (lac = N-1)
> > - Bookie write the entry to the journal and acknowledge. Entry N is in
> the
> > journal but haven't been added to the memtable.
> > - Client received the acknowledge and advanced LAC from N-1 to N.
> > - Client write another entry N+1 (lac = N) to advance LAC.
> > - Another client (reader) detects LAC is advanced from N-1 to N. it
> > attempts to read entry N but N isn't added to ledger storage. (*The
> > correctness is volatiled here*)
> >
> > So to summarize my thoughts on this:
> >
> > - The acknowledge should happen after both writing the entry to journal
> > and write the entry to memtable.
> > - The order of writing the entry to journal and writing entry to memtable
> > doesn't matter here.
> > - Writing the entry to the memtable helps with tailing latency (because
> it
> > will advance LAC first).
> >
> > - Sijie
> >
> >
> >>
> >> JV
> >>
> >>
> >> >
> >> >
> >> >> So adding the entry to memory doesn't expose any consistency issue.
> >> >>
> >> >> On May 1, 2017 5:44 PM, "Venkateswara Rao Jujjuri" <
> jujjuri@gmail.com>
> >> >> wrote:
> >> >>
> >> >> On Mon, May 1, 2017 at 2:31 PM, Yiming Zang
> <yzang@twitter.com.invalid
> >> >
> >> >> wrote:
> >> >>
> >> >> > Hi Andrey,
> >> >> >
> >> >> > That's a good point, and you're actually correct that if write to
> >> >> memTable
> >> >> > got throttled somehow, the addEntry request latency will be
> affected
> >> a
> >> >> lot.
> >> >> > This actually happens a few times in production cluster. Normally,
> >> the
> >> >> idea
> >> >> > of using Journal is to write data to the write-ahead log and then
> >> >> persist
> >> >> > the actual data to disks or add to memTable. However, my
> >> understanding
> >> >> of
> >> >> > why we choose to write entry to ledgerStorage first is to improve
> the
> >> >> > tailing-read performance.
> >> >> >
> >> >> > In SortedLedgerStorage.java, we first add entry to memTable and
> then
> >> we
> >> >> > update lastAddConfirmed, which means if there's a long poll read
> >> request
> >> >> or
> >> >> > readLastAddConfirmed request, it will immediately get satisfied for
> >> the
> >> >> > latest entry before we actually log the entry into Journal. So
> >> >> tailing-read
> >> >> > doesn't actually need to wait for any disk operation in Bookkeeper
> >> >> > including Journal operation.
> >> >> >
> >> >> > public long addEntry(ByteBuffer entry) throws IOException {
> >> >> > long ledgerId = entry.getLong();
> >> >> > long entryId = entry.getLong();
> >> >> > long lac = entry.getLong();
> >> >> > entry.rewind();
> >> >> > memTable.addEntry(ledgerId, entryId, entry, this);
> >> >> > ledgerCache.updateLastAddConfirmed(ledgerId, lac);
> >> >> > return entryId;
> >> >> > }
> >> >> >
> >> >> > But thinking about here, I'm wondering if it's actually safe to
> >> update
> >> >> the
> >> >> > LAC before we write the entry to Journal. What if we tell the
> client
> >> the
> >> >> > LAC has been updated but we actually failed to write the entry to
> >> >> Journal
> >> >> > and Bookie crashed at that time? Would this bring any inconsistency
> >> >> issue?
> >> >> >
> >> >>
> >> >> Good point. This is indeed an inconsistency issue. BK guarantees "if
> >> you
> >> >> read once you can read it all the time".
> >> >> If it is really done for LAC it is not really good idea. Unless I am
> >> >> missing something, this must be changed ASAP.
> >> >>
> >> >> Thanks,
> >> >> JV
> >> >>
> >> >>
> >> >> >
> >> >> > On Mon, May 1, 2017 at 2:13 PM, Andrey Yegorov <
> >> >> andrey.yegorov@gmail.com>
> >> >> > wrote:
> >> >> >
> >> >> > > Hi,
> >> >> > >
> >> >> > > Looking at the code in Bookie.java I noticed that write to
> journal
> >> >> (which
> >> >> > > is supposed to be a write-ahead log as I understand) happened
> after
> >> >> write
> >> >> > > to ledger storage.
> >> >> > > This looks counter-intuitive, can someone explain why is it done
> in
> >> >> this
> >> >> > > order?
> >> >> > >
> >> >> > > My primary concern is that ledger storage write can be delayed
> >> (i.e.
> >> >> > > EntryMemTable's addEntry can do throttleWriters() in some cases)
> >> thus
> >> >> > > dragging overall client's view of add latency up even though it
> is
> >> >> > possible
> >> >> > > that journal's write (i.e. in case of dedicated journal disk)
> will
> >> >> > complete
> >> >> > > faster.
> >> >> > >
> >> >> > >     private void addEntryInternal(LedgerDescriptor handle,
> >> ByteBuffer
> >> >> > > entry, WriteCallback cb, Object ctx)
> >> >> > >
> >> >> > >             throws IOException, BookieException {
> >> >> > >
> >> >> > >         long ledgerId = handle.getLedgerId();
> >> >> > >
> >> >> > >         entry.rewind();
> >> >> > >
> >> >> > > *// ledgerStorage.addEntry() is happening here*
> >> >> > >
> >> >> > >         long entryId = handle.addEntry(entry);
> >> >> > >
> >> >> > >
> >> >> > >         entry.rewind();
> >> >> > >
> >> >> > >         writeBytes.add(entry.remaining());
> >> >> > >
> >> >> > >
> >> >> > >         LOG.trace("Adding {}@{}", entryId, ledgerId);
> >> >> > >
> >> >> > > *// journal add entry is happening here*
> >> >> > >
> >> >> > > *// callback/response to client is sent after journal add is
> done.*
> >> >> > >
> >> >> > >         journal.logAddEntry(entry, cb, ctx);
> >> >> > >
> >> >> > >     }
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > ----------
> >> >> > > Andrey Yegorov
> >> >> > >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Jvrao
> >> >> ---
> >> >> First they ignore you, then they laugh at you, then they fight you,
> >> then
> >> >> you win. - Mahatma Gandhi
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Jvrao
> >> > ---
> >> > First they ignore you, then they laugh at you, then they fight you,
> then
> >> > you win. - Mahatma Gandhi
> >> >
> >> >
> >> >
> >>
> >>
> >> --
> >> Jvrao
> >> ---
> >> First they ignore you, then they laugh at you, then they fight you, then
> >> you win. - Mahatma Gandhi
> >>
> >
> >
>



-- 
Jvrao
---
First they ignore you, then they laugh at you, then they fight you, then
you win. - Mahatma Gandhi

Re: why write to journal is happening after write to ledgerStorage?

Posted by Sijie Guo <gu...@gmail.com>.

In the other to think about this,

when 'throttling' happens,  it typically means:

- the bookie doesn't have enough bandwidth/capacity to keep up with the
traffic.
- the disks on the bookie might have problems (e.g. slow down or other
hardware issues).

Either case can happen. It might be worth to let the throttling kick in,
rather than let journal disk accepting writes and putting ledger storage
into worse state.

- Sijie

On Mon, May 1, 2017 at 6:23 PM, Sijie Guo <gu...@gmail.com> wrote:

>
>
> On Mon, May 1, 2017 at 6:14 PM, Venkateswara Rao Jujjuri <
> jujjuri@gmail.com> wrote:
>
>> On Mon, May 1, 2017 at 6:03 PM, Venkateswara Rao Jujjuri <
>> jujjuri@gmail.com>
>> wrote:
>>
>> >
>> >
>> > On Mon, May 1, 2017 at 5:56 PM, Sijie Guo <gu...@gmail.com> wrote:
>> >
>> >> I don't think this is an inconsistent issue. The in memory update is
>> >> updating lac not current entry. Even the entry is added into memory but
>> >> this entry will not be readable after lac is advanced, lac is advanced
>> >> only
>> >> after the next entry is added which happened after current entry is
>> acked.
>> >>
>> >
>> > That is not true. You are talking about piggy-backed LAC only. But with
>> > Explicit LAC
>> > you don't need next entry to move LAC on bookie.
>> >
>>
>> Sorry, I pushed send before finishing. :)
>>
>> So you don't need next entry to move LAC forward, but its client job to
>> move LAC forward.
>> Hence client need to send explicit LAC to update LAC after it hear back
>> from AckQuorum.
>> Hence Sijie is right on this part, it is not a consistency issue. :)
>>
>>
>> But never the less, I believe we need to change the order as it is not
>> completely shielding
>> writes from other activity. @Sijie do you see any issue if we write to
>> journal, ack to client
>> and the write to ledger ?
>>
>
> Based on my understanding about this email thread, the concern comes from
> the latency on write. However, it doesn't change any latency behavior if
> you add to journal first and add to memtable later. 'Throttling' will still
> happen when you add entry to memtable.
>
> So the question would be "can we write to journal and back back immediate
> after written to journal, and add the entry to memtable in background"?
>
> The answer would be "no". Because this would volatile the correctness. It
> might end up a case - the lac is already advanced but the entry is not
> found - it can happen in following sequence.
>
> - Client issue write entry N (lac = N-1)
> - Bookie write the entry to the journal and acknowledge. Entry N is in the
> journal but haven't been added to the memtable.
> - Client received the acknowledge and advanced LAC from N-1 to N.
> - Client write another entry N+1 (lac = N) to advance LAC.
> - Another client (reader) detects LAC is advanced from N-1 to N. it
> attempts to read entry N but N isn't added to ledger storage. (*The
> correctness is volatiled here*)
>
> So to summarize my thoughts on this:
>
> - The acknowledge should happen after both writing the entry to journal
> and write the entry to memtable.
> - The order of writing the entry to journal and writing entry to memtable
> doesn't matter here.
> - Writing the entry to the memtable helps with tailing latency (because it
> will advance LAC first).
>
> - Sijie
>
>
>>
>> JV
>>
>>
>> >
>> >
>> >> So adding the entry to memory doesn't expose any consistency issue.
>> >>
>> >> On May 1, 2017 5:44 PM, "Venkateswara Rao Jujjuri" <ju...@gmail.com>
>> >> wrote:
>> >>
>> >> On Mon, May 1, 2017 at 2:31 PM, Yiming Zang <yzang@twitter.com.invalid
>> >
>> >> wrote:
>> >>
>> >> > Hi Andrey,
>> >> >
>> >> > That's a good point, and you're actually correct that if write to
>> >> memTable
>> >> > got throttled somehow, the addEntry request latency will be affected
>> a
>> >> lot.
>> >> > This actually happens a few times in production cluster. Normally,
>> the
>> >> idea
>> >> > of using Journal is to write data to the write-ahead log and then
>> >> persist
>> >> > the actual data to disks or add to memTable. However, my
>> understanding
>> >> of
>> >> > why we choose to write entry to ledgerStorage first is to improve the
>> >> > tailing-read performance.
>> >> >
>> >> > In SortedLedgerStorage.java, we first add entry to memTable and then
>> we
>> >> > update lastAddConfirmed, which means if there's a long poll read
>> request
>> >> or
>> >> > readLastAddConfirmed request, it will immediately get satisfied for
>> the
>> >> > latest entry before we actually log the entry into Journal. So
>> >> tailing-read
>> >> > doesn't actually need to wait for any disk operation in Bookkeeper
>> >> > including Journal operation.
>> >> >
>> >> > public long addEntry(ByteBuffer entry) throws IOException {
>> >> > long ledgerId = entry.getLong();
>> >> > long entryId = entry.getLong();
>> >> > long lac = entry.getLong();
>> >> > entry.rewind();
>> >> > memTable.addEntry(ledgerId, entryId, entry, this);
>> >> > ledgerCache.updateLastAddConfirmed(ledgerId, lac);
>> >> > return entryId;
>> >> > }
>> >> >
>> >> > But thinking about here, I'm wondering if it's actually safe to
>> update
>> >> the
>> >> > LAC before we write the entry to Journal. What if we tell the client
>> the
>> >> > LAC has been updated but we actually failed to write the entry to
>> >> Journal
>> >> > and Bookie crashed at that time? Would this bring any inconsistency
>> >> issue?
>> >> >
>> >>
>> >> Good point. This is indeed an inconsistency issue. BK guarantees "if
>> you
>> >> read once you can read it all the time".
>> >> If it is really done for LAC it is not really good idea. Unless I am
>> >> missing something, this must be changed ASAP.
>> >>
>> >> Thanks,
>> >> JV
>> >>
>> >>
>> >> >
>> >> > On Mon, May 1, 2017 at 2:13 PM, Andrey Yegorov <
>> >> andrey.yegorov@gmail.com>
>> >> > wrote:
>> >> >
>> >> > > Hi,
>> >> > >
>> >> > > Looking at the code in Bookie.java I noticed that write to journal
>> >> (which
>> >> > > is supposed to be a write-ahead log as I understand) happened after
>> >> write
>> >> > > to ledger storage.
>> >> > > This looks counter-intuitive, can someone explain why is it done in
>> >> this
>> >> > > order?
>> >> > >
>> >> > > My primary concern is that ledger storage write can be delayed
>> (i.e.
>> >> > > EntryMemTable's addEntry can do throttleWriters() in some cases)
>> thus
>> >> > > dragging overall client's view of add latency up even though it is
>> >> > possible
>> >> > > that journal's write (i.e. in case of dedicated journal disk) will
>> >> > complete
>> >> > > faster.
>> >> > >
>> >> > >     private void addEntryInternal(LedgerDescriptor handle,
>> ByteBuffer
>> >> > > entry, WriteCallback cb, Object ctx)
>> >> > >
>> >> > >             throws IOException, BookieException {
>> >> > >
>> >> > >         long ledgerId = handle.getLedgerId();
>> >> > >
>> >> > >         entry.rewind();
>> >> > >
>> >> > > *// ledgerStorage.addEntry() is happening here*
>> >> > >
>> >> > >         long entryId = handle.addEntry(entry);
>> >> > >
>> >> > >
>> >> > >         entry.rewind();
>> >> > >
>> >> > >         writeBytes.add(entry.remaining());
>> >> > >
>> >> > >
>> >> > >         LOG.trace("Adding {}@{}", entryId, ledgerId);
>> >> > >
>> >> > > *// journal add entry is happening here*
>> >> > >
>> >> > > *// callback/response to client is sent after journal add is done.*
>> >> > >
>> >> > >         journal.logAddEntry(entry, cb, ctx);
>> >> > >
>> >> > >     }
>> >> > >
>> >> > >
>> >> > >
>> >> > > ----------
>> >> > > Andrey Yegorov
>> >> > >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Jvrao
>> >> ---
>> >> First they ignore you, then they laugh at you, then they fight you,
>> then
>> >> you win. - Mahatma Gandhi
>> >>
>> >
>> >
>> >
>> > --
>> > Jvrao
>> > ---
>> > First they ignore you, then they laugh at you, then they fight you, then
>> > you win. - Mahatma Gandhi
>> >
>> >
>> >
>>
>>
>> --
>> Jvrao
>> ---
>> First they ignore you, then they laugh at you, then they fight you, then
>> you win. - Mahatma Gandhi
>>
>
>

Re: why write to journal is happening after write to ledgerStorage?

Posted by Sijie Guo <gu...@gmail.com>.

On Mon, May 1, 2017 at 6:14 PM, Venkateswara Rao Jujjuri <ju...@gmail.com>
wrote:

> On Mon, May 1, 2017 at 6:03 PM, Venkateswara Rao Jujjuri <
> jujjuri@gmail.com>
> wrote:
>
> >
> >
> > On Mon, May 1, 2017 at 5:56 PM, Sijie Guo <gu...@gmail.com> wrote:
> >
> >> I don't think this is an inconsistent issue. The in memory update is
> >> updating lac not current entry. Even the entry is added into memory but
> >> this entry will not be readable after lac is advanced, lac is advanced
> >> only
> >> after the next entry is added which happened after current entry is
> acked.
> >>
> >
> > That is not true. You are talking about piggy-backed LAC only. But with
> > Explicit LAC
> > you don't need next entry to move LAC on bookie.
> >
>
> Sorry, I pushed send before finishing. :)
>
> So you don't need next entry to move LAC forward, but its client job to
> move LAC forward.
> Hence client need to send explicit LAC to update LAC after it hear back
> from AckQuorum.
> Hence Sijie is right on this part, it is not a consistency issue. :)
>
>
> But never the less, I believe we need to change the order as it is not
> completely shielding
> writes from other activity. @Sijie do you see any issue if we write to
> journal, ack to client
> and the write to ledger ?
>

Based on my understanding about this email thread, the concern comes from
the latency on write. However, it doesn't change any latency behavior if
you add to journal first and add to memtable later. 'Throttling' will still
happen when you add entry to memtable.

So the question would be "can we write to journal and back back immediate
after written to journal, and add the entry to memtable in background"?

The answer would be "no". Because this would volatile the correctness. It
might end up a case - the lac is already advanced but the entry is not
found - it can happen in following sequence.

- Client issue write entry N (lac = N-1)
- Bookie write the entry to the journal and acknowledge. Entry N is in the
journal but haven't been added to the memtable.
- Client received the acknowledge and advanced LAC from N-1 to N.
- Client write another entry N+1 (lac = N) to advance LAC.
- Another client (reader) detects LAC is advanced from N-1 to N. it
attempts to read entry N but N isn't added to ledger storage. (*The
correctness is volatiled here*)

So to summarize my thoughts on this:

- The acknowledge should happen after both writing the entry to journal and
write the entry to memtable.
- The order of writing the entry to journal and writing entry to memtable
doesn't matter here.
- Writing the entry to the memtable helps with tailing latency (because it
will advance LAC first).

- Sijie


>
> JV
>
>
> >
> >
> >> So adding the entry to memory doesn't expose any consistency issue.
> >>
> >> On May 1, 2017 5:44 PM, "Venkateswara Rao Jujjuri" <ju...@gmail.com>
> >> wrote:
> >>
> >> On Mon, May 1, 2017 at 2:31 PM, Yiming Zang <yz...@twitter.com.invalid>
> >> wrote:
> >>
> >> > Hi Andrey,
> >> >
> >> > That's a good point, and you're actually correct that if write to
> >> memTable
> >> > got throttled somehow, the addEntry request latency will be affected a
> >> lot.
> >> > This actually happens a few times in production cluster. Normally, the
> >> idea
> >> > of using Journal is to write data to the write-ahead log and then
> >> persist
> >> > the actual data to disks or add to memTable. However, my understanding
> >> of
> >> > why we choose to write entry to ledgerStorage first is to improve the
> >> > tailing-read performance.
> >> >
> >> > In SortedLedgerStorage.java, we first add entry to memTable and then
> we
> >> > update lastAddConfirmed, which means if there's a long poll read
> request
> >> or
> >> > readLastAddConfirmed request, it will immediately get satisfied for
> the
> >> > latest entry before we actually log the entry into Journal. So
> >> tailing-read
> >> > doesn't actually need to wait for any disk operation in Bookkeeper
> >> > including Journal operation.
> >> >
> >> > public long addEntry(ByteBuffer entry) throws IOException {
> >> > long ledgerId = entry.getLong();
> >> > long entryId = entry.getLong();
> >> > long lac = entry.getLong();
> >> > entry.rewind();
> >> > memTable.addEntry(ledgerId, entryId, entry, this);
> >> > ledgerCache.updateLastAddConfirmed(ledgerId, lac);
> >> > return entryId;
> >> > }
> >> >
> >> > But thinking about here, I'm wondering if it's actually safe to update
> >> the
> >> > LAC before we write the entry to Journal. What if we tell the client
> the
> >> > LAC has been updated but we actually failed to write the entry to
> >> Journal
> >> > and Bookie crashed at that time? Would this bring any inconsistency
> >> issue?
> >> >
> >>
> >> Good point. This is indeed an inconsistency issue. BK guarantees "if you
> >> read once you can read it all the time".
> >> If it is really done for LAC it is not really good idea. Unless I am
> >> missing something, this must be changed ASAP.
> >>
> >> Thanks,
> >> JV
> >>
> >>
> >> >
> >> > On Mon, May 1, 2017 at 2:13 PM, Andrey Yegorov <
> >> andrey.yegorov@gmail.com>
> >> > wrote:
> >> >
> >> > > Hi,
> >> > >
> >> > > Looking at the code in Bookie.java I noticed that write to journal
> >> (which
> >> > > is supposed to be a write-ahead log as I understand) happened after
> >> write
> >> > > to ledger storage.
> >> > > This looks counter-intuitive, can someone explain why is it done in
> >> this
> >> > > order?
> >> > >
> >> > > My primary concern is that ledger storage write can be delayed (i.e.
> >> > > EntryMemTable's addEntry can do throttleWriters() in some cases)
> thus
> >> > > dragging overall client's view of add latency up even though it is
> >> > possible
> >> > > that journal's write (i.e. in case of dedicated journal disk) will
> >> > complete
> >> > > faster.
> >> > >
> >> > >     private void addEntryInternal(LedgerDescriptor handle,
> ByteBuffer
> >> > > entry, WriteCallback cb, Object ctx)
> >> > >
> >> > >             throws IOException, BookieException {
> >> > >
> >> > >         long ledgerId = handle.getLedgerId();
> >> > >
> >> > >         entry.rewind();
> >> > >
> >> > > *// ledgerStorage.addEntry() is happening here*
> >> > >
> >> > >         long entryId = handle.addEntry(entry);
> >> > >
> >> > >
> >> > >         entry.rewind();
> >> > >
> >> > >         writeBytes.add(entry.remaining());
> >> > >
> >> > >
> >> > >         LOG.trace("Adding {}@{}", entryId, ledgerId);
> >> > >
> >> > > *// journal add entry is happening here*
> >> > >
> >> > > *// callback/response to client is sent after journal add is done.*
> >> > >
> >> > >         journal.logAddEntry(entry, cb, ctx);
> >> > >
> >> > >     }
> >> > >
> >> > >
> >> > >
> >> > > ----------
> >> > > Andrey Yegorov
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Jvrao
> >> ---
> >> First they ignore you, then they laugh at you, then they fight you, then
> >> you win. - Mahatma Gandhi
> >>
> >
> >
> >
> > --
> > Jvrao
> > ---
> > First they ignore you, then they laugh at you, then they fight you, then
> > you win. - Mahatma Gandhi
> >
> >
> >
>
>
> --
> Jvrao
> ---
> First they ignore you, then they laugh at you, then they fight you, then
> you win. - Mahatma Gandhi
>

Re: why write to journal is happening after write to ledgerStorage?

Posted by Venkateswara Rao Jujjuri <ju...@gmail.com>.

On Mon, May 1, 2017 at 6:03 PM, Venkateswara Rao Jujjuri <ju...@gmail.com>
wrote:

>
>
> On Mon, May 1, 2017 at 5:56 PM, Sijie Guo <gu...@gmail.com> wrote:
>
>> I don't think this is an inconsistent issue. The in memory update is
>> updating lac not current entry. Even the entry is added into memory but
>> this entry will not be readable after lac is advanced, lac is advanced
>> only
>> after the next entry is added which happened after current entry is acked.
>>
>
> That is not true. You are talking about piggy-backed LAC only. But with
> Explicit LAC
> you don't need next entry to move LAC on bookie.
>

Sorry, I pushed send before finishing. :)

So you don't need next entry to move LAC forward, but its client job to
move LAC forward.
Hence client need to send explicit LAC to update LAC after it hear back
from AckQuorum.
Hence Sijie is right on this part, it is not a consistency issue. :)


But never the less, I believe we need to change the order as it is not
completely shielding
writes from other activity. @Sijie do you see any issue if we write to
journal, ack to client
and the write to ledger ?

JV


>
>
>> So adding the entry to memory doesn't expose any consistency issue.
>>
>> On May 1, 2017 5:44 PM, "Venkateswara Rao Jujjuri" <ju...@gmail.com>
>> wrote:
>>
>> On Mon, May 1, 2017 at 2:31 PM, Yiming Zang <yz...@twitter.com.invalid>
>> wrote:
>>
>> > Hi Andrey,
>> >
>> > That's a good point, and you're actually correct that if write to
>> memTable
>> > got throttled somehow, the addEntry request latency will be affected a
>> lot.
>> > This actually happens a few times in production cluster. Normally, the
>> idea
>> > of using Journal is to write data to the write-ahead log and then
>> persist
>> > the actual data to disks or add to memTable. However, my understanding
>> of
>> > why we choose to write entry to ledgerStorage first is to improve the
>> > tailing-read performance.
>> >
>> > In SortedLedgerStorage.java, we first add entry to memTable and then we
>> > update lastAddConfirmed, which means if there's a long poll read request
>> or
>> > readLastAddConfirmed request, it will immediately get satisfied for the
>> > latest entry before we actually log the entry into Journal. So
>> tailing-read
>> > doesn't actually need to wait for any disk operation in Bookkeeper
>> > including Journal operation.
>> >
>> > public long addEntry(ByteBuffer entry) throws IOException {
>> > long ledgerId = entry.getLong();
>> > long entryId = entry.getLong();
>> > long lac = entry.getLong();
>> > entry.rewind();
>> > memTable.addEntry(ledgerId, entryId, entry, this);
>> > ledgerCache.updateLastAddConfirmed(ledgerId, lac);
>> > return entryId;
>> > }
>> >
>> > But thinking about here, I'm wondering if it's actually safe to update
>> the
>> > LAC before we write the entry to Journal. What if we tell the client the
>> > LAC has been updated but we actually failed to write the entry to
>> Journal
>> > and Bookie crashed at that time? Would this bring any inconsistency
>> issue?
>> >
>>
>> Good point. This is indeed an inconsistency issue. BK guarantees "if you
>> read once you can read it all the time".
>> If it is really done for LAC it is not really good idea. Unless I am
>> missing something, this must be changed ASAP.
>>
>> Thanks,
>> JV
>>
>>
>> >
>> > On Mon, May 1, 2017 at 2:13 PM, Andrey Yegorov <
>> andrey.yegorov@gmail.com>
>> > wrote:
>> >
>> > > Hi,
>> > >
>> > > Looking at the code in Bookie.java I noticed that write to journal
>> (which
>> > > is supposed to be a write-ahead log as I understand) happened after
>> write
>> > > to ledger storage.
>> > > This looks counter-intuitive, can someone explain why is it done in
>> this
>> > > order?
>> > >
>> > > My primary concern is that ledger storage write can be delayed (i.e.
>> > > EntryMemTable's addEntry can do throttleWriters() in some cases) thus
>> > > dragging overall client's view of add latency up even though it is
>> > possible
>> > > that journal's write (i.e. in case of dedicated journal disk) will
>> > complete
>> > > faster.
>> > >
>> > >     private void addEntryInternal(LedgerDescriptor handle, ByteBuffer
>> > > entry, WriteCallback cb, Object ctx)
>> > >
>> > >             throws IOException, BookieException {
>> > >
>> > >         long ledgerId = handle.getLedgerId();
>> > >
>> > >         entry.rewind();
>> > >
>> > > *// ledgerStorage.addEntry() is happening here*
>> > >
>> > >         long entryId = handle.addEntry(entry);
>> > >
>> > >
>> > >         entry.rewind();
>> > >
>> > >         writeBytes.add(entry.remaining());
>> > >
>> > >
>> > >         LOG.trace("Adding {}@{}", entryId, ledgerId);
>> > >
>> > > *// journal add entry is happening here*
>> > >
>> > > *// callback/response to client is sent after journal add is done.*
>> > >
>> > >         journal.logAddEntry(entry, cb, ctx);
>> > >
>> > >     }
>> > >
>> > >
>> > >
>> > > ----------
>> > > Andrey Yegorov
>> > >
>> >
>>
>>
>>
>> --
>> Jvrao
>> ---
>> First they ignore you, then they laugh at you, then they fight you, then
>> you win. - Mahatma Gandhi
>>
>
>
>
> --
> Jvrao
> ---
> First they ignore you, then they laugh at you, then they fight you, then
> you win. - Mahatma Gandhi
>
>
>


-- 
Jvrao
---
First they ignore you, then they laugh at you, then they fight you, then
you win. - Mahatma Gandhi

Re: why write to journal is happening after write to ledgerStorage?

Posted by Venkateswara Rao Jujjuri <ju...@gmail.com>.

On Mon, May 1, 2017 at 5:56 PM, Sijie Guo <gu...@gmail.com> wrote:

> I don't think this is an inconsistent issue. The in memory update is
> updating lac not current entry. Even the entry is added into memory but
> this entry will not be readable after lac is advanced, lac is advanced only
> after the next entry is added which happened after current entry is acked.
>

That is not true. You are talking about piggy-backed LAC only. But with
Explicit LAC
you don't need next entry to move LAC on bookie.



> So adding the entry to memory doesn't expose any consistency issue.
>
> On May 1, 2017 5:44 PM, "Venkateswara Rao Jujjuri" <ju...@gmail.com>
> wrote:
>
> On Mon, May 1, 2017 at 2:31 PM, Yiming Zang <yz...@twitter.com.invalid>
> wrote:
>
> > Hi Andrey,
> >
> > That's a good point, and you're actually correct that if write to
> memTable
> > got throttled somehow, the addEntry request latency will be affected a
> lot.
> > This actually happens a few times in production cluster. Normally, the
> idea
> > of using Journal is to write data to the write-ahead log and then persist
> > the actual data to disks or add to memTable. However, my understanding of
> > why we choose to write entry to ledgerStorage first is to improve the
> > tailing-read performance.
> >
> > In SortedLedgerStorage.java, we first add entry to memTable and then we
> > update lastAddConfirmed, which means if there's a long poll read request
> or
> > readLastAddConfirmed request, it will immediately get satisfied for the
> > latest entry before we actually log the entry into Journal. So
> tailing-read
> > doesn't actually need to wait for any disk operation in Bookkeeper
> > including Journal operation.
> >
> > public long addEntry(ByteBuffer entry) throws IOException {
> > long ledgerId = entry.getLong();
> > long entryId = entry.getLong();
> > long lac = entry.getLong();
> > entry.rewind();
> > memTable.addEntry(ledgerId, entryId, entry, this);
> > ledgerCache.updateLastAddConfirmed(ledgerId, lac);
> > return entryId;
> > }
> >
> > But thinking about here, I'm wondering if it's actually safe to update
> the
> > LAC before we write the entry to Journal. What if we tell the client the
> > LAC has been updated but we actually failed to write the entry to Journal
> > and Bookie crashed at that time? Would this bring any inconsistency
> issue?
> >
>
> Good point. This is indeed an inconsistency issue. BK guarantees "if you
> read once you can read it all the time".
> If it is really done for LAC it is not really good idea. Unless I am
> missing something, this must be changed ASAP.
>
> Thanks,
> JV
>
>
> >
> > On Mon, May 1, 2017 at 2:13 PM, Andrey Yegorov <andrey.yegorov@gmail.com
> >
> > wrote:
> >
> > > Hi,
> > >
> > > Looking at the code in Bookie.java I noticed that write to journal
> (which
> > > is supposed to be a write-ahead log as I understand) happened after
> write
> > > to ledger storage.
> > > This looks counter-intuitive, can someone explain why is it done in
> this
> > > order?
> > >
> > > My primary concern is that ledger storage write can be delayed (i.e.
> > > EntryMemTable's addEntry can do throttleWriters() in some cases) thus
> > > dragging overall client's view of add latency up even though it is
> > possible
> > > that journal's write (i.e. in case of dedicated journal disk) will
> > complete
> > > faster.
> > >
> > >     private void addEntryInternal(LedgerDescriptor handle, ByteBuffer
> > > entry, WriteCallback cb, Object ctx)
> > >
> > >             throws IOException, BookieException {
> > >
> > >         long ledgerId = handle.getLedgerId();
> > >
> > >         entry.rewind();
> > >
> > > *// ledgerStorage.addEntry() is happening here*
> > >
> > >         long entryId = handle.addEntry(entry);
> > >
> > >
> > >         entry.rewind();
> > >
> > >         writeBytes.add(entry.remaining());
> > >
> > >
> > >         LOG.trace("Adding {}@{}", entryId, ledgerId);
> > >
> > > *// journal add entry is happening here*
> > >
> > > *// callback/response to client is sent after journal add is done.*
> > >
> > >         journal.logAddEntry(entry, cb, ctx);
> > >
> > >     }
> > >
> > >
> > >
> > > ----------
> > > Andrey Yegorov
> > >
> >
>
>
>
> --
> Jvrao
> ---
> First they ignore you, then they laugh at you, then they fight you, then
> you win. - Mahatma Gandhi
>



-- 
Jvrao
---
First they ignore you, then they laugh at you, then they fight you, then
you win. - Mahatma Gandhi

Re: why write to journal is happening after write to ledgerStorage?

Posted by Sijie Guo <gu...@gmail.com>.

I don't think this is an inconsistent issue. The in memory update is
updating lac not current entry. Even the entry is added into memory but
this entry will not be readable after lac is advanced, lac is advanced only
after the next entry is added which happened after current entry is acked.
So adding the entry to memory doesn't expose any consistency issue.

On May 1, 2017 5:44 PM, "Venkateswara Rao Jujjuri" <ju...@gmail.com>
wrote:

On Mon, May 1, 2017 at 2:31 PM, Yiming Zang <yz...@twitter.com.invalid>
wrote:

> Hi Andrey,
>
> That's a good point, and you're actually correct that if write to memTable
> got throttled somehow, the addEntry request latency will be affected a
lot.
> This actually happens a few times in production cluster. Normally, the
idea
> of using Journal is to write data to the write-ahead log and then persist
> the actual data to disks or add to memTable. However, my understanding of
> why we choose to write entry to ledgerStorage first is to improve the
> tailing-read performance.
>
> In SortedLedgerStorage.java, we first add entry to memTable and then we
> update lastAddConfirmed, which means if there's a long poll read request
or
> readLastAddConfirmed request, it will immediately get satisfied for the
> latest entry before we actually log the entry into Journal. So
tailing-read
> doesn't actually need to wait for any disk operation in Bookkeeper
> including Journal operation.
>
> public long addEntry(ByteBuffer entry) throws IOException {
> long ledgerId = entry.getLong();
> long entryId = entry.getLong();
> long lac = entry.getLong();
> entry.rewind();
> memTable.addEntry(ledgerId, entryId, entry, this);
> ledgerCache.updateLastAddConfirmed(ledgerId, lac);
> return entryId;
> }
>
> But thinking about here, I'm wondering if it's actually safe to update the
> LAC before we write the entry to Journal. What if we tell the client the
> LAC has been updated but we actually failed to write the entry to Journal
> and Bookie crashed at that time? Would this bring any inconsistency issue?
>

Good point. This is indeed an inconsistency issue. BK guarantees "if you
read once you can read it all the time".
If it is really done for LAC it is not really good idea. Unless I am
missing something, this must be changed ASAP.

Thanks,
JV


>
> On Mon, May 1, 2017 at 2:13 PM, Andrey Yegorov <an...@gmail.com>
> wrote:
>
> > Hi,
> >
> > Looking at the code in Bookie.java I noticed that write to journal
(which
> > is supposed to be a write-ahead log as I understand) happened after
write
> > to ledger storage.
> > This looks counter-intuitive, can someone explain why is it done in this
> > order?
> >
> > My primary concern is that ledger storage write can be delayed (i.e.
> > EntryMemTable's addEntry can do throttleWriters() in some cases) thus
> > dragging overall client's view of add latency up even though it is
> possible
> > that journal's write (i.e. in case of dedicated journal disk) will
> complete
> > faster.
> >
> >     private void addEntryInternal(LedgerDescriptor handle, ByteBuffer
> > entry, WriteCallback cb, Object ctx)
> >
> >             throws IOException, BookieException {
> >
> >         long ledgerId = handle.getLedgerId();
> >
> >         entry.rewind();
> >
> > *// ledgerStorage.addEntry() is happening here*
> >
> >         long entryId = handle.addEntry(entry);
> >
> >
> >         entry.rewind();
> >
> >         writeBytes.add(entry.remaining());
> >
> >
> >         LOG.trace("Adding {}@{}", entryId, ledgerId);
> >
> > *// journal add entry is happening here*
> >
> > *// callback/response to client is sent after journal add is done.*
> >
> >         journal.logAddEntry(entry, cb, ctx);
> >
> >     }
> >
> >
> >
> > ----------
> > Andrey Yegorov
> >
>



--
Jvrao
---
First they ignore you, then they laugh at you, then they fight you, then
you win. - Mahatma Gandhi

Re: why write to journal is happening after write to ledgerStorage?

Posted by Venkateswara Rao Jujjuri <ju...@gmail.com>.

On Mon, May 1, 2017 at 2:31 PM, Yiming Zang <yz...@twitter.com.invalid>
wrote:

> Hi Andrey,
>
> That's a good point, and you're actually correct that if write to memTable
> got throttled somehow, the addEntry request latency will be affected a lot.
> This actually happens a few times in production cluster. Normally, the idea
> of using Journal is to write data to the write-ahead log and then persist
> the actual data to disks or add to memTable. However, my understanding of
> why we choose to write entry to ledgerStorage first is to improve the
> tailing-read performance.
>
> In SortedLedgerStorage.java, we first add entry to memTable and then we
> update lastAddConfirmed, which means if there's a long poll read request or
> readLastAddConfirmed request, it will immediately get satisfied for the
> latest entry before we actually log the entry into Journal. So tailing-read
> doesn't actually need to wait for any disk operation in Bookkeeper
> including Journal operation.
>
> public long addEntry(ByteBuffer entry) throws IOException {
> long ledgerId = entry.getLong();
> long entryId = entry.getLong();
> long lac = entry.getLong();
> entry.rewind();
> memTable.addEntry(ledgerId, entryId, entry, this);
> ledgerCache.updateLastAddConfirmed(ledgerId, lac);
> return entryId;
> }
>
> But thinking about here, I'm wondering if it's actually safe to update the
> LAC before we write the entry to Journal. What if we tell the client the
> LAC has been updated but we actually failed to write the entry to Journal
> and Bookie crashed at that time? Would this bring any inconsistency issue?
>

Good point. This is indeed an inconsistency issue. BK guarantees "if you
read once you can read it all the time".
If it is really done for LAC it is not really good idea. Unless I am
missing something, this must be changed ASAP.

Thanks,
JV


>
> On Mon, May 1, 2017 at 2:13 PM, Andrey Yegorov <an...@gmail.com>
> wrote:
>
> > Hi,
> >
> > Looking at the code in Bookie.java I noticed that write to journal (which
> > is supposed to be a write-ahead log as I understand) happened after write
> > to ledger storage.
> > This looks counter-intuitive, can someone explain why is it done in this
> > order?
> >
> > My primary concern is that ledger storage write can be delayed (i.e.
> > EntryMemTable's addEntry can do throttleWriters() in some cases) thus
> > dragging overall client's view of add latency up even though it is
> possible
> > that journal's write (i.e. in case of dedicated journal disk) will
> complete
> > faster.
> >
> >     private void addEntryInternal(LedgerDescriptor handle, ByteBuffer
> > entry, WriteCallback cb, Object ctx)
> >
> >             throws IOException, BookieException {
> >
> >         long ledgerId = handle.getLedgerId();
> >
> >         entry.rewind();
> >
> > *// ledgerStorage.addEntry() is happening here*
> >
> >         long entryId = handle.addEntry(entry);
> >
> >
> >         entry.rewind();
> >
> >         writeBytes.add(entry.remaining());
> >
> >
> >         LOG.trace("Adding {}@{}", entryId, ledgerId);
> >
> > *// journal add entry is happening here*
> >
> > *// callback/response to client is sent after journal add is done.*
> >
> >         journal.logAddEntry(entry, cb, ctx);
> >
> >     }
> >
> >
> >
> > ----------
> > Andrey Yegorov
> >
>



-- 
Jvrao
---
First they ignore you, then they laugh at you, then they fight you, then
you win. - Mahatma Gandhi

Re: why write to journal is happening after write to ledgerStorage?

Posted by Yiming Zang <yz...@twitter.com.INVALID>.

Hi Andrey,

That's a good point, and you're actually correct that if write to memTable
got throttled somehow, the addEntry request latency will be affected a lot.
This actually happens a few times in production cluster. Normally, the idea
of using Journal is to write data to the write-ahead log and then persist
the actual data to disks or add to memTable. However, my understanding of
why we choose to write entry to ledgerStorage first is to improve the
tailing-read performance.

In SortedLedgerStorage.java, we first add entry to memTable and then we
update lastAddConfirmed, which means if there's a long poll read request or
readLastAddConfirmed request, it will immediately get satisfied for the
latest entry before we actually log the entry into Journal. So tailing-read
doesn't actually need to wait for any disk operation in Bookkeeper
including Journal operation.

public long addEntry(ByteBuffer entry) throws IOException {
long ledgerId = entry.getLong();
long entryId = entry.getLong();
long lac = entry.getLong();
entry.rewind();
memTable.addEntry(ledgerId, entryId, entry, this);
ledgerCache.updateLastAddConfirmed(ledgerId, lac);
return entryId;
}

But thinking about here, I'm wondering if it's actually safe to update the
LAC before we write the entry to Journal. What if we tell the client the
LAC has been updated but we actually failed to write the entry to Journal
and Bookie crashed at that time? Would this bring any inconsistency issue?

On Mon, May 1, 2017 at 2:13 PM, Andrey Yegorov <an...@gmail.com>
wrote:

> Hi,
>
> Looking at the code in Bookie.java I noticed that write to journal (which
> is supposed to be a write-ahead log as I understand) happened after write
> to ledger storage.
> This looks counter-intuitive, can someone explain why is it done in this
> order?
>
> My primary concern is that ledger storage write can be delayed (i.e.
> EntryMemTable's addEntry can do throttleWriters() in some cases) thus
> dragging overall client's view of add latency up even though it is possible
> that journal's write (i.e. in case of dedicated journal disk) will complete
> faster.
>
>     private void addEntryInternal(LedgerDescriptor handle, ByteBuffer
> entry, WriteCallback cb, Object ctx)
>
>             throws IOException, BookieException {
>
>         long ledgerId = handle.getLedgerId();
>
>         entry.rewind();
>
> *// ledgerStorage.addEntry() is happening here*
>
>         long entryId = handle.addEntry(entry);
>
>
>         entry.rewind();
>
>         writeBytes.add(entry.remaining());
>
>
>         LOG.trace("Adding {}@{}", entryId, ledgerId);
>
> *// journal add entry is happening here*
>
> *// callback/response to client is sent after journal add is done.*
>
>         journal.logAddEntry(entry, cb, ctx);
>
>     }
>
>
>
> ----------
> Andrey Yegorov
>