You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@distributedlog.apache.org by Sijie Guo <si...@apache.org> on 2016/11/01 10:05:33 UTC

Re: Proxy Client - Batch Ordering / Commit

I created https://issues.apache.org/jira/browse/DL-63 for tracking the
proposed idea here.



On Wed, Oct 26, 2016 at 4:53 PM, Sijie Guo <si...@twitter.com.invalid>
wrote:

> On Tue, Oct 25, 2016 at 11:30 AM, Cameron Hatfield <ki...@gmail.com>
> wrote:
>
> > Yes, we are reading the HBase WAL (from their replication plugin
> support),
> > and writing that into DL.
> >
>
> Gotcha.
>
>
> >
> > From the sounds of it, yes, it would. Only thing I would say is make the
> > epoch requirement optional, so that if I client doesn't care about dupes
> > they don't have to deal with the process of getting a new epoch.
> >
>
> Yup. This should be optional. I can start a wiki page on how we want to
> implement this. Are you interested in contributing to this?
>
>
> >
> > -Cameron
> >
> > On Wed, Oct 19, 2016 at 7:43 PM, Sijie Guo <si...@twitter.com.invalid>
> > wrote:
> >
> > > On Wed, Oct 19, 2016 at 7:17 PM, Sijie Guo <si...@twitter.com> wrote:
> > >
> > > >
> > > >
> > > > On Monday, October 17, 2016, Cameron Hatfield <ki...@gmail.com>
> > wrote:
> > > >
> > > >> Answer inline:
> > > >>
> > > >> On Mon, Oct 17, 2016 at 11:46 AM, Sijie Guo <si...@apache.org>
> wrote:
> > > >>
> > > >> > Cameron,
> > > >> >
> > > >> > Thank you for your summary. I liked the discussion here. I also
> > liked
> > > >> the
> > > >> > summary of your requirement - 'single-writer-per-key,
> > > >> > multiple-writers-per-log'. If I understand correctly, the core
> > concern
> > > >> here
> > > >> > is almost 'exact-once' write (or a way to explicit tell if a write
> > can
> > > >> be
> > > >> > retried or not).
> > > >> >
> > > >> > Comments inline.
> > > >> >
> > > >> > On Fri, Oct 14, 2016 at 11:17 AM, Cameron Hatfield <
> > kinguy@gmail.com>
> > > >> > wrote:
> > > >> >
> > > >> > > > Ah- yes good point (to be clear we're not using the proxy this
> > way
> > > >> > > today).
> > > >> > >
> > > >> > > > > Due to the source of the
> > > >> > > > > data (HBase Replication), we cannot guarantee that a single
> > > >> partition
> > > >> > > will
> > > >> > > > > be owned for writes by the same client.
> > > >> > >
> > > >> > > > Do you mean you *need* to support multiple writers issuing
> > > >> interleaved
> > > >> > > > writes or is it just that they might sometimes interleave
> writes
> > > and
> > > >> > you
> > > >> > > >don't care?
> > > >> > > How HBase partitions the keys being written wouldn't have a
> > one->one
> > > >> > > mapping with the partitions we would have in HBase. Even if we
> did
> > > >> have
> > > >> > > that alignment when the cluster first started, HBase will
> > rebalance
> > > >> what
> > > >> > > servers own what partitions, as well as split and merge
> partitions
> > > >> that
> > > >> > > already exist, causing eventual drift from one log per
> partition.
> > > >> > > Because we want ordering guarantees per key (row in hbase), we
> > > >> partition
> > > >> > > the logs by the key. Since multiple writers are possible per
> range
> > > of
> > > >> > keys
> > > >> > > (due to the aforementioned rebalancing / splitting / etc of
> > hbase),
> > > we
> > > >> > > cannot use the core library due to requiring a single writer for
> > > >> > ordering.
> > > >> > >
> > > >> > > But, for a single log, we don't really care about ordering aside
> > > from
> > > >> at
> > > >> > > the per-key level. So all we really need to be able to handle is
> > > >> > preventing
> > > >> > > duplicates when a failure occurs, and ordering consistency
> across
> > > >> > requests
> > > >> > > from a single client.
> > > >> > >
> > > >> > > So our general requirements are:
> > > >> > > Write A, Write B
> > > >> > > Timeline: A -> B
> > > >> > > Request B is only made after A has successfully returned
> (possibly
> > > >> after
> > > >> > > retries)
> > > >> > >
> > > >> > > 1) If the write succeeds, it will be durably exposed to clients
> > > within
> > > >> > some
> > > >> > > bounded time frame
> > > >> > >
> > > >> >
> > > >> > Guaranteed.
> > > >> >
> > > >>
> > > >> > > 2) If A succeeds and B succeeds, the ordering for the log will
> be
> > A
> > > >> and
> > > >> > > then B
> > > >> > >
> > > >> >
> > > >> > If I understand correctly here, B is only sent after A is
> returned,
> > > >> right?
> > > >> > If that's the case, It is guaranteed.
> > > >>
> > > >>
> > > >>
> > > >> >
> > > >> >
> > > >> >
> > > >> > > 3) If A fails due to an error that can be relied on to *not* be
> a
> > > lost
> > > >> > ack
> > > >> > > problem, it will never be exposed to the client, so it may
> > > (depending
> > > >> on
> > > >> > > the error) be retried immediately
> > > >> > >
> > > >> >
> > > >> > If it is not a lost-ack problem, the entry will be exposed. it is
> > > >> > guaranteed.
> > > >>
> > > >> Let me try rephrasing the questions, to make sure I'm understanding
> > > >> correctly:
> > > >> If A fails, with an error such as "Unable to create connection to
> > > >> bookkeeper server", that would be the type of error we would expect
> to
> > > be
> > > >> able to retry immediately, as that result means no action was taken
> on
> > > any
> > > >> log / etc, so no entry could have been created. This is different
> > then a
> > > >> "Connection Timeout" exception, as we just might not have gotten a
> > > >> response
> > > >> in time.
> > > >>
> > > >>
> > > > Gotcha.
> > > >
> > > > The response code returned from proxy can tell if a failure can be
> > > retried
> > > > safely or not. (We might need to make them well documented)
> > > >
> > > >
> > > >
> > > >>
> > > >> >
> > > >> >
> > > >> > > 4) If A fails due to an error that could be a lost ack problem
> > > >> (network
> > > >> > > connectivity / etc), within a bounded time frame it should be
> > > >> possible to
> > > >> > > find out if the write succeed or failed. Either by reading from
> > some
> > > >> > > checkpoint of the log for the changes that should have been made
> > or
> > > >> some
> > > >> > > other possible server-side support.
> > > >> > >
> > > >> >
> > > >> > If I understand this correctly, it is a duplication issue, right?
> > > >> >
> > > >> > Can a de-duplication solution work here? Either DL or your client
> > does
> > > >> the
> > > >> > de-duplication?
> > > >> >
> > > >>
> > > >> The requirements I'm mentioning are the ones needed for client-side
> > > >> dedupping. Since if I can guarantee writes being exposed within some
> > > time
> > > >> frame, and I can never get into an inconsistently ordered state when
> > > >> successes happen, when an error occurs, I can always wait for max
> time
> > > >> frame, read the latest writes, and then dedup locally against the
> > > request
> > > >> I
> > > >> just made.
> > > >>
> > > >> The main thing about that timeframe is that its basically the
> addition
> > > of
> > > >> every timeout, all the way down in the system, combined with
> whatever
> > > >> flushing / caching / etc times are at the bookkeeper / client level
> > for
> > > >> when values are exposed
> > > >
> > > >
> > > > Gotcha.
> > > >
> > > >>
> > > >>
> > > >> >
> > > >> > Is there any ways to identify your write?
> > > >> >
> > > >> > I can think of a case as follow - I want to know what is your
> > expected
> > > >> > behavior from the log.
> > > >> >
> > > >> > a)
> > > >> >
> > > >> > If a hbase region server A writes a change of key K to the log,
> the
> > > >> change
> > > >> > is successfully made to the log;
> > > >> > but server A is down before receiving the change.
> > > >> > region server B took over the region that contains K, what will B
> > do?
> > > >> >
> > > >>
> > > >> HBase writes in large chunks (WAL Logs), which its replication
> system
> > > then
> > > >> handles by replaying in the case of failure. If I'm in a middle of a
> > > log,
> > > >> and the whole region goes down and gets rescheduled elsewhere, I
> will
> > > >> start
> > > >> back up from the beginning of the log I was in the middle of. Using
> > > >> checkpointing + deduping, we should be able to find out where we
> left
> > > off
> > > >> in the log.
> > > >
> > > >
> > > >> >
> > > >> >
> > > >> > b) same as a). but server A was just network partitioned. will
> both
> > A
> > > >> and B
> > > >> > write the change of key K?
> > > >> >
> > > >>
> > > >> HBase gives us some guarantees around network partitions
> (Consistency
> > > over
> > > >> availability for HBase). HBase is a single-master failover recovery
> > type
> > > >> of
> > > >> system, with zookeeper-based guarantees for single owners (writers)
> > of a
> > > >> range of data.
> > > >>
> > > >> >
> > > >> >
> > > >> >
> > > >> > > 5) If A is turned into multiple batches (one large request gets
> > > split
> > > >> > into
> > > >> > > multiple smaller ones to the bookkeeper backend, due to log
> > rolling
> > > /
> > > >> > size
> > > >> > > / etc):
> > > >> > >   a) The ordering of entries within batches have ordering
> > > consistence
> > > >> > with
> > > >> > > the original request, when exposed in the log (though they may
> be
> > > >> > > interleaved with other requests)
> > > >> > >   b) The ordering across batches have ordering consistence with
> > the
> > > >> > > original request, when exposed in the log (though they may be
> > > >> interleaved
> > > >> > > with other requests)
> > > >> > >   c) If a batch fails, and cannot be retried / is unsuccessfully
> > > >> retried,
> > > >> > > all batches after the failed batch should not be exposed in the
> > log.
> > > >> > Note:
> > > >> > > The batches before and including the failed batch, that ended up
> > > >> > > succeeding, can show up in the log, again within some bounded
> time
> > > >> range
> > > >> > > for reads by a client.
> > > >> > >
> > > >> >
> > > >> > There is a method 'writeBulk' in DistributedLogClient can achieve
> > this
> > > >> > guarantee.
> > > >> >
> > > >> > However, I am not very sure about how will you turn A into
> batches.
> > If
> > > >> you
> > > >> > are dividing A into batches,
> > > >> > you can simply control the application write sequence to achieve
> the
> > > >> > guarantee here.
> > > >> >
> > > >> > Can you explain more about this?
> > > >> >
> > > >>
> > > >> In this case, by batches I mean what the proxy does with the single
> > > >> request
> > > >> that I send it. If the proxy decides it needs to turn my single
> > request
> > > >> into multiple batches of requests, due to log rolling, size
> > limitations,
> > > >> etc, those would be the guarantees I need to be able to reduplicate
> on
> > > the
> > > >> client side.
> > > >
> > > >
> > > > A single record written by #write and A record set (set of records)
> > > > written by #writeRecordSet are atomic - they will not be broken down
> > into
> > > > entries (batches). With the correct response code, you would be able
> to
> > > > tell if it is a lost-ack failure or not. However there is a size
> > > limitation
> > > > for this - it can't not go beyond 1MB for current implementation.
> > > >
> > > > What is your expected record size?
> > > >
> > > >
> > > >>
> > > >> >
> > > >> >
> > > >> > >
> > > >> > > Since we can guarantee per-key ordering on the client side, we
> > > >> guarantee
> > > >> > > that there is a single writer per-key, just not per log.
> > > >> >
> > > >> >
> > > >> > Do you need fencing guarantee in the case of network partition
> > causing
> > > >> > two-writers?
> > > >> >
> > > >> >
> > > >> > > So if there was a
> > > >> > > way to guarantee a single write request as being written or not,
> > > >> within a
> > > >> > > certain time frame (since failures should be rare anyways, this
> is
> > > >> fine
> > > >> > if
> > > >> > > it is expensive), we can then have the client guarantee the
> > ordering
> > > >> it
> > > >> > > needs.
> > > >> > >
> > > >> >
> > > >> > This sounds an 'exact-once' write (regarding retries) requirement
> to
> > > me,
> > > >> > right?
> > > >> >
> > > >> Yes. I'm curious of how this issue is handled by Manhattan, since
> you
> > > can
> > > >> imagine a data store that ends up getting multiple writes for the
> same
> > > put
> > > >> / get / etc, would be harder to use, and we are basically trying to
> > > create
> > > >> a log like that for HBase.
> > > >
> > > >
> > > > Are you guys replacing HBase WAL?
> > > >
> > > > In Manhattan case, the request will be first written to DL streams by
> > > > Manhattan coordinator. The Manhattan replica then will read from the
> DL
> > > > streams and apply the change. In the lost-ack case, the MH
> coordinator
> > > will
> > > > just fail the request to client.
> > > >
> > > > My feeling here is your usage for HBase is a bit different from how
> we
> > > use
> > > > DL in Manhattan. It sounds like you read from a source (HBase WAL)
> and
> > > > write to DL. But I might be wrong.
> > > >
> > > >
> > > >>
> > > >> >
> > > >> >
> > > >> > >
> > > >> > >
> > > >> > > > Cameron:
> > > >> > > > Another thing we've discussed but haven't really thought
> > through -
> > > >> > > > We might be able to support some kind of epoch write request,
> > > where
> > > >> the
> > > >> > > > epoch is guaranteed to have changed if the writer has changed
> or
> > > the
> > > >> > > ledger
> > > >> > > > was ever fenced off. Writes include an epoch and are rejected
> if
> > > the
> > > >> > > epoch
> > > >> > > > has changed.
> > > >> > > > With a mechanism like this, fencing the ledger off after a
> > failure
> > > >> > would
> > > >> > > > ensure any pending writes had either been written or would be
> > > >> rejected.
> > > >> > >
> > > >> > > The issue would be how I guarantee the write I wrote to the
> server
> > > was
> > > >> > > written. Since a network issue could happen on the send of the
> > > >> request,
> > > >> > or
> > > >> > > on the receive of the success response, an epoch wouldn't tell
> me
> > > if I
> > > >> > can
> > > >> > > successfully retry, as it could be successfully written but AWS
> > > >> dropped
> > > >> > the
> > > >> > > connection for the success response. Since the epoch would be
> the
> > > same
> > > >> > > (same ledger), I could write duplicates.
> > > >> > >
> > > >> > >
> > > >> > > > We are currently proposing adding a transaction semantic to dl
> > to
> > > >> get
> > > >> > rid
> > > >> > > > of the size limitation and the unaware-ness in the proxy
> client.
> > > >> Here
> > > >> > is
> > > >> > > > our idea -
> > > >> > > > http://mail-archives.apache.org/mod_mbox/incubator-
> > distributedlog
> > > >> > > -dev/201609.mbox/%3cCAAC6BxP5YyEHwG0ZCF5soh42X=xuYwYm
> > > >> > > <http://mail-archives.apache.org/mod_mbox/incubator-
> > > >> > distributedlog%0A-dev/201609.mbox/%3cCAAC6BxP5YyEHwG0ZCF5soh
> > > 42X=xuYwYm>
> > > >> > > L4nXsYBYiofzxpVk6g@mail.gmail.com%3e
> > > >> > >
> > > >> > > > I am not sure if your idea is similar as ours. but we'd like
> to
> > > >> > > collaborate
> > > >> > > > with the community if anyone has the similar idea.
> > > >> > >
> > > >> > > Our use case would be covered by transaction support, but I'm
> > unsure
> > > >> if
> > > >> > we
> > > >> > > would need something that heavy weight for the guarantees we
> need.
> > > >> > >
> > > >> >
> > > >> > >
> > > >> > > Basically, the high level requirement here is "Support
> consistent
> > > >> write
> > > >> > > ordering for single-writer-per-key, multi-writer-per-log". My
> > hunch
> > > is
> > > >> > > that, with some added guarantees to the proxy (if it isn't
> already
> > > >> > > supported), and some custom client code on our side for removing
> > the
> > > >> > > entries that actually succeed to write to DistributedLog from
> the
> > > >> request
> > > >> > > that failed, it should be a relatively easy thing to support.
> > > >> > >
> > > >> >
> > > >> > Yup. I think it should not be very difficult to support. There
> might
> > > be
> > > >> > some changes in the server side.
> > > >> > Let's figure out what will the changes be. Are you guys interested
> > in
> > > >> > contributing?
> > > >> >
> > > >> > Yes, we would be.
> > > >>
> > > >> As a note, the one thing that we see as an issue with the client
> side
> > > >> dedupping is how to bound the range of data that needs to be looked
> at
> > > for
> > > >> deduplication. As you can imagine, it is pretty easy to bound the
> > bottom
> > > >> of
> > > >> the range, as that it just regular checkpointing of the DSLN that is
> > > >> returned. I'm still not sure if there is any nice way to time bound
> > the
> > > >> top
> > > >> end of the range, especially since the proxy owns sequence numbers
> > > (which
> > > >> makes sense). I am curious if there is more that can be done if
> > > >> deduplication is on the server side. However the main minus I see of
> > > >> server
> > > >> side deduplication is that instead of running contingent on there
> > being
> > > a
> > > >> failed client request, instead it would have to run every time a
> write
> > > >> happens.
> > > >
> > > >
> > > > For a reliable dedup, we probably need fence-then-getLastDLSN
> > operation -
> > > > so it would guarantee that any non-completed requests issued
> (lost-ack
> > > > requests) before this fence-then-getLastDLSN operation will be failed
> > and
> > > > they will never land at the log.
> > > >
> > > > the pseudo code would look like below -
> > > >
> > > > write(request) onFailure { t =>
> > > >
> > > > if (t is timeout exception) {
> > > >
> > > > DLSN lastDLSN = fenceThenGetLastDLSN()
> > > > DLSN lastCheckpointedDLSN = ...;
> > > > // find if the request lands between [lastDLSN,
> lastCheckpointedDLSN].
> > > > // if it exists, the write succeed; otherwise retry.
> > > >
> > > > }
> > > >
> > > >
> > > > }
> > > >
> > >
> > >
> > > Just realized the idea is same as what Leigh raised in the previous
> email
> > > about 'epoch write'. Let me explain more about this idea (Leigh, feel
> > free
> > > to jump in to fill up your idea).
> > >
> > > - when a log stream is owned,  the proxy use the last transaction id as
> > the
> > > epoch
> > > - when a client connects (handshake with the proxy), it will get the
> > epoch
> > > for the stream.
> > > - the writes issued by this client will carry the epoch to the proxy.
> > > - add a new rpc - fenceThenGetLastDLSN - it would force the proxy to
> bump
> > > the epoch.
> > > - if fenceThenGetLastDLSN happened, all the outstanding writes with old
> > > epoch will be rejected with exceptions (e.g. EpochFenced).
> > > - The DLSN returned from fenceThenGetLastDLSN can be used as the bound
> > for
> > > deduplications on failures.
> > >
> > > Cameron, does this sound a solution to your use case?
> > >
> > >
> > >
> > > >
> > > >
> > > >>
> > > >> Maybe something that could fit a similar need that Kafka does (the
> > last
> > > >> store value for a particular key in a log), such that on a per key
> > basis
> > > >> there could be a sequence number that support deduplication? Cost
> > seems
> > > >> like it would be high however, and I'm not even sure if bookkeeper
> > > >> supports
> > > >> it.
> > > >
> > > >
> > > >> Cheers,
> > > >> Cameron
> > > >>
> > > >> >
> > > >> > >
> > > >> > > Thanks,
> > > >> > > Cameron
> > > >> > >
> > > >> > >
> > > >> > > On Sat, Oct 8, 2016 at 7:35 AM, Leigh Stewart
> > > >> > <lstewart@twitter.com.invalid
> > > >> > > >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Cameron:
> > > >> > > > Another thing we've discussed but haven't really thought
> > through -
> > > >> > > > We might be able to support some kind of epoch write request,
> > > where
> > > >> the
> > > >> > > > epoch is guaranteed to have changed if the writer has changed
> or
> > > the
> > > >> > > ledger
> > > >> > > > was ever fenced off. Writes include an epoch and are rejected
> if
> > > the
> > > >> > > epoch
> > > >> > > > has changed.
> > > >> > > > With a mechanism like this, fencing the ledger off after a
> > failure
> > > >> > would
> > > >> > > > ensure any pending writes had either been written or would be
> > > >> rejected.
> > > >> > > >
> > > >> > > >
> > > >> > > > On Sat, Oct 8, 2016 at 7:10 AM, Sijie Guo <si...@apache.org>
> > > wrote:
> > > >> > > >
> > > >> > > > > Cameron,
> > > >> > > > >
> > > >> > > > > I think both Leigh and Xi had made a few good points about
> > your
> > > >> > > question.
> > > >> > > > >
> > > >> > > > > To add one more point to your question - "but I am not
> > > >> > > > > 100% of how all of the futures in the code handle failures.
> > > >> > > > > If not, where in the code would be the relevant places to
> add
> > > the
> > > >> > > ability
> > > >> > > > > to do this, and would the project be interested in a pull
> > > >> request?"
> > > >> > > > >
> > > >> > > > > The current proxy and client logic doesn't do perfectly on
> > > >> handling
> > > >> > > > > failures (duplicates) - the strategy now is the client will
> > > retry
> > > >> as
> > > >> > > best
> > > >> > > > > at it can before throwing exceptions to users. The code you
> > are
> > > >> > looking
> > > >> > > > for
> > > >> > > > > - it is on BKLogSegmentWriter for the proxy handling writes
> > and
> > > >> it is
> > > >> > > on
> > > >> > > > > DistributedLogClientImpl for the proxy client handling
> > responses
> > > >> from
> > > >> > > > > proxies. Does this help you?
> > > >> > > > >
> > > >> > > > > And also, you are welcome to contribute the pull requests.
> > > >> > > > >
> > > >> > > > > - Sijie
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > On Tue, Oct 4, 2016 at 3:39 PM, Cameron Hatfield <
> > > >> kinguy@gmail.com>
> > > >> > > > wrote:
> > > >> > > > >
> > > >> > > > > > I have a question about the Proxy Client. Basically, for
> our
> > > use
> > > >> > > cases,
> > > >> > > > > we
> > > >> > > > > > want to guarantee ordering at the key level, irrespective
> of
> > > the
> > > >> > > > ordering
> > > >> > > > > > of the partition it may be assigned to as a whole. Due to
> > the
> > > >> > source
> > > >> > > of
> > > >> > > > > the
> > > >> > > > > > data (HBase Replication), we cannot guarantee that a
> single
> > > >> > partition
> > > >> > > > > will
> > > >> > > > > > be owned for writes by the same client. This means the
> proxy
> > > >> client
> > > >> > > > works
> > > >> > > > > > well (since we don't care which proxy owns the partition
> we
> > > are
> > > >> > > writing
> > > >> > > > > > to).
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > However, the guarantees we need when writing a batch
> > consists
> > > >> of:
> > > >> > > > > > Definition of a Batch: The set of records sent to the
> > > writeBatch
> > > >> > > > endpoint
> > > >> > > > > > on the proxy
> > > >> > > > > >
> > > >> > > > > > 1. Batch success: If the client receives a success from
> the
> > > >> proxy,
> > > >> > > then
> > > >> > > > > > that batch is successfully written
> > > >> > > > > >
> > > >> > > > > > 2. Inter-Batch ordering : Once a batch has been written
> > > >> > successfully
> > > >> > > by
> > > >> > > > > the
> > > >> > > > > > client, when another batch is written, it will be
> guaranteed
> > > to
> > > >> be
> > > >> > > > > ordered
> > > >> > > > > > after the last batch (if it is the same stream).
> > > >> > > > > >
> > > >> > > > > > 3. Intra-Batch ordering: Within a batch of writes, the
> > records
> > > >> will
> > > >> > > be
> > > >> > > > > > committed in order
> > > >> > > > > >
> > > >> > > > > > 4. Intra-Batch failure ordering: If an individual record
> > fails
> > > >> to
> > > >> > > write
> > > >> > > > > > within a batch, all records after that record will not be
> > > >> written.
> > > >> > > > > >
> > > >> > > > > > 5. Batch Commit: Guarantee that if a batch returns a
> > success,
> > > it
> > > >> > will
> > > >> > > > be
> > > >> > > > > > written
> > > >> > > > > >
> > > >> > > > > > 6. Read-after-write: Once a batch is committed, within a
> > > limited
> > > >> > > > > time-frame
> > > >> > > > > > it will be able to be read. This is required in the case
> of
> > > >> > failure,
> > > >> > > so
> > > >> > > > > > that the client can see what actually got committed. I
> > believe
> > > >> the
> > > >> > > > > > time-frame part could be removed if the client can send in
> > the
> > > >> same
> > > >> > > > > > sequence number that was written previously, since it
> would
> > > then
> > > >> > fail
> > > >> > > > and
> > > >> > > > > > we would know that a read needs to occur.
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > So, my basic question is if this is currently possible in
> > the
> > > >> > proxy?
> > > >> > > I
> > > >> > > > > > don't believe it gives these guarantees as it stands
> today,
> > > but
> > > >> I
> > > >> > am
> > > >> > > > not
> > > >> > > > > > 100% of how all of the futures in the code handle
> failures.
> > > >> > > > > > If not, where in the code would be the relevant places to
> > add
> > > >> the
> > > >> > > > ability
> > > >> > > > > > to do this, and would the project be interested in a pull
> > > >> request?
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > Thanks,
> > > >> > > > > > Cameron
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Proxy Client - Batch Ordering / Commit

Posted by Xi Liu <xi...@gmail.com>.
On Mon, Dec 5, 2016 at 9:05 AM, Leigh Stewart <ls...@twitter.com.invalid>
wrote:

> Great discussion here :)
>
> Have you started any work for this? I just updated the proposal page -
> > https://cwiki.apache.org/confluence/display/DL/DP-2+-+
> Epoch+Write+Support
> > Maybe we can work together with this.
>
>
> Looks good Xi.
>
> If I understand the proposal correctly, this gets us, in thin client:
> a) ability to achieve a very strict form of consistency, supporting ex.
> exactly once updates
> b) exclusive ownership
>

Yes. this is correct. Also the other proposal DP-1 (transaction support)
would leverage this proposal here for achieving idempotent writes on thin
clients.


>
> I think there may also be a need for some kind of large atomic update
> functionality, which is a different way of looking at
> consistent/transactional updates.
>
> Cameron earlier Sijie asked if you need > 1MB writes - is that a
> requirement for you? If so, epoch write may not meet all of your
> requirements above.
>

If I understand correctly, with fencing operation, the client itself can
de-duplicate 'lost-ack' records by checking them after fencing.


>
> We actually have a fairly urgent business need to support something like
> this. Our two use cases are:
> 1. apply a set of logically separate writes as an atomic batch (we want
> them all to be visible at the same time, or not at all). might be useful to
> be able to aggregate updates and apply at a time of our choosing.
> 2. large writes: dl write now has a 1MB write limit. we will need to apply
> updates which are larger than 1MB atomically.
>

I am working on finalizing my proposal on DP-1 (transaction support). I
probably can send it out by end of this week.
We can collaborate on this.


>
> On Fri, Nov 18, 2016 at 2:37 PM, Sijie Guo <si...@apache.org> wrote:
>
> > On Thu, Nov 17, 2016 at 2:30 AM, Xi Liu <xi...@gmail.com> wrote:
> >
> > > Cameron,
> > >
> > > Thank you for your comments. It's very helpful. My replies are inline.
> > >
> > > On Wed, Nov 16, 2016 at 11:59 AM, Cameron Hatfield <ki...@gmail.com>
> > > wrote:
> > >
> > > > "A couple of questions" is what I originally wrote, and then the
> > > following
> > > > happened. Sorry about the large swath of them, making sure my
> > > understanding
> > > > of the code base, as well as the DL/Bookkeeper/ZK ecosystem
> > interaction,
> > > > makes sense.
> > > >
> > > > ==General:
> > > > What is an exclusive session? What is it providing over a regular
> > > session?
> > > >
> > >
> > > The idea here is to provide exclusive writer semantic for the
> > > distributedlog (thin) client use case.
> > > So it can have similar behavior as using the storage
> > (distributedlog-core)
> > > library directly.
> > >
> >
> > +1 for an exclusive writer feature. Leigh and me were talking about this
> > before. It is glad to see it is happening now. However it might be worth
> to
> > separate the exclusive writer feature into a separate task once we have
> > 'fencing' feature available.
> >
> >
> > >
> > >
> > > >
> > > >
> > > > ==Proxy side:
> > > > Should a new streamop be added for the fencing operation, or does it
> > make
> > > > sense to piggyback on an existing one (such as write)?
> > >
> > >
> > > > ====getLastDLSN:
> > > > What should be the return result for:
> > > > A new stream
> > > > A new session, after successful fencing
> > > > A new session, after a change in ownership / first starting up
> > > >
> > > > What is the main use case for getLastDLSN(<stream>, false)? Is this
> to
> > > > guarantee that the recovery process has happened in case of ownership
> > > > failure (I don't have a good understanding of what causes the
> recovery
> > > > process to happen, especially from the reader side)? Or is it to
> handle
> > > the
> > > > lost-ack problem? Since all the rest of the read related things go
> > > through
> > > > the read client, I'm not sure if I see the use case, but it seems
> like
> > > > there would be a large potential for confusion on which to use. What
> > > about
> > > > just a fenceSession op, that always fences, returning the DLSN of the
> > > > fence, and leave the normal getLastDLSN for the regular read client.
> > > >
> > >
> > >
> > > Ah, your point is valid. I was following the style of bookkeeper. The
> > > bookkeeper client supplies a fencing flag on readLastAddConfirmed
> request
> > > during recovery.
> > >
> > > But you are right. It is clear to have just a fenceSessionOp and return
> > the
> > > DLSN of the fence request.
> > >
> >
> >
> > Correct. In bookkeeper client, the fencing flag is set with a read op.
> > However it is part of the recovery procedure and internal to the client.
> I
> > agree with Cameron that we should hide the details from public.
> Otherwise,
> > it will cause confusion.
> >
> >
> > >
> > >
> > >
> > > >
> > > > ====Fencing:
> > > > When a fence session occurs, what call needs to be made to make sure
> > any
> > > > outstanding writes are flushed and committed (so that we guarantee
> the
> > > > client will be able to read anything that was in the write queue)?
> > > > Is there a guaranteed ordering for things written in the future queue
> > for
> > > > AsyncLogWriter (I'm not quite confident that I was able to accurately
> > > > follow the logic, as their are many parts of the code that write,
> have
> > > > queues, heartbeat, etc)?
> > > >
> > >
> > > I believed that when calling AsyncLogWriter asyncWrite in order, the
> > > records will be written in order. Sijie or Leigh can confirm that.
> > >
> >
> > That's write. The writes are guaranteed to write in the order of how they
> > are issues.
> >
> >
> > >
> > > Since the writes are in order, when a fence session occurs, the control
> > > record written successfully by the writer will guarantee the writes
> > called
> > > before the fence session write are flushed and committed.
> > >
> > > We need to invalidate the session when we fence the session. So any
> > writes
> > > with old session come after writing the control record will be
> rejected.
> > In
> > > this way, it can guarantee the client will be able to read anything in
> a
> > > consistent way.
> > >
> > >
> > > >
> > > > ====SessionID:
> > > > What is the default sessionid / transactionid for a new stream? I
> > assume
> > > > this would just be the first control record
> > > >
> > >
> > > The initial session id will be the transaction id of the first control
> > > record.
> > >
> > >
> > > >
> > > > ======Should all streams have a sessionid by default, regardless if
> it
> > is
> > > > never used by a client (aka, everytime ownership changes, a new
> control
> > > > record is generated, and a sessionid is stored)?
> > > > Main edge case that would have to be handled is if a client writes
> with
> > > an
> > > > old sessionid, but the owner has changed and has yet to create a
> > > sessionid.
> > > > This should be handled by the "non-matching sessionid" rule, since
> the
> > > > invalid sessionid wouldn't match the passed sessionid, which should
> > cause
> > > > the client to get a new sessionid.
> > > >
> > >
> > > I think all streams should just have a session id by default. The
> session
> > > id is changed when ownership is changed or it is explicitly bumped by a
> > > fence session op.
> > >
> > >
> > > >
> > > > ======Where in the code does it make sense to own the session, the
> > stream
> > > > interfaces / classes? Should they pass that information down to the
> > ops,
> > > or
> > > > do the sessionid check within?
> > > > My first thought would be Stream owns the sessionid, passes it into
> the
> > > ops
> > > > (as either a nullable value, or an invalid default value), which then
> > do
> > > > the sessionid check if they care. The main issue is updating the
> > > sessionid
> > > > is a bit backwards, as either every op has the ability to update it
> > > through
> > > > some type of return value / direct stream access / etc, or there is a
> > > > special case in the stream for the fence operation / any other
> > operation
> > > > that can update the session.
> > > >
> > >
> > > My thought is to add session id in Stream (StreamImpl.java). The stream
> > > validates the session id before submitting a stream op. If it is a
> fence
> > > session op, it would just invalidate the current session, so the
> > subsequent
> > > requests with old session will be rejected.
> > >
> >
> > >
> > > >
> > > > ======For "the owner of the log stream will first advance the
> > transaction
> > > > id generator to claim a new transaction id and write a control record
> > to
> > > > the log stream. ":
> > > > Should "DistributedLogConstants.CONTROL_RECORD_CONTENT" be the type
> of
> > > > control record written?
> > > > Should the "writeControlRecord" on the BKAsyncLogWriter be exposed in
> > the
> > > > AsyncLogWriter interface be exposed?  Or even in the one within the
> > > segment
> > > > writer? Or should the code be duplicated / pulled out into a helper /
> > > etc?
> > > > (Not a big Java person, so any suggestions on the "Java Way", or at
> > least
> > > > the DL way, to do it would be appreciated)
> > > >
> > >
> > > I believe we can just construct a log record and set it to be a control
> > > record and write it.
> > >
> > > LogRecord record = ...
> > > record.setControl();
> > > writer.write(record);
> > >
> > > (Can anyone from community confirm that it is okay to write a control
> > > record in this way?)
> > >
> >
> > Ideally we would like to hide the logic from public usage. However, since
> > it is a code change at proxy side, it is absolutely fine.
> >
> >
> > >
> > >
> > >
> > > >
> > > > ======Transaction ID:
> > > > The BKLogSegmentWriter ignores the transaction ids from control
> records
> > > > when it records the "LastTXId." Would that be an issue here for
> > anything?
> > > > It looks like it may do that because it assumes you're calling it's
> > local
> > > > function for writing a controlrecord, which uses the lastTxId.
> > > >
> > >
> > > I think the lastTxId is for the last transaction id of user records. so
> > we
> > > probably don't change the behavior on how we record the lastTxId.
> However
> > > we can change how do we fetch the last tx id for id generation after
> > > recovery.
> > >
> >
> >
> > getLastTxId should have a flag to include control records or not.
> >
> >
> > >
> > >
> > > >
> > > >
> > > > ==Thrift Interface:
> > > > ====Should the write response be split out for different calls?
> > > > It seems odd to have a single struct with many optional items that
> are
> > > > filled depending on the call made for every rpc call. This is mostly
> a
> > > > curiosity question, since I assume it comes from the general
> practices
> > > from
> > > > using thrift for a while. Would it at least make sense for the
> > > > getLastDLSN/fence endpoint to have a new struct?
> > > >
> > >
> > > I don't have any preference here. It might make sense to have a new
> > struct.
> > >
> > >
> > > >
> > > > ====Any particular error code that makes sense for session fenced? If
> > we
> > > > want to be close to the HTTP errors, looks like 412 (PRECONDITION
> > FAILED)
> > > > might make the most sense, if a bit generic.
> > > >
> > > > 412 def:
> > > > "The precondition given in one or more of the request-header fields
> > > > evaluated to false when it was tested on the server. This response
> code
> > > > allows the client to place preconditions on the current resource
> > > > metainformation (header field data) and thus prevent the requested
> > method
> > > > from being applied to a resource other than the one intended."
> > > >
> > > > 412 excerpt from the If-match doc:
> > > > "This behavior is most useful when the client wants to prevent an
> > > updating
> > > > method, such as PUT, from modifying a resource that has changed since
> > the
> > > > client last retrieved it."
> > > >
> > >
> > > I liked your idea.
> > >
> > >
> > > >
> > > > ====Should we return the sessionid to the client in the
> "fencesession"
> > > > calls?
> > > > Seems like it may be useful when you fence, especially if you have
> some
> > > > type of custom sequencer where it would make sense for search, or for
> > > > debugging.
> > > > Main minus is that it would be easy for users to create an implicit
> > > > requirement that the sessionid is forever a valid transactionid,
> which
> > > may
> > > > not always be the case long term for the project.
> > > >
> > >
> > > I think we probably should start with not return and add if we really
> > need
> > > it.
> > >
> > >
> > > >
> > > >
> > > > ==Client:
> > > > ====What is the proposed process for the client retrieving the new
> > > > sessionid?
> > > > A full reconnect? No special case code, but intrusive on the client
> > side,
> > > > and possibly expensive garbage/processing wise. (though this type of
> > > > failure should hopefully be rare enough to not be a problem)
> > > > A call to reset the sessionid? Less intrusive, all the issues you get
> > > with
> > > > mutable object methods that need to be called in a certain order,
> edge
> > > > cases such as outstanding/buffered requests to the old stream, etc.
> > > > The call could also return the new sessionid, making it a good call
> for
> > > > storing or debugging the value.
> > > >
> > >
> > > It is probably not good to tight a session with a connection.
> Especially
> > > the underneath communication is a RPC framework.
> > >
> > > I think the new session id can be piggyback with rejected response. The
> > > client doesn't have to explicitly retrieve a new session id.
> > >
> >
> > +1 for separating session from connection. Especially the 'session' here
> is
> > more a lifecycle concept for a stream.
> >
> >
> >
> > >
> > >
> > > >
> > > > ====Session Fenced failure:
> > > > Will this put the client into a failure state, stopping all future
> > writes
> > > > until fixed?
> > > >
> > >
> > > My thought is the new session id will be piggyback with fence response.
> > so
> > > the client will know the new session id and all the future writes will
> > just
> > > carry the new session id.
> > >
> > >
> > > > Is it even possible to get this error when ownership changes? The
> > > > connection to the new owner should get a new sessionid on connect,
> so I
> > > > would expect not.
> > > >
> > >
> > > I think it will. But the client should handle and retry. As session
> > fenced
> > > exception is an exception that indicates that write is never attempted.
> > >
> > >
> > > >
> > > > Cheers,
> > > > Cameron
> > > >
> > > > On Tue, Nov 15, 2016 at 2:01 AM, Xi Liu <xi...@gmail.com>
> wrote:
> > > >
> > > > > Thank you, Cameron. Look forward to your comments.
> > > > >
> > > > > - Xi
> > > > >
> > > > > On Sun, Nov 13, 2016 at 1:21 PM, Cameron Hatfield <
> kinguy@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Sorry, I've been on vacation for the past week, and heads down
> for
> > a
> > > > > > release that is using DL at the end of Nov. I'll take a look at
> > this
> > > > over
> > > > > > the next week, and add any relevant comments. After we are
> finished
> > > > with
> > > > > > dev for this release, I am hoping to tackle this next.
> > > > > >
> > > > > > -Cameron
> > > > > >
> > > > > > On Fri, Nov 11, 2016 at 12:07 PM, Sijie Guo <si...@apache.org>
> > > wrote:
> > > > > >
> > > > > > > Xi,
> > > > > > >
> > > > > > > Thank you so much for your proposal. I took a look. It looks
> fine
> > > to
> > > > > me.
> > > > > > > Cameron, do you have any comments?
> > > > > > >
> > > > > > > Look forward to your pull requests.
> > > > > > >
> > > > > > > - Sijie
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Nov 9, 2016 at 2:34 AM, Xi Liu <xi...@gmail.com>
> > > wrote:
> > > > > > >
> > > > > > > > Cameron,
> > > > > > > >
> > > > > > > > Have you started any work for this? I just updated the
> proposal
> > > > page
> > > > > -
> > > > > > > > https://cwiki.apache.org/confluence/display/DL/DP-2+-+
> > > > > > > Epoch+Write+Support
> > > > > > > > Maybe we can work together with this.
> > > > > > > >
> > > > > > > > Sijie, Leigh,
> > > > > > > >
> > > > > > > > can you guys help review this to make sure our proposal is in
> > the
> > > > > right
> > > > > > > > direction?
> > > > > > > >
> > > > > > > > - Xi
> > > > > > > >
> > > > > > > > On Tue, Nov 1, 2016 at 3:05 AM, Sijie Guo <si...@apache.org>
> > > > wrote:
> > > > > > > >
> > > > > > > > > I created https://issues.apache.org/jira/browse/DL-63 for
> > > > tracking
> > > > > > the
> > > > > > > > > proposed idea here.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Oct 26, 2016 at 4:53 PM, Sijie Guo
> > > > > > <sijieg@twitter.com.invalid
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > On Tue, Oct 25, 2016 at 11:30 AM, Cameron Hatfield <
> > > > > > kinguy@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Yes, we are reading the HBase WAL (from their
> replication
> > > > > plugin
> > > > > > > > > > support),
> > > > > > > > > > > and writing that into DL.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Gotcha.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > From the sounds of it, yes, it would. Only thing I
> would
> > > say
> > > > is
> > > > > > > make
> > > > > > > > > the
> > > > > > > > > > > epoch requirement optional, so that if I client doesn't
> > > care
> > > > > > about
> > > > > > > > > dupes
> > > > > > > > > > > they don't have to deal with the process of getting a
> new
> > > > > epoch.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Yup. This should be optional. I can start a wiki page on
> > how
> > > we
> > > > > > want
> > > > > > > to
> > > > > > > > > > implement this. Are you interested in contributing to
> this?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > -Cameron
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Oct 19, 2016 at 7:43 PM, Sijie Guo
> > > > > > > > <sijieg@twitter.com.invalid
> > > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > On Wed, Oct 19, 2016 at 7:17 PM, Sijie Guo <
> > > > > sijieg@twitter.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Monday, October 17, 2016, Cameron Hatfield <
> > > > > > > kinguy@gmail.com>
> > > > > > > > > > > wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > >> Answer inline:
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> On Mon, Oct 17, 2016 at 11:46 AM, Sijie Guo <
> > > > > > sijie@apache.org
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> > Cameron,
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > Thank you for your summary. I liked the
> discussion
> > > > > here. I
> > > > > > > > also
> > > > > > > > > > > liked
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> > summary of your requirement -
> > > 'single-writer-per-key,
> > > > > > > > > > > > >> > multiple-writers-per-log'. If I understand
> > > correctly,
> > > > > the
> > > > > > > core
> > > > > > > > > > > concern
> > > > > > > > > > > > >> here
> > > > > > > > > > > > >> > is almost 'exact-once' write (or a way to
> explicit
> > > > tell
> > > > > > if a
> > > > > > > > > write
> > > > > > > > > > > can
> > > > > > > > > > > > >> be
> > > > > > > > > > > > >> > retried or not).
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > Comments inline.
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > On Fri, Oct 14, 2016 at 11:17 AM, Cameron
> > Hatfield <
> > > > > > > > > > > kinguy@gmail.com>
> > > > > > > > > > > > >> > wrote:
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > > > Ah- yes good point (to be clear we're not
> > using
> > > > the
> > > > > > > proxy
> > > > > > > > > this
> > > > > > > > > > > way
> > > > > > > > > > > > >> > > today).
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > > > Due to the source of the
> > > > > > > > > > > > >> > > > > data (HBase Replication), we cannot
> > guarantee
> > > > > that a
> > > > > > > > > single
> > > > > > > > > > > > >> partition
> > > > > > > > > > > > >> > > will
> > > > > > > > > > > > >> > > > > be owned for writes by the same client.
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > > Do you mean you *need* to support multiple
> > > writers
> > > > > > > issuing
> > > > > > > > > > > > >> interleaved
> > > > > > > > > > > > >> > > > writes or is it just that they might
> sometimes
> > > > > > > interleave
> > > > > > > > > > writes
> > > > > > > > > > > > and
> > > > > > > > > > > > >> > you
> > > > > > > > > > > > >> > > >don't care?
> > > > > > > > > > > > >> > > How HBase partitions the keys being written
> > > wouldn't
> > > > > > have
> > > > > > > a
> > > > > > > > > > > one->one
> > > > > > > > > > > > >> > > mapping with the partitions we would have in
> > > HBase.
> > > > > Even
> > > > > > > if
> > > > > > > > we
> > > > > > > > > > did
> > > > > > > > > > > > >> have
> > > > > > > > > > > > >> > > that alignment when the cluster first started,
> > > HBase
> > > > > > will
> > > > > > > > > > > rebalance
> > > > > > > > > > > > >> what
> > > > > > > > > > > > >> > > servers own what partitions, as well as split
> > and
> > > > > merge
> > > > > > > > > > partitions
> > > > > > > > > > > > >> that
> > > > > > > > > > > > >> > > already exist, causing eventual drift from one
> > log
> > > > per
> > > > > > > > > > partition.
> > > > > > > > > > > > >> > > Because we want ordering guarantees per key
> (row
> > > in
> > > > > > > hbase),
> > > > > > > > we
> > > > > > > > > > > > >> partition
> > > > > > > > > > > > >> > > the logs by the key. Since multiple writers
> are
> > > > > possible
> > > > > > > per
> > > > > > > > > > range
> > > > > > > > > > > > of
> > > > > > > > > > > > >> > keys
> > > > > > > > > > > > >> > > (due to the aforementioned rebalancing /
> > > splitting /
> > > > > etc
> > > > > > > of
> > > > > > > > > > > hbase),
> > > > > > > > > > > > we
> > > > > > > > > > > > >> > > cannot use the core library due to requiring a
> > > > single
> > > > > > > writer
> > > > > > > > > for
> > > > > > > > > > > > >> > ordering.
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > But, for a single log, we don't really care
> > about
> > > > > > ordering
> > > > > > > > > aside
> > > > > > > > > > > > from
> > > > > > > > > > > > >> at
> > > > > > > > > > > > >> > > the per-key level. So all we really need to be
> > > able
> > > > to
> > > > > > > > handle
> > > > > > > > > is
> > > > > > > > > > > > >> > preventing
> > > > > > > > > > > > >> > > duplicates when a failure occurs, and ordering
> > > > > > consistency
> > > > > > > > > > across
> > > > > > > > > > > > >> > requests
> > > > > > > > > > > > >> > > from a single client.
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > So our general requirements are:
> > > > > > > > > > > > >> > > Write A, Write B
> > > > > > > > > > > > >> > > Timeline: A -> B
> > > > > > > > > > > > >> > > Request B is only made after A has
> successfully
> > > > > returned
> > > > > > > > > > (possibly
> > > > > > > > > > > > >> after
> > > > > > > > > > > > >> > > retries)
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > 1) If the write succeeds, it will be durably
> > > exposed
> > > > > to
> > > > > > > > > clients
> > > > > > > > > > > > within
> > > > > > > > > > > > >> > some
> > > > > > > > > > > > >> > > bounded time frame
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > Guaranteed.
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> > > 2) If A succeeds and B succeeds, the ordering
> > for
> > > > the
> > > > > > log
> > > > > > > > will
> > > > > > > > > > be
> > > > > > > > > > > A
> > > > > > > > > > > > >> and
> > > > > > > > > > > > >> > > then B
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > If I understand correctly here, B is only sent
> > > after A
> > > > > is
> > > > > > > > > > returned,
> > > > > > > > > > > > >> right?
> > > > > > > > > > > > >> > If that's the case, It is guaranteed.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > > 3) If A fails due to an error that can be
> relied
> > > on
> > > > to
> > > > > > > *not*
> > > > > > > > > be
> > > > > > > > > > a
> > > > > > > > > > > > lost
> > > > > > > > > > > > >> > ack
> > > > > > > > > > > > >> > > problem, it will never be exposed to the
> client,
> > > so
> > > > it
> > > > > > may
> > > > > > > > > > > > (depending
> > > > > > > > > > > > >> on
> > > > > > > > > > > > >> > > the error) be retried immediately
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > If it is not a lost-ack problem, the entry will
> be
> > > > > > exposed.
> > > > > > > it
> > > > > > > > > is
> > > > > > > > > > > > >> > guaranteed.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Let me try rephrasing the questions, to make sure
> > I'm
> > > > > > > > > understanding
> > > > > > > > > > > > >> correctly:
> > > > > > > > > > > > >> If A fails, with an error such as "Unable to
> create
> > > > > > connection
> > > > > > > > to
> > > > > > > > > > > > >> bookkeeper server", that would be the type of
> error
> > we
> > > > > would
> > > > > > > > > expect
> > > > > > > > > > to
> > > > > > > > > > > > be
> > > > > > > > > > > > >> able to retry immediately, as that result means no
> > > > action
> > > > > > was
> > > > > > > > > taken
> > > > > > > > > > on
> > > > > > > > > > > > any
> > > > > > > > > > > > >> log / etc, so no entry could have been created.
> This
> > > is
> > > > > > > > different
> > > > > > > > > > > then a
> > > > > > > > > > > > >> "Connection Timeout" exception, as we just might
> not
> > > > have
> > > > > > > > gotten a
> > > > > > > > > > > > >> response
> > > > > > > > > > > > >> in time.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > > > Gotcha.
> > > > > > > > > > > > >
> > > > > > > > > > > > > The response code returned from proxy can tell if a
> > > > failure
> > > > > > can
> > > > > > > > be
> > > > > > > > > > > > retried
> > > > > > > > > > > > > safely or not. (We might need to make them well
> > > > documented)
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > > 4) If A fails due to an error that could be a
> > lost
> > > > ack
> > > > > > > > problem
> > > > > > > > > > > > >> (network
> > > > > > > > > > > > >> > > connectivity / etc), within a bounded time
> frame
> > > it
> > > > > > should
> > > > > > > > be
> > > > > > > > > > > > >> possible to
> > > > > > > > > > > > >> > > find out if the write succeed or failed.
> Either
> > by
> > > > > > reading
> > > > > > > > > from
> > > > > > > > > > > some
> > > > > > > > > > > > >> > > checkpoint of the log for the changes that
> > should
> > > > have
> > > > > > > been
> > > > > > > > > made
> > > > > > > > > > > or
> > > > > > > > > > > > >> some
> > > > > > > > > > > > >> > > other possible server-side support.
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > If I understand this correctly, it is a
> > duplication
> > > > > issue,
> > > > > > > > > right?
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > Can a de-duplication solution work here? Either
> DL
> > > or
> > > > > your
> > > > > > > > > client
> > > > > > > > > > > does
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> > de-duplication?
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> The requirements I'm mentioning are the ones
> needed
> > > for
> > > > > > > > > client-side
> > > > > > > > > > > > >> dedupping. Since if I can guarantee writes being
> > > exposed
> > > > > > > within
> > > > > > > > > some
> > > > > > > > > > > > time
> > > > > > > > > > > > >> frame, and I can never get into an inconsistently
> > > > ordered
> > > > > > > state
> > > > > > > > > when
> > > > > > > > > > > > >> successes happen, when an error occurs, I can
> always
> > > > wait
> > > > > > for
> > > > > > > > max
> > > > > > > > > > time
> > > > > > > > > > > > >> frame, read the latest writes, and then dedup
> > locally
> > > > > > against
> > > > > > > > the
> > > > > > > > > > > > request
> > > > > > > > > > > > >> I
> > > > > > > > > > > > >> just made.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> The main thing about that timeframe is that its
> > > > basically
> > > > > > the
> > > > > > > > > > addition
> > > > > > > > > > > > of
> > > > > > > > > > > > >> every timeout, all the way down in the system,
> > > combined
> > > > > with
> > > > > > > > > > whatever
> > > > > > > > > > > > >> flushing / caching / etc times are at the
> > bookkeeper /
> > > > > > client
> > > > > > > > > level
> > > > > > > > > > > for
> > > > > > > > > > > > >> when values are exposed
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Gotcha.
> > > > > > > > > > > > >
> > > > > > > > > > > > >>
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > Is there any ways to identify your write?
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > I can think of a case as follow - I want to know
> > > what
> > > > is
> > > > > > > your
> > > > > > > > > > > expected
> > > > > > > > > > > > >> > behavior from the log.
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > a)
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > If a hbase region server A writes a change of
> key
> > K
> > > to
> > > > > the
> > > > > > > > log,
> > > > > > > > > > the
> > > > > > > > > > > > >> change
> > > > > > > > > > > > >> > is successfully made to the log;
> > > > > > > > > > > > >> > but server A is down before receiving the
> change.
> > > > > > > > > > > > >> > region server B took over the region that
> contains
> > > K,
> > > > > what
> > > > > > > > will
> > > > > > > > > B
> > > > > > > > > > > do?
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> HBase writes in large chunks (WAL Logs), which its
> > > > > > replication
> > > > > > > > > > system
> > > > > > > > > > > > then
> > > > > > > > > > > > >> handles by replaying in the case of failure. If
> I'm
> > > in a
> > > > > > > middle
> > > > > > > > > of a
> > > > > > > > > > > > log,
> > > > > > > > > > > > >> and the whole region goes down and gets
> rescheduled
> > > > > > > elsewhere, I
> > > > > > > > > > will
> > > > > > > > > > > > >> start
> > > > > > > > > > > > >> back up from the beginning of the log I was in the
> > > > middle
> > > > > > of.
> > > > > > > > > Using
> > > > > > > > > > > > >> checkpointing + deduping, we should be able to
> find
> > > out
> > > > > > where
> > > > > > > we
> > > > > > > > > > left
> > > > > > > > > > > > off
> > > > > > > > > > > > >> in the log.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > b) same as a). but server A was just network
> > > > > partitioned.
> > > > > > > will
> > > > > > > > > > both
> > > > > > > > > > > A
> > > > > > > > > > > > >> and B
> > > > > > > > > > > > >> > write the change of key K?
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> HBase gives us some guarantees around network
> > > partitions
> > > > > > > > > > (Consistency
> > > > > > > > > > > > over
> > > > > > > > > > > > >> availability for HBase). HBase is a single-master
> > > > failover
> > > > > > > > > recovery
> > > > > > > > > > > type
> > > > > > > > > > > > >> of
> > > > > > > > > > > > >> system, with zookeeper-based guarantees for single
> > > > owners
> > > > > > > > > (writers)
> > > > > > > > > > > of a
> > > > > > > > > > > > >> range of data.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > > 5) If A is turned into multiple batches (one
> > large
> > > > > > request
> > > > > > > > > gets
> > > > > > > > > > > > split
> > > > > > > > > > > > >> > into
> > > > > > > > > > > > >> > > multiple smaller ones to the bookkeeper
> backend,
> > > due
> > > > > to
> > > > > > > log
> > > > > > > > > > > rolling
> > > > > > > > > > > > /
> > > > > > > > > > > > >> > size
> > > > > > > > > > > > >> > > / etc):
> > > > > > > > > > > > >> > >   a) The ordering of entries within batches
> have
> > > > > > ordering
> > > > > > > > > > > > consistence
> > > > > > > > > > > > >> > with
> > > > > > > > > > > > >> > > the original request, when exposed in the log
> > > > (though
> > > > > > they
> > > > > > > > may
> > > > > > > > > > be
> > > > > > > > > > > > >> > > interleaved with other requests)
> > > > > > > > > > > > >> > >   b) The ordering across batches have ordering
> > > > > > consistence
> > > > > > > > > with
> > > > > > > > > > > the
> > > > > > > > > > > > >> > > original request, when exposed in the log
> > (though
> > > > they
> > > > > > may
> > > > > > > > be
> > > > > > > > > > > > >> interleaved
> > > > > > > > > > > > >> > > with other requests)
> > > > > > > > > > > > >> > >   c) If a batch fails, and cannot be retried /
> > is
> > > > > > > > > unsuccessfully
> > > > > > > > > > > > >> retried,
> > > > > > > > > > > > >> > > all batches after the failed batch should not
> be
> > > > > exposed
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > log.
> > > > > > > > > > > > >> > Note:
> > > > > > > > > > > > >> > > The batches before and including the failed
> > batch,
> > > > > that
> > > > > > > > ended
> > > > > > > > > up
> > > > > > > > > > > > >> > > succeeding, can show up in the log, again
> within
> > > > some
> > > > > > > > bounded
> > > > > > > > > > time
> > > > > > > > > > > > >> range
> > > > > > > > > > > > >> > > for reads by a client.
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > There is a method 'writeBulk' in
> > > DistributedLogClient
> > > > > can
> > > > > > > > > achieve
> > > > > > > > > > > this
> > > > > > > > > > > > >> > guarantee.
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > However, I am not very sure about how will you
> > turn
> > > A
> > > > > into
> > > > > > > > > > batches.
> > > > > > > > > > > If
> > > > > > > > > > > > >> you
> > > > > > > > > > > > >> > are dividing A into batches,
> > > > > > > > > > > > >> > you can simply control the application write
> > > sequence
> > > > to
> > > > > > > > achieve
> > > > > > > > > > the
> > > > > > > > > > > > >> > guarantee here.
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > Can you explain more about this?
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> In this case, by batches I mean what the proxy
> does
> > > with
> > > > > the
> > > > > > > > > single
> > > > > > > > > > > > >> request
> > > > > > > > > > > > >> that I send it. If the proxy decides it needs to
> > turn
> > > my
> > > > > > > single
> > > > > > > > > > > request
> > > > > > > > > > > > >> into multiple batches of requests, due to log
> > rolling,
> > > > > size
> > > > > > > > > > > limitations,
> > > > > > > > > > > > >> etc, those would be the guarantees I need to be
> able
> > > to
> > > > > > > > > reduplicate
> > > > > > > > > > on
> > > > > > > > > > > > the
> > > > > > > > > > > > >> client side.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > A single record written by #write and A record set
> > (set
> > > > of
> > > > > > > > records)
> > > > > > > > > > > > > written by #writeRecordSet are atomic - they will
> not
> > > be
> > > > > > broken
> > > > > > > > > down
> > > > > > > > > > > into
> > > > > > > > > > > > > entries (batches). With the correct response code,
> > you
> > > > > would
> > > > > > be
> > > > > > > > > able
> > > > > > > > > > to
> > > > > > > > > > > > > tell if it is a lost-ack failure or not. However
> > there
> > > > is a
> > > > > > > size
> > > > > > > > > > > > limitation
> > > > > > > > > > > > > for this - it can't not go beyond 1MB for current
> > > > > > > implementation.
> > > > > > > > > > > > >
> > > > > > > > > > > > > What is your expected record size?
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > Since we can guarantee per-key ordering on the
> > > > client
> > > > > > > side,
> > > > > > > > we
> > > > > > > > > > > > >> guarantee
> > > > > > > > > > > > >> > > that there is a single writer per-key, just
> not
> > > per
> > > > > log.
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > Do you need fencing guarantee in the case of
> > network
> > > > > > > partition
> > > > > > > > > > > causing
> > > > > > > > > > > > >> > two-writers?
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > > So if there was a
> > > > > > > > > > > > >> > > way to guarantee a single write request as
> being
> > > > > written
> > > > > > > or
> > > > > > > > > not,
> > > > > > > > > > > > >> within a
> > > > > > > > > > > > >> > > certain time frame (since failures should be
> > rare
> > > > > > anyways,
> > > > > > > > > this
> > > > > > > > > > is
> > > > > > > > > > > > >> fine
> > > > > > > > > > > > >> > if
> > > > > > > > > > > > >> > > it is expensive), we can then have the client
> > > > > guarantee
> > > > > > > the
> > > > > > > > > > > ordering
> > > > > > > > > > > > >> it
> > > > > > > > > > > > >> > > needs.
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > This sounds an 'exact-once' write (regarding
> > > retries)
> > > > > > > > > requirement
> > > > > > > > > > to
> > > > > > > > > > > > me,
> > > > > > > > > > > > >> > right?
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> Yes. I'm curious of how this issue is handled by
> > > > > Manhattan,
> > > > > > > > since
> > > > > > > > > > you
> > > > > > > > > > > > can
> > > > > > > > > > > > >> imagine a data store that ends up getting multiple
> > > > writes
> > > > > > for
> > > > > > > > the
> > > > > > > > > > same
> > > > > > > > > > > > put
> > > > > > > > > > > > >> / get / etc, would be harder to use, and we are
> > > > basically
> > > > > > > trying
> > > > > > > > > to
> > > > > > > > > > > > create
> > > > > > > > > > > > >> a log like that for HBase.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > Are you guys replacing HBase WAL?
> > > > > > > > > > > > >
> > > > > > > > > > > > > In Manhattan case, the request will be first
> written
> > to
> > > > DL
> > > > > > > > streams
> > > > > > > > > by
> > > > > > > > > > > > > Manhattan coordinator. The Manhattan replica then
> > will
> > > > read
> > > > > > > from
> > > > > > > > > the
> > > > > > > > > > DL
> > > > > > > > > > > > > streams and apply the change. In the lost-ack case,
> > the
> > > > MH
> > > > > > > > > > coordinator
> > > > > > > > > > > > will
> > > > > > > > > > > > > just fail the request to client.
> > > > > > > > > > > > >
> > > > > > > > > > > > > My feeling here is your usage for HBase is a bit
> > > > different
> > > > > > from
> > > > > > > > how
> > > > > > > > > > we
> > > > > > > > > > > > use
> > > > > > > > > > > > > DL in Manhattan. It sounds like you read from a
> > source
> > > > > (HBase
> > > > > > > > WAL)
> > > > > > > > > > and
> > > > > > > > > > > > > write to DL. But I might be wrong.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > > Cameron:
> > > > > > > > > > > > >> > > > Another thing we've discussed but haven't
> > really
> > > > > > thought
> > > > > > > > > > > through -
> > > > > > > > > > > > >> > > > We might be able to support some kind of
> epoch
> > > > write
> > > > > > > > > request,
> > > > > > > > > > > > where
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> > > > epoch is guaranteed to have changed if the
> > > writer
> > > > > has
> > > > > > > > > changed
> > > > > > > > > > or
> > > > > > > > > > > > the
> > > > > > > > > > > > >> > > ledger
> > > > > > > > > > > > >> > > > was ever fenced off. Writes include an epoch
> > and
> > > > are
> > > > > > > > > rejected
> > > > > > > > > > if
> > > > > > > > > > > > the
> > > > > > > > > > > > >> > > epoch
> > > > > > > > > > > > >> > > > has changed.
> > > > > > > > > > > > >> > > > With a mechanism like this, fencing the
> ledger
> > > off
> > > > > > > after a
> > > > > > > > > > > failure
> > > > > > > > > > > > >> > would
> > > > > > > > > > > > >> > > > ensure any pending writes had either been
> > > written
> > > > or
> > > > > > > would
> > > > > > > > > be
> > > > > > > > > > > > >> rejected.
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > The issue would be how I guarantee the write I
> > > wrote
> > > > > to
> > > > > > > the
> > > > > > > > > > server
> > > > > > > > > > > > was
> > > > > > > > > > > > >> > > written. Since a network issue could happen on
> > the
> > > > > send
> > > > > > of
> > > > > > > > the
> > > > > > > > > > > > >> request,
> > > > > > > > > > > > >> > or
> > > > > > > > > > > > >> > > on the receive of the success response, an
> epoch
> > > > > > wouldn't
> > > > > > > > tell
> > > > > > > > > > me
> > > > > > > > > > > > if I
> > > > > > > > > > > > >> > can
> > > > > > > > > > > > >> > > successfully retry, as it could be
> successfully
> > > > > written
> > > > > > > but
> > > > > > > > > AWS
> > > > > > > > > > > > >> dropped
> > > > > > > > > > > > >> > the
> > > > > > > > > > > > >> > > connection for the success response. Since the
> > > epoch
> > > > > > would
> > > > > > > > be
> > > > > > > > > > the
> > > > > > > > > > > > same
> > > > > > > > > > > > >> > > (same ledger), I could write duplicates.
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > > We are currently proposing adding a
> > transaction
> > > > > > semantic
> > > > > > > > to
> > > > > > > > > dl
> > > > > > > > > > > to
> > > > > > > > > > > > >> get
> > > > > > > > > > > > >> > rid
> > > > > > > > > > > > >> > > > of the size limitation and the unaware-ness
> in
> > > the
> > > > > > proxy
> > > > > > > > > > client.
> > > > > > > > > > > > >> Here
> > > > > > > > > > > > >> > is
> > > > > > > > > > > > >> > > > our idea -
> > > > > > > > > > > > >> > > > http://mail-archives.apache.
> > > > org/mod_mbox/incubator-
> > > > > > > > > > > distributedlog
> > > > > > > > > > > > >> > > -dev/201609.mbox/%
> 3cCAAC6BxP5YyEHwG0ZCF5soh42X=
> > > > xuYwYm
> > > > > > > > > > > > >> > > <http://mail-archives.apache.
> > > > org/mod_mbox/incubator-
> > > > > > > > > > > > >> > distributedlog%0A-dev/201609.mbox/%
> > > > > > > 3cCAAC6BxP5YyEHwG0ZCF5soh
> > > > > > > > > > > > 42X=xuYwYm>
> > > > > > > > > > > > >> > > L4nXsYBYiofzxpVk6g@mail.gmail.com%3e
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > > I am not sure if your idea is similar as
> ours.
> > > but
> > > > > > we'd
> > > > > > > > like
> > > > > > > > > > to
> > > > > > > > > > > > >> > > collaborate
> > > > > > > > > > > > >> > > > with the community if anyone has the similar
> > > idea.
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > Our use case would be covered by transaction
> > > > support,
> > > > > > but
> > > > > > > > I'm
> > > > > > > > > > > unsure
> > > > > > > > > > > > >> if
> > > > > > > > > > > > >> > we
> > > > > > > > > > > > >> > > would need something that heavy weight for the
> > > > > > guarantees
> > > > > > > we
> > > > > > > > > > need.
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > Basically, the high level requirement here is
> > > > "Support
> > > > > > > > > > consistent
> > > > > > > > > > > > >> write
> > > > > > > > > > > > >> > > ordering for single-writer-per-key,
> > > > > > multi-writer-per-log".
> > > > > > > > My
> > > > > > > > > > > hunch
> > > > > > > > > > > > is
> > > > > > > > > > > > >> > > that, with some added guarantees to the proxy
> > (if
> > > it
> > > > > > isn't
> > > > > > > > > > already
> > > > > > > > > > > > >> > > supported), and some custom client code on our
> > > side
> > > > > for
> > > > > > > > > removing
> > > > > > > > > > > the
> > > > > > > > > > > > >> > > entries that actually succeed to write to
> > > > > DistributedLog
> > > > > > > > from
> > > > > > > > > > the
> > > > > > > > > > > > >> request
> > > > > > > > > > > > >> > > that failed, it should be a relatively easy
> > thing
> > > to
> > > > > > > > support.
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > Yup. I think it should not be very difficult to
> > > > support.
> > > > > > > There
> > > > > > > > > > might
> > > > > > > > > > > > be
> > > > > > > > > > > > >> > some changes in the server side.
> > > > > > > > > > > > >> > Let's figure out what will the changes be. Are
> you
> > > > guys
> > > > > > > > > interested
> > > > > > > > > > > in
> > > > > > > > > > > > >> > contributing?
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > Yes, we would be.
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> As a note, the one thing that we see as an issue
> > with
> > > > the
> > > > > > > client
> > > > > > > > > > side
> > > > > > > > > > > > >> dedupping is how to bound the range of data that
> > needs
> > > > to
> > > > > be
> > > > > > > > > looked
> > > > > > > > > > at
> > > > > > > > > > > > for
> > > > > > > > > > > > >> deduplication. As you can imagine, it is pretty
> easy
> > > to
> > > > > > bound
> > > > > > > > the
> > > > > > > > > > > bottom
> > > > > > > > > > > > >> of
> > > > > > > > > > > > >> the range, as that it just regular checkpointing
> of
> > > the
> > > > > DSLN
> > > > > > > > that
> > > > > > > > > is
> > > > > > > > > > > > >> returned. I'm still not sure if there is any nice
> > way
> > > to
> > > > > > time
> > > > > > > > > bound
> > > > > > > > > > > the
> > > > > > > > > > > > >> top
> > > > > > > > > > > > >> end of the range, especially since the proxy owns
> > > > sequence
> > > > > > > > numbers
> > > > > > > > > > > > (which
> > > > > > > > > > > > >> makes sense). I am curious if there is more that
> can
> > > be
> > > > > done
> > > > > > > if
> > > > > > > > > > > > >> deduplication is on the server side. However the
> > main
> > > > > minus
> > > > > > I
> > > > > > > > see
> > > > > > > > > of
> > > > > > > > > > > > >> server
> > > > > > > > > > > > >> side deduplication is that instead of running
> > > contingent
> > > > > on
> > > > > > > > there
> > > > > > > > > > > being
> > > > > > > > > > > > a
> > > > > > > > > > > > >> failed client request, instead it would have to
> run
> > > > every
> > > > > > > time a
> > > > > > > > > > write
> > > > > > > > > > > > >> happens.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > For a reliable dedup, we probably need
> > > > > fence-then-getLastDLSN
> > > > > > > > > > > operation -
> > > > > > > > > > > > > so it would guarantee that any non-completed
> requests
> > > > > issued
> > > > > > > > > > (lost-ack
> > > > > > > > > > > > > requests) before this fence-then-getLastDLSN
> > operation
> > > > will
> > > > > > be
> > > > > > > > > failed
> > > > > > > > > > > and
> > > > > > > > > > > > > they will never land at the log.
> > > > > > > > > > > > >
> > > > > > > > > > > > > the pseudo code would look like below -
> > > > > > > > > > > > >
> > > > > > > > > > > > > write(request) onFailure { t =>
> > > > > > > > > > > > >
> > > > > > > > > > > > > if (t is timeout exception) {
> > > > > > > > > > > > >
> > > > > > > > > > > > > DLSN lastDLSN = fenceThenGetLastDLSN()
> > > > > > > > > > > > > DLSN lastCheckpointedDLSN = ...;
> > > > > > > > > > > > > // find if the request lands between [lastDLSN,
> > > > > > > > > > lastCheckpointedDLSN].
> > > > > > > > > > > > > // if it exists, the write succeed; otherwise
> retry.
> > > > > > > > > > > > >
> > > > > > > > > > > > > }
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > }
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Just realized the idea is same as what Leigh raised
> in
> > > the
> > > > > > > previous
> > > > > > > > > > email
> > > > > > > > > > > > about 'epoch write'. Let me explain more about this
> > idea
> > > > > > (Leigh,
> > > > > > > > feel
> > > > > > > > > > > free
> > > > > > > > > > > > to jump in to fill up your idea).
> > > > > > > > > > > >
> > > > > > > > > > > > - when a log stream is owned,  the proxy use the last
> > > > > > transaction
> > > > > > > > id
> > > > > > > > > as
> > > > > > > > > > > the
> > > > > > > > > > > > epoch
> > > > > > > > > > > > - when a client connects (handshake with the proxy),
> it
> > > > will
> > > > > > get
> > > > > > > > the
> > > > > > > > > > > epoch
> > > > > > > > > > > > for the stream.
> > > > > > > > > > > > - the writes issued by this client will carry the
> epoch
> > > to
> > > > > the
> > > > > > > > proxy.
> > > > > > > > > > > > - add a new rpc - fenceThenGetLastDLSN - it would
> force
> > > the
> > > > > > proxy
> > > > > > > > to
> > > > > > > > > > bump
> > > > > > > > > > > > the epoch.
> > > > > > > > > > > > - if fenceThenGetLastDLSN happened, all the
> outstanding
> > > > > writes
> > > > > > > with
> > > > > > > > > old
> > > > > > > > > > > > epoch will be rejected with exceptions (e.g.
> > > EpochFenced).
> > > > > > > > > > > > - The DLSN returned from fenceThenGetLastDLSN can be
> > used
> > > > as
> > > > > > the
> > > > > > > > > bound
> > > > > > > > > > > for
> > > > > > > > > > > > deduplications on failures.
> > > > > > > > > > > >
> > > > > > > > > > > > Cameron, does this sound a solution to your use case?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> Maybe something that could fit a similar need that
> > > Kafka
> > > > > > does
> > > > > > > > (the
> > > > > > > > > > > last
> > > > > > > > > > > > >> store value for a particular key in a log), such
> > that
> > > > on a
> > > > > > per
> > > > > > > > key
> > > > > > > > > > > basis
> > > > > > > > > > > > >> there could be a sequence number that support
> > > > > deduplication?
> > > > > > > > Cost
> > > > > > > > > > > seems
> > > > > > > > > > > > >> like it would be high however, and I'm not even
> sure
> > > if
> > > > > > > > bookkeeper
> > > > > > > > > > > > >> supports
> > > > > > > > > > > > >> it.
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >> Cheers,
> > > > > > > > > > > > >> Cameron
> > > > > > > > > > > > >>
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > Thanks,
> > > > > > > > > > > > >> > > Cameron
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > On Sat, Oct 8, 2016 at 7:35 AM, Leigh Stewart
> > > > > > > > > > > > >> > <lstewart@twitter.com.invalid
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > wrote:
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> > > > Cameron:
> > > > > > > > > > > > >> > > > Another thing we've discussed but haven't
> > really
> > > > > > thought
> > > > > > > > > > > through -
> > > > > > > > > > > > >> > > > We might be able to support some kind of
> epoch
> > > > write
> > > > > > > > > request,
> > > > > > > > > > > > where
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> > > > epoch is guaranteed to have changed if the
> > > writer
> > > > > has
> > > > > > > > > changed
> > > > > > > > > > or
> > > > > > > > > > > > the
> > > > > > > > > > > > >> > > ledger
> > > > > > > > > > > > >> > > > was ever fenced off. Writes include an epoch
> > and
> > > > are
> > > > > > > > > rejected
> > > > > > > > > > if
> > > > > > > > > > > > the
> > > > > > > > > > > > >> > > epoch
> > > > > > > > > > > > >> > > > has changed.
> > > > > > > > > > > > >> > > > With a mechanism like this, fencing the
> ledger
> > > off
> > > > > > > after a
> > > > > > > > > > > failure
> > > > > > > > > > > > >> > would
> > > > > > > > > > > > >> > > > ensure any pending writes had either been
> > > written
> > > > or
> > > > > > > would
> > > > > > > > > be
> > > > > > > > > > > > >> rejected.
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > On Sat, Oct 8, 2016 at 7:10 AM, Sijie Guo <
> > > > > > > > sijie@apache.org
> > > > > > > > > >
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > > > > Cameron,
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > I think both Leigh and Xi had made a few
> > good
> > > > > points
> > > > > > > > about
> > > > > > > > > > > your
> > > > > > > > > > > > >> > > question.
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > To add one more point to your question -
> > "but
> > > I
> > > > am
> > > > > > not
> > > > > > > > > > > > >> > > > > 100% of how all of the futures in the code
> > > > handle
> > > > > > > > > failures.
> > > > > > > > > > > > >> > > > > If not, where in the code would be the
> > > relevant
> > > > > > places
> > > > > > > > to
> > > > > > > > > > add
> > > > > > > > > > > > the
> > > > > > > > > > > > >> > > ability
> > > > > > > > > > > > >> > > > > to do this, and would the project be
> > > interested
> > > > > in a
> > > > > > > > pull
> > > > > > > > > > > > >> request?"
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > The current proxy and client logic doesn't
> > do
> > > > > > > perfectly
> > > > > > > > on
> > > > > > > > > > > > >> handling
> > > > > > > > > > > > >> > > > > failures (duplicates) - the strategy now
> is
> > > the
> > > > > > client
> > > > > > > > > will
> > > > > > > > > > > > retry
> > > > > > > > > > > > >> as
> > > > > > > > > > > > >> > > best
> > > > > > > > > > > > >> > > > > at it can before throwing exceptions to
> > users.
> > > > The
> > > > > > > code
> > > > > > > > > you
> > > > > > > > > > > are
> > > > > > > > > > > > >> > looking
> > > > > > > > > > > > >> > > > for
> > > > > > > > > > > > >> > > > > - it is on BKLogSegmentWriter for the
> proxy
> > > > > handling
> > > > > > > > > writes
> > > > > > > > > > > and
> > > > > > > > > > > > >> it is
> > > > > > > > > > > > >> > > on
> > > > > > > > > > > > >> > > > > DistributedLogClientImpl for the proxy
> > client
> > > > > > handling
> > > > > > > > > > > responses
> > > > > > > > > > > > >> from
> > > > > > > > > > > > >> > > > > proxies. Does this help you?
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > And also, you are welcome to contribute
> the
> > > pull
> > > > > > > > requests.
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > - Sijie
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > On Tue, Oct 4, 2016 at 3:39 PM, Cameron
> > > > Hatfield <
> > > > > > > > > > > > >> kinguy@gmail.com>
> > > > > > > > > > > > >> > > > wrote:
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > > > > I have a question about the Proxy
> Client.
> > > > > > Basically,
> > > > > > > > for
> > > > > > > > > > our
> > > > > > > > > > > > use
> > > > > > > > > > > > >> > > cases,
> > > > > > > > > > > > >> > > > > we
> > > > > > > > > > > > >> > > > > > want to guarantee ordering at the key
> > level,
> > > > > > > > > irrespective
> > > > > > > > > > of
> > > > > > > > > > > > the
> > > > > > > > > > > > >> > > > ordering
> > > > > > > > > > > > >> > > > > > of the partition it may be assigned to
> as
> > a
> > > > > whole.
> > > > > > > Due
> > > > > > > > > to
> > > > > > > > > > > the
> > > > > > > > > > > > >> > source
> > > > > > > > > > > > >> > > of
> > > > > > > > > > > > >> > > > > the
> > > > > > > > > > > > >> > > > > > data (HBase Replication), we cannot
> > > guarantee
> > > > > > that a
> > > > > > > > > > single
> > > > > > > > > > > > >> > partition
> > > > > > > > > > > > >> > > > > will
> > > > > > > > > > > > >> > > > > > be owned for writes by the same client.
> > This
> > > > > means
> > > > > > > the
> > > > > > > > > > proxy
> > > > > > > > > > > > >> client
> > > > > > > > > > > > >> > > > works
> > > > > > > > > > > > >> > > > > > well (since we don't care which proxy
> owns
> > > the
> > > > > > > > partition
> > > > > > > > > > we
> > > > > > > > > > > > are
> > > > > > > > > > > > >> > > writing
> > > > > > > > > > > > >> > > > > > to).
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > However, the guarantees we need when
> > > writing a
> > > > > > batch
> > > > > > > > > > > consists
> > > > > > > > > > > > >> of:
> > > > > > > > > > > > >> > > > > > Definition of a Batch: The set of
> records
> > > sent
> > > > > to
> > > > > > > the
> > > > > > > > > > > > writeBatch
> > > > > > > > > > > > >> > > > endpoint
> > > > > > > > > > > > >> > > > > > on the proxy
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > 1. Batch success: If the client
> receives a
> > > > > success
> > > > > > > > from
> > > > > > > > > > the
> > > > > > > > > > > > >> proxy,
> > > > > > > > > > > > >> > > then
> > > > > > > > > > > > >> > > > > > that batch is successfully written
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > 2. Inter-Batch ordering : Once a batch
> has
> > > > been
> > > > > > > > written
> > > > > > > > > > > > >> > successfully
> > > > > > > > > > > > >> > > by
> > > > > > > > > > > > >> > > > > the
> > > > > > > > > > > > >> > > > > > client, when another batch is written,
> it
> > > will
> > > > > be
> > > > > > > > > > guaranteed
> > > > > > > > > > > > to
> > > > > > > > > > > > >> be
> > > > > > > > > > > > >> > > > > ordered
> > > > > > > > > > > > >> > > > > > after the last batch (if it is the same
> > > > stream).
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > 3. Intra-Batch ordering: Within a batch
> of
> > > > > writes,
> > > > > > > the
> > > > > > > > > > > records
> > > > > > > > > > > > >> will
> > > > > > > > > > > > >> > > be
> > > > > > > > > > > > >> > > > > > committed in order
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > 4. Intra-Batch failure ordering: If an
> > > > > individual
> > > > > > > > record
> > > > > > > > > > > fails
> > > > > > > > > > > > >> to
> > > > > > > > > > > > >> > > write
> > > > > > > > > > > > >> > > > > > within a batch, all records after that
> > > record
> > > > > will
> > > > > > > not
> > > > > > > > > be
> > > > > > > > > > > > >> written.
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > 5. Batch Commit: Guarantee that if a
> batch
> > > > > > returns a
> > > > > > > > > > > success,
> > > > > > > > > > > > it
> > > > > > > > > > > > >> > will
> > > > > > > > > > > > >> > > > be
> > > > > > > > > > > > >> > > > > > written
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > 6. Read-after-write: Once a batch is
> > > > committed,
> > > > > > > > within a
> > > > > > > > > > > > limited
> > > > > > > > > > > > >> > > > > time-frame
> > > > > > > > > > > > >> > > > > > it will be able to be read. This is
> > required
> > > > in
> > > > > > the
> > > > > > > > case
> > > > > > > > > > of
> > > > > > > > > > > > >> > failure,
> > > > > > > > > > > > >> > > so
> > > > > > > > > > > > >> > > > > > that the client can see what actually
> got
> > > > > > > committed. I
> > > > > > > > > > > believe
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> > > > > > time-frame part could be removed if the
> > > client
> > > > > can
> > > > > > > > send
> > > > > > > > > in
> > > > > > > > > > > the
> > > > > > > > > > > > >> same
> > > > > > > > > > > > >> > > > > > sequence number that was written
> > previously,
> > > > > since
> > > > > > > it
> > > > > > > > > > would
> > > > > > > > > > > > then
> > > > > > > > > > > > >> > fail
> > > > > > > > > > > > >> > > > and
> > > > > > > > > > > > >> > > > > > we would know that a read needs to
> occur.
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > So, my basic question is if this is
> > > currently
> > > > > > > possible
> > > > > > > > > in
> > > > > > > > > > > the
> > > > > > > > > > > > >> > proxy?
> > > > > > > > > > > > >> > > I
> > > > > > > > > > > > >> > > > > > don't believe it gives these guarantees
> as
> > > it
> > > > > > stands
> > > > > > > > > > today,
> > > > > > > > > > > > but
> > > > > > > > > > > > >> I
> > > > > > > > > > > > >> > am
> > > > > > > > > > > > >> > > > not
> > > > > > > > > > > > >> > > > > > 100% of how all of the futures in the
> code
> > > > > handle
> > > > > > > > > > failures.
> > > > > > > > > > > > >> > > > > > If not, where in the code would be the
> > > > relevant
> > > > > > > places
> > > > > > > > > to
> > > > > > > > > > > add
> > > > > > > > > > > > >> the
> > > > > > > > > > > > >> > > > ability
> > > > > > > > > > > > >> > > > > > to do this, and would the project be
> > > > interested
> > > > > > in a
> > > > > > > > > pull
> > > > > > > > > > > > >> request?
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > > > Thanks,
> > > > > > > > > > > > >> > > > > > Cameron
> > > > > > > > > > > > >> > > > > >
> > > > > > > > > > > > >> > > > >
> > > > > > > > > > > > >> > > >
> > > > > > > > > > > > >> > >
> > > > > > > > > > > > >> >
> > > > > > > > > > > > >>
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Proxy Client - Batch Ordering / Commit

Posted by Leigh Stewart <ls...@twitter.com.INVALID>.
Great discussion here :)

Have you started any work for this? I just updated the proposal page -
> https://cwiki.apache.org/confluence/display/DL/DP-2+-+Epoch+Write+Support
> Maybe we can work together with this.


Looks good Xi.

If I understand the proposal correctly, this gets us, in thin client:
a) ability to achieve a very strict form of consistency, supporting ex.
exactly once updates
b) exclusive ownership

I think there may also be a need for some kind of large atomic update
functionality, which is a different way of looking at
consistent/transactional updates.

Cameron earlier Sijie asked if you need > 1MB writes - is that a
requirement for you? If so, epoch write may not meet all of your
requirements above.

We actually have a fairly urgent business need to support something like
this. Our two use cases are:
1. apply a set of logically separate writes as an atomic batch (we want
them all to be visible at the same time, or not at all). might be useful to
be able to aggregate updates and apply at a time of our choosing.
2. large writes: dl write now has a 1MB write limit. we will need to apply
updates which are larger than 1MB atomically.

On Fri, Nov 18, 2016 at 2:37 PM, Sijie Guo <si...@apache.org> wrote:

> On Thu, Nov 17, 2016 at 2:30 AM, Xi Liu <xi...@gmail.com> wrote:
>
> > Cameron,
> >
> > Thank you for your comments. It's very helpful. My replies are inline.
> >
> > On Wed, Nov 16, 2016 at 11:59 AM, Cameron Hatfield <ki...@gmail.com>
> > wrote:
> >
> > > "A couple of questions" is what I originally wrote, and then the
> > following
> > > happened. Sorry about the large swath of them, making sure my
> > understanding
> > > of the code base, as well as the DL/Bookkeeper/ZK ecosystem
> interaction,
> > > makes sense.
> > >
> > > ==General:
> > > What is an exclusive session? What is it providing over a regular
> > session?
> > >
> >
> > The idea here is to provide exclusive writer semantic for the
> > distributedlog (thin) client use case.
> > So it can have similar behavior as using the storage
> (distributedlog-core)
> > library directly.
> >
>
> +1 for an exclusive writer feature. Leigh and me were talking about this
> before. It is glad to see it is happening now. However it might be worth to
> separate the exclusive writer feature into a separate task once we have
> 'fencing' feature available.
>
>
> >
> >
> > >
> > >
> > > ==Proxy side:
> > > Should a new streamop be added for the fencing operation, or does it
> make
> > > sense to piggyback on an existing one (such as write)?
> >
> >
> > > ====getLastDLSN:
> > > What should be the return result for:
> > > A new stream
> > > A new session, after successful fencing
> > > A new session, after a change in ownership / first starting up
> > >
> > > What is the main use case for getLastDLSN(<stream>, false)? Is this to
> > > guarantee that the recovery process has happened in case of ownership
> > > failure (I don't have a good understanding of what causes the recovery
> > > process to happen, especially from the reader side)? Or is it to handle
> > the
> > > lost-ack problem? Since all the rest of the read related things go
> > through
> > > the read client, I'm not sure if I see the use case, but it seems like
> > > there would be a large potential for confusion on which to use. What
> > about
> > > just a fenceSession op, that always fences, returning the DLSN of the
> > > fence, and leave the normal getLastDLSN for the regular read client.
> > >
> >
> >
> > Ah, your point is valid. I was following the style of bookkeeper. The
> > bookkeeper client supplies a fencing flag on readLastAddConfirmed request
> > during recovery.
> >
> > But you are right. It is clear to have just a fenceSessionOp and return
> the
> > DLSN of the fence request.
> >
>
>
> Correct. In bookkeeper client, the fencing flag is set with a read op.
> However it is part of the recovery procedure and internal to the client. I
> agree with Cameron that we should hide the details from public. Otherwise,
> it will cause confusion.
>
>
> >
> >
> >
> > >
> > > ====Fencing:
> > > When a fence session occurs, what call needs to be made to make sure
> any
> > > outstanding writes are flushed and committed (so that we guarantee the
> > > client will be able to read anything that was in the write queue)?
> > > Is there a guaranteed ordering for things written in the future queue
> for
> > > AsyncLogWriter (I'm not quite confident that I was able to accurately
> > > follow the logic, as their are many parts of the code that write, have
> > > queues, heartbeat, etc)?
> > >
> >
> > I believed that when calling AsyncLogWriter asyncWrite in order, the
> > records will be written in order. Sijie or Leigh can confirm that.
> >
>
> That's write. The writes are guaranteed to write in the order of how they
> are issues.
>
>
> >
> > Since the writes are in order, when a fence session occurs, the control
> > record written successfully by the writer will guarantee the writes
> called
> > before the fence session write are flushed and committed.
> >
> > We need to invalidate the session when we fence the session. So any
> writes
> > with old session come after writing the control record will be rejected.
> In
> > this way, it can guarantee the client will be able to read anything in a
> > consistent way.
> >
> >
> > >
> > > ====SessionID:
> > > What is the default sessionid / transactionid for a new stream? I
> assume
> > > this would just be the first control record
> > >
> >
> > The initial session id will be the transaction id of the first control
> > record.
> >
> >
> > >
> > > ======Should all streams have a sessionid by default, regardless if it
> is
> > > never used by a client (aka, everytime ownership changes, a new control
> > > record is generated, and a sessionid is stored)?
> > > Main edge case that would have to be handled is if a client writes with
> > an
> > > old sessionid, but the owner has changed and has yet to create a
> > sessionid.
> > > This should be handled by the "non-matching sessionid" rule, since the
> > > invalid sessionid wouldn't match the passed sessionid, which should
> cause
> > > the client to get a new sessionid.
> > >
> >
> > I think all streams should just have a session id by default. The session
> > id is changed when ownership is changed or it is explicitly bumped by a
> > fence session op.
> >
> >
> > >
> > > ======Where in the code does it make sense to own the session, the
> stream
> > > interfaces / classes? Should they pass that information down to the
> ops,
> > or
> > > do the sessionid check within?
> > > My first thought would be Stream owns the sessionid, passes it into the
> > ops
> > > (as either a nullable value, or an invalid default value), which then
> do
> > > the sessionid check if they care. The main issue is updating the
> > sessionid
> > > is a bit backwards, as either every op has the ability to update it
> > through
> > > some type of return value / direct stream access / etc, or there is a
> > > special case in the stream for the fence operation / any other
> operation
> > > that can update the session.
> > >
> >
> > My thought is to add session id in Stream (StreamImpl.java). The stream
> > validates the session id before submitting a stream op. If it is a fence
> > session op, it would just invalidate the current session, so the
> subsequent
> > requests with old session will be rejected.
> >
>
> >
> > >
> > > ======For "the owner of the log stream will first advance the
> transaction
> > > id generator to claim a new transaction id and write a control record
> to
> > > the log stream. ":
> > > Should "DistributedLogConstants.CONTROL_RECORD_CONTENT" be the type of
> > > control record written?
> > > Should the "writeControlRecord" on the BKAsyncLogWriter be exposed in
> the
> > > AsyncLogWriter interface be exposed?  Or even in the one within the
> > segment
> > > writer? Or should the code be duplicated / pulled out into a helper /
> > etc?
> > > (Not a big Java person, so any suggestions on the "Java Way", or at
> least
> > > the DL way, to do it would be appreciated)
> > >
> >
> > I believe we can just construct a log record and set it to be a control
> > record and write it.
> >
> > LogRecord record = ...
> > record.setControl();
> > writer.write(record);
> >
> > (Can anyone from community confirm that it is okay to write a control
> > record in this way?)
> >
>
> Ideally we would like to hide the logic from public usage. However, since
> it is a code change at proxy side, it is absolutely fine.
>
>
> >
> >
> >
> > >
> > > ======Transaction ID:
> > > The BKLogSegmentWriter ignores the transaction ids from control records
> > > when it records the "LastTXId." Would that be an issue here for
> anything?
> > > It looks like it may do that because it assumes you're calling it's
> local
> > > function for writing a controlrecord, which uses the lastTxId.
> > >
> >
> > I think the lastTxId is for the last transaction id of user records. so
> we
> > probably don't change the behavior on how we record the lastTxId. However
> > we can change how do we fetch the last tx id for id generation after
> > recovery.
> >
>
>
> getLastTxId should have a flag to include control records or not.
>
>
> >
> >
> > >
> > >
> > > ==Thrift Interface:
> > > ====Should the write response be split out for different calls?
> > > It seems odd to have a single struct with many optional items that are
> > > filled depending on the call made for every rpc call. This is mostly a
> > > curiosity question, since I assume it comes from the general practices
> > from
> > > using thrift for a while. Would it at least make sense for the
> > > getLastDLSN/fence endpoint to have a new struct?
> > >
> >
> > I don't have any preference here. It might make sense to have a new
> struct.
> >
> >
> > >
> > > ====Any particular error code that makes sense for session fenced? If
> we
> > > want to be close to the HTTP errors, looks like 412 (PRECONDITION
> FAILED)
> > > might make the most sense, if a bit generic.
> > >
> > > 412 def:
> > > "The precondition given in one or more of the request-header fields
> > > evaluated to false when it was tested on the server. This response code
> > > allows the client to place preconditions on the current resource
> > > metainformation (header field data) and thus prevent the requested
> method
> > > from being applied to a resource other than the one intended."
> > >
> > > 412 excerpt from the If-match doc:
> > > "This behavior is most useful when the client wants to prevent an
> > updating
> > > method, such as PUT, from modifying a resource that has changed since
> the
> > > client last retrieved it."
> > >
> >
> > I liked your idea.
> >
> >
> > >
> > > ====Should we return the sessionid to the client in the "fencesession"
> > > calls?
> > > Seems like it may be useful when you fence, especially if you have some
> > > type of custom sequencer where it would make sense for search, or for
> > > debugging.
> > > Main minus is that it would be easy for users to create an implicit
> > > requirement that the sessionid is forever a valid transactionid, which
> > may
> > > not always be the case long term for the project.
> > >
> >
> > I think we probably should start with not return and add if we really
> need
> > it.
> >
> >
> > >
> > >
> > > ==Client:
> > > ====What is the proposed process for the client retrieving the new
> > > sessionid?
> > > A full reconnect? No special case code, but intrusive on the client
> side,
> > > and possibly expensive garbage/processing wise. (though this type of
> > > failure should hopefully be rare enough to not be a problem)
> > > A call to reset the sessionid? Less intrusive, all the issues you get
> > with
> > > mutable object methods that need to be called in a certain order, edge
> > > cases such as outstanding/buffered requests to the old stream, etc.
> > > The call could also return the new sessionid, making it a good call for
> > > storing or debugging the value.
> > >
> >
> > It is probably not good to tight a session with a connection. Especially
> > the underneath communication is a RPC framework.
> >
> > I think the new session id can be piggyback with rejected response. The
> > client doesn't have to explicitly retrieve a new session id.
> >
>
> +1 for separating session from connection. Especially the 'session' here is
> more a lifecycle concept for a stream.
>
>
>
> >
> >
> > >
> > > ====Session Fenced failure:
> > > Will this put the client into a failure state, stopping all future
> writes
> > > until fixed?
> > >
> >
> > My thought is the new session id will be piggyback with fence response.
> so
> > the client will know the new session id and all the future writes will
> just
> > carry the new session id.
> >
> >
> > > Is it even possible to get this error when ownership changes? The
> > > connection to the new owner should get a new sessionid on connect, so I
> > > would expect not.
> > >
> >
> > I think it will. But the client should handle and retry. As session
> fenced
> > exception is an exception that indicates that write is never attempted.
> >
> >
> > >
> > > Cheers,
> > > Cameron
> > >
> > > On Tue, Nov 15, 2016 at 2:01 AM, Xi Liu <xi...@gmail.com> wrote:
> > >
> > > > Thank you, Cameron. Look forward to your comments.
> > > >
> > > > - Xi
> > > >
> > > > On Sun, Nov 13, 2016 at 1:21 PM, Cameron Hatfield <ki...@gmail.com>
> > > > wrote:
> > > >
> > > > > Sorry, I've been on vacation for the past week, and heads down for
> a
> > > > > release that is using DL at the end of Nov. I'll take a look at
> this
> > > over
> > > > > the next week, and add any relevant comments. After we are finished
> > > with
> > > > > dev for this release, I am hoping to tackle this next.
> > > > >
> > > > > -Cameron
> > > > >
> > > > > On Fri, Nov 11, 2016 at 12:07 PM, Sijie Guo <si...@apache.org>
> > wrote:
> > > > >
> > > > > > Xi,
> > > > > >
> > > > > > Thank you so much for your proposal. I took a look. It looks fine
> > to
> > > > me.
> > > > > > Cameron, do you have any comments?
> > > > > >
> > > > > > Look forward to your pull requests.
> > > > > >
> > > > > > - Sijie
> > > > > >
> > > > > >
> > > > > > On Wed, Nov 9, 2016 at 2:34 AM, Xi Liu <xi...@gmail.com>
> > wrote:
> > > > > >
> > > > > > > Cameron,
> > > > > > >
> > > > > > > Have you started any work for this? I just updated the proposal
> > > page
> > > > -
> > > > > > > https://cwiki.apache.org/confluence/display/DL/DP-2+-+
> > > > > > Epoch+Write+Support
> > > > > > > Maybe we can work together with this.
> > > > > > >
> > > > > > > Sijie, Leigh,
> > > > > > >
> > > > > > > can you guys help review this to make sure our proposal is in
> the
> > > > right
> > > > > > > direction?
> > > > > > >
> > > > > > > - Xi
> > > > > > >
> > > > > > > On Tue, Nov 1, 2016 at 3:05 AM, Sijie Guo <si...@apache.org>
> > > wrote:
> > > > > > >
> > > > > > > > I created https://issues.apache.org/jira/browse/DL-63 for
> > > tracking
> > > > > the
> > > > > > > > proposed idea here.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Oct 26, 2016 at 4:53 PM, Sijie Guo
> > > > > <sijieg@twitter.com.invalid
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > On Tue, Oct 25, 2016 at 11:30 AM, Cameron Hatfield <
> > > > > kinguy@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Yes, we are reading the HBase WAL (from their replication
> > > > plugin
> > > > > > > > > support),
> > > > > > > > > > and writing that into DL.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Gotcha.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > From the sounds of it, yes, it would. Only thing I would
> > say
> > > is
> > > > > > make
> > > > > > > > the
> > > > > > > > > > epoch requirement optional, so that if I client doesn't
> > care
> > > > > about
> > > > > > > > dupes
> > > > > > > > > > they don't have to deal with the process of getting a new
> > > > epoch.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Yup. This should be optional. I can start a wiki page on
> how
> > we
> > > > > want
> > > > > > to
> > > > > > > > > implement this. Are you interested in contributing to this?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > -Cameron
> > > > > > > > > >
> > > > > > > > > > On Wed, Oct 19, 2016 at 7:43 PM, Sijie Guo
> > > > > > > <sijieg@twitter.com.invalid
> > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > On Wed, Oct 19, 2016 at 7:17 PM, Sijie Guo <
> > > > sijieg@twitter.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Monday, October 17, 2016, Cameron Hatfield <
> > > > > > kinguy@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >> Answer inline:
> > > > > > > > > > > >>
> > > > > > > > > > > >> On Mon, Oct 17, 2016 at 11:46 AM, Sijie Guo <
> > > > > sijie@apache.org
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > >>
> > > > > > > > > > > >> > Cameron,
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Thank you for your summary. I liked the discussion
> > > > here. I
> > > > > > > also
> > > > > > > > > > liked
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > summary of your requirement -
> > 'single-writer-per-key,
> > > > > > > > > > > >> > multiple-writers-per-log'. If I understand
> > correctly,
> > > > the
> > > > > > core
> > > > > > > > > > concern
> > > > > > > > > > > >> here
> > > > > > > > > > > >> > is almost 'exact-once' write (or a way to explicit
> > > tell
> > > > > if a
> > > > > > > > write
> > > > > > > > > > can
> > > > > > > > > > > >> be
> > > > > > > > > > > >> > retried or not).
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Comments inline.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > On Fri, Oct 14, 2016 at 11:17 AM, Cameron
> Hatfield <
> > > > > > > > > > kinguy@gmail.com>
> > > > > > > > > > > >> > wrote:
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > > > Ah- yes good point (to be clear we're not
> using
> > > the
> > > > > > proxy
> > > > > > > > this
> > > > > > > > > > way
> > > > > > > > > > > >> > > today).
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > > > Due to the source of the
> > > > > > > > > > > >> > > > > data (HBase Replication), we cannot
> guarantee
> > > > that a
> > > > > > > > single
> > > > > > > > > > > >> partition
> > > > > > > > > > > >> > > will
> > > > > > > > > > > >> > > > > be owned for writes by the same client.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > > Do you mean you *need* to support multiple
> > writers
> > > > > > issuing
> > > > > > > > > > > >> interleaved
> > > > > > > > > > > >> > > > writes or is it just that they might sometimes
> > > > > > interleave
> > > > > > > > > writes
> > > > > > > > > > > and
> > > > > > > > > > > >> > you
> > > > > > > > > > > >> > > >don't care?
> > > > > > > > > > > >> > > How HBase partitions the keys being written
> > wouldn't
> > > > > have
> > > > > > a
> > > > > > > > > > one->one
> > > > > > > > > > > >> > > mapping with the partitions we would have in
> > HBase.
> > > > Even
> > > > > > if
> > > > > > > we
> > > > > > > > > did
> > > > > > > > > > > >> have
> > > > > > > > > > > >> > > that alignment when the cluster first started,
> > HBase
> > > > > will
> > > > > > > > > > rebalance
> > > > > > > > > > > >> what
> > > > > > > > > > > >> > > servers own what partitions, as well as split
> and
> > > > merge
> > > > > > > > > partitions
> > > > > > > > > > > >> that
> > > > > > > > > > > >> > > already exist, causing eventual drift from one
> log
> > > per
> > > > > > > > > partition.
> > > > > > > > > > > >> > > Because we want ordering guarantees per key (row
> > in
> > > > > > hbase),
> > > > > > > we
> > > > > > > > > > > >> partition
> > > > > > > > > > > >> > > the logs by the key. Since multiple writers are
> > > > possible
> > > > > > per
> > > > > > > > > range
> > > > > > > > > > > of
> > > > > > > > > > > >> > keys
> > > > > > > > > > > >> > > (due to the aforementioned rebalancing /
> > splitting /
> > > > etc
> > > > > > of
> > > > > > > > > > hbase),
> > > > > > > > > > > we
> > > > > > > > > > > >> > > cannot use the core library due to requiring a
> > > single
> > > > > > writer
> > > > > > > > for
> > > > > > > > > > > >> > ordering.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > But, for a single log, we don't really care
> about
> > > > > ordering
> > > > > > > > aside
> > > > > > > > > > > from
> > > > > > > > > > > >> at
> > > > > > > > > > > >> > > the per-key level. So all we really need to be
> > able
> > > to
> > > > > > > handle
> > > > > > > > is
> > > > > > > > > > > >> > preventing
> > > > > > > > > > > >> > > duplicates when a failure occurs, and ordering
> > > > > consistency
> > > > > > > > > across
> > > > > > > > > > > >> > requests
> > > > > > > > > > > >> > > from a single client.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > So our general requirements are:
> > > > > > > > > > > >> > > Write A, Write B
> > > > > > > > > > > >> > > Timeline: A -> B
> > > > > > > > > > > >> > > Request B is only made after A has successfully
> > > > returned
> > > > > > > > > (possibly
> > > > > > > > > > > >> after
> > > > > > > > > > > >> > > retries)
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > 1) If the write succeeds, it will be durably
> > exposed
> > > > to
> > > > > > > > clients
> > > > > > > > > > > within
> > > > > > > > > > > >> > some
> > > > > > > > > > > >> > > bounded time frame
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Guaranteed.
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >> > > 2) If A succeeds and B succeeds, the ordering
> for
> > > the
> > > > > log
> > > > > > > will
> > > > > > > > > be
> > > > > > > > > > A
> > > > > > > > > > > >> and
> > > > > > > > > > > >> > > then B
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > If I understand correctly here, B is only sent
> > after A
> > > > is
> > > > > > > > > returned,
> > > > > > > > > > > >> right?
> > > > > > > > > > > >> > If that's the case, It is guaranteed.
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > > 3) If A fails due to an error that can be relied
> > on
> > > to
> > > > > > *not*
> > > > > > > > be
> > > > > > > > > a
> > > > > > > > > > > lost
> > > > > > > > > > > >> > ack
> > > > > > > > > > > >> > > problem, it will never be exposed to the client,
> > so
> > > it
> > > > > may
> > > > > > > > > > > (depending
> > > > > > > > > > > >> on
> > > > > > > > > > > >> > > the error) be retried immediately
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > If it is not a lost-ack problem, the entry will be
> > > > > exposed.
> > > > > > it
> > > > > > > > is
> > > > > > > > > > > >> > guaranteed.
> > > > > > > > > > > >>
> > > > > > > > > > > >> Let me try rephrasing the questions, to make sure
> I'm
> > > > > > > > understanding
> > > > > > > > > > > >> correctly:
> > > > > > > > > > > >> If A fails, with an error such as "Unable to create
> > > > > connection
> > > > > > > to
> > > > > > > > > > > >> bookkeeper server", that would be the type of error
> we
> > > > would
> > > > > > > > expect
> > > > > > > > > to
> > > > > > > > > > > be
> > > > > > > > > > > >> able to retry immediately, as that result means no
> > > action
> > > > > was
> > > > > > > > taken
> > > > > > > > > on
> > > > > > > > > > > any
> > > > > > > > > > > >> log / etc, so no entry could have been created. This
> > is
> > > > > > > different
> > > > > > > > > > then a
> > > > > > > > > > > >> "Connection Timeout" exception, as we just might not
> > > have
> > > > > > > gotten a
> > > > > > > > > > > >> response
> > > > > > > > > > > >> in time.
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > > Gotcha.
> > > > > > > > > > > >
> > > > > > > > > > > > The response code returned from proxy can tell if a
> > > failure
> > > > > can
> > > > > > > be
> > > > > > > > > > > retried
> > > > > > > > > > > > safely or not. (We might need to make them well
> > > documented)
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >>
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > > 4) If A fails due to an error that could be a
> lost
> > > ack
> > > > > > > problem
> > > > > > > > > > > >> (network
> > > > > > > > > > > >> > > connectivity / etc), within a bounded time frame
> > it
> > > > > should
> > > > > > > be
> > > > > > > > > > > >> possible to
> > > > > > > > > > > >> > > find out if the write succeed or failed. Either
> by
> > > > > reading
> > > > > > > > from
> > > > > > > > > > some
> > > > > > > > > > > >> > > checkpoint of the log for the changes that
> should
> > > have
> > > > > > been
> > > > > > > > made
> > > > > > > > > > or
> > > > > > > > > > > >> some
> > > > > > > > > > > >> > > other possible server-side support.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > If I understand this correctly, it is a
> duplication
> > > > issue,
> > > > > > > > right?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Can a de-duplication solution work here? Either DL
> > or
> > > > your
> > > > > > > > client
> > > > > > > > > > does
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > de-duplication?
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >> The requirements I'm mentioning are the ones needed
> > for
> > > > > > > > client-side
> > > > > > > > > > > >> dedupping. Since if I can guarantee writes being
> > exposed
> > > > > > within
> > > > > > > > some
> > > > > > > > > > > time
> > > > > > > > > > > >> frame, and I can never get into an inconsistently
> > > ordered
> > > > > > state
> > > > > > > > when
> > > > > > > > > > > >> successes happen, when an error occurs, I can always
> > > wait
> > > > > for
> > > > > > > max
> > > > > > > > > time
> > > > > > > > > > > >> frame, read the latest writes, and then dedup
> locally
> > > > > against
> > > > > > > the
> > > > > > > > > > > request
> > > > > > > > > > > >> I
> > > > > > > > > > > >> just made.
> > > > > > > > > > > >>
> > > > > > > > > > > >> The main thing about that timeframe is that its
> > > basically
> > > > > the
> > > > > > > > > addition
> > > > > > > > > > > of
> > > > > > > > > > > >> every timeout, all the way down in the system,
> > combined
> > > > with
> > > > > > > > > whatever
> > > > > > > > > > > >> flushing / caching / etc times are at the
> bookkeeper /
> > > > > client
> > > > > > > > level
> > > > > > > > > > for
> > > > > > > > > > > >> when values are exposed
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Gotcha.
> > > > > > > > > > > >
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Is there any ways to identify your write?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > I can think of a case as follow - I want to know
> > what
> > > is
> > > > > > your
> > > > > > > > > > expected
> > > > > > > > > > > >> > behavior from the log.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > a)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > If a hbase region server A writes a change of key
> K
> > to
> > > > the
> > > > > > > log,
> > > > > > > > > the
> > > > > > > > > > > >> change
> > > > > > > > > > > >> > is successfully made to the log;
> > > > > > > > > > > >> > but server A is down before receiving the change.
> > > > > > > > > > > >> > region server B took over the region that contains
> > K,
> > > > what
> > > > > > > will
> > > > > > > > B
> > > > > > > > > > do?
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >> HBase writes in large chunks (WAL Logs), which its
> > > > > replication
> > > > > > > > > system
> > > > > > > > > > > then
> > > > > > > > > > > >> handles by replaying in the case of failure. If I'm
> > in a
> > > > > > middle
> > > > > > > > of a
> > > > > > > > > > > log,
> > > > > > > > > > > >> and the whole region goes down and gets rescheduled
> > > > > > elsewhere, I
> > > > > > > > > will
> > > > > > > > > > > >> start
> > > > > > > > > > > >> back up from the beginning of the log I was in the
> > > middle
> > > > > of.
> > > > > > > > Using
> > > > > > > > > > > >> checkpointing + deduping, we should be able to find
> > out
> > > > > where
> > > > > > we
> > > > > > > > > left
> > > > > > > > > > > off
> > > > > > > > > > > >> in the log.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > b) same as a). but server A was just network
> > > > partitioned.
> > > > > > will
> > > > > > > > > both
> > > > > > > > > > A
> > > > > > > > > > > >> and B
> > > > > > > > > > > >> > write the change of key K?
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >> HBase gives us some guarantees around network
> > partitions
> > > > > > > > > (Consistency
> > > > > > > > > > > over
> > > > > > > > > > > >> availability for HBase). HBase is a single-master
> > > failover
> > > > > > > > recovery
> > > > > > > > > > type
> > > > > > > > > > > >> of
> > > > > > > > > > > >> system, with zookeeper-based guarantees for single
> > > owners
> > > > > > > > (writers)
> > > > > > > > > > of a
> > > > > > > > > > > >> range of data.
> > > > > > > > > > > >>
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > > 5) If A is turned into multiple batches (one
> large
> > > > > request
> > > > > > > > gets
> > > > > > > > > > > split
> > > > > > > > > > > >> > into
> > > > > > > > > > > >> > > multiple smaller ones to the bookkeeper backend,
> > due
> > > > to
> > > > > > log
> > > > > > > > > > rolling
> > > > > > > > > > > /
> > > > > > > > > > > >> > size
> > > > > > > > > > > >> > > / etc):
> > > > > > > > > > > >> > >   a) The ordering of entries within batches have
> > > > > ordering
> > > > > > > > > > > consistence
> > > > > > > > > > > >> > with
> > > > > > > > > > > >> > > the original request, when exposed in the log
> > > (though
> > > > > they
> > > > > > > may
> > > > > > > > > be
> > > > > > > > > > > >> > > interleaved with other requests)
> > > > > > > > > > > >> > >   b) The ordering across batches have ordering
> > > > > consistence
> > > > > > > > with
> > > > > > > > > > the
> > > > > > > > > > > >> > > original request, when exposed in the log
> (though
> > > they
> > > > > may
> > > > > > > be
> > > > > > > > > > > >> interleaved
> > > > > > > > > > > >> > > with other requests)
> > > > > > > > > > > >> > >   c) If a batch fails, and cannot be retried /
> is
> > > > > > > > unsuccessfully
> > > > > > > > > > > >> retried,
> > > > > > > > > > > >> > > all batches after the failed batch should not be
> > > > exposed
> > > > > > in
> > > > > > > > the
> > > > > > > > > > log.
> > > > > > > > > > > >> > Note:
> > > > > > > > > > > >> > > The batches before and including the failed
> batch,
> > > > that
> > > > > > > ended
> > > > > > > > up
> > > > > > > > > > > >> > > succeeding, can show up in the log, again within
> > > some
> > > > > > > bounded
> > > > > > > > > time
> > > > > > > > > > > >> range
> > > > > > > > > > > >> > > for reads by a client.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > There is a method 'writeBulk' in
> > DistributedLogClient
> > > > can
> > > > > > > > achieve
> > > > > > > > > > this
> > > > > > > > > > > >> > guarantee.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > However, I am not very sure about how will you
> turn
> > A
> > > > into
> > > > > > > > > batches.
> > > > > > > > > > If
> > > > > > > > > > > >> you
> > > > > > > > > > > >> > are dividing A into batches,
> > > > > > > > > > > >> > you can simply control the application write
> > sequence
> > > to
> > > > > > > achieve
> > > > > > > > > the
> > > > > > > > > > > >> > guarantee here.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Can you explain more about this?
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >> In this case, by batches I mean what the proxy does
> > with
> > > > the
> > > > > > > > single
> > > > > > > > > > > >> request
> > > > > > > > > > > >> that I send it. If the proxy decides it needs to
> turn
> > my
> > > > > > single
> > > > > > > > > > request
> > > > > > > > > > > >> into multiple batches of requests, due to log
> rolling,
> > > > size
> > > > > > > > > > limitations,
> > > > > > > > > > > >> etc, those would be the guarantees I need to be able
> > to
> > > > > > > > reduplicate
> > > > > > > > > on
> > > > > > > > > > > the
> > > > > > > > > > > >> client side.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > A single record written by #write and A record set
> (set
> > > of
> > > > > > > records)
> > > > > > > > > > > > written by #writeRecordSet are atomic - they will not
> > be
> > > > > broken
> > > > > > > > down
> > > > > > > > > > into
> > > > > > > > > > > > entries (batches). With the correct response code,
> you
> > > > would
> > > > > be
> > > > > > > > able
> > > > > > > > > to
> > > > > > > > > > > > tell if it is a lost-ack failure or not. However
> there
> > > is a
> > > > > > size
> > > > > > > > > > > limitation
> > > > > > > > > > > > for this - it can't not go beyond 1MB for current
> > > > > > implementation.
> > > > > > > > > > > >
> > > > > > > > > > > > What is your expected record size?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >>
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > Since we can guarantee per-key ordering on the
> > > client
> > > > > > side,
> > > > > > > we
> > > > > > > > > > > >> guarantee
> > > > > > > > > > > >> > > that there is a single writer per-key, just not
> > per
> > > > log.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Do you need fencing guarantee in the case of
> network
> > > > > > partition
> > > > > > > > > > causing
> > > > > > > > > > > >> > two-writers?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > > So if there was a
> > > > > > > > > > > >> > > way to guarantee a single write request as being
> > > > written
> > > > > > or
> > > > > > > > not,
> > > > > > > > > > > >> within a
> > > > > > > > > > > >> > > certain time frame (since failures should be
> rare
> > > > > anyways,
> > > > > > > > this
> > > > > > > > > is
> > > > > > > > > > > >> fine
> > > > > > > > > > > >> > if
> > > > > > > > > > > >> > > it is expensive), we can then have the client
> > > > guarantee
> > > > > > the
> > > > > > > > > > ordering
> > > > > > > > > > > >> it
> > > > > > > > > > > >> > > needs.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > This sounds an 'exact-once' write (regarding
> > retries)
> > > > > > > > requirement
> > > > > > > > > to
> > > > > > > > > > > me,
> > > > > > > > > > > >> > right?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> Yes. I'm curious of how this issue is handled by
> > > > Manhattan,
> > > > > > > since
> > > > > > > > > you
> > > > > > > > > > > can
> > > > > > > > > > > >> imagine a data store that ends up getting multiple
> > > writes
> > > > > for
> > > > > > > the
> > > > > > > > > same
> > > > > > > > > > > put
> > > > > > > > > > > >> / get / etc, would be harder to use, and we are
> > > basically
> > > > > > trying
> > > > > > > > to
> > > > > > > > > > > create
> > > > > > > > > > > >> a log like that for HBase.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Are you guys replacing HBase WAL?
> > > > > > > > > > > >
> > > > > > > > > > > > In Manhattan case, the request will be first written
> to
> > > DL
> > > > > > > streams
> > > > > > > > by
> > > > > > > > > > > > Manhattan coordinator. The Manhattan replica then
> will
> > > read
> > > > > > from
> > > > > > > > the
> > > > > > > > > DL
> > > > > > > > > > > > streams and apply the change. In the lost-ack case,
> the
> > > MH
> > > > > > > > > coordinator
> > > > > > > > > > > will
> > > > > > > > > > > > just fail the request to client.
> > > > > > > > > > > >
> > > > > > > > > > > > My feeling here is your usage for HBase is a bit
> > > different
> > > > > from
> > > > > > > how
> > > > > > > > > we
> > > > > > > > > > > use
> > > > > > > > > > > > DL in Manhattan. It sounds like you read from a
> source
> > > > (HBase
> > > > > > > WAL)
> > > > > > > > > and
> > > > > > > > > > > > write to DL. But I might be wrong.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >>
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > > Cameron:
> > > > > > > > > > > >> > > > Another thing we've discussed but haven't
> really
> > > > > thought
> > > > > > > > > > through -
> > > > > > > > > > > >> > > > We might be able to support some kind of epoch
> > > write
> > > > > > > > request,
> > > > > > > > > > > where
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > > epoch is guaranteed to have changed if the
> > writer
> > > > has
> > > > > > > > changed
> > > > > > > > > or
> > > > > > > > > > > the
> > > > > > > > > > > >> > > ledger
> > > > > > > > > > > >> > > > was ever fenced off. Writes include an epoch
> and
> > > are
> > > > > > > > rejected
> > > > > > > > > if
> > > > > > > > > > > the
> > > > > > > > > > > >> > > epoch
> > > > > > > > > > > >> > > > has changed.
> > > > > > > > > > > >> > > > With a mechanism like this, fencing the ledger
> > off
> > > > > > after a
> > > > > > > > > > failure
> > > > > > > > > > > >> > would
> > > > > > > > > > > >> > > > ensure any pending writes had either been
> > written
> > > or
> > > > > > would
> > > > > > > > be
> > > > > > > > > > > >> rejected.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > The issue would be how I guarantee the write I
> > wrote
> > > > to
> > > > > > the
> > > > > > > > > server
> > > > > > > > > > > was
> > > > > > > > > > > >> > > written. Since a network issue could happen on
> the
> > > > send
> > > > > of
> > > > > > > the
> > > > > > > > > > > >> request,
> > > > > > > > > > > >> > or
> > > > > > > > > > > >> > > on the receive of the success response, an epoch
> > > > > wouldn't
> > > > > > > tell
> > > > > > > > > me
> > > > > > > > > > > if I
> > > > > > > > > > > >> > can
> > > > > > > > > > > >> > > successfully retry, as it could be successfully
> > > > written
> > > > > > but
> > > > > > > > AWS
> > > > > > > > > > > >> dropped
> > > > > > > > > > > >> > the
> > > > > > > > > > > >> > > connection for the success response. Since the
> > epoch
> > > > > would
> > > > > > > be
> > > > > > > > > the
> > > > > > > > > > > same
> > > > > > > > > > > >> > > (same ledger), I could write duplicates.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > > We are currently proposing adding a
> transaction
> > > > > semantic
> > > > > > > to
> > > > > > > > dl
> > > > > > > > > > to
> > > > > > > > > > > >> get
> > > > > > > > > > > >> > rid
> > > > > > > > > > > >> > > > of the size limitation and the unaware-ness in
> > the
> > > > > proxy
> > > > > > > > > client.
> > > > > > > > > > > >> Here
> > > > > > > > > > > >> > is
> > > > > > > > > > > >> > > > our idea -
> > > > > > > > > > > >> > > > http://mail-archives.apache.
> > > org/mod_mbox/incubator-
> > > > > > > > > > distributedlog
> > > > > > > > > > > >> > > -dev/201609.mbox/%3cCAAC6BxP5YyEHwG0ZCF5soh42X=
> > > xuYwYm
> > > > > > > > > > > >> > > <http://mail-archives.apache.
> > > org/mod_mbox/incubator-
> > > > > > > > > > > >> > distributedlog%0A-dev/201609.mbox/%
> > > > > > 3cCAAC6BxP5YyEHwG0ZCF5soh
> > > > > > > > > > > 42X=xuYwYm>
> > > > > > > > > > > >> > > L4nXsYBYiofzxpVk6g@mail.gmail.com%3e
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > > I am not sure if your idea is similar as ours.
> > but
> > > > > we'd
> > > > > > > like
> > > > > > > > > to
> > > > > > > > > > > >> > > collaborate
> > > > > > > > > > > >> > > > with the community if anyone has the similar
> > idea.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > Our use case would be covered by transaction
> > > support,
> > > > > but
> > > > > > > I'm
> > > > > > > > > > unsure
> > > > > > > > > > > >> if
> > > > > > > > > > > >> > we
> > > > > > > > > > > >> > > would need something that heavy weight for the
> > > > > guarantees
> > > > > > we
> > > > > > > > > need.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > Basically, the high level requirement here is
> > > "Support
> > > > > > > > > consistent
> > > > > > > > > > > >> write
> > > > > > > > > > > >> > > ordering for single-writer-per-key,
> > > > > multi-writer-per-log".
> > > > > > > My
> > > > > > > > > > hunch
> > > > > > > > > > > is
> > > > > > > > > > > >> > > that, with some added guarantees to the proxy
> (if
> > it
> > > > > isn't
> > > > > > > > > already
> > > > > > > > > > > >> > > supported), and some custom client code on our
> > side
> > > > for
> > > > > > > > removing
> > > > > > > > > > the
> > > > > > > > > > > >> > > entries that actually succeed to write to
> > > > DistributedLog
> > > > > > > from
> > > > > > > > > the
> > > > > > > > > > > >> request
> > > > > > > > > > > >> > > that failed, it should be a relatively easy
> thing
> > to
> > > > > > > support.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Yup. I think it should not be very difficult to
> > > support.
> > > > > > There
> > > > > > > > > might
> > > > > > > > > > > be
> > > > > > > > > > > >> > some changes in the server side.
> > > > > > > > > > > >> > Let's figure out what will the changes be. Are you
> > > guys
> > > > > > > > interested
> > > > > > > > > > in
> > > > > > > > > > > >> > contributing?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Yes, we would be.
> > > > > > > > > > > >>
> > > > > > > > > > > >> As a note, the one thing that we see as an issue
> with
> > > the
> > > > > > client
> > > > > > > > > side
> > > > > > > > > > > >> dedupping is how to bound the range of data that
> needs
> > > to
> > > > be
> > > > > > > > looked
> > > > > > > > > at
> > > > > > > > > > > for
> > > > > > > > > > > >> deduplication. As you can imagine, it is pretty easy
> > to
> > > > > bound
> > > > > > > the
> > > > > > > > > > bottom
> > > > > > > > > > > >> of
> > > > > > > > > > > >> the range, as that it just regular checkpointing of
> > the
> > > > DSLN
> > > > > > > that
> > > > > > > > is
> > > > > > > > > > > >> returned. I'm still not sure if there is any nice
> way
> > to
> > > > > time
> > > > > > > > bound
> > > > > > > > > > the
> > > > > > > > > > > >> top
> > > > > > > > > > > >> end of the range, especially since the proxy owns
> > > sequence
> > > > > > > numbers
> > > > > > > > > > > (which
> > > > > > > > > > > >> makes sense). I am curious if there is more that can
> > be
> > > > done
> > > > > > if
> > > > > > > > > > > >> deduplication is on the server side. However the
> main
> > > > minus
> > > > > I
> > > > > > > see
> > > > > > > > of
> > > > > > > > > > > >> server
> > > > > > > > > > > >> side deduplication is that instead of running
> > contingent
> > > > on
> > > > > > > there
> > > > > > > > > > being
> > > > > > > > > > > a
> > > > > > > > > > > >> failed client request, instead it would have to run
> > > every
> > > > > > time a
> > > > > > > > > write
> > > > > > > > > > > >> happens.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > For a reliable dedup, we probably need
> > > > fence-then-getLastDLSN
> > > > > > > > > > operation -
> > > > > > > > > > > > so it would guarantee that any non-completed requests
> > > > issued
> > > > > > > > > (lost-ack
> > > > > > > > > > > > requests) before this fence-then-getLastDLSN
> operation
> > > will
> > > > > be
> > > > > > > > failed
> > > > > > > > > > and
> > > > > > > > > > > > they will never land at the log.
> > > > > > > > > > > >
> > > > > > > > > > > > the pseudo code would look like below -
> > > > > > > > > > > >
> > > > > > > > > > > > write(request) onFailure { t =>
> > > > > > > > > > > >
> > > > > > > > > > > > if (t is timeout exception) {
> > > > > > > > > > > >
> > > > > > > > > > > > DLSN lastDLSN = fenceThenGetLastDLSN()
> > > > > > > > > > > > DLSN lastCheckpointedDLSN = ...;
> > > > > > > > > > > > // find if the request lands between [lastDLSN,
> > > > > > > > > lastCheckpointedDLSN].
> > > > > > > > > > > > // if it exists, the write succeed; otherwise retry.
> > > > > > > > > > > >
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Just realized the idea is same as what Leigh raised in
> > the
> > > > > > previous
> > > > > > > > > email
> > > > > > > > > > > about 'epoch write'. Let me explain more about this
> idea
> > > > > (Leigh,
> > > > > > > feel
> > > > > > > > > > free
> > > > > > > > > > > to jump in to fill up your idea).
> > > > > > > > > > >
> > > > > > > > > > > - when a log stream is owned,  the proxy use the last
> > > > > transaction
> > > > > > > id
> > > > > > > > as
> > > > > > > > > > the
> > > > > > > > > > > epoch
> > > > > > > > > > > - when a client connects (handshake with the proxy), it
> > > will
> > > > > get
> > > > > > > the
> > > > > > > > > > epoch
> > > > > > > > > > > for the stream.
> > > > > > > > > > > - the writes issued by this client will carry the epoch
> > to
> > > > the
> > > > > > > proxy.
> > > > > > > > > > > - add a new rpc - fenceThenGetLastDLSN - it would force
> > the
> > > > > proxy
> > > > > > > to
> > > > > > > > > bump
> > > > > > > > > > > the epoch.
> > > > > > > > > > > - if fenceThenGetLastDLSN happened, all the outstanding
> > > > writes
> > > > > > with
> > > > > > > > old
> > > > > > > > > > > epoch will be rejected with exceptions (e.g.
> > EpochFenced).
> > > > > > > > > > > - The DLSN returned from fenceThenGetLastDLSN can be
> used
> > > as
> > > > > the
> > > > > > > > bound
> > > > > > > > > > for
> > > > > > > > > > > deduplications on failures.
> > > > > > > > > > >
> > > > > > > > > > > Cameron, does this sound a solution to your use case?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >>
> > > > > > > > > > > >> Maybe something that could fit a similar need that
> > Kafka
> > > > > does
> > > > > > > (the
> > > > > > > > > > last
> > > > > > > > > > > >> store value for a particular key in a log), such
> that
> > > on a
> > > > > per
> > > > > > > key
> > > > > > > > > > basis
> > > > > > > > > > > >> there could be a sequence number that support
> > > > deduplication?
> > > > > > > Cost
> > > > > > > > > > seems
> > > > > > > > > > > >> like it would be high however, and I'm not even sure
> > if
> > > > > > > bookkeeper
> > > > > > > > > > > >> supports
> > > > > > > > > > > >> it.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >> Cheers,
> > > > > > > > > > > >> Cameron
> > > > > > > > > > > >>
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > Thanks,
> > > > > > > > > > > >> > > Cameron
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > On Sat, Oct 8, 2016 at 7:35 AM, Leigh Stewart
> > > > > > > > > > > >> > <lstewart@twitter.com.invalid
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > wrote:
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > > Cameron:
> > > > > > > > > > > >> > > > Another thing we've discussed but haven't
> really
> > > > > thought
> > > > > > > > > > through -
> > > > > > > > > > > >> > > > We might be able to support some kind of epoch
> > > write
> > > > > > > > request,
> > > > > > > > > > > where
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > > epoch is guaranteed to have changed if the
> > writer
> > > > has
> > > > > > > > changed
> > > > > > > > > or
> > > > > > > > > > > the
> > > > > > > > > > > >> > > ledger
> > > > > > > > > > > >> > > > was ever fenced off. Writes include an epoch
> and
> > > are
> > > > > > > > rejected
> > > > > > > > > if
> > > > > > > > > > > the
> > > > > > > > > > > >> > > epoch
> > > > > > > > > > > >> > > > has changed.
> > > > > > > > > > > >> > > > With a mechanism like this, fencing the ledger
> > off
> > > > > > after a
> > > > > > > > > > failure
> > > > > > > > > > > >> > would
> > > > > > > > > > > >> > > > ensure any pending writes had either been
> > written
> > > or
> > > > > > would
> > > > > > > > be
> > > > > > > > > > > >> rejected.
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > On Sat, Oct 8, 2016 at 7:10 AM, Sijie Guo <
> > > > > > > sijie@apache.org
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > > Cameron,
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > I think both Leigh and Xi had made a few
> good
> > > > points
> > > > > > > about
> > > > > > > > > > your
> > > > > > > > > > > >> > > question.
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > To add one more point to your question -
> "but
> > I
> > > am
> > > > > not
> > > > > > > > > > > >> > > > > 100% of how all of the futures in the code
> > > handle
> > > > > > > > failures.
> > > > > > > > > > > >> > > > > If not, where in the code would be the
> > relevant
> > > > > places
> > > > > > > to
> > > > > > > > > add
> > > > > > > > > > > the
> > > > > > > > > > > >> > > ability
> > > > > > > > > > > >> > > > > to do this, and would the project be
> > interested
> > > > in a
> > > > > > > pull
> > > > > > > > > > > >> request?"
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > The current proxy and client logic doesn't
> do
> > > > > > perfectly
> > > > > > > on
> > > > > > > > > > > >> handling
> > > > > > > > > > > >> > > > > failures (duplicates) - the strategy now is
> > the
> > > > > client
> > > > > > > > will
> > > > > > > > > > > retry
> > > > > > > > > > > >> as
> > > > > > > > > > > >> > > best
> > > > > > > > > > > >> > > > > at it can before throwing exceptions to
> users.
> > > The
> > > > > > code
> > > > > > > > you
> > > > > > > > > > are
> > > > > > > > > > > >> > looking
> > > > > > > > > > > >> > > > for
> > > > > > > > > > > >> > > > > - it is on BKLogSegmentWriter for the proxy
> > > > handling
> > > > > > > > writes
> > > > > > > > > > and
> > > > > > > > > > > >> it is
> > > > > > > > > > > >> > > on
> > > > > > > > > > > >> > > > > DistributedLogClientImpl for the proxy
> client
> > > > > handling
> > > > > > > > > > responses
> > > > > > > > > > > >> from
> > > > > > > > > > > >> > > > > proxies. Does this help you?
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > And also, you are welcome to contribute the
> > pull
> > > > > > > requests.
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > - Sijie
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > On Tue, Oct 4, 2016 at 3:39 PM, Cameron
> > > Hatfield <
> > > > > > > > > > > >> kinguy@gmail.com>
> > > > > > > > > > > >> > > > wrote:
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > > I have a question about the Proxy Client.
> > > > > Basically,
> > > > > > > for
> > > > > > > > > our
> > > > > > > > > > > use
> > > > > > > > > > > >> > > cases,
> > > > > > > > > > > >> > > > > we
> > > > > > > > > > > >> > > > > > want to guarantee ordering at the key
> level,
> > > > > > > > irrespective
> > > > > > > > > of
> > > > > > > > > > > the
> > > > > > > > > > > >> > > > ordering
> > > > > > > > > > > >> > > > > > of the partition it may be assigned to as
> a
> > > > whole.
> > > > > > Due
> > > > > > > > to
> > > > > > > > > > the
> > > > > > > > > > > >> > source
> > > > > > > > > > > >> > > of
> > > > > > > > > > > >> > > > > the
> > > > > > > > > > > >> > > > > > data (HBase Replication), we cannot
> > guarantee
> > > > > that a
> > > > > > > > > single
> > > > > > > > > > > >> > partition
> > > > > > > > > > > >> > > > > will
> > > > > > > > > > > >> > > > > > be owned for writes by the same client.
> This
> > > > means
> > > > > > the
> > > > > > > > > proxy
> > > > > > > > > > > >> client
> > > > > > > > > > > >> > > > works
> > > > > > > > > > > >> > > > > > well (since we don't care which proxy owns
> > the
> > > > > > > partition
> > > > > > > > > we
> > > > > > > > > > > are
> > > > > > > > > > > >> > > writing
> > > > > > > > > > > >> > > > > > to).
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > However, the guarantees we need when
> > writing a
> > > > > batch
> > > > > > > > > > consists
> > > > > > > > > > > >> of:
> > > > > > > > > > > >> > > > > > Definition of a Batch: The set of records
> > sent
> > > > to
> > > > > > the
> > > > > > > > > > > writeBatch
> > > > > > > > > > > >> > > > endpoint
> > > > > > > > > > > >> > > > > > on the proxy
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > 1. Batch success: If the client receives a
> > > > success
> > > > > > > from
> > > > > > > > > the
> > > > > > > > > > > >> proxy,
> > > > > > > > > > > >> > > then
> > > > > > > > > > > >> > > > > > that batch is successfully written
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > 2. Inter-Batch ordering : Once a batch has
> > > been
> > > > > > > written
> > > > > > > > > > > >> > successfully
> > > > > > > > > > > >> > > by
> > > > > > > > > > > >> > > > > the
> > > > > > > > > > > >> > > > > > client, when another batch is written, it
> > will
> > > > be
> > > > > > > > > guaranteed
> > > > > > > > > > > to
> > > > > > > > > > > >> be
> > > > > > > > > > > >> > > > > ordered
> > > > > > > > > > > >> > > > > > after the last batch (if it is the same
> > > stream).
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > 3. Intra-Batch ordering: Within a batch of
> > > > writes,
> > > > > > the
> > > > > > > > > > records
> > > > > > > > > > > >> will
> > > > > > > > > > > >> > > be
> > > > > > > > > > > >> > > > > > committed in order
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > 4. Intra-Batch failure ordering: If an
> > > > individual
> > > > > > > record
> > > > > > > > > > fails
> > > > > > > > > > > >> to
> > > > > > > > > > > >> > > write
> > > > > > > > > > > >> > > > > > within a batch, all records after that
> > record
> > > > will
> > > > > > not
> > > > > > > > be
> > > > > > > > > > > >> written.
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > 5. Batch Commit: Guarantee that if a batch
> > > > > returns a
> > > > > > > > > > success,
> > > > > > > > > > > it
> > > > > > > > > > > >> > will
> > > > > > > > > > > >> > > > be
> > > > > > > > > > > >> > > > > > written
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > 6. Read-after-write: Once a batch is
> > > committed,
> > > > > > > within a
> > > > > > > > > > > limited
> > > > > > > > > > > >> > > > > time-frame
> > > > > > > > > > > >> > > > > > it will be able to be read. This is
> required
> > > in
> > > > > the
> > > > > > > case
> > > > > > > > > of
> > > > > > > > > > > >> > failure,
> > > > > > > > > > > >> > > so
> > > > > > > > > > > >> > > > > > that the client can see what actually got
> > > > > > committed. I
> > > > > > > > > > believe
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > > > > time-frame part could be removed if the
> > client
> > > > can
> > > > > > > send
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > >> same
> > > > > > > > > > > >> > > > > > sequence number that was written
> previously,
> > > > since
> > > > > > it
> > > > > > > > > would
> > > > > > > > > > > then
> > > > > > > > > > > >> > fail
> > > > > > > > > > > >> > > > and
> > > > > > > > > > > >> > > > > > we would know that a read needs to occur.
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > So, my basic question is if this is
> > currently
> > > > > > possible
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > >> > proxy?
> > > > > > > > > > > >> > > I
> > > > > > > > > > > >> > > > > > don't believe it gives these guarantees as
> > it
> > > > > stands
> > > > > > > > > today,
> > > > > > > > > > > but
> > > > > > > > > > > >> I
> > > > > > > > > > > >> > am
> > > > > > > > > > > >> > > > not
> > > > > > > > > > > >> > > > > > 100% of how all of the futures in the code
> > > > handle
> > > > > > > > > failures.
> > > > > > > > > > > >> > > > > > If not, where in the code would be the
> > > relevant
> > > > > > places
> > > > > > > > to
> > > > > > > > > > add
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > > ability
> > > > > > > > > > > >> > > > > > to do this, and would the project be
> > > interested
> > > > > in a
> > > > > > > > pull
> > > > > > > > > > > >> request?
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > Thanks,
> > > > > > > > > > > >> > > > > > Cameron
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Proxy Client - Batch Ordering / Commit

Posted by Jon Derrick <jo...@gmail.com>.
On Fri, Nov 18, 2016 at 2:37 PM, Sijie Guo <si...@apache.org> wrote:

> On Thu, Nov 17, 2016 at 2:30 AM, Xi Liu <xi...@gmail.com> wrote:
>
> > Cameron,
> >
> > Thank you for your comments. It's very helpful. My replies are inline.
> >
> > On Wed, Nov 16, 2016 at 11:59 AM, Cameron Hatfield <ki...@gmail.com>
> > wrote:
> >
> > > "A couple of questions" is what I originally wrote, and then the
> > following
> > > happened. Sorry about the large swath of them, making sure my
> > understanding
> > > of the code base, as well as the DL/Bookkeeper/ZK ecosystem
> interaction,
> > > makes sense.
> > >
> > > ==General:
> > > What is an exclusive session? What is it providing over a regular
> > session?
> > >
> >
> > The idea here is to provide exclusive writer semantic for the
> > distributedlog (thin) client use case.
> > So it can have similar behavior as using the storage
> (distributedlog-core)
> > library directly.
> >
>
> +1 for an exclusive writer feature. Leigh and me were talking about this
> before. It is glad to see it is happening now. However it might be worth to
> separate the exclusive writer feature into a separate task once we have
> 'fencing' feature available.
>

The exclusive writer feature is exactly what I am looking for. +1


>
>
> >
> >
> > >
> > >
> > > ==Proxy side:
> > > Should a new streamop be added for the fencing operation, or does it
> make
> > > sense to piggyback on an existing one (such as write)?
> >
> >
> > > ====getLastDLSN:
> > > What should be the return result for:
> > > A new stream
> > > A new session, after successful fencing
> > > A new session, after a change in ownership / first starting up
> > >
> > > What is the main use case for getLastDLSN(<stream>, false)? Is this to
> > > guarantee that the recovery process has happened in case of ownership
> > > failure (I don't have a good understanding of what causes the recovery
> > > process to happen, especially from the reader side)? Or is it to handle
> > the
> > > lost-ack problem? Since all the rest of the read related things go
> > through
> > > the read client, I'm not sure if I see the use case, but it seems like
> > > there would be a large potential for confusion on which to use. What
> > about
> > > just a fenceSession op, that always fences, returning the DLSN of the
> > > fence, and leave the normal getLastDLSN for the regular read client.
> > >
> >
> >
> > Ah, your point is valid. I was following the style of bookkeeper. The
> > bookkeeper client supplies a fencing flag on readLastAddConfirmed request
> > during recovery.
> >
> > But you are right. It is clear to have just a fenceSessionOp and return
> the
> > DLSN of the fence request.
> >
>
>
> Correct. In bookkeeper client, the fencing flag is set with a read op.
> However it is part of the recovery procedure and internal to the client. I
> agree with Cameron that we should hide the details from public. Otherwise,
> it will cause confusion.
>
>
> >
> >
> >
> > >
> > > ====Fencing:
> > > When a fence session occurs, what call needs to be made to make sure
> any
> > > outstanding writes are flushed and committed (so that we guarantee the
> > > client will be able to read anything that was in the write queue)?
> > > Is there a guaranteed ordering for things written in the future queue
> for
> > > AsyncLogWriter (I'm not quite confident that I was able to accurately
> > > follow the logic, as their are many parts of the code that write, have
> > > queues, heartbeat, etc)?
> > >
> >
> > I believed that when calling AsyncLogWriter asyncWrite in order, the
> > records will be written in order. Sijie or Leigh can confirm that.
> >
>
> That's write. The writes are guaranteed to write in the order of how they
> are issues.
>
>
> >
> > Since the writes are in order, when a fence session occurs, the control
> > record written successfully by the writer will guarantee the writes
> called
> > before the fence session write are flushed and committed.
> >
> > We need to invalidate the session when we fence the session. So any
> writes
> > with old session come after writing the control record will be rejected.
> In
> > this way, it can guarantee the client will be able to read anything in a
> > consistent way.
> >
> >
> > >
> > > ====SessionID:
> > > What is the default sessionid / transactionid for a new stream? I
> assume
> > > this would just be the first control record
> > >
> >
> > The initial session id will be the transaction id of the first control
> > record.
> >
> >
> > >
> > > ======Should all streams have a sessionid by default, regardless if it
> is
> > > never used by a client (aka, everytime ownership changes, a new control
> > > record is generated, and a sessionid is stored)?
> > > Main edge case that would have to be handled is if a client writes with
> > an
> > > old sessionid, but the owner has changed and has yet to create a
> > sessionid.
> > > This should be handled by the "non-matching sessionid" rule, since the
> > > invalid sessionid wouldn't match the passed sessionid, which should
> cause
> > > the client to get a new sessionid.
> > >
> >
> > I think all streams should just have a session id by default. The session
> > id is changed when ownership is changed or it is explicitly bumped by a
> > fence session op.
> >
> >
> > >
> > > ======Where in the code does it make sense to own the session, the
> stream
> > > interfaces / classes? Should they pass that information down to the
> ops,
> > or
> > > do the sessionid check within?
> > > My first thought would be Stream owns the sessionid, passes it into the
> > ops
> > > (as either a nullable value, or an invalid default value), which then
> do
> > > the sessionid check if they care. The main issue is updating the
> > sessionid
> > > is a bit backwards, as either every op has the ability to update it
> > through
> > > some type of return value / direct stream access / etc, or there is a
> > > special case in the stream for the fence operation / any other
> operation
> > > that can update the session.
> > >
> >
> > My thought is to add session id in Stream (StreamImpl.java). The stream
> > validates the session id before submitting a stream op. If it is a fence
> > session op, it would just invalidate the current session, so the
> subsequent
> > requests with old session will be rejected.
> >
>
> >
> > >
> > > ======For "the owner of the log stream will first advance the
> transaction
> > > id generator to claim a new transaction id and write a control record
> to
> > > the log stream. ":
> > > Should "DistributedLogConstants.CONTROL_RECORD_CONTENT" be the type of
> > > control record written?
> > > Should the "writeControlRecord" on the BKAsyncLogWriter be exposed in
> the
> > > AsyncLogWriter interface be exposed?  Or even in the one within the
> > segment
> > > writer? Or should the code be duplicated / pulled out into a helper /
> > etc?
> > > (Not a big Java person, so any suggestions on the "Java Way", or at
> least
> > > the DL way, to do it would be appreciated)
> > >
> >
> > I believe we can just construct a log record and set it to be a control
> > record and write it.
> >
> > LogRecord record = ...
> > record.setControl();
> > writer.write(record);
> >
> > (Can anyone from community confirm that it is okay to write a control
> > record in this way?)
> >
>
> Ideally we would like to hide the logic from public usage. However, since
> it is a code change at proxy side, it is absolutely fine.
>
>
> >
> >
> >
> > >
> > > ======Transaction ID:
> > > The BKLogSegmentWriter ignores the transaction ids from control records
> > > when it records the "LastTXId." Would that be an issue here for
> anything?
> > > It looks like it may do that because it assumes you're calling it's
> local
> > > function for writing a controlrecord, which uses the lastTxId.
> > >
> >
> > I think the lastTxId is for the last transaction id of user records. so
> we
> > probably don't change the behavior on how we record the lastTxId. However
> > we can change how do we fetch the last tx id for id generation after
> > recovery.
> >
>
>
> getLastTxId should have a flag to include control records or not.
>
>
> >
> >
> > >
> > >
> > > ==Thrift Interface:
> > > ====Should the write response be split out for different calls?
> > > It seems odd to have a single struct with many optional items that are
> > > filled depending on the call made for every rpc call. This is mostly a
> > > curiosity question, since I assume it comes from the general practices
> > from
> > > using thrift for a while. Would it at least make sense for the
> > > getLastDLSN/fence endpoint to have a new struct?
> > >
> >
> > I don't have any preference here. It might make sense to have a new
> struct.
> >
> >
> > >
> > > ====Any particular error code that makes sense for session fenced? If
> we
> > > want to be close to the HTTP errors, looks like 412 (PRECONDITION
> FAILED)
> > > might make the most sense, if a bit generic.
> > >
> > > 412 def:
> > > "The precondition given in one or more of the request-header fields
> > > evaluated to false when it was tested on the server. This response code
> > > allows the client to place preconditions on the current resource
> > > metainformation (header field data) and thus prevent the requested
> method
> > > from being applied to a resource other than the one intended."
> > >
> > > 412 excerpt from the If-match doc:
> > > "This behavior is most useful when the client wants to prevent an
> > updating
> > > method, such as PUT, from modifying a resource that has changed since
> the
> > > client last retrieved it."
> > >
> >
> > I liked your idea.
> >
> >
> > >
> > > ====Should we return the sessionid to the client in the "fencesession"
> > > calls?
> > > Seems like it may be useful when you fence, especially if you have some
> > > type of custom sequencer where it would make sense for search, or for
> > > debugging.
> > > Main minus is that it would be easy for users to create an implicit
> > > requirement that the sessionid is forever a valid transactionid, which
> > may
> > > not always be the case long term for the project.
> > >
> >
> > I think we probably should start with not return and add if we really
> need
> > it.
> >
> >
> > >
> > >
> > > ==Client:
> > > ====What is the proposed process for the client retrieving the new
> > > sessionid?
> > > A full reconnect? No special case code, but intrusive on the client
> side,
> > > and possibly expensive garbage/processing wise. (though this type of
> > > failure should hopefully be rare enough to not be a problem)
> > > A call to reset the sessionid? Less intrusive, all the issues you get
> > with
> > > mutable object methods that need to be called in a certain order, edge
> > > cases such as outstanding/buffered requests to the old stream, etc.
> > > The call could also return the new sessionid, making it a good call for
> > > storing or debugging the value.
> > >
> >
> > It is probably not good to tight a session with a connection. Especially
> > the underneath communication is a RPC framework.
> >
> > I think the new session id can be piggyback with rejected response. The
> > client doesn't have to explicitly retrieve a new session id.
> >
>
> +1 for separating session from connection. Especially the 'session' here is
> more a lifecycle concept for a stream.
>
>
>
> >
> >
> > >
> > > ====Session Fenced failure:
> > > Will this put the client into a failure state, stopping all future
> writes
> > > until fixed?
> > >
> >
> > My thought is the new session id will be piggyback with fence response.
> so
> > the client will know the new session id and all the future writes will
> just
> > carry the new session id.
> >
> >
> > > Is it even possible to get this error when ownership changes? The
> > > connection to the new owner should get a new sessionid on connect, so I
> > > would expect not.
> > >
> >
> > I think it will. But the client should handle and retry. As session
> fenced
> > exception is an exception that indicates that write is never attempted.
> >
> >
> > >
> > > Cheers,
> > > Cameron
> > >
> > > On Tue, Nov 15, 2016 at 2:01 AM, Xi Liu <xi...@gmail.com> wrote:
> > >
> > > > Thank you, Cameron. Look forward to your comments.
> > > >
> > > > - Xi
> > > >
> > > > On Sun, Nov 13, 2016 at 1:21 PM, Cameron Hatfield <ki...@gmail.com>
> > > > wrote:
> > > >
> > > > > Sorry, I've been on vacation for the past week, and heads down for
> a
> > > > > release that is using DL at the end of Nov. I'll take a look at
> this
> > > over
> > > > > the next week, and add any relevant comments. After we are finished
> > > with
> > > > > dev for this release, I am hoping to tackle this next.
> > > > >
> > > > > -Cameron
> > > > >
> > > > > On Fri, Nov 11, 2016 at 12:07 PM, Sijie Guo <si...@apache.org>
> > wrote:
> > > > >
> > > > > > Xi,
> > > > > >
> > > > > > Thank you so much for your proposal. I took a look. It looks fine
> > to
> > > > me.
> > > > > > Cameron, do you have any comments?
> > > > > >
> > > > > > Look forward to your pull requests.
> > > > > >
> > > > > > - Sijie
> > > > > >
> > > > > >
> > > > > > On Wed, Nov 9, 2016 at 2:34 AM, Xi Liu <xi...@gmail.com>
> > wrote:
> > > > > >
> > > > > > > Cameron,
> > > > > > >
> > > > > > > Have you started any work for this? I just updated the proposal
> > > page
> > > > -
> > > > > > > https://cwiki.apache.org/confluence/display/DL/DP-2+-+
> > > > > > Epoch+Write+Support
> > > > > > > Maybe we can work together with this.
> > > > > > >
> > > > > > > Sijie, Leigh,
> > > > > > >
> > > > > > > can you guys help review this to make sure our proposal is in
> the
> > > > right
> > > > > > > direction?
> > > > > > >
> > > > > > > - Xi
> > > > > > >
> > > > > > > On Tue, Nov 1, 2016 at 3:05 AM, Sijie Guo <si...@apache.org>
> > > wrote:
> > > > > > >
> > > > > > > > I created https://issues.apache.org/jira/browse/DL-63 for
> > > tracking
> > > > > the
> > > > > > > > proposed idea here.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Oct 26, 2016 at 4:53 PM, Sijie Guo
> > > > > <sijieg@twitter.com.invalid
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > On Tue, Oct 25, 2016 at 11:30 AM, Cameron Hatfield <
> > > > > kinguy@gmail.com
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Yes, we are reading the HBase WAL (from their replication
> > > > plugin
> > > > > > > > > support),
> > > > > > > > > > and writing that into DL.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Gotcha.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > From the sounds of it, yes, it would. Only thing I would
> > say
> > > is
> > > > > > make
> > > > > > > > the
> > > > > > > > > > epoch requirement optional, so that if I client doesn't
> > care
> > > > > about
> > > > > > > > dupes
> > > > > > > > > > they don't have to deal with the process of getting a new
> > > > epoch.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Yup. This should be optional. I can start a wiki page on
> how
> > we
> > > > > want
> > > > > > to
> > > > > > > > > implement this. Are you interested in contributing to this?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > -Cameron
> > > > > > > > > >
> > > > > > > > > > On Wed, Oct 19, 2016 at 7:43 PM, Sijie Guo
> > > > > > > <sijieg@twitter.com.invalid
> > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > On Wed, Oct 19, 2016 at 7:17 PM, Sijie Guo <
> > > > sijieg@twitter.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On Monday, October 17, 2016, Cameron Hatfield <
> > > > > > kinguy@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >> Answer inline:
> > > > > > > > > > > >>
> > > > > > > > > > > >> On Mon, Oct 17, 2016 at 11:46 AM, Sijie Guo <
> > > > > sijie@apache.org
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > > > >>
> > > > > > > > > > > >> > Cameron,
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Thank you for your summary. I liked the discussion
> > > > here. I
> > > > > > > also
> > > > > > > > > > liked
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > summary of your requirement -
> > 'single-writer-per-key,
> > > > > > > > > > > >> > multiple-writers-per-log'. If I understand
> > correctly,
> > > > the
> > > > > > core
> > > > > > > > > > concern
> > > > > > > > > > > >> here
> > > > > > > > > > > >> > is almost 'exact-once' write (or a way to explicit
> > > tell
> > > > > if a
> > > > > > > > write
> > > > > > > > > > can
> > > > > > > > > > > >> be
> > > > > > > > > > > >> > retried or not).
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Comments inline.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > On Fri, Oct 14, 2016 at 11:17 AM, Cameron
> Hatfield <
> > > > > > > > > > kinguy@gmail.com>
> > > > > > > > > > > >> > wrote:
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > > > Ah- yes good point (to be clear we're not
> using
> > > the
> > > > > > proxy
> > > > > > > > this
> > > > > > > > > > way
> > > > > > > > > > > >> > > today).
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > > > Due to the source of the
> > > > > > > > > > > >> > > > > data (HBase Replication), we cannot
> guarantee
> > > > that a
> > > > > > > > single
> > > > > > > > > > > >> partition
> > > > > > > > > > > >> > > will
> > > > > > > > > > > >> > > > > be owned for writes by the same client.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > > Do you mean you *need* to support multiple
> > writers
> > > > > > issuing
> > > > > > > > > > > >> interleaved
> > > > > > > > > > > >> > > > writes or is it just that they might sometimes
> > > > > > interleave
> > > > > > > > > writes
> > > > > > > > > > > and
> > > > > > > > > > > >> > you
> > > > > > > > > > > >> > > >don't care?
> > > > > > > > > > > >> > > How HBase partitions the keys being written
> > wouldn't
> > > > > have
> > > > > > a
> > > > > > > > > > one->one
> > > > > > > > > > > >> > > mapping with the partitions we would have in
> > HBase.
> > > > Even
> > > > > > if
> > > > > > > we
> > > > > > > > > did
> > > > > > > > > > > >> have
> > > > > > > > > > > >> > > that alignment when the cluster first started,
> > HBase
> > > > > will
> > > > > > > > > > rebalance
> > > > > > > > > > > >> what
> > > > > > > > > > > >> > > servers own what partitions, as well as split
> and
> > > > merge
> > > > > > > > > partitions
> > > > > > > > > > > >> that
> > > > > > > > > > > >> > > already exist, causing eventual drift from one
> log
> > > per
> > > > > > > > > partition.
> > > > > > > > > > > >> > > Because we want ordering guarantees per key (row
> > in
> > > > > > hbase),
> > > > > > > we
> > > > > > > > > > > >> partition
> > > > > > > > > > > >> > > the logs by the key. Since multiple writers are
> > > > possible
> > > > > > per
> > > > > > > > > range
> > > > > > > > > > > of
> > > > > > > > > > > >> > keys
> > > > > > > > > > > >> > > (due to the aforementioned rebalancing /
> > splitting /
> > > > etc
> > > > > > of
> > > > > > > > > > hbase),
> > > > > > > > > > > we
> > > > > > > > > > > >> > > cannot use the core library due to requiring a
> > > single
> > > > > > writer
> > > > > > > > for
> > > > > > > > > > > >> > ordering.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > But, for a single log, we don't really care
> about
> > > > > ordering
> > > > > > > > aside
> > > > > > > > > > > from
> > > > > > > > > > > >> at
> > > > > > > > > > > >> > > the per-key level. So all we really need to be
> > able
> > > to
> > > > > > > handle
> > > > > > > > is
> > > > > > > > > > > >> > preventing
> > > > > > > > > > > >> > > duplicates when a failure occurs, and ordering
> > > > > consistency
> > > > > > > > > across
> > > > > > > > > > > >> > requests
> > > > > > > > > > > >> > > from a single client.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > So our general requirements are:
> > > > > > > > > > > >> > > Write A, Write B
> > > > > > > > > > > >> > > Timeline: A -> B
> > > > > > > > > > > >> > > Request B is only made after A has successfully
> > > > returned
> > > > > > > > > (possibly
> > > > > > > > > > > >> after
> > > > > > > > > > > >> > > retries)
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > 1) If the write succeeds, it will be durably
> > exposed
> > > > to
> > > > > > > > clients
> > > > > > > > > > > within
> > > > > > > > > > > >> > some
> > > > > > > > > > > >> > > bounded time frame
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Guaranteed.
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >> > > 2) If A succeeds and B succeeds, the ordering
> for
> > > the
> > > > > log
> > > > > > > will
> > > > > > > > > be
> > > > > > > > > > A
> > > > > > > > > > > >> and
> > > > > > > > > > > >> > > then B
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > If I understand correctly here, B is only sent
> > after A
> > > > is
> > > > > > > > > returned,
> > > > > > > > > > > >> right?
> > > > > > > > > > > >> > If that's the case, It is guaranteed.
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > > 3) If A fails due to an error that can be relied
> > on
> > > to
> > > > > > *not*
> > > > > > > > be
> > > > > > > > > a
> > > > > > > > > > > lost
> > > > > > > > > > > >> > ack
> > > > > > > > > > > >> > > problem, it will never be exposed to the client,
> > so
> > > it
> > > > > may
> > > > > > > > > > > (depending
> > > > > > > > > > > >> on
> > > > > > > > > > > >> > > the error) be retried immediately
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > If it is not a lost-ack problem, the entry will be
> > > > > exposed.
> > > > > > it
> > > > > > > > is
> > > > > > > > > > > >> > guaranteed.
> > > > > > > > > > > >>
> > > > > > > > > > > >> Let me try rephrasing the questions, to make sure
> I'm
> > > > > > > > understanding
> > > > > > > > > > > >> correctly:
> > > > > > > > > > > >> If A fails, with an error such as "Unable to create
> > > > > connection
> > > > > > > to
> > > > > > > > > > > >> bookkeeper server", that would be the type of error
> we
> > > > would
> > > > > > > > expect
> > > > > > > > > to
> > > > > > > > > > > be
> > > > > > > > > > > >> able to retry immediately, as that result means no
> > > action
> > > > > was
> > > > > > > > taken
> > > > > > > > > on
> > > > > > > > > > > any
> > > > > > > > > > > >> log / etc, so no entry could have been created. This
> > is
> > > > > > > different
> > > > > > > > > > then a
> > > > > > > > > > > >> "Connection Timeout" exception, as we just might not
> > > have
> > > > > > > gotten a
> > > > > > > > > > > >> response
> > > > > > > > > > > >> in time.
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > > Gotcha.
> > > > > > > > > > > >
> > > > > > > > > > > > The response code returned from proxy can tell if a
> > > failure
> > > > > can
> > > > > > > be
> > > > > > > > > > > retried
> > > > > > > > > > > > safely or not. (We might need to make them well
> > > documented)
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >>
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > > 4) If A fails due to an error that could be a
> lost
> > > ack
> > > > > > > problem
> > > > > > > > > > > >> (network
> > > > > > > > > > > >> > > connectivity / etc), within a bounded time frame
> > it
> > > > > should
> > > > > > > be
> > > > > > > > > > > >> possible to
> > > > > > > > > > > >> > > find out if the write succeed or failed. Either
> by
> > > > > reading
> > > > > > > > from
> > > > > > > > > > some
> > > > > > > > > > > >> > > checkpoint of the log for the changes that
> should
> > > have
> > > > > > been
> > > > > > > > made
> > > > > > > > > > or
> > > > > > > > > > > >> some
> > > > > > > > > > > >> > > other possible server-side support.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > If I understand this correctly, it is a
> duplication
> > > > issue,
> > > > > > > > right?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Can a de-duplication solution work here? Either DL
> > or
> > > > your
> > > > > > > > client
> > > > > > > > > > does
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > de-duplication?
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >> The requirements I'm mentioning are the ones needed
> > for
> > > > > > > > client-side
> > > > > > > > > > > >> dedupping. Since if I can guarantee writes being
> > exposed
> > > > > > within
> > > > > > > > some
> > > > > > > > > > > time
> > > > > > > > > > > >> frame, and I can never get into an inconsistently
> > > ordered
> > > > > > state
> > > > > > > > when
> > > > > > > > > > > >> successes happen, when an error occurs, I can always
> > > wait
> > > > > for
> > > > > > > max
> > > > > > > > > time
> > > > > > > > > > > >> frame, read the latest writes, and then dedup
> locally
> > > > > against
> > > > > > > the
> > > > > > > > > > > request
> > > > > > > > > > > >> I
> > > > > > > > > > > >> just made.
> > > > > > > > > > > >>
> > > > > > > > > > > >> The main thing about that timeframe is that its
> > > basically
> > > > > the
> > > > > > > > > addition
> > > > > > > > > > > of
> > > > > > > > > > > >> every timeout, all the way down in the system,
> > combined
> > > > with
> > > > > > > > > whatever
> > > > > > > > > > > >> flushing / caching / etc times are at the
> bookkeeper /
> > > > > client
> > > > > > > > level
> > > > > > > > > > for
> > > > > > > > > > > >> when values are exposed
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Gotcha.
> > > > > > > > > > > >
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Is there any ways to identify your write?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > I can think of a case as follow - I want to know
> > what
> > > is
> > > > > > your
> > > > > > > > > > expected
> > > > > > > > > > > >> > behavior from the log.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > a)
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > If a hbase region server A writes a change of key
> K
> > to
> > > > the
> > > > > > > log,
> > > > > > > > > the
> > > > > > > > > > > >> change
> > > > > > > > > > > >> > is successfully made to the log;
> > > > > > > > > > > >> > but server A is down before receiving the change.
> > > > > > > > > > > >> > region server B took over the region that contains
> > K,
> > > > what
> > > > > > > will
> > > > > > > > B
> > > > > > > > > > do?
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >> HBase writes in large chunks (WAL Logs), which its
> > > > > replication
> > > > > > > > > system
> > > > > > > > > > > then
> > > > > > > > > > > >> handles by replaying in the case of failure. If I'm
> > in a
> > > > > > middle
> > > > > > > > of a
> > > > > > > > > > > log,
> > > > > > > > > > > >> and the whole region goes down and gets rescheduled
> > > > > > elsewhere, I
> > > > > > > > > will
> > > > > > > > > > > >> start
> > > > > > > > > > > >> back up from the beginning of the log I was in the
> > > middle
> > > > > of.
> > > > > > > > Using
> > > > > > > > > > > >> checkpointing + deduping, we should be able to find
> > out
> > > > > where
> > > > > > we
> > > > > > > > > left
> > > > > > > > > > > off
> > > > > > > > > > > >> in the log.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > b) same as a). but server A was just network
> > > > partitioned.
> > > > > > will
> > > > > > > > > both
> > > > > > > > > > A
> > > > > > > > > > > >> and B
> > > > > > > > > > > >> > write the change of key K?
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >> HBase gives us some guarantees around network
> > partitions
> > > > > > > > > (Consistency
> > > > > > > > > > > over
> > > > > > > > > > > >> availability for HBase). HBase is a single-master
> > > failover
> > > > > > > > recovery
> > > > > > > > > > type
> > > > > > > > > > > >> of
> > > > > > > > > > > >> system, with zookeeper-based guarantees for single
> > > owners
> > > > > > > > (writers)
> > > > > > > > > > of a
> > > > > > > > > > > >> range of data.
> > > > > > > > > > > >>
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > > 5) If A is turned into multiple batches (one
> large
> > > > > request
> > > > > > > > gets
> > > > > > > > > > > split
> > > > > > > > > > > >> > into
> > > > > > > > > > > >> > > multiple smaller ones to the bookkeeper backend,
> > due
> > > > to
> > > > > > log
> > > > > > > > > > rolling
> > > > > > > > > > > /
> > > > > > > > > > > >> > size
> > > > > > > > > > > >> > > / etc):
> > > > > > > > > > > >> > >   a) The ordering of entries within batches have
> > > > > ordering
> > > > > > > > > > > consistence
> > > > > > > > > > > >> > with
> > > > > > > > > > > >> > > the original request, when exposed in the log
> > > (though
> > > > > they
> > > > > > > may
> > > > > > > > > be
> > > > > > > > > > > >> > > interleaved with other requests)
> > > > > > > > > > > >> > >   b) The ordering across batches have ordering
> > > > > consistence
> > > > > > > > with
> > > > > > > > > > the
> > > > > > > > > > > >> > > original request, when exposed in the log
> (though
> > > they
> > > > > may
> > > > > > > be
> > > > > > > > > > > >> interleaved
> > > > > > > > > > > >> > > with other requests)
> > > > > > > > > > > >> > >   c) If a batch fails, and cannot be retried /
> is
> > > > > > > > unsuccessfully
> > > > > > > > > > > >> retried,
> > > > > > > > > > > >> > > all batches after the failed batch should not be
> > > > exposed
> > > > > > in
> > > > > > > > the
> > > > > > > > > > log.
> > > > > > > > > > > >> > Note:
> > > > > > > > > > > >> > > The batches before and including the failed
> batch,
> > > > that
> > > > > > > ended
> > > > > > > > up
> > > > > > > > > > > >> > > succeeding, can show up in the log, again within
> > > some
> > > > > > > bounded
> > > > > > > > > time
> > > > > > > > > > > >> range
> > > > > > > > > > > >> > > for reads by a client.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > There is a method 'writeBulk' in
> > DistributedLogClient
> > > > can
> > > > > > > > achieve
> > > > > > > > > > this
> > > > > > > > > > > >> > guarantee.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > However, I am not very sure about how will you
> turn
> > A
> > > > into
> > > > > > > > > batches.
> > > > > > > > > > If
> > > > > > > > > > > >> you
> > > > > > > > > > > >> > are dividing A into batches,
> > > > > > > > > > > >> > you can simply control the application write
> > sequence
> > > to
> > > > > > > achieve
> > > > > > > > > the
> > > > > > > > > > > >> > guarantee here.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Can you explain more about this?
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >> In this case, by batches I mean what the proxy does
> > with
> > > > the
> > > > > > > > single
> > > > > > > > > > > >> request
> > > > > > > > > > > >> that I send it. If the proxy decides it needs to
> turn
> > my
> > > > > > single
> > > > > > > > > > request
> > > > > > > > > > > >> into multiple batches of requests, due to log
> rolling,
> > > > size
> > > > > > > > > > limitations,
> > > > > > > > > > > >> etc, those would be the guarantees I need to be able
> > to
> > > > > > > > reduplicate
> > > > > > > > > on
> > > > > > > > > > > the
> > > > > > > > > > > >> client side.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > A single record written by #write and A record set
> (set
> > > of
> > > > > > > records)
> > > > > > > > > > > > written by #writeRecordSet are atomic - they will not
> > be
> > > > > broken
> > > > > > > > down
> > > > > > > > > > into
> > > > > > > > > > > > entries (batches). With the correct response code,
> you
> > > > would
> > > > > be
> > > > > > > > able
> > > > > > > > > to
> > > > > > > > > > > > tell if it is a lost-ack failure or not. However
> there
> > > is a
> > > > > > size
> > > > > > > > > > > limitation
> > > > > > > > > > > > for this - it can't not go beyond 1MB for current
> > > > > > implementation.
> > > > > > > > > > > >
> > > > > > > > > > > > What is your expected record size?
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >>
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > Since we can guarantee per-key ordering on the
> > > client
> > > > > > side,
> > > > > > > we
> > > > > > > > > > > >> guarantee
> > > > > > > > > > > >> > > that there is a single writer per-key, just not
> > per
> > > > log.
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Do you need fencing guarantee in the case of
> network
> > > > > > partition
> > > > > > > > > > causing
> > > > > > > > > > > >> > two-writers?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > > So if there was a
> > > > > > > > > > > >> > > way to guarantee a single write request as being
> > > > written
> > > > > > or
> > > > > > > > not,
> > > > > > > > > > > >> within a
> > > > > > > > > > > >> > > certain time frame (since failures should be
> rare
> > > > > anyways,
> > > > > > > > this
> > > > > > > > > is
> > > > > > > > > > > >> fine
> > > > > > > > > > > >> > if
> > > > > > > > > > > >> > > it is expensive), we can then have the client
> > > > guarantee
> > > > > > the
> > > > > > > > > > ordering
> > > > > > > > > > > >> it
> > > > > > > > > > > >> > > needs.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > This sounds an 'exact-once' write (regarding
> > retries)
> > > > > > > > requirement
> > > > > > > > > to
> > > > > > > > > > > me,
> > > > > > > > > > > >> > right?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> Yes. I'm curious of how this issue is handled by
> > > > Manhattan,
> > > > > > > since
> > > > > > > > > you
> > > > > > > > > > > can
> > > > > > > > > > > >> imagine a data store that ends up getting multiple
> > > writes
> > > > > for
> > > > > > > the
> > > > > > > > > same
> > > > > > > > > > > put
> > > > > > > > > > > >> / get / etc, would be harder to use, and we are
> > > basically
> > > > > > trying
> > > > > > > > to
> > > > > > > > > > > create
> > > > > > > > > > > >> a log like that for HBase.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Are you guys replacing HBase WAL?
> > > > > > > > > > > >
> > > > > > > > > > > > In Manhattan case, the request will be first written
> to
> > > DL
> > > > > > > streams
> > > > > > > > by
> > > > > > > > > > > > Manhattan coordinator. The Manhattan replica then
> will
> > > read
> > > > > > from
> > > > > > > > the
> > > > > > > > > DL
> > > > > > > > > > > > streams and apply the change. In the lost-ack case,
> the
> > > MH
> > > > > > > > > coordinator
> > > > > > > > > > > will
> > > > > > > > > > > > just fail the request to client.
> > > > > > > > > > > >
> > > > > > > > > > > > My feeling here is your usage for HBase is a bit
> > > different
> > > > > from
> > > > > > > how
> > > > > > > > > we
> > > > > > > > > > > use
> > > > > > > > > > > > DL in Manhattan. It sounds like you read from a
> source
> > > > (HBase
> > > > > > > WAL)
> > > > > > > > > and
> > > > > > > > > > > > write to DL. But I might be wrong.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >>
> > > > > > > > > > > >> >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > > Cameron:
> > > > > > > > > > > >> > > > Another thing we've discussed but haven't
> really
> > > > > thought
> > > > > > > > > > through -
> > > > > > > > > > > >> > > > We might be able to support some kind of epoch
> > > write
> > > > > > > > request,
> > > > > > > > > > > where
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > > epoch is guaranteed to have changed if the
> > writer
> > > > has
> > > > > > > > changed
> > > > > > > > > or
> > > > > > > > > > > the
> > > > > > > > > > > >> > > ledger
> > > > > > > > > > > >> > > > was ever fenced off. Writes include an epoch
> and
> > > are
> > > > > > > > rejected
> > > > > > > > > if
> > > > > > > > > > > the
> > > > > > > > > > > >> > > epoch
> > > > > > > > > > > >> > > > has changed.
> > > > > > > > > > > >> > > > With a mechanism like this, fencing the ledger
> > off
> > > > > > after a
> > > > > > > > > > failure
> > > > > > > > > > > >> > would
> > > > > > > > > > > >> > > > ensure any pending writes had either been
> > written
> > > or
> > > > > > would
> > > > > > > > be
> > > > > > > > > > > >> rejected.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > The issue would be how I guarantee the write I
> > wrote
> > > > to
> > > > > > the
> > > > > > > > > server
> > > > > > > > > > > was
> > > > > > > > > > > >> > > written. Since a network issue could happen on
> the
> > > > send
> > > > > of
> > > > > > > the
> > > > > > > > > > > >> request,
> > > > > > > > > > > >> > or
> > > > > > > > > > > >> > > on the receive of the success response, an epoch
> > > > > wouldn't
> > > > > > > tell
> > > > > > > > > me
> > > > > > > > > > > if I
> > > > > > > > > > > >> > can
> > > > > > > > > > > >> > > successfully retry, as it could be successfully
> > > > written
> > > > > > but
> > > > > > > > AWS
> > > > > > > > > > > >> dropped
> > > > > > > > > > > >> > the
> > > > > > > > > > > >> > > connection for the success response. Since the
> > epoch
> > > > > would
> > > > > > > be
> > > > > > > > > the
> > > > > > > > > > > same
> > > > > > > > > > > >> > > (same ledger), I could write duplicates.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > > We are currently proposing adding a
> transaction
> > > > > semantic
> > > > > > > to
> > > > > > > > dl
> > > > > > > > > > to
> > > > > > > > > > > >> get
> > > > > > > > > > > >> > rid
> > > > > > > > > > > >> > > > of the size limitation and the unaware-ness in
> > the
> > > > > proxy
> > > > > > > > > client.
> > > > > > > > > > > >> Here
> > > > > > > > > > > >> > is
> > > > > > > > > > > >> > > > our idea -
> > > > > > > > > > > >> > > > http://mail-archives.apache.
> > > org/mod_mbox/incubator-
> > > > > > > > > > distributedlog
> > > > > > > > > > > >> > > -dev/201609.mbox/%3cCAAC6BxP5YyEHwG0ZCF5soh42X=
> > > xuYwYm
> > > > > > > > > > > >> > > <http://mail-archives.apache.
> > > org/mod_mbox/incubator-
> > > > > > > > > > > >> > distributedlog%0A-dev/201609.mbox/%
> > > > > > 3cCAAC6BxP5YyEHwG0ZCF5soh
> > > > > > > > > > > 42X=xuYwYm>
> > > > > > > > > > > >> > > L4nXsYBYiofzxpVk6g@mail.gmail.com%3e
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > > I am not sure if your idea is similar as ours.
> > but
> > > > > we'd
> > > > > > > like
> > > > > > > > > to
> > > > > > > > > > > >> > > collaborate
> > > > > > > > > > > >> > > > with the community if anyone has the similar
> > idea.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > Our use case would be covered by transaction
> > > support,
> > > > > but
> > > > > > > I'm
> > > > > > > > > > unsure
> > > > > > > > > > > >> if
> > > > > > > > > > > >> > we
> > > > > > > > > > > >> > > would need something that heavy weight for the
> > > > > guarantees
> > > > > > we
> > > > > > > > > need.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > Basically, the high level requirement here is
> > > "Support
> > > > > > > > > consistent
> > > > > > > > > > > >> write
> > > > > > > > > > > >> > > ordering for single-writer-per-key,
> > > > > multi-writer-per-log".
> > > > > > > My
> > > > > > > > > > hunch
> > > > > > > > > > > is
> > > > > > > > > > > >> > > that, with some added guarantees to the proxy
> (if
> > it
> > > > > isn't
> > > > > > > > > already
> > > > > > > > > > > >> > > supported), and some custom client code on our
> > side
> > > > for
> > > > > > > > removing
> > > > > > > > > > the
> > > > > > > > > > > >> > > entries that actually succeed to write to
> > > > DistributedLog
> > > > > > > from
> > > > > > > > > the
> > > > > > > > > > > >> request
> > > > > > > > > > > >> > > that failed, it should be a relatively easy
> thing
> > to
> > > > > > > support.
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Yup. I think it should not be very difficult to
> > > support.
> > > > > > There
> > > > > > > > > might
> > > > > > > > > > > be
> > > > > > > > > > > >> > some changes in the server side.
> > > > > > > > > > > >> > Let's figure out what will the changes be. Are you
> > > guys
> > > > > > > > interested
> > > > > > > > > > in
> > > > > > > > > > > >> > contributing?
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > Yes, we would be.
> > > > > > > > > > > >>
> > > > > > > > > > > >> As a note, the one thing that we see as an issue
> with
> > > the
> > > > > > client
> > > > > > > > > side
> > > > > > > > > > > >> dedupping is how to bound the range of data that
> needs
> > > to
> > > > be
> > > > > > > > looked
> > > > > > > > > at
> > > > > > > > > > > for
> > > > > > > > > > > >> deduplication. As you can imagine, it is pretty easy
> > to
> > > > > bound
> > > > > > > the
> > > > > > > > > > bottom
> > > > > > > > > > > >> of
> > > > > > > > > > > >> the range, as that it just regular checkpointing of
> > the
> > > > DSLN
> > > > > > > that
> > > > > > > > is
> > > > > > > > > > > >> returned. I'm still not sure if there is any nice
> way
> > to
> > > > > time
> > > > > > > > bound
> > > > > > > > > > the
> > > > > > > > > > > >> top
> > > > > > > > > > > >> end of the range, especially since the proxy owns
> > > sequence
> > > > > > > numbers
> > > > > > > > > > > (which
> > > > > > > > > > > >> makes sense). I am curious if there is more that can
> > be
> > > > done
> > > > > > if
> > > > > > > > > > > >> deduplication is on the server side. However the
> main
> > > > minus
> > > > > I
> > > > > > > see
> > > > > > > > of
> > > > > > > > > > > >> server
> > > > > > > > > > > >> side deduplication is that instead of running
> > contingent
> > > > on
> > > > > > > there
> > > > > > > > > > being
> > > > > > > > > > > a
> > > > > > > > > > > >> failed client request, instead it would have to run
> > > every
> > > > > > time a
> > > > > > > > > write
> > > > > > > > > > > >> happens.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > For a reliable dedup, we probably need
> > > > fence-then-getLastDLSN
> > > > > > > > > > operation -
> > > > > > > > > > > > so it would guarantee that any non-completed requests
> > > > issued
> > > > > > > > > (lost-ack
> > > > > > > > > > > > requests) before this fence-then-getLastDLSN
> operation
> > > will
> > > > > be
> > > > > > > > failed
> > > > > > > > > > and
> > > > > > > > > > > > they will never land at the log.
> > > > > > > > > > > >
> > > > > > > > > > > > the pseudo code would look like below -
> > > > > > > > > > > >
> > > > > > > > > > > > write(request) onFailure { t =>
> > > > > > > > > > > >
> > > > > > > > > > > > if (t is timeout exception) {
> > > > > > > > > > > >
> > > > > > > > > > > > DLSN lastDLSN = fenceThenGetLastDLSN()
> > > > > > > > > > > > DLSN lastCheckpointedDLSN = ...;
> > > > > > > > > > > > // find if the request lands between [lastDLSN,
> > > > > > > > > lastCheckpointedDLSN].
> > > > > > > > > > > > // if it exists, the write succeed; otherwise retry.
> > > > > > > > > > > >
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Just realized the idea is same as what Leigh raised in
> > the
> > > > > > previous
> > > > > > > > > email
> > > > > > > > > > > about 'epoch write'. Let me explain more about this
> idea
> > > > > (Leigh,
> > > > > > > feel
> > > > > > > > > > free
> > > > > > > > > > > to jump in to fill up your idea).
> > > > > > > > > > >
> > > > > > > > > > > - when a log stream is owned,  the proxy use the last
> > > > > transaction
> > > > > > > id
> > > > > > > > as
> > > > > > > > > > the
> > > > > > > > > > > epoch
> > > > > > > > > > > - when a client connects (handshake with the proxy), it
> > > will
> > > > > get
> > > > > > > the
> > > > > > > > > > epoch
> > > > > > > > > > > for the stream.
> > > > > > > > > > > - the writes issued by this client will carry the epoch
> > to
> > > > the
> > > > > > > proxy.
> > > > > > > > > > > - add a new rpc - fenceThenGetLastDLSN - it would force
> > the
> > > > > proxy
> > > > > > > to
> > > > > > > > > bump
> > > > > > > > > > > the epoch.
> > > > > > > > > > > - if fenceThenGetLastDLSN happened, all the outstanding
> > > > writes
> > > > > > with
> > > > > > > > old
> > > > > > > > > > > epoch will be rejected with exceptions (e.g.
> > EpochFenced).
> > > > > > > > > > > - The DLSN returned from fenceThenGetLastDLSN can be
> used
> > > as
> > > > > the
> > > > > > > > bound
> > > > > > > > > > for
> > > > > > > > > > > deduplications on failures.
> > > > > > > > > > >
> > > > > > > > > > > Cameron, does this sound a solution to your use case?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >>
> > > > > > > > > > > >> Maybe something that could fit a similar need that
> > Kafka
> > > > > does
> > > > > > > (the
> > > > > > > > > > last
> > > > > > > > > > > >> store value for a particular key in a log), such
> that
> > > on a
> > > > > per
> > > > > > > key
> > > > > > > > > > basis
> > > > > > > > > > > >> there could be a sequence number that support
> > > > deduplication?
> > > > > > > Cost
> > > > > > > > > > seems
> > > > > > > > > > > >> like it would be high however, and I'm not even sure
> > if
> > > > > > > bookkeeper
> > > > > > > > > > > >> supports
> > > > > > > > > > > >> it.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >> Cheers,
> > > > > > > > > > > >> Cameron
> > > > > > > > > > > >>
> > > > > > > > > > > >> >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > Thanks,
> > > > > > > > > > > >> > > Cameron
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > On Sat, Oct 8, 2016 at 7:35 AM, Leigh Stewart
> > > > > > > > > > > >> > <lstewart@twitter.com.invalid
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > wrote:
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> > > > Cameron:
> > > > > > > > > > > >> > > > Another thing we've discussed but haven't
> really
> > > > > thought
> > > > > > > > > > through -
> > > > > > > > > > > >> > > > We might be able to support some kind of epoch
> > > write
> > > > > > > > request,
> > > > > > > > > > > where
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > > epoch is guaranteed to have changed if the
> > writer
> > > > has
> > > > > > > > changed
> > > > > > > > > or
> > > > > > > > > > > the
> > > > > > > > > > > >> > > ledger
> > > > > > > > > > > >> > > > was ever fenced off. Writes include an epoch
> and
> > > are
> > > > > > > > rejected
> > > > > > > > > if
> > > > > > > > > > > the
> > > > > > > > > > > >> > > epoch
> > > > > > > > > > > >> > > > has changed.
> > > > > > > > > > > >> > > > With a mechanism like this, fencing the ledger
> > off
> > > > > > after a
> > > > > > > > > > failure
> > > > > > > > > > > >> > would
> > > > > > > > > > > >> > > > ensure any pending writes had either been
> > written
> > > or
> > > > > > would
> > > > > > > > be
> > > > > > > > > > > >> rejected.
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > On Sat, Oct 8, 2016 at 7:10 AM, Sijie Guo <
> > > > > > > sijie@apache.org
> > > > > > > > >
> > > > > > > > > > > wrote:
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > > > > Cameron,
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > I think both Leigh and Xi had made a few
> good
> > > > points
> > > > > > > about
> > > > > > > > > > your
> > > > > > > > > > > >> > > question.
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > To add one more point to your question -
> "but
> > I
> > > am
> > > > > not
> > > > > > > > > > > >> > > > > 100% of how all of the futures in the code
> > > handle
> > > > > > > > failures.
> > > > > > > > > > > >> > > > > If not, where in the code would be the
> > relevant
> > > > > places
> > > > > > > to
> > > > > > > > > add
> > > > > > > > > > > the
> > > > > > > > > > > >> > > ability
> > > > > > > > > > > >> > > > > to do this, and would the project be
> > interested
> > > > in a
> > > > > > > pull
> > > > > > > > > > > >> request?"
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > The current proxy and client logic doesn't
> do
> > > > > > perfectly
> > > > > > > on
> > > > > > > > > > > >> handling
> > > > > > > > > > > >> > > > > failures (duplicates) - the strategy now is
> > the
> > > > > client
> > > > > > > > will
> > > > > > > > > > > retry
> > > > > > > > > > > >> as
> > > > > > > > > > > >> > > best
> > > > > > > > > > > >> > > > > at it can before throwing exceptions to
> users.
> > > The
> > > > > > code
> > > > > > > > you
> > > > > > > > > > are
> > > > > > > > > > > >> > looking
> > > > > > > > > > > >> > > > for
> > > > > > > > > > > >> > > > > - it is on BKLogSegmentWriter for the proxy
> > > > handling
> > > > > > > > writes
> > > > > > > > > > and
> > > > > > > > > > > >> it is
> > > > > > > > > > > >> > > on
> > > > > > > > > > > >> > > > > DistributedLogClientImpl for the proxy
> client
> > > > > handling
> > > > > > > > > > responses
> > > > > > > > > > > >> from
> > > > > > > > > > > >> > > > > proxies. Does this help you?
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > And also, you are welcome to contribute the
> > pull
> > > > > > > requests.
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > - Sijie
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > On Tue, Oct 4, 2016 at 3:39 PM, Cameron
> > > Hatfield <
> > > > > > > > > > > >> kinguy@gmail.com>
> > > > > > > > > > > >> > > > wrote:
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > > > > I have a question about the Proxy Client.
> > > > > Basically,
> > > > > > > for
> > > > > > > > > our
> > > > > > > > > > > use
> > > > > > > > > > > >> > > cases,
> > > > > > > > > > > >> > > > > we
> > > > > > > > > > > >> > > > > > want to guarantee ordering at the key
> level,
> > > > > > > > irrespective
> > > > > > > > > of
> > > > > > > > > > > the
> > > > > > > > > > > >> > > > ordering
> > > > > > > > > > > >> > > > > > of the partition it may be assigned to as
> a
> > > > whole.
> > > > > > Due
> > > > > > > > to
> > > > > > > > > > the
> > > > > > > > > > > >> > source
> > > > > > > > > > > >> > > of
> > > > > > > > > > > >> > > > > the
> > > > > > > > > > > >> > > > > > data (HBase Replication), we cannot
> > guarantee
> > > > > that a
> > > > > > > > > single
> > > > > > > > > > > >> > partition
> > > > > > > > > > > >> > > > > will
> > > > > > > > > > > >> > > > > > be owned for writes by the same client.
> This
> > > > means
> > > > > > the
> > > > > > > > > proxy
> > > > > > > > > > > >> client
> > > > > > > > > > > >> > > > works
> > > > > > > > > > > >> > > > > > well (since we don't care which proxy owns
> > the
> > > > > > > partition
> > > > > > > > > we
> > > > > > > > > > > are
> > > > > > > > > > > >> > > writing
> > > > > > > > > > > >> > > > > > to).
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > However, the guarantees we need when
> > writing a
> > > > > batch
> > > > > > > > > > consists
> > > > > > > > > > > >> of:
> > > > > > > > > > > >> > > > > > Definition of a Batch: The set of records
> > sent
> > > > to
> > > > > > the
> > > > > > > > > > > writeBatch
> > > > > > > > > > > >> > > > endpoint
> > > > > > > > > > > >> > > > > > on the proxy
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > 1. Batch success: If the client receives a
> > > > success
> > > > > > > from
> > > > > > > > > the
> > > > > > > > > > > >> proxy,
> > > > > > > > > > > >> > > then
> > > > > > > > > > > >> > > > > > that batch is successfully written
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > 2. Inter-Batch ordering : Once a batch has
> > > been
> > > > > > > written
> > > > > > > > > > > >> > successfully
> > > > > > > > > > > >> > > by
> > > > > > > > > > > >> > > > > the
> > > > > > > > > > > >> > > > > > client, when another batch is written, it
> > will
> > > > be
> > > > > > > > > guaranteed
> > > > > > > > > > > to
> > > > > > > > > > > >> be
> > > > > > > > > > > >> > > > > ordered
> > > > > > > > > > > >> > > > > > after the last batch (if it is the same
> > > stream).
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > 3. Intra-Batch ordering: Within a batch of
> > > > writes,
> > > > > > the
> > > > > > > > > > records
> > > > > > > > > > > >> will
> > > > > > > > > > > >> > > be
> > > > > > > > > > > >> > > > > > committed in order
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > 4. Intra-Batch failure ordering: If an
> > > > individual
> > > > > > > record
> > > > > > > > > > fails
> > > > > > > > > > > >> to
> > > > > > > > > > > >> > > write
> > > > > > > > > > > >> > > > > > within a batch, all records after that
> > record
> > > > will
> > > > > > not
> > > > > > > > be
> > > > > > > > > > > >> written.
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > 5. Batch Commit: Guarantee that if a batch
> > > > > returns a
> > > > > > > > > > success,
> > > > > > > > > > > it
> > > > > > > > > > > >> > will
> > > > > > > > > > > >> > > > be
> > > > > > > > > > > >> > > > > > written
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > 6. Read-after-write: Once a batch is
> > > committed,
> > > > > > > within a
> > > > > > > > > > > limited
> > > > > > > > > > > >> > > > > time-frame
> > > > > > > > > > > >> > > > > > it will be able to be read. This is
> required
> > > in
> > > > > the
> > > > > > > case
> > > > > > > > > of
> > > > > > > > > > > >> > failure,
> > > > > > > > > > > >> > > so
> > > > > > > > > > > >> > > > > > that the client can see what actually got
> > > > > > committed. I
> > > > > > > > > > believe
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > > > > time-frame part could be removed if the
> > client
> > > > can
> > > > > > > send
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > >> same
> > > > > > > > > > > >> > > > > > sequence number that was written
> previously,
> > > > since
> > > > > > it
> > > > > > > > > would
> > > > > > > > > > > then
> > > > > > > > > > > >> > fail
> > > > > > > > > > > >> > > > and
> > > > > > > > > > > >> > > > > > we would know that a read needs to occur.
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > So, my basic question is if this is
> > currently
> > > > > > possible
> > > > > > > > in
> > > > > > > > > > the
> > > > > > > > > > > >> > proxy?
> > > > > > > > > > > >> > > I
> > > > > > > > > > > >> > > > > > don't believe it gives these guarantees as
> > it
> > > > > stands
> > > > > > > > > today,
> > > > > > > > > > > but
> > > > > > > > > > > >> I
> > > > > > > > > > > >> > am
> > > > > > > > > > > >> > > > not
> > > > > > > > > > > >> > > > > > 100% of how all of the futures in the code
> > > > handle
> > > > > > > > > failures.
> > > > > > > > > > > >> > > > > > If not, where in the code would be the
> > > relevant
> > > > > > places
> > > > > > > > to
> > > > > > > > > > add
> > > > > > > > > > > >> the
> > > > > > > > > > > >> > > > ability
> > > > > > > > > > > >> > > > > > to do this, and would the project be
> > > interested
> > > > > in a
> > > > > > > > pull
> > > > > > > > > > > >> request?
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > > > Thanks,
> > > > > > > > > > > >> > > > > > Cameron
> > > > > > > > > > > >> > > > > >
> > > > > > > > > > > >> > > > >
> > > > > > > > > > > >> > > >
> > > > > > > > > > > >> > >
> > > > > > > > > > > >> >
> > > > > > > > > > > >>
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>



-- 
- jderrick

Re: Proxy Client - Batch Ordering / Commit

Posted by Sijie Guo <si...@apache.org>.
On Thu, Nov 17, 2016 at 2:30 AM, Xi Liu <xi...@gmail.com> wrote:

> Cameron,
>
> Thank you for your comments. It's very helpful. My replies are inline.
>
> On Wed, Nov 16, 2016 at 11:59 AM, Cameron Hatfield <ki...@gmail.com>
> wrote:
>
> > "A couple of questions" is what I originally wrote, and then the
> following
> > happened. Sorry about the large swath of them, making sure my
> understanding
> > of the code base, as well as the DL/Bookkeeper/ZK ecosystem interaction,
> > makes sense.
> >
> > ==General:
> > What is an exclusive session? What is it providing over a regular
> session?
> >
>
> The idea here is to provide exclusive writer semantic for the
> distributedlog (thin) client use case.
> So it can have similar behavior as using the storage (distributedlog-core)
> library directly.
>

+1 for an exclusive writer feature. Leigh and me were talking about this
before. It is glad to see it is happening now. However it might be worth to
separate the exclusive writer feature into a separate task once we have
'fencing' feature available.


>
>
> >
> >
> > ==Proxy side:
> > Should a new streamop be added for the fencing operation, or does it make
> > sense to piggyback on an existing one (such as write)?
>
>
> > ====getLastDLSN:
> > What should be the return result for:
> > A new stream
> > A new session, after successful fencing
> > A new session, after a change in ownership / first starting up
> >
> > What is the main use case for getLastDLSN(<stream>, false)? Is this to
> > guarantee that the recovery process has happened in case of ownership
> > failure (I don't have a good understanding of what causes the recovery
> > process to happen, especially from the reader side)? Or is it to handle
> the
> > lost-ack problem? Since all the rest of the read related things go
> through
> > the read client, I'm not sure if I see the use case, but it seems like
> > there would be a large potential for confusion on which to use. What
> about
> > just a fenceSession op, that always fences, returning the DLSN of the
> > fence, and leave the normal getLastDLSN for the regular read client.
> >
>
>
> Ah, your point is valid. I was following the style of bookkeeper. The
> bookkeeper client supplies a fencing flag on readLastAddConfirmed request
> during recovery.
>
> But you are right. It is clear to have just a fenceSessionOp and return the
> DLSN of the fence request.
>


Correct. In bookkeeper client, the fencing flag is set with a read op.
However it is part of the recovery procedure and internal to the client. I
agree with Cameron that we should hide the details from public. Otherwise,
it will cause confusion.


>
>
>
> >
> > ====Fencing:
> > When a fence session occurs, what call needs to be made to make sure any
> > outstanding writes are flushed and committed (so that we guarantee the
> > client will be able to read anything that was in the write queue)?
> > Is there a guaranteed ordering for things written in the future queue for
> > AsyncLogWriter (I'm not quite confident that I was able to accurately
> > follow the logic, as their are many parts of the code that write, have
> > queues, heartbeat, etc)?
> >
>
> I believed that when calling AsyncLogWriter asyncWrite in order, the
> records will be written in order. Sijie or Leigh can confirm that.
>

That's write. The writes are guaranteed to write in the order of how they
are issues.


>
> Since the writes are in order, when a fence session occurs, the control
> record written successfully by the writer will guarantee the writes called
> before the fence session write are flushed and committed.
>
> We need to invalidate the session when we fence the session. So any writes
> with old session come after writing the control record will be rejected. In
> this way, it can guarantee the client will be able to read anything in a
> consistent way.
>
>
> >
> > ====SessionID:
> > What is the default sessionid / transactionid for a new stream? I assume
> > this would just be the first control record
> >
>
> The initial session id will be the transaction id of the first control
> record.
>
>
> >
> > ======Should all streams have a sessionid by default, regardless if it is
> > never used by a client (aka, everytime ownership changes, a new control
> > record is generated, and a sessionid is stored)?
> > Main edge case that would have to be handled is if a client writes with
> an
> > old sessionid, but the owner has changed and has yet to create a
> sessionid.
> > This should be handled by the "non-matching sessionid" rule, since the
> > invalid sessionid wouldn't match the passed sessionid, which should cause
> > the client to get a new sessionid.
> >
>
> I think all streams should just have a session id by default. The session
> id is changed when ownership is changed or it is explicitly bumped by a
> fence session op.
>
>
> >
> > ======Where in the code does it make sense to own the session, the stream
> > interfaces / classes? Should they pass that information down to the ops,
> or
> > do the sessionid check within?
> > My first thought would be Stream owns the sessionid, passes it into the
> ops
> > (as either a nullable value, or an invalid default value), which then do
> > the sessionid check if they care. The main issue is updating the
> sessionid
> > is a bit backwards, as either every op has the ability to update it
> through
> > some type of return value / direct stream access / etc, or there is a
> > special case in the stream for the fence operation / any other operation
> > that can update the session.
> >
>
> My thought is to add session id in Stream (StreamImpl.java). The stream
> validates the session id before submitting a stream op. If it is a fence
> session op, it would just invalidate the current session, so the subsequent
> requests with old session will be rejected.
>

>
> >
> > ======For "the owner of the log stream will first advance the transaction
> > id generator to claim a new transaction id and write a control record to
> > the log stream. ":
> > Should "DistributedLogConstants.CONTROL_RECORD_CONTENT" be the type of
> > control record written?
> > Should the "writeControlRecord" on the BKAsyncLogWriter be exposed in the
> > AsyncLogWriter interface be exposed?  Or even in the one within the
> segment
> > writer? Or should the code be duplicated / pulled out into a helper /
> etc?
> > (Not a big Java person, so any suggestions on the "Java Way", or at least
> > the DL way, to do it would be appreciated)
> >
>
> I believe we can just construct a log record and set it to be a control
> record and write it.
>
> LogRecord record = ...
> record.setControl();
> writer.write(record);
>
> (Can anyone from community confirm that it is okay to write a control
> record in this way?)
>

Ideally we would like to hide the logic from public usage. However, since
it is a code change at proxy side, it is absolutely fine.


>
>
>
> >
> > ======Transaction ID:
> > The BKLogSegmentWriter ignores the transaction ids from control records
> > when it records the "LastTXId." Would that be an issue here for anything?
> > It looks like it may do that because it assumes you're calling it's local
> > function for writing a controlrecord, which uses the lastTxId.
> >
>
> I think the lastTxId is for the last transaction id of user records. so we
> probably don't change the behavior on how we record the lastTxId. However
> we can change how do we fetch the last tx id for id generation after
> recovery.
>


getLastTxId should have a flag to include control records or not.


>
>
> >
> >
> > ==Thrift Interface:
> > ====Should the write response be split out for different calls?
> > It seems odd to have a single struct with many optional items that are
> > filled depending on the call made for every rpc call. This is mostly a
> > curiosity question, since I assume it comes from the general practices
> from
> > using thrift for a while. Would it at least make sense for the
> > getLastDLSN/fence endpoint to have a new struct?
> >
>
> I don't have any preference here. It might make sense to have a new struct.
>
>
> >
> > ====Any particular error code that makes sense for session fenced? If we
> > want to be close to the HTTP errors, looks like 412 (PRECONDITION FAILED)
> > might make the most sense, if a bit generic.
> >
> > 412 def:
> > "The precondition given in one or more of the request-header fields
> > evaluated to false when it was tested on the server. This response code
> > allows the client to place preconditions on the current resource
> > metainformation (header field data) and thus prevent the requested method
> > from being applied to a resource other than the one intended."
> >
> > 412 excerpt from the If-match doc:
> > "This behavior is most useful when the client wants to prevent an
> updating
> > method, such as PUT, from modifying a resource that has changed since the
> > client last retrieved it."
> >
>
> I liked your idea.
>
>
> >
> > ====Should we return the sessionid to the client in the "fencesession"
> > calls?
> > Seems like it may be useful when you fence, especially if you have some
> > type of custom sequencer where it would make sense for search, or for
> > debugging.
> > Main minus is that it would be easy for users to create an implicit
> > requirement that the sessionid is forever a valid transactionid, which
> may
> > not always be the case long term for the project.
> >
>
> I think we probably should start with not return and add if we really need
> it.
>
>
> >
> >
> > ==Client:
> > ====What is the proposed process for the client retrieving the new
> > sessionid?
> > A full reconnect? No special case code, but intrusive on the client side,
> > and possibly expensive garbage/processing wise. (though this type of
> > failure should hopefully be rare enough to not be a problem)
> > A call to reset the sessionid? Less intrusive, all the issues you get
> with
> > mutable object methods that need to be called in a certain order, edge
> > cases such as outstanding/buffered requests to the old stream, etc.
> > The call could also return the new sessionid, making it a good call for
> > storing or debugging the value.
> >
>
> It is probably not good to tight a session with a connection. Especially
> the underneath communication is a RPC framework.
>
> I think the new session id can be piggyback with rejected response. The
> client doesn't have to explicitly retrieve a new session id.
>

+1 for separating session from connection. Especially the 'session' here is
more a lifecycle concept for a stream.



>
>
> >
> > ====Session Fenced failure:
> > Will this put the client into a failure state, stopping all future writes
> > until fixed?
> >
>
> My thought is the new session id will be piggyback with fence response. so
> the client will know the new session id and all the future writes will just
> carry the new session id.
>
>
> > Is it even possible to get this error when ownership changes? The
> > connection to the new owner should get a new sessionid on connect, so I
> > would expect not.
> >
>
> I think it will. But the client should handle and retry. As session fenced
> exception is an exception that indicates that write is never attempted.
>
>
> >
> > Cheers,
> > Cameron
> >
> > On Tue, Nov 15, 2016 at 2:01 AM, Xi Liu <xi...@gmail.com> wrote:
> >
> > > Thank you, Cameron. Look forward to your comments.
> > >
> > > - Xi
> > >
> > > On Sun, Nov 13, 2016 at 1:21 PM, Cameron Hatfield <ki...@gmail.com>
> > > wrote:
> > >
> > > > Sorry, I've been on vacation for the past week, and heads down for a
> > > > release that is using DL at the end of Nov. I'll take a look at this
> > over
> > > > the next week, and add any relevant comments. After we are finished
> > with
> > > > dev for this release, I am hoping to tackle this next.
> > > >
> > > > -Cameron
> > > >
> > > > On Fri, Nov 11, 2016 at 12:07 PM, Sijie Guo <si...@apache.org>
> wrote:
> > > >
> > > > > Xi,
> > > > >
> > > > > Thank you so much for your proposal. I took a look. It looks fine
> to
> > > me.
> > > > > Cameron, do you have any comments?
> > > > >
> > > > > Look forward to your pull requests.
> > > > >
> > > > > - Sijie
> > > > >
> > > > >
> > > > > On Wed, Nov 9, 2016 at 2:34 AM, Xi Liu <xi...@gmail.com>
> wrote:
> > > > >
> > > > > > Cameron,
> > > > > >
> > > > > > Have you started any work for this? I just updated the proposal
> > page
> > > -
> > > > > > https://cwiki.apache.org/confluence/display/DL/DP-2+-+
> > > > > Epoch+Write+Support
> > > > > > Maybe we can work together with this.
> > > > > >
> > > > > > Sijie, Leigh,
> > > > > >
> > > > > > can you guys help review this to make sure our proposal is in the
> > > right
> > > > > > direction?
> > > > > >
> > > > > > - Xi
> > > > > >
> > > > > > On Tue, Nov 1, 2016 at 3:05 AM, Sijie Guo <si...@apache.org>
> > wrote:
> > > > > >
> > > > > > > I created https://issues.apache.org/jira/browse/DL-63 for
> > tracking
> > > > the
> > > > > > > proposed idea here.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Oct 26, 2016 at 4:53 PM, Sijie Guo
> > > > <sijieg@twitter.com.invalid
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > On Tue, Oct 25, 2016 at 11:30 AM, Cameron Hatfield <
> > > > kinguy@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Yes, we are reading the HBase WAL (from their replication
> > > plugin
> > > > > > > > support),
> > > > > > > > > and writing that into DL.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Gotcha.
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > From the sounds of it, yes, it would. Only thing I would
> say
> > is
> > > > > make
> > > > > > > the
> > > > > > > > > epoch requirement optional, so that if I client doesn't
> care
> > > > about
> > > > > > > dupes
> > > > > > > > > they don't have to deal with the process of getting a new
> > > epoch.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Yup. This should be optional. I can start a wiki page on how
> we
> > > > want
> > > > > to
> > > > > > > > implement this. Are you interested in contributing to this?
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > -Cameron
> > > > > > > > >
> > > > > > > > > On Wed, Oct 19, 2016 at 7:43 PM, Sijie Guo
> > > > > > <sijieg@twitter.com.invalid
> > > > > > > >
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > On Wed, Oct 19, 2016 at 7:17 PM, Sijie Guo <
> > > sijieg@twitter.com
> > > > >
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Monday, October 17, 2016, Cameron Hatfield <
> > > > > kinguy@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >> Answer inline:
> > > > > > > > > > >>
> > > > > > > > > > >> On Mon, Oct 17, 2016 at 11:46 AM, Sijie Guo <
> > > > sijie@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> > Cameron,
> > > > > > > > > > >> >
> > > > > > > > > > >> > Thank you for your summary. I liked the discussion
> > > here. I
> > > > > > also
> > > > > > > > > liked
> > > > > > > > > > >> the
> > > > > > > > > > >> > summary of your requirement -
> 'single-writer-per-key,
> > > > > > > > > > >> > multiple-writers-per-log'. If I understand
> correctly,
> > > the
> > > > > core
> > > > > > > > > concern
> > > > > > > > > > >> here
> > > > > > > > > > >> > is almost 'exact-once' write (or a way to explicit
> > tell
> > > > if a
> > > > > > > write
> > > > > > > > > can
> > > > > > > > > > >> be
> > > > > > > > > > >> > retried or not).
> > > > > > > > > > >> >
> > > > > > > > > > >> > Comments inline.
> > > > > > > > > > >> >
> > > > > > > > > > >> > On Fri, Oct 14, 2016 at 11:17 AM, Cameron Hatfield <
> > > > > > > > > kinguy@gmail.com>
> > > > > > > > > > >> > wrote:
> > > > > > > > > > >> >
> > > > > > > > > > >> > > > Ah- yes good point (to be clear we're not using
> > the
> > > > > proxy
> > > > > > > this
> > > > > > > > > way
> > > > > > > > > > >> > > today).
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > > > Due to the source of the
> > > > > > > > > > >> > > > > data (HBase Replication), we cannot guarantee
> > > that a
> > > > > > > single
> > > > > > > > > > >> partition
> > > > > > > > > > >> > > will
> > > > > > > > > > >> > > > > be owned for writes by the same client.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > > Do you mean you *need* to support multiple
> writers
> > > > > issuing
> > > > > > > > > > >> interleaved
> > > > > > > > > > >> > > > writes or is it just that they might sometimes
> > > > > interleave
> > > > > > > > writes
> > > > > > > > > > and
> > > > > > > > > > >> > you
> > > > > > > > > > >> > > >don't care?
> > > > > > > > > > >> > > How HBase partitions the keys being written
> wouldn't
> > > > have
> > > > > a
> > > > > > > > > one->one
> > > > > > > > > > >> > > mapping with the partitions we would have in
> HBase.
> > > Even
> > > > > if
> > > > > > we
> > > > > > > > did
> > > > > > > > > > >> have
> > > > > > > > > > >> > > that alignment when the cluster first started,
> HBase
> > > > will
> > > > > > > > > rebalance
> > > > > > > > > > >> what
> > > > > > > > > > >> > > servers own what partitions, as well as split and
> > > merge
> > > > > > > > partitions
> > > > > > > > > > >> that
> > > > > > > > > > >> > > already exist, causing eventual drift from one log
> > per
> > > > > > > > partition.
> > > > > > > > > > >> > > Because we want ordering guarantees per key (row
> in
> > > > > hbase),
> > > > > > we
> > > > > > > > > > >> partition
> > > > > > > > > > >> > > the logs by the key. Since multiple writers are
> > > possible
> > > > > per
> > > > > > > > range
> > > > > > > > > > of
> > > > > > > > > > >> > keys
> > > > > > > > > > >> > > (due to the aforementioned rebalancing /
> splitting /
> > > etc
> > > > > of
> > > > > > > > > hbase),
> > > > > > > > > > we
> > > > > > > > > > >> > > cannot use the core library due to requiring a
> > single
> > > > > writer
> > > > > > > for
> > > > > > > > > > >> > ordering.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > But, for a single log, we don't really care about
> > > > ordering
> > > > > > > aside
> > > > > > > > > > from
> > > > > > > > > > >> at
> > > > > > > > > > >> > > the per-key level. So all we really need to be
> able
> > to
> > > > > > handle
> > > > > > > is
> > > > > > > > > > >> > preventing
> > > > > > > > > > >> > > duplicates when a failure occurs, and ordering
> > > > consistency
> > > > > > > > across
> > > > > > > > > > >> > requests
> > > > > > > > > > >> > > from a single client.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > So our general requirements are:
> > > > > > > > > > >> > > Write A, Write B
> > > > > > > > > > >> > > Timeline: A -> B
> > > > > > > > > > >> > > Request B is only made after A has successfully
> > > returned
> > > > > > > > (possibly
> > > > > > > > > > >> after
> > > > > > > > > > >> > > retries)
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > 1) If the write succeeds, it will be durably
> exposed
> > > to
> > > > > > > clients
> > > > > > > > > > within
> > > > > > > > > > >> > some
> > > > > > > > > > >> > > bounded time frame
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >> > Guaranteed.
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > > >> > > 2) If A succeeds and B succeeds, the ordering for
> > the
> > > > log
> > > > > > will
> > > > > > > > be
> > > > > > > > > A
> > > > > > > > > > >> and
> > > > > > > > > > >> > > then B
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >> > If I understand correctly here, B is only sent
> after A
> > > is
> > > > > > > > returned,
> > > > > > > > > > >> right?
> > > > > > > > > > >> > If that's the case, It is guaranteed.
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > > > > > > > > >> > > 3) If A fails due to an error that can be relied
> on
> > to
> > > > > *not*
> > > > > > > be
> > > > > > > > a
> > > > > > > > > > lost
> > > > > > > > > > >> > ack
> > > > > > > > > > >> > > problem, it will never be exposed to the client,
> so
> > it
> > > > may
> > > > > > > > > > (depending
> > > > > > > > > > >> on
> > > > > > > > > > >> > > the error) be retried immediately
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >> > If it is not a lost-ack problem, the entry will be
> > > > exposed.
> > > > > it
> > > > > > > is
> > > > > > > > > > >> > guaranteed.
> > > > > > > > > > >>
> > > > > > > > > > >> Let me try rephrasing the questions, to make sure I'm
> > > > > > > understanding
> > > > > > > > > > >> correctly:
> > > > > > > > > > >> If A fails, with an error such as "Unable to create
> > > > connection
> > > > > > to
> > > > > > > > > > >> bookkeeper server", that would be the type of error we
> > > would
> > > > > > > expect
> > > > > > > > to
> > > > > > > > > > be
> > > > > > > > > > >> able to retry immediately, as that result means no
> > action
> > > > was
> > > > > > > taken
> > > > > > > > on
> > > > > > > > > > any
> > > > > > > > > > >> log / etc, so no entry could have been created. This
> is
> > > > > > different
> > > > > > > > > then a
> > > > > > > > > > >> "Connection Timeout" exception, as we just might not
> > have
> > > > > > gotten a
> > > > > > > > > > >> response
> > > > > > > > > > >> in time.
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > > Gotcha.
> > > > > > > > > > >
> > > > > > > > > > > The response code returned from proxy can tell if a
> > failure
> > > > can
> > > > > > be
> > > > > > > > > > retried
> > > > > > > > > > > safely or not. (We might need to make them well
> > documented)
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >>
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > > > > > > > > >> > > 4) If A fails due to an error that could be a lost
> > ack
> > > > > > problem
> > > > > > > > > > >> (network
> > > > > > > > > > >> > > connectivity / etc), within a bounded time frame
> it
> > > > should
> > > > > > be
> > > > > > > > > > >> possible to
> > > > > > > > > > >> > > find out if the write succeed or failed. Either by
> > > > reading
> > > > > > > from
> > > > > > > > > some
> > > > > > > > > > >> > > checkpoint of the log for the changes that should
> > have
> > > > > been
> > > > > > > made
> > > > > > > > > or
> > > > > > > > > > >> some
> > > > > > > > > > >> > > other possible server-side support.
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >> > If I understand this correctly, it is a duplication
> > > issue,
> > > > > > > right?
> > > > > > > > > > >> >
> > > > > > > > > > >> > Can a de-duplication solution work here? Either DL
> or
> > > your
> > > > > > > client
> > > > > > > > > does
> > > > > > > > > > >> the
> > > > > > > > > > >> > de-duplication?
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > > >> The requirements I'm mentioning are the ones needed
> for
> > > > > > > client-side
> > > > > > > > > > >> dedupping. Since if I can guarantee writes being
> exposed
> > > > > within
> > > > > > > some
> > > > > > > > > > time
> > > > > > > > > > >> frame, and I can never get into an inconsistently
> > ordered
> > > > > state
> > > > > > > when
> > > > > > > > > > >> successes happen, when an error occurs, I can always
> > wait
> > > > for
> > > > > > max
> > > > > > > > time
> > > > > > > > > > >> frame, read the latest writes, and then dedup locally
> > > > against
> > > > > > the
> > > > > > > > > > request
> > > > > > > > > > >> I
> > > > > > > > > > >> just made.
> > > > > > > > > > >>
> > > > > > > > > > >> The main thing about that timeframe is that its
> > basically
> > > > the
> > > > > > > > addition
> > > > > > > > > > of
> > > > > > > > > > >> every timeout, all the way down in the system,
> combined
> > > with
> > > > > > > > whatever
> > > > > > > > > > >> flushing / caching / etc times are at the bookkeeper /
> > > > client
> > > > > > > level
> > > > > > > > > for
> > > > > > > > > > >> when values are exposed
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Gotcha.
> > > > > > > > > > >
> > > > > > > > > > >>
> > > > > > > > > > >>
> > > > > > > > > > >> >
> > > > > > > > > > >> > Is there any ways to identify your write?
> > > > > > > > > > >> >
> > > > > > > > > > >> > I can think of a case as follow - I want to know
> what
> > is
> > > > > your
> > > > > > > > > expected
> > > > > > > > > > >> > behavior from the log.
> > > > > > > > > > >> >
> > > > > > > > > > >> > a)
> > > > > > > > > > >> >
> > > > > > > > > > >> > If a hbase region server A writes a change of key K
> to
> > > the
> > > > > > log,
> > > > > > > > the
> > > > > > > > > > >> change
> > > > > > > > > > >> > is successfully made to the log;
> > > > > > > > > > >> > but server A is down before receiving the change.
> > > > > > > > > > >> > region server B took over the region that contains
> K,
> > > what
> > > > > > will
> > > > > > > B
> > > > > > > > > do?
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > > >> HBase writes in large chunks (WAL Logs), which its
> > > > replication
> > > > > > > > system
> > > > > > > > > > then
> > > > > > > > > > >> handles by replaying in the case of failure. If I'm
> in a
> > > > > middle
> > > > > > > of a
> > > > > > > > > > log,
> > > > > > > > > > >> and the whole region goes down and gets rescheduled
> > > > > elsewhere, I
> > > > > > > > will
> > > > > > > > > > >> start
> > > > > > > > > > >> back up from the beginning of the log I was in the
> > middle
> > > > of.
> > > > > > > Using
> > > > > > > > > > >> checkpointing + deduping, we should be able to find
> out
> > > > where
> > > > > we
> > > > > > > > left
> > > > > > > > > > off
> > > > > > > > > > >> in the log.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > > > > > > > > >> > b) same as a). but server A was just network
> > > partitioned.
> > > > > will
> > > > > > > > both
> > > > > > > > > A
> > > > > > > > > > >> and B
> > > > > > > > > > >> > write the change of key K?
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > > >> HBase gives us some guarantees around network
> partitions
> > > > > > > > (Consistency
> > > > > > > > > > over
> > > > > > > > > > >> availability for HBase). HBase is a single-master
> > failover
> > > > > > > recovery
> > > > > > > > > type
> > > > > > > > > > >> of
> > > > > > > > > > >> system, with zookeeper-based guarantees for single
> > owners
> > > > > > > (writers)
> > > > > > > > > of a
> > > > > > > > > > >> range of data.
> > > > > > > > > > >>
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > > > > > > > > >> > > 5) If A is turned into multiple batches (one large
> > > > request
> > > > > > > gets
> > > > > > > > > > split
> > > > > > > > > > >> > into
> > > > > > > > > > >> > > multiple smaller ones to the bookkeeper backend,
> due
> > > to
> > > > > log
> > > > > > > > > rolling
> > > > > > > > > > /
> > > > > > > > > > >> > size
> > > > > > > > > > >> > > / etc):
> > > > > > > > > > >> > >   a) The ordering of entries within batches have
> > > > ordering
> > > > > > > > > > consistence
> > > > > > > > > > >> > with
> > > > > > > > > > >> > > the original request, when exposed in the log
> > (though
> > > > they
> > > > > > may
> > > > > > > > be
> > > > > > > > > > >> > > interleaved with other requests)
> > > > > > > > > > >> > >   b) The ordering across batches have ordering
> > > > consistence
> > > > > > > with
> > > > > > > > > the
> > > > > > > > > > >> > > original request, when exposed in the log (though
> > they
> > > > may
> > > > > > be
> > > > > > > > > > >> interleaved
> > > > > > > > > > >> > > with other requests)
> > > > > > > > > > >> > >   c) If a batch fails, and cannot be retried / is
> > > > > > > unsuccessfully
> > > > > > > > > > >> retried,
> > > > > > > > > > >> > > all batches after the failed batch should not be
> > > exposed
> > > > > in
> > > > > > > the
> > > > > > > > > log.
> > > > > > > > > > >> > Note:
> > > > > > > > > > >> > > The batches before and including the failed batch,
> > > that
> > > > > > ended
> > > > > > > up
> > > > > > > > > > >> > > succeeding, can show up in the log, again within
> > some
> > > > > > bounded
> > > > > > > > time
> > > > > > > > > > >> range
> > > > > > > > > > >> > > for reads by a client.
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >> > There is a method 'writeBulk' in
> DistributedLogClient
> > > can
> > > > > > > achieve
> > > > > > > > > this
> > > > > > > > > > >> > guarantee.
> > > > > > > > > > >> >
> > > > > > > > > > >> > However, I am not very sure about how will you turn
> A
> > > into
> > > > > > > > batches.
> > > > > > > > > If
> > > > > > > > > > >> you
> > > > > > > > > > >> > are dividing A into batches,
> > > > > > > > > > >> > you can simply control the application write
> sequence
> > to
> > > > > > achieve
> > > > > > > > the
> > > > > > > > > > >> > guarantee here.
> > > > > > > > > > >> >
> > > > > > > > > > >> > Can you explain more about this?
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > > >> In this case, by batches I mean what the proxy does
> with
> > > the
> > > > > > > single
> > > > > > > > > > >> request
> > > > > > > > > > >> that I send it. If the proxy decides it needs to turn
> my
> > > > > single
> > > > > > > > > request
> > > > > > > > > > >> into multiple batches of requests, due to log rolling,
> > > size
> > > > > > > > > limitations,
> > > > > > > > > > >> etc, those would be the guarantees I need to be able
> to
> > > > > > > reduplicate
> > > > > > > > on
> > > > > > > > > > the
> > > > > > > > > > >> client side.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > A single record written by #write and A record set (set
> > of
> > > > > > records)
> > > > > > > > > > > written by #writeRecordSet are atomic - they will not
> be
> > > > broken
> > > > > > > down
> > > > > > > > > into
> > > > > > > > > > > entries (batches). With the correct response code, you
> > > would
> > > > be
> > > > > > > able
> > > > > > > > to
> > > > > > > > > > > tell if it is a lost-ack failure or not. However there
> > is a
> > > > > size
> > > > > > > > > > limitation
> > > > > > > > > > > for this - it can't not go beyond 1MB for current
> > > > > implementation.
> > > > > > > > > > >
> > > > > > > > > > > What is your expected record size?
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >>
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Since we can guarantee per-key ordering on the
> > client
> > > > > side,
> > > > > > we
> > > > > > > > > > >> guarantee
> > > > > > > > > > >> > > that there is a single writer per-key, just not
> per
> > > log.
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > > > > > > > > >> > Do you need fencing guarantee in the case of network
> > > > > partition
> > > > > > > > > causing
> > > > > > > > > > >> > two-writers?
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > > > > > > > > >> > > So if there was a
> > > > > > > > > > >> > > way to guarantee a single write request as being
> > > written
> > > > > or
> > > > > > > not,
> > > > > > > > > > >> within a
> > > > > > > > > > >> > > certain time frame (since failures should be rare
> > > > anyways,
> > > > > > > this
> > > > > > > > is
> > > > > > > > > > >> fine
> > > > > > > > > > >> > if
> > > > > > > > > > >> > > it is expensive), we can then have the client
> > > guarantee
> > > > > the
> > > > > > > > > ordering
> > > > > > > > > > >> it
> > > > > > > > > > >> > > needs.
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >> > This sounds an 'exact-once' write (regarding
> retries)
> > > > > > > requirement
> > > > > > > > to
> > > > > > > > > > me,
> > > > > > > > > > >> > right?
> > > > > > > > > > >> >
> > > > > > > > > > >> Yes. I'm curious of how this issue is handled by
> > > Manhattan,
> > > > > > since
> > > > > > > > you
> > > > > > > > > > can
> > > > > > > > > > >> imagine a data store that ends up getting multiple
> > writes
> > > > for
> > > > > > the
> > > > > > > > same
> > > > > > > > > > put
> > > > > > > > > > >> / get / etc, would be harder to use, and we are
> > basically
> > > > > trying
> > > > > > > to
> > > > > > > > > > create
> > > > > > > > > > >> a log like that for HBase.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Are you guys replacing HBase WAL?
> > > > > > > > > > >
> > > > > > > > > > > In Manhattan case, the request will be first written to
> > DL
> > > > > > streams
> > > > > > > by
> > > > > > > > > > > Manhattan coordinator. The Manhattan replica then will
> > read
> > > > > from
> > > > > > > the
> > > > > > > > DL
> > > > > > > > > > > streams and apply the change. In the lost-ack case, the
> > MH
> > > > > > > > coordinator
> > > > > > > > > > will
> > > > > > > > > > > just fail the request to client.
> > > > > > > > > > >
> > > > > > > > > > > My feeling here is your usage for HBase is a bit
> > different
> > > > from
> > > > > > how
> > > > > > > > we
> > > > > > > > > > use
> > > > > > > > > > > DL in Manhattan. It sounds like you read from a source
> > > (HBase
> > > > > > WAL)
> > > > > > > > and
> > > > > > > > > > > write to DL. But I might be wrong.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >>
> > > > > > > > > > >> >
> > > > > > > > > > >> >
> > > > > > > > > > >> > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > > Cameron:
> > > > > > > > > > >> > > > Another thing we've discussed but haven't really
> > > > thought
> > > > > > > > > through -
> > > > > > > > > > >> > > > We might be able to support some kind of epoch
> > write
> > > > > > > request,
> > > > > > > > > > where
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > epoch is guaranteed to have changed if the
> writer
> > > has
> > > > > > > changed
> > > > > > > > or
> > > > > > > > > > the
> > > > > > > > > > >> > > ledger
> > > > > > > > > > >> > > > was ever fenced off. Writes include an epoch and
> > are
> > > > > > > rejected
> > > > > > > > if
> > > > > > > > > > the
> > > > > > > > > > >> > > epoch
> > > > > > > > > > >> > > > has changed.
> > > > > > > > > > >> > > > With a mechanism like this, fencing the ledger
> off
> > > > > after a
> > > > > > > > > failure
> > > > > > > > > > >> > would
> > > > > > > > > > >> > > > ensure any pending writes had either been
> written
> > or
> > > > > would
> > > > > > > be
> > > > > > > > > > >> rejected.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > The issue would be how I guarantee the write I
> wrote
> > > to
> > > > > the
> > > > > > > > server
> > > > > > > > > > was
> > > > > > > > > > >> > > written. Since a network issue could happen on the
> > > send
> > > > of
> > > > > > the
> > > > > > > > > > >> request,
> > > > > > > > > > >> > or
> > > > > > > > > > >> > > on the receive of the success response, an epoch
> > > > wouldn't
> > > > > > tell
> > > > > > > > me
> > > > > > > > > > if I
> > > > > > > > > > >> > can
> > > > > > > > > > >> > > successfully retry, as it could be successfully
> > > written
> > > > > but
> > > > > > > AWS
> > > > > > > > > > >> dropped
> > > > > > > > > > >> > the
> > > > > > > > > > >> > > connection for the success response. Since the
> epoch
> > > > would
> > > > > > be
> > > > > > > > the
> > > > > > > > > > same
> > > > > > > > > > >> > > (same ledger), I could write duplicates.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > > We are currently proposing adding a transaction
> > > > semantic
> > > > > > to
> > > > > > > dl
> > > > > > > > > to
> > > > > > > > > > >> get
> > > > > > > > > > >> > rid
> > > > > > > > > > >> > > > of the size limitation and the unaware-ness in
> the
> > > > proxy
> > > > > > > > client.
> > > > > > > > > > >> Here
> > > > > > > > > > >> > is
> > > > > > > > > > >> > > > our idea -
> > > > > > > > > > >> > > > http://mail-archives.apache.
> > org/mod_mbox/incubator-
> > > > > > > > > distributedlog
> > > > > > > > > > >> > > -dev/201609.mbox/%3cCAAC6BxP5YyEHwG0ZCF5soh42X=
> > xuYwYm
> > > > > > > > > > >> > > <http://mail-archives.apache.
> > org/mod_mbox/incubator-
> > > > > > > > > > >> > distributedlog%0A-dev/201609.mbox/%
> > > > > 3cCAAC6BxP5YyEHwG0ZCF5soh
> > > > > > > > > > 42X=xuYwYm>
> > > > > > > > > > >> > > L4nXsYBYiofzxpVk6g@mail.gmail.com%3e
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > > I am not sure if your idea is similar as ours.
> but
> > > > we'd
> > > > > > like
> > > > > > > > to
> > > > > > > > > > >> > > collaborate
> > > > > > > > > > >> > > > with the community if anyone has the similar
> idea.
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Our use case would be covered by transaction
> > support,
> > > > but
> > > > > > I'm
> > > > > > > > > unsure
> > > > > > > > > > >> if
> > > > > > > > > > >> > we
> > > > > > > > > > >> > > would need something that heavy weight for the
> > > > guarantees
> > > > > we
> > > > > > > > need.
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Basically, the high level requirement here is
> > "Support
> > > > > > > > consistent
> > > > > > > > > > >> write
> > > > > > > > > > >> > > ordering for single-writer-per-key,
> > > > multi-writer-per-log".
> > > > > > My
> > > > > > > > > hunch
> > > > > > > > > > is
> > > > > > > > > > >> > > that, with some added guarantees to the proxy (if
> it
> > > > isn't
> > > > > > > > already
> > > > > > > > > > >> > > supported), and some custom client code on our
> side
> > > for
> > > > > > > removing
> > > > > > > > > the
> > > > > > > > > > >> > > entries that actually succeed to write to
> > > DistributedLog
> > > > > > from
> > > > > > > > the
> > > > > > > > > > >> request
> > > > > > > > > > >> > > that failed, it should be a relatively easy thing
> to
> > > > > > support.
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >> > Yup. I think it should not be very difficult to
> > support.
> > > > > There
> > > > > > > > might
> > > > > > > > > > be
> > > > > > > > > > >> > some changes in the server side.
> > > > > > > > > > >> > Let's figure out what will the changes be. Are you
> > guys
> > > > > > > interested
> > > > > > > > > in
> > > > > > > > > > >> > contributing?
> > > > > > > > > > >> >
> > > > > > > > > > >> > Yes, we would be.
> > > > > > > > > > >>
> > > > > > > > > > >> As a note, the one thing that we see as an issue with
> > the
> > > > > client
> > > > > > > > side
> > > > > > > > > > >> dedupping is how to bound the range of data that needs
> > to
> > > be
> > > > > > > looked
> > > > > > > > at
> > > > > > > > > > for
> > > > > > > > > > >> deduplication. As you can imagine, it is pretty easy
> to
> > > > bound
> > > > > > the
> > > > > > > > > bottom
> > > > > > > > > > >> of
> > > > > > > > > > >> the range, as that it just regular checkpointing of
> the
> > > DSLN
> > > > > > that
> > > > > > > is
> > > > > > > > > > >> returned. I'm still not sure if there is any nice way
> to
> > > > time
> > > > > > > bound
> > > > > > > > > the
> > > > > > > > > > >> top
> > > > > > > > > > >> end of the range, especially since the proxy owns
> > sequence
> > > > > > numbers
> > > > > > > > > > (which
> > > > > > > > > > >> makes sense). I am curious if there is more that can
> be
> > > done
> > > > > if
> > > > > > > > > > >> deduplication is on the server side. However the main
> > > minus
> > > > I
> > > > > > see
> > > > > > > of
> > > > > > > > > > >> server
> > > > > > > > > > >> side deduplication is that instead of running
> contingent
> > > on
> > > > > > there
> > > > > > > > > being
> > > > > > > > > > a
> > > > > > > > > > >> failed client request, instead it would have to run
> > every
> > > > > time a
> > > > > > > > write
> > > > > > > > > > >> happens.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > For a reliable dedup, we probably need
> > > fence-then-getLastDLSN
> > > > > > > > > operation -
> > > > > > > > > > > so it would guarantee that any non-completed requests
> > > issued
> > > > > > > > (lost-ack
> > > > > > > > > > > requests) before this fence-then-getLastDLSN operation
> > will
> > > > be
> > > > > > > failed
> > > > > > > > > and
> > > > > > > > > > > they will never land at the log.
> > > > > > > > > > >
> > > > > > > > > > > the pseudo code would look like below -
> > > > > > > > > > >
> > > > > > > > > > > write(request) onFailure { t =>
> > > > > > > > > > >
> > > > > > > > > > > if (t is timeout exception) {
> > > > > > > > > > >
> > > > > > > > > > > DLSN lastDLSN = fenceThenGetLastDLSN()
> > > > > > > > > > > DLSN lastCheckpointedDLSN = ...;
> > > > > > > > > > > // find if the request lands between [lastDLSN,
> > > > > > > > lastCheckpointedDLSN].
> > > > > > > > > > > // if it exists, the write succeed; otherwise retry.
> > > > > > > > > > >
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Just realized the idea is same as what Leigh raised in
> the
> > > > > previous
> > > > > > > > email
> > > > > > > > > > about 'epoch write'. Let me explain more about this idea
> > > > (Leigh,
> > > > > > feel
> > > > > > > > > free
> > > > > > > > > > to jump in to fill up your idea).
> > > > > > > > > >
> > > > > > > > > > - when a log stream is owned,  the proxy use the last
> > > > transaction
> > > > > > id
> > > > > > > as
> > > > > > > > > the
> > > > > > > > > > epoch
> > > > > > > > > > - when a client connects (handshake with the proxy), it
> > will
> > > > get
> > > > > > the
> > > > > > > > > epoch
> > > > > > > > > > for the stream.
> > > > > > > > > > - the writes issued by this client will carry the epoch
> to
> > > the
> > > > > > proxy.
> > > > > > > > > > - add a new rpc - fenceThenGetLastDLSN - it would force
> the
> > > > proxy
> > > > > > to
> > > > > > > > bump
> > > > > > > > > > the epoch.
> > > > > > > > > > - if fenceThenGetLastDLSN happened, all the outstanding
> > > writes
> > > > > with
> > > > > > > old
> > > > > > > > > > epoch will be rejected with exceptions (e.g.
> EpochFenced).
> > > > > > > > > > - The DLSN returned from fenceThenGetLastDLSN can be used
> > as
> > > > the
> > > > > > > bound
> > > > > > > > > for
> > > > > > > > > > deduplications on failures.
> > > > > > > > > >
> > > > > > > > > > Cameron, does this sound a solution to your use case?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >>
> > > > > > > > > > >> Maybe something that could fit a similar need that
> Kafka
> > > > does
> > > > > > (the
> > > > > > > > > last
> > > > > > > > > > >> store value for a particular key in a log), such that
> > on a
> > > > per
> > > > > > key
> > > > > > > > > basis
> > > > > > > > > > >> there could be a sequence number that support
> > > deduplication?
> > > > > > Cost
> > > > > > > > > seems
> > > > > > > > > > >> like it would be high however, and I'm not even sure
> if
> > > > > > bookkeeper
> > > > > > > > > > >> supports
> > > > > > > > > > >> it.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >> Cheers,
> > > > > > > > > > >> Cameron
> > > > > > > > > > >>
> > > > > > > > > > >> >
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > Thanks,
> > > > > > > > > > >> > > Cameron
> > > > > > > > > > >> > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > On Sat, Oct 8, 2016 at 7:35 AM, Leigh Stewart
> > > > > > > > > > >> > <lstewart@twitter.com.invalid
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > wrote:
> > > > > > > > > > >> > >
> > > > > > > > > > >> > > > Cameron:
> > > > > > > > > > >> > > > Another thing we've discussed but haven't really
> > > > thought
> > > > > > > > > through -
> > > > > > > > > > >> > > > We might be able to support some kind of epoch
> > write
> > > > > > > request,
> > > > > > > > > > where
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > epoch is guaranteed to have changed if the
> writer
> > > has
> > > > > > > changed
> > > > > > > > or
> > > > > > > > > > the
> > > > > > > > > > >> > > ledger
> > > > > > > > > > >> > > > was ever fenced off. Writes include an epoch and
> > are
> > > > > > > rejected
> > > > > > > > if
> > > > > > > > > > the
> > > > > > > > > > >> > > epoch
> > > > > > > > > > >> > > > has changed.
> > > > > > > > > > >> > > > With a mechanism like this, fencing the ledger
> off
> > > > > after a
> > > > > > > > > failure
> > > > > > > > > > >> > would
> > > > > > > > > > >> > > > ensure any pending writes had either been
> written
> > or
> > > > > would
> > > > > > > be
> > > > > > > > > > >> rejected.
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > On Sat, Oct 8, 2016 at 7:10 AM, Sijie Guo <
> > > > > > sijie@apache.org
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > > > > Cameron,
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > I think both Leigh and Xi had made a few good
> > > points
> > > > > > about
> > > > > > > > > your
> > > > > > > > > > >> > > question.
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > To add one more point to your question - "but
> I
> > am
> > > > not
> > > > > > > > > > >> > > > > 100% of how all of the futures in the code
> > handle
> > > > > > > failures.
> > > > > > > > > > >> > > > > If not, where in the code would be the
> relevant
> > > > places
> > > > > > to
> > > > > > > > add
> > > > > > > > > > the
> > > > > > > > > > >> > > ability
> > > > > > > > > > >> > > > > to do this, and would the project be
> interested
> > > in a
> > > > > > pull
> > > > > > > > > > >> request?"
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > The current proxy and client logic doesn't do
> > > > > perfectly
> > > > > > on
> > > > > > > > > > >> handling
> > > > > > > > > > >> > > > > failures (duplicates) - the strategy now is
> the
> > > > client
> > > > > > > will
> > > > > > > > > > retry
> > > > > > > > > > >> as
> > > > > > > > > > >> > > best
> > > > > > > > > > >> > > > > at it can before throwing exceptions to users.
> > The
> > > > > code
> > > > > > > you
> > > > > > > > > are
> > > > > > > > > > >> > looking
> > > > > > > > > > >> > > > for
> > > > > > > > > > >> > > > > - it is on BKLogSegmentWriter for the proxy
> > > handling
> > > > > > > writes
> > > > > > > > > and
> > > > > > > > > > >> it is
> > > > > > > > > > >> > > on
> > > > > > > > > > >> > > > > DistributedLogClientImpl for the proxy client
> > > > handling
> > > > > > > > > responses
> > > > > > > > > > >> from
> > > > > > > > > > >> > > > > proxies. Does this help you?
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > And also, you are welcome to contribute the
> pull
> > > > > > requests.
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > - Sijie
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > On Tue, Oct 4, 2016 at 3:39 PM, Cameron
> > Hatfield <
> > > > > > > > > > >> kinguy@gmail.com>
> > > > > > > > > > >> > > > wrote:
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > > > > I have a question about the Proxy Client.
> > > > Basically,
> > > > > > for
> > > > > > > > our
> > > > > > > > > > use
> > > > > > > > > > >> > > cases,
> > > > > > > > > > >> > > > > we
> > > > > > > > > > >> > > > > > want to guarantee ordering at the key level,
> > > > > > > irrespective
> > > > > > > > of
> > > > > > > > > > the
> > > > > > > > > > >> > > > ordering
> > > > > > > > > > >> > > > > > of the partition it may be assigned to as a
> > > whole.
> > > > > Due
> > > > > > > to
> > > > > > > > > the
> > > > > > > > > > >> > source
> > > > > > > > > > >> > > of
> > > > > > > > > > >> > > > > the
> > > > > > > > > > >> > > > > > data (HBase Replication), we cannot
> guarantee
> > > > that a
> > > > > > > > single
> > > > > > > > > > >> > partition
> > > > > > > > > > >> > > > > will
> > > > > > > > > > >> > > > > > be owned for writes by the same client. This
> > > means
> > > > > the
> > > > > > > > proxy
> > > > > > > > > > >> client
> > > > > > > > > > >> > > > works
> > > > > > > > > > >> > > > > > well (since we don't care which proxy owns
> the
> > > > > > partition
> > > > > > > > we
> > > > > > > > > > are
> > > > > > > > > > >> > > writing
> > > > > > > > > > >> > > > > > to).
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > However, the guarantees we need when
> writing a
> > > > batch
> > > > > > > > > consists
> > > > > > > > > > >> of:
> > > > > > > > > > >> > > > > > Definition of a Batch: The set of records
> sent
> > > to
> > > > > the
> > > > > > > > > > writeBatch
> > > > > > > > > > >> > > > endpoint
> > > > > > > > > > >> > > > > > on the proxy
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > 1. Batch success: If the client receives a
> > > success
> > > > > > from
> > > > > > > > the
> > > > > > > > > > >> proxy,
> > > > > > > > > > >> > > then
> > > > > > > > > > >> > > > > > that batch is successfully written
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > 2. Inter-Batch ordering : Once a batch has
> > been
> > > > > > written
> > > > > > > > > > >> > successfully
> > > > > > > > > > >> > > by
> > > > > > > > > > >> > > > > the
> > > > > > > > > > >> > > > > > client, when another batch is written, it
> will
> > > be
> > > > > > > > guaranteed
> > > > > > > > > > to
> > > > > > > > > > >> be
> > > > > > > > > > >> > > > > ordered
> > > > > > > > > > >> > > > > > after the last batch (if it is the same
> > stream).
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > 3. Intra-Batch ordering: Within a batch of
> > > writes,
> > > > > the
> > > > > > > > > records
> > > > > > > > > > >> will
> > > > > > > > > > >> > > be
> > > > > > > > > > >> > > > > > committed in order
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > 4. Intra-Batch failure ordering: If an
> > > individual
> > > > > > record
> > > > > > > > > fails
> > > > > > > > > > >> to
> > > > > > > > > > >> > > write
> > > > > > > > > > >> > > > > > within a batch, all records after that
> record
> > > will
> > > > > not
> > > > > > > be
> > > > > > > > > > >> written.
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > 5. Batch Commit: Guarantee that if a batch
> > > > returns a
> > > > > > > > > success,
> > > > > > > > > > it
> > > > > > > > > > >> > will
> > > > > > > > > > >> > > > be
> > > > > > > > > > >> > > > > > written
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > 6. Read-after-write: Once a batch is
> > committed,
> > > > > > within a
> > > > > > > > > > limited
> > > > > > > > > > >> > > > > time-frame
> > > > > > > > > > >> > > > > > it will be able to be read. This is required
> > in
> > > > the
> > > > > > case
> > > > > > > > of
> > > > > > > > > > >> > failure,
> > > > > > > > > > >> > > so
> > > > > > > > > > >> > > > > > that the client can see what actually got
> > > > > committed. I
> > > > > > > > > believe
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > > > time-frame part could be removed if the
> client
> > > can
> > > > > > send
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > >> same
> > > > > > > > > > >> > > > > > sequence number that was written previously,
> > > since
> > > > > it
> > > > > > > > would
> > > > > > > > > > then
> > > > > > > > > > >> > fail
> > > > > > > > > > >> > > > and
> > > > > > > > > > >> > > > > > we would know that a read needs to occur.
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > So, my basic question is if this is
> currently
> > > > > possible
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > >> > proxy?
> > > > > > > > > > >> > > I
> > > > > > > > > > >> > > > > > don't believe it gives these guarantees as
> it
> > > > stands
> > > > > > > > today,
> > > > > > > > > > but
> > > > > > > > > > >> I
> > > > > > > > > > >> > am
> > > > > > > > > > >> > > > not
> > > > > > > > > > >> > > > > > 100% of how all of the futures in the code
> > > handle
> > > > > > > > failures.
> > > > > > > > > > >> > > > > > If not, where in the code would be the
> > relevant
> > > > > places
> > > > > > > to
> > > > > > > > > add
> > > > > > > > > > >> the
> > > > > > > > > > >> > > > ability
> > > > > > > > > > >> > > > > > to do this, and would the project be
> > interested
> > > > in a
> > > > > > > pull
> > > > > > > > > > >> request?
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > > > Thanks,
> > > > > > > > > > >> > > > > > Cameron
> > > > > > > > > > >> > > > > >
> > > > > > > > > > >> > > > >
> > > > > > > > > > >> > > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Proxy Client - Batch Ordering / Commit

Posted by Xi Liu <xi...@gmail.com>.
Cameron,

Thank you for your comments. It's very helpful. My replies are inline.

On Wed, Nov 16, 2016 at 11:59 AM, Cameron Hatfield <ki...@gmail.com> wrote:

> "A couple of questions" is what I originally wrote, and then the following
> happened. Sorry about the large swath of them, making sure my understanding
> of the code base, as well as the DL/Bookkeeper/ZK ecosystem interaction,
> makes sense.
>
> ==General:
> What is an exclusive session? What is it providing over a regular session?
>

The idea here is to provide exclusive writer semantic for the
distributedlog (thin) client use case.
So it can have similar behavior as using the storage (distributedlog-core)
library directly.


>
>
> ==Proxy side:
> Should a new streamop be added for the fencing operation, or does it make
> sense to piggyback on an existing one (such as write)?


> ====getLastDLSN:
> What should be the return result for:
> A new stream
> A new session, after successful fencing
> A new session, after a change in ownership / first starting up
>
> What is the main use case for getLastDLSN(<stream>, false)? Is this to
> guarantee that the recovery process has happened in case of ownership
> failure (I don't have a good understanding of what causes the recovery
> process to happen, especially from the reader side)? Or is it to handle the
> lost-ack problem? Since all the rest of the read related things go through
> the read client, I'm not sure if I see the use case, but it seems like
> there would be a large potential for confusion on which to use. What about
> just a fenceSession op, that always fences, returning the DLSN of the
> fence, and leave the normal getLastDLSN for the regular read client.
>


Ah, your point is valid. I was following the style of bookkeeper. The
bookkeeper client supplies a fencing flag on readLastAddConfirmed request
during recovery.

But you are right. It is clear to have just a fenceSessionOp and return the
DLSN of the fence request.



>
> ====Fencing:
> When a fence session occurs, what call needs to be made to make sure any
> outstanding writes are flushed and committed (so that we guarantee the
> client will be able to read anything that was in the write queue)?
> Is there a guaranteed ordering for things written in the future queue for
> AsyncLogWriter (I'm not quite confident that I was able to accurately
> follow the logic, as their are many parts of the code that write, have
> queues, heartbeat, etc)?
>

I believed that when calling AsyncLogWriter asyncWrite in order, the
records will be written in order. Sijie or Leigh can confirm that.

Since the writes are in order, when a fence session occurs, the control
record written successfully by the writer will guarantee the writes called
before the fence session write are flushed and committed.

We need to invalidate the session when we fence the session. So any writes
with old session come after writing the control record will be rejected. In
this way, it can guarantee the client will be able to read anything in a
consistent way.


>
> ====SessionID:
> What is the default sessionid / transactionid for a new stream? I assume
> this would just be the first control record
>

The initial session id will be the transaction id of the first control
record.


>
> ======Should all streams have a sessionid by default, regardless if it is
> never used by a client (aka, everytime ownership changes, a new control
> record is generated, and a sessionid is stored)?
> Main edge case that would have to be handled is if a client writes with an
> old sessionid, but the owner has changed and has yet to create a sessionid.
> This should be handled by the "non-matching sessionid" rule, since the
> invalid sessionid wouldn't match the passed sessionid, which should cause
> the client to get a new sessionid.
>

I think all streams should just have a session id by default. The session
id is changed when ownership is changed or it is explicitly bumped by a
fence session op.


>
> ======Where in the code does it make sense to own the session, the stream
> interfaces / classes? Should they pass that information down to the ops, or
> do the sessionid check within?
> My first thought would be Stream owns the sessionid, passes it into the ops
> (as either a nullable value, or an invalid default value), which then do
> the sessionid check if they care. The main issue is updating the sessionid
> is a bit backwards, as either every op has the ability to update it through
> some type of return value / direct stream access / etc, or there is a
> special case in the stream for the fence operation / any other operation
> that can update the session.
>

My thought is to add session id in Stream (StreamImpl.java). The stream
validates the session id before submitting a stream op. If it is a fence
session op, it would just invalidate the current session, so the subsequent
requests with old session will be rejected.


>
> ======For "the owner of the log stream will first advance the transaction
> id generator to claim a new transaction id and write a control record to
> the log stream. ":
> Should "DistributedLogConstants.CONTROL_RECORD_CONTENT" be the type of
> control record written?
> Should the "writeControlRecord" on the BKAsyncLogWriter be exposed in the
> AsyncLogWriter interface be exposed?  Or even in the one within the segment
> writer? Or should the code be duplicated / pulled out into a helper / etc?
> (Not a big Java person, so any suggestions on the "Java Way", or at least
> the DL way, to do it would be appreciated)
>

I believe we can just construct a log record and set it to be a control
record and write it.

LogRecord record = ...
record.setControl();
writer.write(record);

(Can anyone from community confirm that it is okay to write a control
record in this way?)



>
> ======Transaction ID:
> The BKLogSegmentWriter ignores the transaction ids from control records
> when it records the "LastTXId." Would that be an issue here for anything?
> It looks like it may do that because it assumes you're calling it's local
> function for writing a controlrecord, which uses the lastTxId.
>

I think the lastTxId is for the last transaction id of user records. so we
probably don't change the behavior on how we record the lastTxId. However
we can change how do we fetch the last tx id for id generation after
recovery.


>
>
> ==Thrift Interface:
> ====Should the write response be split out for different calls?
> It seems odd to have a single struct with many optional items that are
> filled depending on the call made for every rpc call. This is mostly a
> curiosity question, since I assume it comes from the general practices from
> using thrift for a while. Would it at least make sense for the
> getLastDLSN/fence endpoint to have a new struct?
>

I don't have any preference here. It might make sense to have a new struct.


>
> ====Any particular error code that makes sense for session fenced? If we
> want to be close to the HTTP errors, looks like 412 (PRECONDITION FAILED)
> might make the most sense, if a bit generic.
>
> 412 def:
> "The precondition given in one or more of the request-header fields
> evaluated to false when it was tested on the server. This response code
> allows the client to place preconditions on the current resource
> metainformation (header field data) and thus prevent the requested method
> from being applied to a resource other than the one intended."
>
> 412 excerpt from the If-match doc:
> "This behavior is most useful when the client wants to prevent an updating
> method, such as PUT, from modifying a resource that has changed since the
> client last retrieved it."
>

I liked your idea.


>
> ====Should we return the sessionid to the client in the "fencesession"
> calls?
> Seems like it may be useful when you fence, especially if you have some
> type of custom sequencer where it would make sense for search, or for
> debugging.
> Main minus is that it would be easy for users to create an implicit
> requirement that the sessionid is forever a valid transactionid, which may
> not always be the case long term for the project.
>

I think we probably should start with not return and add if we really need
it.


>
>
> ==Client:
> ====What is the proposed process for the client retrieving the new
> sessionid?
> A full reconnect? No special case code, but intrusive on the client side,
> and possibly expensive garbage/processing wise. (though this type of
> failure should hopefully be rare enough to not be a problem)
> A call to reset the sessionid? Less intrusive, all the issues you get with
> mutable object methods that need to be called in a certain order, edge
> cases such as outstanding/buffered requests to the old stream, etc.
> The call could also return the new sessionid, making it a good call for
> storing or debugging the value.
>

It is probably not good to tight a session with a connection. Especially
the underneath communication is a RPC framework.

I think the new session id can be piggyback with rejected response. The
client doesn't have to explicitly retrieve a new session id.


>
> ====Session Fenced failure:
> Will this put the client into a failure state, stopping all future writes
> until fixed?
>

My thought is the new session id will be piggyback with fence response. so
the client will know the new session id and all the future writes will just
carry the new session id.


> Is it even possible to get this error when ownership changes? The
> connection to the new owner should get a new sessionid on connect, so I
> would expect not.
>

I think it will. But the client should handle and retry. As session fenced
exception is an exception that indicates that write is never attempted.


>
> Cheers,
> Cameron
>
> On Tue, Nov 15, 2016 at 2:01 AM, Xi Liu <xi...@gmail.com> wrote:
>
> > Thank you, Cameron. Look forward to your comments.
> >
> > - Xi
> >
> > On Sun, Nov 13, 2016 at 1:21 PM, Cameron Hatfield <ki...@gmail.com>
> > wrote:
> >
> > > Sorry, I've been on vacation for the past week, and heads down for a
> > > release that is using DL at the end of Nov. I'll take a look at this
> over
> > > the next week, and add any relevant comments. After we are finished
> with
> > > dev for this release, I am hoping to tackle this next.
> > >
> > > -Cameron
> > >
> > > On Fri, Nov 11, 2016 at 12:07 PM, Sijie Guo <si...@apache.org> wrote:
> > >
> > > > Xi,
> > > >
> > > > Thank you so much for your proposal. I took a look. It looks fine to
> > me.
> > > > Cameron, do you have any comments?
> > > >
> > > > Look forward to your pull requests.
> > > >
> > > > - Sijie
> > > >
> > > >
> > > > On Wed, Nov 9, 2016 at 2:34 AM, Xi Liu <xi...@gmail.com> wrote:
> > > >
> > > > > Cameron,
> > > > >
> > > > > Have you started any work for this? I just updated the proposal
> page
> > -
> > > > > https://cwiki.apache.org/confluence/display/DL/DP-2+-+
> > > > Epoch+Write+Support
> > > > > Maybe we can work together with this.
> > > > >
> > > > > Sijie, Leigh,
> > > > >
> > > > > can you guys help review this to make sure our proposal is in the
> > right
> > > > > direction?
> > > > >
> > > > > - Xi
> > > > >
> > > > > On Tue, Nov 1, 2016 at 3:05 AM, Sijie Guo <si...@apache.org>
> wrote:
> > > > >
> > > > > > I created https://issues.apache.org/jira/browse/DL-63 for
> tracking
> > > the
> > > > > > proposed idea here.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Oct 26, 2016 at 4:53 PM, Sijie Guo
> > > <sijieg@twitter.com.invalid
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > On Tue, Oct 25, 2016 at 11:30 AM, Cameron Hatfield <
> > > kinguy@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Yes, we are reading the HBase WAL (from their replication
> > plugin
> > > > > > > support),
> > > > > > > > and writing that into DL.
> > > > > > > >
> > > > > > >
> > > > > > > Gotcha.
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > From the sounds of it, yes, it would. Only thing I would say
> is
> > > > make
> > > > > > the
> > > > > > > > epoch requirement optional, so that if I client doesn't care
> > > about
> > > > > > dupes
> > > > > > > > they don't have to deal with the process of getting a new
> > epoch.
> > > > > > > >
> > > > > > >
> > > > > > > Yup. This should be optional. I can start a wiki page on how we
> > > want
> > > > to
> > > > > > > implement this. Are you interested in contributing to this?
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > -Cameron
> > > > > > > >
> > > > > > > > On Wed, Oct 19, 2016 at 7:43 PM, Sijie Guo
> > > > > <sijieg@twitter.com.invalid
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > On Wed, Oct 19, 2016 at 7:17 PM, Sijie Guo <
> > sijieg@twitter.com
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Monday, October 17, 2016, Cameron Hatfield <
> > > > kinguy@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >> Answer inline:
> > > > > > > > > >>
> > > > > > > > > >> On Mon, Oct 17, 2016 at 11:46 AM, Sijie Guo <
> > > sijie@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > > > >>
> > > > > > > > > >> > Cameron,
> > > > > > > > > >> >
> > > > > > > > > >> > Thank you for your summary. I liked the discussion
> > here. I
> > > > > also
> > > > > > > > liked
> > > > > > > > > >> the
> > > > > > > > > >> > summary of your requirement - 'single-writer-per-key,
> > > > > > > > > >> > multiple-writers-per-log'. If I understand correctly,
> > the
> > > > core
> > > > > > > > concern
> > > > > > > > > >> here
> > > > > > > > > >> > is almost 'exact-once' write (or a way to explicit
> tell
> > > if a
> > > > > > write
> > > > > > > > can
> > > > > > > > > >> be
> > > > > > > > > >> > retried or not).
> > > > > > > > > >> >
> > > > > > > > > >> > Comments inline.
> > > > > > > > > >> >
> > > > > > > > > >> > On Fri, Oct 14, 2016 at 11:17 AM, Cameron Hatfield <
> > > > > > > > kinguy@gmail.com>
> > > > > > > > > >> > wrote:
> > > > > > > > > >> >
> > > > > > > > > >> > > > Ah- yes good point (to be clear we're not using
> the
> > > > proxy
> > > > > > this
> > > > > > > > way
> > > > > > > > > >> > > today).
> > > > > > > > > >> > >
> > > > > > > > > >> > > > > Due to the source of the
> > > > > > > > > >> > > > > data (HBase Replication), we cannot guarantee
> > that a
> > > > > > single
> > > > > > > > > >> partition
> > > > > > > > > >> > > will
> > > > > > > > > >> > > > > be owned for writes by the same client.
> > > > > > > > > >> > >
> > > > > > > > > >> > > > Do you mean you *need* to support multiple writers
> > > > issuing
> > > > > > > > > >> interleaved
> > > > > > > > > >> > > > writes or is it just that they might sometimes
> > > > interleave
> > > > > > > writes
> > > > > > > > > and
> > > > > > > > > >> > you
> > > > > > > > > >> > > >don't care?
> > > > > > > > > >> > > How HBase partitions the keys being written wouldn't
> > > have
> > > > a
> > > > > > > > one->one
> > > > > > > > > >> > > mapping with the partitions we would have in HBase.
> > Even
> > > > if
> > > > > we
> > > > > > > did
> > > > > > > > > >> have
> > > > > > > > > >> > > that alignment when the cluster first started, HBase
> > > will
> > > > > > > > rebalance
> > > > > > > > > >> what
> > > > > > > > > >> > > servers own what partitions, as well as split and
> > merge
> > > > > > > partitions
> > > > > > > > > >> that
> > > > > > > > > >> > > already exist, causing eventual drift from one log
> per
> > > > > > > partition.
> > > > > > > > > >> > > Because we want ordering guarantees per key (row in
> > > > hbase),
> > > > > we
> > > > > > > > > >> partition
> > > > > > > > > >> > > the logs by the key. Since multiple writers are
> > possible
> > > > per
> > > > > > > range
> > > > > > > > > of
> > > > > > > > > >> > keys
> > > > > > > > > >> > > (due to the aforementioned rebalancing / splitting /
> > etc
> > > > of
> > > > > > > > hbase),
> > > > > > > > > we
> > > > > > > > > >> > > cannot use the core library due to requiring a
> single
> > > > writer
> > > > > > for
> > > > > > > > > >> > ordering.
> > > > > > > > > >> > >
> > > > > > > > > >> > > But, for a single log, we don't really care about
> > > ordering
> > > > > > aside
> > > > > > > > > from
> > > > > > > > > >> at
> > > > > > > > > >> > > the per-key level. So all we really need to be able
> to
> > > > > handle
> > > > > > is
> > > > > > > > > >> > preventing
> > > > > > > > > >> > > duplicates when a failure occurs, and ordering
> > > consistency
> > > > > > > across
> > > > > > > > > >> > requests
> > > > > > > > > >> > > from a single client.
> > > > > > > > > >> > >
> > > > > > > > > >> > > So our general requirements are:
> > > > > > > > > >> > > Write A, Write B
> > > > > > > > > >> > > Timeline: A -> B
> > > > > > > > > >> > > Request B is only made after A has successfully
> > returned
> > > > > > > (possibly
> > > > > > > > > >> after
> > > > > > > > > >> > > retries)
> > > > > > > > > >> > >
> > > > > > > > > >> > > 1) If the write succeeds, it will be durably exposed
> > to
> > > > > > clients
> > > > > > > > > within
> > > > > > > > > >> > some
> > > > > > > > > >> > > bounded time frame
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >> > Guaranteed.
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > > >> > > 2) If A succeeds and B succeeds, the ordering for
> the
> > > log
> > > > > will
> > > > > > > be
> > > > > > > > A
> > > > > > > > > >> and
> > > > > > > > > >> > > then B
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >> > If I understand correctly here, B is only sent after A
> > is
> > > > > > > returned,
> > > > > > > > > >> right?
> > > > > > > > > >> > If that's the case, It is guaranteed.
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> > > 3) If A fails due to an error that can be relied on
> to
> > > > *not*
> > > > > > be
> > > > > > > a
> > > > > > > > > lost
> > > > > > > > > >> > ack
> > > > > > > > > >> > > problem, it will never be exposed to the client, so
> it
> > > may
> > > > > > > > > (depending
> > > > > > > > > >> on
> > > > > > > > > >> > > the error) be retried immediately
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >> > If it is not a lost-ack problem, the entry will be
> > > exposed.
> > > > it
> > > > > > is
> > > > > > > > > >> > guaranteed.
> > > > > > > > > >>
> > > > > > > > > >> Let me try rephrasing the questions, to make sure I'm
> > > > > > understanding
> > > > > > > > > >> correctly:
> > > > > > > > > >> If A fails, with an error such as "Unable to create
> > > connection
> > > > > to
> > > > > > > > > >> bookkeeper server", that would be the type of error we
> > would
> > > > > > expect
> > > > > > > to
> > > > > > > > > be
> > > > > > > > > >> able to retry immediately, as that result means no
> action
> > > was
> > > > > > taken
> > > > > > > on
> > > > > > > > > any
> > > > > > > > > >> log / etc, so no entry could have been created. This is
> > > > > different
> > > > > > > > then a
> > > > > > > > > >> "Connection Timeout" exception, as we just might not
> have
> > > > > gotten a
> > > > > > > > > >> response
> > > > > > > > > >> in time.
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > > Gotcha.
> > > > > > > > > >
> > > > > > > > > > The response code returned from proxy can tell if a
> failure
> > > can
> > > > > be
> > > > > > > > > retried
> > > > > > > > > > safely or not. (We might need to make them well
> documented)
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >>
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> > > 4) If A fails due to an error that could be a lost
> ack
> > > > > problem
> > > > > > > > > >> (network
> > > > > > > > > >> > > connectivity / etc), within a bounded time frame it
> > > should
> > > > > be
> > > > > > > > > >> possible to
> > > > > > > > > >> > > find out if the write succeed or failed. Either by
> > > reading
> > > > > > from
> > > > > > > > some
> > > > > > > > > >> > > checkpoint of the log for the changes that should
> have
> > > > been
> > > > > > made
> > > > > > > > or
> > > > > > > > > >> some
> > > > > > > > > >> > > other possible server-side support.
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >> > If I understand this correctly, it is a duplication
> > issue,
> > > > > > right?
> > > > > > > > > >> >
> > > > > > > > > >> > Can a de-duplication solution work here? Either DL or
> > your
> > > > > > client
> > > > > > > > does
> > > > > > > > > >> the
> > > > > > > > > >> > de-duplication?
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > > >> The requirements I'm mentioning are the ones needed for
> > > > > > client-side
> > > > > > > > > >> dedupping. Since if I can guarantee writes being exposed
> > > > within
> > > > > > some
> > > > > > > > > time
> > > > > > > > > >> frame, and I can never get into an inconsistently
> ordered
> > > > state
> > > > > > when
> > > > > > > > > >> successes happen, when an error occurs, I can always
> wait
> > > for
> > > > > max
> > > > > > > time
> > > > > > > > > >> frame, read the latest writes, and then dedup locally
> > > against
> > > > > the
> > > > > > > > > request
> > > > > > > > > >> I
> > > > > > > > > >> just made.
> > > > > > > > > >>
> > > > > > > > > >> The main thing about that timeframe is that its
> basically
> > > the
> > > > > > > addition
> > > > > > > > > of
> > > > > > > > > >> every timeout, all the way down in the system, combined
> > with
> > > > > > > whatever
> > > > > > > > > >> flushing / caching / etc times are at the bookkeeper /
> > > client
> > > > > > level
> > > > > > > > for
> > > > > > > > > >> when values are exposed
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Gotcha.
> > > > > > > > > >
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> >
> > > > > > > > > >> > Is there any ways to identify your write?
> > > > > > > > > >> >
> > > > > > > > > >> > I can think of a case as follow - I want to know what
> is
> > > > your
> > > > > > > > expected
> > > > > > > > > >> > behavior from the log.
> > > > > > > > > >> >
> > > > > > > > > >> > a)
> > > > > > > > > >> >
> > > > > > > > > >> > If a hbase region server A writes a change of key K to
> > the
> > > > > log,
> > > > > > > the
> > > > > > > > > >> change
> > > > > > > > > >> > is successfully made to the log;
> > > > > > > > > >> > but server A is down before receiving the change.
> > > > > > > > > >> > region server B took over the region that contains K,
> > what
> > > > > will
> > > > > > B
> > > > > > > > do?
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > > >> HBase writes in large chunks (WAL Logs), which its
> > > replication
> > > > > > > system
> > > > > > > > > then
> > > > > > > > > >> handles by replaying in the case of failure. If I'm in a
> > > > middle
> > > > > > of a
> > > > > > > > > log,
> > > > > > > > > >> and the whole region goes down and gets rescheduled
> > > > elsewhere, I
> > > > > > > will
> > > > > > > > > >> start
> > > > > > > > > >> back up from the beginning of the log I was in the
> middle
> > > of.
> > > > > > Using
> > > > > > > > > >> checkpointing + deduping, we should be able to find out
> > > where
> > > > we
> > > > > > > left
> > > > > > > > > off
> > > > > > > > > >> in the log.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> > b) same as a). but server A was just network
> > partitioned.
> > > > will
> > > > > > > both
> > > > > > > > A
> > > > > > > > > >> and B
> > > > > > > > > >> > write the change of key K?
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > > >> HBase gives us some guarantees around network partitions
> > > > > > > (Consistency
> > > > > > > > > over
> > > > > > > > > >> availability for HBase). HBase is a single-master
> failover
> > > > > > recovery
> > > > > > > > type
> > > > > > > > > >> of
> > > > > > > > > >> system, with zookeeper-based guarantees for single
> owners
> > > > > > (writers)
> > > > > > > > of a
> > > > > > > > > >> range of data.
> > > > > > > > > >>
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> > > 5) If A is turned into multiple batches (one large
> > > request
> > > > > > gets
> > > > > > > > > split
> > > > > > > > > >> > into
> > > > > > > > > >> > > multiple smaller ones to the bookkeeper backend, due
> > to
> > > > log
> > > > > > > > rolling
> > > > > > > > > /
> > > > > > > > > >> > size
> > > > > > > > > >> > > / etc):
> > > > > > > > > >> > >   a) The ordering of entries within batches have
> > > ordering
> > > > > > > > > consistence
> > > > > > > > > >> > with
> > > > > > > > > >> > > the original request, when exposed in the log
> (though
> > > they
> > > > > may
> > > > > > > be
> > > > > > > > > >> > > interleaved with other requests)
> > > > > > > > > >> > >   b) The ordering across batches have ordering
> > > consistence
> > > > > > with
> > > > > > > > the
> > > > > > > > > >> > > original request, when exposed in the log (though
> they
> > > may
> > > > > be
> > > > > > > > > >> interleaved
> > > > > > > > > >> > > with other requests)
> > > > > > > > > >> > >   c) If a batch fails, and cannot be retried / is
> > > > > > unsuccessfully
> > > > > > > > > >> retried,
> > > > > > > > > >> > > all batches after the failed batch should not be
> > exposed
> > > > in
> > > > > > the
> > > > > > > > log.
> > > > > > > > > >> > Note:
> > > > > > > > > >> > > The batches before and including the failed batch,
> > that
> > > > > ended
> > > > > > up
> > > > > > > > > >> > > succeeding, can show up in the log, again within
> some
> > > > > bounded
> > > > > > > time
> > > > > > > > > >> range
> > > > > > > > > >> > > for reads by a client.
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >> > There is a method 'writeBulk' in DistributedLogClient
> > can
> > > > > > achieve
> > > > > > > > this
> > > > > > > > > >> > guarantee.
> > > > > > > > > >> >
> > > > > > > > > >> > However, I am not very sure about how will you turn A
> > into
> > > > > > > batches.
> > > > > > > > If
> > > > > > > > > >> you
> > > > > > > > > >> > are dividing A into batches,
> > > > > > > > > >> > you can simply control the application write sequence
> to
> > > > > achieve
> > > > > > > the
> > > > > > > > > >> > guarantee here.
> > > > > > > > > >> >
> > > > > > > > > >> > Can you explain more about this?
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > > >> In this case, by batches I mean what the proxy does with
> > the
> > > > > > single
> > > > > > > > > >> request
> > > > > > > > > >> that I send it. If the proxy decides it needs to turn my
> > > > single
> > > > > > > > request
> > > > > > > > > >> into multiple batches of requests, due to log rolling,
> > size
> > > > > > > > limitations,
> > > > > > > > > >> etc, those would be the guarantees I need to be able to
> > > > > > reduplicate
> > > > > > > on
> > > > > > > > > the
> > > > > > > > > >> client side.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > A single record written by #write and A record set (set
> of
> > > > > records)
> > > > > > > > > > written by #writeRecordSet are atomic - they will not be
> > > broken
> > > > > > down
> > > > > > > > into
> > > > > > > > > > entries (batches). With the correct response code, you
> > would
> > > be
> > > > > > able
> > > > > > > to
> > > > > > > > > > tell if it is a lost-ack failure or not. However there
> is a
> > > > size
> > > > > > > > > limitation
> > > > > > > > > > for this - it can't not go beyond 1MB for current
> > > > implementation.
> > > > > > > > > >
> > > > > > > > > > What is your expected record size?
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >>
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> > >
> > > > > > > > > >> > > Since we can guarantee per-key ordering on the
> client
> > > > side,
> > > > > we
> > > > > > > > > >> guarantee
> > > > > > > > > >> > > that there is a single writer per-key, just not per
> > log.
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> > Do you need fencing guarantee in the case of network
> > > > partition
> > > > > > > > causing
> > > > > > > > > >> > two-writers?
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> > > So if there was a
> > > > > > > > > >> > > way to guarantee a single write request as being
> > written
> > > > or
> > > > > > not,
> > > > > > > > > >> within a
> > > > > > > > > >> > > certain time frame (since failures should be rare
> > > anyways,
> > > > > > this
> > > > > > > is
> > > > > > > > > >> fine
> > > > > > > > > >> > if
> > > > > > > > > >> > > it is expensive), we can then have the client
> > guarantee
> > > > the
> > > > > > > > ordering
> > > > > > > > > >> it
> > > > > > > > > >> > > needs.
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >> > This sounds an 'exact-once' write (regarding retries)
> > > > > > requirement
> > > > > > > to
> > > > > > > > > me,
> > > > > > > > > >> > right?
> > > > > > > > > >> >
> > > > > > > > > >> Yes. I'm curious of how this issue is handled by
> > Manhattan,
> > > > > since
> > > > > > > you
> > > > > > > > > can
> > > > > > > > > >> imagine a data store that ends up getting multiple
> writes
> > > for
> > > > > the
> > > > > > > same
> > > > > > > > > put
> > > > > > > > > >> / get / etc, would be harder to use, and we are
> basically
> > > > trying
> > > > > > to
> > > > > > > > > create
> > > > > > > > > >> a log like that for HBase.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Are you guys replacing HBase WAL?
> > > > > > > > > >
> > > > > > > > > > In Manhattan case, the request will be first written to
> DL
> > > > > streams
> > > > > > by
> > > > > > > > > > Manhattan coordinator. The Manhattan replica then will
> read
> > > > from
> > > > > > the
> > > > > > > DL
> > > > > > > > > > streams and apply the change. In the lost-ack case, the
> MH
> > > > > > > coordinator
> > > > > > > > > will
> > > > > > > > > > just fail the request to client.
> > > > > > > > > >
> > > > > > > > > > My feeling here is your usage for HBase is a bit
> different
> > > from
> > > > > how
> > > > > > > we
> > > > > > > > > use
> > > > > > > > > > DL in Manhattan. It sounds like you read from a source
> > (HBase
> > > > > WAL)
> > > > > > > and
> > > > > > > > > > write to DL. But I might be wrong.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >>
> > > > > > > > > >> >
> > > > > > > > > >> >
> > > > > > > > > >> > >
> > > > > > > > > >> > >
> > > > > > > > > >> > > > Cameron:
> > > > > > > > > >> > > > Another thing we've discussed but haven't really
> > > thought
> > > > > > > > through -
> > > > > > > > > >> > > > We might be able to support some kind of epoch
> write
> > > > > > request,
> > > > > > > > > where
> > > > > > > > > >> the
> > > > > > > > > >> > > > epoch is guaranteed to have changed if the writer
> > has
> > > > > > changed
> > > > > > > or
> > > > > > > > > the
> > > > > > > > > >> > > ledger
> > > > > > > > > >> > > > was ever fenced off. Writes include an epoch and
> are
> > > > > > rejected
> > > > > > > if
> > > > > > > > > the
> > > > > > > > > >> > > epoch
> > > > > > > > > >> > > > has changed.
> > > > > > > > > >> > > > With a mechanism like this, fencing the ledger off
> > > > after a
> > > > > > > > failure
> > > > > > > > > >> > would
> > > > > > > > > >> > > > ensure any pending writes had either been written
> or
> > > > would
> > > > > > be
> > > > > > > > > >> rejected.
> > > > > > > > > >> > >
> > > > > > > > > >> > > The issue would be how I guarantee the write I wrote
> > to
> > > > the
> > > > > > > server
> > > > > > > > > was
> > > > > > > > > >> > > written. Since a network issue could happen on the
> > send
> > > of
> > > > > the
> > > > > > > > > >> request,
> > > > > > > > > >> > or
> > > > > > > > > >> > > on the receive of the success response, an epoch
> > > wouldn't
> > > > > tell
> > > > > > > me
> > > > > > > > > if I
> > > > > > > > > >> > can
> > > > > > > > > >> > > successfully retry, as it could be successfully
> > written
> > > > but
> > > > > > AWS
> > > > > > > > > >> dropped
> > > > > > > > > >> > the
> > > > > > > > > >> > > connection for the success response. Since the epoch
> > > would
> > > > > be
> > > > > > > the
> > > > > > > > > same
> > > > > > > > > >> > > (same ledger), I could write duplicates.
> > > > > > > > > >> > >
> > > > > > > > > >> > >
> > > > > > > > > >> > > > We are currently proposing adding a transaction
> > > semantic
> > > > > to
> > > > > > dl
> > > > > > > > to
> > > > > > > > > >> get
> > > > > > > > > >> > rid
> > > > > > > > > >> > > > of the size limitation and the unaware-ness in the
> > > proxy
> > > > > > > client.
> > > > > > > > > >> Here
> > > > > > > > > >> > is
> > > > > > > > > >> > > > our idea -
> > > > > > > > > >> > > > http://mail-archives.apache.
> org/mod_mbox/incubator-
> > > > > > > > distributedlog
> > > > > > > > > >> > > -dev/201609.mbox/%3cCAAC6BxP5YyEHwG0ZCF5soh42X=
> xuYwYm
> > > > > > > > > >> > > <http://mail-archives.apache.
> org/mod_mbox/incubator-
> > > > > > > > > >> > distributedlog%0A-dev/201609.mbox/%
> > > > 3cCAAC6BxP5YyEHwG0ZCF5soh
> > > > > > > > > 42X=xuYwYm>
> > > > > > > > > >> > > L4nXsYBYiofzxpVk6g@mail.gmail.com%3e
> > > > > > > > > >> > >
> > > > > > > > > >> > > > I am not sure if your idea is similar as ours. but
> > > we'd
> > > > > like
> > > > > > > to
> > > > > > > > > >> > > collaborate
> > > > > > > > > >> > > > with the community if anyone has the similar idea.
> > > > > > > > > >> > >
> > > > > > > > > >> > > Our use case would be covered by transaction
> support,
> > > but
> > > > > I'm
> > > > > > > > unsure
> > > > > > > > > >> if
> > > > > > > > > >> > we
> > > > > > > > > >> > > would need something that heavy weight for the
> > > guarantees
> > > > we
> > > > > > > need.
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >> > >
> > > > > > > > > >> > > Basically, the high level requirement here is
> "Support
> > > > > > > consistent
> > > > > > > > > >> write
> > > > > > > > > >> > > ordering for single-writer-per-key,
> > > multi-writer-per-log".
> > > > > My
> > > > > > > > hunch
> > > > > > > > > is
> > > > > > > > > >> > > that, with some added guarantees to the proxy (if it
> > > isn't
> > > > > > > already
> > > > > > > > > >> > > supported), and some custom client code on our side
> > for
> > > > > > removing
> > > > > > > > the
> > > > > > > > > >> > > entries that actually succeed to write to
> > DistributedLog
> > > > > from
> > > > > > > the
> > > > > > > > > >> request
> > > > > > > > > >> > > that failed, it should be a relatively easy thing to
> > > > > support.
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >> > Yup. I think it should not be very difficult to
> support.
> > > > There
> > > > > > > might
> > > > > > > > > be
> > > > > > > > > >> > some changes in the server side.
> > > > > > > > > >> > Let's figure out what will the changes be. Are you
> guys
> > > > > > interested
> > > > > > > > in
> > > > > > > > > >> > contributing?
> > > > > > > > > >> >
> > > > > > > > > >> > Yes, we would be.
> > > > > > > > > >>
> > > > > > > > > >> As a note, the one thing that we see as an issue with
> the
> > > > client
> > > > > > > side
> > > > > > > > > >> dedupping is how to bound the range of data that needs
> to
> > be
> > > > > > looked
> > > > > > > at
> > > > > > > > > for
> > > > > > > > > >> deduplication. As you can imagine, it is pretty easy to
> > > bound
> > > > > the
> > > > > > > > bottom
> > > > > > > > > >> of
> > > > > > > > > >> the range, as that it just regular checkpointing of the
> > DSLN
> > > > > that
> > > > > > is
> > > > > > > > > >> returned. I'm still not sure if there is any nice way to
> > > time
> > > > > > bound
> > > > > > > > the
> > > > > > > > > >> top
> > > > > > > > > >> end of the range, especially since the proxy owns
> sequence
> > > > > numbers
> > > > > > > > > (which
> > > > > > > > > >> makes sense). I am curious if there is more that can be
> > done
> > > > if
> > > > > > > > > >> deduplication is on the server side. However the main
> > minus
> > > I
> > > > > see
> > > > > > of
> > > > > > > > > >> server
> > > > > > > > > >> side deduplication is that instead of running contingent
> > on
> > > > > there
> > > > > > > > being
> > > > > > > > > a
> > > > > > > > > >> failed client request, instead it would have to run
> every
> > > > time a
> > > > > > > write
> > > > > > > > > >> happens.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > For a reliable dedup, we probably need
> > fence-then-getLastDLSN
> > > > > > > > operation -
> > > > > > > > > > so it would guarantee that any non-completed requests
> > issued
> > > > > > > (lost-ack
> > > > > > > > > > requests) before this fence-then-getLastDLSN operation
> will
> > > be
> > > > > > failed
> > > > > > > > and
> > > > > > > > > > they will never land at the log.
> > > > > > > > > >
> > > > > > > > > > the pseudo code would look like below -
> > > > > > > > > >
> > > > > > > > > > write(request) onFailure { t =>
> > > > > > > > > >
> > > > > > > > > > if (t is timeout exception) {
> > > > > > > > > >
> > > > > > > > > > DLSN lastDLSN = fenceThenGetLastDLSN()
> > > > > > > > > > DLSN lastCheckpointedDLSN = ...;
> > > > > > > > > > // find if the request lands between [lastDLSN,
> > > > > > > lastCheckpointedDLSN].
> > > > > > > > > > // if it exists, the write succeed; otherwise retry.
> > > > > > > > > >
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Just realized the idea is same as what Leigh raised in the
> > > > previous
> > > > > > > email
> > > > > > > > > about 'epoch write'. Let me explain more about this idea
> > > (Leigh,
> > > > > feel
> > > > > > > > free
> > > > > > > > > to jump in to fill up your idea).
> > > > > > > > >
> > > > > > > > > - when a log stream is owned,  the proxy use the last
> > > transaction
> > > > > id
> > > > > > as
> > > > > > > > the
> > > > > > > > > epoch
> > > > > > > > > - when a client connects (handshake with the proxy), it
> will
> > > get
> > > > > the
> > > > > > > > epoch
> > > > > > > > > for the stream.
> > > > > > > > > - the writes issued by this client will carry the epoch to
> > the
> > > > > proxy.
> > > > > > > > > - add a new rpc - fenceThenGetLastDLSN - it would force the
> > > proxy
> > > > > to
> > > > > > > bump
> > > > > > > > > the epoch.
> > > > > > > > > - if fenceThenGetLastDLSN happened, all the outstanding
> > writes
> > > > with
> > > > > > old
> > > > > > > > > epoch will be rejected with exceptions (e.g. EpochFenced).
> > > > > > > > > - The DLSN returned from fenceThenGetLastDLSN can be used
> as
> > > the
> > > > > > bound
> > > > > > > > for
> > > > > > > > > deduplications on failures.
> > > > > > > > >
> > > > > > > > > Cameron, does this sound a solution to your use case?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >>
> > > > > > > > > >> Maybe something that could fit a similar need that Kafka
> > > does
> > > > > (the
> > > > > > > > last
> > > > > > > > > >> store value for a particular key in a log), such that
> on a
> > > per
> > > > > key
> > > > > > > > basis
> > > > > > > > > >> there could be a sequence number that support
> > deduplication?
> > > > > Cost
> > > > > > > > seems
> > > > > > > > > >> like it would be high however, and I'm not even sure if
> > > > > bookkeeper
> > > > > > > > > >> supports
> > > > > > > > > >> it.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >> Cheers,
> > > > > > > > > >> Cameron
> > > > > > > > > >>
> > > > > > > > > >> >
> > > > > > > > > >> > >
> > > > > > > > > >> > > Thanks,
> > > > > > > > > >> > > Cameron
> > > > > > > > > >> > >
> > > > > > > > > >> > >
> > > > > > > > > >> > > On Sat, Oct 8, 2016 at 7:35 AM, Leigh Stewart
> > > > > > > > > >> > <lstewart@twitter.com.invalid
> > > > > > > > > >> > > >
> > > > > > > > > >> > > wrote:
> > > > > > > > > >> > >
> > > > > > > > > >> > > > Cameron:
> > > > > > > > > >> > > > Another thing we've discussed but haven't really
> > > thought
> > > > > > > > through -
> > > > > > > > > >> > > > We might be able to support some kind of epoch
> write
> > > > > > request,
> > > > > > > > > where
> > > > > > > > > >> the
> > > > > > > > > >> > > > epoch is guaranteed to have changed if the writer
> > has
> > > > > > changed
> > > > > > > or
> > > > > > > > > the
> > > > > > > > > >> > > ledger
> > > > > > > > > >> > > > was ever fenced off. Writes include an epoch and
> are
> > > > > > rejected
> > > > > > > if
> > > > > > > > > the
> > > > > > > > > >> > > epoch
> > > > > > > > > >> > > > has changed.
> > > > > > > > > >> > > > With a mechanism like this, fencing the ledger off
> > > > after a
> > > > > > > > failure
> > > > > > > > > >> > would
> > > > > > > > > >> > > > ensure any pending writes had either been written
> or
> > > > would
> > > > > > be
> > > > > > > > > >> rejected.
> > > > > > > > > >> > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > On Sat, Oct 8, 2016 at 7:10 AM, Sijie Guo <
> > > > > sijie@apache.org
> > > > > > >
> > > > > > > > > wrote:
> > > > > > > > > >> > > >
> > > > > > > > > >> > > > > Cameron,
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > I think both Leigh and Xi had made a few good
> > points
> > > > > about
> > > > > > > > your
> > > > > > > > > >> > > question.
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > To add one more point to your question - "but I
> am
> > > not
> > > > > > > > > >> > > > > 100% of how all of the futures in the code
> handle
> > > > > > failures.
> > > > > > > > > >> > > > > If not, where in the code would be the relevant
> > > places
> > > > > to
> > > > > > > add
> > > > > > > > > the
> > > > > > > > > >> > > ability
> > > > > > > > > >> > > > > to do this, and would the project be interested
> > in a
> > > > > pull
> > > > > > > > > >> request?"
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > The current proxy and client logic doesn't do
> > > > perfectly
> > > > > on
> > > > > > > > > >> handling
> > > > > > > > > >> > > > > failures (duplicates) - the strategy now is the
> > > client
> > > > > > will
> > > > > > > > > retry
> > > > > > > > > >> as
> > > > > > > > > >> > > best
> > > > > > > > > >> > > > > at it can before throwing exceptions to users.
> The
> > > > code
> > > > > > you
> > > > > > > > are
> > > > > > > > > >> > looking
> > > > > > > > > >> > > > for
> > > > > > > > > >> > > > > - it is on BKLogSegmentWriter for the proxy
> > handling
> > > > > > writes
> > > > > > > > and
> > > > > > > > > >> it is
> > > > > > > > > >> > > on
> > > > > > > > > >> > > > > DistributedLogClientImpl for the proxy client
> > > handling
> > > > > > > > responses
> > > > > > > > > >> from
> > > > > > > > > >> > > > > proxies. Does this help you?
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > And also, you are welcome to contribute the pull
> > > > > requests.
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > - Sijie
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > On Tue, Oct 4, 2016 at 3:39 PM, Cameron
> Hatfield <
> > > > > > > > > >> kinguy@gmail.com>
> > > > > > > > > >> > > > wrote:
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > > > > I have a question about the Proxy Client.
> > > Basically,
> > > > > for
> > > > > > > our
> > > > > > > > > use
> > > > > > > > > >> > > cases,
> > > > > > > > > >> > > > > we
> > > > > > > > > >> > > > > > want to guarantee ordering at the key level,
> > > > > > irrespective
> > > > > > > of
> > > > > > > > > the
> > > > > > > > > >> > > > ordering
> > > > > > > > > >> > > > > > of the partition it may be assigned to as a
> > whole.
> > > > Due
> > > > > > to
> > > > > > > > the
> > > > > > > > > >> > source
> > > > > > > > > >> > > of
> > > > > > > > > >> > > > > the
> > > > > > > > > >> > > > > > data (HBase Replication), we cannot guarantee
> > > that a
> > > > > > > single
> > > > > > > > > >> > partition
> > > > > > > > > >> > > > > will
> > > > > > > > > >> > > > > > be owned for writes by the same client. This
> > means
> > > > the
> > > > > > > proxy
> > > > > > > > > >> client
> > > > > > > > > >> > > > works
> > > > > > > > > >> > > > > > well (since we don't care which proxy owns the
> > > > > partition
> > > > > > > we
> > > > > > > > > are
> > > > > > > > > >> > > writing
> > > > > > > > > >> > > > > > to).
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > However, the guarantees we need when writing a
> > > batch
> > > > > > > > consists
> > > > > > > > > >> of:
> > > > > > > > > >> > > > > > Definition of a Batch: The set of records sent
> > to
> > > > the
> > > > > > > > > writeBatch
> > > > > > > > > >> > > > endpoint
> > > > > > > > > >> > > > > > on the proxy
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > 1. Batch success: If the client receives a
> > success
> > > > > from
> > > > > > > the
> > > > > > > > > >> proxy,
> > > > > > > > > >> > > then
> > > > > > > > > >> > > > > > that batch is successfully written
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > 2. Inter-Batch ordering : Once a batch has
> been
> > > > > written
> > > > > > > > > >> > successfully
> > > > > > > > > >> > > by
> > > > > > > > > >> > > > > the
> > > > > > > > > >> > > > > > client, when another batch is written, it will
> > be
> > > > > > > guaranteed
> > > > > > > > > to
> > > > > > > > > >> be
> > > > > > > > > >> > > > > ordered
> > > > > > > > > >> > > > > > after the last batch (if it is the same
> stream).
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > 3. Intra-Batch ordering: Within a batch of
> > writes,
> > > > the
> > > > > > > > records
> > > > > > > > > >> will
> > > > > > > > > >> > > be
> > > > > > > > > >> > > > > > committed in order
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > 4. Intra-Batch failure ordering: If an
> > individual
> > > > > record
> > > > > > > > fails
> > > > > > > > > >> to
> > > > > > > > > >> > > write
> > > > > > > > > >> > > > > > within a batch, all records after that record
> > will
> > > > not
> > > > > > be
> > > > > > > > > >> written.
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > 5. Batch Commit: Guarantee that if a batch
> > > returns a
> > > > > > > > success,
> > > > > > > > > it
> > > > > > > > > >> > will
> > > > > > > > > >> > > > be
> > > > > > > > > >> > > > > > written
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > 6. Read-after-write: Once a batch is
> committed,
> > > > > within a
> > > > > > > > > limited
> > > > > > > > > >> > > > > time-frame
> > > > > > > > > >> > > > > > it will be able to be read. This is required
> in
> > > the
> > > > > case
> > > > > > > of
> > > > > > > > > >> > failure,
> > > > > > > > > >> > > so
> > > > > > > > > >> > > > > > that the client can see what actually got
> > > > committed. I
> > > > > > > > believe
> > > > > > > > > >> the
> > > > > > > > > >> > > > > > time-frame part could be removed if the client
> > can
> > > > > send
> > > > > > in
> > > > > > > > the
> > > > > > > > > >> same
> > > > > > > > > >> > > > > > sequence number that was written previously,
> > since
> > > > it
> > > > > > > would
> > > > > > > > > then
> > > > > > > > > >> > fail
> > > > > > > > > >> > > > and
> > > > > > > > > >> > > > > > we would know that a read needs to occur.
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > So, my basic question is if this is currently
> > > > possible
> > > > > > in
> > > > > > > > the
> > > > > > > > > >> > proxy?
> > > > > > > > > >> > > I
> > > > > > > > > >> > > > > > don't believe it gives these guarantees as it
> > > stands
> > > > > > > today,
> > > > > > > > > but
> > > > > > > > > >> I
> > > > > > > > > >> > am
> > > > > > > > > >> > > > not
> > > > > > > > > >> > > > > > 100% of how all of the futures in the code
> > handle
> > > > > > > failures.
> > > > > > > > > >> > > > > > If not, where in the code would be the
> relevant
> > > > places
> > > > > > to
> > > > > > > > add
> > > > > > > > > >> the
> > > > > > > > > >> > > > ability
> > > > > > > > > >> > > > > > to do this, and would the project be
> interested
> > > in a
> > > > > > pull
> > > > > > > > > >> request?
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > > > Thanks,
> > > > > > > > > >> > > > > > Cameron
> > > > > > > > > >> > > > > >
> > > > > > > > > >> > > > >
> > > > > > > > > >> > > >
> > > > > > > > > >> > >
> > > > > > > > > >> >
> > > > > > > > > >>
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Proxy Client - Batch Ordering / Commit

Posted by Sijie Guo <si...@twitter.com.INVALID>.
Cameron,

I just granted you the permissions. You should be able to edit the wiki
pages now. Let me know if you encountered any issues.

- Sijie

On Wed, Nov 16, 2016 at 12:20 PM, Cameron Hatfield <ki...@gmail.com> wrote:

> I believe it is:
> https://cwiki.apache.org/confluence/display/~cameron.hatfield
>
> On Wed, Nov 16, 2016 at 12:14 PM, Sijie Guo <si...@twitter.com.invalid>
> wrote:
>
> > Cameron,
> >
> > Can you send me your wiki account name? I can grant you the permission to
> > edit it.
> >
> > - Sijie
> >
> > On Wed, Nov 16, 2016 at 12:11 PM, Cameron Hatfield <ki...@gmail.com>
> > wrote:
> >
> > > Also, would it be possible for me to get wiki access so I will be able
> to
> > > update it / etc?
> > >
> > > -Cameron
> > >
> > > On Wed, Nov 16, 2016 at 11:59 AM, Cameron Hatfield <ki...@gmail.com>
> > > wrote:
> > >
> > > > "A couple of questions" is what I originally wrote, and then the
> > > following
> > > > happened. Sorry about the large swath of them, making sure my
> > > understanding
> > > > of the code base, as well as the DL/Bookkeeper/ZK ecosystem
> > interaction,
> > > > makes sense.
> > > >
> > > > ==General:
> > > > What is an exclusive session? What is it providing over a regular
> > > session?
> > > >
> > > >
> > > > ==Proxy side:
> > > > Should a new streamop be added for the fencing operation, or does it
> > make
> > > > sense to piggyback on an existing one (such as write)?
> > > >
> > > > ====getLastDLSN:
> > > > What should be the return result for:
> > > > A new stream
> > > > A new session, after successful fencing
> > > > A new session, after a change in ownership / first starting up
> > > >
> > > > What is the main use case for getLastDLSN(<stream>, false)? Is this
> to
> > > > guarantee that the recovery process has happened in case of ownership
> > > > failure (I don't have a good understanding of what causes the
> recovery
> > > > process to happen, especially from the reader side)? Or is it to
> handle
> > > the
> > > > lost-ack problem? Since all the rest of the read related things go
> > > through
> > > > the read client, I'm not sure if I see the use case, but it seems
> like
> > > > there would be a large potential for confusion on which to use. What
> > > about
> > > > just a fenceSession op, that always fences, returning the DLSN of the
> > > > fence, and leave the normal getLastDLSN for the regular read client.
> > > >
> > > > ====Fencing:
> > > > When a fence session occurs, what call needs to be made to make sure
> > any
> > > > outstanding writes are flushed and committed (so that we guarantee
> the
> > > > client will be able to read anything that was in the write queue)?
> > > > Is there a guaranteed ordering for things written in the future queue
> > for
> > > > AsyncLogWriter (I'm not quite confident that I was able to accurately
> > > > follow the logic, as their are many parts of the code that write,
> have
> > > > queues, heartbeat, etc)?
> > > >
> > > > ====SessionID:
> > > > What is the default sessionid / transactionid for a new stream? I
> > assume
> > > > this would just be the first control record
> > > >
> > > > ======Should all streams have a sessionid by default, regardless if
> it
> > is
> > > > never used by a client (aka, everytime ownership changes, a new
> control
> > > > record is generated, and a sessionid is stored)?
> > > > Main edge case that would have to be handled is if a client writes
> with
> > > an
> > > > old sessionid, but the owner has changed and has yet to create a
> > > sessionid.
> > > > This should be handled by the "non-matching sessionid" rule, since
> the
> > > > invalid sessionid wouldn't match the passed sessionid, which should
> > cause
> > > > the client to get a new sessionid.
> > > >
> > > > ======Where in the code does it make sense to own the session, the
> > stream
> > > > interfaces / classes? Should they pass that information down to the
> > ops,
> > > or
> > > > do the sessionid check within?
> > > > My first thought would be Stream owns the sessionid, passes it into
> the
> > > > ops (as either a nullable value, or an invalid default value), which
> > then
> > > > do the sessionid check if they care. The main issue is updating the
> > > > sessionid is a bit backwards, as either every op has the ability to
> > > update
> > > > it through some type of return value / direct stream access / etc, or
> > > there
> > > > is a special case in the stream for the fence operation / any other
> > > > operation that can update the session.
> > > >
> > > > ======For "the owner of the log stream will first advance the
> > transaction
> > > > id generator to claim a new transaction id and write a control record
> > to
> > > > the log stream. ":
> > > > Should "DistributedLogConstants.CONTROL_RECORD_CONTENT" be the type
> of
> > > > control record written?
> > > > Should the "writeControlRecord" on the BKAsyncLogWriter be exposed in
> > the
> > > > AsyncLogWriter interface be exposed?  Or even in the one within the
> > > segment
> > > > writer? Or should the code be duplicated / pulled out into a helper /
> > > etc?
> > > > (Not a big Java person, so any suggestions on the "Java Way", or at
> > least
> > > > the DL way, to do it would be appreciated)
> > > >
> > > > ======Transaction ID:
> > > > The BKLogSegmentWriter ignores the transaction ids from control
> records
> > > > when it records the "LastTXId." Would that be an issue here for
> > anything?
> > > > It looks like it may do that because it assumes you're calling it's
> > local
> > > > function for writing a controlrecord, which uses the lastTxId.
> > > >
> > > >
> > > > ==Thrift Interface:
> > > > ====Should the write response be split out for different calls?
> > > > It seems odd to have a single struct with many optional items that
> are
> > > > filled depending on the call made for every rpc call. This is mostly
> a
> > > > curiosity question, since I assume it comes from the general
> practices
> > > from
> > > > using thrift for a while. Would it at least make sense for the
> > > > getLastDLSN/fence endpoint to have a new struct?
> > > >
> > > > ====Any particular error code that makes sense for session fenced? If
> > we
> > > > want to be close to the HTTP errors, looks like 412 (PRECONDITION
> > FAILED)
> > > > might make the most sense, if a bit generic.
> > > >
> > > > 412 def:
> > > > "The precondition given in one or more of the request-header fields
> > > > evaluated to false when it was tested on the server. This response
> code
> > > > allows the client to place preconditions on the current resource
> > > > metainformation (header field data) and thus prevent the requested
> > method
> > > > from being applied to a resource other than the one intended."
> > > >
> > > > 412 excerpt from the If-match doc:
> > > > "This behavior is most useful when the client wants to prevent an
> > > updating
> > > > method, such as PUT, from modifying a resource that has changed since
> > the
> > > > client last retrieved it."
> > > >
> > > > ====Should we return the sessionid to the client in the
> "fencesession"
> > > > calls?
> > > > Seems like it may be useful when you fence, especially if you have
> some
> > > > type of custom sequencer where it would make sense for search, or for
> > > > debugging.
> > > > Main minus is that it would be easy for users to create an implicit
> > > > requirement that the sessionid is forever a valid transactionid,
> which
> > > may
> > > > not always be the case long term for the project.
> > > >
> > > >
> > > > ==Client:
> > > > ====What is the proposed process for the client retrieving the new
> > > > sessionid?
> > > > A full reconnect? No special case code, but intrusive on the client
> > side,
> > > > and possibly expensive garbage/processing wise. (though this type of
> > > > failure should hopefully be rare enough to not be a problem)
> > > > A call to reset the sessionid? Less intrusive, all the issues you get
> > > with
> > > > mutable object methods that need to be called in a certain order,
> edge
> > > > cases such as outstanding/buffered requests to the old stream, etc.
> > > > The call could also return the new sessionid, making it a good call
> for
> > > > storing or debugging the value.
> > > >
> > > > ====Session Fenced failure:
> > > > Will this put the client into a failure state, stopping all future
> > writes
> > > > until fixed?
> > > > Is it even possible to get this error when ownership changes? The
> > > > connection to the new owner should get a new sessionid on connect,
> so I
> > > > would expect not.
> > > >
> > > > Cheers,
> > > > Cameron
> > > >
> > > > On Tue, Nov 15, 2016 at 2:01 AM, Xi Liu <xi...@gmail.com>
> wrote:
> > > >
> > > >> Thank you, Cameron. Look forward to your comments.
> > > >>
> > > >> - Xi
> > > >>
> > > >> On Sun, Nov 13, 2016 at 1:21 PM, Cameron Hatfield <kinguy@gmail.com
> >
> > > >> wrote:
> > > >>
> > > >> > Sorry, I've been on vacation for the past week, and heads down
> for a
> > > >> > release that is using DL at the end of Nov. I'll take a look at
> this
> > > >> over
> > > >> > the next week, and add any relevant comments. After we are
> finished
> > > with
> > > >> > dev for this release, I am hoping to tackle this next.
> > > >> >
> > > >> > -Cameron
> > > >> >
> > > >> > On Fri, Nov 11, 2016 at 12:07 PM, Sijie Guo <si...@apache.org>
> > wrote:
> > > >> >
> > > >> > > Xi,
> > > >> > >
> > > >> > > Thank you so much for your proposal. I took a look. It looks
> fine
> > to
> > > >> me.
> > > >> > > Cameron, do you have any comments?
> > > >> > >
> > > >> > > Look forward to your pull requests.
> > > >> > >
> > > >> > > - Sijie
> > > >> > >
> > > >> > >
> > > >> > > On Wed, Nov 9, 2016 at 2:34 AM, Xi Liu <xi...@gmail.com>
> > > wrote:
> > > >> > >
> > > >> > > > Cameron,
> > > >> > > >
> > > >> > > > Have you started any work for this? I just updated the
> proposal
> > > >> page -
> > > >> > > > https://cwiki.apache.org/confluence/display/DL/DP-2+-+
> > > >> > > Epoch+Write+Support
> > > >> > > > Maybe we can work together with this.
> > > >> > > >
> > > >> > > > Sijie, Leigh,
> > > >> > > >
> > > >> > > > can you guys help review this to make sure our proposal is in
> > the
> > > >> right
> > > >> > > > direction?
> > > >> > > >
> > > >> > > > - Xi
> > > >> > > >
> > > >> > > > On Tue, Nov 1, 2016 at 3:05 AM, Sijie Guo <si...@apache.org>
> > > wrote:
> > > >> > > >
> > > >> > > > > I created https://issues.apache.org/jira/browse/DL-63 for
> > > >> tracking
> > > >> > the
> > > >> > > > > proposed idea here.
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > On Wed, Oct 26, 2016 at 4:53 PM, Sijie Guo
> > > >> > <sijieg@twitter.com.invalid
> > > >> > > >
> > > >> > > > > wrote:
> > > >> > > > >
> > > >> > > > > > On Tue, Oct 25, 2016 at 11:30 AM, Cameron Hatfield <
> > > >> > kinguy@gmail.com
> > > >> > > >
> > > >> > > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > > Yes, we are reading the HBase WAL (from their
> replication
> > > >> plugin
> > > >> > > > > > support),
> > > >> > > > > > > and writing that into DL.
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > > > Gotcha.
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > >
> > > >> > > > > > > From the sounds of it, yes, it would. Only thing I would
> > say
> > > >> is
> > > >> > > make
> > > >> > > > > the
> > > >> > > > > > > epoch requirement optional, so that if I client doesn't
> > care
> > > >> > about
> > > >> > > > > dupes
> > > >> > > > > > > they don't have to deal with the process of getting a
> new
> > > >> epoch.
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > > > Yup. This should be optional. I can start a wiki page on
> how
> > > we
> > > >> > want
> > > >> > > to
> > > >> > > > > > implement this. Are you interested in contributing to
> this?
> > > >> > > > > >
> > > >> > > > > >
> > > >> > > > > > >
> > > >> > > > > > > -Cameron
> > > >> > > > > > >
> > > >> > > > > > > On Wed, Oct 19, 2016 at 7:43 PM, Sijie Guo
> > > >> > > > <sijieg@twitter.com.invalid
> > > >> > > > > >
> > > >> > > > > > > wrote:
> > > >> > > > > > >
> > > >> > > > > > > > On Wed, Oct 19, 2016 at 7:17 PM, Sijie Guo <
> > > >> sijieg@twitter.com
> > > >> > >
> > > >> > > > > wrote:
> > > >> > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > > On Monday, October 17, 2016, Cameron Hatfield <
> > > >> > > kinguy@gmail.com>
> > > >> > > > > > > wrote:
> > > >> > > > > > > > >
> > > >> > > > > > > > >> Answer inline:
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> On Mon, Oct 17, 2016 at 11:46 AM, Sijie Guo <
> > > >> > sijie@apache.org
> > > >> > > >
> > > >> > > > > > wrote:
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> > Cameron,
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > Thank you for your summary. I liked the
> discussion
> > > >> here. I
> > > >> > > > also
> > > >> > > > > > > liked
> > > >> > > > > > > > >> the
> > > >> > > > > > > > >> > summary of your requirement -
> > 'single-writer-per-key,
> > > >> > > > > > > > >> > multiple-writers-per-log'. If I understand
> > correctly,
> > > >> the
> > > >> > > core
> > > >> > > > > > > concern
> > > >> > > > > > > > >> here
> > > >> > > > > > > > >> > is almost 'exact-once' write (or a way to
> explicit
> > > tell
> > > >> > if a
> > > >> > > > > write
> > > >> > > > > > > can
> > > >> > > > > > > > >> be
> > > >> > > > > > > > >> > retried or not).
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > Comments inline.
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > On Fri, Oct 14, 2016 at 11:17 AM, Cameron
> Hatfield
> > <
> > > >> > > > > > > kinguy@gmail.com>
> > > >> > > > > > > > >> > wrote:
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > > > Ah- yes good point (to be clear we're not
> using
> > > the
> > > >> > > proxy
> > > >> > > > > this
> > > >> > > > > > > way
> > > >> > > > > > > > >> > > today).
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > > > > Due to the source of the
> > > >> > > > > > > > >> > > > > data (HBase Replication), we cannot
> guarantee
> > > >> that a
> > > >> > > > > single
> > > >> > > > > > > > >> partition
> > > >> > > > > > > > >> > > will
> > > >> > > > > > > > >> > > > > be owned for writes by the same client.
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > > > Do you mean you *need* to support multiple
> > > writers
> > > >> > > issuing
> > > >> > > > > > > > >> interleaved
> > > >> > > > > > > > >> > > > writes or is it just that they might
> sometimes
> > > >> > > interleave
> > > >> > > > > > writes
> > > >> > > > > > > > and
> > > >> > > > > > > > >> > you
> > > >> > > > > > > > >> > > >don't care?
> > > >> > > > > > > > >> > > How HBase partitions the keys being written
> > > wouldn't
> > > >> > have
> > > >> > > a
> > > >> > > > > > > one->one
> > > >> > > > > > > > >> > > mapping with the partitions we would have in
> > HBase.
> > > >> Even
> > > >> > > if
> > > >> > > > we
> > > >> > > > > > did
> > > >> > > > > > > > >> have
> > > >> > > > > > > > >> > > that alignment when the cluster first started,
> > > HBase
> > > >> > will
> > > >> > > > > > > rebalance
> > > >> > > > > > > > >> what
> > > >> > > > > > > > >> > > servers own what partitions, as well as split
> and
> > > >> merge
> > > >> > > > > > partitions
> > > >> > > > > > > > >> that
> > > >> > > > > > > > >> > > already exist, causing eventual drift from one
> > log
> > > >> per
> > > >> > > > > > partition.
> > > >> > > > > > > > >> > > Because we want ordering guarantees per key
> (row
> > in
> > > >> > > hbase),
> > > >> > > > we
> > > >> > > > > > > > >> partition
> > > >> > > > > > > > >> > > the logs by the key. Since multiple writers are
> > > >> possible
> > > >> > > per
> > > >> > > > > > range
> > > >> > > > > > > > of
> > > >> > > > > > > > >> > keys
> > > >> > > > > > > > >> > > (due to the aforementioned rebalancing /
> > splitting
> > > /
> > > >> etc
> > > >> > > of
> > > >> > > > > > > hbase),
> > > >> > > > > > > > we
> > > >> > > > > > > > >> > > cannot use the core library due to requiring a
> > > single
> > > >> > > writer
> > > >> > > > > for
> > > >> > > > > > > > >> > ordering.
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > > But, for a single log, we don't really care
> about
> > > >> > ordering
> > > >> > > > > aside
> > > >> > > > > > > > from
> > > >> > > > > > > > >> at
> > > >> > > > > > > > >> > > the per-key level. So all we really need to be
> > able
> > > >> to
> > > >> > > > handle
> > > >> > > > > is
> > > >> > > > > > > > >> > preventing
> > > >> > > > > > > > >> > > duplicates when a failure occurs, and ordering
> > > >> > consistency
> > > >> > > > > > across
> > > >> > > > > > > > >> > requests
> > > >> > > > > > > > >> > > from a single client.
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > > So our general requirements are:
> > > >> > > > > > > > >> > > Write A, Write B
> > > >> > > > > > > > >> > > Timeline: A -> B
> > > >> > > > > > > > >> > > Request B is only made after A has successfully
> > > >> returned
> > > >> > > > > > (possibly
> > > >> > > > > > > > >> after
> > > >> > > > > > > > >> > > retries)
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > > 1) If the write succeeds, it will be durably
> > > exposed
> > > >> to
> > > >> > > > > clients
> > > >> > > > > > > > within
> > > >> > > > > > > > >> > some
> > > >> > > > > > > > >> > > bounded time frame
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > Guaranteed.
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> > > 2) If A succeeds and B succeeds, the ordering
> for
> > > the
> > > >> > log
> > > >> > > > will
> > > >> > > > > > be
> > > >> > > > > > > A
> > > >> > > > > > > > >> and
> > > >> > > > > > > > >> > > then B
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > If I understand correctly here, B is only sent
> > after
> > > A
> > > >> is
> > > >> > > > > > returned,
> > > >> > > > > > > > >> right?
> > > >> > > > > > > > >> > If that's the case, It is guaranteed.
> > > >> > > > > > > > >>
> > > >> > > > > > > > >>
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > > 3) If A fails due to an error that can be
> relied
> > on
> > > >> to
> > > >> > > *not*
> > > >> > > > > be
> > > >> > > > > > a
> > > >> > > > > > > > lost
> > > >> > > > > > > > >> > ack
> > > >> > > > > > > > >> > > problem, it will never be exposed to the
> client,
> > so
> > > >> it
> > > >> > may
> > > >> > > > > > > > (depending
> > > >> > > > > > > > >> on
> > > >> > > > > > > > >> > > the error) be retried immediately
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > If it is not a lost-ack problem, the entry will
> be
> > > >> > exposed.
> > > >> > > it
> > > >> > > > > is
> > > >> > > > > > > > >> > guaranteed.
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> Let me try rephrasing the questions, to make sure
> I'm
> > > >> > > > > understanding
> > > >> > > > > > > > >> correctly:
> > > >> > > > > > > > >> If A fails, with an error such as "Unable to create
> > > >> > connection
> > > >> > > > to
> > > >> > > > > > > > >> bookkeeper server", that would be the type of error
> > we
> > > >> would
> > > >> > > > > expect
> > > >> > > > > > to
> > > >> > > > > > > > be
> > > >> > > > > > > > >> able to retry immediately, as that result means no
> > > action
> > > >> > was
> > > >> > > > > taken
> > > >> > > > > > on
> > > >> > > > > > > > any
> > > >> > > > > > > > >> log / etc, so no entry could have been created.
> This
> > is
> > > >> > > > different
> > > >> > > > > > > then a
> > > >> > > > > > > > >> "Connection Timeout" exception, as we just might
> not
> > > have
> > > >> > > > gotten a
> > > >> > > > > > > > >> response
> > > >> > > > > > > > >> in time.
> > > >> > > > > > > > >>
> > > >> > > > > > > > >>
> > > >> > > > > > > > > Gotcha.
> > > >> > > > > > > > >
> > > >> > > > > > > > > The response code returned from proxy can tell if a
> > > >> failure
> > > >> > can
> > > >> > > > be
> > > >> > > > > > > > retried
> > > >> > > > > > > > > safely or not. (We might need to make them well
> > > >> documented)
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > > 4) If A fails due to an error that could be a
> > lost
> > > >> ack
> > > >> > > > problem
> > > >> > > > > > > > >> (network
> > > >> > > > > > > > >> > > connectivity / etc), within a bounded time
> frame
> > it
> > > >> > should
> > > >> > > > be
> > > >> > > > > > > > >> possible to
> > > >> > > > > > > > >> > > find out if the write succeed or failed. Either
> > by
> > > >> > reading
> > > >> > > > > from
> > > >> > > > > > > some
> > > >> > > > > > > > >> > > checkpoint of the log for the changes that
> should
> > > >> have
> > > >> > > been
> > > >> > > > > made
> > > >> > > > > > > or
> > > >> > > > > > > > >> some
> > > >> > > > > > > > >> > > other possible server-side support.
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > If I understand this correctly, it is a
> duplication
> > > >> issue,
> > > >> > > > > right?
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > Can a de-duplication solution work here? Either
> DL
> > or
> > > >> your
> > > >> > > > > client
> > > >> > > > > > > does
> > > >> > > > > > > > >> the
> > > >> > > > > > > > >> > de-duplication?
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> The requirements I'm mentioning are the ones needed
> > for
> > > >> > > > > client-side
> > > >> > > > > > > > >> dedupping. Since if I can guarantee writes being
> > > exposed
> > > >> > > within
> > > >> > > > > some
> > > >> > > > > > > > time
> > > >> > > > > > > > >> frame, and I can never get into an inconsistently
> > > ordered
> > > >> > > state
> > > >> > > > > when
> > > >> > > > > > > > >> successes happen, when an error occurs, I can
> always
> > > wait
> > > >> > for
> > > >> > > > max
> > > >> > > > > > time
> > > >> > > > > > > > >> frame, read the latest writes, and then dedup
> locally
> > > >> > against
> > > >> > > > the
> > > >> > > > > > > > request
> > > >> > > > > > > > >> I
> > > >> > > > > > > > >> just made.
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> The main thing about that timeframe is that its
> > > basically
> > > >> > the
> > > >> > > > > > addition
> > > >> > > > > > > > of
> > > >> > > > > > > > >> every timeout, all the way down in the system,
> > combined
> > > >> with
> > > >> > > > > > whatever
> > > >> > > > > > > > >> flushing / caching / etc times are at the
> bookkeeper
> > /
> > > >> > client
> > > >> > > > > level
> > > >> > > > > > > for
> > > >> > > > > > > > >> when values are exposed
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > > Gotcha.
> > > >> > > > > > > > >
> > > >> > > > > > > > >>
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > Is there any ways to identify your write?
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > I can think of a case as follow - I want to know
> > what
> > > >> is
> > > >> > > your
> > > >> > > > > > > expected
> > > >> > > > > > > > >> > behavior from the log.
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > a)
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > If a hbase region server A writes a change of
> key K
> > > to
> > > >> the
> > > >> > > > log,
> > > >> > > > > > the
> > > >> > > > > > > > >> change
> > > >> > > > > > > > >> > is successfully made to the log;
> > > >> > > > > > > > >> > but server A is down before receiving the change.
> > > >> > > > > > > > >> > region server B took over the region that
> contains
> > K,
> > > >> what
> > > >> > > > will
> > > >> > > > > B
> > > >> > > > > > > do?
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> HBase writes in large chunks (WAL Logs), which its
> > > >> > replication
> > > >> > > > > > system
> > > >> > > > > > > > then
> > > >> > > > > > > > >> handles by replaying in the case of failure. If I'm
> > in
> > > a
> > > >> > > middle
> > > >> > > > > of a
> > > >> > > > > > > > log,
> > > >> > > > > > > > >> and the whole region goes down and gets rescheduled
> > > >> > > elsewhere, I
> > > >> > > > > > will
> > > >> > > > > > > > >> start
> > > >> > > > > > > > >> back up from the beginning of the log I was in the
> > > middle
> > > >> > of.
> > > >> > > > > Using
> > > >> > > > > > > > >> checkpointing + deduping, we should be able to find
> > out
> > > >> > where
> > > >> > > we
> > > >> > > > > > left
> > > >> > > > > > > > off
> > > >> > > > > > > > >> in the log.
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > b) same as a). but server A was just network
> > > >> partitioned.
> > > >> > > will
> > > >> > > > > > both
> > > >> > > > > > > A
> > > >> > > > > > > > >> and B
> > > >> > > > > > > > >> > write the change of key K?
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> HBase gives us some guarantees around network
> > > partitions
> > > >> > > > > > (Consistency
> > > >> > > > > > > > over
> > > >> > > > > > > > >> availability for HBase). HBase is a single-master
> > > >> failover
> > > >> > > > > recovery
> > > >> > > > > > > type
> > > >> > > > > > > > >> of
> > > >> > > > > > > > >> system, with zookeeper-based guarantees for single
> > > owners
> > > >> > > > > (writers)
> > > >> > > > > > > of a
> > > >> > > > > > > > >> range of data.
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > > 5) If A is turned into multiple batches (one
> > large
> > > >> > request
> > > >> > > > > gets
> > > >> > > > > > > > split
> > > >> > > > > > > > >> > into
> > > >> > > > > > > > >> > > multiple smaller ones to the bookkeeper
> backend,
> > > due
> > > >> to
> > > >> > > log
> > > >> > > > > > > rolling
> > > >> > > > > > > > /
> > > >> > > > > > > > >> > size
> > > >> > > > > > > > >> > > / etc):
> > > >> > > > > > > > >> > >   a) The ordering of entries within batches
> have
> > > >> > ordering
> > > >> > > > > > > > consistence
> > > >> > > > > > > > >> > with
> > > >> > > > > > > > >> > > the original request, when exposed in the log
> > > (though
> > > >> > they
> > > >> > > > may
> > > >> > > > > > be
> > > >> > > > > > > > >> > > interleaved with other requests)
> > > >> > > > > > > > >> > >   b) The ordering across batches have ordering
> > > >> > consistence
> > > >> > > > > with
> > > >> > > > > > > the
> > > >> > > > > > > > >> > > original request, when exposed in the log
> (though
> > > >> they
> > > >> > may
> > > >> > > > be
> > > >> > > > > > > > >> interleaved
> > > >> > > > > > > > >> > > with other requests)
> > > >> > > > > > > > >> > >   c) If a batch fails, and cannot be retried /
> is
> > > >> > > > > unsuccessfully
> > > >> > > > > > > > >> retried,
> > > >> > > > > > > > >> > > all batches after the failed batch should not
> be
> > > >> exposed
> > > >> > > in
> > > >> > > > > the
> > > >> > > > > > > log.
> > > >> > > > > > > > >> > Note:
> > > >> > > > > > > > >> > > The batches before and including the failed
> > batch,
> > > >> that
> > > >> > > > ended
> > > >> > > > > up
> > > >> > > > > > > > >> > > succeeding, can show up in the log, again
> within
> > > some
> > > >> > > > bounded
> > > >> > > > > > time
> > > >> > > > > > > > >> range
> > > >> > > > > > > > >> > > for reads by a client.
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > There is a method 'writeBulk' in
> > DistributedLogClient
> > > >> can
> > > >> > > > > achieve
> > > >> > > > > > > this
> > > >> > > > > > > > >> > guarantee.
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > However, I am not very sure about how will you
> > turn A
> > > >> into
> > > >> > > > > > batches.
> > > >> > > > > > > If
> > > >> > > > > > > > >> you
> > > >> > > > > > > > >> > are dividing A into batches,
> > > >> > > > > > > > >> > you can simply control the application write
> > sequence
> > > >> to
> > > >> > > > achieve
> > > >> > > > > > the
> > > >> > > > > > > > >> > guarantee here.
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > Can you explain more about this?
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> In this case, by batches I mean what the proxy does
> > > with
> > > >> the
> > > >> > > > > single
> > > >> > > > > > > > >> request
> > > >> > > > > > > > >> that I send it. If the proxy decides it needs to
> turn
> > > my
> > > >> > > single
> > > >> > > > > > > request
> > > >> > > > > > > > >> into multiple batches of requests, due to log
> > rolling,
> > > >> size
> > > >> > > > > > > limitations,
> > > >> > > > > > > > >> etc, those would be the guarantees I need to be
> able
> > to
> > > >> > > > > reduplicate
> > > >> > > > > > on
> > > >> > > > > > > > the
> > > >> > > > > > > > >> client side.
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > > A single record written by #write and A record set
> > (set
> > > of
> > > >> > > > records)
> > > >> > > > > > > > > written by #writeRecordSet are atomic - they will
> not
> > be
> > > >> > broken
> > > >> > > > > down
> > > >> > > > > > > into
> > > >> > > > > > > > > entries (batches). With the correct response code,
> you
> > > >> would
> > > >> > be
> > > >> > > > > able
> > > >> > > > > > to
> > > >> > > > > > > > > tell if it is a lost-ack failure or not. However
> there
> > > is
> > > >> a
> > > >> > > size
> > > >> > > > > > > > limitation
> > > >> > > > > > > > > for this - it can't not go beyond 1MB for current
> > > >> > > implementation.
> > > >> > > > > > > > >
> > > >> > > > > > > > > What is your expected record size?
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > > Since we can guarantee per-key ordering on the
> > > client
> > > >> > > side,
> > > >> > > > we
> > > >> > > > > > > > >> guarantee
> > > >> > > > > > > > >> > > that there is a single writer per-key, just not
> > per
> > > >> log.
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > Do you need fencing guarantee in the case of
> > network
> > > >> > > partition
> > > >> > > > > > > causing
> > > >> > > > > > > > >> > two-writers?
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > > So if there was a
> > > >> > > > > > > > >> > > way to guarantee a single write request as
> being
> > > >> written
> > > >> > > or
> > > >> > > > > not,
> > > >> > > > > > > > >> within a
> > > >> > > > > > > > >> > > certain time frame (since failures should be
> rare
> > > >> > anyways,
> > > >> > > > > this
> > > >> > > > > > is
> > > >> > > > > > > > >> fine
> > > >> > > > > > > > >> > if
> > > >> > > > > > > > >> > > it is expensive), we can then have the client
> > > >> guarantee
> > > >> > > the
> > > >> > > > > > > ordering
> > > >> > > > > > > > >> it
> > > >> > > > > > > > >> > > needs.
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > This sounds an 'exact-once' write (regarding
> > retries)
> > > >> > > > > requirement
> > > >> > > > > > to
> > > >> > > > > > > > me,
> > > >> > > > > > > > >> > right?
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> Yes. I'm curious of how this issue is handled by
> > > >> Manhattan,
> > > >> > > > since
> > > >> > > > > > you
> > > >> > > > > > > > can
> > > >> > > > > > > > >> imagine a data store that ends up getting multiple
> > > writes
> > > >> > for
> > > >> > > > the
> > > >> > > > > > same
> > > >> > > > > > > > put
> > > >> > > > > > > > >> / get / etc, would be harder to use, and we are
> > > basically
> > > >> > > trying
> > > >> > > > > to
> > > >> > > > > > > > create
> > > >> > > > > > > > >> a log like that for HBase.
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > > Are you guys replacing HBase WAL?
> > > >> > > > > > > > >
> > > >> > > > > > > > > In Manhattan case, the request will be first written
> > to
> > > DL
> > > >> > > > streams
> > > >> > > > > by
> > > >> > > > > > > > > Manhattan coordinator. The Manhattan replica then
> will
> > > >> read
> > > >> > > from
> > > >> > > > > the
> > > >> > > > > > DL
> > > >> > > > > > > > > streams and apply the change. In the lost-ack case,
> > the
> > > MH
> > > >> > > > > > coordinator
> > > >> > > > > > > > will
> > > >> > > > > > > > > just fail the request to client.
> > > >> > > > > > > > >
> > > >> > > > > > > > > My feeling here is your usage for HBase is a bit
> > > different
> > > >> > from
> > > >> > > > how
> > > >> > > > > > we
> > > >> > > > > > > > use
> > > >> > > > > > > > > DL in Manhattan. It sounds like you read from a
> source
> > > >> (HBase
> > > >> > > > WAL)
> > > >> > > > > > and
> > > >> > > > > > > > > write to DL. But I might be wrong.
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > > > Cameron:
> > > >> > > > > > > > >> > > > Another thing we've discussed but haven't
> > really
> > > >> > thought
> > > >> > > > > > > through -
> > > >> > > > > > > > >> > > > We might be able to support some kind of
> epoch
> > > >> write
> > > >> > > > > request,
> > > >> > > > > > > > where
> > > >> > > > > > > > >> the
> > > >> > > > > > > > >> > > > epoch is guaranteed to have changed if the
> > writer
> > > >> has
> > > >> > > > > changed
> > > >> > > > > > or
> > > >> > > > > > > > the
> > > >> > > > > > > > >> > > ledger
> > > >> > > > > > > > >> > > > was ever fenced off. Writes include an epoch
> > and
> > > >> are
> > > >> > > > > rejected
> > > >> > > > > > if
> > > >> > > > > > > > the
> > > >> > > > > > > > >> > > epoch
> > > >> > > > > > > > >> > > > has changed.
> > > >> > > > > > > > >> > > > With a mechanism like this, fencing the
> ledger
> > > off
> > > >> > > after a
> > > >> > > > > > > failure
> > > >> > > > > > > > >> > would
> > > >> > > > > > > > >> > > > ensure any pending writes had either been
> > written
> > > >> or
> > > >> > > would
> > > >> > > > > be
> > > >> > > > > > > > >> rejected.
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > > The issue would be how I guarantee the write I
> > > wrote
> > > >> to
> > > >> > > the
> > > >> > > > > > server
> > > >> > > > > > > > was
> > > >> > > > > > > > >> > > written. Since a network issue could happen on
> > the
> > > >> send
> > > >> > of
> > > >> > > > the
> > > >> > > > > > > > >> request,
> > > >> > > > > > > > >> > or
> > > >> > > > > > > > >> > > on the receive of the success response, an
> epoch
> > > >> > wouldn't
> > > >> > > > tell
> > > >> > > > > > me
> > > >> > > > > > > > if I
> > > >> > > > > > > > >> > can
> > > >> > > > > > > > >> > > successfully retry, as it could be successfully
> > > >> written
> > > >> > > but
> > > >> > > > > AWS
> > > >> > > > > > > > >> dropped
> > > >> > > > > > > > >> > the
> > > >> > > > > > > > >> > > connection for the success response. Since the
> > > epoch
> > > >> > would
> > > >> > > > be
> > > >> > > > > > the
> > > >> > > > > > > > same
> > > >> > > > > > > > >> > > (same ledger), I could write duplicates.
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > > > We are currently proposing adding a
> transaction
> > > >> > semantic
> > > >> > > > to
> > > >> > > > > dl
> > > >> > > > > > > to
> > > >> > > > > > > > >> get
> > > >> > > > > > > > >> > rid
> > > >> > > > > > > > >> > > > of the size limitation and the unaware-ness
> in
> > > the
> > > >> > proxy
> > > >> > > > > > client.
> > > >> > > > > > > > >> Here
> > > >> > > > > > > > >> > is
> > > >> > > > > > > > >> > > > our idea -
> > > >> > > > > > > > >> > > > http://mail-archives.apache.or
> > > >> g/mod_mbox/incubator-
> > > >> > > > > > > distributedlog
> > > >> > > > > > > > >> > > -dev/201609.mbox/%3cCAAC6BxP5Y
> > > >> yEHwG0ZCF5soh42X=xuYwYm
> > > >> > > > > > > > >> > > <http://mail-archives.apache.
> > > org/mod_mbox/incubator-
> > > >> > > > > > > > >> > distributedlog%0A-dev/201609.mbox/%
> > > >> > > 3cCAAC6BxP5YyEHwG0ZCF5soh
> > > >> > > > > > > > 42X=xuYwYm>
> > > >> > > > > > > > >> > > L4nXsYBYiofzxpVk6g@mail.gmail.com%3e
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > > > I am not sure if your idea is similar as
> ours.
> > > but
> > > >> > we'd
> > > >> > > > like
> > > >> > > > > > to
> > > >> > > > > > > > >> > > collaborate
> > > >> > > > > > > > >> > > > with the community if anyone has the similar
> > > idea.
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > > Our use case would be covered by transaction
> > > support,
> > > >> > but
> > > >> > > > I'm
> > > >> > > > > > > unsure
> > > >> > > > > > > > >> if
> > > >> > > > > > > > >> > we
> > > >> > > > > > > > >> > > would need something that heavy weight for the
> > > >> > guarantees
> > > >> > > we
> > > >> > > > > > need.
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > > Basically, the high level requirement here is
> > > >> "Support
> > > >> > > > > > consistent
> > > >> > > > > > > > >> write
> > > >> > > > > > > > >> > > ordering for single-writer-per-key,
> > > >> > multi-writer-per-log".
> > > >> > > > My
> > > >> > > > > > > hunch
> > > >> > > > > > > > is
> > > >> > > > > > > > >> > > that, with some added guarantees to the proxy
> (if
> > > it
> > > >> > isn't
> > > >> > > > > > already
> > > >> > > > > > > > >> > > supported), and some custom client code on our
> > side
> > > >> for
> > > >> > > > > removing
> > > >> > > > > > > the
> > > >> > > > > > > > >> > > entries that actually succeed to write to
> > > >> DistributedLog
> > > >> > > > from
> > > >> > > > > > the
> > > >> > > > > > > > >> request
> > > >> > > > > > > > >> > > that failed, it should be a relatively easy
> thing
> > > to
> > > >> > > > support.
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > Yup. I think it should not be very difficult to
> > > >> support.
> > > >> > > There
> > > >> > > > > > might
> > > >> > > > > > > > be
> > > >> > > > > > > > >> > some changes in the server side.
> > > >> > > > > > > > >> > Let's figure out what will the changes be. Are
> you
> > > guys
> > > >> > > > > interested
> > > >> > > > > > > in
> > > >> > > > > > > > >> > contributing?
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > Yes, we would be.
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> As a note, the one thing that we see as an issue
> with
> > > the
> > > >> > > client
> > > >> > > > > > side
> > > >> > > > > > > > >> dedupping is how to bound the range of data that
> > needs
> > > >> to be
> > > >> > > > > looked
> > > >> > > > > > at
> > > >> > > > > > > > for
> > > >> > > > > > > > >> deduplication. As you can imagine, it is pretty
> easy
> > to
> > > >> > bound
> > > >> > > > the
> > > >> > > > > > > bottom
> > > >> > > > > > > > >> of
> > > >> > > > > > > > >> the range, as that it just regular checkpointing of
> > the
> > > >> DSLN
> > > >> > > > that
> > > >> > > > > is
> > > >> > > > > > > > >> returned. I'm still not sure if there is any nice
> way
> > > to
> > > >> > time
> > > >> > > > > bound
> > > >> > > > > > > the
> > > >> > > > > > > > >> top
> > > >> > > > > > > > >> end of the range, especially since the proxy owns
> > > >> sequence
> > > >> > > > numbers
> > > >> > > > > > > > (which
> > > >> > > > > > > > >> makes sense). I am curious if there is more that
> can
> > be
> > > >> done
> > > >> > > if
> > > >> > > > > > > > >> deduplication is on the server side. However the
> main
> > > >> minus
> > > >> > I
> > > >> > > > see
> > > >> > > > > of
> > > >> > > > > > > > >> server
> > > >> > > > > > > > >> side deduplication is that instead of running
> > > contingent
> > > >> on
> > > >> > > > there
> > > >> > > > > > > being
> > > >> > > > > > > > a
> > > >> > > > > > > > >> failed client request, instead it would have to run
> > > every
> > > >> > > time a
> > > >> > > > > > write
> > > >> > > > > > > > >> happens.
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > > For a reliable dedup, we probably need
> > > >> fence-then-getLastDLSN
> > > >> > > > > > > operation -
> > > >> > > > > > > > > so it would guarantee that any non-completed
> requests
> > > >> issued
> > > >> > > > > > (lost-ack
> > > >> > > > > > > > > requests) before this fence-then-getLastDLSN
> operation
> > > >> will
> > > >> > be
> > > >> > > > > failed
> > > >> > > > > > > and
> > > >> > > > > > > > > they will never land at the log.
> > > >> > > > > > > > >
> > > >> > > > > > > > > the pseudo code would look like below -
> > > >> > > > > > > > >
> > > >> > > > > > > > > write(request) onFailure { t =>
> > > >> > > > > > > > >
> > > >> > > > > > > > > if (t is timeout exception) {
> > > >> > > > > > > > >
> > > >> > > > > > > > > DLSN lastDLSN = fenceThenGetLastDLSN()
> > > >> > > > > > > > > DLSN lastCheckpointedDLSN = ...;
> > > >> > > > > > > > > // find if the request lands between [lastDLSN,
> > > >> > > > > > lastCheckpointedDLSN].
> > > >> > > > > > > > > // if it exists, the write succeed; otherwise retry.
> > > >> > > > > > > > >
> > > >> > > > > > > > > }
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > > }
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > > Just realized the idea is same as what Leigh raised in
> > the
> > > >> > > previous
> > > >> > > > > > email
> > > >> > > > > > > > about 'epoch write'. Let me explain more about this
> idea
> > > >> > (Leigh,
> > > >> > > > feel
> > > >> > > > > > > free
> > > >> > > > > > > > to jump in to fill up your idea).
> > > >> > > > > > > >
> > > >> > > > > > > > - when a log stream is owned,  the proxy use the last
> > > >> > transaction
> > > >> > > > id
> > > >> > > > > as
> > > >> > > > > > > the
> > > >> > > > > > > > epoch
> > > >> > > > > > > > - when a client connects (handshake with the proxy),
> it
> > > will
> > > >> > get
> > > >> > > > the
> > > >> > > > > > > epoch
> > > >> > > > > > > > for the stream.
> > > >> > > > > > > > - the writes issued by this client will carry the
> epoch
> > to
> > > >> the
> > > >> > > > proxy.
> > > >> > > > > > > > - add a new rpc - fenceThenGetLastDLSN - it would
> force
> > > the
> > > >> > proxy
> > > >> > > > to
> > > >> > > > > > bump
> > > >> > > > > > > > the epoch.
> > > >> > > > > > > > - if fenceThenGetLastDLSN happened, all the
> outstanding
> > > >> writes
> > > >> > > with
> > > >> > > > > old
> > > >> > > > > > > > epoch will be rejected with exceptions (e.g.
> > EpochFenced).
> > > >> > > > > > > > - The DLSN returned from fenceThenGetLastDLSN can be
> > used
> > > as
> > > >> > the
> > > >> > > > > bound
> > > >> > > > > > > for
> > > >> > > > > > > > deduplications on failures.
> > > >> > > > > > > >
> > > >> > > > > > > > Cameron, does this sound a solution to your use case?
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> Maybe something that could fit a similar need that
> > > Kafka
> > > >> > does
> > > >> > > > (the
> > > >> > > > > > > last
> > > >> > > > > > > > >> store value for a particular key in a log), such
> that
> > > on
> > > >> a
> > > >> > per
> > > >> > > > key
> > > >> > > > > > > basis
> > > >> > > > > > > > >> there could be a sequence number that support
> > > >> deduplication?
> > > >> > > > Cost
> > > >> > > > > > > seems
> > > >> > > > > > > > >> like it would be high however, and I'm not even
> sure
> > if
> > > >> > > > bookkeeper
> > > >> > > > > > > > >> supports
> > > >> > > > > > > > >> it.
> > > >> > > > > > > > >
> > > >> > > > > > > > >
> > > >> > > > > > > > >> Cheers,
> > > >> > > > > > > > >> Cameron
> > > >> > > > > > > > >>
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > > Thanks,
> > > >> > > > > > > > >> > > Cameron
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > > On Sat, Oct 8, 2016 at 7:35 AM, Leigh Stewart
> > > >> > > > > > > > >> > <lstewart@twitter.com.invalid
> > > >> > > > > > > > >> > > >
> > > >> > > > > > > > >> > > wrote:
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> > > > Cameron:
> > > >> > > > > > > > >> > > > Another thing we've discussed but haven't
> > really
> > > >> > thought
> > > >> > > > > > > through -
> > > >> > > > > > > > >> > > > We might be able to support some kind of
> epoch
> > > >> write
> > > >> > > > > request,
> > > >> > > > > > > > where
> > > >> > > > > > > > >> the
> > > >> > > > > > > > >> > > > epoch is guaranteed to have changed if the
> > writer
> > > >> has
> > > >> > > > > changed
> > > >> > > > > > or
> > > >> > > > > > > > the
> > > >> > > > > > > > >> > > ledger
> > > >> > > > > > > > >> > > > was ever fenced off. Writes include an epoch
> > and
> > > >> are
> > > >> > > > > rejected
> > > >> > > > > > if
> > > >> > > > > > > > the
> > > >> > > > > > > > >> > > epoch
> > > >> > > > > > > > >> > > > has changed.
> > > >> > > > > > > > >> > > > With a mechanism like this, fencing the
> ledger
> > > off
> > > >> > > after a
> > > >> > > > > > > failure
> > > >> > > > > > > > >> > would
> > > >> > > > > > > > >> > > > ensure any pending writes had either been
> > written
> > > >> or
> > > >> > > would
> > > >> > > > > be
> > > >> > > > > > > > >> rejected.
> > > >> > > > > > > > >> > > >
> > > >> > > > > > > > >> > > >
> > > >> > > > > > > > >> > > > On Sat, Oct 8, 2016 at 7:10 AM, Sijie Guo <
> > > >> > > > sijie@apache.org
> > > >> > > > > >
> > > >> > > > > > > > wrote:
> > > >> > > > > > > > >> > > >
> > > >> > > > > > > > >> > > > > Cameron,
> > > >> > > > > > > > >> > > > >
> > > >> > > > > > > > >> > > > > I think both Leigh and Xi had made a few
> good
> > > >> points
> > > >> > > > about
> > > >> > > > > > > your
> > > >> > > > > > > > >> > > question.
> > > >> > > > > > > > >> > > > >
> > > >> > > > > > > > >> > > > > To add one more point to your question -
> > "but I
> > > >> am
> > > >> > not
> > > >> > > > > > > > >> > > > > 100% of how all of the futures in the code
> > > handle
> > > >> > > > > failures.
> > > >> > > > > > > > >> > > > > If not, where in the code would be the
> > relevant
> > > >> > places
> > > >> > > > to
> > > >> > > > > > add
> > > >> > > > > > > > the
> > > >> > > > > > > > >> > > ability
> > > >> > > > > > > > >> > > > > to do this, and would the project be
> > interested
> > > >> in a
> > > >> > > > pull
> > > >> > > > > > > > >> request?"
> > > >> > > > > > > > >> > > > >
> > > >> > > > > > > > >> > > > > The current proxy and client logic doesn't
> do
> > > >> > > perfectly
> > > >> > > > on
> > > >> > > > > > > > >> handling
> > > >> > > > > > > > >> > > > > failures (duplicates) - the strategy now is
> > the
> > > >> > client
> > > >> > > > > will
> > > >> > > > > > > > retry
> > > >> > > > > > > > >> as
> > > >> > > > > > > > >> > > best
> > > >> > > > > > > > >> > > > > at it can before throwing exceptions to
> > users.
> > > >> The
> > > >> > > code
> > > >> > > > > you
> > > >> > > > > > > are
> > > >> > > > > > > > >> > looking
> > > >> > > > > > > > >> > > > for
> > > >> > > > > > > > >> > > > > - it is on BKLogSegmentWriter for the proxy
> > > >> handling
> > > >> > > > > writes
> > > >> > > > > > > and
> > > >> > > > > > > > >> it is
> > > >> > > > > > > > >> > > on
> > > >> > > > > > > > >> > > > > DistributedLogClientImpl for the proxy
> client
> > > >> > handling
> > > >> > > > > > > responses
> > > >> > > > > > > > >> from
> > > >> > > > > > > > >> > > > > proxies. Does this help you?
> > > >> > > > > > > > >> > > > >
> > > >> > > > > > > > >> > > > > And also, you are welcome to contribute the
> > > pull
> > > >> > > > requests.
> > > >> > > > > > > > >> > > > >
> > > >> > > > > > > > >> > > > > - Sijie
> > > >> > > > > > > > >> > > > >
> > > >> > > > > > > > >> > > > >
> > > >> > > > > > > > >> > > > >
> > > >> > > > > > > > >> > > > > On Tue, Oct 4, 2016 at 3:39 PM, Cameron
> > > Hatfield
> > > >> <
> > > >> > > > > > > > >> kinguy@gmail.com>
> > > >> > > > > > > > >> > > > wrote:
> > > >> > > > > > > > >> > > > >
> > > >> > > > > > > > >> > > > > > I have a question about the Proxy Client.
> > > >> > Basically,
> > > >> > > > for
> > > >> > > > > > our
> > > >> > > > > > > > use
> > > >> > > > > > > > >> > > cases,
> > > >> > > > > > > > >> > > > > we
> > > >> > > > > > > > >> > > > > > want to guarantee ordering at the key
> > level,
> > > >> > > > > irrespective
> > > >> > > > > > of
> > > >> > > > > > > > the
> > > >> > > > > > > > >> > > > ordering
> > > >> > > > > > > > >> > > > > > of the partition it may be assigned to
> as a
> > > >> whole.
> > > >> > > Due
> > > >> > > > > to
> > > >> > > > > > > the
> > > >> > > > > > > > >> > source
> > > >> > > > > > > > >> > > of
> > > >> > > > > > > > >> > > > > the
> > > >> > > > > > > > >> > > > > > data (HBase Replication), we cannot
> > guarantee
> > > >> > that a
> > > >> > > > > > single
> > > >> > > > > > > > >> > partition
> > > >> > > > > > > > >> > > > > will
> > > >> > > > > > > > >> > > > > > be owned for writes by the same client.
> > This
> > > >> means
> > > >> > > the
> > > >> > > > > > proxy
> > > >> > > > > > > > >> client
> > > >> > > > > > > > >> > > > works
> > > >> > > > > > > > >> > > > > > well (since we don't care which proxy
> owns
> > > the
> > > >> > > > partition
> > > >> > > > > > we
> > > >> > > > > > > > are
> > > >> > > > > > > > >> > > writing
> > > >> > > > > > > > >> > > > > > to).
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > > However, the guarantees we need when
> > writing
> > > a
> > > >> > batch
> > > >> > > > > > > consists
> > > >> > > > > > > > >> of:
> > > >> > > > > > > > >> > > > > > Definition of a Batch: The set of records
> > > sent
> > > >> to
> > > >> > > the
> > > >> > > > > > > > writeBatch
> > > >> > > > > > > > >> > > > endpoint
> > > >> > > > > > > > >> > > > > > on the proxy
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > > 1. Batch success: If the client receives
> a
> > > >> success
> > > >> > > > from
> > > >> > > > > > the
> > > >> > > > > > > > >> proxy,
> > > >> > > > > > > > >> > > then
> > > >> > > > > > > > >> > > > > > that batch is successfully written
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > > 2. Inter-Batch ordering : Once a batch
> has
> > > been
> > > >> > > > written
> > > >> > > > > > > > >> > successfully
> > > >> > > > > > > > >> > > by
> > > >> > > > > > > > >> > > > > the
> > > >> > > > > > > > >> > > > > > client, when another batch is written, it
> > > will
> > > >> be
> > > >> > > > > > guaranteed
> > > >> > > > > > > > to
> > > >> > > > > > > > >> be
> > > >> > > > > > > > >> > > > > ordered
> > > >> > > > > > > > >> > > > > > after the last batch (if it is the same
> > > >> stream).
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > > 3. Intra-Batch ordering: Within a batch
> of
> > > >> writes,
> > > >> > > the
> > > >> > > > > > > records
> > > >> > > > > > > > >> will
> > > >> > > > > > > > >> > > be
> > > >> > > > > > > > >> > > > > > committed in order
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > > 4. Intra-Batch failure ordering: If an
> > > >> individual
> > > >> > > > record
> > > >> > > > > > > fails
> > > >> > > > > > > > >> to
> > > >> > > > > > > > >> > > write
> > > >> > > > > > > > >> > > > > > within a batch, all records after that
> > record
> > > >> will
> > > >> > > not
> > > >> > > > > be
> > > >> > > > > > > > >> written.
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > > 5. Batch Commit: Guarantee that if a
> batch
> > > >> > returns a
> > > >> > > > > > > success,
> > > >> > > > > > > > it
> > > >> > > > > > > > >> > will
> > > >> > > > > > > > >> > > > be
> > > >> > > > > > > > >> > > > > > written
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > > 6. Read-after-write: Once a batch is
> > > committed,
> > > >> > > > within a
> > > >> > > > > > > > limited
> > > >> > > > > > > > >> > > > > time-frame
> > > >> > > > > > > > >> > > > > > it will be able to be read. This is
> > required
> > > in
> > > >> > the
> > > >> > > > case
> > > >> > > > > > of
> > > >> > > > > > > > >> > failure,
> > > >> > > > > > > > >> > > so
> > > >> > > > > > > > >> > > > > > that the client can see what actually got
> > > >> > > committed. I
> > > >> > > > > > > believe
> > > >> > > > > > > > >> the
> > > >> > > > > > > > >> > > > > > time-frame part could be removed if the
> > > client
> > > >> can
> > > >> > > > send
> > > >> > > > > in
> > > >> > > > > > > the
> > > >> > > > > > > > >> same
> > > >> > > > > > > > >> > > > > > sequence number that was written
> > previously,
> > > >> since
> > > >> > > it
> > > >> > > > > > would
> > > >> > > > > > > > then
> > > >> > > > > > > > >> > fail
> > > >> > > > > > > > >> > > > and
> > > >> > > > > > > > >> > > > > > we would know that a read needs to occur.
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > > So, my basic question is if this is
> > currently
> > > >> > > possible
> > > >> > > > > in
> > > >> > > > > > > the
> > > >> > > > > > > > >> > proxy?
> > > >> > > > > > > > >> > > I
> > > >> > > > > > > > >> > > > > > don't believe it gives these guarantees
> as
> > it
> > > >> > stands
> > > >> > > > > > today,
> > > >> > > > > > > > but
> > > >> > > > > > > > >> I
> > > >> > > > > > > > >> > am
> > > >> > > > > > > > >> > > > not
> > > >> > > > > > > > >> > > > > > 100% of how all of the futures in the
> code
> > > >> handle
> > > >> > > > > > failures.
> > > >> > > > > > > > >> > > > > > If not, where in the code would be the
> > > relevant
> > > >> > > places
> > > >> > > > > to
> > > >> > > > > > > add
> > > >> > > > > > > > >> the
> > > >> > > > > > > > >> > > > ability
> > > >> > > > > > > > >> > > > > > to do this, and would the project be
> > > interested
> > > >> > in a
> > > >> > > > > pull
> > > >> > > > > > > > >> request?
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > > > Thanks,
> > > >> > > > > > > > >> > > > > > Cameron
> > > >> > > > > > > > >> > > > > >
> > > >> > > > > > > > >> > > > >
> > > >> > > > > > > > >> > > >
> > > >> > > > > > > > >> > >
> > > >> > > > > > > > >> >
> > > >> > > > > > > > >>
> > > >> > > > > > > > >
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > > >
> > >
> >
>

Re: Proxy Client - Batch Ordering / Commit

Posted by Cameron Hatfield <ki...@gmail.com>.
I believe it is:
https://cwiki.apache.org/confluence/display/~cameron.hatfield

On Wed, Nov 16, 2016 at 12:14 PM, Sijie Guo <si...@twitter.com.invalid>
wrote:

> Cameron,
>
> Can you send me your wiki account name? I can grant you the permission to
> edit it.
>
> - Sijie
>
> On Wed, Nov 16, 2016 at 12:11 PM, Cameron Hatfield <ki...@gmail.com>
> wrote:
>
> > Also, would it be possible for me to get wiki access so I will be able to
> > update it / etc?
> >
> > -Cameron
> >
> > On Wed, Nov 16, 2016 at 11:59 AM, Cameron Hatfield <ki...@gmail.com>
> > wrote:
> >
> > > "A couple of questions" is what I originally wrote, and then the
> > following
> > > happened. Sorry about the large swath of them, making sure my
> > understanding
> > > of the code base, as well as the DL/Bookkeeper/ZK ecosystem
> interaction,
> > > makes sense.
> > >
> > > ==General:
> > > What is an exclusive session? What is it providing over a regular
> > session?
> > >
> > >
> > > ==Proxy side:
> > > Should a new streamop be added for the fencing operation, or does it
> make
> > > sense to piggyback on an existing one (such as write)?
> > >
> > > ====getLastDLSN:
> > > What should be the return result for:
> > > A new stream
> > > A new session, after successful fencing
> > > A new session, after a change in ownership / first starting up
> > >
> > > What is the main use case for getLastDLSN(<stream>, false)? Is this to
> > > guarantee that the recovery process has happened in case of ownership
> > > failure (I don't have a good understanding of what causes the recovery
> > > process to happen, especially from the reader side)? Or is it to handle
> > the
> > > lost-ack problem? Since all the rest of the read related things go
> > through
> > > the read client, I'm not sure if I see the use case, but it seems like
> > > there would be a large potential for confusion on which to use. What
> > about
> > > just a fenceSession op, that always fences, returning the DLSN of the
> > > fence, and leave the normal getLastDLSN for the regular read client.
> > >
> > > ====Fencing:
> > > When a fence session occurs, what call needs to be made to make sure
> any
> > > outstanding writes are flushed and committed (so that we guarantee the
> > > client will be able to read anything that was in the write queue)?
> > > Is there a guaranteed ordering for things written in the future queue
> for
> > > AsyncLogWriter (I'm not quite confident that I was able to accurately
> > > follow the logic, as their are many parts of the code that write, have
> > > queues, heartbeat, etc)?
> > >
> > > ====SessionID:
> > > What is the default sessionid / transactionid for a new stream? I
> assume
> > > this would just be the first control record
> > >
> > > ======Should all streams have a sessionid by default, regardless if it
> is
> > > never used by a client (aka, everytime ownership changes, a new control
> > > record is generated, and a sessionid is stored)?
> > > Main edge case that would have to be handled is if a client writes with
> > an
> > > old sessionid, but the owner has changed and has yet to create a
> > sessionid.
> > > This should be handled by the "non-matching sessionid" rule, since the
> > > invalid sessionid wouldn't match the passed sessionid, which should
> cause
> > > the client to get a new sessionid.
> > >
> > > ======Where in the code does it make sense to own the session, the
> stream
> > > interfaces / classes? Should they pass that information down to the
> ops,
> > or
> > > do the sessionid check within?
> > > My first thought would be Stream owns the sessionid, passes it into the
> > > ops (as either a nullable value, or an invalid default value), which
> then
> > > do the sessionid check if they care. The main issue is updating the
> > > sessionid is a bit backwards, as either every op has the ability to
> > update
> > > it through some type of return value / direct stream access / etc, or
> > there
> > > is a special case in the stream for the fence operation / any other
> > > operation that can update the session.
> > >
> > > ======For "the owner of the log stream will first advance the
> transaction
> > > id generator to claim a new transaction id and write a control record
> to
> > > the log stream. ":
> > > Should "DistributedLogConstants.CONTROL_RECORD_CONTENT" be the type of
> > > control record written?
> > > Should the "writeControlRecord" on the BKAsyncLogWriter be exposed in
> the
> > > AsyncLogWriter interface be exposed?  Or even in the one within the
> > segment
> > > writer? Or should the code be duplicated / pulled out into a helper /
> > etc?
> > > (Not a big Java person, so any suggestions on the "Java Way", or at
> least
> > > the DL way, to do it would be appreciated)
> > >
> > > ======Transaction ID:
> > > The BKLogSegmentWriter ignores the transaction ids from control records
> > > when it records the "LastTXId." Would that be an issue here for
> anything?
> > > It looks like it may do that because it assumes you're calling it's
> local
> > > function for writing a controlrecord, which uses the lastTxId.
> > >
> > >
> > > ==Thrift Interface:
> > > ====Should the write response be split out for different calls?
> > > It seems odd to have a single struct with many optional items that are
> > > filled depending on the call made for every rpc call. This is mostly a
> > > curiosity question, since I assume it comes from the general practices
> > from
> > > using thrift for a while. Would it at least make sense for the
> > > getLastDLSN/fence endpoint to have a new struct?
> > >
> > > ====Any particular error code that makes sense for session fenced? If
> we
> > > want to be close to the HTTP errors, looks like 412 (PRECONDITION
> FAILED)
> > > might make the most sense, if a bit generic.
> > >
> > > 412 def:
> > > "The precondition given in one or more of the request-header fields
> > > evaluated to false when it was tested on the server. This response code
> > > allows the client to place preconditions on the current resource
> > > metainformation (header field data) and thus prevent the requested
> method
> > > from being applied to a resource other than the one intended."
> > >
> > > 412 excerpt from the If-match doc:
> > > "This behavior is most useful when the client wants to prevent an
> > updating
> > > method, such as PUT, from modifying a resource that has changed since
> the
> > > client last retrieved it."
> > >
> > > ====Should we return the sessionid to the client in the "fencesession"
> > > calls?
> > > Seems like it may be useful when you fence, especially if you have some
> > > type of custom sequencer where it would make sense for search, or for
> > > debugging.
> > > Main minus is that it would be easy for users to create an implicit
> > > requirement that the sessionid is forever a valid transactionid, which
> > may
> > > not always be the case long term for the project.
> > >
> > >
> > > ==Client:
> > > ====What is the proposed process for the client retrieving the new
> > > sessionid?
> > > A full reconnect? No special case code, but intrusive on the client
> side,
> > > and possibly expensive garbage/processing wise. (though this type of
> > > failure should hopefully be rare enough to not be a problem)
> > > A call to reset the sessionid? Less intrusive, all the issues you get
> > with
> > > mutable object methods that need to be called in a certain order, edge
> > > cases such as outstanding/buffered requests to the old stream, etc.
> > > The call could also return the new sessionid, making it a good call for
> > > storing or debugging the value.
> > >
> > > ====Session Fenced failure:
> > > Will this put the client into a failure state, stopping all future
> writes
> > > until fixed?
> > > Is it even possible to get this error when ownership changes? The
> > > connection to the new owner should get a new sessionid on connect, so I
> > > would expect not.
> > >
> > > Cheers,
> > > Cameron
> > >
> > > On Tue, Nov 15, 2016 at 2:01 AM, Xi Liu <xi...@gmail.com> wrote:
> > >
> > >> Thank you, Cameron. Look forward to your comments.
> > >>
> > >> - Xi
> > >>
> > >> On Sun, Nov 13, 2016 at 1:21 PM, Cameron Hatfield <ki...@gmail.com>
> > >> wrote:
> > >>
> > >> > Sorry, I've been on vacation for the past week, and heads down for a
> > >> > release that is using DL at the end of Nov. I'll take a look at this
> > >> over
> > >> > the next week, and add any relevant comments. After we are finished
> > with
> > >> > dev for this release, I am hoping to tackle this next.
> > >> >
> > >> > -Cameron
> > >> >
> > >> > On Fri, Nov 11, 2016 at 12:07 PM, Sijie Guo <si...@apache.org>
> wrote:
> > >> >
> > >> > > Xi,
> > >> > >
> > >> > > Thank you so much for your proposal. I took a look. It looks fine
> to
> > >> me.
> > >> > > Cameron, do you have any comments?
> > >> > >
> > >> > > Look forward to your pull requests.
> > >> > >
> > >> > > - Sijie
> > >> > >
> > >> > >
> > >> > > On Wed, Nov 9, 2016 at 2:34 AM, Xi Liu <xi...@gmail.com>
> > wrote:
> > >> > >
> > >> > > > Cameron,
> > >> > > >
> > >> > > > Have you started any work for this? I just updated the proposal
> > >> page -
> > >> > > > https://cwiki.apache.org/confluence/display/DL/DP-2+-+
> > >> > > Epoch+Write+Support
> > >> > > > Maybe we can work together with this.
> > >> > > >
> > >> > > > Sijie, Leigh,
> > >> > > >
> > >> > > > can you guys help review this to make sure our proposal is in
> the
> > >> right
> > >> > > > direction?
> > >> > > >
> > >> > > > - Xi
> > >> > > >
> > >> > > > On Tue, Nov 1, 2016 at 3:05 AM, Sijie Guo <si...@apache.org>
> > wrote:
> > >> > > >
> > >> > > > > I created https://issues.apache.org/jira/browse/DL-63 for
> > >> tracking
> > >> > the
> > >> > > > > proposed idea here.
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > On Wed, Oct 26, 2016 at 4:53 PM, Sijie Guo
> > >> > <sijieg@twitter.com.invalid
> > >> > > >
> > >> > > > > wrote:
> > >> > > > >
> > >> > > > > > On Tue, Oct 25, 2016 at 11:30 AM, Cameron Hatfield <
> > >> > kinguy@gmail.com
> > >> > > >
> > >> > > > > > wrote:
> > >> > > > > >
> > >> > > > > > > Yes, we are reading the HBase WAL (from their replication
> > >> plugin
> > >> > > > > > support),
> > >> > > > > > > and writing that into DL.
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > > Gotcha.
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > >
> > >> > > > > > > From the sounds of it, yes, it would. Only thing I would
> say
> > >> is
> > >> > > make
> > >> > > > > the
> > >> > > > > > > epoch requirement optional, so that if I client doesn't
> care
> > >> > about
> > >> > > > > dupes
> > >> > > > > > > they don't have to deal with the process of getting a new
> > >> epoch.
> > >> > > > > > >
> > >> > > > > >
> > >> > > > > > Yup. This should be optional. I can start a wiki page on how
> > we
> > >> > want
> > >> > > to
> > >> > > > > > implement this. Are you interested in contributing to this?
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > >
> > >> > > > > > > -Cameron
> > >> > > > > > >
> > >> > > > > > > On Wed, Oct 19, 2016 at 7:43 PM, Sijie Guo
> > >> > > > <sijieg@twitter.com.invalid
> > >> > > > > >
> > >> > > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > > On Wed, Oct 19, 2016 at 7:17 PM, Sijie Guo <
> > >> sijieg@twitter.com
> > >> > >
> > >> > > > > wrote:
> > >> > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > On Monday, October 17, 2016, Cameron Hatfield <
> > >> > > kinguy@gmail.com>
> > >> > > > > > > wrote:
> > >> > > > > > > > >
> > >> > > > > > > > >> Answer inline:
> > >> > > > > > > > >>
> > >> > > > > > > > >> On Mon, Oct 17, 2016 at 11:46 AM, Sijie Guo <
> > >> > sijie@apache.org
> > >> > > >
> > >> > > > > > wrote:
> > >> > > > > > > > >>
> > >> > > > > > > > >> > Cameron,
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > Thank you for your summary. I liked the discussion
> > >> here. I
> > >> > > > also
> > >> > > > > > > liked
> > >> > > > > > > > >> the
> > >> > > > > > > > >> > summary of your requirement -
> 'single-writer-per-key,
> > >> > > > > > > > >> > multiple-writers-per-log'. If I understand
> correctly,
> > >> the
> > >> > > core
> > >> > > > > > > concern
> > >> > > > > > > > >> here
> > >> > > > > > > > >> > is almost 'exact-once' write (or a way to explicit
> > tell
> > >> > if a
> > >> > > > > write
> > >> > > > > > > can
> > >> > > > > > > > >> be
> > >> > > > > > > > >> > retried or not).
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > Comments inline.
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > On Fri, Oct 14, 2016 at 11:17 AM, Cameron Hatfield
> <
> > >> > > > > > > kinguy@gmail.com>
> > >> > > > > > > > >> > wrote:
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > > > Ah- yes good point (to be clear we're not using
> > the
> > >> > > proxy
> > >> > > > > this
> > >> > > > > > > way
> > >> > > > > > > > >> > > today).
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> > > > > Due to the source of the
> > >> > > > > > > > >> > > > > data (HBase Replication), we cannot guarantee
> > >> that a
> > >> > > > > single
> > >> > > > > > > > >> partition
> > >> > > > > > > > >> > > will
> > >> > > > > > > > >> > > > > be owned for writes by the same client.
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> > > > Do you mean you *need* to support multiple
> > writers
> > >> > > issuing
> > >> > > > > > > > >> interleaved
> > >> > > > > > > > >> > > > writes or is it just that they might sometimes
> > >> > > interleave
> > >> > > > > > writes
> > >> > > > > > > > and
> > >> > > > > > > > >> > you
> > >> > > > > > > > >> > > >don't care?
> > >> > > > > > > > >> > > How HBase partitions the keys being written
> > wouldn't
> > >> > have
> > >> > > a
> > >> > > > > > > one->one
> > >> > > > > > > > >> > > mapping with the partitions we would have in
> HBase.
> > >> Even
> > >> > > if
> > >> > > > we
> > >> > > > > > did
> > >> > > > > > > > >> have
> > >> > > > > > > > >> > > that alignment when the cluster first started,
> > HBase
> > >> > will
> > >> > > > > > > rebalance
> > >> > > > > > > > >> what
> > >> > > > > > > > >> > > servers own what partitions, as well as split and
> > >> merge
> > >> > > > > > partitions
> > >> > > > > > > > >> that
> > >> > > > > > > > >> > > already exist, causing eventual drift from one
> log
> > >> per
> > >> > > > > > partition.
> > >> > > > > > > > >> > > Because we want ordering guarantees per key (row
> in
> > >> > > hbase),
> > >> > > > we
> > >> > > > > > > > >> partition
> > >> > > > > > > > >> > > the logs by the key. Since multiple writers are
> > >> possible
> > >> > > per
> > >> > > > > > range
> > >> > > > > > > > of
> > >> > > > > > > > >> > keys
> > >> > > > > > > > >> > > (due to the aforementioned rebalancing /
> splitting
> > /
> > >> etc
> > >> > > of
> > >> > > > > > > hbase),
> > >> > > > > > > > we
> > >> > > > > > > > >> > > cannot use the core library due to requiring a
> > single
> > >> > > writer
> > >> > > > > for
> > >> > > > > > > > >> > ordering.
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> > > But, for a single log, we don't really care about
> > >> > ordering
> > >> > > > > aside
> > >> > > > > > > > from
> > >> > > > > > > > >> at
> > >> > > > > > > > >> > > the per-key level. So all we really need to be
> able
> > >> to
> > >> > > > handle
> > >> > > > > is
> > >> > > > > > > > >> > preventing
> > >> > > > > > > > >> > > duplicates when a failure occurs, and ordering
> > >> > consistency
> > >> > > > > > across
> > >> > > > > > > > >> > requests
> > >> > > > > > > > >> > > from a single client.
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> > > So our general requirements are:
> > >> > > > > > > > >> > > Write A, Write B
> > >> > > > > > > > >> > > Timeline: A -> B
> > >> > > > > > > > >> > > Request B is only made after A has successfully
> > >> returned
> > >> > > > > > (possibly
> > >> > > > > > > > >> after
> > >> > > > > > > > >> > > retries)
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> > > 1) If the write succeeds, it will be durably
> > exposed
> > >> to
> > >> > > > > clients
> > >> > > > > > > > within
> > >> > > > > > > > >> > some
> > >> > > > > > > > >> > > bounded time frame
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > Guaranteed.
> > >> > > > > > > > >> >
> > >> > > > > > > > >>
> > >> > > > > > > > >> > > 2) If A succeeds and B succeeds, the ordering for
> > the
> > >> > log
> > >> > > > will
> > >> > > > > > be
> > >> > > > > > > A
> > >> > > > > > > > >> and
> > >> > > > > > > > >> > > then B
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > If I understand correctly here, B is only sent
> after
> > A
> > >> is
> > >> > > > > > returned,
> > >> > > > > > > > >> right?
> > >> > > > > > > > >> > If that's the case, It is guaranteed.
> > >> > > > > > > > >>
> > >> > > > > > > > >>
> > >> > > > > > > > >>
> > >> > > > > > > > >> >
> > >> > > > > > > > >> >
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > > 3) If A fails due to an error that can be relied
> on
> > >> to
> > >> > > *not*
> > >> > > > > be
> > >> > > > > > a
> > >> > > > > > > > lost
> > >> > > > > > > > >> > ack
> > >> > > > > > > > >> > > problem, it will never be exposed to the client,
> so
> > >> it
> > >> > may
> > >> > > > > > > > (depending
> > >> > > > > > > > >> on
> > >> > > > > > > > >> > > the error) be retried immediately
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > If it is not a lost-ack problem, the entry will be
> > >> > exposed.
> > >> > > it
> > >> > > > > is
> > >> > > > > > > > >> > guaranteed.
> > >> > > > > > > > >>
> > >> > > > > > > > >> Let me try rephrasing the questions, to make sure I'm
> > >> > > > > understanding
> > >> > > > > > > > >> correctly:
> > >> > > > > > > > >> If A fails, with an error such as "Unable to create
> > >> > connection
> > >> > > > to
> > >> > > > > > > > >> bookkeeper server", that would be the type of error
> we
> > >> would
> > >> > > > > expect
> > >> > > > > > to
> > >> > > > > > > > be
> > >> > > > > > > > >> able to retry immediately, as that result means no
> > action
> > >> > was
> > >> > > > > taken
> > >> > > > > > on
> > >> > > > > > > > any
> > >> > > > > > > > >> log / etc, so no entry could have been created. This
> is
> > >> > > > different
> > >> > > > > > > then a
> > >> > > > > > > > >> "Connection Timeout" exception, as we just might not
> > have
> > >> > > > gotten a
> > >> > > > > > > > >> response
> > >> > > > > > > > >> in time.
> > >> > > > > > > > >>
> > >> > > > > > > > >>
> > >> > > > > > > > > Gotcha.
> > >> > > > > > > > >
> > >> > > > > > > > > The response code returned from proxy can tell if a
> > >> failure
> > >> > can
> > >> > > > be
> > >> > > > > > > > retried
> > >> > > > > > > > > safely or not. (We might need to make them well
> > >> documented)
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >>
> > >> > > > > > > > >> >
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > > 4) If A fails due to an error that could be a
> lost
> > >> ack
> > >> > > > problem
> > >> > > > > > > > >> (network
> > >> > > > > > > > >> > > connectivity / etc), within a bounded time frame
> it
> > >> > should
> > >> > > > be
> > >> > > > > > > > >> possible to
> > >> > > > > > > > >> > > find out if the write succeed or failed. Either
> by
> > >> > reading
> > >> > > > > from
> > >> > > > > > > some
> > >> > > > > > > > >> > > checkpoint of the log for the changes that should
> > >> have
> > >> > > been
> > >> > > > > made
> > >> > > > > > > or
> > >> > > > > > > > >> some
> > >> > > > > > > > >> > > other possible server-side support.
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > If I understand this correctly, it is a duplication
> > >> issue,
> > >> > > > > right?
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > Can a de-duplication solution work here? Either DL
> or
> > >> your
> > >> > > > > client
> > >> > > > > > > does
> > >> > > > > > > > >> the
> > >> > > > > > > > >> > de-duplication?
> > >> > > > > > > > >> >
> > >> > > > > > > > >>
> > >> > > > > > > > >> The requirements I'm mentioning are the ones needed
> for
> > >> > > > > client-side
> > >> > > > > > > > >> dedupping. Since if I can guarantee writes being
> > exposed
> > >> > > within
> > >> > > > > some
> > >> > > > > > > > time
> > >> > > > > > > > >> frame, and I can never get into an inconsistently
> > ordered
> > >> > > state
> > >> > > > > when
> > >> > > > > > > > >> successes happen, when an error occurs, I can always
> > wait
> > >> > for
> > >> > > > max
> > >> > > > > > time
> > >> > > > > > > > >> frame, read the latest writes, and then dedup locally
> > >> > against
> > >> > > > the
> > >> > > > > > > > request
> > >> > > > > > > > >> I
> > >> > > > > > > > >> just made.
> > >> > > > > > > > >>
> > >> > > > > > > > >> The main thing about that timeframe is that its
> > basically
> > >> > the
> > >> > > > > > addition
> > >> > > > > > > > of
> > >> > > > > > > > >> every timeout, all the way down in the system,
> combined
> > >> with
> > >> > > > > > whatever
> > >> > > > > > > > >> flushing / caching / etc times are at the bookkeeper
> /
> > >> > client
> > >> > > > > level
> > >> > > > > > > for
> > >> > > > > > > > >> when values are exposed
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > Gotcha.
> > >> > > > > > > > >
> > >> > > > > > > > >>
> > >> > > > > > > > >>
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > Is there any ways to identify your write?
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > I can think of a case as follow - I want to know
> what
> > >> is
> > >> > > your
> > >> > > > > > > expected
> > >> > > > > > > > >> > behavior from the log.
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > a)
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > If a hbase region server A writes a change of key K
> > to
> > >> the
> > >> > > > log,
> > >> > > > > > the
> > >> > > > > > > > >> change
> > >> > > > > > > > >> > is successfully made to the log;
> > >> > > > > > > > >> > but server A is down before receiving the change.
> > >> > > > > > > > >> > region server B took over the region that contains
> K,
> > >> what
> > >> > > > will
> > >> > > > > B
> > >> > > > > > > do?
> > >> > > > > > > > >> >
> > >> > > > > > > > >>
> > >> > > > > > > > >> HBase writes in large chunks (WAL Logs), which its
> > >> > replication
> > >> > > > > > system
> > >> > > > > > > > then
> > >> > > > > > > > >> handles by replaying in the case of failure. If I'm
> in
> > a
> > >> > > middle
> > >> > > > > of a
> > >> > > > > > > > log,
> > >> > > > > > > > >> and the whole region goes down and gets rescheduled
> > >> > > elsewhere, I
> > >> > > > > > will
> > >> > > > > > > > >> start
> > >> > > > > > > > >> back up from the beginning of the log I was in the
> > middle
> > >> > of.
> > >> > > > > Using
> > >> > > > > > > > >> checkpointing + deduping, we should be able to find
> out
> > >> > where
> > >> > > we
> > >> > > > > > left
> > >> > > > > > > > off
> > >> > > > > > > > >> in the log.
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >> >
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > b) same as a). but server A was just network
> > >> partitioned.
> > >> > > will
> > >> > > > > > both
> > >> > > > > > > A
> > >> > > > > > > > >> and B
> > >> > > > > > > > >> > write the change of key K?
> > >> > > > > > > > >> >
> > >> > > > > > > > >>
> > >> > > > > > > > >> HBase gives us some guarantees around network
> > partitions
> > >> > > > > > (Consistency
> > >> > > > > > > > over
> > >> > > > > > > > >> availability for HBase). HBase is a single-master
> > >> failover
> > >> > > > > recovery
> > >> > > > > > > type
> > >> > > > > > > > >> of
> > >> > > > > > > > >> system, with zookeeper-based guarantees for single
> > owners
> > >> > > > > (writers)
> > >> > > > > > > of a
> > >> > > > > > > > >> range of data.
> > >> > > > > > > > >>
> > >> > > > > > > > >> >
> > >> > > > > > > > >> >
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > > 5) If A is turned into multiple batches (one
> large
> > >> > request
> > >> > > > > gets
> > >> > > > > > > > split
> > >> > > > > > > > >> > into
> > >> > > > > > > > >> > > multiple smaller ones to the bookkeeper backend,
> > due
> > >> to
> > >> > > log
> > >> > > > > > > rolling
> > >> > > > > > > > /
> > >> > > > > > > > >> > size
> > >> > > > > > > > >> > > / etc):
> > >> > > > > > > > >> > >   a) The ordering of entries within batches have
> > >> > ordering
> > >> > > > > > > > consistence
> > >> > > > > > > > >> > with
> > >> > > > > > > > >> > > the original request, when exposed in the log
> > (though
> > >> > they
> > >> > > > may
> > >> > > > > > be
> > >> > > > > > > > >> > > interleaved with other requests)
> > >> > > > > > > > >> > >   b) The ordering across batches have ordering
> > >> > consistence
> > >> > > > > with
> > >> > > > > > > the
> > >> > > > > > > > >> > > original request, when exposed in the log (though
> > >> they
> > >> > may
> > >> > > > be
> > >> > > > > > > > >> interleaved
> > >> > > > > > > > >> > > with other requests)
> > >> > > > > > > > >> > >   c) If a batch fails, and cannot be retried / is
> > >> > > > > unsuccessfully
> > >> > > > > > > > >> retried,
> > >> > > > > > > > >> > > all batches after the failed batch should not be
> > >> exposed
> > >> > > in
> > >> > > > > the
> > >> > > > > > > log.
> > >> > > > > > > > >> > Note:
> > >> > > > > > > > >> > > The batches before and including the failed
> batch,
> > >> that
> > >> > > > ended
> > >> > > > > up
> > >> > > > > > > > >> > > succeeding, can show up in the log, again within
> > some
> > >> > > > bounded
> > >> > > > > > time
> > >> > > > > > > > >> range
> > >> > > > > > > > >> > > for reads by a client.
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > There is a method 'writeBulk' in
> DistributedLogClient
> > >> can
> > >> > > > > achieve
> > >> > > > > > > this
> > >> > > > > > > > >> > guarantee.
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > However, I am not very sure about how will you
> turn A
> > >> into
> > >> > > > > > batches.
> > >> > > > > > > If
> > >> > > > > > > > >> you
> > >> > > > > > > > >> > are dividing A into batches,
> > >> > > > > > > > >> > you can simply control the application write
> sequence
> > >> to
> > >> > > > achieve
> > >> > > > > > the
> > >> > > > > > > > >> > guarantee here.
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > Can you explain more about this?
> > >> > > > > > > > >> >
> > >> > > > > > > > >>
> > >> > > > > > > > >> In this case, by batches I mean what the proxy does
> > with
> > >> the
> > >> > > > > single
> > >> > > > > > > > >> request
> > >> > > > > > > > >> that I send it. If the proxy decides it needs to turn
> > my
> > >> > > single
> > >> > > > > > > request
> > >> > > > > > > > >> into multiple batches of requests, due to log
> rolling,
> > >> size
> > >> > > > > > > limitations,
> > >> > > > > > > > >> etc, those would be the guarantees I need to be able
> to
> > >> > > > > reduplicate
> > >> > > > > > on
> > >> > > > > > > > the
> > >> > > > > > > > >> client side.
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > A single record written by #write and A record set
> (set
> > of
> > >> > > > records)
> > >> > > > > > > > > written by #writeRecordSet are atomic - they will not
> be
> > >> > broken
> > >> > > > > down
> > >> > > > > > > into
> > >> > > > > > > > > entries (batches). With the correct response code, you
> > >> would
> > >> > be
> > >> > > > > able
> > >> > > > > > to
> > >> > > > > > > > > tell if it is a lost-ack failure or not. However there
> > is
> > >> a
> > >> > > size
> > >> > > > > > > > limitation
> > >> > > > > > > > > for this - it can't not go beyond 1MB for current
> > >> > > implementation.
> > >> > > > > > > > >
> > >> > > > > > > > > What is your expected record size?
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >>
> > >> > > > > > > > >> >
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> > > Since we can guarantee per-key ordering on the
> > client
> > >> > > side,
> > >> > > > we
> > >> > > > > > > > >> guarantee
> > >> > > > > > > > >> > > that there is a single writer per-key, just not
> per
> > >> log.
> > >> > > > > > > > >> >
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > Do you need fencing guarantee in the case of
> network
> > >> > > partition
> > >> > > > > > > causing
> > >> > > > > > > > >> > two-writers?
> > >> > > > > > > > >> >
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > > So if there was a
> > >> > > > > > > > >> > > way to guarantee a single write request as being
> > >> written
> > >> > > or
> > >> > > > > not,
> > >> > > > > > > > >> within a
> > >> > > > > > > > >> > > certain time frame (since failures should be rare
> > >> > anyways,
> > >> > > > > this
> > >> > > > > > is
> > >> > > > > > > > >> fine
> > >> > > > > > > > >> > if
> > >> > > > > > > > >> > > it is expensive), we can then have the client
> > >> guarantee
> > >> > > the
> > >> > > > > > > ordering
> > >> > > > > > > > >> it
> > >> > > > > > > > >> > > needs.
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > This sounds an 'exact-once' write (regarding
> retries)
> > >> > > > > requirement
> > >> > > > > > to
> > >> > > > > > > > me,
> > >> > > > > > > > >> > right?
> > >> > > > > > > > >> >
> > >> > > > > > > > >> Yes. I'm curious of how this issue is handled by
> > >> Manhattan,
> > >> > > > since
> > >> > > > > > you
> > >> > > > > > > > can
> > >> > > > > > > > >> imagine a data store that ends up getting multiple
> > writes
> > >> > for
> > >> > > > the
> > >> > > > > > same
> > >> > > > > > > > put
> > >> > > > > > > > >> / get / etc, would be harder to use, and we are
> > basically
> > >> > > trying
> > >> > > > > to
> > >> > > > > > > > create
> > >> > > > > > > > >> a log like that for HBase.
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > Are you guys replacing HBase WAL?
> > >> > > > > > > > >
> > >> > > > > > > > > In Manhattan case, the request will be first written
> to
> > DL
> > >> > > > streams
> > >> > > > > by
> > >> > > > > > > > > Manhattan coordinator. The Manhattan replica then will
> > >> read
> > >> > > from
> > >> > > > > the
> > >> > > > > > DL
> > >> > > > > > > > > streams and apply the change. In the lost-ack case,
> the
> > MH
> > >> > > > > > coordinator
> > >> > > > > > > > will
> > >> > > > > > > > > just fail the request to client.
> > >> > > > > > > > >
> > >> > > > > > > > > My feeling here is your usage for HBase is a bit
> > different
> > >> > from
> > >> > > > how
> > >> > > > > > we
> > >> > > > > > > > use
> > >> > > > > > > > > DL in Manhattan. It sounds like you read from a source
> > >> (HBase
> > >> > > > WAL)
> > >> > > > > > and
> > >> > > > > > > > > write to DL. But I might be wrong.
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >>
> > >> > > > > > > > >> >
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> > > > Cameron:
> > >> > > > > > > > >> > > > Another thing we've discussed but haven't
> really
> > >> > thought
> > >> > > > > > > through -
> > >> > > > > > > > >> > > > We might be able to support some kind of epoch
> > >> write
> > >> > > > > request,
> > >> > > > > > > > where
> > >> > > > > > > > >> the
> > >> > > > > > > > >> > > > epoch is guaranteed to have changed if the
> writer
> > >> has
> > >> > > > > changed
> > >> > > > > > or
> > >> > > > > > > > the
> > >> > > > > > > > >> > > ledger
> > >> > > > > > > > >> > > > was ever fenced off. Writes include an epoch
> and
> > >> are
> > >> > > > > rejected
> > >> > > > > > if
> > >> > > > > > > > the
> > >> > > > > > > > >> > > epoch
> > >> > > > > > > > >> > > > has changed.
> > >> > > > > > > > >> > > > With a mechanism like this, fencing the ledger
> > off
> > >> > > after a
> > >> > > > > > > failure
> > >> > > > > > > > >> > would
> > >> > > > > > > > >> > > > ensure any pending writes had either been
> written
> > >> or
> > >> > > would
> > >> > > > > be
> > >> > > > > > > > >> rejected.
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> > > The issue would be how I guarantee the write I
> > wrote
> > >> to
> > >> > > the
> > >> > > > > > server
> > >> > > > > > > > was
> > >> > > > > > > > >> > > written. Since a network issue could happen on
> the
> > >> send
> > >> > of
> > >> > > > the
> > >> > > > > > > > >> request,
> > >> > > > > > > > >> > or
> > >> > > > > > > > >> > > on the receive of the success response, an epoch
> > >> > wouldn't
> > >> > > > tell
> > >> > > > > > me
> > >> > > > > > > > if I
> > >> > > > > > > > >> > can
> > >> > > > > > > > >> > > successfully retry, as it could be successfully
> > >> written
> > >> > > but
> > >> > > > > AWS
> > >> > > > > > > > >> dropped
> > >> > > > > > > > >> > the
> > >> > > > > > > > >> > > connection for the success response. Since the
> > epoch
> > >> > would
> > >> > > > be
> > >> > > > > > the
> > >> > > > > > > > same
> > >> > > > > > > > >> > > (same ledger), I could write duplicates.
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> > > > We are currently proposing adding a transaction
> > >> > semantic
> > >> > > > to
> > >> > > > > dl
> > >> > > > > > > to
> > >> > > > > > > > >> get
> > >> > > > > > > > >> > rid
> > >> > > > > > > > >> > > > of the size limitation and the unaware-ness in
> > the
> > >> > proxy
> > >> > > > > > client.
> > >> > > > > > > > >> Here
> > >> > > > > > > > >> > is
> > >> > > > > > > > >> > > > our idea -
> > >> > > > > > > > >> > > > http://mail-archives.apache.or
> > >> g/mod_mbox/incubator-
> > >> > > > > > > distributedlog
> > >> > > > > > > > >> > > -dev/201609.mbox/%3cCAAC6BxP5Y
> > >> yEHwG0ZCF5soh42X=xuYwYm
> > >> > > > > > > > >> > > <http://mail-archives.apache.
> > org/mod_mbox/incubator-
> > >> > > > > > > > >> > distributedlog%0A-dev/201609.mbox/%
> > >> > > 3cCAAC6BxP5YyEHwG0ZCF5soh
> > >> > > > > > > > 42X=xuYwYm>
> > >> > > > > > > > >> > > L4nXsYBYiofzxpVk6g@mail.gmail.com%3e
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> > > > I am not sure if your idea is similar as ours.
> > but
> > >> > we'd
> > >> > > > like
> > >> > > > > > to
> > >> > > > > > > > >> > > collaborate
> > >> > > > > > > > >> > > > with the community if anyone has the similar
> > idea.
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> > > Our use case would be covered by transaction
> > support,
> > >> > but
> > >> > > > I'm
> > >> > > > > > > unsure
> > >> > > > > > > > >> if
> > >> > > > > > > > >> > we
> > >> > > > > > > > >> > > would need something that heavy weight for the
> > >> > guarantees
> > >> > > we
> > >> > > > > > need.
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> > > Basically, the high level requirement here is
> > >> "Support
> > >> > > > > > consistent
> > >> > > > > > > > >> write
> > >> > > > > > > > >> > > ordering for single-writer-per-key,
> > >> > multi-writer-per-log".
> > >> > > > My
> > >> > > > > > > hunch
> > >> > > > > > > > is
> > >> > > > > > > > >> > > that, with some added guarantees to the proxy (if
> > it
> > >> > isn't
> > >> > > > > > already
> > >> > > > > > > > >> > > supported), and some custom client code on our
> side
> > >> for
> > >> > > > > removing
> > >> > > > > > > the
> > >> > > > > > > > >> > > entries that actually succeed to write to
> > >> DistributedLog
> > >> > > > from
> > >> > > > > > the
> > >> > > > > > > > >> request
> > >> > > > > > > > >> > > that failed, it should be a relatively easy thing
> > to
> > >> > > > support.
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > Yup. I think it should not be very difficult to
> > >> support.
> > >> > > There
> > >> > > > > > might
> > >> > > > > > > > be
> > >> > > > > > > > >> > some changes in the server side.
> > >> > > > > > > > >> > Let's figure out what will the changes be. Are you
> > guys
> > >> > > > > interested
> > >> > > > > > > in
> > >> > > > > > > > >> > contributing?
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > Yes, we would be.
> > >> > > > > > > > >>
> > >> > > > > > > > >> As a note, the one thing that we see as an issue with
> > the
> > >> > > client
> > >> > > > > > side
> > >> > > > > > > > >> dedupping is how to bound the range of data that
> needs
> > >> to be
> > >> > > > > looked
> > >> > > > > > at
> > >> > > > > > > > for
> > >> > > > > > > > >> deduplication. As you can imagine, it is pretty easy
> to
> > >> > bound
> > >> > > > the
> > >> > > > > > > bottom
> > >> > > > > > > > >> of
> > >> > > > > > > > >> the range, as that it just regular checkpointing of
> the
> > >> DSLN
> > >> > > > that
> > >> > > > > is
> > >> > > > > > > > >> returned. I'm still not sure if there is any nice way
> > to
> > >> > time
> > >> > > > > bound
> > >> > > > > > > the
> > >> > > > > > > > >> top
> > >> > > > > > > > >> end of the range, especially since the proxy owns
> > >> sequence
> > >> > > > numbers
> > >> > > > > > > > (which
> > >> > > > > > > > >> makes sense). I am curious if there is more that can
> be
> > >> done
> > >> > > if
> > >> > > > > > > > >> deduplication is on the server side. However the main
> > >> minus
> > >> > I
> > >> > > > see
> > >> > > > > of
> > >> > > > > > > > >> server
> > >> > > > > > > > >> side deduplication is that instead of running
> > contingent
> > >> on
> > >> > > > there
> > >> > > > > > > being
> > >> > > > > > > > a
> > >> > > > > > > > >> failed client request, instead it would have to run
> > every
> > >> > > time a
> > >> > > > > > write
> > >> > > > > > > > >> happens.
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > For a reliable dedup, we probably need
> > >> fence-then-getLastDLSN
> > >> > > > > > > operation -
> > >> > > > > > > > > so it would guarantee that any non-completed requests
> > >> issued
> > >> > > > > > (lost-ack
> > >> > > > > > > > > requests) before this fence-then-getLastDLSN operation
> > >> will
> > >> > be
> > >> > > > > failed
> > >> > > > > > > and
> > >> > > > > > > > > they will never land at the log.
> > >> > > > > > > > >
> > >> > > > > > > > > the pseudo code would look like below -
> > >> > > > > > > > >
> > >> > > > > > > > > write(request) onFailure { t =>
> > >> > > > > > > > >
> > >> > > > > > > > > if (t is timeout exception) {
> > >> > > > > > > > >
> > >> > > > > > > > > DLSN lastDLSN = fenceThenGetLastDLSN()
> > >> > > > > > > > > DLSN lastCheckpointedDLSN = ...;
> > >> > > > > > > > > // find if the request lands between [lastDLSN,
> > >> > > > > > lastCheckpointedDLSN].
> > >> > > > > > > > > // if it exists, the write succeed; otherwise retry.
> > >> > > > > > > > >
> > >> > > > > > > > > }
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > > }
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > Just realized the idea is same as what Leigh raised in
> the
> > >> > > previous
> > >> > > > > > email
> > >> > > > > > > > about 'epoch write'. Let me explain more about this idea
> > >> > (Leigh,
> > >> > > > feel
> > >> > > > > > > free
> > >> > > > > > > > to jump in to fill up your idea).
> > >> > > > > > > >
> > >> > > > > > > > - when a log stream is owned,  the proxy use the last
> > >> > transaction
> > >> > > > id
> > >> > > > > as
> > >> > > > > > > the
> > >> > > > > > > > epoch
> > >> > > > > > > > - when a client connects (handshake with the proxy), it
> > will
> > >> > get
> > >> > > > the
> > >> > > > > > > epoch
> > >> > > > > > > > for the stream.
> > >> > > > > > > > - the writes issued by this client will carry the epoch
> to
> > >> the
> > >> > > > proxy.
> > >> > > > > > > > - add a new rpc - fenceThenGetLastDLSN - it would force
> > the
> > >> > proxy
> > >> > > > to
> > >> > > > > > bump
> > >> > > > > > > > the epoch.
> > >> > > > > > > > - if fenceThenGetLastDLSN happened, all the outstanding
> > >> writes
> > >> > > with
> > >> > > > > old
> > >> > > > > > > > epoch will be rejected with exceptions (e.g.
> EpochFenced).
> > >> > > > > > > > - The DLSN returned from fenceThenGetLastDLSN can be
> used
> > as
> > >> > the
> > >> > > > > bound
> > >> > > > > > > for
> > >> > > > > > > > deduplications on failures.
> > >> > > > > > > >
> > >> > > > > > > > Cameron, does this sound a solution to your use case?
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >>
> > >> > > > > > > > >> Maybe something that could fit a similar need that
> > Kafka
> > >> > does
> > >> > > > (the
> > >> > > > > > > last
> > >> > > > > > > > >> store value for a particular key in a log), such that
> > on
> > >> a
> > >> > per
> > >> > > > key
> > >> > > > > > > basis
> > >> > > > > > > > >> there could be a sequence number that support
> > >> deduplication?
> > >> > > > Cost
> > >> > > > > > > seems
> > >> > > > > > > > >> like it would be high however, and I'm not even sure
> if
> > >> > > > bookkeeper
> > >> > > > > > > > >> supports
> > >> > > > > > > > >> it.
> > >> > > > > > > > >
> > >> > > > > > > > >
> > >> > > > > > > > >> Cheers,
> > >> > > > > > > > >> Cameron
> > >> > > > > > > > >>
> > >> > > > > > > > >> >
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> > > Thanks,
> > >> > > > > > > > >> > > Cameron
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> > > On Sat, Oct 8, 2016 at 7:35 AM, Leigh Stewart
> > >> > > > > > > > >> > <lstewart@twitter.com.invalid
> > >> > > > > > > > >> > > >
> > >> > > > > > > > >> > > wrote:
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> > > > Cameron:
> > >> > > > > > > > >> > > > Another thing we've discussed but haven't
> really
> > >> > thought
> > >> > > > > > > through -
> > >> > > > > > > > >> > > > We might be able to support some kind of epoch
> > >> write
> > >> > > > > request,
> > >> > > > > > > > where
> > >> > > > > > > > >> the
> > >> > > > > > > > >> > > > epoch is guaranteed to have changed if the
> writer
> > >> has
> > >> > > > > changed
> > >> > > > > > or
> > >> > > > > > > > the
> > >> > > > > > > > >> > > ledger
> > >> > > > > > > > >> > > > was ever fenced off. Writes include an epoch
> and
> > >> are
> > >> > > > > rejected
> > >> > > > > > if
> > >> > > > > > > > the
> > >> > > > > > > > >> > > epoch
> > >> > > > > > > > >> > > > has changed.
> > >> > > > > > > > >> > > > With a mechanism like this, fencing the ledger
> > off
> > >> > > after a
> > >> > > > > > > failure
> > >> > > > > > > > >> > would
> > >> > > > > > > > >> > > > ensure any pending writes had either been
> written
> > >> or
> > >> > > would
> > >> > > > > be
> > >> > > > > > > > >> rejected.
> > >> > > > > > > > >> > > >
> > >> > > > > > > > >> > > >
> > >> > > > > > > > >> > > > On Sat, Oct 8, 2016 at 7:10 AM, Sijie Guo <
> > >> > > > sijie@apache.org
> > >> > > > > >
> > >> > > > > > > > wrote:
> > >> > > > > > > > >> > > >
> > >> > > > > > > > >> > > > > Cameron,
> > >> > > > > > > > >> > > > >
> > >> > > > > > > > >> > > > > I think both Leigh and Xi had made a few good
> > >> points
> > >> > > > about
> > >> > > > > > > your
> > >> > > > > > > > >> > > question.
> > >> > > > > > > > >> > > > >
> > >> > > > > > > > >> > > > > To add one more point to your question -
> "but I
> > >> am
> > >> > not
> > >> > > > > > > > >> > > > > 100% of how all of the futures in the code
> > handle
> > >> > > > > failures.
> > >> > > > > > > > >> > > > > If not, where in the code would be the
> relevant
> > >> > places
> > >> > > > to
> > >> > > > > > add
> > >> > > > > > > > the
> > >> > > > > > > > >> > > ability
> > >> > > > > > > > >> > > > > to do this, and would the project be
> interested
> > >> in a
> > >> > > > pull
> > >> > > > > > > > >> request?"
> > >> > > > > > > > >> > > > >
> > >> > > > > > > > >> > > > > The current proxy and client logic doesn't do
> > >> > > perfectly
> > >> > > > on
> > >> > > > > > > > >> handling
> > >> > > > > > > > >> > > > > failures (duplicates) - the strategy now is
> the
> > >> > client
> > >> > > > > will
> > >> > > > > > > > retry
> > >> > > > > > > > >> as
> > >> > > > > > > > >> > > best
> > >> > > > > > > > >> > > > > at it can before throwing exceptions to
> users.
> > >> The
> > >> > > code
> > >> > > > > you
> > >> > > > > > > are
> > >> > > > > > > > >> > looking
> > >> > > > > > > > >> > > > for
> > >> > > > > > > > >> > > > > - it is on BKLogSegmentWriter for the proxy
> > >> handling
> > >> > > > > writes
> > >> > > > > > > and
> > >> > > > > > > > >> it is
> > >> > > > > > > > >> > > on
> > >> > > > > > > > >> > > > > DistributedLogClientImpl for the proxy client
> > >> > handling
> > >> > > > > > > responses
> > >> > > > > > > > >> from
> > >> > > > > > > > >> > > > > proxies. Does this help you?
> > >> > > > > > > > >> > > > >
> > >> > > > > > > > >> > > > > And also, you are welcome to contribute the
> > pull
> > >> > > > requests.
> > >> > > > > > > > >> > > > >
> > >> > > > > > > > >> > > > > - Sijie
> > >> > > > > > > > >> > > > >
> > >> > > > > > > > >> > > > >
> > >> > > > > > > > >> > > > >
> > >> > > > > > > > >> > > > > On Tue, Oct 4, 2016 at 3:39 PM, Cameron
> > Hatfield
> > >> <
> > >> > > > > > > > >> kinguy@gmail.com>
> > >> > > > > > > > >> > > > wrote:
> > >> > > > > > > > >> > > > >
> > >> > > > > > > > >> > > > > > I have a question about the Proxy Client.
> > >> > Basically,
> > >> > > > for
> > >> > > > > > our
> > >> > > > > > > > use
> > >> > > > > > > > >> > > cases,
> > >> > > > > > > > >> > > > > we
> > >> > > > > > > > >> > > > > > want to guarantee ordering at the key
> level,
> > >> > > > > irrespective
> > >> > > > > > of
> > >> > > > > > > > the
> > >> > > > > > > > >> > > > ordering
> > >> > > > > > > > >> > > > > > of the partition it may be assigned to as a
> > >> whole.
> > >> > > Due
> > >> > > > > to
> > >> > > > > > > the
> > >> > > > > > > > >> > source
> > >> > > > > > > > >> > > of
> > >> > > > > > > > >> > > > > the
> > >> > > > > > > > >> > > > > > data (HBase Replication), we cannot
> guarantee
> > >> > that a
> > >> > > > > > single
> > >> > > > > > > > >> > partition
> > >> > > > > > > > >> > > > > will
> > >> > > > > > > > >> > > > > > be owned for writes by the same client.
> This
> > >> means
> > >> > > the
> > >> > > > > > proxy
> > >> > > > > > > > >> client
> > >> > > > > > > > >> > > > works
> > >> > > > > > > > >> > > > > > well (since we don't care which proxy owns
> > the
> > >> > > > partition
> > >> > > > > > we
> > >> > > > > > > > are
> > >> > > > > > > > >> > > writing
> > >> > > > > > > > >> > > > > > to).
> > >> > > > > > > > >> > > > > >
> > >> > > > > > > > >> > > > > >
> > >> > > > > > > > >> > > > > > However, the guarantees we need when
> writing
> > a
> > >> > batch
> > >> > > > > > > consists
> > >> > > > > > > > >> of:
> > >> > > > > > > > >> > > > > > Definition of a Batch: The set of records
> > sent
> > >> to
> > >> > > the
> > >> > > > > > > > writeBatch
> > >> > > > > > > > >> > > > endpoint
> > >> > > > > > > > >> > > > > > on the proxy
> > >> > > > > > > > >> > > > > >
> > >> > > > > > > > >> > > > > > 1. Batch success: If the client receives a
> > >> success
> > >> > > > from
> > >> > > > > > the
> > >> > > > > > > > >> proxy,
> > >> > > > > > > > >> > > then
> > >> > > > > > > > >> > > > > > that batch is successfully written
> > >> > > > > > > > >> > > > > >
> > >> > > > > > > > >> > > > > > 2. Inter-Batch ordering : Once a batch has
> > been
> > >> > > > written
> > >> > > > > > > > >> > successfully
> > >> > > > > > > > >> > > by
> > >> > > > > > > > >> > > > > the
> > >> > > > > > > > >> > > > > > client, when another batch is written, it
> > will
> > >> be
> > >> > > > > > guaranteed
> > >> > > > > > > > to
> > >> > > > > > > > >> be
> > >> > > > > > > > >> > > > > ordered
> > >> > > > > > > > >> > > > > > after the last batch (if it is the same
> > >> stream).
> > >> > > > > > > > >> > > > > >
> > >> > > > > > > > >> > > > > > 3. Intra-Batch ordering: Within a batch of
> > >> writes,
> > >> > > the
> > >> > > > > > > records
> > >> > > > > > > > >> will
> > >> > > > > > > > >> > > be
> > >> > > > > > > > >> > > > > > committed in order
> > >> > > > > > > > >> > > > > >
> > >> > > > > > > > >> > > > > > 4. Intra-Batch failure ordering: If an
> > >> individual
> > >> > > > record
> > >> > > > > > > fails
> > >> > > > > > > > >> to
> > >> > > > > > > > >> > > write
> > >> > > > > > > > >> > > > > > within a batch, all records after that
> record
> > >> will
> > >> > > not
> > >> > > > > be
> > >> > > > > > > > >> written.
> > >> > > > > > > > >> > > > > >
> > >> > > > > > > > >> > > > > > 5. Batch Commit: Guarantee that if a batch
> > >> > returns a
> > >> > > > > > > success,
> > >> > > > > > > > it
> > >> > > > > > > > >> > will
> > >> > > > > > > > >> > > > be
> > >> > > > > > > > >> > > > > > written
> > >> > > > > > > > >> > > > > >
> > >> > > > > > > > >> > > > > > 6. Read-after-write: Once a batch is
> > committed,
> > >> > > > within a
> > >> > > > > > > > limited
> > >> > > > > > > > >> > > > > time-frame
> > >> > > > > > > > >> > > > > > it will be able to be read. This is
> required
> > in
> > >> > the
> > >> > > > case
> > >> > > > > > of
> > >> > > > > > > > >> > failure,
> > >> > > > > > > > >> > > so
> > >> > > > > > > > >> > > > > > that the client can see what actually got
> > >> > > committed. I
> > >> > > > > > > believe
> > >> > > > > > > > >> the
> > >> > > > > > > > >> > > > > > time-frame part could be removed if the
> > client
> > >> can
> > >> > > > send
> > >> > > > > in
> > >> > > > > > > the
> > >> > > > > > > > >> same
> > >> > > > > > > > >> > > > > > sequence number that was written
> previously,
> > >> since
> > >> > > it
> > >> > > > > > would
> > >> > > > > > > > then
> > >> > > > > > > > >> > fail
> > >> > > > > > > > >> > > > and
> > >> > > > > > > > >> > > > > > we would know that a read needs to occur.
> > >> > > > > > > > >> > > > > >
> > >> > > > > > > > >> > > > > >
> > >> > > > > > > > >> > > > > > So, my basic question is if this is
> currently
> > >> > > possible
> > >> > > > > in
> > >> > > > > > > the
> > >> > > > > > > > >> > proxy?
> > >> > > > > > > > >> > > I
> > >> > > > > > > > >> > > > > > don't believe it gives these guarantees as
> it
> > >> > stands
> > >> > > > > > today,
> > >> > > > > > > > but
> > >> > > > > > > > >> I
> > >> > > > > > > > >> > am
> > >> > > > > > > > >> > > > not
> > >> > > > > > > > >> > > > > > 100% of how all of the futures in the code
> > >> handle
> > >> > > > > > failures.
> > >> > > > > > > > >> > > > > > If not, where in the code would be the
> > relevant
> > >> > > places
> > >> > > > > to
> > >> > > > > > > add
> > >> > > > > > > > >> the
> > >> > > > > > > > >> > > > ability
> > >> > > > > > > > >> > > > > > to do this, and would the project be
> > interested
> > >> > in a
> > >> > > > > pull
> > >> > > > > > > > >> request?
> > >> > > > > > > > >> > > > > >
> > >> > > > > > > > >> > > > > >
> > >> > > > > > > > >> > > > > > Thanks,
> > >> > > > > > > > >> > > > > > Cameron
> > >> > > > > > > > >> > > > > >
> > >> > > > > > > > >> > > > >
> > >> > > > > > > > >> > > >
> > >> > > > > > > > >> > >
> > >> > > > > > > > >> >
> > >> > > > > > > > >>
> > >> > > > > > > > >
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> > >
> >
>

Re: Proxy Client - Batch Ordering / Commit

Posted by Sijie Guo <si...@twitter.com.INVALID>.
Cameron,

Can you send me your wiki account name? I can grant you the permission to
edit it.

- Sijie

On Wed, Nov 16, 2016 at 12:11 PM, Cameron Hatfield <ki...@gmail.com> wrote:

> Also, would it be possible for me to get wiki access so I will be able to
> update it / etc?
>
> -Cameron
>
> On Wed, Nov 16, 2016 at 11:59 AM, Cameron Hatfield <ki...@gmail.com>
> wrote:
>
> > "A couple of questions" is what I originally wrote, and then the
> following
> > happened. Sorry about the large swath of them, making sure my
> understanding
> > of the code base, as well as the DL/Bookkeeper/ZK ecosystem interaction,
> > makes sense.
> >
> > ==General:
> > What is an exclusive session? What is it providing over a regular
> session?
> >
> >
> > ==Proxy side:
> > Should a new streamop be added for the fencing operation, or does it make
> > sense to piggyback on an existing one (such as write)?
> >
> > ====getLastDLSN:
> > What should be the return result for:
> > A new stream
> > A new session, after successful fencing
> > A new session, after a change in ownership / first starting up
> >
> > What is the main use case for getLastDLSN(<stream>, false)? Is this to
> > guarantee that the recovery process has happened in case of ownership
> > failure (I don't have a good understanding of what causes the recovery
> > process to happen, especially from the reader side)? Or is it to handle
> the
> > lost-ack problem? Since all the rest of the read related things go
> through
> > the read client, I'm not sure if I see the use case, but it seems like
> > there would be a large potential for confusion on which to use. What
> about
> > just a fenceSession op, that always fences, returning the DLSN of the
> > fence, and leave the normal getLastDLSN for the regular read client.
> >
> > ====Fencing:
> > When a fence session occurs, what call needs to be made to make sure any
> > outstanding writes are flushed and committed (so that we guarantee the
> > client will be able to read anything that was in the write queue)?
> > Is there a guaranteed ordering for things written in the future queue for
> > AsyncLogWriter (I'm not quite confident that I was able to accurately
> > follow the logic, as their are many parts of the code that write, have
> > queues, heartbeat, etc)?
> >
> > ====SessionID:
> > What is the default sessionid / transactionid for a new stream? I assume
> > this would just be the first control record
> >
> > ======Should all streams have a sessionid by default, regardless if it is
> > never used by a client (aka, everytime ownership changes, a new control
> > record is generated, and a sessionid is stored)?
> > Main edge case that would have to be handled is if a client writes with
> an
> > old sessionid, but the owner has changed and has yet to create a
> sessionid.
> > This should be handled by the "non-matching sessionid" rule, since the
> > invalid sessionid wouldn't match the passed sessionid, which should cause
> > the client to get a new sessionid.
> >
> > ======Where in the code does it make sense to own the session, the stream
> > interfaces / classes? Should they pass that information down to the ops,
> or
> > do the sessionid check within?
> > My first thought would be Stream owns the sessionid, passes it into the
> > ops (as either a nullable value, or an invalid default value), which then
> > do the sessionid check if they care. The main issue is updating the
> > sessionid is a bit backwards, as either every op has the ability to
> update
> > it through some type of return value / direct stream access / etc, or
> there
> > is a special case in the stream for the fence operation / any other
> > operation that can update the session.
> >
> > ======For "the owner of the log stream will first advance the transaction
> > id generator to claim a new transaction id and write a control record to
> > the log stream. ":
> > Should "DistributedLogConstants.CONTROL_RECORD_CONTENT" be the type of
> > control record written?
> > Should the "writeControlRecord" on the BKAsyncLogWriter be exposed in the
> > AsyncLogWriter interface be exposed?  Or even in the one within the
> segment
> > writer? Or should the code be duplicated / pulled out into a helper /
> etc?
> > (Not a big Java person, so any suggestions on the "Java Way", or at least
> > the DL way, to do it would be appreciated)
> >
> > ======Transaction ID:
> > The BKLogSegmentWriter ignores the transaction ids from control records
> > when it records the "LastTXId." Would that be an issue here for anything?
> > It looks like it may do that because it assumes you're calling it's local
> > function for writing a controlrecord, which uses the lastTxId.
> >
> >
> > ==Thrift Interface:
> > ====Should the write response be split out for different calls?
> > It seems odd to have a single struct with many optional items that are
> > filled depending on the call made for every rpc call. This is mostly a
> > curiosity question, since I assume it comes from the general practices
> from
> > using thrift for a while. Would it at least make sense for the
> > getLastDLSN/fence endpoint to have a new struct?
> >
> > ====Any particular error code that makes sense for session fenced? If we
> > want to be close to the HTTP errors, looks like 412 (PRECONDITION FAILED)
> > might make the most sense, if a bit generic.
> >
> > 412 def:
> > "The precondition given in one or more of the request-header fields
> > evaluated to false when it was tested on the server. This response code
> > allows the client to place preconditions on the current resource
> > metainformation (header field data) and thus prevent the requested method
> > from being applied to a resource other than the one intended."
> >
> > 412 excerpt from the If-match doc:
> > "This behavior is most useful when the client wants to prevent an
> updating
> > method, such as PUT, from modifying a resource that has changed since the
> > client last retrieved it."
> >
> > ====Should we return the sessionid to the client in the "fencesession"
> > calls?
> > Seems like it may be useful when you fence, especially if you have some
> > type of custom sequencer where it would make sense for search, or for
> > debugging.
> > Main minus is that it would be easy for users to create an implicit
> > requirement that the sessionid is forever a valid transactionid, which
> may
> > not always be the case long term for the project.
> >
> >
> > ==Client:
> > ====What is the proposed process for the client retrieving the new
> > sessionid?
> > A full reconnect? No special case code, but intrusive on the client side,
> > and possibly expensive garbage/processing wise. (though this type of
> > failure should hopefully be rare enough to not be a problem)
> > A call to reset the sessionid? Less intrusive, all the issues you get
> with
> > mutable object methods that need to be called in a certain order, edge
> > cases such as outstanding/buffered requests to the old stream, etc.
> > The call could also return the new sessionid, making it a good call for
> > storing or debugging the value.
> >
> > ====Session Fenced failure:
> > Will this put the client into a failure state, stopping all future writes
> > until fixed?
> > Is it even possible to get this error when ownership changes? The
> > connection to the new owner should get a new sessionid on connect, so I
> > would expect not.
> >
> > Cheers,
> > Cameron
> >
> > On Tue, Nov 15, 2016 at 2:01 AM, Xi Liu <xi...@gmail.com> wrote:
> >
> >> Thank you, Cameron. Look forward to your comments.
> >>
> >> - Xi
> >>
> >> On Sun, Nov 13, 2016 at 1:21 PM, Cameron Hatfield <ki...@gmail.com>
> >> wrote:
> >>
> >> > Sorry, I've been on vacation for the past week, and heads down for a
> >> > release that is using DL at the end of Nov. I'll take a look at this
> >> over
> >> > the next week, and add any relevant comments. After we are finished
> with
> >> > dev for this release, I am hoping to tackle this next.
> >> >
> >> > -Cameron
> >> >
> >> > On Fri, Nov 11, 2016 at 12:07 PM, Sijie Guo <si...@apache.org> wrote:
> >> >
> >> > > Xi,
> >> > >
> >> > > Thank you so much for your proposal. I took a look. It looks fine to
> >> me.
> >> > > Cameron, do you have any comments?
> >> > >
> >> > > Look forward to your pull requests.
> >> > >
> >> > > - Sijie
> >> > >
> >> > >
> >> > > On Wed, Nov 9, 2016 at 2:34 AM, Xi Liu <xi...@gmail.com>
> wrote:
> >> > >
> >> > > > Cameron,
> >> > > >
> >> > > > Have you started any work for this? I just updated the proposal
> >> page -
> >> > > > https://cwiki.apache.org/confluence/display/DL/DP-2+-+
> >> > > Epoch+Write+Support
> >> > > > Maybe we can work together with this.
> >> > > >
> >> > > > Sijie, Leigh,
> >> > > >
> >> > > > can you guys help review this to make sure our proposal is in the
> >> right
> >> > > > direction?
> >> > > >
> >> > > > - Xi
> >> > > >
> >> > > > On Tue, Nov 1, 2016 at 3:05 AM, Sijie Guo <si...@apache.org>
> wrote:
> >> > > >
> >> > > > > I created https://issues.apache.org/jira/browse/DL-63 for
> >> tracking
> >> > the
> >> > > > > proposed idea here.
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Wed, Oct 26, 2016 at 4:53 PM, Sijie Guo
> >> > <sijieg@twitter.com.invalid
> >> > > >
> >> > > > > wrote:
> >> > > > >
> >> > > > > > On Tue, Oct 25, 2016 at 11:30 AM, Cameron Hatfield <
> >> > kinguy@gmail.com
> >> > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Yes, we are reading the HBase WAL (from their replication
> >> plugin
> >> > > > > > support),
> >> > > > > > > and writing that into DL.
> >> > > > > > >
> >> > > > > >
> >> > > > > > Gotcha.
> >> > > > > >
> >> > > > > >
> >> > > > > > >
> >> > > > > > > From the sounds of it, yes, it would. Only thing I would say
> >> is
> >> > > make
> >> > > > > the
> >> > > > > > > epoch requirement optional, so that if I client doesn't care
> >> > about
> >> > > > > dupes
> >> > > > > > > they don't have to deal with the process of getting a new
> >> epoch.
> >> > > > > > >
> >> > > > > >
> >> > > > > > Yup. This should be optional. I can start a wiki page on how
> we
> >> > want
> >> > > to
> >> > > > > > implement this. Are you interested in contributing to this?
> >> > > > > >
> >> > > > > >
> >> > > > > > >
> >> > > > > > > -Cameron
> >> > > > > > >
> >> > > > > > > On Wed, Oct 19, 2016 at 7:43 PM, Sijie Guo
> >> > > > <sijieg@twitter.com.invalid
> >> > > > > >
> >> > > > > > > wrote:
> >> > > > > > >
> >> > > > > > > > On Wed, Oct 19, 2016 at 7:17 PM, Sijie Guo <
> >> sijieg@twitter.com
> >> > >
> >> > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > On Monday, October 17, 2016, Cameron Hatfield <
> >> > > kinguy@gmail.com>
> >> > > > > > > wrote:
> >> > > > > > > > >
> >> > > > > > > > >> Answer inline:
> >> > > > > > > > >>
> >> > > > > > > > >> On Mon, Oct 17, 2016 at 11:46 AM, Sijie Guo <
> >> > sijie@apache.org
> >> > > >
> >> > > > > > wrote:
> >> > > > > > > > >>
> >> > > > > > > > >> > Cameron,
> >> > > > > > > > >> >
> >> > > > > > > > >> > Thank you for your summary. I liked the discussion
> >> here. I
> >> > > > also
> >> > > > > > > liked
> >> > > > > > > > >> the
> >> > > > > > > > >> > summary of your requirement - 'single-writer-per-key,
> >> > > > > > > > >> > multiple-writers-per-log'. If I understand correctly,
> >> the
> >> > > core
> >> > > > > > > concern
> >> > > > > > > > >> here
> >> > > > > > > > >> > is almost 'exact-once' write (or a way to explicit
> tell
> >> > if a
> >> > > > > write
> >> > > > > > > can
> >> > > > > > > > >> be
> >> > > > > > > > >> > retried or not).
> >> > > > > > > > >> >
> >> > > > > > > > >> > Comments inline.
> >> > > > > > > > >> >
> >> > > > > > > > >> > On Fri, Oct 14, 2016 at 11:17 AM, Cameron Hatfield <
> >> > > > > > > kinguy@gmail.com>
> >> > > > > > > > >> > wrote:
> >> > > > > > > > >> >
> >> > > > > > > > >> > > > Ah- yes good point (to be clear we're not using
> the
> >> > > proxy
> >> > > > > this
> >> > > > > > > way
> >> > > > > > > > >> > > today).
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > > > Due to the source of the
> >> > > > > > > > >> > > > > data (HBase Replication), we cannot guarantee
> >> that a
> >> > > > > single
> >> > > > > > > > >> partition
> >> > > > > > > > >> > > will
> >> > > > > > > > >> > > > > be owned for writes by the same client.
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > > Do you mean you *need* to support multiple
> writers
> >> > > issuing
> >> > > > > > > > >> interleaved
> >> > > > > > > > >> > > > writes or is it just that they might sometimes
> >> > > interleave
> >> > > > > > writes
> >> > > > > > > > and
> >> > > > > > > > >> > you
> >> > > > > > > > >> > > >don't care?
> >> > > > > > > > >> > > How HBase partitions the keys being written
> wouldn't
> >> > have
> >> > > a
> >> > > > > > > one->one
> >> > > > > > > > >> > > mapping with the partitions we would have in HBase.
> >> Even
> >> > > if
> >> > > > we
> >> > > > > > did
> >> > > > > > > > >> have
> >> > > > > > > > >> > > that alignment when the cluster first started,
> HBase
> >> > will
> >> > > > > > > rebalance
> >> > > > > > > > >> what
> >> > > > > > > > >> > > servers own what partitions, as well as split and
> >> merge
> >> > > > > > partitions
> >> > > > > > > > >> that
> >> > > > > > > > >> > > already exist, causing eventual drift from one log
> >> per
> >> > > > > > partition.
> >> > > > > > > > >> > > Because we want ordering guarantees per key (row in
> >> > > hbase),
> >> > > > we
> >> > > > > > > > >> partition
> >> > > > > > > > >> > > the logs by the key. Since multiple writers are
> >> possible
> >> > > per
> >> > > > > > range
> >> > > > > > > > of
> >> > > > > > > > >> > keys
> >> > > > > > > > >> > > (due to the aforementioned rebalancing / splitting
> /
> >> etc
> >> > > of
> >> > > > > > > hbase),
> >> > > > > > > > we
> >> > > > > > > > >> > > cannot use the core library due to requiring a
> single
> >> > > writer
> >> > > > > for
> >> > > > > > > > >> > ordering.
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > But, for a single log, we don't really care about
> >> > ordering
> >> > > > > aside
> >> > > > > > > > from
> >> > > > > > > > >> at
> >> > > > > > > > >> > > the per-key level. So all we really need to be able
> >> to
> >> > > > handle
> >> > > > > is
> >> > > > > > > > >> > preventing
> >> > > > > > > > >> > > duplicates when a failure occurs, and ordering
> >> > consistency
> >> > > > > > across
> >> > > > > > > > >> > requests
> >> > > > > > > > >> > > from a single client.
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > So our general requirements are:
> >> > > > > > > > >> > > Write A, Write B
> >> > > > > > > > >> > > Timeline: A -> B
> >> > > > > > > > >> > > Request B is only made after A has successfully
> >> returned
> >> > > > > > (possibly
> >> > > > > > > > >> after
> >> > > > > > > > >> > > retries)
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > 1) If the write succeeds, it will be durably
> exposed
> >> to
> >> > > > > clients
> >> > > > > > > > within
> >> > > > > > > > >> > some
> >> > > > > > > > >> > > bounded time frame
> >> > > > > > > > >> > >
> >> > > > > > > > >> >
> >> > > > > > > > >> > Guaranteed.
> >> > > > > > > > >> >
> >> > > > > > > > >>
> >> > > > > > > > >> > > 2) If A succeeds and B succeeds, the ordering for
> the
> >> > log
> >> > > > will
> >> > > > > > be
> >> > > > > > > A
> >> > > > > > > > >> and
> >> > > > > > > > >> > > then B
> >> > > > > > > > >> > >
> >> > > > > > > > >> >
> >> > > > > > > > >> > If I understand correctly here, B is only sent after
> A
> >> is
> >> > > > > > returned,
> >> > > > > > > > >> right?
> >> > > > > > > > >> > If that's the case, It is guaranteed.
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >> >
> >> > > > > > > > >> >
> >> > > > > > > > >> >
> >> > > > > > > > >> > > 3) If A fails due to an error that can be relied on
> >> to
> >> > > *not*
> >> > > > > be
> >> > > > > > a
> >> > > > > > > > lost
> >> > > > > > > > >> > ack
> >> > > > > > > > >> > > problem, it will never be exposed to the client, so
> >> it
> >> > may
> >> > > > > > > > (depending
> >> > > > > > > > >> on
> >> > > > > > > > >> > > the error) be retried immediately
> >> > > > > > > > >> > >
> >> > > > > > > > >> >
> >> > > > > > > > >> > If it is not a lost-ack problem, the entry will be
> >> > exposed.
> >> > > it
> >> > > > > is
> >> > > > > > > > >> > guaranteed.
> >> > > > > > > > >>
> >> > > > > > > > >> Let me try rephrasing the questions, to make sure I'm
> >> > > > > understanding
> >> > > > > > > > >> correctly:
> >> > > > > > > > >> If A fails, with an error such as "Unable to create
> >> > connection
> >> > > > to
> >> > > > > > > > >> bookkeeper server", that would be the type of error we
> >> would
> >> > > > > expect
> >> > > > > > to
> >> > > > > > > > be
> >> > > > > > > > >> able to retry immediately, as that result means no
> action
> >> > was
> >> > > > > taken
> >> > > > > > on
> >> > > > > > > > any
> >> > > > > > > > >> log / etc, so no entry could have been created. This is
> >> > > > different
> >> > > > > > > then a
> >> > > > > > > > >> "Connection Timeout" exception, as we just might not
> have
> >> > > > gotten a
> >> > > > > > > > >> response
> >> > > > > > > > >> in time.
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > > Gotcha.
> >> > > > > > > > >
> >> > > > > > > > > The response code returned from proxy can tell if a
> >> failure
> >> > can
> >> > > > be
> >> > > > > > > > retried
> >> > > > > > > > > safely or not. (We might need to make them well
> >> documented)
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >>
> >> > > > > > > > >> >
> >> > > > > > > > >> >
> >> > > > > > > > >> > > 4) If A fails due to an error that could be a lost
> >> ack
> >> > > > problem
> >> > > > > > > > >> (network
> >> > > > > > > > >> > > connectivity / etc), within a bounded time frame it
> >> > should
> >> > > > be
> >> > > > > > > > >> possible to
> >> > > > > > > > >> > > find out if the write succeed or failed. Either by
> >> > reading
> >> > > > > from
> >> > > > > > > some
> >> > > > > > > > >> > > checkpoint of the log for the changes that should
> >> have
> >> > > been
> >> > > > > made
> >> > > > > > > or
> >> > > > > > > > >> some
> >> > > > > > > > >> > > other possible server-side support.
> >> > > > > > > > >> > >
> >> > > > > > > > >> >
> >> > > > > > > > >> > If I understand this correctly, it is a duplication
> >> issue,
> >> > > > > right?
> >> > > > > > > > >> >
> >> > > > > > > > >> > Can a de-duplication solution work here? Either DL or
> >> your
> >> > > > > client
> >> > > > > > > does
> >> > > > > > > > >> the
> >> > > > > > > > >> > de-duplication?
> >> > > > > > > > >> >
> >> > > > > > > > >>
> >> > > > > > > > >> The requirements I'm mentioning are the ones needed for
> >> > > > > client-side
> >> > > > > > > > >> dedupping. Since if I can guarantee writes being
> exposed
> >> > > within
> >> > > > > some
> >> > > > > > > > time
> >> > > > > > > > >> frame, and I can never get into an inconsistently
> ordered
> >> > > state
> >> > > > > when
> >> > > > > > > > >> successes happen, when an error occurs, I can always
> wait
> >> > for
> >> > > > max
> >> > > > > > time
> >> > > > > > > > >> frame, read the latest writes, and then dedup locally
> >> > against
> >> > > > the
> >> > > > > > > > request
> >> > > > > > > > >> I
> >> > > > > > > > >> just made.
> >> > > > > > > > >>
> >> > > > > > > > >> The main thing about that timeframe is that its
> basically
> >> > the
> >> > > > > > addition
> >> > > > > > > > of
> >> > > > > > > > >> every timeout, all the way down in the system, combined
> >> with
> >> > > > > > whatever
> >> > > > > > > > >> flushing / caching / etc times are at the bookkeeper /
> >> > client
> >> > > > > level
> >> > > > > > > for
> >> > > > > > > > >> when values are exposed
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > Gotcha.
> >> > > > > > > > >
> >> > > > > > > > >>
> >> > > > > > > > >>
> >> > > > > > > > >> >
> >> > > > > > > > >> > Is there any ways to identify your write?
> >> > > > > > > > >> >
> >> > > > > > > > >> > I can think of a case as follow - I want to know what
> >> is
> >> > > your
> >> > > > > > > expected
> >> > > > > > > > >> > behavior from the log.
> >> > > > > > > > >> >
> >> > > > > > > > >> > a)
> >> > > > > > > > >> >
> >> > > > > > > > >> > If a hbase region server A writes a change of key K
> to
> >> the
> >> > > > log,
> >> > > > > > the
> >> > > > > > > > >> change
> >> > > > > > > > >> > is successfully made to the log;
> >> > > > > > > > >> > but server A is down before receiving the change.
> >> > > > > > > > >> > region server B took over the region that contains K,
> >> what
> >> > > > will
> >> > > > > B
> >> > > > > > > do?
> >> > > > > > > > >> >
> >> > > > > > > > >>
> >> > > > > > > > >> HBase writes in large chunks (WAL Logs), which its
> >> > replication
> >> > > > > > system
> >> > > > > > > > then
> >> > > > > > > > >> handles by replaying in the case of failure. If I'm in
> a
> >> > > middle
> >> > > > > of a
> >> > > > > > > > log,
> >> > > > > > > > >> and the whole region goes down and gets rescheduled
> >> > > elsewhere, I
> >> > > > > > will
> >> > > > > > > > >> start
> >> > > > > > > > >> back up from the beginning of the log I was in the
> middle
> >> > of.
> >> > > > > Using
> >> > > > > > > > >> checkpointing + deduping, we should be able to find out
> >> > where
> >> > > we
> >> > > > > > left
> >> > > > > > > > off
> >> > > > > > > > >> in the log.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >> >
> >> > > > > > > > >> >
> >> > > > > > > > >> > b) same as a). but server A was just network
> >> partitioned.
> >> > > will
> >> > > > > > both
> >> > > > > > > A
> >> > > > > > > > >> and B
> >> > > > > > > > >> > write the change of key K?
> >> > > > > > > > >> >
> >> > > > > > > > >>
> >> > > > > > > > >> HBase gives us some guarantees around network
> partitions
> >> > > > > > (Consistency
> >> > > > > > > > over
> >> > > > > > > > >> availability for HBase). HBase is a single-master
> >> failover
> >> > > > > recovery
> >> > > > > > > type
> >> > > > > > > > >> of
> >> > > > > > > > >> system, with zookeeper-based guarantees for single
> owners
> >> > > > > (writers)
> >> > > > > > > of a
> >> > > > > > > > >> range of data.
> >> > > > > > > > >>
> >> > > > > > > > >> >
> >> > > > > > > > >> >
> >> > > > > > > > >> >
> >> > > > > > > > >> > > 5) If A is turned into multiple batches (one large
> >> > request
> >> > > > > gets
> >> > > > > > > > split
> >> > > > > > > > >> > into
> >> > > > > > > > >> > > multiple smaller ones to the bookkeeper backend,
> due
> >> to
> >> > > log
> >> > > > > > > rolling
> >> > > > > > > > /
> >> > > > > > > > >> > size
> >> > > > > > > > >> > > / etc):
> >> > > > > > > > >> > >   a) The ordering of entries within batches have
> >> > ordering
> >> > > > > > > > consistence
> >> > > > > > > > >> > with
> >> > > > > > > > >> > > the original request, when exposed in the log
> (though
> >> > they
> >> > > > may
> >> > > > > > be
> >> > > > > > > > >> > > interleaved with other requests)
> >> > > > > > > > >> > >   b) The ordering across batches have ordering
> >> > consistence
> >> > > > > with
> >> > > > > > > the
> >> > > > > > > > >> > > original request, when exposed in the log (though
> >> they
> >> > may
> >> > > > be
> >> > > > > > > > >> interleaved
> >> > > > > > > > >> > > with other requests)
> >> > > > > > > > >> > >   c) If a batch fails, and cannot be retried / is
> >> > > > > unsuccessfully
> >> > > > > > > > >> retried,
> >> > > > > > > > >> > > all batches after the failed batch should not be
> >> exposed
> >> > > in
> >> > > > > the
> >> > > > > > > log.
> >> > > > > > > > >> > Note:
> >> > > > > > > > >> > > The batches before and including the failed batch,
> >> that
> >> > > > ended
> >> > > > > up
> >> > > > > > > > >> > > succeeding, can show up in the log, again within
> some
> >> > > > bounded
> >> > > > > > time
> >> > > > > > > > >> range
> >> > > > > > > > >> > > for reads by a client.
> >> > > > > > > > >> > >
> >> > > > > > > > >> >
> >> > > > > > > > >> > There is a method 'writeBulk' in DistributedLogClient
> >> can
> >> > > > > achieve
> >> > > > > > > this
> >> > > > > > > > >> > guarantee.
> >> > > > > > > > >> >
> >> > > > > > > > >> > However, I am not very sure about how will you turn A
> >> into
> >> > > > > > batches.
> >> > > > > > > If
> >> > > > > > > > >> you
> >> > > > > > > > >> > are dividing A into batches,
> >> > > > > > > > >> > you can simply control the application write sequence
> >> to
> >> > > > achieve
> >> > > > > > the
> >> > > > > > > > >> > guarantee here.
> >> > > > > > > > >> >
> >> > > > > > > > >> > Can you explain more about this?
> >> > > > > > > > >> >
> >> > > > > > > > >>
> >> > > > > > > > >> In this case, by batches I mean what the proxy does
> with
> >> the
> >> > > > > single
> >> > > > > > > > >> request
> >> > > > > > > > >> that I send it. If the proxy decides it needs to turn
> my
> >> > > single
> >> > > > > > > request
> >> > > > > > > > >> into multiple batches of requests, due to log rolling,
> >> size
> >> > > > > > > limitations,
> >> > > > > > > > >> etc, those would be the guarantees I need to be able to
> >> > > > > reduplicate
> >> > > > > > on
> >> > > > > > > > the
> >> > > > > > > > >> client side.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > A single record written by #write and A record set (set
> of
> >> > > > records)
> >> > > > > > > > > written by #writeRecordSet are atomic - they will not be
> >> > broken
> >> > > > > down
> >> > > > > > > into
> >> > > > > > > > > entries (batches). With the correct response code, you
> >> would
> >> > be
> >> > > > > able
> >> > > > > > to
> >> > > > > > > > > tell if it is a lost-ack failure or not. However there
> is
> >> a
> >> > > size
> >> > > > > > > > limitation
> >> > > > > > > > > for this - it can't not go beyond 1MB for current
> >> > > implementation.
> >> > > > > > > > >
> >> > > > > > > > > What is your expected record size?
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >>
> >> > > > > > > > >> >
> >> > > > > > > > >> >
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > Since we can guarantee per-key ordering on the
> client
> >> > > side,
> >> > > > we
> >> > > > > > > > >> guarantee
> >> > > > > > > > >> > > that there is a single writer per-key, just not per
> >> log.
> >> > > > > > > > >> >
> >> > > > > > > > >> >
> >> > > > > > > > >> > Do you need fencing guarantee in the case of network
> >> > > partition
> >> > > > > > > causing
> >> > > > > > > > >> > two-writers?
> >> > > > > > > > >> >
> >> > > > > > > > >> >
> >> > > > > > > > >> > > So if there was a
> >> > > > > > > > >> > > way to guarantee a single write request as being
> >> written
> >> > > or
> >> > > > > not,
> >> > > > > > > > >> within a
> >> > > > > > > > >> > > certain time frame (since failures should be rare
> >> > anyways,
> >> > > > > this
> >> > > > > > is
> >> > > > > > > > >> fine
> >> > > > > > > > >> > if
> >> > > > > > > > >> > > it is expensive), we can then have the client
> >> guarantee
> >> > > the
> >> > > > > > > ordering
> >> > > > > > > > >> it
> >> > > > > > > > >> > > needs.
> >> > > > > > > > >> > >
> >> > > > > > > > >> >
> >> > > > > > > > >> > This sounds an 'exact-once' write (regarding retries)
> >> > > > > requirement
> >> > > > > > to
> >> > > > > > > > me,
> >> > > > > > > > >> > right?
> >> > > > > > > > >> >
> >> > > > > > > > >> Yes. I'm curious of how this issue is handled by
> >> Manhattan,
> >> > > > since
> >> > > > > > you
> >> > > > > > > > can
> >> > > > > > > > >> imagine a data store that ends up getting multiple
> writes
> >> > for
> >> > > > the
> >> > > > > > same
> >> > > > > > > > put
> >> > > > > > > > >> / get / etc, would be harder to use, and we are
> basically
> >> > > trying
> >> > > > > to
> >> > > > > > > > create
> >> > > > > > > > >> a log like that for HBase.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > Are you guys replacing HBase WAL?
> >> > > > > > > > >
> >> > > > > > > > > In Manhattan case, the request will be first written to
> DL
> >> > > > streams
> >> > > > > by
> >> > > > > > > > > Manhattan coordinator. The Manhattan replica then will
> >> read
> >> > > from
> >> > > > > the
> >> > > > > > DL
> >> > > > > > > > > streams and apply the change. In the lost-ack case, the
> MH
> >> > > > > > coordinator
> >> > > > > > > > will
> >> > > > > > > > > just fail the request to client.
> >> > > > > > > > >
> >> > > > > > > > > My feeling here is your usage for HBase is a bit
> different
> >> > from
> >> > > > how
> >> > > > > > we
> >> > > > > > > > use
> >> > > > > > > > > DL in Manhattan. It sounds like you read from a source
> >> (HBase
> >> > > > WAL)
> >> > > > > > and
> >> > > > > > > > > write to DL. But I might be wrong.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >>
> >> > > > > > > > >> >
> >> > > > > > > > >> >
> >> > > > > > > > >> > >
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > > Cameron:
> >> > > > > > > > >> > > > Another thing we've discussed but haven't really
> >> > thought
> >> > > > > > > through -
> >> > > > > > > > >> > > > We might be able to support some kind of epoch
> >> write
> >> > > > > request,
> >> > > > > > > > where
> >> > > > > > > > >> the
> >> > > > > > > > >> > > > epoch is guaranteed to have changed if the writer
> >> has
> >> > > > > changed
> >> > > > > > or
> >> > > > > > > > the
> >> > > > > > > > >> > > ledger
> >> > > > > > > > >> > > > was ever fenced off. Writes include an epoch and
> >> are
> >> > > > > rejected
> >> > > > > > if
> >> > > > > > > > the
> >> > > > > > > > >> > > epoch
> >> > > > > > > > >> > > > has changed.
> >> > > > > > > > >> > > > With a mechanism like this, fencing the ledger
> off
> >> > > after a
> >> > > > > > > failure
> >> > > > > > > > >> > would
> >> > > > > > > > >> > > > ensure any pending writes had either been written
> >> or
> >> > > would
> >> > > > > be
> >> > > > > > > > >> rejected.
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > The issue would be how I guarantee the write I
> wrote
> >> to
> >> > > the
> >> > > > > > server
> >> > > > > > > > was
> >> > > > > > > > >> > > written. Since a network issue could happen on the
> >> send
> >> > of
> >> > > > the
> >> > > > > > > > >> request,
> >> > > > > > > > >> > or
> >> > > > > > > > >> > > on the receive of the success response, an epoch
> >> > wouldn't
> >> > > > tell
> >> > > > > > me
> >> > > > > > > > if I
> >> > > > > > > > >> > can
> >> > > > > > > > >> > > successfully retry, as it could be successfully
> >> written
> >> > > but
> >> > > > > AWS
> >> > > > > > > > >> dropped
> >> > > > > > > > >> > the
> >> > > > > > > > >> > > connection for the success response. Since the
> epoch
> >> > would
> >> > > > be
> >> > > > > > the
> >> > > > > > > > same
> >> > > > > > > > >> > > (same ledger), I could write duplicates.
> >> > > > > > > > >> > >
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > > We are currently proposing adding a transaction
> >> > semantic
> >> > > > to
> >> > > > > dl
> >> > > > > > > to
> >> > > > > > > > >> get
> >> > > > > > > > >> > rid
> >> > > > > > > > >> > > > of the size limitation and the unaware-ness in
> the
> >> > proxy
> >> > > > > > client.
> >> > > > > > > > >> Here
> >> > > > > > > > >> > is
> >> > > > > > > > >> > > > our idea -
> >> > > > > > > > >> > > > http://mail-archives.apache.or
> >> g/mod_mbox/incubator-
> >> > > > > > > distributedlog
> >> > > > > > > > >> > > -dev/201609.mbox/%3cCAAC6BxP5Y
> >> yEHwG0ZCF5soh42X=xuYwYm
> >> > > > > > > > >> > > <http://mail-archives.apache.
> org/mod_mbox/incubator-
> >> > > > > > > > >> > distributedlog%0A-dev/201609.mbox/%
> >> > > 3cCAAC6BxP5YyEHwG0ZCF5soh
> >> > > > > > > > 42X=xuYwYm>
> >> > > > > > > > >> > > L4nXsYBYiofzxpVk6g@mail.gmail.com%3e
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > > I am not sure if your idea is similar as ours.
> but
> >> > we'd
> >> > > > like
> >> > > > > > to
> >> > > > > > > > >> > > collaborate
> >> > > > > > > > >> > > > with the community if anyone has the similar
> idea.
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > Our use case would be covered by transaction
> support,
> >> > but
> >> > > > I'm
> >> > > > > > > unsure
> >> > > > > > > > >> if
> >> > > > > > > > >> > we
> >> > > > > > > > >> > > would need something that heavy weight for the
> >> > guarantees
> >> > > we
> >> > > > > > need.
> >> > > > > > > > >> > >
> >> > > > > > > > >> >
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > Basically, the high level requirement here is
> >> "Support
> >> > > > > > consistent
> >> > > > > > > > >> write
> >> > > > > > > > >> > > ordering for single-writer-per-key,
> >> > multi-writer-per-log".
> >> > > > My
> >> > > > > > > hunch
> >> > > > > > > > is
> >> > > > > > > > >> > > that, with some added guarantees to the proxy (if
> it
> >> > isn't
> >> > > > > > already
> >> > > > > > > > >> > > supported), and some custom client code on our side
> >> for
> >> > > > > removing
> >> > > > > > > the
> >> > > > > > > > >> > > entries that actually succeed to write to
> >> DistributedLog
> >> > > > from
> >> > > > > > the
> >> > > > > > > > >> request
> >> > > > > > > > >> > > that failed, it should be a relatively easy thing
> to
> >> > > > support.
> >> > > > > > > > >> > >
> >> > > > > > > > >> >
> >> > > > > > > > >> > Yup. I think it should not be very difficult to
> >> support.
> >> > > There
> >> > > > > > might
> >> > > > > > > > be
> >> > > > > > > > >> > some changes in the server side.
> >> > > > > > > > >> > Let's figure out what will the changes be. Are you
> guys
> >> > > > > interested
> >> > > > > > > in
> >> > > > > > > > >> > contributing?
> >> > > > > > > > >> >
> >> > > > > > > > >> > Yes, we would be.
> >> > > > > > > > >>
> >> > > > > > > > >> As a note, the one thing that we see as an issue with
> the
> >> > > client
> >> > > > > > side
> >> > > > > > > > >> dedupping is how to bound the range of data that needs
> >> to be
> >> > > > > looked
> >> > > > > > at
> >> > > > > > > > for
> >> > > > > > > > >> deduplication. As you can imagine, it is pretty easy to
> >> > bound
> >> > > > the
> >> > > > > > > bottom
> >> > > > > > > > >> of
> >> > > > > > > > >> the range, as that it just regular checkpointing of the
> >> DSLN
> >> > > > that
> >> > > > > is
> >> > > > > > > > >> returned. I'm still not sure if there is any nice way
> to
> >> > time
> >> > > > > bound
> >> > > > > > > the
> >> > > > > > > > >> top
> >> > > > > > > > >> end of the range, especially since the proxy owns
> >> sequence
> >> > > > numbers
> >> > > > > > > > (which
> >> > > > > > > > >> makes sense). I am curious if there is more that can be
> >> done
> >> > > if
> >> > > > > > > > >> deduplication is on the server side. However the main
> >> minus
> >> > I
> >> > > > see
> >> > > > > of
> >> > > > > > > > >> server
> >> > > > > > > > >> side deduplication is that instead of running
> contingent
> >> on
> >> > > > there
> >> > > > > > > being
> >> > > > > > > > a
> >> > > > > > > > >> failed client request, instead it would have to run
> every
> >> > > time a
> >> > > > > > write
> >> > > > > > > > >> happens.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > For a reliable dedup, we probably need
> >> fence-then-getLastDLSN
> >> > > > > > > operation -
> >> > > > > > > > > so it would guarantee that any non-completed requests
> >> issued
> >> > > > > > (lost-ack
> >> > > > > > > > > requests) before this fence-then-getLastDLSN operation
> >> will
> >> > be
> >> > > > > failed
> >> > > > > > > and
> >> > > > > > > > > they will never land at the log.
> >> > > > > > > > >
> >> > > > > > > > > the pseudo code would look like below -
> >> > > > > > > > >
> >> > > > > > > > > write(request) onFailure { t =>
> >> > > > > > > > >
> >> > > > > > > > > if (t is timeout exception) {
> >> > > > > > > > >
> >> > > > > > > > > DLSN lastDLSN = fenceThenGetLastDLSN()
> >> > > > > > > > > DLSN lastCheckpointedDLSN = ...;
> >> > > > > > > > > // find if the request lands between [lastDLSN,
> >> > > > > > lastCheckpointedDLSN].
> >> > > > > > > > > // if it exists, the write succeed; otherwise retry.
> >> > > > > > > > >
> >> > > > > > > > > }
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > }
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Just realized the idea is same as what Leigh raised in the
> >> > > previous
> >> > > > > > email
> >> > > > > > > > about 'epoch write'. Let me explain more about this idea
> >> > (Leigh,
> >> > > > feel
> >> > > > > > > free
> >> > > > > > > > to jump in to fill up your idea).
> >> > > > > > > >
> >> > > > > > > > - when a log stream is owned,  the proxy use the last
> >> > transaction
> >> > > > id
> >> > > > > as
> >> > > > > > > the
> >> > > > > > > > epoch
> >> > > > > > > > - when a client connects (handshake with the proxy), it
> will
> >> > get
> >> > > > the
> >> > > > > > > epoch
> >> > > > > > > > for the stream.
> >> > > > > > > > - the writes issued by this client will carry the epoch to
> >> the
> >> > > > proxy.
> >> > > > > > > > - add a new rpc - fenceThenGetLastDLSN - it would force
> the
> >> > proxy
> >> > > > to
> >> > > > > > bump
> >> > > > > > > > the epoch.
> >> > > > > > > > - if fenceThenGetLastDLSN happened, all the outstanding
> >> writes
> >> > > with
> >> > > > > old
> >> > > > > > > > epoch will be rejected with exceptions (e.g. EpochFenced).
> >> > > > > > > > - The DLSN returned from fenceThenGetLastDLSN can be used
> as
> >> > the
> >> > > > > bound
> >> > > > > > > for
> >> > > > > > > > deduplications on failures.
> >> > > > > > > >
> >> > > > > > > > Cameron, does this sound a solution to your use case?
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >>
> >> > > > > > > > >> Maybe something that could fit a similar need that
> Kafka
> >> > does
> >> > > > (the
> >> > > > > > > last
> >> > > > > > > > >> store value for a particular key in a log), such that
> on
> >> a
> >> > per
> >> > > > key
> >> > > > > > > basis
> >> > > > > > > > >> there could be a sequence number that support
> >> deduplication?
> >> > > > Cost
> >> > > > > > > seems
> >> > > > > > > > >> like it would be high however, and I'm not even sure if
> >> > > > bookkeeper
> >> > > > > > > > >> supports
> >> > > > > > > > >> it.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >> Cheers,
> >> > > > > > > > >> Cameron
> >> > > > > > > > >>
> >> > > > > > > > >> >
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > Thanks,
> >> > > > > > > > >> > > Cameron
> >> > > > > > > > >> > >
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > On Sat, Oct 8, 2016 at 7:35 AM, Leigh Stewart
> >> > > > > > > > >> > <lstewart@twitter.com.invalid
> >> > > > > > > > >> > > >
> >> > > > > > > > >> > > wrote:
> >> > > > > > > > >> > >
> >> > > > > > > > >> > > > Cameron:
> >> > > > > > > > >> > > > Another thing we've discussed but haven't really
> >> > thought
> >> > > > > > > through -
> >> > > > > > > > >> > > > We might be able to support some kind of epoch
> >> write
> >> > > > > request,
> >> > > > > > > > where
> >> > > > > > > > >> the
> >> > > > > > > > >> > > > epoch is guaranteed to have changed if the writer
> >> has
> >> > > > > changed
> >> > > > > > or
> >> > > > > > > > the
> >> > > > > > > > >> > > ledger
> >> > > > > > > > >> > > > was ever fenced off. Writes include an epoch and
> >> are
> >> > > > > rejected
> >> > > > > > if
> >> > > > > > > > the
> >> > > > > > > > >> > > epoch
> >> > > > > > > > >> > > > has changed.
> >> > > > > > > > >> > > > With a mechanism like this, fencing the ledger
> off
> >> > > after a
> >> > > > > > > failure
> >> > > > > > > > >> > would
> >> > > > > > > > >> > > > ensure any pending writes had either been written
> >> or
> >> > > would
> >> > > > > be
> >> > > > > > > > >> rejected.
> >> > > > > > > > >> > > >
> >> > > > > > > > >> > > >
> >> > > > > > > > >> > > > On Sat, Oct 8, 2016 at 7:10 AM, Sijie Guo <
> >> > > > sijie@apache.org
> >> > > > > >
> >> > > > > > > > wrote:
> >> > > > > > > > >> > > >
> >> > > > > > > > >> > > > > Cameron,
> >> > > > > > > > >> > > > >
> >> > > > > > > > >> > > > > I think both Leigh and Xi had made a few good
> >> points
> >> > > > about
> >> > > > > > > your
> >> > > > > > > > >> > > question.
> >> > > > > > > > >> > > > >
> >> > > > > > > > >> > > > > To add one more point to your question - "but I
> >> am
> >> > not
> >> > > > > > > > >> > > > > 100% of how all of the futures in the code
> handle
> >> > > > > failures.
> >> > > > > > > > >> > > > > If not, where in the code would be the relevant
> >> > places
> >> > > > to
> >> > > > > > add
> >> > > > > > > > the
> >> > > > > > > > >> > > ability
> >> > > > > > > > >> > > > > to do this, and would the project be interested
> >> in a
> >> > > > pull
> >> > > > > > > > >> request?"
> >> > > > > > > > >> > > > >
> >> > > > > > > > >> > > > > The current proxy and client logic doesn't do
> >> > > perfectly
> >> > > > on
> >> > > > > > > > >> handling
> >> > > > > > > > >> > > > > failures (duplicates) - the strategy now is the
> >> > client
> >> > > > > will
> >> > > > > > > > retry
> >> > > > > > > > >> as
> >> > > > > > > > >> > > best
> >> > > > > > > > >> > > > > at it can before throwing exceptions to users.
> >> The
> >> > > code
> >> > > > > you
> >> > > > > > > are
> >> > > > > > > > >> > looking
> >> > > > > > > > >> > > > for
> >> > > > > > > > >> > > > > - it is on BKLogSegmentWriter for the proxy
> >> handling
> >> > > > > writes
> >> > > > > > > and
> >> > > > > > > > >> it is
> >> > > > > > > > >> > > on
> >> > > > > > > > >> > > > > DistributedLogClientImpl for the proxy client
> >> > handling
> >> > > > > > > responses
> >> > > > > > > > >> from
> >> > > > > > > > >> > > > > proxies. Does this help you?
> >> > > > > > > > >> > > > >
> >> > > > > > > > >> > > > > And also, you are welcome to contribute the
> pull
> >> > > > requests.
> >> > > > > > > > >> > > > >
> >> > > > > > > > >> > > > > - Sijie
> >> > > > > > > > >> > > > >
> >> > > > > > > > >> > > > >
> >> > > > > > > > >> > > > >
> >> > > > > > > > >> > > > > On Tue, Oct 4, 2016 at 3:39 PM, Cameron
> Hatfield
> >> <
> >> > > > > > > > >> kinguy@gmail.com>
> >> > > > > > > > >> > > > wrote:
> >> > > > > > > > >> > > > >
> >> > > > > > > > >> > > > > > I have a question about the Proxy Client.
> >> > Basically,
> >> > > > for
> >> > > > > > our
> >> > > > > > > > use
> >> > > > > > > > >> > > cases,
> >> > > > > > > > >> > > > > we
> >> > > > > > > > >> > > > > > want to guarantee ordering at the key level,
> >> > > > > irrespective
> >> > > > > > of
> >> > > > > > > > the
> >> > > > > > > > >> > > > ordering
> >> > > > > > > > >> > > > > > of the partition it may be assigned to as a
> >> whole.
> >> > > Due
> >> > > > > to
> >> > > > > > > the
> >> > > > > > > > >> > source
> >> > > > > > > > >> > > of
> >> > > > > > > > >> > > > > the
> >> > > > > > > > >> > > > > > data (HBase Replication), we cannot guarantee
> >> > that a
> >> > > > > > single
> >> > > > > > > > >> > partition
> >> > > > > > > > >> > > > > will
> >> > > > > > > > >> > > > > > be owned for writes by the same client. This
> >> means
> >> > > the
> >> > > > > > proxy
> >> > > > > > > > >> client
> >> > > > > > > > >> > > > works
> >> > > > > > > > >> > > > > > well (since we don't care which proxy owns
> the
> >> > > > partition
> >> > > > > > we
> >> > > > > > > > are
> >> > > > > > > > >> > > writing
> >> > > > > > > > >> > > > > > to).
> >> > > > > > > > >> > > > > >
> >> > > > > > > > >> > > > > >
> >> > > > > > > > >> > > > > > However, the guarantees we need when writing
> a
> >> > batch
> >> > > > > > > consists
> >> > > > > > > > >> of:
> >> > > > > > > > >> > > > > > Definition of a Batch: The set of records
> sent
> >> to
> >> > > the
> >> > > > > > > > writeBatch
> >> > > > > > > > >> > > > endpoint
> >> > > > > > > > >> > > > > > on the proxy
> >> > > > > > > > >> > > > > >
> >> > > > > > > > >> > > > > > 1. Batch success: If the client receives a
> >> success
> >> > > > from
> >> > > > > > the
> >> > > > > > > > >> proxy,
> >> > > > > > > > >> > > then
> >> > > > > > > > >> > > > > > that batch is successfully written
> >> > > > > > > > >> > > > > >
> >> > > > > > > > >> > > > > > 2. Inter-Batch ordering : Once a batch has
> been
> >> > > > written
> >> > > > > > > > >> > successfully
> >> > > > > > > > >> > > by
> >> > > > > > > > >> > > > > the
> >> > > > > > > > >> > > > > > client, when another batch is written, it
> will
> >> be
> >> > > > > > guaranteed
> >> > > > > > > > to
> >> > > > > > > > >> be
> >> > > > > > > > >> > > > > ordered
> >> > > > > > > > >> > > > > > after the last batch (if it is the same
> >> stream).
> >> > > > > > > > >> > > > > >
> >> > > > > > > > >> > > > > > 3. Intra-Batch ordering: Within a batch of
> >> writes,
> >> > > the
> >> > > > > > > records
> >> > > > > > > > >> will
> >> > > > > > > > >> > > be
> >> > > > > > > > >> > > > > > committed in order
> >> > > > > > > > >> > > > > >
> >> > > > > > > > >> > > > > > 4. Intra-Batch failure ordering: If an
> >> individual
> >> > > > record
> >> > > > > > > fails
> >> > > > > > > > >> to
> >> > > > > > > > >> > > write
> >> > > > > > > > >> > > > > > within a batch, all records after that record
> >> will
> >> > > not
> >> > > > > be
> >> > > > > > > > >> written.
> >> > > > > > > > >> > > > > >
> >> > > > > > > > >> > > > > > 5. Batch Commit: Guarantee that if a batch
> >> > returns a
> >> > > > > > > success,
> >> > > > > > > > it
> >> > > > > > > > >> > will
> >> > > > > > > > >> > > > be
> >> > > > > > > > >> > > > > > written
> >> > > > > > > > >> > > > > >
> >> > > > > > > > >> > > > > > 6. Read-after-write: Once a batch is
> committed,
> >> > > > within a
> >> > > > > > > > limited
> >> > > > > > > > >> > > > > time-frame
> >> > > > > > > > >> > > > > > it will be able to be read. This is required
> in
> >> > the
> >> > > > case
> >> > > > > > of
> >> > > > > > > > >> > failure,
> >> > > > > > > > >> > > so
> >> > > > > > > > >> > > > > > that the client can see what actually got
> >> > > committed. I
> >> > > > > > > believe
> >> > > > > > > > >> the
> >> > > > > > > > >> > > > > > time-frame part could be removed if the
> client
> >> can
> >> > > > send
> >> > > > > in
> >> > > > > > > the
> >> > > > > > > > >> same
> >> > > > > > > > >> > > > > > sequence number that was written previously,
> >> since
> >> > > it
> >> > > > > > would
> >> > > > > > > > then
> >> > > > > > > > >> > fail
> >> > > > > > > > >> > > > and
> >> > > > > > > > >> > > > > > we would know that a read needs to occur.
> >> > > > > > > > >> > > > > >
> >> > > > > > > > >> > > > > >
> >> > > > > > > > >> > > > > > So, my basic question is if this is currently
> >> > > possible
> >> > > > > in
> >> > > > > > > the
> >> > > > > > > > >> > proxy?
> >> > > > > > > > >> > > I
> >> > > > > > > > >> > > > > > don't believe it gives these guarantees as it
> >> > stands
> >> > > > > > today,
> >> > > > > > > > but
> >> > > > > > > > >> I
> >> > > > > > > > >> > am
> >> > > > > > > > >> > > > not
> >> > > > > > > > >> > > > > > 100% of how all of the futures in the code
> >> handle
> >> > > > > > failures.
> >> > > > > > > > >> > > > > > If not, where in the code would be the
> relevant
> >> > > places
> >> > > > > to
> >> > > > > > > add
> >> > > > > > > > >> the
> >> > > > > > > > >> > > > ability
> >> > > > > > > > >> > > > > > to do this, and would the project be
> interested
> >> > in a
> >> > > > > pull
> >> > > > > > > > >> request?
> >> > > > > > > > >> > > > > >
> >> > > > > > > > >> > > > > >
> >> > > > > > > > >> > > > > > Thanks,
> >> > > > > > > > >> > > > > > Cameron
> >> > > > > > > > >> > > > > >
> >> > > > > > > > >> > > > >
> >> > > > > > > > >> > > >
> >> > > > > > > > >> > >
> >> > > > > > > > >> >
> >> > > > > > > > >>
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Proxy Client - Batch Ordering / Commit

Posted by Cameron Hatfield <ki...@gmail.com>.
Also, would it be possible for me to get wiki access so I will be able to
update it / etc?

-Cameron

On Wed, Nov 16, 2016 at 11:59 AM, Cameron Hatfield <ki...@gmail.com> wrote:

> "A couple of questions" is what I originally wrote, and then the following
> happened. Sorry about the large swath of them, making sure my understanding
> of the code base, as well as the DL/Bookkeeper/ZK ecosystem interaction,
> makes sense.
>
> ==General:
> What is an exclusive session? What is it providing over a regular session?
>
>
> ==Proxy side:
> Should a new streamop be added for the fencing operation, or does it make
> sense to piggyback on an existing one (such as write)?
>
> ====getLastDLSN:
> What should be the return result for:
> A new stream
> A new session, after successful fencing
> A new session, after a change in ownership / first starting up
>
> What is the main use case for getLastDLSN(<stream>, false)? Is this to
> guarantee that the recovery process has happened in case of ownership
> failure (I don't have a good understanding of what causes the recovery
> process to happen, especially from the reader side)? Or is it to handle the
> lost-ack problem? Since all the rest of the read related things go through
> the read client, I'm not sure if I see the use case, but it seems like
> there would be a large potential for confusion on which to use. What about
> just a fenceSession op, that always fences, returning the DLSN of the
> fence, and leave the normal getLastDLSN for the regular read client.
>
> ====Fencing:
> When a fence session occurs, what call needs to be made to make sure any
> outstanding writes are flushed and committed (so that we guarantee the
> client will be able to read anything that was in the write queue)?
> Is there a guaranteed ordering for things written in the future queue for
> AsyncLogWriter (I'm not quite confident that I was able to accurately
> follow the logic, as their are many parts of the code that write, have
> queues, heartbeat, etc)?
>
> ====SessionID:
> What is the default sessionid / transactionid for a new stream? I assume
> this would just be the first control record
>
> ======Should all streams have a sessionid by default, regardless if it is
> never used by a client (aka, everytime ownership changes, a new control
> record is generated, and a sessionid is stored)?
> Main edge case that would have to be handled is if a client writes with an
> old sessionid, but the owner has changed and has yet to create a sessionid.
> This should be handled by the "non-matching sessionid" rule, since the
> invalid sessionid wouldn't match the passed sessionid, which should cause
> the client to get a new sessionid.
>
> ======Where in the code does it make sense to own the session, the stream
> interfaces / classes? Should they pass that information down to the ops, or
> do the sessionid check within?
> My first thought would be Stream owns the sessionid, passes it into the
> ops (as either a nullable value, or an invalid default value), which then
> do the sessionid check if they care. The main issue is updating the
> sessionid is a bit backwards, as either every op has the ability to update
> it through some type of return value / direct stream access / etc, or there
> is a special case in the stream for the fence operation / any other
> operation that can update the session.
>
> ======For "the owner of the log stream will first advance the transaction
> id generator to claim a new transaction id and write a control record to
> the log stream. ":
> Should "DistributedLogConstants.CONTROL_RECORD_CONTENT" be the type of
> control record written?
> Should the "writeControlRecord" on the BKAsyncLogWriter be exposed in the
> AsyncLogWriter interface be exposed?  Or even in the one within the segment
> writer? Or should the code be duplicated / pulled out into a helper / etc?
> (Not a big Java person, so any suggestions on the "Java Way", or at least
> the DL way, to do it would be appreciated)
>
> ======Transaction ID:
> The BKLogSegmentWriter ignores the transaction ids from control records
> when it records the "LastTXId." Would that be an issue here for anything?
> It looks like it may do that because it assumes you're calling it's local
> function for writing a controlrecord, which uses the lastTxId.
>
>
> ==Thrift Interface:
> ====Should the write response be split out for different calls?
> It seems odd to have a single struct with many optional items that are
> filled depending on the call made for every rpc call. This is mostly a
> curiosity question, since I assume it comes from the general practices from
> using thrift for a while. Would it at least make sense for the
> getLastDLSN/fence endpoint to have a new struct?
>
> ====Any particular error code that makes sense for session fenced? If we
> want to be close to the HTTP errors, looks like 412 (PRECONDITION FAILED)
> might make the most sense, if a bit generic.
>
> 412 def:
> "The precondition given in one or more of the request-header fields
> evaluated to false when it was tested on the server. This response code
> allows the client to place preconditions on the current resource
> metainformation (header field data) and thus prevent the requested method
> from being applied to a resource other than the one intended."
>
> 412 excerpt from the If-match doc:
> "This behavior is most useful when the client wants to prevent an updating
> method, such as PUT, from modifying a resource that has changed since the
> client last retrieved it."
>
> ====Should we return the sessionid to the client in the "fencesession"
> calls?
> Seems like it may be useful when you fence, especially if you have some
> type of custom sequencer where it would make sense for search, or for
> debugging.
> Main minus is that it would be easy for users to create an implicit
> requirement that the sessionid is forever a valid transactionid, which may
> not always be the case long term for the project.
>
>
> ==Client:
> ====What is the proposed process for the client retrieving the new
> sessionid?
> A full reconnect? No special case code, but intrusive on the client side,
> and possibly expensive garbage/processing wise. (though this type of
> failure should hopefully be rare enough to not be a problem)
> A call to reset the sessionid? Less intrusive, all the issues you get with
> mutable object methods that need to be called in a certain order, edge
> cases such as outstanding/buffered requests to the old stream, etc.
> The call could also return the new sessionid, making it a good call for
> storing or debugging the value.
>
> ====Session Fenced failure:
> Will this put the client into a failure state, stopping all future writes
> until fixed?
> Is it even possible to get this error when ownership changes? The
> connection to the new owner should get a new sessionid on connect, so I
> would expect not.
>
> Cheers,
> Cameron
>
> On Tue, Nov 15, 2016 at 2:01 AM, Xi Liu <xi...@gmail.com> wrote:
>
>> Thank you, Cameron. Look forward to your comments.
>>
>> - Xi
>>
>> On Sun, Nov 13, 2016 at 1:21 PM, Cameron Hatfield <ki...@gmail.com>
>> wrote:
>>
>> > Sorry, I've been on vacation for the past week, and heads down for a
>> > release that is using DL at the end of Nov. I'll take a look at this
>> over
>> > the next week, and add any relevant comments. After we are finished with
>> > dev for this release, I am hoping to tackle this next.
>> >
>> > -Cameron
>> >
>> > On Fri, Nov 11, 2016 at 12:07 PM, Sijie Guo <si...@apache.org> wrote:
>> >
>> > > Xi,
>> > >
>> > > Thank you so much for your proposal. I took a look. It looks fine to
>> me.
>> > > Cameron, do you have any comments?
>> > >
>> > > Look forward to your pull requests.
>> > >
>> > > - Sijie
>> > >
>> > >
>> > > On Wed, Nov 9, 2016 at 2:34 AM, Xi Liu <xi...@gmail.com> wrote:
>> > >
>> > > > Cameron,
>> > > >
>> > > > Have you started any work for this? I just updated the proposal
>> page -
>> > > > https://cwiki.apache.org/confluence/display/DL/DP-2+-+
>> > > Epoch+Write+Support
>> > > > Maybe we can work together with this.
>> > > >
>> > > > Sijie, Leigh,
>> > > >
>> > > > can you guys help review this to make sure our proposal is in the
>> right
>> > > > direction?
>> > > >
>> > > > - Xi
>> > > >
>> > > > On Tue, Nov 1, 2016 at 3:05 AM, Sijie Guo <si...@apache.org> wrote:
>> > > >
>> > > > > I created https://issues.apache.org/jira/browse/DL-63 for
>> tracking
>> > the
>> > > > > proposed idea here.
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Wed, Oct 26, 2016 at 4:53 PM, Sijie Guo
>> > <sijieg@twitter.com.invalid
>> > > >
>> > > > > wrote:
>> > > > >
>> > > > > > On Tue, Oct 25, 2016 at 11:30 AM, Cameron Hatfield <
>> > kinguy@gmail.com
>> > > >
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Yes, we are reading the HBase WAL (from their replication
>> plugin
>> > > > > > support),
>> > > > > > > and writing that into DL.
>> > > > > > >
>> > > > > >
>> > > > > > Gotcha.
>> > > > > >
>> > > > > >
>> > > > > > >
>> > > > > > > From the sounds of it, yes, it would. Only thing I would say
>> is
>> > > make
>> > > > > the
>> > > > > > > epoch requirement optional, so that if I client doesn't care
>> > about
>> > > > > dupes
>> > > > > > > they don't have to deal with the process of getting a new
>> epoch.
>> > > > > > >
>> > > > > >
>> > > > > > Yup. This should be optional. I can start a wiki page on how we
>> > want
>> > > to
>> > > > > > implement this. Are you interested in contributing to this?
>> > > > > >
>> > > > > >
>> > > > > > >
>> > > > > > > -Cameron
>> > > > > > >
>> > > > > > > On Wed, Oct 19, 2016 at 7:43 PM, Sijie Guo
>> > > > <sijieg@twitter.com.invalid
>> > > > > >
>> > > > > > > wrote:
>> > > > > > >
>> > > > > > > > On Wed, Oct 19, 2016 at 7:17 PM, Sijie Guo <
>> sijieg@twitter.com
>> > >
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > On Monday, October 17, 2016, Cameron Hatfield <
>> > > kinguy@gmail.com>
>> > > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > >> Answer inline:
>> > > > > > > > >>
>> > > > > > > > >> On Mon, Oct 17, 2016 at 11:46 AM, Sijie Guo <
>> > sijie@apache.org
>> > > >
>> > > > > > wrote:
>> > > > > > > > >>
>> > > > > > > > >> > Cameron,
>> > > > > > > > >> >
>> > > > > > > > >> > Thank you for your summary. I liked the discussion
>> here. I
>> > > > also
>> > > > > > > liked
>> > > > > > > > >> the
>> > > > > > > > >> > summary of your requirement - 'single-writer-per-key,
>> > > > > > > > >> > multiple-writers-per-log'. If I understand correctly,
>> the
>> > > core
>> > > > > > > concern
>> > > > > > > > >> here
>> > > > > > > > >> > is almost 'exact-once' write (or a way to explicit tell
>> > if a
>> > > > > write
>> > > > > > > can
>> > > > > > > > >> be
>> > > > > > > > >> > retried or not).
>> > > > > > > > >> >
>> > > > > > > > >> > Comments inline.
>> > > > > > > > >> >
>> > > > > > > > >> > On Fri, Oct 14, 2016 at 11:17 AM, Cameron Hatfield <
>> > > > > > > kinguy@gmail.com>
>> > > > > > > > >> > wrote:
>> > > > > > > > >> >
>> > > > > > > > >> > > > Ah- yes good point (to be clear we're not using the
>> > > proxy
>> > > > > this
>> > > > > > > way
>> > > > > > > > >> > > today).
>> > > > > > > > >> > >
>> > > > > > > > >> > > > > Due to the source of the
>> > > > > > > > >> > > > > data (HBase Replication), we cannot guarantee
>> that a
>> > > > > single
>> > > > > > > > >> partition
>> > > > > > > > >> > > will
>> > > > > > > > >> > > > > be owned for writes by the same client.
>> > > > > > > > >> > >
>> > > > > > > > >> > > > Do you mean you *need* to support multiple writers
>> > > issuing
>> > > > > > > > >> interleaved
>> > > > > > > > >> > > > writes or is it just that they might sometimes
>> > > interleave
>> > > > > > writes
>> > > > > > > > and
>> > > > > > > > >> > you
>> > > > > > > > >> > > >don't care?
>> > > > > > > > >> > > How HBase partitions the keys being written wouldn't
>> > have
>> > > a
>> > > > > > > one->one
>> > > > > > > > >> > > mapping with the partitions we would have in HBase.
>> Even
>> > > if
>> > > > we
>> > > > > > did
>> > > > > > > > >> have
>> > > > > > > > >> > > that alignment when the cluster first started, HBase
>> > will
>> > > > > > > rebalance
>> > > > > > > > >> what
>> > > > > > > > >> > > servers own what partitions, as well as split and
>> merge
>> > > > > > partitions
>> > > > > > > > >> that
>> > > > > > > > >> > > already exist, causing eventual drift from one log
>> per
>> > > > > > partition.
>> > > > > > > > >> > > Because we want ordering guarantees per key (row in
>> > > hbase),
>> > > > we
>> > > > > > > > >> partition
>> > > > > > > > >> > > the logs by the key. Since multiple writers are
>> possible
>> > > per
>> > > > > > range
>> > > > > > > > of
>> > > > > > > > >> > keys
>> > > > > > > > >> > > (due to the aforementioned rebalancing / splitting /
>> etc
>> > > of
>> > > > > > > hbase),
>> > > > > > > > we
>> > > > > > > > >> > > cannot use the core library due to requiring a single
>> > > writer
>> > > > > for
>> > > > > > > > >> > ordering.
>> > > > > > > > >> > >
>> > > > > > > > >> > > But, for a single log, we don't really care about
>> > ordering
>> > > > > aside
>> > > > > > > > from
>> > > > > > > > >> at
>> > > > > > > > >> > > the per-key level. So all we really need to be able
>> to
>> > > > handle
>> > > > > is
>> > > > > > > > >> > preventing
>> > > > > > > > >> > > duplicates when a failure occurs, and ordering
>> > consistency
>> > > > > > across
>> > > > > > > > >> > requests
>> > > > > > > > >> > > from a single client.
>> > > > > > > > >> > >
>> > > > > > > > >> > > So our general requirements are:
>> > > > > > > > >> > > Write A, Write B
>> > > > > > > > >> > > Timeline: A -> B
>> > > > > > > > >> > > Request B is only made after A has successfully
>> returned
>> > > > > > (possibly
>> > > > > > > > >> after
>> > > > > > > > >> > > retries)
>> > > > > > > > >> > >
>> > > > > > > > >> > > 1) If the write succeeds, it will be durably exposed
>> to
>> > > > > clients
>> > > > > > > > within
>> > > > > > > > >> > some
>> > > > > > > > >> > > bounded time frame
>> > > > > > > > >> > >
>> > > > > > > > >> >
>> > > > > > > > >> > Guaranteed.
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > > > >> > > 2) If A succeeds and B succeeds, the ordering for the
>> > log
>> > > > will
>> > > > > > be
>> > > > > > > A
>> > > > > > > > >> and
>> > > > > > > > >> > > then B
>> > > > > > > > >> > >
>> > > > > > > > >> >
>> > > > > > > > >> > If I understand correctly here, B is only sent after A
>> is
>> > > > > > returned,
>> > > > > > > > >> right?
>> > > > > > > > >> > If that's the case, It is guaranteed.
>> > > > > > > > >>
>> > > > > > > > >>
>> > > > > > > > >>
>> > > > > > > > >> >
>> > > > > > > > >> >
>> > > > > > > > >> >
>> > > > > > > > >> > > 3) If A fails due to an error that can be relied on
>> to
>> > > *not*
>> > > > > be
>> > > > > > a
>> > > > > > > > lost
>> > > > > > > > >> > ack
>> > > > > > > > >> > > problem, it will never be exposed to the client, so
>> it
>> > may
>> > > > > > > > (depending
>> > > > > > > > >> on
>> > > > > > > > >> > > the error) be retried immediately
>> > > > > > > > >> > >
>> > > > > > > > >> >
>> > > > > > > > >> > If it is not a lost-ack problem, the entry will be
>> > exposed.
>> > > it
>> > > > > is
>> > > > > > > > >> > guaranteed.
>> > > > > > > > >>
>> > > > > > > > >> Let me try rephrasing the questions, to make sure I'm
>> > > > > understanding
>> > > > > > > > >> correctly:
>> > > > > > > > >> If A fails, with an error such as "Unable to create
>> > connection
>> > > > to
>> > > > > > > > >> bookkeeper server", that would be the type of error we
>> would
>> > > > > expect
>> > > > > > to
>> > > > > > > > be
>> > > > > > > > >> able to retry immediately, as that result means no action
>> > was
>> > > > > taken
>> > > > > > on
>> > > > > > > > any
>> > > > > > > > >> log / etc, so no entry could have been created. This is
>> > > > different
>> > > > > > > then a
>> > > > > > > > >> "Connection Timeout" exception, as we just might not have
>> > > > gotten a
>> > > > > > > > >> response
>> > > > > > > > >> in time.
>> > > > > > > > >>
>> > > > > > > > >>
>> > > > > > > > > Gotcha.
>> > > > > > > > >
>> > > > > > > > > The response code returned from proxy can tell if a
>> failure
>> > can
>> > > > be
>> > > > > > > > retried
>> > > > > > > > > safely or not. (We might need to make them well
>> documented)
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >>
>> > > > > > > > >> >
>> > > > > > > > >> >
>> > > > > > > > >> > > 4) If A fails due to an error that could be a lost
>> ack
>> > > > problem
>> > > > > > > > >> (network
>> > > > > > > > >> > > connectivity / etc), within a bounded time frame it
>> > should
>> > > > be
>> > > > > > > > >> possible to
>> > > > > > > > >> > > find out if the write succeed or failed. Either by
>> > reading
>> > > > > from
>> > > > > > > some
>> > > > > > > > >> > > checkpoint of the log for the changes that should
>> have
>> > > been
>> > > > > made
>> > > > > > > or
>> > > > > > > > >> some
>> > > > > > > > >> > > other possible server-side support.
>> > > > > > > > >> > >
>> > > > > > > > >> >
>> > > > > > > > >> > If I understand this correctly, it is a duplication
>> issue,
>> > > > > right?
>> > > > > > > > >> >
>> > > > > > > > >> > Can a de-duplication solution work here? Either DL or
>> your
>> > > > > client
>> > > > > > > does
>> > > > > > > > >> the
>> > > > > > > > >> > de-duplication?
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > > > >> The requirements I'm mentioning are the ones needed for
>> > > > > client-side
>> > > > > > > > >> dedupping. Since if I can guarantee writes being exposed
>> > > within
>> > > > > some
>> > > > > > > > time
>> > > > > > > > >> frame, and I can never get into an inconsistently ordered
>> > > state
>> > > > > when
>> > > > > > > > >> successes happen, when an error occurs, I can always wait
>> > for
>> > > > max
>> > > > > > time
>> > > > > > > > >> frame, read the latest writes, and then dedup locally
>> > against
>> > > > the
>> > > > > > > > request
>> > > > > > > > >> I
>> > > > > > > > >> just made.
>> > > > > > > > >>
>> > > > > > > > >> The main thing about that timeframe is that its basically
>> > the
>> > > > > > addition
>> > > > > > > > of
>> > > > > > > > >> every timeout, all the way down in the system, combined
>> with
>> > > > > > whatever
>> > > > > > > > >> flushing / caching / etc times are at the bookkeeper /
>> > client
>> > > > > level
>> > > > > > > for
>> > > > > > > > >> when values are exposed
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Gotcha.
>> > > > > > > > >
>> > > > > > > > >>
>> > > > > > > > >>
>> > > > > > > > >> >
>> > > > > > > > >> > Is there any ways to identify your write?
>> > > > > > > > >> >
>> > > > > > > > >> > I can think of a case as follow - I want to know what
>> is
>> > > your
>> > > > > > > expected
>> > > > > > > > >> > behavior from the log.
>> > > > > > > > >> >
>> > > > > > > > >> > a)
>> > > > > > > > >> >
>> > > > > > > > >> > If a hbase region server A writes a change of key K to
>> the
>> > > > log,
>> > > > > > the
>> > > > > > > > >> change
>> > > > > > > > >> > is successfully made to the log;
>> > > > > > > > >> > but server A is down before receiving the change.
>> > > > > > > > >> > region server B took over the region that contains K,
>> what
>> > > > will
>> > > > > B
>> > > > > > > do?
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > > > >> HBase writes in large chunks (WAL Logs), which its
>> > replication
>> > > > > > system
>> > > > > > > > then
>> > > > > > > > >> handles by replaying in the case of failure. If I'm in a
>> > > middle
>> > > > > of a
>> > > > > > > > log,
>> > > > > > > > >> and the whole region goes down and gets rescheduled
>> > > elsewhere, I
>> > > > > > will
>> > > > > > > > >> start
>> > > > > > > > >> back up from the beginning of the log I was in the middle
>> > of.
>> > > > > Using
>> > > > > > > > >> checkpointing + deduping, we should be able to find out
>> > where
>> > > we
>> > > > > > left
>> > > > > > > > off
>> > > > > > > > >> in the log.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >> >
>> > > > > > > > >> >
>> > > > > > > > >> > b) same as a). but server A was just network
>> partitioned.
>> > > will
>> > > > > > both
>> > > > > > > A
>> > > > > > > > >> and B
>> > > > > > > > >> > write the change of key K?
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > > > >> HBase gives us some guarantees around network partitions
>> > > > > > (Consistency
>> > > > > > > > over
>> > > > > > > > >> availability for HBase). HBase is a single-master
>> failover
>> > > > > recovery
>> > > > > > > type
>> > > > > > > > >> of
>> > > > > > > > >> system, with zookeeper-based guarantees for single owners
>> > > > > (writers)
>> > > > > > > of a
>> > > > > > > > >> range of data.
>> > > > > > > > >>
>> > > > > > > > >> >
>> > > > > > > > >> >
>> > > > > > > > >> >
>> > > > > > > > >> > > 5) If A is turned into multiple batches (one large
>> > request
>> > > > > gets
>> > > > > > > > split
>> > > > > > > > >> > into
>> > > > > > > > >> > > multiple smaller ones to the bookkeeper backend, due
>> to
>> > > log
>> > > > > > > rolling
>> > > > > > > > /
>> > > > > > > > >> > size
>> > > > > > > > >> > > / etc):
>> > > > > > > > >> > >   a) The ordering of entries within batches have
>> > ordering
>> > > > > > > > consistence
>> > > > > > > > >> > with
>> > > > > > > > >> > > the original request, when exposed in the log (though
>> > they
>> > > > may
>> > > > > > be
>> > > > > > > > >> > > interleaved with other requests)
>> > > > > > > > >> > >   b) The ordering across batches have ordering
>> > consistence
>> > > > > with
>> > > > > > > the
>> > > > > > > > >> > > original request, when exposed in the log (though
>> they
>> > may
>> > > > be
>> > > > > > > > >> interleaved
>> > > > > > > > >> > > with other requests)
>> > > > > > > > >> > >   c) If a batch fails, and cannot be retried / is
>> > > > > unsuccessfully
>> > > > > > > > >> retried,
>> > > > > > > > >> > > all batches after the failed batch should not be
>> exposed
>> > > in
>> > > > > the
>> > > > > > > log.
>> > > > > > > > >> > Note:
>> > > > > > > > >> > > The batches before and including the failed batch,
>> that
>> > > > ended
>> > > > > up
>> > > > > > > > >> > > succeeding, can show up in the log, again within some
>> > > > bounded
>> > > > > > time
>> > > > > > > > >> range
>> > > > > > > > >> > > for reads by a client.
>> > > > > > > > >> > >
>> > > > > > > > >> >
>> > > > > > > > >> > There is a method 'writeBulk' in DistributedLogClient
>> can
>> > > > > achieve
>> > > > > > > this
>> > > > > > > > >> > guarantee.
>> > > > > > > > >> >
>> > > > > > > > >> > However, I am not very sure about how will you turn A
>> into
>> > > > > > batches.
>> > > > > > > If
>> > > > > > > > >> you
>> > > > > > > > >> > are dividing A into batches,
>> > > > > > > > >> > you can simply control the application write sequence
>> to
>> > > > achieve
>> > > > > > the
>> > > > > > > > >> > guarantee here.
>> > > > > > > > >> >
>> > > > > > > > >> > Can you explain more about this?
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > > > >> In this case, by batches I mean what the proxy does with
>> the
>> > > > > single
>> > > > > > > > >> request
>> > > > > > > > >> that I send it. If the proxy decides it needs to turn my
>> > > single
>> > > > > > > request
>> > > > > > > > >> into multiple batches of requests, due to log rolling,
>> size
>> > > > > > > limitations,
>> > > > > > > > >> etc, those would be the guarantees I need to be able to
>> > > > > reduplicate
>> > > > > > on
>> > > > > > > > the
>> > > > > > > > >> client side.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > A single record written by #write and A record set (set of
>> > > > records)
>> > > > > > > > > written by #writeRecordSet are atomic - they will not be
>> > broken
>> > > > > down
>> > > > > > > into
>> > > > > > > > > entries (batches). With the correct response code, you
>> would
>> > be
>> > > > > able
>> > > > > > to
>> > > > > > > > > tell if it is a lost-ack failure or not. However there is
>> a
>> > > size
>> > > > > > > > limitation
>> > > > > > > > > for this - it can't not go beyond 1MB for current
>> > > implementation.
>> > > > > > > > >
>> > > > > > > > > What is your expected record size?
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >>
>> > > > > > > > >> >
>> > > > > > > > >> >
>> > > > > > > > >> > >
>> > > > > > > > >> > > Since we can guarantee per-key ordering on the client
>> > > side,
>> > > > we
>> > > > > > > > >> guarantee
>> > > > > > > > >> > > that there is a single writer per-key, just not per
>> log.
>> > > > > > > > >> >
>> > > > > > > > >> >
>> > > > > > > > >> > Do you need fencing guarantee in the case of network
>> > > partition
>> > > > > > > causing
>> > > > > > > > >> > two-writers?
>> > > > > > > > >> >
>> > > > > > > > >> >
>> > > > > > > > >> > > So if there was a
>> > > > > > > > >> > > way to guarantee a single write request as being
>> written
>> > > or
>> > > > > not,
>> > > > > > > > >> within a
>> > > > > > > > >> > > certain time frame (since failures should be rare
>> > anyways,
>> > > > > this
>> > > > > > is
>> > > > > > > > >> fine
>> > > > > > > > >> > if
>> > > > > > > > >> > > it is expensive), we can then have the client
>> guarantee
>> > > the
>> > > > > > > ordering
>> > > > > > > > >> it
>> > > > > > > > >> > > needs.
>> > > > > > > > >> > >
>> > > > > > > > >> >
>> > > > > > > > >> > This sounds an 'exact-once' write (regarding retries)
>> > > > > requirement
>> > > > > > to
>> > > > > > > > me,
>> > > > > > > > >> > right?
>> > > > > > > > >> >
>> > > > > > > > >> Yes. I'm curious of how this issue is handled by
>> Manhattan,
>> > > > since
>> > > > > > you
>> > > > > > > > can
>> > > > > > > > >> imagine a data store that ends up getting multiple writes
>> > for
>> > > > the
>> > > > > > same
>> > > > > > > > put
>> > > > > > > > >> / get / etc, would be harder to use, and we are basically
>> > > trying
>> > > > > to
>> > > > > > > > create
>> > > > > > > > >> a log like that for HBase.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Are you guys replacing HBase WAL?
>> > > > > > > > >
>> > > > > > > > > In Manhattan case, the request will be first written to DL
>> > > > streams
>> > > > > by
>> > > > > > > > > Manhattan coordinator. The Manhattan replica then will
>> read
>> > > from
>> > > > > the
>> > > > > > DL
>> > > > > > > > > streams and apply the change. In the lost-ack case, the MH
>> > > > > > coordinator
>> > > > > > > > will
>> > > > > > > > > just fail the request to client.
>> > > > > > > > >
>> > > > > > > > > My feeling here is your usage for HBase is a bit different
>> > from
>> > > > how
>> > > > > > we
>> > > > > > > > use
>> > > > > > > > > DL in Manhattan. It sounds like you read from a source
>> (HBase
>> > > > WAL)
>> > > > > > and
>> > > > > > > > > write to DL. But I might be wrong.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >>
>> > > > > > > > >> >
>> > > > > > > > >> >
>> > > > > > > > >> > >
>> > > > > > > > >> > >
>> > > > > > > > >> > > > Cameron:
>> > > > > > > > >> > > > Another thing we've discussed but haven't really
>> > thought
>> > > > > > > through -
>> > > > > > > > >> > > > We might be able to support some kind of epoch
>> write
>> > > > > request,
>> > > > > > > > where
>> > > > > > > > >> the
>> > > > > > > > >> > > > epoch is guaranteed to have changed if the writer
>> has
>> > > > > changed
>> > > > > > or
>> > > > > > > > the
>> > > > > > > > >> > > ledger
>> > > > > > > > >> > > > was ever fenced off. Writes include an epoch and
>> are
>> > > > > rejected
>> > > > > > if
>> > > > > > > > the
>> > > > > > > > >> > > epoch
>> > > > > > > > >> > > > has changed.
>> > > > > > > > >> > > > With a mechanism like this, fencing the ledger off
>> > > after a
>> > > > > > > failure
>> > > > > > > > >> > would
>> > > > > > > > >> > > > ensure any pending writes had either been written
>> or
>> > > would
>> > > > > be
>> > > > > > > > >> rejected.
>> > > > > > > > >> > >
>> > > > > > > > >> > > The issue would be how I guarantee the write I wrote
>> to
>> > > the
>> > > > > > server
>> > > > > > > > was
>> > > > > > > > >> > > written. Since a network issue could happen on the
>> send
>> > of
>> > > > the
>> > > > > > > > >> request,
>> > > > > > > > >> > or
>> > > > > > > > >> > > on the receive of the success response, an epoch
>> > wouldn't
>> > > > tell
>> > > > > > me
>> > > > > > > > if I
>> > > > > > > > >> > can
>> > > > > > > > >> > > successfully retry, as it could be successfully
>> written
>> > > but
>> > > > > AWS
>> > > > > > > > >> dropped
>> > > > > > > > >> > the
>> > > > > > > > >> > > connection for the success response. Since the epoch
>> > would
>> > > > be
>> > > > > > the
>> > > > > > > > same
>> > > > > > > > >> > > (same ledger), I could write duplicates.
>> > > > > > > > >> > >
>> > > > > > > > >> > >
>> > > > > > > > >> > > > We are currently proposing adding a transaction
>> > semantic
>> > > > to
>> > > > > dl
>> > > > > > > to
>> > > > > > > > >> get
>> > > > > > > > >> > rid
>> > > > > > > > >> > > > of the size limitation and the unaware-ness in the
>> > proxy
>> > > > > > client.
>> > > > > > > > >> Here
>> > > > > > > > >> > is
>> > > > > > > > >> > > > our idea -
>> > > > > > > > >> > > > http://mail-archives.apache.or
>> g/mod_mbox/incubator-
>> > > > > > > distributedlog
>> > > > > > > > >> > > -dev/201609.mbox/%3cCAAC6BxP5Y
>> yEHwG0ZCF5soh42X=xuYwYm
>> > > > > > > > >> > > <http://mail-archives.apache.org/mod_mbox/incubator-
>> > > > > > > > >> > distributedlog%0A-dev/201609.mbox/%
>> > > 3cCAAC6BxP5YyEHwG0ZCF5soh
>> > > > > > > > 42X=xuYwYm>
>> > > > > > > > >> > > L4nXsYBYiofzxpVk6g@mail.gmail.com%3e
>> > > > > > > > >> > >
>> > > > > > > > >> > > > I am not sure if your idea is similar as ours. but
>> > we'd
>> > > > like
>> > > > > > to
>> > > > > > > > >> > > collaborate
>> > > > > > > > >> > > > with the community if anyone has the similar idea.
>> > > > > > > > >> > >
>> > > > > > > > >> > > Our use case would be covered by transaction support,
>> > but
>> > > > I'm
>> > > > > > > unsure
>> > > > > > > > >> if
>> > > > > > > > >> > we
>> > > > > > > > >> > > would need something that heavy weight for the
>> > guarantees
>> > > we
>> > > > > > need.
>> > > > > > > > >> > >
>> > > > > > > > >> >
>> > > > > > > > >> > >
>> > > > > > > > >> > > Basically, the high level requirement here is
>> "Support
>> > > > > > consistent
>> > > > > > > > >> write
>> > > > > > > > >> > > ordering for single-writer-per-key,
>> > multi-writer-per-log".
>> > > > My
>> > > > > > > hunch
>> > > > > > > > is
>> > > > > > > > >> > > that, with some added guarantees to the proxy (if it
>> > isn't
>> > > > > > already
>> > > > > > > > >> > > supported), and some custom client code on our side
>> for
>> > > > > removing
>> > > > > > > the
>> > > > > > > > >> > > entries that actually succeed to write to
>> DistributedLog
>> > > > from
>> > > > > > the
>> > > > > > > > >> request
>> > > > > > > > >> > > that failed, it should be a relatively easy thing to
>> > > > support.
>> > > > > > > > >> > >
>> > > > > > > > >> >
>> > > > > > > > >> > Yup. I think it should not be very difficult to
>> support.
>> > > There
>> > > > > > might
>> > > > > > > > be
>> > > > > > > > >> > some changes in the server side.
>> > > > > > > > >> > Let's figure out what will the changes be. Are you guys
>> > > > > interested
>> > > > > > > in
>> > > > > > > > >> > contributing?
>> > > > > > > > >> >
>> > > > > > > > >> > Yes, we would be.
>> > > > > > > > >>
>> > > > > > > > >> As a note, the one thing that we see as an issue with the
>> > > client
>> > > > > > side
>> > > > > > > > >> dedupping is how to bound the range of data that needs
>> to be
>> > > > > looked
>> > > > > > at
>> > > > > > > > for
>> > > > > > > > >> deduplication. As you can imagine, it is pretty easy to
>> > bound
>> > > > the
>> > > > > > > bottom
>> > > > > > > > >> of
>> > > > > > > > >> the range, as that it just regular checkpointing of the
>> DSLN
>> > > > that
>> > > > > is
>> > > > > > > > >> returned. I'm still not sure if there is any nice way to
>> > time
>> > > > > bound
>> > > > > > > the
>> > > > > > > > >> top
>> > > > > > > > >> end of the range, especially since the proxy owns
>> sequence
>> > > > numbers
>> > > > > > > > (which
>> > > > > > > > >> makes sense). I am curious if there is more that can be
>> done
>> > > if
>> > > > > > > > >> deduplication is on the server side. However the main
>> minus
>> > I
>> > > > see
>> > > > > of
>> > > > > > > > >> server
>> > > > > > > > >> side deduplication is that instead of running contingent
>> on
>> > > > there
>> > > > > > > being
>> > > > > > > > a
>> > > > > > > > >> failed client request, instead it would have to run every
>> > > time a
>> > > > > > write
>> > > > > > > > >> happens.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > For a reliable dedup, we probably need
>> fence-then-getLastDLSN
>> > > > > > > operation -
>> > > > > > > > > so it would guarantee that any non-completed requests
>> issued
>> > > > > > (lost-ack
>> > > > > > > > > requests) before this fence-then-getLastDLSN operation
>> will
>> > be
>> > > > > failed
>> > > > > > > and
>> > > > > > > > > they will never land at the log.
>> > > > > > > > >
>> > > > > > > > > the pseudo code would look like below -
>> > > > > > > > >
>> > > > > > > > > write(request) onFailure { t =>
>> > > > > > > > >
>> > > > > > > > > if (t is timeout exception) {
>> > > > > > > > >
>> > > > > > > > > DLSN lastDLSN = fenceThenGetLastDLSN()
>> > > > > > > > > DLSN lastCheckpointedDLSN = ...;
>> > > > > > > > > // find if the request lands between [lastDLSN,
>> > > > > > lastCheckpointedDLSN].
>> > > > > > > > > // if it exists, the write succeed; otherwise retry.
>> > > > > > > > >
>> > > > > > > > > }
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > }
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Just realized the idea is same as what Leigh raised in the
>> > > previous
>> > > > > > email
>> > > > > > > > about 'epoch write'. Let me explain more about this idea
>> > (Leigh,
>> > > > feel
>> > > > > > > free
>> > > > > > > > to jump in to fill up your idea).
>> > > > > > > >
>> > > > > > > > - when a log stream is owned,  the proxy use the last
>> > transaction
>> > > > id
>> > > > > as
>> > > > > > > the
>> > > > > > > > epoch
>> > > > > > > > - when a client connects (handshake with the proxy), it will
>> > get
>> > > > the
>> > > > > > > epoch
>> > > > > > > > for the stream.
>> > > > > > > > - the writes issued by this client will carry the epoch to
>> the
>> > > > proxy.
>> > > > > > > > - add a new rpc - fenceThenGetLastDLSN - it would force the
>> > proxy
>> > > > to
>> > > > > > bump
>> > > > > > > > the epoch.
>> > > > > > > > - if fenceThenGetLastDLSN happened, all the outstanding
>> writes
>> > > with
>> > > > > old
>> > > > > > > > epoch will be rejected with exceptions (e.g. EpochFenced).
>> > > > > > > > - The DLSN returned from fenceThenGetLastDLSN can be used as
>> > the
>> > > > > bound
>> > > > > > > for
>> > > > > > > > deduplications on failures.
>> > > > > > > >
>> > > > > > > > Cameron, does this sound a solution to your use case?
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >>
>> > > > > > > > >> Maybe something that could fit a similar need that Kafka
>> > does
>> > > > (the
>> > > > > > > last
>> > > > > > > > >> store value for a particular key in a log), such that on
>> a
>> > per
>> > > > key
>> > > > > > > basis
>> > > > > > > > >> there could be a sequence number that support
>> deduplication?
>> > > > Cost
>> > > > > > > seems
>> > > > > > > > >> like it would be high however, and I'm not even sure if
>> > > > bookkeeper
>> > > > > > > > >> supports
>> > > > > > > > >> it.
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >> Cheers,
>> > > > > > > > >> Cameron
>> > > > > > > > >>
>> > > > > > > > >> >
>> > > > > > > > >> > >
>> > > > > > > > >> > > Thanks,
>> > > > > > > > >> > > Cameron
>> > > > > > > > >> > >
>> > > > > > > > >> > >
>> > > > > > > > >> > > On Sat, Oct 8, 2016 at 7:35 AM, Leigh Stewart
>> > > > > > > > >> > <lstewart@twitter.com.invalid
>> > > > > > > > >> > > >
>> > > > > > > > >> > > wrote:
>> > > > > > > > >> > >
>> > > > > > > > >> > > > Cameron:
>> > > > > > > > >> > > > Another thing we've discussed but haven't really
>> > thought
>> > > > > > > through -
>> > > > > > > > >> > > > We might be able to support some kind of epoch
>> write
>> > > > > request,
>> > > > > > > > where
>> > > > > > > > >> the
>> > > > > > > > >> > > > epoch is guaranteed to have changed if the writer
>> has
>> > > > > changed
>> > > > > > or
>> > > > > > > > the
>> > > > > > > > >> > > ledger
>> > > > > > > > >> > > > was ever fenced off. Writes include an epoch and
>> are
>> > > > > rejected
>> > > > > > if
>> > > > > > > > the
>> > > > > > > > >> > > epoch
>> > > > > > > > >> > > > has changed.
>> > > > > > > > >> > > > With a mechanism like this, fencing the ledger off
>> > > after a
>> > > > > > > failure
>> > > > > > > > >> > would
>> > > > > > > > >> > > > ensure any pending writes had either been written
>> or
>> > > would
>> > > > > be
>> > > > > > > > >> rejected.
>> > > > > > > > >> > > >
>> > > > > > > > >> > > >
>> > > > > > > > >> > > > On Sat, Oct 8, 2016 at 7:10 AM, Sijie Guo <
>> > > > sijie@apache.org
>> > > > > >
>> > > > > > > > wrote:
>> > > > > > > > >> > > >
>> > > > > > > > >> > > > > Cameron,
>> > > > > > > > >> > > > >
>> > > > > > > > >> > > > > I think both Leigh and Xi had made a few good
>> points
>> > > > about
>> > > > > > > your
>> > > > > > > > >> > > question.
>> > > > > > > > >> > > > >
>> > > > > > > > >> > > > > To add one more point to your question - "but I
>> am
>> > not
>> > > > > > > > >> > > > > 100% of how all of the futures in the code handle
>> > > > > failures.
>> > > > > > > > >> > > > > If not, where in the code would be the relevant
>> > places
>> > > > to
>> > > > > > add
>> > > > > > > > the
>> > > > > > > > >> > > ability
>> > > > > > > > >> > > > > to do this, and would the project be interested
>> in a
>> > > > pull
>> > > > > > > > >> request?"
>> > > > > > > > >> > > > >
>> > > > > > > > >> > > > > The current proxy and client logic doesn't do
>> > > perfectly
>> > > > on
>> > > > > > > > >> handling
>> > > > > > > > >> > > > > failures (duplicates) - the strategy now is the
>> > client
>> > > > > will
>> > > > > > > > retry
>> > > > > > > > >> as
>> > > > > > > > >> > > best
>> > > > > > > > >> > > > > at it can before throwing exceptions to users.
>> The
>> > > code
>> > > > > you
>> > > > > > > are
>> > > > > > > > >> > looking
>> > > > > > > > >> > > > for
>> > > > > > > > >> > > > > - it is on BKLogSegmentWriter for the proxy
>> handling
>> > > > > writes
>> > > > > > > and
>> > > > > > > > >> it is
>> > > > > > > > >> > > on
>> > > > > > > > >> > > > > DistributedLogClientImpl for the proxy client
>> > handling
>> > > > > > > responses
>> > > > > > > > >> from
>> > > > > > > > >> > > > > proxies. Does this help you?
>> > > > > > > > >> > > > >
>> > > > > > > > >> > > > > And also, you are welcome to contribute the pull
>> > > > requests.
>> > > > > > > > >> > > > >
>> > > > > > > > >> > > > > - Sijie
>> > > > > > > > >> > > > >
>> > > > > > > > >> > > > >
>> > > > > > > > >> > > > >
>> > > > > > > > >> > > > > On Tue, Oct 4, 2016 at 3:39 PM, Cameron Hatfield
>> <
>> > > > > > > > >> kinguy@gmail.com>
>> > > > > > > > >> > > > wrote:
>> > > > > > > > >> > > > >
>> > > > > > > > >> > > > > > I have a question about the Proxy Client.
>> > Basically,
>> > > > for
>> > > > > > our
>> > > > > > > > use
>> > > > > > > > >> > > cases,
>> > > > > > > > >> > > > > we
>> > > > > > > > >> > > > > > want to guarantee ordering at the key level,
>> > > > > irrespective
>> > > > > > of
>> > > > > > > > the
>> > > > > > > > >> > > > ordering
>> > > > > > > > >> > > > > > of the partition it may be assigned to as a
>> whole.
>> > > Due
>> > > > > to
>> > > > > > > the
>> > > > > > > > >> > source
>> > > > > > > > >> > > of
>> > > > > > > > >> > > > > the
>> > > > > > > > >> > > > > > data (HBase Replication), we cannot guarantee
>> > that a
>> > > > > > single
>> > > > > > > > >> > partition
>> > > > > > > > >> > > > > will
>> > > > > > > > >> > > > > > be owned for writes by the same client. This
>> means
>> > > the
>> > > > > > proxy
>> > > > > > > > >> client
>> > > > > > > > >> > > > works
>> > > > > > > > >> > > > > > well (since we don't care which proxy owns the
>> > > > partition
>> > > > > > we
>> > > > > > > > are
>> > > > > > > > >> > > writing
>> > > > > > > > >> > > > > > to).
>> > > > > > > > >> > > > > >
>> > > > > > > > >> > > > > >
>> > > > > > > > >> > > > > > However, the guarantees we need when writing a
>> > batch
>> > > > > > > consists
>> > > > > > > > >> of:
>> > > > > > > > >> > > > > > Definition of a Batch: The set of records sent
>> to
>> > > the
>> > > > > > > > writeBatch
>> > > > > > > > >> > > > endpoint
>> > > > > > > > >> > > > > > on the proxy
>> > > > > > > > >> > > > > >
>> > > > > > > > >> > > > > > 1. Batch success: If the client receives a
>> success
>> > > > from
>> > > > > > the
>> > > > > > > > >> proxy,
>> > > > > > > > >> > > then
>> > > > > > > > >> > > > > > that batch is successfully written
>> > > > > > > > >> > > > > >
>> > > > > > > > >> > > > > > 2. Inter-Batch ordering : Once a batch has been
>> > > > written
>> > > > > > > > >> > successfully
>> > > > > > > > >> > > by
>> > > > > > > > >> > > > > the
>> > > > > > > > >> > > > > > client, when another batch is written, it will
>> be
>> > > > > > guaranteed
>> > > > > > > > to
>> > > > > > > > >> be
>> > > > > > > > >> > > > > ordered
>> > > > > > > > >> > > > > > after the last batch (if it is the same
>> stream).
>> > > > > > > > >> > > > > >
>> > > > > > > > >> > > > > > 3. Intra-Batch ordering: Within a batch of
>> writes,
>> > > the
>> > > > > > > records
>> > > > > > > > >> will
>> > > > > > > > >> > > be
>> > > > > > > > >> > > > > > committed in order
>> > > > > > > > >> > > > > >
>> > > > > > > > >> > > > > > 4. Intra-Batch failure ordering: If an
>> individual
>> > > > record
>> > > > > > > fails
>> > > > > > > > >> to
>> > > > > > > > >> > > write
>> > > > > > > > >> > > > > > within a batch, all records after that record
>> will
>> > > not
>> > > > > be
>> > > > > > > > >> written.
>> > > > > > > > >> > > > > >
>> > > > > > > > >> > > > > > 5. Batch Commit: Guarantee that if a batch
>> > returns a
>> > > > > > > success,
>> > > > > > > > it
>> > > > > > > > >> > will
>> > > > > > > > >> > > > be
>> > > > > > > > >> > > > > > written
>> > > > > > > > >> > > > > >
>> > > > > > > > >> > > > > > 6. Read-after-write: Once a batch is committed,
>> > > > within a
>> > > > > > > > limited
>> > > > > > > > >> > > > > time-frame
>> > > > > > > > >> > > > > > it will be able to be read. This is required in
>> > the
>> > > > case
>> > > > > > of
>> > > > > > > > >> > failure,
>> > > > > > > > >> > > so
>> > > > > > > > >> > > > > > that the client can see what actually got
>> > > committed. I
>> > > > > > > believe
>> > > > > > > > >> the
>> > > > > > > > >> > > > > > time-frame part could be removed if the client
>> can
>> > > > send
>> > > > > in
>> > > > > > > the
>> > > > > > > > >> same
>> > > > > > > > >> > > > > > sequence number that was written previously,
>> since
>> > > it
>> > > > > > would
>> > > > > > > > then
>> > > > > > > > >> > fail
>> > > > > > > > >> > > > and
>> > > > > > > > >> > > > > > we would know that a read needs to occur.
>> > > > > > > > >> > > > > >
>> > > > > > > > >> > > > > >
>> > > > > > > > >> > > > > > So, my basic question is if this is currently
>> > > possible
>> > > > > in
>> > > > > > > the
>> > > > > > > > >> > proxy?
>> > > > > > > > >> > > I
>> > > > > > > > >> > > > > > don't believe it gives these guarantees as it
>> > stands
>> > > > > > today,
>> > > > > > > > but
>> > > > > > > > >> I
>> > > > > > > > >> > am
>> > > > > > > > >> > > > not
>> > > > > > > > >> > > > > > 100% of how all of the futures in the code
>> handle
>> > > > > > failures.
>> > > > > > > > >> > > > > > If not, where in the code would be the relevant
>> > > places
>> > > > > to
>> > > > > > > add
>> > > > > > > > >> the
>> > > > > > > > >> > > > ability
>> > > > > > > > >> > > > > > to do this, and would the project be interested
>> > in a
>> > > > > pull
>> > > > > > > > >> request?
>> > > > > > > > >> > > > > >
>> > > > > > > > >> > > > > >
>> > > > > > > > >> > > > > > Thanks,
>> > > > > > > > >> > > > > > Cameron
>> > > > > > > > >> > > > > >
>> > > > > > > > >> > > > >
>> > > > > > > > >> > > >
>> > > > > > > > >> > >
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Proxy Client - Batch Ordering / Commit

Posted by Cameron Hatfield <ki...@gmail.com>.
"A couple of questions" is what I originally wrote, and then the following
happened. Sorry about the large swath of them, making sure my understanding
of the code base, as well as the DL/Bookkeeper/ZK ecosystem interaction,
makes sense.

==General:
What is an exclusive session? What is it providing over a regular session?


==Proxy side:
Should a new streamop be added for the fencing operation, or does it make
sense to piggyback on an existing one (such as write)?

====getLastDLSN:
What should be the return result for:
A new stream
A new session, after successful fencing
A new session, after a change in ownership / first starting up

What is the main use case for getLastDLSN(<stream>, false)? Is this to
guarantee that the recovery process has happened in case of ownership
failure (I don't have a good understanding of what causes the recovery
process to happen, especially from the reader side)? Or is it to handle the
lost-ack problem? Since all the rest of the read related things go through
the read client, I'm not sure if I see the use case, but it seems like
there would be a large potential for confusion on which to use. What about
just a fenceSession op, that always fences, returning the DLSN of the
fence, and leave the normal getLastDLSN for the regular read client.

====Fencing:
When a fence session occurs, what call needs to be made to make sure any
outstanding writes are flushed and committed (so that we guarantee the
client will be able to read anything that was in the write queue)?
Is there a guaranteed ordering for things written in the future queue for
AsyncLogWriter (I'm not quite confident that I was able to accurately
follow the logic, as their are many parts of the code that write, have
queues, heartbeat, etc)?

====SessionID:
What is the default sessionid / transactionid for a new stream? I assume
this would just be the first control record

======Should all streams have a sessionid by default, regardless if it is
never used by a client (aka, everytime ownership changes, a new control
record is generated, and a sessionid is stored)?
Main edge case that would have to be handled is if a client writes with an
old sessionid, but the owner has changed and has yet to create a sessionid.
This should be handled by the "non-matching sessionid" rule, since the
invalid sessionid wouldn't match the passed sessionid, which should cause
the client to get a new sessionid.

======Where in the code does it make sense to own the session, the stream
interfaces / classes? Should they pass that information down to the ops, or
do the sessionid check within?
My first thought would be Stream owns the sessionid, passes it into the ops
(as either a nullable value, or an invalid default value), which then do
the sessionid check if they care. The main issue is updating the sessionid
is a bit backwards, as either every op has the ability to update it through
some type of return value / direct stream access / etc, or there is a
special case in the stream for the fence operation / any other operation
that can update the session.

======For "the owner of the log stream will first advance the transaction
id generator to claim a new transaction id and write a control record to
the log stream. ":
Should "DistributedLogConstants.CONTROL_RECORD_CONTENT" be the type of
control record written?
Should the "writeControlRecord" on the BKAsyncLogWriter be exposed in the
AsyncLogWriter interface be exposed?  Or even in the one within the segment
writer? Or should the code be duplicated / pulled out into a helper / etc?
(Not a big Java person, so any suggestions on the "Java Way", or at least
the DL way, to do it would be appreciated)

======Transaction ID:
The BKLogSegmentWriter ignores the transaction ids from control records
when it records the "LastTXId." Would that be an issue here for anything?
It looks like it may do that because it assumes you're calling it's local
function for writing a controlrecord, which uses the lastTxId.


==Thrift Interface:
====Should the write response be split out for different calls?
It seems odd to have a single struct with many optional items that are
filled depending on the call made for every rpc call. This is mostly a
curiosity question, since I assume it comes from the general practices from
using thrift for a while. Would it at least make sense for the
getLastDLSN/fence endpoint to have a new struct?

====Any particular error code that makes sense for session fenced? If we
want to be close to the HTTP errors, looks like 412 (PRECONDITION FAILED)
might make the most sense, if a bit generic.

412 def:
"The precondition given in one or more of the request-header fields
evaluated to false when it was tested on the server. This response code
allows the client to place preconditions on the current resource
metainformation (header field data) and thus prevent the requested method
from being applied to a resource other than the one intended."

412 excerpt from the If-match doc:
"This behavior is most useful when the client wants to prevent an updating
method, such as PUT, from modifying a resource that has changed since the
client last retrieved it."

====Should we return the sessionid to the client in the "fencesession"
calls?
Seems like it may be useful when you fence, especially if you have some
type of custom sequencer where it would make sense for search, or for
debugging.
Main minus is that it would be easy for users to create an implicit
requirement that the sessionid is forever a valid transactionid, which may
not always be the case long term for the project.


==Client:
====What is the proposed process for the client retrieving the new
sessionid?
A full reconnect? No special case code, but intrusive on the client side,
and possibly expensive garbage/processing wise. (though this type of
failure should hopefully be rare enough to not be a problem)
A call to reset the sessionid? Less intrusive, all the issues you get with
mutable object methods that need to be called in a certain order, edge
cases such as outstanding/buffered requests to the old stream, etc.
The call could also return the new sessionid, making it a good call for
storing or debugging the value.

====Session Fenced failure:
Will this put the client into a failure state, stopping all future writes
until fixed?
Is it even possible to get this error when ownership changes? The
connection to the new owner should get a new sessionid on connect, so I
would expect not.

Cheers,
Cameron

On Tue, Nov 15, 2016 at 2:01 AM, Xi Liu <xi...@gmail.com> wrote:

> Thank you, Cameron. Look forward to your comments.
>
> - Xi
>
> On Sun, Nov 13, 2016 at 1:21 PM, Cameron Hatfield <ki...@gmail.com>
> wrote:
>
> > Sorry, I've been on vacation for the past week, and heads down for a
> > release that is using DL at the end of Nov. I'll take a look at this over
> > the next week, and add any relevant comments. After we are finished with
> > dev for this release, I am hoping to tackle this next.
> >
> > -Cameron
> >
> > On Fri, Nov 11, 2016 at 12:07 PM, Sijie Guo <si...@apache.org> wrote:
> >
> > > Xi,
> > >
> > > Thank you so much for your proposal. I took a look. It looks fine to
> me.
> > > Cameron, do you have any comments?
> > >
> > > Look forward to your pull requests.
> > >
> > > - Sijie
> > >
> > >
> > > On Wed, Nov 9, 2016 at 2:34 AM, Xi Liu <xi...@gmail.com> wrote:
> > >
> > > > Cameron,
> > > >
> > > > Have you started any work for this? I just updated the proposal page
> -
> > > > https://cwiki.apache.org/confluence/display/DL/DP-2+-+
> > > Epoch+Write+Support
> > > > Maybe we can work together with this.
> > > >
> > > > Sijie, Leigh,
> > > >
> > > > can you guys help review this to make sure our proposal is in the
> right
> > > > direction?
> > > >
> > > > - Xi
> > > >
> > > > On Tue, Nov 1, 2016 at 3:05 AM, Sijie Guo <si...@apache.org> wrote:
> > > >
> > > > > I created https://issues.apache.org/jira/browse/DL-63 for tracking
> > the
> > > > > proposed idea here.
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Oct 26, 2016 at 4:53 PM, Sijie Guo
> > <sijieg@twitter.com.invalid
> > > >
> > > > > wrote:
> > > > >
> > > > > > On Tue, Oct 25, 2016 at 11:30 AM, Cameron Hatfield <
> > kinguy@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Yes, we are reading the HBase WAL (from their replication
> plugin
> > > > > > support),
> > > > > > > and writing that into DL.
> > > > > > >
> > > > > >
> > > > > > Gotcha.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > From the sounds of it, yes, it would. Only thing I would say is
> > > make
> > > > > the
> > > > > > > epoch requirement optional, so that if I client doesn't care
> > about
> > > > > dupes
> > > > > > > they don't have to deal with the process of getting a new
> epoch.
> > > > > > >
> > > > > >
> > > > > > Yup. This should be optional. I can start a wiki page on how we
> > want
> > > to
> > > > > > implement this. Are you interested in contributing to this?
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > -Cameron
> > > > > > >
> > > > > > > On Wed, Oct 19, 2016 at 7:43 PM, Sijie Guo
> > > > <sijieg@twitter.com.invalid
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > On Wed, Oct 19, 2016 at 7:17 PM, Sijie Guo <
> sijieg@twitter.com
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Monday, October 17, 2016, Cameron Hatfield <
> > > kinguy@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > >> Answer inline:
> > > > > > > > >>
> > > > > > > > >> On Mon, Oct 17, 2016 at 11:46 AM, Sijie Guo <
> > sijie@apache.org
> > > >
> > > > > > wrote:
> > > > > > > > >>
> > > > > > > > >> > Cameron,
> > > > > > > > >> >
> > > > > > > > >> > Thank you for your summary. I liked the discussion
> here. I
> > > > also
> > > > > > > liked
> > > > > > > > >> the
> > > > > > > > >> > summary of your requirement - 'single-writer-per-key,
> > > > > > > > >> > multiple-writers-per-log'. If I understand correctly,
> the
> > > core
> > > > > > > concern
> > > > > > > > >> here
> > > > > > > > >> > is almost 'exact-once' write (or a way to explicit tell
> > if a
> > > > > write
> > > > > > > can
> > > > > > > > >> be
> > > > > > > > >> > retried or not).
> > > > > > > > >> >
> > > > > > > > >> > Comments inline.
> > > > > > > > >> >
> > > > > > > > >> > On Fri, Oct 14, 2016 at 11:17 AM, Cameron Hatfield <
> > > > > > > kinguy@gmail.com>
> > > > > > > > >> > wrote:
> > > > > > > > >> >
> > > > > > > > >> > > > Ah- yes good point (to be clear we're not using the
> > > proxy
> > > > > this
> > > > > > > way
> > > > > > > > >> > > today).
> > > > > > > > >> > >
> > > > > > > > >> > > > > Due to the source of the
> > > > > > > > >> > > > > data (HBase Replication), we cannot guarantee
> that a
> > > > > single
> > > > > > > > >> partition
> > > > > > > > >> > > will
> > > > > > > > >> > > > > be owned for writes by the same client.
> > > > > > > > >> > >
> > > > > > > > >> > > > Do you mean you *need* to support multiple writers
> > > issuing
> > > > > > > > >> interleaved
> > > > > > > > >> > > > writes or is it just that they might sometimes
> > > interleave
> > > > > > writes
> > > > > > > > and
> > > > > > > > >> > you
> > > > > > > > >> > > >don't care?
> > > > > > > > >> > > How HBase partitions the keys being written wouldn't
> > have
> > > a
> > > > > > > one->one
> > > > > > > > >> > > mapping with the partitions we would have in HBase.
> Even
> > > if
> > > > we
> > > > > > did
> > > > > > > > >> have
> > > > > > > > >> > > that alignment when the cluster first started, HBase
> > will
> > > > > > > rebalance
> > > > > > > > >> what
> > > > > > > > >> > > servers own what partitions, as well as split and
> merge
> > > > > > partitions
> > > > > > > > >> that
> > > > > > > > >> > > already exist, causing eventual drift from one log per
> > > > > > partition.
> > > > > > > > >> > > Because we want ordering guarantees per key (row in
> > > hbase),
> > > > we
> > > > > > > > >> partition
> > > > > > > > >> > > the logs by the key. Since multiple writers are
> possible
> > > per
> > > > > > range
> > > > > > > > of
> > > > > > > > >> > keys
> > > > > > > > >> > > (due to the aforementioned rebalancing / splitting /
> etc
> > > of
> > > > > > > hbase),
> > > > > > > > we
> > > > > > > > >> > > cannot use the core library due to requiring a single
> > > writer
> > > > > for
> > > > > > > > >> > ordering.
> > > > > > > > >> > >
> > > > > > > > >> > > But, for a single log, we don't really care about
> > ordering
> > > > > aside
> > > > > > > > from
> > > > > > > > >> at
> > > > > > > > >> > > the per-key level. So all we really need to be able to
> > > > handle
> > > > > is
> > > > > > > > >> > preventing
> > > > > > > > >> > > duplicates when a failure occurs, and ordering
> > consistency
> > > > > > across
> > > > > > > > >> > requests
> > > > > > > > >> > > from a single client.
> > > > > > > > >> > >
> > > > > > > > >> > > So our general requirements are:
> > > > > > > > >> > > Write A, Write B
> > > > > > > > >> > > Timeline: A -> B
> > > > > > > > >> > > Request B is only made after A has successfully
> returned
> > > > > > (possibly
> > > > > > > > >> after
> > > > > > > > >> > > retries)
> > > > > > > > >> > >
> > > > > > > > >> > > 1) If the write succeeds, it will be durably exposed
> to
> > > > > clients
> > > > > > > > within
> > > > > > > > >> > some
> > > > > > > > >> > > bounded time frame
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >> > Guaranteed.
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >> > > 2) If A succeeds and B succeeds, the ordering for the
> > log
> > > > will
> > > > > > be
> > > > > > > A
> > > > > > > > >> and
> > > > > > > > >> > > then B
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >> > If I understand correctly here, B is only sent after A
> is
> > > > > > returned,
> > > > > > > > >> right?
> > > > > > > > >> > If that's the case, It is guaranteed.
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > > 3) If A fails due to an error that can be relied on to
> > > *not*
> > > > > be
> > > > > > a
> > > > > > > > lost
> > > > > > > > >> > ack
> > > > > > > > >> > > problem, it will never be exposed to the client, so it
> > may
> > > > > > > > (depending
> > > > > > > > >> on
> > > > > > > > >> > > the error) be retried immediately
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >> > If it is not a lost-ack problem, the entry will be
> > exposed.
> > > it
> > > > > is
> > > > > > > > >> > guaranteed.
> > > > > > > > >>
> > > > > > > > >> Let me try rephrasing the questions, to make sure I'm
> > > > > understanding
> > > > > > > > >> correctly:
> > > > > > > > >> If A fails, with an error such as "Unable to create
> > connection
> > > > to
> > > > > > > > >> bookkeeper server", that would be the type of error we
> would
> > > > > expect
> > > > > > to
> > > > > > > > be
> > > > > > > > >> able to retry immediately, as that result means no action
> > was
> > > > > taken
> > > > > > on
> > > > > > > > any
> > > > > > > > >> log / etc, so no entry could have been created. This is
> > > > different
> > > > > > > then a
> > > > > > > > >> "Connection Timeout" exception, as we just might not have
> > > > gotten a
> > > > > > > > >> response
> > > > > > > > >> in time.
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > > Gotcha.
> > > > > > > > >
> > > > > > > > > The response code returned from proxy can tell if a failure
> > can
> > > > be
> > > > > > > > retried
> > > > > > > > > safely or not. (We might need to make them well documented)
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >>
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > > 4) If A fails due to an error that could be a lost ack
> > > > problem
> > > > > > > > >> (network
> > > > > > > > >> > > connectivity / etc), within a bounded time frame it
> > should
> > > > be
> > > > > > > > >> possible to
> > > > > > > > >> > > find out if the write succeed or failed. Either by
> > reading
> > > > > from
> > > > > > > some
> > > > > > > > >> > > checkpoint of the log for the changes that should have
> > > been
> > > > > made
> > > > > > > or
> > > > > > > > >> some
> > > > > > > > >> > > other possible server-side support.
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >> > If I understand this correctly, it is a duplication
> issue,
> > > > > right?
> > > > > > > > >> >
> > > > > > > > >> > Can a de-duplication solution work here? Either DL or
> your
> > > > > client
> > > > > > > does
> > > > > > > > >> the
> > > > > > > > >> > de-duplication?
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >> The requirements I'm mentioning are the ones needed for
> > > > > client-side
> > > > > > > > >> dedupping. Since if I can guarantee writes being exposed
> > > within
> > > > > some
> > > > > > > > time
> > > > > > > > >> frame, and I can never get into an inconsistently ordered
> > > state
> > > > > when
> > > > > > > > >> successes happen, when an error occurs, I can always wait
> > for
> > > > max
> > > > > > time
> > > > > > > > >> frame, read the latest writes, and then dedup locally
> > against
> > > > the
> > > > > > > > request
> > > > > > > > >> I
> > > > > > > > >> just made.
> > > > > > > > >>
> > > > > > > > >> The main thing about that timeframe is that its basically
> > the
> > > > > > addition
> > > > > > > > of
> > > > > > > > >> every timeout, all the way down in the system, combined
> with
> > > > > > whatever
> > > > > > > > >> flushing / caching / etc times are at the bookkeeper /
> > client
> > > > > level
> > > > > > > for
> > > > > > > > >> when values are exposed
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Gotcha.
> > > > > > > > >
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> >
> > > > > > > > >> > Is there any ways to identify your write?
> > > > > > > > >> >
> > > > > > > > >> > I can think of a case as follow - I want to know what is
> > > your
> > > > > > > expected
> > > > > > > > >> > behavior from the log.
> > > > > > > > >> >
> > > > > > > > >> > a)
> > > > > > > > >> >
> > > > > > > > >> > If a hbase region server A writes a change of key K to
> the
> > > > log,
> > > > > > the
> > > > > > > > >> change
> > > > > > > > >> > is successfully made to the log;
> > > > > > > > >> > but server A is down before receiving the change.
> > > > > > > > >> > region server B took over the region that contains K,
> what
> > > > will
> > > > > B
> > > > > > > do?
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >> HBase writes in large chunks (WAL Logs), which its
> > replication
> > > > > > system
> > > > > > > > then
> > > > > > > > >> handles by replaying in the case of failure. If I'm in a
> > > middle
> > > > > of a
> > > > > > > > log,
> > > > > > > > >> and the whole region goes down and gets rescheduled
> > > elsewhere, I
> > > > > > will
> > > > > > > > >> start
> > > > > > > > >> back up from the beginning of the log I was in the middle
> > of.
> > > > > Using
> > > > > > > > >> checkpointing + deduping, we should be able to find out
> > where
> > > we
> > > > > > left
> > > > > > > > off
> > > > > > > > >> in the log.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > b) same as a). but server A was just network
> partitioned.
> > > will
> > > > > > both
> > > > > > > A
> > > > > > > > >> and B
> > > > > > > > >> > write the change of key K?
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >> HBase gives us some guarantees around network partitions
> > > > > > (Consistency
> > > > > > > > over
> > > > > > > > >> availability for HBase). HBase is a single-master failover
> > > > > recovery
> > > > > > > type
> > > > > > > > >> of
> > > > > > > > >> system, with zookeeper-based guarantees for single owners
> > > > > (writers)
> > > > > > > of a
> > > > > > > > >> range of data.
> > > > > > > > >>
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > > 5) If A is turned into multiple batches (one large
> > request
> > > > > gets
> > > > > > > > split
> > > > > > > > >> > into
> > > > > > > > >> > > multiple smaller ones to the bookkeeper backend, due
> to
> > > log
> > > > > > > rolling
> > > > > > > > /
> > > > > > > > >> > size
> > > > > > > > >> > > / etc):
> > > > > > > > >> > >   a) The ordering of entries within batches have
> > ordering
> > > > > > > > consistence
> > > > > > > > >> > with
> > > > > > > > >> > > the original request, when exposed in the log (though
> > they
> > > > may
> > > > > > be
> > > > > > > > >> > > interleaved with other requests)
> > > > > > > > >> > >   b) The ordering across batches have ordering
> > consistence
> > > > > with
> > > > > > > the
> > > > > > > > >> > > original request, when exposed in the log (though they
> > may
> > > > be
> > > > > > > > >> interleaved
> > > > > > > > >> > > with other requests)
> > > > > > > > >> > >   c) If a batch fails, and cannot be retried / is
> > > > > unsuccessfully
> > > > > > > > >> retried,
> > > > > > > > >> > > all batches after the failed batch should not be
> exposed
> > > in
> > > > > the
> > > > > > > log.
> > > > > > > > >> > Note:
> > > > > > > > >> > > The batches before and including the failed batch,
> that
> > > > ended
> > > > > up
> > > > > > > > >> > > succeeding, can show up in the log, again within some
> > > > bounded
> > > > > > time
> > > > > > > > >> range
> > > > > > > > >> > > for reads by a client.
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >> > There is a method 'writeBulk' in DistributedLogClient
> can
> > > > > achieve
> > > > > > > this
> > > > > > > > >> > guarantee.
> > > > > > > > >> >
> > > > > > > > >> > However, I am not very sure about how will you turn A
> into
> > > > > > batches.
> > > > > > > If
> > > > > > > > >> you
> > > > > > > > >> > are dividing A into batches,
> > > > > > > > >> > you can simply control the application write sequence to
> > > > achieve
> > > > > > the
> > > > > > > > >> > guarantee here.
> > > > > > > > >> >
> > > > > > > > >> > Can you explain more about this?
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >> In this case, by batches I mean what the proxy does with
> the
> > > > > single
> > > > > > > > >> request
> > > > > > > > >> that I send it. If the proxy decides it needs to turn my
> > > single
> > > > > > > request
> > > > > > > > >> into multiple batches of requests, due to log rolling,
> size
> > > > > > > limitations,
> > > > > > > > >> etc, those would be the guarantees I need to be able to
> > > > > reduplicate
> > > > > > on
> > > > > > > > the
> > > > > > > > >> client side.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > A single record written by #write and A record set (set of
> > > > records)
> > > > > > > > > written by #writeRecordSet are atomic - they will not be
> > broken
> > > > > down
> > > > > > > into
> > > > > > > > > entries (batches). With the correct response code, you
> would
> > be
> > > > > able
> > > > > > to
> > > > > > > > > tell if it is a lost-ack failure or not. However there is a
> > > size
> > > > > > > > limitation
> > > > > > > > > for this - it can't not go beyond 1MB for current
> > > implementation.
> > > > > > > > >
> > > > > > > > > What is your expected record size?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >>
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > >
> > > > > > > > >> > > Since we can guarantee per-key ordering on the client
> > > side,
> > > > we
> > > > > > > > >> guarantee
> > > > > > > > >> > > that there is a single writer per-key, just not per
> log.
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > Do you need fencing guarantee in the case of network
> > > partition
> > > > > > > causing
> > > > > > > > >> > two-writers?
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > > So if there was a
> > > > > > > > >> > > way to guarantee a single write request as being
> written
> > > or
> > > > > not,
> > > > > > > > >> within a
> > > > > > > > >> > > certain time frame (since failures should be rare
> > anyways,
> > > > > this
> > > > > > is
> > > > > > > > >> fine
> > > > > > > > >> > if
> > > > > > > > >> > > it is expensive), we can then have the client
> guarantee
> > > the
> > > > > > > ordering
> > > > > > > > >> it
> > > > > > > > >> > > needs.
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >> > This sounds an 'exact-once' write (regarding retries)
> > > > > requirement
> > > > > > to
> > > > > > > > me,
> > > > > > > > >> > right?
> > > > > > > > >> >
> > > > > > > > >> Yes. I'm curious of how this issue is handled by
> Manhattan,
> > > > since
> > > > > > you
> > > > > > > > can
> > > > > > > > >> imagine a data store that ends up getting multiple writes
> > for
> > > > the
> > > > > > same
> > > > > > > > put
> > > > > > > > >> / get / etc, would be harder to use, and we are basically
> > > trying
> > > > > to
> > > > > > > > create
> > > > > > > > >> a log like that for HBase.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Are you guys replacing HBase WAL?
> > > > > > > > >
> > > > > > > > > In Manhattan case, the request will be first written to DL
> > > > streams
> > > > > by
> > > > > > > > > Manhattan coordinator. The Manhattan replica then will read
> > > from
> > > > > the
> > > > > > DL
> > > > > > > > > streams and apply the change. In the lost-ack case, the MH
> > > > > > coordinator
> > > > > > > > will
> > > > > > > > > just fail the request to client.
> > > > > > > > >
> > > > > > > > > My feeling here is your usage for HBase is a bit different
> > from
> > > > how
> > > > > > we
> > > > > > > > use
> > > > > > > > > DL in Manhattan. It sounds like you read from a source
> (HBase
> > > > WAL)
> > > > > > and
> > > > > > > > > write to DL. But I might be wrong.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >>
> > > > > > > > >> >
> > > > > > > > >> >
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > > > Cameron:
> > > > > > > > >> > > > Another thing we've discussed but haven't really
> > thought
> > > > > > > through -
> > > > > > > > >> > > > We might be able to support some kind of epoch write
> > > > > request,
> > > > > > > > where
> > > > > > > > >> the
> > > > > > > > >> > > > epoch is guaranteed to have changed if the writer
> has
> > > > > changed
> > > > > > or
> > > > > > > > the
> > > > > > > > >> > > ledger
> > > > > > > > >> > > > was ever fenced off. Writes include an epoch and are
> > > > > rejected
> > > > > > if
> > > > > > > > the
> > > > > > > > >> > > epoch
> > > > > > > > >> > > > has changed.
> > > > > > > > >> > > > With a mechanism like this, fencing the ledger off
> > > after a
> > > > > > > failure
> > > > > > > > >> > would
> > > > > > > > >> > > > ensure any pending writes had either been written or
> > > would
> > > > > be
> > > > > > > > >> rejected.
> > > > > > > > >> > >
> > > > > > > > >> > > The issue would be how I guarantee the write I wrote
> to
> > > the
> > > > > > server
> > > > > > > > was
> > > > > > > > >> > > written. Since a network issue could happen on the
> send
> > of
> > > > the
> > > > > > > > >> request,
> > > > > > > > >> > or
> > > > > > > > >> > > on the receive of the success response, an epoch
> > wouldn't
> > > > tell
> > > > > > me
> > > > > > > > if I
> > > > > > > > >> > can
> > > > > > > > >> > > successfully retry, as it could be successfully
> written
> > > but
> > > > > AWS
> > > > > > > > >> dropped
> > > > > > > > >> > the
> > > > > > > > >> > > connection for the success response. Since the epoch
> > would
> > > > be
> > > > > > the
> > > > > > > > same
> > > > > > > > >> > > (same ledger), I could write duplicates.
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > > > We are currently proposing adding a transaction
> > semantic
> > > > to
> > > > > dl
> > > > > > > to
> > > > > > > > >> get
> > > > > > > > >> > rid
> > > > > > > > >> > > > of the size limitation and the unaware-ness in the
> > proxy
> > > > > > client.
> > > > > > > > >> Here
> > > > > > > > >> > is
> > > > > > > > >> > > > our idea -
> > > > > > > > >> > > > http://mail-archives.apache.org/mod_mbox/incubator-
> > > > > > > distributedlog
> > > > > > > > >> > > -dev/201609.mbox/%3cCAAC6BxP5YyEHwG0ZCF5soh42X=xuYwYm
> > > > > > > > >> > > <http://mail-archives.apache.org/mod_mbox/incubator-
> > > > > > > > >> > distributedlog%0A-dev/201609.mbox/%
> > > 3cCAAC6BxP5YyEHwG0ZCF5soh
> > > > > > > > 42X=xuYwYm>
> > > > > > > > >> > > L4nXsYBYiofzxpVk6g@mail.gmail.com%3e
> > > > > > > > >> > >
> > > > > > > > >> > > > I am not sure if your idea is similar as ours. but
> > we'd
> > > > like
> > > > > > to
> > > > > > > > >> > > collaborate
> > > > > > > > >> > > > with the community if anyone has the similar idea.
> > > > > > > > >> > >
> > > > > > > > >> > > Our use case would be covered by transaction support,
> > but
> > > > I'm
> > > > > > > unsure
> > > > > > > > >> if
> > > > > > > > >> > we
> > > > > > > > >> > > would need something that heavy weight for the
> > guarantees
> > > we
> > > > > > need.
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >> > >
> > > > > > > > >> > > Basically, the high level requirement here is "Support
> > > > > > consistent
> > > > > > > > >> write
> > > > > > > > >> > > ordering for single-writer-per-key,
> > multi-writer-per-log".
> > > > My
> > > > > > > hunch
> > > > > > > > is
> > > > > > > > >> > > that, with some added guarantees to the proxy (if it
> > isn't
> > > > > > already
> > > > > > > > >> > > supported), and some custom client code on our side
> for
> > > > > removing
> > > > > > > the
> > > > > > > > >> > > entries that actually succeed to write to
> DistributedLog
> > > > from
> > > > > > the
> > > > > > > > >> request
> > > > > > > > >> > > that failed, it should be a relatively easy thing to
> > > > support.
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >> > Yup. I think it should not be very difficult to support.
> > > There
> > > > > > might
> > > > > > > > be
> > > > > > > > >> > some changes in the server side.
> > > > > > > > >> > Let's figure out what will the changes be. Are you guys
> > > > > interested
> > > > > > > in
> > > > > > > > >> > contributing?
> > > > > > > > >> >
> > > > > > > > >> > Yes, we would be.
> > > > > > > > >>
> > > > > > > > >> As a note, the one thing that we see as an issue with the
> > > client
> > > > > > side
> > > > > > > > >> dedupping is how to bound the range of data that needs to
> be
> > > > > looked
> > > > > > at
> > > > > > > > for
> > > > > > > > >> deduplication. As you can imagine, it is pretty easy to
> > bound
> > > > the
> > > > > > > bottom
> > > > > > > > >> of
> > > > > > > > >> the range, as that it just regular checkpointing of the
> DSLN
> > > > that
> > > > > is
> > > > > > > > >> returned. I'm still not sure if there is any nice way to
> > time
> > > > > bound
> > > > > > > the
> > > > > > > > >> top
> > > > > > > > >> end of the range, especially since the proxy owns sequence
> > > > numbers
> > > > > > > > (which
> > > > > > > > >> makes sense). I am curious if there is more that can be
> done
> > > if
> > > > > > > > >> deduplication is on the server side. However the main
> minus
> > I
> > > > see
> > > > > of
> > > > > > > > >> server
> > > > > > > > >> side deduplication is that instead of running contingent
> on
> > > > there
> > > > > > > being
> > > > > > > > a
> > > > > > > > >> failed client request, instead it would have to run every
> > > time a
> > > > > > write
> > > > > > > > >> happens.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > For a reliable dedup, we probably need
> fence-then-getLastDLSN
> > > > > > > operation -
> > > > > > > > > so it would guarantee that any non-completed requests
> issued
> > > > > > (lost-ack
> > > > > > > > > requests) before this fence-then-getLastDLSN operation will
> > be
> > > > > failed
> > > > > > > and
> > > > > > > > > they will never land at the log.
> > > > > > > > >
> > > > > > > > > the pseudo code would look like below -
> > > > > > > > >
> > > > > > > > > write(request) onFailure { t =>
> > > > > > > > >
> > > > > > > > > if (t is timeout exception) {
> > > > > > > > >
> > > > > > > > > DLSN lastDLSN = fenceThenGetLastDLSN()
> > > > > > > > > DLSN lastCheckpointedDLSN = ...;
> > > > > > > > > // find if the request lands between [lastDLSN,
> > > > > > lastCheckpointedDLSN].
> > > > > > > > > // if it exists, the write succeed; otherwise retry.
> > > > > > > > >
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > }
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Just realized the idea is same as what Leigh raised in the
> > > previous
> > > > > > email
> > > > > > > > about 'epoch write'. Let me explain more about this idea
> > (Leigh,
> > > > feel
> > > > > > > free
> > > > > > > > to jump in to fill up your idea).
> > > > > > > >
> > > > > > > > - when a log stream is owned,  the proxy use the last
> > transaction
> > > > id
> > > > > as
> > > > > > > the
> > > > > > > > epoch
> > > > > > > > - when a client connects (handshake with the proxy), it will
> > get
> > > > the
> > > > > > > epoch
> > > > > > > > for the stream.
> > > > > > > > - the writes issued by this client will carry the epoch to
> the
> > > > proxy.
> > > > > > > > - add a new rpc - fenceThenGetLastDLSN - it would force the
> > proxy
> > > > to
> > > > > > bump
> > > > > > > > the epoch.
> > > > > > > > - if fenceThenGetLastDLSN happened, all the outstanding
> writes
> > > with
> > > > > old
> > > > > > > > epoch will be rejected with exceptions (e.g. EpochFenced).
> > > > > > > > - The DLSN returned from fenceThenGetLastDLSN can be used as
> > the
> > > > > bound
> > > > > > > for
> > > > > > > > deduplications on failures.
> > > > > > > >
> > > > > > > > Cameron, does this sound a solution to your use case?
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >>
> > > > > > > > >> Maybe something that could fit a similar need that Kafka
> > does
> > > > (the
> > > > > > > last
> > > > > > > > >> store value for a particular key in a log), such that on a
> > per
> > > > key
> > > > > > > basis
> > > > > > > > >> there could be a sequence number that support
> deduplication?
> > > > Cost
> > > > > > > seems
> > > > > > > > >> like it would be high however, and I'm not even sure if
> > > > bookkeeper
> > > > > > > > >> supports
> > > > > > > > >> it.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >> Cheers,
> > > > > > > > >> Cameron
> > > > > > > > >>
> > > > > > > > >> >
> > > > > > > > >> > >
> > > > > > > > >> > > Thanks,
> > > > > > > > >> > > Cameron
> > > > > > > > >> > >
> > > > > > > > >> > >
> > > > > > > > >> > > On Sat, Oct 8, 2016 at 7:35 AM, Leigh Stewart
> > > > > > > > >> > <lstewart@twitter.com.invalid
> > > > > > > > >> > > >
> > > > > > > > >> > > wrote:
> > > > > > > > >> > >
> > > > > > > > >> > > > Cameron:
> > > > > > > > >> > > > Another thing we've discussed but haven't really
> > thought
> > > > > > > through -
> > > > > > > > >> > > > We might be able to support some kind of epoch write
> > > > > request,
> > > > > > > > where
> > > > > > > > >> the
> > > > > > > > >> > > > epoch is guaranteed to have changed if the writer
> has
> > > > > changed
> > > > > > or
> > > > > > > > the
> > > > > > > > >> > > ledger
> > > > > > > > >> > > > was ever fenced off. Writes include an epoch and are
> > > > > rejected
> > > > > > if
> > > > > > > > the
> > > > > > > > >> > > epoch
> > > > > > > > >> > > > has changed.
> > > > > > > > >> > > > With a mechanism like this, fencing the ledger off
> > > after a
> > > > > > > failure
> > > > > > > > >> > would
> > > > > > > > >> > > > ensure any pending writes had either been written or
> > > would
> > > > > be
> > > > > > > > >> rejected.
> > > > > > > > >> > > >
> > > > > > > > >> > > >
> > > > > > > > >> > > > On Sat, Oct 8, 2016 at 7:10 AM, Sijie Guo <
> > > > sijie@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >> > > >
> > > > > > > > >> > > > > Cameron,
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > I think both Leigh and Xi had made a few good
> points
> > > > about
> > > > > > > your
> > > > > > > > >> > > question.
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > To add one more point to your question - "but I am
> > not
> > > > > > > > >> > > > > 100% of how all of the futures in the code handle
> > > > > failures.
> > > > > > > > >> > > > > If not, where in the code would be the relevant
> > places
> > > > to
> > > > > > add
> > > > > > > > the
> > > > > > > > >> > > ability
> > > > > > > > >> > > > > to do this, and would the project be interested
> in a
> > > > pull
> > > > > > > > >> request?"
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > The current proxy and client logic doesn't do
> > > perfectly
> > > > on
> > > > > > > > >> handling
> > > > > > > > >> > > > > failures (duplicates) - the strategy now is the
> > client
> > > > > will
> > > > > > > > retry
> > > > > > > > >> as
> > > > > > > > >> > > best
> > > > > > > > >> > > > > at it can before throwing exceptions to users. The
> > > code
> > > > > you
> > > > > > > are
> > > > > > > > >> > looking
> > > > > > > > >> > > > for
> > > > > > > > >> > > > > - it is on BKLogSegmentWriter for the proxy
> handling
> > > > > writes
> > > > > > > and
> > > > > > > > >> it is
> > > > > > > > >> > > on
> > > > > > > > >> > > > > DistributedLogClientImpl for the proxy client
> > handling
> > > > > > > responses
> > > > > > > > >> from
> > > > > > > > >> > > > > proxies. Does this help you?
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > And also, you are welcome to contribute the pull
> > > > requests.
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > - Sijie
> > > > > > > > >> > > > >
> > > > > > > > >> > > > >
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > On Tue, Oct 4, 2016 at 3:39 PM, Cameron Hatfield <
> > > > > > > > >> kinguy@gmail.com>
> > > > > > > > >> > > > wrote:
> > > > > > > > >> > > > >
> > > > > > > > >> > > > > > I have a question about the Proxy Client.
> > Basically,
> > > > for
> > > > > > our
> > > > > > > > use
> > > > > > > > >> > > cases,
> > > > > > > > >> > > > > we
> > > > > > > > >> > > > > > want to guarantee ordering at the key level,
> > > > > irrespective
> > > > > > of
> > > > > > > > the
> > > > > > > > >> > > > ordering
> > > > > > > > >> > > > > > of the partition it may be assigned to as a
> whole.
> > > Due
> > > > > to
> > > > > > > the
> > > > > > > > >> > source
> > > > > > > > >> > > of
> > > > > > > > >> > > > > the
> > > > > > > > >> > > > > > data (HBase Replication), we cannot guarantee
> > that a
> > > > > > single
> > > > > > > > >> > partition
> > > > > > > > >> > > > > will
> > > > > > > > >> > > > > > be owned for writes by the same client. This
> means
> > > the
> > > > > > proxy
> > > > > > > > >> client
> > > > > > > > >> > > > works
> > > > > > > > >> > > > > > well (since we don't care which proxy owns the
> > > > partition
> > > > > > we
> > > > > > > > are
> > > > > > > > >> > > writing
> > > > > > > > >> > > > > > to).
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > However, the guarantees we need when writing a
> > batch
> > > > > > > consists
> > > > > > > > >> of:
> > > > > > > > >> > > > > > Definition of a Batch: The set of records sent
> to
> > > the
> > > > > > > > writeBatch
> > > > > > > > >> > > > endpoint
> > > > > > > > >> > > > > > on the proxy
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > 1. Batch success: If the client receives a
> success
> > > > from
> > > > > > the
> > > > > > > > >> proxy,
> > > > > > > > >> > > then
> > > > > > > > >> > > > > > that batch is successfully written
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > 2. Inter-Batch ordering : Once a batch has been
> > > > written
> > > > > > > > >> > successfully
> > > > > > > > >> > > by
> > > > > > > > >> > > > > the
> > > > > > > > >> > > > > > client, when another batch is written, it will
> be
> > > > > > guaranteed
> > > > > > > > to
> > > > > > > > >> be
> > > > > > > > >> > > > > ordered
> > > > > > > > >> > > > > > after the last batch (if it is the same stream).
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > 3. Intra-Batch ordering: Within a batch of
> writes,
> > > the
> > > > > > > records
> > > > > > > > >> will
> > > > > > > > >> > > be
> > > > > > > > >> > > > > > committed in order
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > 4. Intra-Batch failure ordering: If an
> individual
> > > > record
> > > > > > > fails
> > > > > > > > >> to
> > > > > > > > >> > > write
> > > > > > > > >> > > > > > within a batch, all records after that record
> will
> > > not
> > > > > be
> > > > > > > > >> written.
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > 5. Batch Commit: Guarantee that if a batch
> > returns a
> > > > > > > success,
> > > > > > > > it
> > > > > > > > >> > will
> > > > > > > > >> > > > be
> > > > > > > > >> > > > > > written
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > 6. Read-after-write: Once a batch is committed,
> > > > within a
> > > > > > > > limited
> > > > > > > > >> > > > > time-frame
> > > > > > > > >> > > > > > it will be able to be read. This is required in
> > the
> > > > case
> > > > > > of
> > > > > > > > >> > failure,
> > > > > > > > >> > > so
> > > > > > > > >> > > > > > that the client can see what actually got
> > > committed. I
> > > > > > > believe
> > > > > > > > >> the
> > > > > > > > >> > > > > > time-frame part could be removed if the client
> can
> > > > send
> > > > > in
> > > > > > > the
> > > > > > > > >> same
> > > > > > > > >> > > > > > sequence number that was written previously,
> since
> > > it
> > > > > > would
> > > > > > > > then
> > > > > > > > >> > fail
> > > > > > > > >> > > > and
> > > > > > > > >> > > > > > we would know that a read needs to occur.
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > So, my basic question is if this is currently
> > > possible
> > > > > in
> > > > > > > the
> > > > > > > > >> > proxy?
> > > > > > > > >> > > I
> > > > > > > > >> > > > > > don't believe it gives these guarantees as it
> > stands
> > > > > > today,
> > > > > > > > but
> > > > > > > > >> I
> > > > > > > > >> > am
> > > > > > > > >> > > > not
> > > > > > > > >> > > > > > 100% of how all of the futures in the code
> handle
> > > > > > failures.
> > > > > > > > >> > > > > > If not, where in the code would be the relevant
> > > places
> > > > > to
> > > > > > > add
> > > > > > > > >> the
> > > > > > > > >> > > > ability
> > > > > > > > >> > > > > > to do this, and would the project be interested
> > in a
> > > > > pull
> > > > > > > > >> request?
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > > > Thanks,
> > > > > > > > >> > > > > > Cameron
> > > > > > > > >> > > > > >
> > > > > > > > >> > > > >
> > > > > > > > >> > > >
> > > > > > > > >> > >
> > > > > > > > >> >
> > > > > > > > >>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Proxy Client - Batch Ordering / Commit

Posted by Xi Liu <xi...@gmail.com>.
Thank you, Cameron. Look forward to your comments.

- Xi

On Sun, Nov 13, 2016 at 1:21 PM, Cameron Hatfield <ki...@gmail.com> wrote:

> Sorry, I've been on vacation for the past week, and heads down for a
> release that is using DL at the end of Nov. I'll take a look at this over
> the next week, and add any relevant comments. After we are finished with
> dev for this release, I am hoping to tackle this next.
>
> -Cameron
>
> On Fri, Nov 11, 2016 at 12:07 PM, Sijie Guo <si...@apache.org> wrote:
>
> > Xi,
> >
> > Thank you so much for your proposal. I took a look. It looks fine to me.
> > Cameron, do you have any comments?
> >
> > Look forward to your pull requests.
> >
> > - Sijie
> >
> >
> > On Wed, Nov 9, 2016 at 2:34 AM, Xi Liu <xi...@gmail.com> wrote:
> >
> > > Cameron,
> > >
> > > Have you started any work for this? I just updated the proposal page -
> > > https://cwiki.apache.org/confluence/display/DL/DP-2+-+
> > Epoch+Write+Support
> > > Maybe we can work together with this.
> > >
> > > Sijie, Leigh,
> > >
> > > can you guys help review this to make sure our proposal is in the right
> > > direction?
> > >
> > > - Xi
> > >
> > > On Tue, Nov 1, 2016 at 3:05 AM, Sijie Guo <si...@apache.org> wrote:
> > >
> > > > I created https://issues.apache.org/jira/browse/DL-63 for tracking
> the
> > > > proposed idea here.
> > > >
> > > >
> > > >
> > > > On Wed, Oct 26, 2016 at 4:53 PM, Sijie Guo
> <sijieg@twitter.com.invalid
> > >
> > > > wrote:
> > > >
> > > > > On Tue, Oct 25, 2016 at 11:30 AM, Cameron Hatfield <
> kinguy@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Yes, we are reading the HBase WAL (from their replication plugin
> > > > > support),
> > > > > > and writing that into DL.
> > > > > >
> > > > >
> > > > > Gotcha.
> > > > >
> > > > >
> > > > > >
> > > > > > From the sounds of it, yes, it would. Only thing I would say is
> > make
> > > > the
> > > > > > epoch requirement optional, so that if I client doesn't care
> about
> > > > dupes
> > > > > > they don't have to deal with the process of getting a new epoch.
> > > > > >
> > > > >
> > > > > Yup. This should be optional. I can start a wiki page on how we
> want
> > to
> > > > > implement this. Are you interested in contributing to this?
> > > > >
> > > > >
> > > > > >
> > > > > > -Cameron
> > > > > >
> > > > > > On Wed, Oct 19, 2016 at 7:43 PM, Sijie Guo
> > > <sijieg@twitter.com.invalid
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > On Wed, Oct 19, 2016 at 7:17 PM, Sijie Guo <sijieg@twitter.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Monday, October 17, 2016, Cameron Hatfield <
> > kinguy@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > >> Answer inline:
> > > > > > > >>
> > > > > > > >> On Mon, Oct 17, 2016 at 11:46 AM, Sijie Guo <
> sijie@apache.org
> > >
> > > > > wrote:
> > > > > > > >>
> > > > > > > >> > Cameron,
> > > > > > > >> >
> > > > > > > >> > Thank you for your summary. I liked the discussion here. I
> > > also
> > > > > > liked
> > > > > > > >> the
> > > > > > > >> > summary of your requirement - 'single-writer-per-key,
> > > > > > > >> > multiple-writers-per-log'. If I understand correctly, the
> > core
> > > > > > concern
> > > > > > > >> here
> > > > > > > >> > is almost 'exact-once' write (or a way to explicit tell
> if a
> > > > write
> > > > > > can
> > > > > > > >> be
> > > > > > > >> > retried or not).
> > > > > > > >> >
> > > > > > > >> > Comments inline.
> > > > > > > >> >
> > > > > > > >> > On Fri, Oct 14, 2016 at 11:17 AM, Cameron Hatfield <
> > > > > > kinguy@gmail.com>
> > > > > > > >> > wrote:
> > > > > > > >> >
> > > > > > > >> > > > Ah- yes good point (to be clear we're not using the
> > proxy
> > > > this
> > > > > > way
> > > > > > > >> > > today).
> > > > > > > >> > >
> > > > > > > >> > > > > Due to the source of the
> > > > > > > >> > > > > data (HBase Replication), we cannot guarantee that a
> > > > single
> > > > > > > >> partition
> > > > > > > >> > > will
> > > > > > > >> > > > > be owned for writes by the same client.
> > > > > > > >> > >
> > > > > > > >> > > > Do you mean you *need* to support multiple writers
> > issuing
> > > > > > > >> interleaved
> > > > > > > >> > > > writes or is it just that they might sometimes
> > interleave
> > > > > writes
> > > > > > > and
> > > > > > > >> > you
> > > > > > > >> > > >don't care?
> > > > > > > >> > > How HBase partitions the keys being written wouldn't
> have
> > a
> > > > > > one->one
> > > > > > > >> > > mapping with the partitions we would have in HBase. Even
> > if
> > > we
> > > > > did
> > > > > > > >> have
> > > > > > > >> > > that alignment when the cluster first started, HBase
> will
> > > > > > rebalance
> > > > > > > >> what
> > > > > > > >> > > servers own what partitions, as well as split and merge
> > > > > partitions
> > > > > > > >> that
> > > > > > > >> > > already exist, causing eventual drift from one log per
> > > > > partition.
> > > > > > > >> > > Because we want ordering guarantees per key (row in
> > hbase),
> > > we
> > > > > > > >> partition
> > > > > > > >> > > the logs by the key. Since multiple writers are possible
> > per
> > > > > range
> > > > > > > of
> > > > > > > >> > keys
> > > > > > > >> > > (due to the aforementioned rebalancing / splitting / etc
> > of
> > > > > > hbase),
> > > > > > > we
> > > > > > > >> > > cannot use the core library due to requiring a single
> > writer
> > > > for
> > > > > > > >> > ordering.
> > > > > > > >> > >
> > > > > > > >> > > But, for a single log, we don't really care about
> ordering
> > > > aside
> > > > > > > from
> > > > > > > >> at
> > > > > > > >> > > the per-key level. So all we really need to be able to
> > > handle
> > > > is
> > > > > > > >> > preventing
> > > > > > > >> > > duplicates when a failure occurs, and ordering
> consistency
> > > > > across
> > > > > > > >> > requests
> > > > > > > >> > > from a single client.
> > > > > > > >> > >
> > > > > > > >> > > So our general requirements are:
> > > > > > > >> > > Write A, Write B
> > > > > > > >> > > Timeline: A -> B
> > > > > > > >> > > Request B is only made after A has successfully returned
> > > > > (possibly
> > > > > > > >> after
> > > > > > > >> > > retries)
> > > > > > > >> > >
> > > > > > > >> > > 1) If the write succeeds, it will be durably exposed to
> > > > clients
> > > > > > > within
> > > > > > > >> > some
> > > > > > > >> > > bounded time frame
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >> > Guaranteed.
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >> > > 2) If A succeeds and B succeeds, the ordering for the
> log
> > > will
> > > > > be
> > > > > > A
> > > > > > > >> and
> > > > > > > >> > > then B
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >> > If I understand correctly here, B is only sent after A is
> > > > > returned,
> > > > > > > >> right?
> > > > > > > >> > If that's the case, It is guaranteed.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > > 3) If A fails due to an error that can be relied on to
> > *not*
> > > > be
> > > > > a
> > > > > > > lost
> > > > > > > >> > ack
> > > > > > > >> > > problem, it will never be exposed to the client, so it
> may
> > > > > > > (depending
> > > > > > > >> on
> > > > > > > >> > > the error) be retried immediately
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >> > If it is not a lost-ack problem, the entry will be
> exposed.
> > it
> > > > is
> > > > > > > >> > guaranteed.
> > > > > > > >>
> > > > > > > >> Let me try rephrasing the questions, to make sure I'm
> > > > understanding
> > > > > > > >> correctly:
> > > > > > > >> If A fails, with an error such as "Unable to create
> connection
> > > to
> > > > > > > >> bookkeeper server", that would be the type of error we would
> > > > expect
> > > > > to
> > > > > > > be
> > > > > > > >> able to retry immediately, as that result means no action
> was
> > > > taken
> > > > > on
> > > > > > > any
> > > > > > > >> log / etc, so no entry could have been created. This is
> > > different
> > > > > > then a
> > > > > > > >> "Connection Timeout" exception, as we just might not have
> > > gotten a
> > > > > > > >> response
> > > > > > > >> in time.
> > > > > > > >>
> > > > > > > >>
> > > > > > > > Gotcha.
> > > > > > > >
> > > > > > > > The response code returned from proxy can tell if a failure
> can
> > > be
> > > > > > > retried
> > > > > > > > safely or not. (We might need to make them well documented)
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >>
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > > 4) If A fails due to an error that could be a lost ack
> > > problem
> > > > > > > >> (network
> > > > > > > >> > > connectivity / etc), within a bounded time frame it
> should
> > > be
> > > > > > > >> possible to
> > > > > > > >> > > find out if the write succeed or failed. Either by
> reading
> > > > from
> > > > > > some
> > > > > > > >> > > checkpoint of the log for the changes that should have
> > been
> > > > made
> > > > > > or
> > > > > > > >> some
> > > > > > > >> > > other possible server-side support.
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >> > If I understand this correctly, it is a duplication issue,
> > > > right?
> > > > > > > >> >
> > > > > > > >> > Can a de-duplication solution work here? Either DL or your
> > > > client
> > > > > > does
> > > > > > > >> the
> > > > > > > >> > de-duplication?
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >> The requirements I'm mentioning are the ones needed for
> > > > client-side
> > > > > > > >> dedupping. Since if I can guarantee writes being exposed
> > within
> > > > some
> > > > > > > time
> > > > > > > >> frame, and I can never get into an inconsistently ordered
> > state
> > > > when
> > > > > > > >> successes happen, when an error occurs, I can always wait
> for
> > > max
> > > > > time
> > > > > > > >> frame, read the latest writes, and then dedup locally
> against
> > > the
> > > > > > > request
> > > > > > > >> I
> > > > > > > >> just made.
> > > > > > > >>
> > > > > > > >> The main thing about that timeframe is that its basically
> the
> > > > > addition
> > > > > > > of
> > > > > > > >> every timeout, all the way down in the system, combined with
> > > > > whatever
> > > > > > > >> flushing / caching / etc times are at the bookkeeper /
> client
> > > > level
> > > > > > for
> > > > > > > >> when values are exposed
> > > > > > > >
> > > > > > > >
> > > > > > > > Gotcha.
> > > > > > > >
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> >
> > > > > > > >> > Is there any ways to identify your write?
> > > > > > > >> >
> > > > > > > >> > I can think of a case as follow - I want to know what is
> > your
> > > > > > expected
> > > > > > > >> > behavior from the log.
> > > > > > > >> >
> > > > > > > >> > a)
> > > > > > > >> >
> > > > > > > >> > If a hbase region server A writes a change of key K to the
> > > log,
> > > > > the
> > > > > > > >> change
> > > > > > > >> > is successfully made to the log;
> > > > > > > >> > but server A is down before receiving the change.
> > > > > > > >> > region server B took over the region that contains K, what
> > > will
> > > > B
> > > > > > do?
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >> HBase writes in large chunks (WAL Logs), which its
> replication
> > > > > system
> > > > > > > then
> > > > > > > >> handles by replaying in the case of failure. If I'm in a
> > middle
> > > > of a
> > > > > > > log,
> > > > > > > >> and the whole region goes down and gets rescheduled
> > elsewhere, I
> > > > > will
> > > > > > > >> start
> > > > > > > >> back up from the beginning of the log I was in the middle
> of.
> > > > Using
> > > > > > > >> checkpointing + deduping, we should be able to find out
> where
> > we
> > > > > left
> > > > > > > off
> > > > > > > >> in the log.
> > > > > > > >
> > > > > > > >
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > b) same as a). but server A was just network partitioned.
> > will
> > > > > both
> > > > > > A
> > > > > > > >> and B
> > > > > > > >> > write the change of key K?
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >> HBase gives us some guarantees around network partitions
> > > > > (Consistency
> > > > > > > over
> > > > > > > >> availability for HBase). HBase is a single-master failover
> > > > recovery
> > > > > > type
> > > > > > > >> of
> > > > > > > >> system, with zookeeper-based guarantees for single owners
> > > > (writers)
> > > > > > of a
> > > > > > > >> range of data.
> > > > > > > >>
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > > 5) If A is turned into multiple batches (one large
> request
> > > > gets
> > > > > > > split
> > > > > > > >> > into
> > > > > > > >> > > multiple smaller ones to the bookkeeper backend, due to
> > log
> > > > > > rolling
> > > > > > > /
> > > > > > > >> > size
> > > > > > > >> > > / etc):
> > > > > > > >> > >   a) The ordering of entries within batches have
> ordering
> > > > > > > consistence
> > > > > > > >> > with
> > > > > > > >> > > the original request, when exposed in the log (though
> they
> > > may
> > > > > be
> > > > > > > >> > > interleaved with other requests)
> > > > > > > >> > >   b) The ordering across batches have ordering
> consistence
> > > > with
> > > > > > the
> > > > > > > >> > > original request, when exposed in the log (though they
> may
> > > be
> > > > > > > >> interleaved
> > > > > > > >> > > with other requests)
> > > > > > > >> > >   c) If a batch fails, and cannot be retried / is
> > > > unsuccessfully
> > > > > > > >> retried,
> > > > > > > >> > > all batches after the failed batch should not be exposed
> > in
> > > > the
> > > > > > log.
> > > > > > > >> > Note:
> > > > > > > >> > > The batches before and including the failed batch, that
> > > ended
> > > > up
> > > > > > > >> > > succeeding, can show up in the log, again within some
> > > bounded
> > > > > time
> > > > > > > >> range
> > > > > > > >> > > for reads by a client.
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >> > There is a method 'writeBulk' in DistributedLogClient can
> > > > achieve
> > > > > > this
> > > > > > > >> > guarantee.
> > > > > > > >> >
> > > > > > > >> > However, I am not very sure about how will you turn A into
> > > > > batches.
> > > > > > If
> > > > > > > >> you
> > > > > > > >> > are dividing A into batches,
> > > > > > > >> > you can simply control the application write sequence to
> > > achieve
> > > > > the
> > > > > > > >> > guarantee here.
> > > > > > > >> >
> > > > > > > >> > Can you explain more about this?
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >> In this case, by batches I mean what the proxy does with the
> > > > single
> > > > > > > >> request
> > > > > > > >> that I send it. If the proxy decides it needs to turn my
> > single
> > > > > > request
> > > > > > > >> into multiple batches of requests, due to log rolling, size
> > > > > > limitations,
> > > > > > > >> etc, those would be the guarantees I need to be able to
> > > > reduplicate
> > > > > on
> > > > > > > the
> > > > > > > >> client side.
> > > > > > > >
> > > > > > > >
> > > > > > > > A single record written by #write and A record set (set of
> > > records)
> > > > > > > > written by #writeRecordSet are atomic - they will not be
> broken
> > > > down
> > > > > > into
> > > > > > > > entries (batches). With the correct response code, you would
> be
> > > > able
> > > > > to
> > > > > > > > tell if it is a lost-ack failure or not. However there is a
> > size
> > > > > > > limitation
> > > > > > > > for this - it can't not go beyond 1MB for current
> > implementation.
> > > > > > > >
> > > > > > > > What is your expected record size?
> > > > > > > >
> > > > > > > >
> > > > > > > >>
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > >
> > > > > > > >> > > Since we can guarantee per-key ordering on the client
> > side,
> > > we
> > > > > > > >> guarantee
> > > > > > > >> > > that there is a single writer per-key, just not per log.
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > Do you need fencing guarantee in the case of network
> > partition
> > > > > > causing
> > > > > > > >> > two-writers?
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > > So if there was a
> > > > > > > >> > > way to guarantee a single write request as being written
> > or
> > > > not,
> > > > > > > >> within a
> > > > > > > >> > > certain time frame (since failures should be rare
> anyways,
> > > > this
> > > > > is
> > > > > > > >> fine
> > > > > > > >> > if
> > > > > > > >> > > it is expensive), we can then have the client guarantee
> > the
> > > > > > ordering
> > > > > > > >> it
> > > > > > > >> > > needs.
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >> > This sounds an 'exact-once' write (regarding retries)
> > > > requirement
> > > > > to
> > > > > > > me,
> > > > > > > >> > right?
> > > > > > > >> >
> > > > > > > >> Yes. I'm curious of how this issue is handled by Manhattan,
> > > since
> > > > > you
> > > > > > > can
> > > > > > > >> imagine a data store that ends up getting multiple writes
> for
> > > the
> > > > > same
> > > > > > > put
> > > > > > > >> / get / etc, would be harder to use, and we are basically
> > trying
> > > > to
> > > > > > > create
> > > > > > > >> a log like that for HBase.
> > > > > > > >
> > > > > > > >
> > > > > > > > Are you guys replacing HBase WAL?
> > > > > > > >
> > > > > > > > In Manhattan case, the request will be first written to DL
> > > streams
> > > > by
> > > > > > > > Manhattan coordinator. The Manhattan replica then will read
> > from
> > > > the
> > > > > DL
> > > > > > > > streams and apply the change. In the lost-ack case, the MH
> > > > > coordinator
> > > > > > > will
> > > > > > > > just fail the request to client.
> > > > > > > >
> > > > > > > > My feeling here is your usage for HBase is a bit different
> from
> > > how
> > > > > we
> > > > > > > use
> > > > > > > > DL in Manhattan. It sounds like you read from a source (HBase
> > > WAL)
> > > > > and
> > > > > > > > write to DL. But I might be wrong.
> > > > > > > >
> > > > > > > >
> > > > > > > >>
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > > Cameron:
> > > > > > > >> > > > Another thing we've discussed but haven't really
> thought
> > > > > > through -
> > > > > > > >> > > > We might be able to support some kind of epoch write
> > > > request,
> > > > > > > where
> > > > > > > >> the
> > > > > > > >> > > > epoch is guaranteed to have changed if the writer has
> > > > changed
> > > > > or
> > > > > > > the
> > > > > > > >> > > ledger
> > > > > > > >> > > > was ever fenced off. Writes include an epoch and are
> > > > rejected
> > > > > if
> > > > > > > the
> > > > > > > >> > > epoch
> > > > > > > >> > > > has changed.
> > > > > > > >> > > > With a mechanism like this, fencing the ledger off
> > after a
> > > > > > failure
> > > > > > > >> > would
> > > > > > > >> > > > ensure any pending writes had either been written or
> > would
> > > > be
> > > > > > > >> rejected.
> > > > > > > >> > >
> > > > > > > >> > > The issue would be how I guarantee the write I wrote to
> > the
> > > > > server
> > > > > > > was
> > > > > > > >> > > written. Since a network issue could happen on the send
> of
> > > the
> > > > > > > >> request,
> > > > > > > >> > or
> > > > > > > >> > > on the receive of the success response, an epoch
> wouldn't
> > > tell
> > > > > me
> > > > > > > if I
> > > > > > > >> > can
> > > > > > > >> > > successfully retry, as it could be successfully written
> > but
> > > > AWS
> > > > > > > >> dropped
> > > > > > > >> > the
> > > > > > > >> > > connection for the success response. Since the epoch
> would
> > > be
> > > > > the
> > > > > > > same
> > > > > > > >> > > (same ledger), I could write duplicates.
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > > We are currently proposing adding a transaction
> semantic
> > > to
> > > > dl
> > > > > > to
> > > > > > > >> get
> > > > > > > >> > rid
> > > > > > > >> > > > of the size limitation and the unaware-ness in the
> proxy
> > > > > client.
> > > > > > > >> Here
> > > > > > > >> > is
> > > > > > > >> > > > our idea -
> > > > > > > >> > > > http://mail-archives.apache.org/mod_mbox/incubator-
> > > > > > distributedlog
> > > > > > > >> > > -dev/201609.mbox/%3cCAAC6BxP5YyEHwG0ZCF5soh42X=xuYwYm
> > > > > > > >> > > <http://mail-archives.apache.org/mod_mbox/incubator-
> > > > > > > >> > distributedlog%0A-dev/201609.mbox/%
> > 3cCAAC6BxP5YyEHwG0ZCF5soh
> > > > > > > 42X=xuYwYm>
> > > > > > > >> > > L4nXsYBYiofzxpVk6g@mail.gmail.com%3e
> > > > > > > >> > >
> > > > > > > >> > > > I am not sure if your idea is similar as ours. but
> we'd
> > > like
> > > > > to
> > > > > > > >> > > collaborate
> > > > > > > >> > > > with the community if anyone has the similar idea.
> > > > > > > >> > >
> > > > > > > >> > > Our use case would be covered by transaction support,
> but
> > > I'm
> > > > > > unsure
> > > > > > > >> if
> > > > > > > >> > we
> > > > > > > >> > > would need something that heavy weight for the
> guarantees
> > we
> > > > > need.
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >> > >
> > > > > > > >> > > Basically, the high level requirement here is "Support
> > > > > consistent
> > > > > > > >> write
> > > > > > > >> > > ordering for single-writer-per-key,
> multi-writer-per-log".
> > > My
> > > > > > hunch
> > > > > > > is
> > > > > > > >> > > that, with some added guarantees to the proxy (if it
> isn't
> > > > > already
> > > > > > > >> > > supported), and some custom client code on our side for
> > > > removing
> > > > > > the
> > > > > > > >> > > entries that actually succeed to write to DistributedLog
> > > from
> > > > > the
> > > > > > > >> request
> > > > > > > >> > > that failed, it should be a relatively easy thing to
> > > support.
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >> > Yup. I think it should not be very difficult to support.
> > There
> > > > > might
> > > > > > > be
> > > > > > > >> > some changes in the server side.
> > > > > > > >> > Let's figure out what will the changes be. Are you guys
> > > > interested
> > > > > > in
> > > > > > > >> > contributing?
> > > > > > > >> >
> > > > > > > >> > Yes, we would be.
> > > > > > > >>
> > > > > > > >> As a note, the one thing that we see as an issue with the
> > client
> > > > > side
> > > > > > > >> dedupping is how to bound the range of data that needs to be
> > > > looked
> > > > > at
> > > > > > > for
> > > > > > > >> deduplication. As you can imagine, it is pretty easy to
> bound
> > > the
> > > > > > bottom
> > > > > > > >> of
> > > > > > > >> the range, as that it just regular checkpointing of the DSLN
> > > that
> > > > is
> > > > > > > >> returned. I'm still not sure if there is any nice way to
> time
> > > > bound
> > > > > > the
> > > > > > > >> top
> > > > > > > >> end of the range, especially since the proxy owns sequence
> > > numbers
> > > > > > > (which
> > > > > > > >> makes sense). I am curious if there is more that can be done
> > if
> > > > > > > >> deduplication is on the server side. However the main minus
> I
> > > see
> > > > of
> > > > > > > >> server
> > > > > > > >> side deduplication is that instead of running contingent on
> > > there
> > > > > > being
> > > > > > > a
> > > > > > > >> failed client request, instead it would have to run every
> > time a
> > > > > write
> > > > > > > >> happens.
> > > > > > > >
> > > > > > > >
> > > > > > > > For a reliable dedup, we probably need fence-then-getLastDLSN
> > > > > > operation -
> > > > > > > > so it would guarantee that any non-completed requests issued
> > > > > (lost-ack
> > > > > > > > requests) before this fence-then-getLastDLSN operation will
> be
> > > > failed
> > > > > > and
> > > > > > > > they will never land at the log.
> > > > > > > >
> > > > > > > > the pseudo code would look like below -
> > > > > > > >
> > > > > > > > write(request) onFailure { t =>
> > > > > > > >
> > > > > > > > if (t is timeout exception) {
> > > > > > > >
> > > > > > > > DLSN lastDLSN = fenceThenGetLastDLSN()
> > > > > > > > DLSN lastCheckpointedDLSN = ...;
> > > > > > > > // find if the request lands between [lastDLSN,
> > > > > lastCheckpointedDLSN].
> > > > > > > > // if it exists, the write succeed; otherwise retry.
> > > > > > > >
> > > > > > > > }
> > > > > > > >
> > > > > > > >
> > > > > > > > }
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Just realized the idea is same as what Leigh raised in the
> > previous
> > > > > email
> > > > > > > about 'epoch write'. Let me explain more about this idea
> (Leigh,
> > > feel
> > > > > > free
> > > > > > > to jump in to fill up your idea).
> > > > > > >
> > > > > > > - when a log stream is owned,  the proxy use the last
> transaction
> > > id
> > > > as
> > > > > > the
> > > > > > > epoch
> > > > > > > - when a client connects (handshake with the proxy), it will
> get
> > > the
> > > > > > epoch
> > > > > > > for the stream.
> > > > > > > - the writes issued by this client will carry the epoch to the
> > > proxy.
> > > > > > > - add a new rpc - fenceThenGetLastDLSN - it would force the
> proxy
> > > to
> > > > > bump
> > > > > > > the epoch.
> > > > > > > - if fenceThenGetLastDLSN happened, all the outstanding writes
> > with
> > > > old
> > > > > > > epoch will be rejected with exceptions (e.g. EpochFenced).
> > > > > > > - The DLSN returned from fenceThenGetLastDLSN can be used as
> the
> > > > bound
> > > > > > for
> > > > > > > deduplications on failures.
> > > > > > >
> > > > > > > Cameron, does this sound a solution to your use case?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >>
> > > > > > > >> Maybe something that could fit a similar need that Kafka
> does
> > > (the
> > > > > > last
> > > > > > > >> store value for a particular key in a log), such that on a
> per
> > > key
> > > > > > basis
> > > > > > > >> there could be a sequence number that support deduplication?
> > > Cost
> > > > > > seems
> > > > > > > >> like it would be high however, and I'm not even sure if
> > > bookkeeper
> > > > > > > >> supports
> > > > > > > >> it.
> > > > > > > >
> > > > > > > >
> > > > > > > >> Cheers,
> > > > > > > >> Cameron
> > > > > > > >>
> > > > > > > >> >
> > > > > > > >> > >
> > > > > > > >> > > Thanks,
> > > > > > > >> > > Cameron
> > > > > > > >> > >
> > > > > > > >> > >
> > > > > > > >> > > On Sat, Oct 8, 2016 at 7:35 AM, Leigh Stewart
> > > > > > > >> > <lstewart@twitter.com.invalid
> > > > > > > >> > > >
> > > > > > > >> > > wrote:
> > > > > > > >> > >
> > > > > > > >> > > > Cameron:
> > > > > > > >> > > > Another thing we've discussed but haven't really
> thought
> > > > > > through -
> > > > > > > >> > > > We might be able to support some kind of epoch write
> > > > request,
> > > > > > > where
> > > > > > > >> the
> > > > > > > >> > > > epoch is guaranteed to have changed if the writer has
> > > > changed
> > > > > or
> > > > > > > the
> > > > > > > >> > > ledger
> > > > > > > >> > > > was ever fenced off. Writes include an epoch and are
> > > > rejected
> > > > > if
> > > > > > > the
> > > > > > > >> > > epoch
> > > > > > > >> > > > has changed.
> > > > > > > >> > > > With a mechanism like this, fencing the ledger off
> > after a
> > > > > > failure
> > > > > > > >> > would
> > > > > > > >> > > > ensure any pending writes had either been written or
> > would
> > > > be
> > > > > > > >> rejected.
> > > > > > > >> > > >
> > > > > > > >> > > >
> > > > > > > >> > > > On Sat, Oct 8, 2016 at 7:10 AM, Sijie Guo <
> > > sijie@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > >> > > >
> > > > > > > >> > > > > Cameron,
> > > > > > > >> > > > >
> > > > > > > >> > > > > I think both Leigh and Xi had made a few good points
> > > about
> > > > > > your
> > > > > > > >> > > question.
> > > > > > > >> > > > >
> > > > > > > >> > > > > To add one more point to your question - "but I am
> not
> > > > > > > >> > > > > 100% of how all of the futures in the code handle
> > > > failures.
> > > > > > > >> > > > > If not, where in the code would be the relevant
> places
> > > to
> > > > > add
> > > > > > > the
> > > > > > > >> > > ability
> > > > > > > >> > > > > to do this, and would the project be interested in a
> > > pull
> > > > > > > >> request?"
> > > > > > > >> > > > >
> > > > > > > >> > > > > The current proxy and client logic doesn't do
> > perfectly
> > > on
> > > > > > > >> handling
> > > > > > > >> > > > > failures (duplicates) - the strategy now is the
> client
> > > > will
> > > > > > > retry
> > > > > > > >> as
> > > > > > > >> > > best
> > > > > > > >> > > > > at it can before throwing exceptions to users. The
> > code
> > > > you
> > > > > > are
> > > > > > > >> > looking
> > > > > > > >> > > > for
> > > > > > > >> > > > > - it is on BKLogSegmentWriter for the proxy handling
> > > > writes
> > > > > > and
> > > > > > > >> it is
> > > > > > > >> > > on
> > > > > > > >> > > > > DistributedLogClientImpl for the proxy client
> handling
> > > > > > responses
> > > > > > > >> from
> > > > > > > >> > > > > proxies. Does this help you?
> > > > > > > >> > > > >
> > > > > > > >> > > > > And also, you are welcome to contribute the pull
> > > requests.
> > > > > > > >> > > > >
> > > > > > > >> > > > > - Sijie
> > > > > > > >> > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > > > On Tue, Oct 4, 2016 at 3:39 PM, Cameron Hatfield <
> > > > > > > >> kinguy@gmail.com>
> > > > > > > >> > > > wrote:
> > > > > > > >> > > > >
> > > > > > > >> > > > > > I have a question about the Proxy Client.
> Basically,
> > > for
> > > > > our
> > > > > > > use
> > > > > > > >> > > cases,
> > > > > > > >> > > > > we
> > > > > > > >> > > > > > want to guarantee ordering at the key level,
> > > > irrespective
> > > > > of
> > > > > > > the
> > > > > > > >> > > > ordering
> > > > > > > >> > > > > > of the partition it may be assigned to as a whole.
> > Due
> > > > to
> > > > > > the
> > > > > > > >> > source
> > > > > > > >> > > of
> > > > > > > >> > > > > the
> > > > > > > >> > > > > > data (HBase Replication), we cannot guarantee
> that a
> > > > > single
> > > > > > > >> > partition
> > > > > > > >> > > > > will
> > > > > > > >> > > > > > be owned for writes by the same client. This means
> > the
> > > > > proxy
> > > > > > > >> client
> > > > > > > >> > > > works
> > > > > > > >> > > > > > well (since we don't care which proxy owns the
> > > partition
> > > > > we
> > > > > > > are
> > > > > > > >> > > writing
> > > > > > > >> > > > > > to).
> > > > > > > >> > > > > >
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > However, the guarantees we need when writing a
> batch
> > > > > > consists
> > > > > > > >> of:
> > > > > > > >> > > > > > Definition of a Batch: The set of records sent to
> > the
> > > > > > > writeBatch
> > > > > > > >> > > > endpoint
> > > > > > > >> > > > > > on the proxy
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > 1. Batch success: If the client receives a success
> > > from
> > > > > the
> > > > > > > >> proxy,
> > > > > > > >> > > then
> > > > > > > >> > > > > > that batch is successfully written
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > 2. Inter-Batch ordering : Once a batch has been
> > > written
> > > > > > > >> > successfully
> > > > > > > >> > > by
> > > > > > > >> > > > > the
> > > > > > > >> > > > > > client, when another batch is written, it will be
> > > > > guaranteed
> > > > > > > to
> > > > > > > >> be
> > > > > > > >> > > > > ordered
> > > > > > > >> > > > > > after the last batch (if it is the same stream).
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > 3. Intra-Batch ordering: Within a batch of writes,
> > the
> > > > > > records
> > > > > > > >> will
> > > > > > > >> > > be
> > > > > > > >> > > > > > committed in order
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > 4. Intra-Batch failure ordering: If an individual
> > > record
> > > > > > fails
> > > > > > > >> to
> > > > > > > >> > > write
> > > > > > > >> > > > > > within a batch, all records after that record will
> > not
> > > > be
> > > > > > > >> written.
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > 5. Batch Commit: Guarantee that if a batch
> returns a
> > > > > > success,
> > > > > > > it
> > > > > > > >> > will
> > > > > > > >> > > > be
> > > > > > > >> > > > > > written
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > 6. Read-after-write: Once a batch is committed,
> > > within a
> > > > > > > limited
> > > > > > > >> > > > > time-frame
> > > > > > > >> > > > > > it will be able to be read. This is required in
> the
> > > case
> > > > > of
> > > > > > > >> > failure,
> > > > > > > >> > > so
> > > > > > > >> > > > > > that the client can see what actually got
> > committed. I
> > > > > > believe
> > > > > > > >> the
> > > > > > > >> > > > > > time-frame part could be removed if the client can
> > > send
> > > > in
> > > > > > the
> > > > > > > >> same
> > > > > > > >> > > > > > sequence number that was written previously, since
> > it
> > > > > would
> > > > > > > then
> > > > > > > >> > fail
> > > > > > > >> > > > and
> > > > > > > >> > > > > > we would know that a read needs to occur.
> > > > > > > >> > > > > >
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > So, my basic question is if this is currently
> > possible
> > > > in
> > > > > > the
> > > > > > > >> > proxy?
> > > > > > > >> > > I
> > > > > > > >> > > > > > don't believe it gives these guarantees as it
> stands
> > > > > today,
> > > > > > > but
> > > > > > > >> I
> > > > > > > >> > am
> > > > > > > >> > > > not
> > > > > > > >> > > > > > 100% of how all of the futures in the code handle
> > > > > failures.
> > > > > > > >> > > > > > If not, where in the code would be the relevant
> > places
> > > > to
> > > > > > add
> > > > > > > >> the
> > > > > > > >> > > > ability
> > > > > > > >> > > > > > to do this, and would the project be interested
> in a
> > > > pull
> > > > > > > >> request?
> > > > > > > >> > > > > >
> > > > > > > >> > > > > >
> > > > > > > >> > > > > > Thanks,
> > > > > > > >> > > > > > Cameron
> > > > > > > >> > > > > >
> > > > > > > >> > > > >
> > > > > > > >> > > >
> > > > > > > >> > >
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Proxy Client - Batch Ordering / Commit

Posted by Cameron Hatfield <ki...@gmail.com>.
Sorry, I've been on vacation for the past week, and heads down for a
release that is using DL at the end of Nov. I'll take a look at this over
the next week, and add any relevant comments. After we are finished with
dev for this release, I am hoping to tackle this next.

-Cameron

On Fri, Nov 11, 2016 at 12:07 PM, Sijie Guo <si...@apache.org> wrote:

> Xi,
>
> Thank you so much for your proposal. I took a look. It looks fine to me.
> Cameron, do you have any comments?
>
> Look forward to your pull requests.
>
> - Sijie
>
>
> On Wed, Nov 9, 2016 at 2:34 AM, Xi Liu <xi...@gmail.com> wrote:
>
> > Cameron,
> >
> > Have you started any work for this? I just updated the proposal page -
> > https://cwiki.apache.org/confluence/display/DL/DP-2+-+
> Epoch+Write+Support
> > Maybe we can work together with this.
> >
> > Sijie, Leigh,
> >
> > can you guys help review this to make sure our proposal is in the right
> > direction?
> >
> > - Xi
> >
> > On Tue, Nov 1, 2016 at 3:05 AM, Sijie Guo <si...@apache.org> wrote:
> >
> > > I created https://issues.apache.org/jira/browse/DL-63 for tracking the
> > > proposed idea here.
> > >
> > >
> > >
> > > On Wed, Oct 26, 2016 at 4:53 PM, Sijie Guo <sijieg@twitter.com.invalid
> >
> > > wrote:
> > >
> > > > On Tue, Oct 25, 2016 at 11:30 AM, Cameron Hatfield <kinguy@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Yes, we are reading the HBase WAL (from their replication plugin
> > > > support),
> > > > > and writing that into DL.
> > > > >
> > > >
> > > > Gotcha.
> > > >
> > > >
> > > > >
> > > > > From the sounds of it, yes, it would. Only thing I would say is
> make
> > > the
> > > > > epoch requirement optional, so that if I client doesn't care about
> > > dupes
> > > > > they don't have to deal with the process of getting a new epoch.
> > > > >
> > > >
> > > > Yup. This should be optional. I can start a wiki page on how we want
> to
> > > > implement this. Are you interested in contributing to this?
> > > >
> > > >
> > > > >
> > > > > -Cameron
> > > > >
> > > > > On Wed, Oct 19, 2016 at 7:43 PM, Sijie Guo
> > <sijieg@twitter.com.invalid
> > > >
> > > > > wrote:
> > > > >
> > > > > > On Wed, Oct 19, 2016 at 7:17 PM, Sijie Guo <si...@twitter.com>
> > > wrote:
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Monday, October 17, 2016, Cameron Hatfield <
> kinguy@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > >> Answer inline:
> > > > > > >>
> > > > > > >> On Mon, Oct 17, 2016 at 11:46 AM, Sijie Guo <sijie@apache.org
> >
> > > > wrote:
> > > > > > >>
> > > > > > >> > Cameron,
> > > > > > >> >
> > > > > > >> > Thank you for your summary. I liked the discussion here. I
> > also
> > > > > liked
> > > > > > >> the
> > > > > > >> > summary of your requirement - 'single-writer-per-key,
> > > > > > >> > multiple-writers-per-log'. If I understand correctly, the
> core
> > > > > concern
> > > > > > >> here
> > > > > > >> > is almost 'exact-once' write (or a way to explicit tell if a
> > > write
> > > > > can
> > > > > > >> be
> > > > > > >> > retried or not).
> > > > > > >> >
> > > > > > >> > Comments inline.
> > > > > > >> >
> > > > > > >> > On Fri, Oct 14, 2016 at 11:17 AM, Cameron Hatfield <
> > > > > kinguy@gmail.com>
> > > > > > >> > wrote:
> > > > > > >> >
> > > > > > >> > > > Ah- yes good point (to be clear we're not using the
> proxy
> > > this
> > > > > way
> > > > > > >> > > today).
> > > > > > >> > >
> > > > > > >> > > > > Due to the source of the
> > > > > > >> > > > > data (HBase Replication), we cannot guarantee that a
> > > single
> > > > > > >> partition
> > > > > > >> > > will
> > > > > > >> > > > > be owned for writes by the same client.
> > > > > > >> > >
> > > > > > >> > > > Do you mean you *need* to support multiple writers
> issuing
> > > > > > >> interleaved
> > > > > > >> > > > writes or is it just that they might sometimes
> interleave
> > > > writes
> > > > > > and
> > > > > > >> > you
> > > > > > >> > > >don't care?
> > > > > > >> > > How HBase partitions the keys being written wouldn't have
> a
> > > > > one->one
> > > > > > >> > > mapping with the partitions we would have in HBase. Even
> if
> > we
> > > > did
> > > > > > >> have
> > > > > > >> > > that alignment when the cluster first started, HBase will
> > > > > rebalance
> > > > > > >> what
> > > > > > >> > > servers own what partitions, as well as split and merge
> > > > partitions
> > > > > > >> that
> > > > > > >> > > already exist, causing eventual drift from one log per
> > > > partition.
> > > > > > >> > > Because we want ordering guarantees per key (row in
> hbase),
> > we
> > > > > > >> partition
> > > > > > >> > > the logs by the key. Since multiple writers are possible
> per
> > > > range
> > > > > > of
> > > > > > >> > keys
> > > > > > >> > > (due to the aforementioned rebalancing / splitting / etc
> of
> > > > > hbase),
> > > > > > we
> > > > > > >> > > cannot use the core library due to requiring a single
> writer
> > > for
> > > > > > >> > ordering.
> > > > > > >> > >
> > > > > > >> > > But, for a single log, we don't really care about ordering
> > > aside
> > > > > > from
> > > > > > >> at
> > > > > > >> > > the per-key level. So all we really need to be able to
> > handle
> > > is
> > > > > > >> > preventing
> > > > > > >> > > duplicates when a failure occurs, and ordering consistency
> > > > across
> > > > > > >> > requests
> > > > > > >> > > from a single client.
> > > > > > >> > >
> > > > > > >> > > So our general requirements are:
> > > > > > >> > > Write A, Write B
> > > > > > >> > > Timeline: A -> B
> > > > > > >> > > Request B is only made after A has successfully returned
> > > > (possibly
> > > > > > >> after
> > > > > > >> > > retries)
> > > > > > >> > >
> > > > > > >> > > 1) If the write succeeds, it will be durably exposed to
> > > clients
> > > > > > within
> > > > > > >> > some
> > > > > > >> > > bounded time frame
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> > Guaranteed.
> > > > > > >> >
> > > > > > >>
> > > > > > >> > > 2) If A succeeds and B succeeds, the ordering for the log
> > will
> > > > be
> > > > > A
> > > > > > >> and
> > > > > > >> > > then B
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> > If I understand correctly here, B is only sent after A is
> > > > returned,
> > > > > > >> right?
> > > > > > >> > If that's the case, It is guaranteed.
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > > 3) If A fails due to an error that can be relied on to
> *not*
> > > be
> > > > a
> > > > > > lost
> > > > > > >> > ack
> > > > > > >> > > problem, it will never be exposed to the client, so it may
> > > > > > (depending
> > > > > > >> on
> > > > > > >> > > the error) be retried immediately
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> > If it is not a lost-ack problem, the entry will be exposed.
> it
> > > is
> > > > > > >> > guaranteed.
> > > > > > >>
> > > > > > >> Let me try rephrasing the questions, to make sure I'm
> > > understanding
> > > > > > >> correctly:
> > > > > > >> If A fails, with an error such as "Unable to create connection
> > to
> > > > > > >> bookkeeper server", that would be the type of error we would
> > > expect
> > > > to
> > > > > > be
> > > > > > >> able to retry immediately, as that result means no action was
> > > taken
> > > > on
> > > > > > any
> > > > > > >> log / etc, so no entry could have been created. This is
> > different
> > > > > then a
> > > > > > >> "Connection Timeout" exception, as we just might not have
> > gotten a
> > > > > > >> response
> > > > > > >> in time.
> > > > > > >>
> > > > > > >>
> > > > > > > Gotcha.
> > > > > > >
> > > > > > > The response code returned from proxy can tell if a failure can
> > be
> > > > > > retried
> > > > > > > safely or not. (We might need to make them well documented)
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >>
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > > 4) If A fails due to an error that could be a lost ack
> > problem
> > > > > > >> (network
> > > > > > >> > > connectivity / etc), within a bounded time frame it should
> > be
> > > > > > >> possible to
> > > > > > >> > > find out if the write succeed or failed. Either by reading
> > > from
> > > > > some
> > > > > > >> > > checkpoint of the log for the changes that should have
> been
> > > made
> > > > > or
> > > > > > >> some
> > > > > > >> > > other possible server-side support.
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> > If I understand this correctly, it is a duplication issue,
> > > right?
> > > > > > >> >
> > > > > > >> > Can a de-duplication solution work here? Either DL or your
> > > client
> > > > > does
> > > > > > >> the
> > > > > > >> > de-duplication?
> > > > > > >> >
> > > > > > >>
> > > > > > >> The requirements I'm mentioning are the ones needed for
> > > client-side
> > > > > > >> dedupping. Since if I can guarantee writes being exposed
> within
> > > some
> > > > > > time
> > > > > > >> frame, and I can never get into an inconsistently ordered
> state
> > > when
> > > > > > >> successes happen, when an error occurs, I can always wait for
> > max
> > > > time
> > > > > > >> frame, read the latest writes, and then dedup locally against
> > the
> > > > > > request
> > > > > > >> I
> > > > > > >> just made.
> > > > > > >>
> > > > > > >> The main thing about that timeframe is that its basically the
> > > > addition
> > > > > > of
> > > > > > >> every timeout, all the way down in the system, combined with
> > > > whatever
> > > > > > >> flushing / caching / etc times are at the bookkeeper / client
> > > level
> > > > > for
> > > > > > >> when values are exposed
> > > > > > >
> > > > > > >
> > > > > > > Gotcha.
> > > > > > >
> > > > > > >>
> > > > > > >>
> > > > > > >> >
> > > > > > >> > Is there any ways to identify your write?
> > > > > > >> >
> > > > > > >> > I can think of a case as follow - I want to know what is
> your
> > > > > expected
> > > > > > >> > behavior from the log.
> > > > > > >> >
> > > > > > >> > a)
> > > > > > >> >
> > > > > > >> > If a hbase region server A writes a change of key K to the
> > log,
> > > > the
> > > > > > >> change
> > > > > > >> > is successfully made to the log;
> > > > > > >> > but server A is down before receiving the change.
> > > > > > >> > region server B took over the region that contains K, what
> > will
> > > B
> > > > > do?
> > > > > > >> >
> > > > > > >>
> > > > > > >> HBase writes in large chunks (WAL Logs), which its replication
> > > > system
> > > > > > then
> > > > > > >> handles by replaying in the case of failure. If I'm in a
> middle
> > > of a
> > > > > > log,
> > > > > > >> and the whole region goes down and gets rescheduled
> elsewhere, I
> > > > will
> > > > > > >> start
> > > > > > >> back up from the beginning of the log I was in the middle of.
> > > Using
> > > > > > >> checkpointing + deduping, we should be able to find out where
> we
> > > > left
> > > > > > off
> > > > > > >> in the log.
> > > > > > >
> > > > > > >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > b) same as a). but server A was just network partitioned.
> will
> > > > both
> > > > > A
> > > > > > >> and B
> > > > > > >> > write the change of key K?
> > > > > > >> >
> > > > > > >>
> > > > > > >> HBase gives us some guarantees around network partitions
> > > > (Consistency
> > > > > > over
> > > > > > >> availability for HBase). HBase is a single-master failover
> > > recovery
> > > > > type
> > > > > > >> of
> > > > > > >> system, with zookeeper-based guarantees for single owners
> > > (writers)
> > > > > of a
> > > > > > >> range of data.
> > > > > > >>
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > > 5) If A is turned into multiple batches (one large request
> > > gets
> > > > > > split
> > > > > > >> > into
> > > > > > >> > > multiple smaller ones to the bookkeeper backend, due to
> log
> > > > > rolling
> > > > > > /
> > > > > > >> > size
> > > > > > >> > > / etc):
> > > > > > >> > >   a) The ordering of entries within batches have ordering
> > > > > > consistence
> > > > > > >> > with
> > > > > > >> > > the original request, when exposed in the log (though they
> > may
> > > > be
> > > > > > >> > > interleaved with other requests)
> > > > > > >> > >   b) The ordering across batches have ordering consistence
> > > with
> > > > > the
> > > > > > >> > > original request, when exposed in the log (though they may
> > be
> > > > > > >> interleaved
> > > > > > >> > > with other requests)
> > > > > > >> > >   c) If a batch fails, and cannot be retried / is
> > > unsuccessfully
> > > > > > >> retried,
> > > > > > >> > > all batches after the failed batch should not be exposed
> in
> > > the
> > > > > log.
> > > > > > >> > Note:
> > > > > > >> > > The batches before and including the failed batch, that
> > ended
> > > up
> > > > > > >> > > succeeding, can show up in the log, again within some
> > bounded
> > > > time
> > > > > > >> range
> > > > > > >> > > for reads by a client.
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> > There is a method 'writeBulk' in DistributedLogClient can
> > > achieve
> > > > > this
> > > > > > >> > guarantee.
> > > > > > >> >
> > > > > > >> > However, I am not very sure about how will you turn A into
> > > > batches.
> > > > > If
> > > > > > >> you
> > > > > > >> > are dividing A into batches,
> > > > > > >> > you can simply control the application write sequence to
> > achieve
> > > > the
> > > > > > >> > guarantee here.
> > > > > > >> >
> > > > > > >> > Can you explain more about this?
> > > > > > >> >
> > > > > > >>
> > > > > > >> In this case, by batches I mean what the proxy does with the
> > > single
> > > > > > >> request
> > > > > > >> that I send it. If the proxy decides it needs to turn my
> single
> > > > > request
> > > > > > >> into multiple batches of requests, due to log rolling, size
> > > > > limitations,
> > > > > > >> etc, those would be the guarantees I need to be able to
> > > reduplicate
> > > > on
> > > > > > the
> > > > > > >> client side.
> > > > > > >
> > > > > > >
> > > > > > > A single record written by #write and A record set (set of
> > records)
> > > > > > > written by #writeRecordSet are atomic - they will not be broken
> > > down
> > > > > into
> > > > > > > entries (batches). With the correct response code, you would be
> > > able
> > > > to
> > > > > > > tell if it is a lost-ack failure or not. However there is a
> size
> > > > > > limitation
> > > > > > > for this - it can't not go beyond 1MB for current
> implementation.
> > > > > > >
> > > > > > > What is your expected record size?
> > > > > > >
> > > > > > >
> > > > > > >>
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > >
> > > > > > >> > > Since we can guarantee per-key ordering on the client
> side,
> > we
> > > > > > >> guarantee
> > > > > > >> > > that there is a single writer per-key, just not per log.
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > Do you need fencing guarantee in the case of network
> partition
> > > > > causing
> > > > > > >> > two-writers?
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > > So if there was a
> > > > > > >> > > way to guarantee a single write request as being written
> or
> > > not,
> > > > > > >> within a
> > > > > > >> > > certain time frame (since failures should be rare anyways,
> > > this
> > > > is
> > > > > > >> fine
> > > > > > >> > if
> > > > > > >> > > it is expensive), we can then have the client guarantee
> the
> > > > > ordering
> > > > > > >> it
> > > > > > >> > > needs.
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> > This sounds an 'exact-once' write (regarding retries)
> > > requirement
> > > > to
> > > > > > me,
> > > > > > >> > right?
> > > > > > >> >
> > > > > > >> Yes. I'm curious of how this issue is handled by Manhattan,
> > since
> > > > you
> > > > > > can
> > > > > > >> imagine a data store that ends up getting multiple writes for
> > the
> > > > same
> > > > > > put
> > > > > > >> / get / etc, would be harder to use, and we are basically
> trying
> > > to
> > > > > > create
> > > > > > >> a log like that for HBase.
> > > > > > >
> > > > > > >
> > > > > > > Are you guys replacing HBase WAL?
> > > > > > >
> > > > > > > In Manhattan case, the request will be first written to DL
> > streams
> > > by
> > > > > > > Manhattan coordinator. The Manhattan replica then will read
> from
> > > the
> > > > DL
> > > > > > > streams and apply the change. In the lost-ack case, the MH
> > > > coordinator
> > > > > > will
> > > > > > > just fail the request to client.
> > > > > > >
> > > > > > > My feeling here is your usage for HBase is a bit different from
> > how
> > > > we
> > > > > > use
> > > > > > > DL in Manhattan. It sounds like you read from a source (HBase
> > WAL)
> > > > and
> > > > > > > write to DL. But I might be wrong.
> > > > > > >
> > > > > > >
> > > > > > >>
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > > Cameron:
> > > > > > >> > > > Another thing we've discussed but haven't really thought
> > > > > through -
> > > > > > >> > > > We might be able to support some kind of epoch write
> > > request,
> > > > > > where
> > > > > > >> the
> > > > > > >> > > > epoch is guaranteed to have changed if the writer has
> > > changed
> > > > or
> > > > > > the
> > > > > > >> > > ledger
> > > > > > >> > > > was ever fenced off. Writes include an epoch and are
> > > rejected
> > > > if
> > > > > > the
> > > > > > >> > > epoch
> > > > > > >> > > > has changed.
> > > > > > >> > > > With a mechanism like this, fencing the ledger off
> after a
> > > > > failure
> > > > > > >> > would
> > > > > > >> > > > ensure any pending writes had either been written or
> would
> > > be
> > > > > > >> rejected.
> > > > > > >> > >
> > > > > > >> > > The issue would be how I guarantee the write I wrote to
> the
> > > > server
> > > > > > was
> > > > > > >> > > written. Since a network issue could happen on the send of
> > the
> > > > > > >> request,
> > > > > > >> > or
> > > > > > >> > > on the receive of the success response, an epoch wouldn't
> > tell
> > > > me
> > > > > > if I
> > > > > > >> > can
> > > > > > >> > > successfully retry, as it could be successfully written
> but
> > > AWS
> > > > > > >> dropped
> > > > > > >> > the
> > > > > > >> > > connection for the success response. Since the epoch would
> > be
> > > > the
> > > > > > same
> > > > > > >> > > (same ledger), I could write duplicates.
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > > We are currently proposing adding a transaction semantic
> > to
> > > dl
> > > > > to
> > > > > > >> get
> > > > > > >> > rid
> > > > > > >> > > > of the size limitation and the unaware-ness in the proxy
> > > > client.
> > > > > > >> Here
> > > > > > >> > is
> > > > > > >> > > > our idea -
> > > > > > >> > > > http://mail-archives.apache.org/mod_mbox/incubator-
> > > > > distributedlog
> > > > > > >> > > -dev/201609.mbox/%3cCAAC6BxP5YyEHwG0ZCF5soh42X=xuYwYm
> > > > > > >> > > <http://mail-archives.apache.org/mod_mbox/incubator-
> > > > > > >> > distributedlog%0A-dev/201609.mbox/%
> 3cCAAC6BxP5YyEHwG0ZCF5soh
> > > > > > 42X=xuYwYm>
> > > > > > >> > > L4nXsYBYiofzxpVk6g@mail.gmail.com%3e
> > > > > > >> > >
> > > > > > >> > > > I am not sure if your idea is similar as ours. but we'd
> > like
> > > > to
> > > > > > >> > > collaborate
> > > > > > >> > > > with the community if anyone has the similar idea.
> > > > > > >> > >
> > > > > > >> > > Our use case would be covered by transaction support, but
> > I'm
> > > > > unsure
> > > > > > >> if
> > > > > > >> > we
> > > > > > >> > > would need something that heavy weight for the guarantees
> we
> > > > need.
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> > >
> > > > > > >> > > Basically, the high level requirement here is "Support
> > > > consistent
> > > > > > >> write
> > > > > > >> > > ordering for single-writer-per-key, multi-writer-per-log".
> > My
> > > > > hunch
> > > > > > is
> > > > > > >> > > that, with some added guarantees to the proxy (if it isn't
> > > > already
> > > > > > >> > > supported), and some custom client code on our side for
> > > removing
> > > > > the
> > > > > > >> > > entries that actually succeed to write to DistributedLog
> > from
> > > > the
> > > > > > >> request
> > > > > > >> > > that failed, it should be a relatively easy thing to
> > support.
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> > Yup. I think it should not be very difficult to support.
> There
> > > > might
> > > > > > be
> > > > > > >> > some changes in the server side.
> > > > > > >> > Let's figure out what will the changes be. Are you guys
> > > interested
> > > > > in
> > > > > > >> > contributing?
> > > > > > >> >
> > > > > > >> > Yes, we would be.
> > > > > > >>
> > > > > > >> As a note, the one thing that we see as an issue with the
> client
> > > > side
> > > > > > >> dedupping is how to bound the range of data that needs to be
> > > looked
> > > > at
> > > > > > for
> > > > > > >> deduplication. As you can imagine, it is pretty easy to bound
> > the
> > > > > bottom
> > > > > > >> of
> > > > > > >> the range, as that it just regular checkpointing of the DSLN
> > that
> > > is
> > > > > > >> returned. I'm still not sure if there is any nice way to time
> > > bound
> > > > > the
> > > > > > >> top
> > > > > > >> end of the range, especially since the proxy owns sequence
> > numbers
> > > > > > (which
> > > > > > >> makes sense). I am curious if there is more that can be done
> if
> > > > > > >> deduplication is on the server side. However the main minus I
> > see
> > > of
> > > > > > >> server
> > > > > > >> side deduplication is that instead of running contingent on
> > there
> > > > > being
> > > > > > a
> > > > > > >> failed client request, instead it would have to run every
> time a
> > > > write
> > > > > > >> happens.
> > > > > > >
> > > > > > >
> > > > > > > For a reliable dedup, we probably need fence-then-getLastDLSN
> > > > > operation -
> > > > > > > so it would guarantee that any non-completed requests issued
> > > > (lost-ack
> > > > > > > requests) before this fence-then-getLastDLSN operation will be
> > > failed
> > > > > and
> > > > > > > they will never land at the log.
> > > > > > >
> > > > > > > the pseudo code would look like below -
> > > > > > >
> > > > > > > write(request) onFailure { t =>
> > > > > > >
> > > > > > > if (t is timeout exception) {
> > > > > > >
> > > > > > > DLSN lastDLSN = fenceThenGetLastDLSN()
> > > > > > > DLSN lastCheckpointedDLSN = ...;
> > > > > > > // find if the request lands between [lastDLSN,
> > > > lastCheckpointedDLSN].
> > > > > > > // if it exists, the write succeed; otherwise retry.
> > > > > > >
> > > > > > > }
> > > > > > >
> > > > > > >
> > > > > > > }
> > > > > > >
> > > > > >
> > > > > >
> > > > > > Just realized the idea is same as what Leigh raised in the
> previous
> > > > email
> > > > > > about 'epoch write'. Let me explain more about this idea (Leigh,
> > feel
> > > > > free
> > > > > > to jump in to fill up your idea).
> > > > > >
> > > > > > - when a log stream is owned,  the proxy use the last transaction
> > id
> > > as
> > > > > the
> > > > > > epoch
> > > > > > - when a client connects (handshake with the proxy), it will get
> > the
> > > > > epoch
> > > > > > for the stream.
> > > > > > - the writes issued by this client will carry the epoch to the
> > proxy.
> > > > > > - add a new rpc - fenceThenGetLastDLSN - it would force the proxy
> > to
> > > > bump
> > > > > > the epoch.
> > > > > > - if fenceThenGetLastDLSN happened, all the outstanding writes
> with
> > > old
> > > > > > epoch will be rejected with exceptions (e.g. EpochFenced).
> > > > > > - The DLSN returned from fenceThenGetLastDLSN can be used as the
> > > bound
> > > > > for
> > > > > > deduplications on failures.
> > > > > >
> > > > > > Cameron, does this sound a solution to your use case?
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > >>
> > > > > > >> Maybe something that could fit a similar need that Kafka does
> > (the
> > > > > last
> > > > > > >> store value for a particular key in a log), such that on a per
> > key
> > > > > basis
> > > > > > >> there could be a sequence number that support deduplication?
> > Cost
> > > > > seems
> > > > > > >> like it would be high however, and I'm not even sure if
> > bookkeeper
> > > > > > >> supports
> > > > > > >> it.
> > > > > > >
> > > > > > >
> > > > > > >> Cheers,
> > > > > > >> Cameron
> > > > > > >>
> > > > > > >> >
> > > > > > >> > >
> > > > > > >> > > Thanks,
> > > > > > >> > > Cameron
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > On Sat, Oct 8, 2016 at 7:35 AM, Leigh Stewart
> > > > > > >> > <lstewart@twitter.com.invalid
> > > > > > >> > > >
> > > > > > >> > > wrote:
> > > > > > >> > >
> > > > > > >> > > > Cameron:
> > > > > > >> > > > Another thing we've discussed but haven't really thought
> > > > > through -
> > > > > > >> > > > We might be able to support some kind of epoch write
> > > request,
> > > > > > where
> > > > > > >> the
> > > > > > >> > > > epoch is guaranteed to have changed if the writer has
> > > changed
> > > > or
> > > > > > the
> > > > > > >> > > ledger
> > > > > > >> > > > was ever fenced off. Writes include an epoch and are
> > > rejected
> > > > if
> > > > > > the
> > > > > > >> > > epoch
> > > > > > >> > > > has changed.
> > > > > > >> > > > With a mechanism like this, fencing the ledger off
> after a
> > > > > failure
> > > > > > >> > would
> > > > > > >> > > > ensure any pending writes had either been written or
> would
> > > be
> > > > > > >> rejected.
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > > > On Sat, Oct 8, 2016 at 7:10 AM, Sijie Guo <
> > sijie@apache.org
> > > >
> > > > > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > > Cameron,
> > > > > > >> > > > >
> > > > > > >> > > > > I think both Leigh and Xi had made a few good points
> > about
> > > > > your
> > > > > > >> > > question.
> > > > > > >> > > > >
> > > > > > >> > > > > To add one more point to your question - "but I am not
> > > > > > >> > > > > 100% of how all of the futures in the code handle
> > > failures.
> > > > > > >> > > > > If not, where in the code would be the relevant places
> > to
> > > > add
> > > > > > the
> > > > > > >> > > ability
> > > > > > >> > > > > to do this, and would the project be interested in a
> > pull
> > > > > > >> request?"
> > > > > > >> > > > >
> > > > > > >> > > > > The current proxy and client logic doesn't do
> perfectly
> > on
> > > > > > >> handling
> > > > > > >> > > > > failures (duplicates) - the strategy now is the client
> > > will
> > > > > > retry
> > > > > > >> as
> > > > > > >> > > best
> > > > > > >> > > > > at it can before throwing exceptions to users. The
> code
> > > you
> > > > > are
> > > > > > >> > looking
> > > > > > >> > > > for
> > > > > > >> > > > > - it is on BKLogSegmentWriter for the proxy handling
> > > writes
> > > > > and
> > > > > > >> it is
> > > > > > >> > > on
> > > > > > >> > > > > DistributedLogClientImpl for the proxy client handling
> > > > > responses
> > > > > > >> from
> > > > > > >> > > > > proxies. Does this help you?
> > > > > > >> > > > >
> > > > > > >> > > > > And also, you are welcome to contribute the pull
> > requests.
> > > > > > >> > > > >
> > > > > > >> > > > > - Sijie
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > > On Tue, Oct 4, 2016 at 3:39 PM, Cameron Hatfield <
> > > > > > >> kinguy@gmail.com>
> > > > > > >> > > > wrote:
> > > > > > >> > > > >
> > > > > > >> > > > > > I have a question about the Proxy Client. Basically,
> > for
> > > > our
> > > > > > use
> > > > > > >> > > cases,
> > > > > > >> > > > > we
> > > > > > >> > > > > > want to guarantee ordering at the key level,
> > > irrespective
> > > > of
> > > > > > the
> > > > > > >> > > > ordering
> > > > > > >> > > > > > of the partition it may be assigned to as a whole.
> Due
> > > to
> > > > > the
> > > > > > >> > source
> > > > > > >> > > of
> > > > > > >> > > > > the
> > > > > > >> > > > > > data (HBase Replication), we cannot guarantee that a
> > > > single
> > > > > > >> > partition
> > > > > > >> > > > > will
> > > > > > >> > > > > > be owned for writes by the same client. This means
> the
> > > > proxy
> > > > > > >> client
> > > > > > >> > > > works
> > > > > > >> > > > > > well (since we don't care which proxy owns the
> > partition
> > > > we
> > > > > > are
> > > > > > >> > > writing
> > > > > > >> > > > > > to).
> > > > > > >> > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > > However, the guarantees we need when writing a batch
> > > > > consists
> > > > > > >> of:
> > > > > > >> > > > > > Definition of a Batch: The set of records sent to
> the
> > > > > > writeBatch
> > > > > > >> > > > endpoint
> > > > > > >> > > > > > on the proxy
> > > > > > >> > > > > >
> > > > > > >> > > > > > 1. Batch success: If the client receives a success
> > from
> > > > the
> > > > > > >> proxy,
> > > > > > >> > > then
> > > > > > >> > > > > > that batch is successfully written
> > > > > > >> > > > > >
> > > > > > >> > > > > > 2. Inter-Batch ordering : Once a batch has been
> > written
> > > > > > >> > successfully
> > > > > > >> > > by
> > > > > > >> > > > > the
> > > > > > >> > > > > > client, when another batch is written, it will be
> > > > guaranteed
> > > > > > to
> > > > > > >> be
> > > > > > >> > > > > ordered
> > > > > > >> > > > > > after the last batch (if it is the same stream).
> > > > > > >> > > > > >
> > > > > > >> > > > > > 3. Intra-Batch ordering: Within a batch of writes,
> the
> > > > > records
> > > > > > >> will
> > > > > > >> > > be
> > > > > > >> > > > > > committed in order
> > > > > > >> > > > > >
> > > > > > >> > > > > > 4. Intra-Batch failure ordering: If an individual
> > record
> > > > > fails
> > > > > > >> to
> > > > > > >> > > write
> > > > > > >> > > > > > within a batch, all records after that record will
> not
> > > be
> > > > > > >> written.
> > > > > > >> > > > > >
> > > > > > >> > > > > > 5. Batch Commit: Guarantee that if a batch returns a
> > > > > success,
> > > > > > it
> > > > > > >> > will
> > > > > > >> > > > be
> > > > > > >> > > > > > written
> > > > > > >> > > > > >
> > > > > > >> > > > > > 6. Read-after-write: Once a batch is committed,
> > within a
> > > > > > limited
> > > > > > >> > > > > time-frame
> > > > > > >> > > > > > it will be able to be read. This is required in the
> > case
> > > > of
> > > > > > >> > failure,
> > > > > > >> > > so
> > > > > > >> > > > > > that the client can see what actually got
> committed. I
> > > > > believe
> > > > > > >> the
> > > > > > >> > > > > > time-frame part could be removed if the client can
> > send
> > > in
> > > > > the
> > > > > > >> same
> > > > > > >> > > > > > sequence number that was written previously, since
> it
> > > > would
> > > > > > then
> > > > > > >> > fail
> > > > > > >> > > > and
> > > > > > >> > > > > > we would know that a read needs to occur.
> > > > > > >> > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > > So, my basic question is if this is currently
> possible
> > > in
> > > > > the
> > > > > > >> > proxy?
> > > > > > >> > > I
> > > > > > >> > > > > > don't believe it gives these guarantees as it stands
> > > > today,
> > > > > > but
> > > > > > >> I
> > > > > > >> > am
> > > > > > >> > > > not
> > > > > > >> > > > > > 100% of how all of the futures in the code handle
> > > > failures.
> > > > > > >> > > > > > If not, where in the code would be the relevant
> places
> > > to
> > > > > add
> > > > > > >> the
> > > > > > >> > > > ability
> > > > > > >> > > > > > to do this, and would the project be interested in a
> > > pull
> > > > > > >> request?
> > > > > > >> > > > > >
> > > > > > >> > > > > >
> > > > > > >> > > > > > Thanks,
> > > > > > >> > > > > > Cameron
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Proxy Client - Batch Ordering / Commit

Posted by Sijie Guo <si...@apache.org>.
Xi,

Thank you so much for your proposal. I took a look. It looks fine to me.
Cameron, do you have any comments?

Look forward to your pull requests.

- Sijie


On Wed, Nov 9, 2016 at 2:34 AM, Xi Liu <xi...@gmail.com> wrote:

> Cameron,
>
> Have you started any work for this? I just updated the proposal page -
> https://cwiki.apache.org/confluence/display/DL/DP-2+-+Epoch+Write+Support
> Maybe we can work together with this.
>
> Sijie, Leigh,
>
> can you guys help review this to make sure our proposal is in the right
> direction?
>
> - Xi
>
> On Tue, Nov 1, 2016 at 3:05 AM, Sijie Guo <si...@apache.org> wrote:
>
> > I created https://issues.apache.org/jira/browse/DL-63 for tracking the
> > proposed idea here.
> >
> >
> >
> > On Wed, Oct 26, 2016 at 4:53 PM, Sijie Guo <si...@twitter.com.invalid>
> > wrote:
> >
> > > On Tue, Oct 25, 2016 at 11:30 AM, Cameron Hatfield <ki...@gmail.com>
> > > wrote:
> > >
> > > > Yes, we are reading the HBase WAL (from their replication plugin
> > > support),
> > > > and writing that into DL.
> > > >
> > >
> > > Gotcha.
> > >
> > >
> > > >
> > > > From the sounds of it, yes, it would. Only thing I would say is make
> > the
> > > > epoch requirement optional, so that if I client doesn't care about
> > dupes
> > > > they don't have to deal with the process of getting a new epoch.
> > > >
> > >
> > > Yup. This should be optional. I can start a wiki page on how we want to
> > > implement this. Are you interested in contributing to this?
> > >
> > >
> > > >
> > > > -Cameron
> > > >
> > > > On Wed, Oct 19, 2016 at 7:43 PM, Sijie Guo
> <sijieg@twitter.com.invalid
> > >
> > > > wrote:
> > > >
> > > > > On Wed, Oct 19, 2016 at 7:17 PM, Sijie Guo <si...@twitter.com>
> > wrote:
> > > > >
> > > > > >
> > > > > >
> > > > > > On Monday, October 17, 2016, Cameron Hatfield <ki...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > >> Answer inline:
> > > > > >>
> > > > > >> On Mon, Oct 17, 2016 at 11:46 AM, Sijie Guo <si...@apache.org>
> > > wrote:
> > > > > >>
> > > > > >> > Cameron,
> > > > > >> >
> > > > > >> > Thank you for your summary. I liked the discussion here. I
> also
> > > > liked
> > > > > >> the
> > > > > >> > summary of your requirement - 'single-writer-per-key,
> > > > > >> > multiple-writers-per-log'. If I understand correctly, the core
> > > > concern
> > > > > >> here
> > > > > >> > is almost 'exact-once' write (or a way to explicit tell if a
> > write
> > > > can
> > > > > >> be
> > > > > >> > retried or not).
> > > > > >> >
> > > > > >> > Comments inline.
> > > > > >> >
> > > > > >> > On Fri, Oct 14, 2016 at 11:17 AM, Cameron Hatfield <
> > > > kinguy@gmail.com>
> > > > > >> > wrote:
> > > > > >> >
> > > > > >> > > > Ah- yes good point (to be clear we're not using the proxy
> > this
> > > > way
> > > > > >> > > today).
> > > > > >> > >
> > > > > >> > > > > Due to the source of the
> > > > > >> > > > > data (HBase Replication), we cannot guarantee that a
> > single
> > > > > >> partition
> > > > > >> > > will
> > > > > >> > > > > be owned for writes by the same client.
> > > > > >> > >
> > > > > >> > > > Do you mean you *need* to support multiple writers issuing
> > > > > >> interleaved
> > > > > >> > > > writes or is it just that they might sometimes interleave
> > > writes
> > > > > and
> > > > > >> > you
> > > > > >> > > >don't care?
> > > > > >> > > How HBase partitions the keys being written wouldn't have a
> > > > one->one
> > > > > >> > > mapping with the partitions we would have in HBase. Even if
> we
> > > did
> > > > > >> have
> > > > > >> > > that alignment when the cluster first started, HBase will
> > > > rebalance
> > > > > >> what
> > > > > >> > > servers own what partitions, as well as split and merge
> > > partitions
> > > > > >> that
> > > > > >> > > already exist, causing eventual drift from one log per
> > > partition.
> > > > > >> > > Because we want ordering guarantees per key (row in hbase),
> we
> > > > > >> partition
> > > > > >> > > the logs by the key. Since multiple writers are possible per
> > > range
> > > > > of
> > > > > >> > keys
> > > > > >> > > (due to the aforementioned rebalancing / splitting / etc of
> > > > hbase),
> > > > > we
> > > > > >> > > cannot use the core library due to requiring a single writer
> > for
> > > > > >> > ordering.
> > > > > >> > >
> > > > > >> > > But, for a single log, we don't really care about ordering
> > aside
> > > > > from
> > > > > >> at
> > > > > >> > > the per-key level. So all we really need to be able to
> handle
> > is
> > > > > >> > preventing
> > > > > >> > > duplicates when a failure occurs, and ordering consistency
> > > across
> > > > > >> > requests
> > > > > >> > > from a single client.
> > > > > >> > >
> > > > > >> > > So our general requirements are:
> > > > > >> > > Write A, Write B
> > > > > >> > > Timeline: A -> B
> > > > > >> > > Request B is only made after A has successfully returned
> > > (possibly
> > > > > >> after
> > > > > >> > > retries)
> > > > > >> > >
> > > > > >> > > 1) If the write succeeds, it will be durably exposed to
> > clients
> > > > > within
> > > > > >> > some
> > > > > >> > > bounded time frame
> > > > > >> > >
> > > > > >> >
> > > > > >> > Guaranteed.
> > > > > >> >
> > > > > >>
> > > > > >> > > 2) If A succeeds and B succeeds, the ordering for the log
> will
> > > be
> > > > A
> > > > > >> and
> > > > > >> > > then B
> > > > > >> > >
> > > > > >> >
> > > > > >> > If I understand correctly here, B is only sent after A is
> > > returned,
> > > > > >> right?
> > > > > >> > If that's the case, It is guaranteed.
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > > 3) If A fails due to an error that can be relied on to *not*
> > be
> > > a
> > > > > lost
> > > > > >> > ack
> > > > > >> > > problem, it will never be exposed to the client, so it may
> > > > > (depending
> > > > > >> on
> > > > > >> > > the error) be retried immediately
> > > > > >> > >
> > > > > >> >
> > > > > >> > If it is not a lost-ack problem, the entry will be exposed. it
> > is
> > > > > >> > guaranteed.
> > > > > >>
> > > > > >> Let me try rephrasing the questions, to make sure I'm
> > understanding
> > > > > >> correctly:
> > > > > >> If A fails, with an error such as "Unable to create connection
> to
> > > > > >> bookkeeper server", that would be the type of error we would
> > expect
> > > to
> > > > > be
> > > > > >> able to retry immediately, as that result means no action was
> > taken
> > > on
> > > > > any
> > > > > >> log / etc, so no entry could have been created. This is
> different
> > > > then a
> > > > > >> "Connection Timeout" exception, as we just might not have
> gotten a
> > > > > >> response
> > > > > >> in time.
> > > > > >>
> > > > > >>
> > > > > > Gotcha.
> > > > > >
> > > > > > The response code returned from proxy can tell if a failure can
> be
> > > > > retried
> > > > > > safely or not. (We might need to make them well documented)
> > > > > >
> > > > > >
> > > > > >
> > > > > >>
> > > > > >> >
> > > > > >> >
> > > > > >> > > 4) If A fails due to an error that could be a lost ack
> problem
> > > > > >> (network
> > > > > >> > > connectivity / etc), within a bounded time frame it should
> be
> > > > > >> possible to
> > > > > >> > > find out if the write succeed or failed. Either by reading
> > from
> > > > some
> > > > > >> > > checkpoint of the log for the changes that should have been
> > made
> > > > or
> > > > > >> some
> > > > > >> > > other possible server-side support.
> > > > > >> > >
> > > > > >> >
> > > > > >> > If I understand this correctly, it is a duplication issue,
> > right?
> > > > > >> >
> > > > > >> > Can a de-duplication solution work here? Either DL or your
> > client
> > > > does
> > > > > >> the
> > > > > >> > de-duplication?
> > > > > >> >
> > > > > >>
> > > > > >> The requirements I'm mentioning are the ones needed for
> > client-side
> > > > > >> dedupping. Since if I can guarantee writes being exposed within
> > some
> > > > > time
> > > > > >> frame, and I can never get into an inconsistently ordered state
> > when
> > > > > >> successes happen, when an error occurs, I can always wait for
> max
> > > time
> > > > > >> frame, read the latest writes, and then dedup locally against
> the
> > > > > request
> > > > > >> I
> > > > > >> just made.
> > > > > >>
> > > > > >> The main thing about that timeframe is that its basically the
> > > addition
> > > > > of
> > > > > >> every timeout, all the way down in the system, combined with
> > > whatever
> > > > > >> flushing / caching / etc times are at the bookkeeper / client
> > level
> > > > for
> > > > > >> when values are exposed
> > > > > >
> > > > > >
> > > > > > Gotcha.
> > > > > >
> > > > > >>
> > > > > >>
> > > > > >> >
> > > > > >> > Is there any ways to identify your write?
> > > > > >> >
> > > > > >> > I can think of a case as follow - I want to know what is your
> > > > expected
> > > > > >> > behavior from the log.
> > > > > >> >
> > > > > >> > a)
> > > > > >> >
> > > > > >> > If a hbase region server A writes a change of key K to the
> log,
> > > the
> > > > > >> change
> > > > > >> > is successfully made to the log;
> > > > > >> > but server A is down before receiving the change.
> > > > > >> > region server B took over the region that contains K, what
> will
> > B
> > > > do?
> > > > > >> >
> > > > > >>
> > > > > >> HBase writes in large chunks (WAL Logs), which its replication
> > > system
> > > > > then
> > > > > >> handles by replaying in the case of failure. If I'm in a middle
> > of a
> > > > > log,
> > > > > >> and the whole region goes down and gets rescheduled elsewhere, I
> > > will
> > > > > >> start
> > > > > >> back up from the beginning of the log I was in the middle of.
> > Using
> > > > > >> checkpointing + deduping, we should be able to find out where we
> > > left
> > > > > off
> > > > > >> in the log.
> > > > > >
> > > > > >
> > > > > >> >
> > > > > >> >
> > > > > >> > b) same as a). but server A was just network partitioned. will
> > > both
> > > > A
> > > > > >> and B
> > > > > >> > write the change of key K?
> > > > > >> >
> > > > > >>
> > > > > >> HBase gives us some guarantees around network partitions
> > > (Consistency
> > > > > over
> > > > > >> availability for HBase). HBase is a single-master failover
> > recovery
> > > > type
> > > > > >> of
> > > > > >> system, with zookeeper-based guarantees for single owners
> > (writers)
> > > > of a
> > > > > >> range of data.
> > > > > >>
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > > 5) If A is turned into multiple batches (one large request
> > gets
> > > > > split
> > > > > >> > into
> > > > > >> > > multiple smaller ones to the bookkeeper backend, due to log
> > > > rolling
> > > > > /
> > > > > >> > size
> > > > > >> > > / etc):
> > > > > >> > >   a) The ordering of entries within batches have ordering
> > > > > consistence
> > > > > >> > with
> > > > > >> > > the original request, when exposed in the log (though they
> may
> > > be
> > > > > >> > > interleaved with other requests)
> > > > > >> > >   b) The ordering across batches have ordering consistence
> > with
> > > > the
> > > > > >> > > original request, when exposed in the log (though they may
> be
> > > > > >> interleaved
> > > > > >> > > with other requests)
> > > > > >> > >   c) If a batch fails, and cannot be retried / is
> > unsuccessfully
> > > > > >> retried,
> > > > > >> > > all batches after the failed batch should not be exposed in
> > the
> > > > log.
> > > > > >> > Note:
> > > > > >> > > The batches before and including the failed batch, that
> ended
> > up
> > > > > >> > > succeeding, can show up in the log, again within some
> bounded
> > > time
> > > > > >> range
> > > > > >> > > for reads by a client.
> > > > > >> > >
> > > > > >> >
> > > > > >> > There is a method 'writeBulk' in DistributedLogClient can
> > achieve
> > > > this
> > > > > >> > guarantee.
> > > > > >> >
> > > > > >> > However, I am not very sure about how will you turn A into
> > > batches.
> > > > If
> > > > > >> you
> > > > > >> > are dividing A into batches,
> > > > > >> > you can simply control the application write sequence to
> achieve
> > > the
> > > > > >> > guarantee here.
> > > > > >> >
> > > > > >> > Can you explain more about this?
> > > > > >> >
> > > > > >>
> > > > > >> In this case, by batches I mean what the proxy does with the
> > single
> > > > > >> request
> > > > > >> that I send it. If the proxy decides it needs to turn my single
> > > > request
> > > > > >> into multiple batches of requests, due to log rolling, size
> > > > limitations,
> > > > > >> etc, those would be the guarantees I need to be able to
> > reduplicate
> > > on
> > > > > the
> > > > > >> client side.
> > > > > >
> > > > > >
> > > > > > A single record written by #write and A record set (set of
> records)
> > > > > > written by #writeRecordSet are atomic - they will not be broken
> > down
> > > > into
> > > > > > entries (batches). With the correct response code, you would be
> > able
> > > to
> > > > > > tell if it is a lost-ack failure or not. However there is a size
> > > > > limitation
> > > > > > for this - it can't not go beyond 1MB for current implementation.
> > > > > >
> > > > > > What is your expected record size?
> > > > > >
> > > > > >
> > > > > >>
> > > > > >> >
> > > > > >> >
> > > > > >> > >
> > > > > >> > > Since we can guarantee per-key ordering on the client side,
> we
> > > > > >> guarantee
> > > > > >> > > that there is a single writer per-key, just not per log.
> > > > > >> >
> > > > > >> >
> > > > > >> > Do you need fencing guarantee in the case of network partition
> > > > causing
> > > > > >> > two-writers?
> > > > > >> >
> > > > > >> >
> > > > > >> > > So if there was a
> > > > > >> > > way to guarantee a single write request as being written or
> > not,
> > > > > >> within a
> > > > > >> > > certain time frame (since failures should be rare anyways,
> > this
> > > is
> > > > > >> fine
> > > > > >> > if
> > > > > >> > > it is expensive), we can then have the client guarantee the
> > > > ordering
> > > > > >> it
> > > > > >> > > needs.
> > > > > >> > >
> > > > > >> >
> > > > > >> > This sounds an 'exact-once' write (regarding retries)
> > requirement
> > > to
> > > > > me,
> > > > > >> > right?
> > > > > >> >
> > > > > >> Yes. I'm curious of how this issue is handled by Manhattan,
> since
> > > you
> > > > > can
> > > > > >> imagine a data store that ends up getting multiple writes for
> the
> > > same
> > > > > put
> > > > > >> / get / etc, would be harder to use, and we are basically trying
> > to
> > > > > create
> > > > > >> a log like that for HBase.
> > > > > >
> > > > > >
> > > > > > Are you guys replacing HBase WAL?
> > > > > >
> > > > > > In Manhattan case, the request will be first written to DL
> streams
> > by
> > > > > > Manhattan coordinator. The Manhattan replica then will read from
> > the
> > > DL
> > > > > > streams and apply the change. In the lost-ack case, the MH
> > > coordinator
> > > > > will
> > > > > > just fail the request to client.
> > > > > >
> > > > > > My feeling here is your usage for HBase is a bit different from
> how
> > > we
> > > > > use
> > > > > > DL in Manhattan. It sounds like you read from a source (HBase
> WAL)
> > > and
> > > > > > write to DL. But I might be wrong.
> > > > > >
> > > > > >
> > > > > >>
> > > > > >> >
> > > > > >> >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > > Cameron:
> > > > > >> > > > Another thing we've discussed but haven't really thought
> > > > through -
> > > > > >> > > > We might be able to support some kind of epoch write
> > request,
> > > > > where
> > > > > >> the
> > > > > >> > > > epoch is guaranteed to have changed if the writer has
> > changed
> > > or
> > > > > the
> > > > > >> > > ledger
> > > > > >> > > > was ever fenced off. Writes include an epoch and are
> > rejected
> > > if
> > > > > the
> > > > > >> > > epoch
> > > > > >> > > > has changed.
> > > > > >> > > > With a mechanism like this, fencing the ledger off after a
> > > > failure
> > > > > >> > would
> > > > > >> > > > ensure any pending writes had either been written or would
> > be
> > > > > >> rejected.
> > > > > >> > >
> > > > > >> > > The issue would be how I guarantee the write I wrote to the
> > > server
> > > > > was
> > > > > >> > > written. Since a network issue could happen on the send of
> the
> > > > > >> request,
> > > > > >> > or
> > > > > >> > > on the receive of the success response, an epoch wouldn't
> tell
> > > me
> > > > > if I
> > > > > >> > can
> > > > > >> > > successfully retry, as it could be successfully written but
> > AWS
> > > > > >> dropped
> > > > > >> > the
> > > > > >> > > connection for the success response. Since the epoch would
> be
> > > the
> > > > > same
> > > > > >> > > (same ledger), I could write duplicates.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > > We are currently proposing adding a transaction semantic
> to
> > dl
> > > > to
> > > > > >> get
> > > > > >> > rid
> > > > > >> > > > of the size limitation and the unaware-ness in the proxy
> > > client.
> > > > > >> Here
> > > > > >> > is
> > > > > >> > > > our idea -
> > > > > >> > > > http://mail-archives.apache.org/mod_mbox/incubator-
> > > > distributedlog
> > > > > >> > > -dev/201609.mbox/%3cCAAC6BxP5YyEHwG0ZCF5soh42X=xuYwYm
> > > > > >> > > <http://mail-archives.apache.org/mod_mbox/incubator-
> > > > > >> > distributedlog%0A-dev/201609.mbox/%3cCAAC6BxP5YyEHwG0ZCF5soh
> > > > > 42X=xuYwYm>
> > > > > >> > > L4nXsYBYiofzxpVk6g@mail.gmail.com%3e
> > > > > >> > >
> > > > > >> > > > I am not sure if your idea is similar as ours. but we'd
> like
> > > to
> > > > > >> > > collaborate
> > > > > >> > > > with the community if anyone has the similar idea.
> > > > > >> > >
> > > > > >> > > Our use case would be covered by transaction support, but
> I'm
> > > > unsure
> > > > > >> if
> > > > > >> > we
> > > > > >> > > would need something that heavy weight for the guarantees we
> > > need.
> > > > > >> > >
> > > > > >> >
> > > > > >> > >
> > > > > >> > > Basically, the high level requirement here is "Support
> > > consistent
> > > > > >> write
> > > > > >> > > ordering for single-writer-per-key, multi-writer-per-log".
> My
> > > > hunch
> > > > > is
> > > > > >> > > that, with some added guarantees to the proxy (if it isn't
> > > already
> > > > > >> > > supported), and some custom client code on our side for
> > removing
> > > > the
> > > > > >> > > entries that actually succeed to write to DistributedLog
> from
> > > the
> > > > > >> request
> > > > > >> > > that failed, it should be a relatively easy thing to
> support.
> > > > > >> > >
> > > > > >> >
> > > > > >> > Yup. I think it should not be very difficult to support. There
> > > might
> > > > > be
> > > > > >> > some changes in the server side.
> > > > > >> > Let's figure out what will the changes be. Are you guys
> > interested
> > > > in
> > > > > >> > contributing?
> > > > > >> >
> > > > > >> > Yes, we would be.
> > > > > >>
> > > > > >> As a note, the one thing that we see as an issue with the client
> > > side
> > > > > >> dedupping is how to bound the range of data that needs to be
> > looked
> > > at
> > > > > for
> > > > > >> deduplication. As you can imagine, it is pretty easy to bound
> the
> > > > bottom
> > > > > >> of
> > > > > >> the range, as that it just regular checkpointing of the DSLN
> that
> > is
> > > > > >> returned. I'm still not sure if there is any nice way to time
> > bound
> > > > the
> > > > > >> top
> > > > > >> end of the range, especially since the proxy owns sequence
> numbers
> > > > > (which
> > > > > >> makes sense). I am curious if there is more that can be done if
> > > > > >> deduplication is on the server side. However the main minus I
> see
> > of
> > > > > >> server
> > > > > >> side deduplication is that instead of running contingent on
> there
> > > > being
> > > > > a
> > > > > >> failed client request, instead it would have to run every time a
> > > write
> > > > > >> happens.
> > > > > >
> > > > > >
> > > > > > For a reliable dedup, we probably need fence-then-getLastDLSN
> > > > operation -
> > > > > > so it would guarantee that any non-completed requests issued
> > > (lost-ack
> > > > > > requests) before this fence-then-getLastDLSN operation will be
> > failed
> > > > and
> > > > > > they will never land at the log.
> > > > > >
> > > > > > the pseudo code would look like below -
> > > > > >
> > > > > > write(request) onFailure { t =>
> > > > > >
> > > > > > if (t is timeout exception) {
> > > > > >
> > > > > > DLSN lastDLSN = fenceThenGetLastDLSN()
> > > > > > DLSN lastCheckpointedDLSN = ...;
> > > > > > // find if the request lands between [lastDLSN,
> > > lastCheckpointedDLSN].
> > > > > > // if it exists, the write succeed; otherwise retry.
> > > > > >
> > > > > > }
> > > > > >
> > > > > >
> > > > > > }
> > > > > >
> > > > >
> > > > >
> > > > > Just realized the idea is same as what Leigh raised in the previous
> > > email
> > > > > about 'epoch write'. Let me explain more about this idea (Leigh,
> feel
> > > > free
> > > > > to jump in to fill up your idea).
> > > > >
> > > > > - when a log stream is owned,  the proxy use the last transaction
> id
> > as
> > > > the
> > > > > epoch
> > > > > - when a client connects (handshake with the proxy), it will get
> the
> > > > epoch
> > > > > for the stream.
> > > > > - the writes issued by this client will carry the epoch to the
> proxy.
> > > > > - add a new rpc - fenceThenGetLastDLSN - it would force the proxy
> to
> > > bump
> > > > > the epoch.
> > > > > - if fenceThenGetLastDLSN happened, all the outstanding writes with
> > old
> > > > > epoch will be rejected with exceptions (e.g. EpochFenced).
> > > > > - The DLSN returned from fenceThenGetLastDLSN can be used as the
> > bound
> > > > for
> > > > > deduplications on failures.
> > > > >
> > > > > Cameron, does this sound a solution to your use case?
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > >>
> > > > > >> Maybe something that could fit a similar need that Kafka does
> (the
> > > > last
> > > > > >> store value for a particular key in a log), such that on a per
> key
> > > > basis
> > > > > >> there could be a sequence number that support deduplication?
> Cost
> > > > seems
> > > > > >> like it would be high however, and I'm not even sure if
> bookkeeper
> > > > > >> supports
> > > > > >> it.
> > > > > >
> > > > > >
> > > > > >> Cheers,
> > > > > >> Cameron
> > > > > >>
> > > > > >> >
> > > > > >> > >
> > > > > >> > > Thanks,
> > > > > >> > > Cameron
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > On Sat, Oct 8, 2016 at 7:35 AM, Leigh Stewart
> > > > > >> > <lstewart@twitter.com.invalid
> > > > > >> > > >
> > > > > >> > > wrote:
> > > > > >> > >
> > > > > >> > > > Cameron:
> > > > > >> > > > Another thing we've discussed but haven't really thought
> > > > through -
> > > > > >> > > > We might be able to support some kind of epoch write
> > request,
> > > > > where
> > > > > >> the
> > > > > >> > > > epoch is guaranteed to have changed if the writer has
> > changed
> > > or
> > > > > the
> > > > > >> > > ledger
> > > > > >> > > > was ever fenced off. Writes include an epoch and are
> > rejected
> > > if
> > > > > the
> > > > > >> > > epoch
> > > > > >> > > > has changed.
> > > > > >> > > > With a mechanism like this, fencing the ledger off after a
> > > > failure
> > > > > >> > would
> > > > > >> > > > ensure any pending writes had either been written or would
> > be
> > > > > >> rejected.
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > On Sat, Oct 8, 2016 at 7:10 AM, Sijie Guo <
> sijie@apache.org
> > >
> > > > > wrote:
> > > > > >> > > >
> > > > > >> > > > > Cameron,
> > > > > >> > > > >
> > > > > >> > > > > I think both Leigh and Xi had made a few good points
> about
> > > > your
> > > > > >> > > question.
> > > > > >> > > > >
> > > > > >> > > > > To add one more point to your question - "but I am not
> > > > > >> > > > > 100% of how all of the futures in the code handle
> > failures.
> > > > > >> > > > > If not, where in the code would be the relevant places
> to
> > > add
> > > > > the
> > > > > >> > > ability
> > > > > >> > > > > to do this, and would the project be interested in a
> pull
> > > > > >> request?"
> > > > > >> > > > >
> > > > > >> > > > > The current proxy and client logic doesn't do perfectly
> on
> > > > > >> handling
> > > > > >> > > > > failures (duplicates) - the strategy now is the client
> > will
> > > > > retry
> > > > > >> as
> > > > > >> > > best
> > > > > >> > > > > at it can before throwing exceptions to users. The code
> > you
> > > > are
> > > > > >> > looking
> > > > > >> > > > for
> > > > > >> > > > > - it is on BKLogSegmentWriter for the proxy handling
> > writes
> > > > and
> > > > > >> it is
> > > > > >> > > on
> > > > > >> > > > > DistributedLogClientImpl for the proxy client handling
> > > > responses
> > > > > >> from
> > > > > >> > > > > proxies. Does this help you?
> > > > > >> > > > >
> > > > > >> > > > > And also, you are welcome to contribute the pull
> requests.
> > > > > >> > > > >
> > > > > >> > > > > - Sijie
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > > On Tue, Oct 4, 2016 at 3:39 PM, Cameron Hatfield <
> > > > > >> kinguy@gmail.com>
> > > > > >> > > > wrote:
> > > > > >> > > > >
> > > > > >> > > > > > I have a question about the Proxy Client. Basically,
> for
> > > our
> > > > > use
> > > > > >> > > cases,
> > > > > >> > > > > we
> > > > > >> > > > > > want to guarantee ordering at the key level,
> > irrespective
> > > of
> > > > > the
> > > > > >> > > > ordering
> > > > > >> > > > > > of the partition it may be assigned to as a whole. Due
> > to
> > > > the
> > > > > >> > source
> > > > > >> > > of
> > > > > >> > > > > the
> > > > > >> > > > > > data (HBase Replication), we cannot guarantee that a
> > > single
> > > > > >> > partition
> > > > > >> > > > > will
> > > > > >> > > > > > be owned for writes by the same client. This means the
> > > proxy
> > > > > >> client
> > > > > >> > > > works
> > > > > >> > > > > > well (since we don't care which proxy owns the
> partition
> > > we
> > > > > are
> > > > > >> > > writing
> > > > > >> > > > > > to).
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > However, the guarantees we need when writing a batch
> > > > consists
> > > > > >> of:
> > > > > >> > > > > > Definition of a Batch: The set of records sent to the
> > > > > writeBatch
> > > > > >> > > > endpoint
> > > > > >> > > > > > on the proxy
> > > > > >> > > > > >
> > > > > >> > > > > > 1. Batch success: If the client receives a success
> from
> > > the
> > > > > >> proxy,
> > > > > >> > > then
> > > > > >> > > > > > that batch is successfully written
> > > > > >> > > > > >
> > > > > >> > > > > > 2. Inter-Batch ordering : Once a batch has been
> written
> > > > > >> > successfully
> > > > > >> > > by
> > > > > >> > > > > the
> > > > > >> > > > > > client, when another batch is written, it will be
> > > guaranteed
> > > > > to
> > > > > >> be
> > > > > >> > > > > ordered
> > > > > >> > > > > > after the last batch (if it is the same stream).
> > > > > >> > > > > >
> > > > > >> > > > > > 3. Intra-Batch ordering: Within a batch of writes, the
> > > > records
> > > > > >> will
> > > > > >> > > be
> > > > > >> > > > > > committed in order
> > > > > >> > > > > >
> > > > > >> > > > > > 4. Intra-Batch failure ordering: If an individual
> record
> > > > fails
> > > > > >> to
> > > > > >> > > write
> > > > > >> > > > > > within a batch, all records after that record will not
> > be
> > > > > >> written.
> > > > > >> > > > > >
> > > > > >> > > > > > 5. Batch Commit: Guarantee that if a batch returns a
> > > > success,
> > > > > it
> > > > > >> > will
> > > > > >> > > > be
> > > > > >> > > > > > written
> > > > > >> > > > > >
> > > > > >> > > > > > 6. Read-after-write: Once a batch is committed,
> within a
> > > > > limited
> > > > > >> > > > > time-frame
> > > > > >> > > > > > it will be able to be read. This is required in the
> case
> > > of
> > > > > >> > failure,
> > > > > >> > > so
> > > > > >> > > > > > that the client can see what actually got committed. I
> > > > believe
> > > > > >> the
> > > > > >> > > > > > time-frame part could be removed if the client can
> send
> > in
> > > > the
> > > > > >> same
> > > > > >> > > > > > sequence number that was written previously, since it
> > > would
> > > > > then
> > > > > >> > fail
> > > > > >> > > > and
> > > > > >> > > > > > we would know that a read needs to occur.
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > So, my basic question is if this is currently possible
> > in
> > > > the
> > > > > >> > proxy?
> > > > > >> > > I
> > > > > >> > > > > > don't believe it gives these guarantees as it stands
> > > today,
> > > > > but
> > > > > >> I
> > > > > >> > am
> > > > > >> > > > not
> > > > > >> > > > > > 100% of how all of the futures in the code handle
> > > failures.
> > > > > >> > > > > > If not, where in the code would be the relevant places
> > to
> > > > add
> > > > > >> the
> > > > > >> > > > ability
> > > > > >> > > > > > to do this, and would the project be interested in a
> > pull
> > > > > >> request?
> > > > > >> > > > > >
> > > > > >> > > > > >
> > > > > >> > > > > > Thanks,
> > > > > >> > > > > > Cameron
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Proxy Client - Batch Ordering / Commit

Posted by Xi Liu <xi...@gmail.com>.
Cameron,

Have you started any work for this? I just updated the proposal page -
https://cwiki.apache.org/confluence/display/DL/DP-2+-+Epoch+Write+Support
Maybe we can work together with this.

Sijie, Leigh,

can you guys help review this to make sure our proposal is in the right
direction?

- Xi

On Tue, Nov 1, 2016 at 3:05 AM, Sijie Guo <si...@apache.org> wrote:

> I created https://issues.apache.org/jira/browse/DL-63 for tracking the
> proposed idea here.
>
>
>
> On Wed, Oct 26, 2016 at 4:53 PM, Sijie Guo <si...@twitter.com.invalid>
> wrote:
>
> > On Tue, Oct 25, 2016 at 11:30 AM, Cameron Hatfield <ki...@gmail.com>
> > wrote:
> >
> > > Yes, we are reading the HBase WAL (from their replication plugin
> > support),
> > > and writing that into DL.
> > >
> >
> > Gotcha.
> >
> >
> > >
> > > From the sounds of it, yes, it would. Only thing I would say is make
> the
> > > epoch requirement optional, so that if I client doesn't care about
> dupes
> > > they don't have to deal with the process of getting a new epoch.
> > >
> >
> > Yup. This should be optional. I can start a wiki page on how we want to
> > implement this. Are you interested in contributing to this?
> >
> >
> > >
> > > -Cameron
> > >
> > > On Wed, Oct 19, 2016 at 7:43 PM, Sijie Guo <sijieg@twitter.com.invalid
> >
> > > wrote:
> > >
> > > > On Wed, Oct 19, 2016 at 7:17 PM, Sijie Guo <si...@twitter.com>
> wrote:
> > > >
> > > > >
> > > > >
> > > > > On Monday, October 17, 2016, Cameron Hatfield <ki...@gmail.com>
> > > wrote:
> > > > >
> > > > >> Answer inline:
> > > > >>
> > > > >> On Mon, Oct 17, 2016 at 11:46 AM, Sijie Guo <si...@apache.org>
> > wrote:
> > > > >>
> > > > >> > Cameron,
> > > > >> >
> > > > >> > Thank you for your summary. I liked the discussion here. I also
> > > liked
> > > > >> the
> > > > >> > summary of your requirement - 'single-writer-per-key,
> > > > >> > multiple-writers-per-log'. If I understand correctly, the core
> > > concern
> > > > >> here
> > > > >> > is almost 'exact-once' write (or a way to explicit tell if a
> write
> > > can
> > > > >> be
> > > > >> > retried or not).
> > > > >> >
> > > > >> > Comments inline.
> > > > >> >
> > > > >> > On Fri, Oct 14, 2016 at 11:17 AM, Cameron Hatfield <
> > > kinguy@gmail.com>
> > > > >> > wrote:
> > > > >> >
> > > > >> > > > Ah- yes good point (to be clear we're not using the proxy
> this
> > > way
> > > > >> > > today).
> > > > >> > >
> > > > >> > > > > Due to the source of the
> > > > >> > > > > data (HBase Replication), we cannot guarantee that a
> single
> > > > >> partition
> > > > >> > > will
> > > > >> > > > > be owned for writes by the same client.
> > > > >> > >
> > > > >> > > > Do you mean you *need* to support multiple writers issuing
> > > > >> interleaved
> > > > >> > > > writes or is it just that they might sometimes interleave
> > writes
> > > > and
> > > > >> > you
> > > > >> > > >don't care?
> > > > >> > > How HBase partitions the keys being written wouldn't have a
> > > one->one
> > > > >> > > mapping with the partitions we would have in HBase. Even if we
> > did
> > > > >> have
> > > > >> > > that alignment when the cluster first started, HBase will
> > > rebalance
> > > > >> what
> > > > >> > > servers own what partitions, as well as split and merge
> > partitions
> > > > >> that
> > > > >> > > already exist, causing eventual drift from one log per
> > partition.
> > > > >> > > Because we want ordering guarantees per key (row in hbase), we
> > > > >> partition
> > > > >> > > the logs by the key. Since multiple writers are possible per
> > range
> > > > of
> > > > >> > keys
> > > > >> > > (due to the aforementioned rebalancing / splitting / etc of
> > > hbase),
> > > > we
> > > > >> > > cannot use the core library due to requiring a single writer
> for
> > > > >> > ordering.
> > > > >> > >
> > > > >> > > But, for a single log, we don't really care about ordering
> aside
> > > > from
> > > > >> at
> > > > >> > > the per-key level. So all we really need to be able to handle
> is
> > > > >> > preventing
> > > > >> > > duplicates when a failure occurs, and ordering consistency
> > across
> > > > >> > requests
> > > > >> > > from a single client.
> > > > >> > >
> > > > >> > > So our general requirements are:
> > > > >> > > Write A, Write B
> > > > >> > > Timeline: A -> B
> > > > >> > > Request B is only made after A has successfully returned
> > (possibly
> > > > >> after
> > > > >> > > retries)
> > > > >> > >
> > > > >> > > 1) If the write succeeds, it will be durably exposed to
> clients
> > > > within
> > > > >> > some
> > > > >> > > bounded time frame
> > > > >> > >
> > > > >> >
> > > > >> > Guaranteed.
> > > > >> >
> > > > >>
> > > > >> > > 2) If A succeeds and B succeeds, the ordering for the log will
> > be
> > > A
> > > > >> and
> > > > >> > > then B
> > > > >> > >
> > > > >> >
> > > > >> > If I understand correctly here, B is only sent after A is
> > returned,
> > > > >> right?
> > > > >> > If that's the case, It is guaranteed.
> > > > >>
> > > > >>
> > > > >>
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > > 3) If A fails due to an error that can be relied on to *not*
> be
> > a
> > > > lost
> > > > >> > ack
> > > > >> > > problem, it will never be exposed to the client, so it may
> > > > (depending
> > > > >> on
> > > > >> > > the error) be retried immediately
> > > > >> > >
> > > > >> >
> > > > >> > If it is not a lost-ack problem, the entry will be exposed. it
> is
> > > > >> > guaranteed.
> > > > >>
> > > > >> Let me try rephrasing the questions, to make sure I'm
> understanding
> > > > >> correctly:
> > > > >> If A fails, with an error such as "Unable to create connection to
> > > > >> bookkeeper server", that would be the type of error we would
> expect
> > to
> > > > be
> > > > >> able to retry immediately, as that result means no action was
> taken
> > on
> > > > any
> > > > >> log / etc, so no entry could have been created. This is different
> > > then a
> > > > >> "Connection Timeout" exception, as we just might not have gotten a
> > > > >> response
> > > > >> in time.
> > > > >>
> > > > >>
> > > > > Gotcha.
> > > > >
> > > > > The response code returned from proxy can tell if a failure can be
> > > > retried
> > > > > safely or not. (We might need to make them well documented)
> > > > >
> > > > >
> > > > >
> > > > >>
> > > > >> >
> > > > >> >
> > > > >> > > 4) If A fails due to an error that could be a lost ack problem
> > > > >> (network
> > > > >> > > connectivity / etc), within a bounded time frame it should be
> > > > >> possible to
> > > > >> > > find out if the write succeed or failed. Either by reading
> from
> > > some
> > > > >> > > checkpoint of the log for the changes that should have been
> made
> > > or
> > > > >> some
> > > > >> > > other possible server-side support.
> > > > >> > >
> > > > >> >
> > > > >> > If I understand this correctly, it is a duplication issue,
> right?
> > > > >> >
> > > > >> > Can a de-duplication solution work here? Either DL or your
> client
> > > does
> > > > >> the
> > > > >> > de-duplication?
> > > > >> >
> > > > >>
> > > > >> The requirements I'm mentioning are the ones needed for
> client-side
> > > > >> dedupping. Since if I can guarantee writes being exposed within
> some
> > > > time
> > > > >> frame, and I can never get into an inconsistently ordered state
> when
> > > > >> successes happen, when an error occurs, I can always wait for max
> > time
> > > > >> frame, read the latest writes, and then dedup locally against the
> > > > request
> > > > >> I
> > > > >> just made.
> > > > >>
> > > > >> The main thing about that timeframe is that its basically the
> > addition
> > > > of
> > > > >> every timeout, all the way down in the system, combined with
> > whatever
> > > > >> flushing / caching / etc times are at the bookkeeper / client
> level
> > > for
> > > > >> when values are exposed
> > > > >
> > > > >
> > > > > Gotcha.
> > > > >
> > > > >>
> > > > >>
> > > > >> >
> > > > >> > Is there any ways to identify your write?
> > > > >> >
> > > > >> > I can think of a case as follow - I want to know what is your
> > > expected
> > > > >> > behavior from the log.
> > > > >> >
> > > > >> > a)
> > > > >> >
> > > > >> > If a hbase region server A writes a change of key K to the log,
> > the
> > > > >> change
> > > > >> > is successfully made to the log;
> > > > >> > but server A is down before receiving the change.
> > > > >> > region server B took over the region that contains K, what will
> B
> > > do?
> > > > >> >
> > > > >>
> > > > >> HBase writes in large chunks (WAL Logs), which its replication
> > system
> > > > then
> > > > >> handles by replaying in the case of failure. If I'm in a middle
> of a
> > > > log,
> > > > >> and the whole region goes down and gets rescheduled elsewhere, I
> > will
> > > > >> start
> > > > >> back up from the beginning of the log I was in the middle of.
> Using
> > > > >> checkpointing + deduping, we should be able to find out where we
> > left
> > > > off
> > > > >> in the log.
> > > > >
> > > > >
> > > > >> >
> > > > >> >
> > > > >> > b) same as a). but server A was just network partitioned. will
> > both
> > > A
> > > > >> and B
> > > > >> > write the change of key K?
> > > > >> >
> > > > >>
> > > > >> HBase gives us some guarantees around network partitions
> > (Consistency
> > > > over
> > > > >> availability for HBase). HBase is a single-master failover
> recovery
> > > type
> > > > >> of
> > > > >> system, with zookeeper-based guarantees for single owners
> (writers)
> > > of a
> > > > >> range of data.
> > > > >>
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > > 5) If A is turned into multiple batches (one large request
> gets
> > > > split
> > > > >> > into
> > > > >> > > multiple smaller ones to the bookkeeper backend, due to log
> > > rolling
> > > > /
> > > > >> > size
> > > > >> > > / etc):
> > > > >> > >   a) The ordering of entries within batches have ordering
> > > > consistence
> > > > >> > with
> > > > >> > > the original request, when exposed in the log (though they may
> > be
> > > > >> > > interleaved with other requests)
> > > > >> > >   b) The ordering across batches have ordering consistence
> with
> > > the
> > > > >> > > original request, when exposed in the log (though they may be
> > > > >> interleaved
> > > > >> > > with other requests)
> > > > >> > >   c) If a batch fails, and cannot be retried / is
> unsuccessfully
> > > > >> retried,
> > > > >> > > all batches after the failed batch should not be exposed in
> the
> > > log.
> > > > >> > Note:
> > > > >> > > The batches before and including the failed batch, that ended
> up
> > > > >> > > succeeding, can show up in the log, again within some bounded
> > time
> > > > >> range
> > > > >> > > for reads by a client.
> > > > >> > >
> > > > >> >
> > > > >> > There is a method 'writeBulk' in DistributedLogClient can
> achieve
> > > this
> > > > >> > guarantee.
> > > > >> >
> > > > >> > However, I am not very sure about how will you turn A into
> > batches.
> > > If
> > > > >> you
> > > > >> > are dividing A into batches,
> > > > >> > you can simply control the application write sequence to achieve
> > the
> > > > >> > guarantee here.
> > > > >> >
> > > > >> > Can you explain more about this?
> > > > >> >
> > > > >>
> > > > >> In this case, by batches I mean what the proxy does with the
> single
> > > > >> request
> > > > >> that I send it. If the proxy decides it needs to turn my single
> > > request
> > > > >> into multiple batches of requests, due to log rolling, size
> > > limitations,
> > > > >> etc, those would be the guarantees I need to be able to
> reduplicate
> > on
> > > > the
> > > > >> client side.
> > > > >
> > > > >
> > > > > A single record written by #write and A record set (set of records)
> > > > > written by #writeRecordSet are atomic - they will not be broken
> down
> > > into
> > > > > entries (batches). With the correct response code, you would be
> able
> > to
> > > > > tell if it is a lost-ack failure or not. However there is a size
> > > > limitation
> > > > > for this - it can't not go beyond 1MB for current implementation.
> > > > >
> > > > > What is your expected record size?
> > > > >
> > > > >
> > > > >>
> > > > >> >
> > > > >> >
> > > > >> > >
> > > > >> > > Since we can guarantee per-key ordering on the client side, we
> > > > >> guarantee
> > > > >> > > that there is a single writer per-key, just not per log.
> > > > >> >
> > > > >> >
> > > > >> > Do you need fencing guarantee in the case of network partition
> > > causing
> > > > >> > two-writers?
> > > > >> >
> > > > >> >
> > > > >> > > So if there was a
> > > > >> > > way to guarantee a single write request as being written or
> not,
> > > > >> within a
> > > > >> > > certain time frame (since failures should be rare anyways,
> this
> > is
> > > > >> fine
> > > > >> > if
> > > > >> > > it is expensive), we can then have the client guarantee the
> > > ordering
> > > > >> it
> > > > >> > > needs.
> > > > >> > >
> > > > >> >
> > > > >> > This sounds an 'exact-once' write (regarding retries)
> requirement
> > to
> > > > me,
> > > > >> > right?
> > > > >> >
> > > > >> Yes. I'm curious of how this issue is handled by Manhattan, since
> > you
> > > > can
> > > > >> imagine a data store that ends up getting multiple writes for the
> > same
> > > > put
> > > > >> / get / etc, would be harder to use, and we are basically trying
> to
> > > > create
> > > > >> a log like that for HBase.
> > > > >
> > > > >
> > > > > Are you guys replacing HBase WAL?
> > > > >
> > > > > In Manhattan case, the request will be first written to DL streams
> by
> > > > > Manhattan coordinator. The Manhattan replica then will read from
> the
> > DL
> > > > > streams and apply the change. In the lost-ack case, the MH
> > coordinator
> > > > will
> > > > > just fail the request to client.
> > > > >
> > > > > My feeling here is your usage for HBase is a bit different from how
> > we
> > > > use
> > > > > DL in Manhattan. It sounds like you read from a source (HBase WAL)
> > and
> > > > > write to DL. But I might be wrong.
> > > > >
> > > > >
> > > > >>
> > > > >> >
> > > > >> >
> > > > >> > >
> > > > >> > >
> > > > >> > > > Cameron:
> > > > >> > > > Another thing we've discussed but haven't really thought
> > > through -
> > > > >> > > > We might be able to support some kind of epoch write
> request,
> > > > where
> > > > >> the
> > > > >> > > > epoch is guaranteed to have changed if the writer has
> changed
> > or
> > > > the
> > > > >> > > ledger
> > > > >> > > > was ever fenced off. Writes include an epoch and are
> rejected
> > if
> > > > the
> > > > >> > > epoch
> > > > >> > > > has changed.
> > > > >> > > > With a mechanism like this, fencing the ledger off after a
> > > failure
> > > > >> > would
> > > > >> > > > ensure any pending writes had either been written or would
> be
> > > > >> rejected.
> > > > >> > >
> > > > >> > > The issue would be how I guarantee the write I wrote to the
> > server
> > > > was
> > > > >> > > written. Since a network issue could happen on the send of the
> > > > >> request,
> > > > >> > or
> > > > >> > > on the receive of the success response, an epoch wouldn't tell
> > me
> > > > if I
> > > > >> > can
> > > > >> > > successfully retry, as it could be successfully written but
> AWS
> > > > >> dropped
> > > > >> > the
> > > > >> > > connection for the success response. Since the epoch would be
> > the
> > > > same
> > > > >> > > (same ledger), I could write duplicates.
> > > > >> > >
> > > > >> > >
> > > > >> > > > We are currently proposing adding a transaction semantic to
> dl
> > > to
> > > > >> get
> > > > >> > rid
> > > > >> > > > of the size limitation and the unaware-ness in the proxy
> > client.
> > > > >> Here
> > > > >> > is
> > > > >> > > > our idea -
> > > > >> > > > http://mail-archives.apache.org/mod_mbox/incubator-
> > > distributedlog
> > > > >> > > -dev/201609.mbox/%3cCAAC6BxP5YyEHwG0ZCF5soh42X=xuYwYm
> > > > >> > > <http://mail-archives.apache.org/mod_mbox/incubator-
> > > > >> > distributedlog%0A-dev/201609.mbox/%3cCAAC6BxP5YyEHwG0ZCF5soh
> > > > 42X=xuYwYm>
> > > > >> > > L4nXsYBYiofzxpVk6g@mail.gmail.com%3e
> > > > >> > >
> > > > >> > > > I am not sure if your idea is similar as ours. but we'd like
> > to
> > > > >> > > collaborate
> > > > >> > > > with the community if anyone has the similar idea.
> > > > >> > >
> > > > >> > > Our use case would be covered by transaction support, but I'm
> > > unsure
> > > > >> if
> > > > >> > we
> > > > >> > > would need something that heavy weight for the guarantees we
> > need.
> > > > >> > >
> > > > >> >
> > > > >> > >
> > > > >> > > Basically, the high level requirement here is "Support
> > consistent
> > > > >> write
> > > > >> > > ordering for single-writer-per-key, multi-writer-per-log". My
> > > hunch
> > > > is
> > > > >> > > that, with some added guarantees to the proxy (if it isn't
> > already
> > > > >> > > supported), and some custom client code on our side for
> removing
> > > the
> > > > >> > > entries that actually succeed to write to DistributedLog from
> > the
> > > > >> request
> > > > >> > > that failed, it should be a relatively easy thing to support.
> > > > >> > >
> > > > >> >
> > > > >> > Yup. I think it should not be very difficult to support. There
> > might
> > > > be
> > > > >> > some changes in the server side.
> > > > >> > Let's figure out what will the changes be. Are you guys
> interested
> > > in
> > > > >> > contributing?
> > > > >> >
> > > > >> > Yes, we would be.
> > > > >>
> > > > >> As a note, the one thing that we see as an issue with the client
> > side
> > > > >> dedupping is how to bound the range of data that needs to be
> looked
> > at
> > > > for
> > > > >> deduplication. As you can imagine, it is pretty easy to bound the
> > > bottom
> > > > >> of
> > > > >> the range, as that it just regular checkpointing of the DSLN that
> is
> > > > >> returned. I'm still not sure if there is any nice way to time
> bound
> > > the
> > > > >> top
> > > > >> end of the range, especially since the proxy owns sequence numbers
> > > > (which
> > > > >> makes sense). I am curious if there is more that can be done if
> > > > >> deduplication is on the server side. However the main minus I see
> of
> > > > >> server
> > > > >> side deduplication is that instead of running contingent on there
> > > being
> > > > a
> > > > >> failed client request, instead it would have to run every time a
> > write
> > > > >> happens.
> > > > >
> > > > >
> > > > > For a reliable dedup, we probably need fence-then-getLastDLSN
> > > operation -
> > > > > so it would guarantee that any non-completed requests issued
> > (lost-ack
> > > > > requests) before this fence-then-getLastDLSN operation will be
> failed
> > > and
> > > > > they will never land at the log.
> > > > >
> > > > > the pseudo code would look like below -
> > > > >
> > > > > write(request) onFailure { t =>
> > > > >
> > > > > if (t is timeout exception) {
> > > > >
> > > > > DLSN lastDLSN = fenceThenGetLastDLSN()
> > > > > DLSN lastCheckpointedDLSN = ...;
> > > > > // find if the request lands between [lastDLSN,
> > lastCheckpointedDLSN].
> > > > > // if it exists, the write succeed; otherwise retry.
> > > > >
> > > > > }
> > > > >
> > > > >
> > > > > }
> > > > >
> > > >
> > > >
> > > > Just realized the idea is same as what Leigh raised in the previous
> > email
> > > > about 'epoch write'. Let me explain more about this idea (Leigh, feel
> > > free
> > > > to jump in to fill up your idea).
> > > >
> > > > - when a log stream is owned,  the proxy use the last transaction id
> as
> > > the
> > > > epoch
> > > > - when a client connects (handshake with the proxy), it will get the
> > > epoch
> > > > for the stream.
> > > > - the writes issued by this client will carry the epoch to the proxy.
> > > > - add a new rpc - fenceThenGetLastDLSN - it would force the proxy to
> > bump
> > > > the epoch.
> > > > - if fenceThenGetLastDLSN happened, all the outstanding writes with
> old
> > > > epoch will be rejected with exceptions (e.g. EpochFenced).
> > > > - The DLSN returned from fenceThenGetLastDLSN can be used as the
> bound
> > > for
> > > > deduplications on failures.
> > > >
> > > > Cameron, does this sound a solution to your use case?
> > > >
> > > >
> > > >
> > > > >
> > > > >
> > > > >>
> > > > >> Maybe something that could fit a similar need that Kafka does (the
> > > last
> > > > >> store value for a particular key in a log), such that on a per key
> > > basis
> > > > >> there could be a sequence number that support deduplication? Cost
> > > seems
> > > > >> like it would be high however, and I'm not even sure if bookkeeper
> > > > >> supports
> > > > >> it.
> > > > >
> > > > >
> > > > >> Cheers,
> > > > >> Cameron
> > > > >>
> > > > >> >
> > > > >> > >
> > > > >> > > Thanks,
> > > > >> > > Cameron
> > > > >> > >
> > > > >> > >
> > > > >> > > On Sat, Oct 8, 2016 at 7:35 AM, Leigh Stewart
> > > > >> > <lstewart@twitter.com.invalid
> > > > >> > > >
> > > > >> > > wrote:
> > > > >> > >
> > > > >> > > > Cameron:
> > > > >> > > > Another thing we've discussed but haven't really thought
> > > through -
> > > > >> > > > We might be able to support some kind of epoch write
> request,
> > > > where
> > > > >> the
> > > > >> > > > epoch is guaranteed to have changed if the writer has
> changed
> > or
> > > > the
> > > > >> > > ledger
> > > > >> > > > was ever fenced off. Writes include an epoch and are
> rejected
> > if
> > > > the
> > > > >> > > epoch
> > > > >> > > > has changed.
> > > > >> > > > With a mechanism like this, fencing the ledger off after a
> > > failure
> > > > >> > would
> > > > >> > > > ensure any pending writes had either been written or would
> be
> > > > >> rejected.
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > On Sat, Oct 8, 2016 at 7:10 AM, Sijie Guo <sijie@apache.org
> >
> > > > wrote:
> > > > >> > > >
> > > > >> > > > > Cameron,
> > > > >> > > > >
> > > > >> > > > > I think both Leigh and Xi had made a few good points about
> > > your
> > > > >> > > question.
> > > > >> > > > >
> > > > >> > > > > To add one more point to your question - "but I am not
> > > > >> > > > > 100% of how all of the futures in the code handle
> failures.
> > > > >> > > > > If not, where in the code would be the relevant places to
> > add
> > > > the
> > > > >> > > ability
> > > > >> > > > > to do this, and would the project be interested in a pull
> > > > >> request?"
> > > > >> > > > >
> > > > >> > > > > The current proxy and client logic doesn't do perfectly on
> > > > >> handling
> > > > >> > > > > failures (duplicates) - the strategy now is the client
> will
> > > > retry
> > > > >> as
> > > > >> > > best
> > > > >> > > > > at it can before throwing exceptions to users. The code
> you
> > > are
> > > > >> > looking
> > > > >> > > > for
> > > > >> > > > > - it is on BKLogSegmentWriter for the proxy handling
> writes
> > > and
> > > > >> it is
> > > > >> > > on
> > > > >> > > > > DistributedLogClientImpl for the proxy client handling
> > > responses
> > > > >> from
> > > > >> > > > > proxies. Does this help you?
> > > > >> > > > >
> > > > >> > > > > And also, you are welcome to contribute the pull requests.
> > > > >> > > > >
> > > > >> > > > > - Sijie
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > On Tue, Oct 4, 2016 at 3:39 PM, Cameron Hatfield <
> > > > >> kinguy@gmail.com>
> > > > >> > > > wrote:
> > > > >> > > > >
> > > > >> > > > > > I have a question about the Proxy Client. Basically, for
> > our
> > > > use
> > > > >> > > cases,
> > > > >> > > > > we
> > > > >> > > > > > want to guarantee ordering at the key level,
> irrespective
> > of
> > > > the
> > > > >> > > > ordering
> > > > >> > > > > > of the partition it may be assigned to as a whole. Due
> to
> > > the
> > > > >> > source
> > > > >> > > of
> > > > >> > > > > the
> > > > >> > > > > > data (HBase Replication), we cannot guarantee that a
> > single
> > > > >> > partition
> > > > >> > > > > will
> > > > >> > > > > > be owned for writes by the same client. This means the
> > proxy
> > > > >> client
> > > > >> > > > works
> > > > >> > > > > > well (since we don't care which proxy owns the partition
> > we
> > > > are
> > > > >> > > writing
> > > > >> > > > > > to).
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > However, the guarantees we need when writing a batch
> > > consists
> > > > >> of:
> > > > >> > > > > > Definition of a Batch: The set of records sent to the
> > > > writeBatch
> > > > >> > > > endpoint
> > > > >> > > > > > on the proxy
> > > > >> > > > > >
> > > > >> > > > > > 1. Batch success: If the client receives a success from
> > the
> > > > >> proxy,
> > > > >> > > then
> > > > >> > > > > > that batch is successfully written
> > > > >> > > > > >
> > > > >> > > > > > 2. Inter-Batch ordering : Once a batch has been written
> > > > >> > successfully
> > > > >> > > by
> > > > >> > > > > the
> > > > >> > > > > > client, when another batch is written, it will be
> > guaranteed
> > > > to
> > > > >> be
> > > > >> > > > > ordered
> > > > >> > > > > > after the last batch (if it is the same stream).
> > > > >> > > > > >
> > > > >> > > > > > 3. Intra-Batch ordering: Within a batch of writes, the
> > > records
> > > > >> will
> > > > >> > > be
> > > > >> > > > > > committed in order
> > > > >> > > > > >
> > > > >> > > > > > 4. Intra-Batch failure ordering: If an individual record
> > > fails
> > > > >> to
> > > > >> > > write
> > > > >> > > > > > within a batch, all records after that record will not
> be
> > > > >> written.
> > > > >> > > > > >
> > > > >> > > > > > 5. Batch Commit: Guarantee that if a batch returns a
> > > success,
> > > > it
> > > > >> > will
> > > > >> > > > be
> > > > >> > > > > > written
> > > > >> > > > > >
> > > > >> > > > > > 6. Read-after-write: Once a batch is committed, within a
> > > > limited
> > > > >> > > > > time-frame
> > > > >> > > > > > it will be able to be read. This is required in the case
> > of
> > > > >> > failure,
> > > > >> > > so
> > > > >> > > > > > that the client can see what actually got committed. I
> > > believe
> > > > >> the
> > > > >> > > > > > time-frame part could be removed if the client can send
> in
> > > the
> > > > >> same
> > > > >> > > > > > sequence number that was written previously, since it
> > would
> > > > then
> > > > >> > fail
> > > > >> > > > and
> > > > >> > > > > > we would know that a read needs to occur.
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > So, my basic question is if this is currently possible
> in
> > > the
> > > > >> > proxy?
> > > > >> > > I
> > > > >> > > > > > don't believe it gives these guarantees as it stands
> > today,
> > > > but
> > > > >> I
> > > > >> > am
> > > > >> > > > not
> > > > >> > > > > > 100% of how all of the futures in the code handle
> > failures.
> > > > >> > > > > > If not, where in the code would be the relevant places
> to
> > > add
> > > > >> the
> > > > >> > > > ability
> > > > >> > > > > > to do this, and would the project be interested in a
> pull
> > > > >> request?
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > > Thanks,
> > > > >> > > > > > Cameron
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>