You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zookeeper.apache.org by Alexander Shraer <sh...@gmail.com> on 2013/03/02 04:06:57 UTC

sync semantics

This is an old thread (below), but it doesn't seem like any conclusion
was reached on what we want to do to address the issue.

Reminder of the problem: sync only gets you strong semantics if there
is no leader change. If there is a leader change,
then these semantics are guaranteed only if we make some timing
assumptions, not made elsewhere in ZooKeeper. It would be much
better not to make timing assumptions for such safety/consistency
properties, only for liveness.

The problem happens when your leader is no longer the leader but
doesn't know it yet. He responds to a sync, but that doesn't mean
you follower sees all committed state. Some other server may have
already become the leader and committed some updates, which the sync
won't flush to
your follower, which is still connected to the old leader.

To prevent this we should broadcast the sync like updates, or
piggyback them on other ops, or perhaps create a new type of sync that
is broadcasted.

As Ben pointed out, this problem is also mentioned in Section 4.4 of
the ZooKeeper peper (but the proposed solution there is insufficient
to solve the issue, as
discussed below).

Alex


On Fri, Sep 28, 2012 at 4:45 PM, John Carrino <jo...@gmail.com> wrote:
> Ben, after thinking about this more. I don't think this solution gets the
> property that I need.  Just because there are outstanding proposals that are
> committed later doesn't imply we are still the leader.  It only means that
> when the new leader does recovery it will also see these proposals as
> committed.
>
> Let's say we have a 5 node cluster and L1 has one pending request out. F2-F5
> are followers. We get back an ack from F2.  Now F5 and L1 are partitioned
> off from the network along with client C1.
>
> Recovery happens on F2-F4 and F2 becomes L2.  During recovery this proposal
> is accepted because F2 had acked it.  Now L2 does a bunch of stuff including
> deleting your ephemeral node.
>
> Now a sync comes in from C1 through F5. Now L1 finally gets that ack from F5
> and goes ahead and commits it and responds to the outstanding sync request
> to C1.
>
> We can see with this ordering there isn't a happens after relationship
> between the sync request and knowing about all commits that occurred before
> the sync request.
>
> Yes, I realize that this ordering is unlikely to happen in practice, but I
> hate trusting time for anything.
>
> -jc
>
> On Fri, Sep 28, 2012 at 7:31 AM, John Carrino <jo...@gmail.com>
> wrote:
>>
>> This seems like a good compromise.  We still have to eat the latency of a
>> write, but we easily achieve smart batching in this case so many outstanding
>> sync can all be serviced by the same lastPending request.
>>
>> -jc
>>
>>
>> On Thu, Sep 27, 2012 at 11:17 PM, Benjamin Reed <be...@gmail.com>
>> wrote:
>>>
>>> there is a very easy solution to this. we only rely on clocks in the
>>> case that there are no pending transactions. (if there are pending
>>> transactions, the sync will only return if in fact the leader is still
>>> the leader, otherwise the transaction that the sync is waiting on will
>>> never commit and the sync will never return.)
>>>
>>> so, if there aren't any transactions, just submit one. make it a bogus
>>> one: create / for example. then queue the sync behind it.
>>>
>>> ben
>>>
>>> ps - we bring up this issue and the solution and the rational for the
>>> current implementation in section 4.4 of the zookeeper usenix paper.
>>>
>>> On Thu, Sep 27, 2012 at 9:57 AM, John Carrino <jo...@gmail.com>
>>> wrote:
>>> > So I think it's time to explain what I'm writing just so everyone has
>>> > more
>>> > situation awareness.  Its just a timestamp server, nothing fancy.
>>> >
>>> > Looks like this:
>>> >
>>> > public interface TimestampService {
>>> >     /**
>>> >      * This will get a fresh timestamp that is guarenteed to be newer
>>> > than
>>> > any other timestamp
>>> >      * handed out before this method was called.
>>> >      */
>>> >     long getFreshTimestamp();
>>> > }
>>> >
>>> > The only requirement is that the timestamp handed back is greater than
>>> > every
>>> > other timestamp that was returned before getFreshTs was called.  There
>>> > is no
>>> > ordering requirement for concurrent requests.
>>> >
>>> > My impl is to reserve blocks of timestamps that are safe to hand out
>>> > (1M at
>>> > a time) using compare and swap in ZK.
>>> > lastPossibleUsed = read(HighWater)
>>> > safeToHandout = compareAndSwap(lastPossibleUsed, lastPossibleUsed+1M)
>>> >
>>> > Now my leader can hand back timestamps up to safeToHandout, but before
>>> > it
>>> > hands one out it must ensure it is still the leader (no one else has
>>> > handed
>>> > back something higher).
>>> > I can use ensureQuorum(), exists(myEphemNode) to make sure this is the
>>> > case.
>>> > Now I have a service that is guarenteed to be correct, but doesn't
>>> > require
>>> > disk hits in the steady state which brings down my latency (if you get
>>> > close
>>> > to running out, you can compareAndSwap for more timestamps).
>>> >
>>> > If many requests come in at the same time I can use smart batching to
>>> > verify
>>> > happens after for all at once.  We can also add more layers if we need
>>> > more
>>> > bandwidth to scale up at the cost of adding latency.  Basically our
>>> > latency
>>> > will be O(lg(requestRate)) if we keep adding layers as each previous
>>> > layer
>>> > becomes saturated.
>>> >
>>> > I hope this explanation helps. I am busy for the next 4 hours, but if
>>> > you
>>> > need more clarification I can respond to them at that time.
>>> >
>>> > -jc
>>> >
>>> >
>>> > On Thu, Sep 27, 2012 at 9:26 AM, John Carrino <jo...@gmail.com>
>>> > wrote:
>>> >>
>>> >> First, thanks everyone for talking this through with me.
>>> >>
>>> >> Flavio, for your example, this is actually ok.  There is a happens
>>> >> after
>>> >> relationship between the client making the request and my leader C1
>>> >> still
>>> >> being the leader.  My service only needs to guarantee that what it
>>> >> hands
>>> >> back is at least as new as anything that existed when the client made
>>> >> the
>>> >> request.  If C2 were to answer requests while C1 is stalling that is
>>> >> ok
>>> >> because these would be considered concurrent requests and the stuff
>>> >> returned
>>> >> by C2 may be newer but that doesn't violate any guarentees.
>>> >>
>>> >> If some client were to get back something from C2 and then (happens
>>> >> after
>>> >> relationship) someone tried to read from C1, it needs to fail.
>>> >>
>>> >> To address your concern of adding too much bandwidth we can get this
>>> >> easily by doing what Martin Thompson calls smart batching
>>> >> (http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html).
>>> >>
>>> >> 1. ensureQuorum request comes in to L1
>>> >> 2. send ENSURE to all followers
>>> >> 3. 10 more ensureQuorum requests come in
>>> >> 4. get back ENSURE from quorum
>>> >> 5. we can now service all 10 pending ensureQuorum requests with
>>> >> another
>>> >> round trip ENSURE.
>>> >>
>>> >> We don't need to send an ENSURE for every ensureQuorum request, we
>>> >> just
>>> >> need it to be happens after from when the request arrived.
>>> >>
>>> >> I am fine with the Ephemeral node being removed after some time
>>> >> expires,
>>> >> but only by the leader.  If the leaders clock is broken and the client
>>> >> owning the Ephemeral node drops off, then we don't have liveness
>>> >> (because
>>> >> this node may not get cleaned up in a timely fashion).  However, we
>>> >> still
>>> >> preserve corectness.
>>> >>
>>> >> -jc
>>> >>
>>> >>
>>> >> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <fp...@yahoo-inc.com>
>>> >> wrote:
>>> >>>
>>> >>> Say that we implement what you're suggesting. Could you check if this
>>> >>> scenario can happen:
>>> >>>
>>> >>> 1- Client C1 is the current leader and it super boosted read to make
>>> >>> sure
>>> >>> it is still the leader;
>>> >>> 2- We process the super boosted read having it through the zab
>>> >>> pipeline;
>>> >>> 3- When we send the response to C1 we slow down the whole deal: the
>>> >>> response to C1 gets delayed and we stall C1;
>>> >>> 4- In the meanwhile, C1's session expires on the server side and its
>>> >>> ephemeral leadership node is removed;
>>> >>> 5- A new client C2 is elected and starts exercising leadership;
>>> >>> 6- Now C1 comes back to normal and receives the response of the super
>>> >>> boosted read saying that it is still the leader.
>>> >>>
>>> >>> If my interpretation is not incorrect, the only way to prevent this
>>> >>> scenario from happening is if the session expires on the client side
>>> >>> before
>>> >>> it receives the response of the read. It doesn't look like we can do
>>> >>> it if
>>> >>> process clocks can be arbitrarily delayed.
>>> >>>
>>> >>> Note that one issue is that the behavior of ephemerals is highly
>>> >>> dependent upon timers, so I don't think we can avoid making some
>>> >>> timing
>>> >>> assumptions altogether. The question is if we are better off with a
>>> >>> mechanism relying upon acknowledgements. My sense is that
>>> >>> application-level
>>> >>> fencing is preferable (if not necessary) for applications like the
>>> >>> ones JC
>>> >>> is mentioning or BookKeeper.
>>> >>>
>>> >>> I'm not concerned about writes to disk, which I agree we don't need
>>> >>> for
>>> >>> sync. I'm more concerned about having it going through the whole
>>> >>> pipeline,
>>> >>> which will induce more traffic to zab and increase latency for an
>>> >>> application that uses it heavily.
>>> >>>
>>> >>> -Flavio
>>> >>>
>>> >>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote:
>>> >>>
>>> >>> > another idea is to add this functionality to MultiOp - have read
>>> >>> > only
>>> >>> > transactions be replicated but not logged or logged asynchronously.
>>> >>> > I'm not sure how it works right now if I do a read-only MultiOp
>>> >>> > transaction - does it replicate the transaction or answer it
>>> >>> > locally
>>> >>> > on the leader ?
>>> >>> >
>>> >>> > Alex
>>> >>> >
>>> >>> > On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer
>>> >>> > <sh...@gmail.com>
>>> >>> > wrote:
>>> >>> >> Thanks for the explanation.
>>> >>> >>
>>> >>> >> I guess one could always invoke a write operation instead of sync
>>> >>> >> to
>>> >>> >> get the more strict semantics, but as John suggests, it might be a
>>> >>> >> good idea to add a new type of operation that requires followers
>>> >>> >> to
>>> >>> >> ack but doesn't require them to log to disk - this seems
>>> >>> >> sufficient in
>>> >>> >> our case.
>>> >>> >>
>>> >>> >> Alex
>>> >>> >>
>>> >>> >> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira
>>> >>> >> <fp...@yahoo-inc.com>
>>> >>> >> wrote:
>>> >>> >>> In theory, the scenario you're describing could happen, but I
>>> >>> >>> would
>>> >>> >>> argue that it is unlikely given that: 1) a leader pings followers
>>> >>> >>> twice a
>>> >>> >>> tick to make sure that it has a quorum of supporters (lead()); 2)
>>> >>> >>> followers
>>> >>> >>> give up on a leader upon catching an exception (followLeader()).
>>> >>> >>> One could
>>> >>> >>> calibrate tickTime to make the probability of having this
>>> >>> >>> scenario low.
>>> >>> >>>
>>> >>> >>> Let me also revisit the motivation for the way we designed sync.
>>> >>> >>> ZooKeeper has been designed to serve reads efficiently and making
>>> >>> >>> sync go
>>> >>> >>> through the pipeline would slow down reads. Although optional, we
>>> >>> >>> thought it
>>> >>> >>> would be a good idea to make it as efficient as possible to
>>> >>> >>> comply with the
>>> >>> >>> original expectations for the service. We consequently came up
>>> >>> >>> with this
>>> >>> >>> cheap way of making sure that a read sees all pending updates. It
>>> >>> >>> is correct
>>> >>> >>> that there are some corner cases that it doesn't cover. One is
>>> >>> >>> the case you
>>> >>> >>> mentioned. Another is having the sync finishing before the client
>>> >>> >>> submits
>>> >>> >>> the read and having a write committing in between. We rely upon
>>> >>> >>> the way we
>>> >>> >>> implement timeouts and some minimum degree of synchrony for the
>>> >>> >>> clients when
>>> >>> >>> submitting operations to guarantee that the scheme work.
>>> >>> >>>
>>> >>> >>> We thought about the option of having the sync operation going
>>> >>> >>> through the pipeline, and in fact it would have been easier to
>>> >>> >>> implement it
>>> >>> >>> just as a regular write, but we opted not to because we felt it
>>> >>> >>> was
>>> >>> >>> sufficient for the use cases we had and more efficient as I
>>> >>> >>> already argued.
>>> >>> >>>
>>> >>> >>> Hope it helps to clarify.
>>> >>> >>>
>>> >>> >>> -Flavio
>>> >>> >>>
>>> >>> >>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
>>> >>> >>>
>>> >>> >>>> thanks for the explanation! but how do you avoid having the
>>> >>> >>>> scenario
>>> >>> >>>> raised by John ?
>>> >>> >>>> lets say you're a client connected to F, and F is connected to
>>> >>> >>>> L.
>>> >>> >>>> Lets
>>> >>> >>>> also say that L's pipeline
>>> >>> >>>> is now empty, and both F and L are partitioned from 3 other
>>> >>> >>>> servers
>>> >>> >>>> in
>>> >>> >>>> the system that have already
>>> >>> >>>> elected a new leader L'. Now I go to L' and write something. L
>>> >>> >>>> still
>>> >>> >>>> thinks its the leader because the
>>> >>> >>>> detection that followers left it is obviously timeout dependent.
>>> >>> >>>> So
>>> >>> >>>> when F sends your sync to L and L returns
>>> >>> >>>> it to F, you actually miss my write!
>>> >>> >>>>
>>> >>> >>>> Alex
>>> >>> >>>>
>>> >>> >>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira
>>> >>> >>>> <fp...@yahoo-inc.com> wrote:
>>> >>> >>>>> Hi Alex, Because of the following:
>>> >>> >>>>>
>>> >>> >>>>> 1- A follower F processes operations from a client in FIFO
>>> >>> >>>>> order,
>>> >>> >>>>> and say that a client submits as you say sync + read;
>>> >>> >>>>> 2- A sync will be processed by the leader and returned to the
>>> >>> >>>>> follower. It will be queued after all pending updates that the
>>> >>> >>>>> follower
>>> >>> >>>>> hasn't processed;
>>> >>> >>>>> 3- The follower will process all pending updates before
>>> >>> >>>>> processing
>>> >>> >>>>> the response of the sync;
>>> >>> >>>>> 4- Once the follower processes the sync, it picks the read
>>> >>> >>>>> operation to process. It reads the local state of the follower
>>> >>> >>>>> and returns
>>> >>> >>>>> to the client.
>>> >>> >>>>>
>>> >>> >>>>> When we process the read in Step 4, we have applied all pending
>>> >>> >>>>> updates the leader had for the follower by the time the read
>>> >>> >>>>> request
>>> >>> >>>>> started.
>>> >>> >>>>>
>>> >>> >>>>> This implementation is a bit of a hack because it doesn't
>>> >>> >>>>> follow
>>> >>> >>>>> the same code path as the other operations that go to the
>>> >>> >>>>> leader, but it
>>> >>> >>>>> avoids some unnecessary steps, which is important for fast
>>> >>> >>>>> reads. In the
>>> >>> >>>>> sync case, the other followers don't really need to know about
>>> >>> >>>>> it (there is
>>> >>> >>>>> nothing to be updated) and the leader simply inserts it in the
>>> >>> >>>>> sequence of
>>> >>> >>>>> updates of F, ordering it.
>>> >>> >>>>>
>>> >>> >>>>> -Flavio
>>> >>> >>>>>
>>> >>> >>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
>>> >>> >>>>>
>>> >>> >>>>>> Hi Flavio,
>>> >>> >>>>>>
>>> >>> >>>>>>> Starting a read operation concurrently with a sync implies
>>> >>> >>>>>>> that
>>> >>> >>>>>>> the result of the read will not miss an update committed
>>> >>> >>>>>>> before the read
>>> >>> >>>>>>> started.
>>> >>> >>>>>>
>>> >>> >>>>>> I thought that the intention of sync was to give something
>>> >>> >>>>>> like
>>> >>> >>>>>> linearizable reads, so if you invoke a sync and then a read,
>>> >>> >>>>>> your
>>> >>> >>>>>> read
>>> >>> >>>>>> is guaranteed to (at least) see any write which completed
>>> >>> >>>>>> before
>>> >>> >>>>>> the
>>> >>> >>>>>> sync began. Is this the intention ? If so, how is this
>>> >>> >>>>>> achieved
>>> >>> >>>>>> without running agreement on the sync op ?
>>> >>> >>>>>>
>>> >>> >>>>>> Thanks,
>>> >>> >>>>>> Alex
>>> >>> >>>>>>
>>> >>> >>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira
>>> >>> >>>>>> <fp...@yahoo-inc.com> wrote:
>>> >>> >>>>>>> sync simply flushes the channel between the leader and the
>>> >>> >>>>>>> follower that forwarded the sync operation, so it doesn't go
>>> >>> >>>>>>> through the
>>> >>> >>>>>>> full zab pipeline. Flushing means that all pending updates
>>> >>> >>>>>>> from the leader
>>> >>> >>>>>>> to the follower are received by the time sync completes.
>>> >>> >>>>>>> Starting a read
>>> >>> >>>>>>> operation concurrently with a sync implies that the result of
>>> >>> >>>>>>> the read will
>>> >>> >>>>>>> not miss an update committed before the read started.
>>> >>> >>>>>>>
>>> >>> >>>>>>> -Flavio
>>> >>> >>>>>>>
>>> >>> >>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
>>> >>> >>>>>>>
>>> >>> >>>>>>>> Its strange that sync doesn't run through agreement, I was
>>> >>> >>>>>>>> always
>>> >>> >>>>>>>> assuming that it is... Exactly for the reason you say -
>>> >>> >>>>>>>> you may trust your leader, but I may have a different leader
>>> >>> >>>>>>>> and
>>> >>> >>>>>>>> your
>>> >>> >>>>>>>> leader may not detect it yet and still think its the leader.
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> This seems like a bug to me.
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or
>>> >>> >>>>>>>> shouldn't)
>>> >>> >>>>>>>> depend on timing assumption.
>>> >>> >>>>>>>> Only progress guarantees depend on time.
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> Alex
>>> >>> >>>>>>>>
>>> >>> >>>>>>>>
>>> >>> >>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino
>>> >>> >>>>>>>> <jo...@gmail.com> wrote:
>>> >>> >>>>>>>>> I have some pretty strong requirements in terms of
>>> >>> >>>>>>>>> consistency
>>> >>> >>>>>>>>> where
>>> >>> >>>>>>>>> reading from followers that may be behind in terms of
>>> >>> >>>>>>>>> updates
>>> >>> >>>>>>>>> isn't ok for
>>> >>> >>>>>>>>> my use case.
>>> >>> >>>>>>>>>
>>> >>> >>>>>>>>> One error case that worries me is if a follower and leader
>>> >>> >>>>>>>>> are
>>> >>> >>>>>>>>> partitioned
>>> >>> >>>>>>>>> off from the network.  A new leader is elected, but the
>>> >>> >>>>>>>>> follower and old
>>> >>> >>>>>>>>> leader don't know about it.
>>> >>> >>>>>>>>>
>>> >>> >>>>>>>>> Normally I think sync was made for this purpost, but I
>>> >>> >>>>>>>>> looked
>>> >>> >>>>>>>>> at the sync
>>> >>> >>>>>>>>> code and if there aren't any outstanding proposals the
>>> >>> >>>>>>>>> leader
>>> >>> >>>>>>>>> sends the
>>> >>> >>>>>>>>> sync right back to the client without first verifying that
>>> >>> >>>>>>>>> it
>>> >>> >>>>>>>>> still has
>>> >>> >>>>>>>>> quorum, so this won't work for my use case.
>>> >>> >>>>>>>>>
>>> >>> >>>>>>>>> At the core of the issue all I really need is a call that
>>> >>> >>>>>>>>> will
>>> >>> >>>>>>>>> make it's
>>> >>> >>>>>>>>> way to the leader and will ping it's followers, ensure it
>>> >>> >>>>>>>>> still
>>> >>> >>>>>>>>> has a
>>> >>> >>>>>>>>> quorum and return success.
>>> >>> >>>>>>>>>
>>> >>> >>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be
>>> >>> >>>>>>>>> forwarded to the
>>> >>> >>>>>>>>> leader, leader will ensure it still has quorum and return
>>> >>> >>>>>>>>> it's
>>> >>> >>>>>>>>> epoch.  I
>>> >>> >>>>>>>>> can use this primitive to implement all the other
>>> >>> >>>>>>>>> properties I
>>> >>> >>>>>>>>> want to
>>> >>> >>>>>>>>> verify (assuming that my client will never connect to an
>>> >>> >>>>>>>>> older
>>> >>> >>>>>>>>> epoch after
>>> >>> >>>>>>>>> this call returns). Also the nice thing about this method
>>> >>> >>>>>>>>> is
>>> >>> >>>>>>>>> that it will
>>> >>> >>>>>>>>> not have to hit disk and the latency should just be a round
>>> >>> >>>>>>>>> trip to the
>>> >>> >>>>>>>>> followers.
>>> >>> >>>>>>>>>
>>> >>> >>>>>>>>> Most of the guarentees offered by zookeeper are time based
>>> >>> >>>>>>>>> an
>>> >>> >>>>>>>>> rely on
>>> >>> >>>>>>>>> clocks and expiring timers, but I'm hoping to offer some
>>> >>> >>>>>>>>> guarantees in
>>> >>> >>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and
>>> >>> >>>>>>>>> any
>>> >>> >>>>>>>>> other way
>>> >>> >>>>>>>>> time is broken.
>>> >>> >>>>>>>>>
>>> >>> >>>>>>>>> Also if people are interested I can go into more detail
>>> >>> >>>>>>>>> about
>>> >>> >>>>>>>>> what I am
>>> >>> >>>>>>>>> trying to write.
>>> >>> >>>>>>>>>
>>> >>> >>>>>>>>> -jc
>>> >>> >>>>>>>
>>> >>> >>>>>
>>> >>> >>>
>>> >>>
>>> >>
>>> >
>>
>>
>

Re: sync semantics

Posted by Patrick Hunt <ph...@apache.org>.

Hi Alex, Hongchao, I agree with the assertion that we should favor
correctness here.

Patrick

On Mon, Mar 9, 2015 at 9:10 PM, Alexander Shraer <sh...@gmail.com> wrote:
> Bumping this thread as it was brought up again by Hongchao in the context of
> failing tests.
> I agree with him that its much better to make sync a quorum operation,
> however this may come in the expense
> of slightly slower strong reads. I think this is acceptable since if someone
> does "sync + read" he cares more about semantics
> than latency.
>
> On Fri, Mar 1, 2013 at 7:06 PM, Alexander Shraer <sh...@gmail.com> wrote:
>>
>> This is an old thread (below), but it doesn't seem like any conclusion
>> was reached on what we want to do to address the issue.
>>
>> Reminder of the problem: sync only gets you strong semantics if there
>> is no leader change. If there is a leader change,
>> then these semantics are guaranteed only if we make some timing
>> assumptions, not made elsewhere in ZooKeeper. It would be much
>> better not to make timing assumptions for such safety/consistency
>> properties, only for liveness.
>>
>> The problem happens when your leader is no longer the leader but
>> doesn't know it yet. He responds to a sync, but that doesn't mean
>> you follower sees all committed state. Some other server may have
>> already become the leader and committed some updates, which the sync
>> won't flush to
>> your follower, which is still connected to the old leader.
>>
>> To prevent this we should broadcast the sync like updates, or
>> piggyback them on other ops, or perhaps create a new type of sync that
>> is broadcasted.
>>
>> As Ben pointed out, this problem is also mentioned in Section 4.4 of
>> the ZooKeeper peper (but the proposed solution there is insufficient
>> to solve the issue, as
>> discussed below).
>>
>> Alex
>>
>>
>> On Fri, Sep 28, 2012 at 4:45 PM, John Carrino <jo...@gmail.com>
>> wrote:
>> > Ben, after thinking about this more. I don't think this solution gets
>> > the
>> > property that I need.  Just because there are outstanding proposals that
>> > are
>> > committed later doesn't imply we are still the leader.  It only means
>> > that
>> > when the new leader does recovery it will also see these proposals as
>> > committed.
>> >
>> > Let's say we have a 5 node cluster and L1 has one pending request out.
>> > F2-F5
>> > are followers. We get back an ack from F2.  Now F5 and L1 are
>> > partitioned
>> > off from the network along with client C1.
>> >
>> > Recovery happens on F2-F4 and F2 becomes L2.  During recovery this
>> > proposal
>> > is accepted because F2 had acked it.  Now L2 does a bunch of stuff
>> > including
>> > deleting your ephemeral node.
>> >
>> > Now a sync comes in from C1 through F5. Now L1 finally gets that ack
>> > from F5
>> > and goes ahead and commits it and responds to the outstanding sync
>> > request
>> > to C1.
>> >
>> > We can see with this ordering there isn't a happens after relationship
>> > between the sync request and knowing about all commits that occurred
>> > before
>> > the sync request.
>> >
>> > Yes, I realize that this ordering is unlikely to happen in practice, but
>> > I
>> > hate trusting time for anything.
>> >
>> > -jc
>> >
>> > On Fri, Sep 28, 2012 at 7:31 AM, John Carrino <jo...@gmail.com>
>> > wrote:
>> >>
>> >> This seems like a good compromise.  We still have to eat the latency of
>> >> a
>> >> write, but we easily achieve smart batching in this case so many
>> >> outstanding
>> >> sync can all be serviced by the same lastPending request.
>> >>
>> >> -jc
>> >>
>> >>
>> >> On Thu, Sep 27, 2012 at 11:17 PM, Benjamin Reed <be...@gmail.com>
>> >> wrote:
>> >>>
>> >>> there is a very easy solution to this. we only rely on clocks in the
>> >>> case that there are no pending transactions. (if there are pending
>> >>> transactions, the sync will only return if in fact the leader is still
>> >>> the leader, otherwise the transaction that the sync is waiting on will
>> >>> never commit and the sync will never return.)
>> >>>
>> >>> so, if there aren't any transactions, just submit one. make it a bogus
>> >>> one: create / for example. then queue the sync behind it.
>> >>>
>> >>> ben
>> >>>
>> >>> ps - we bring up this issue and the solution and the rational for the
>> >>> current implementation in section 4.4 of the zookeeper usenix paper.
>> >>>
>> >>> On Thu, Sep 27, 2012 at 9:57 AM, John Carrino <jo...@gmail.com>
>> >>> wrote:
>> >>> > So I think it's time to explain what I'm writing just so everyone
>> >>> > has
>> >>> > more
>> >>> > situation awareness.  Its just a timestamp server, nothing fancy.
>> >>> >
>> >>> > Looks like this:
>> >>> >
>> >>> > public interface TimestampService {
>> >>> >     /**
>> >>> >      * This will get a fresh timestamp that is guarenteed to be
>> >>> > newer
>> >>> > than
>> >>> > any other timestamp
>> >>> >      * handed out before this method was called.
>> >>> >      */
>> >>> >     long getFreshTimestamp();
>> >>> > }
>> >>> >
>> >>> > The only requirement is that the timestamp handed back is greater
>> >>> > than
>> >>> > every
>> >>> > other timestamp that was returned before getFreshTs was called.
>> >>> > There
>> >>> > is no
>> >>> > ordering requirement for concurrent requests.
>> >>> >
>> >>> > My impl is to reserve blocks of timestamps that are safe to hand out
>> >>> > (1M at
>> >>> > a time) using compare and swap in ZK.
>> >>> > lastPossibleUsed = read(HighWater)
>> >>> > safeToHandout = compareAndSwap(lastPossibleUsed,
>> >>> > lastPossibleUsed+1M)
>> >>> >
>> >>> > Now my leader can hand back timestamps up to safeToHandout, but
>> >>> > before
>> >>> > it
>> >>> > hands one out it must ensure it is still the leader (no one else has
>> >>> > handed
>> >>> > back something higher).
>> >>> > I can use ensureQuorum(), exists(myEphemNode) to make sure this is
>> >>> > the
>> >>> > case.
>> >>> > Now I have a service that is guarenteed to be correct, but doesn't
>> >>> > require
>> >>> > disk hits in the steady state which brings down my latency (if you
>> >>> > get
>> >>> > close
>> >>> > to running out, you can compareAndSwap for more timestamps).
>> >>> >
>> >>> > If many requests come in at the same time I can use smart batching
>> >>> > to
>> >>> > verify
>> >>> > happens after for all at once.  We can also add more layers if we
>> >>> > need
>> >>> > more
>> >>> > bandwidth to scale up at the cost of adding latency.  Basically our
>> >>> > latency
>> >>> > will be O(lg(requestRate)) if we keep adding layers as each previous
>> >>> > layer
>> >>> > becomes saturated.
>> >>> >
>> >>> > I hope this explanation helps. I am busy for the next 4 hours, but
>> >>> > if
>> >>> > you
>> >>> > need more clarification I can respond to them at that time.
>> >>> >
>> >>> > -jc
>> >>> >
>> >>> >
>> >>> > On Thu, Sep 27, 2012 at 9:26 AM, John Carrino
>> >>> > <jo...@gmail.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> First, thanks everyone for talking this through with me.
>> >>> >>
>> >>> >> Flavio, for your example, this is actually ok.  There is a happens
>> >>> >> after
>> >>> >> relationship between the client making the request and my leader C1
>> >>> >> still
>> >>> >> being the leader.  My service only needs to guarantee that what it
>> >>> >> hands
>> >>> >> back is at least as new as anything that existed when the client
>> >>> >> made
>> >>> >> the
>> >>> >> request.  If C2 were to answer requests while C1 is stalling that
>> >>> >> is
>> >>> >> ok
>> >>> >> because these would be considered concurrent requests and the stuff
>> >>> >> returned
>> >>> >> by C2 may be newer but that doesn't violate any guarentees.
>> >>> >>
>> >>> >> If some client were to get back something from C2 and then (happens
>> >>> >> after
>> >>> >> relationship) someone tried to read from C1, it needs to fail.
>> >>> >>
>> >>> >> To address your concern of adding too much bandwidth we can get
>> >>> >> this
>> >>> >> easily by doing what Martin Thompson calls smart batching
>> >>> >>
>> >>> >> (http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html).
>> >>> >>
>> >>> >> 1. ensureQuorum request comes in to L1
>> >>> >> 2. send ENSURE to all followers
>> >>> >> 3. 10 more ensureQuorum requests come in
>> >>> >> 4. get back ENSURE from quorum
>> >>> >> 5. we can now service all 10 pending ensureQuorum requests with
>> >>> >> another
>> >>> >> round trip ENSURE.
>> >>> >>
>> >>> >> We don't need to send an ENSURE for every ensureQuorum request, we
>> >>> >> just
>> >>> >> need it to be happens after from when the request arrived.
>> >>> >>
>> >>> >> I am fine with the Ephemeral node being removed after some time
>> >>> >> expires,
>> >>> >> but only by the leader.  If the leaders clock is broken and the
>> >>> >> client
>> >>> >> owning the Ephemeral node drops off, then we don't have liveness
>> >>> >> (because
>> >>> >> this node may not get cleaned up in a timely fashion).  However, we
>> >>> >> still
>> >>> >> preserve corectness.
>> >>> >>
>> >>> >> -jc
>> >>> >>
>> >>> >>
>> >>> >> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira
>> >>> >> <fp...@yahoo-inc.com>
>> >>> >> wrote:
>> >>> >>>
>> >>> >>> Say that we implement what you're suggesting. Could you check if
>> >>> >>> this
>> >>> >>> scenario can happen:
>> >>> >>>
>> >>> >>> 1- Client C1 is the current leader and it super boosted read to
>> >>> >>> make
>> >>> >>> sure
>> >>> >>> it is still the leader;
>> >>> >>> 2- We process the super boosted read having it through the zab
>> >>> >>> pipeline;
>> >>> >>> 3- When we send the response to C1 we slow down the whole deal:
>> >>> >>> the
>> >>> >>> response to C1 gets delayed and we stall C1;
>> >>> >>> 4- In the meanwhile, C1's session expires on the server side and
>> >>> >>> its
>> >>> >>> ephemeral leadership node is removed;
>> >>> >>> 5- A new client C2 is elected and starts exercising leadership;
>> >>> >>> 6- Now C1 comes back to normal and receives the response of the
>> >>> >>> super
>> >>> >>> boosted read saying that it is still the leader.
>> >>> >>>
>> >>> >>> If my interpretation is not incorrect, the only way to prevent
>> >>> >>> this
>> >>> >>> scenario from happening is if the session expires on the client
>> >>> >>> side
>> >>> >>> before
>> >>> >>> it receives the response of the read. It doesn't look like we can
>> >>> >>> do
>> >>> >>> it if
>> >>> >>> process clocks can be arbitrarily delayed.
>> >>> >>>
>> >>> >>> Note that one issue is that the behavior of ephemerals is highly
>> >>> >>> dependent upon timers, so I don't think we can avoid making some
>> >>> >>> timing
>> >>> >>> assumptions altogether. The question is if we are better off with
>> >>> >>> a
>> >>> >>> mechanism relying upon acknowledgements. My sense is that
>> >>> >>> application-level
>> >>> >>> fencing is preferable (if not necessary) for applications like the
>> >>> >>> ones JC
>> >>> >>> is mentioning or BookKeeper.
>> >>> >>>
>> >>> >>> I'm not concerned about writes to disk, which I agree we don't
>> >>> >>> need
>> >>> >>> for
>> >>> >>> sync. I'm more concerned about having it going through the whole
>> >>> >>> pipeline,
>> >>> >>> which will induce more traffic to zab and increase latency for an
>> >>> >>> application that uses it heavily.
>> >>> >>>
>> >>> >>> -Flavio
>> >>> >>>
>> >>> >>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote:
>> >>> >>>
>> >>> >>> > another idea is to add this functionality to MultiOp - have read
>> >>> >>> > only
>> >>> >>> > transactions be replicated but not logged or logged
>> >>> >>> > asynchronously.
>> >>> >>> > I'm not sure how it works right now if I do a read-only MultiOp
>> >>> >>> > transaction - does it replicate the transaction or answer it
>> >>> >>> > locally
>> >>> >>> > on the leader ?
>> >>> >>> >
>> >>> >>> > Alex
>> >>> >>> >
>> >>> >>> > On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer
>> >>> >>> > <sh...@gmail.com>
>> >>> >>> > wrote:
>> >>> >>> >> Thanks for the explanation.
>> >>> >>> >>
>> >>> >>> >> I guess one could always invoke a write operation instead of
>> >>> >>> >> sync
>> >>> >>> >> to
>> >>> >>> >> get the more strict semantics, but as John suggests, it might
>> >>> >>> >> be a
>> >>> >>> >> good idea to add a new type of operation that requires
>> >>> >>> >> followers
>> >>> >>> >> to
>> >>> >>> >> ack but doesn't require them to log to disk - this seems
>> >>> >>> >> sufficient in
>> >>> >>> >> our case.
>> >>> >>> >>
>> >>> >>> >> Alex
>> >>> >>> >>
>> >>> >>> >> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira
>> >>> >>> >> <fp...@yahoo-inc.com>
>> >>> >>> >> wrote:
>> >>> >>> >>> In theory, the scenario you're describing could happen, but I
>> >>> >>> >>> would
>> >>> >>> >>> argue that it is unlikely given that: 1) a leader pings
>> >>> >>> >>> followers
>> >>> >>> >>> twice a
>> >>> >>> >>> tick to make sure that it has a quorum of supporters (lead());
>> >>> >>> >>> 2)
>> >>> >>> >>> followers
>> >>> >>> >>> give up on a leader upon catching an exception
>> >>> >>> >>> (followLeader()).
>> >>> >>> >>> One could
>> >>> >>> >>> calibrate tickTime to make the probability of having this
>> >>> >>> >>> scenario low.
>> >>> >>> >>>
>> >>> >>> >>> Let me also revisit the motivation for the way we designed
>> >>> >>> >>> sync.
>> >>> >>> >>> ZooKeeper has been designed to serve reads efficiently and
>> >>> >>> >>> making
>> >>> >>> >>> sync go
>> >>> >>> >>> through the pipeline would slow down reads. Although optional,
>> >>> >>> >>> we
>> >>> >>> >>> thought it
>> >>> >>> >>> would be a good idea to make it as efficient as possible to
>> >>> >>> >>> comply with the
>> >>> >>> >>> original expectations for the service. We consequently came up
>> >>> >>> >>> with this
>> >>> >>> >>> cheap way of making sure that a read sees all pending updates.
>> >>> >>> >>> It
>> >>> >>> >>> is correct
>> >>> >>> >>> that there are some corner cases that it doesn't cover. One is
>> >>> >>> >>> the case you
>> >>> >>> >>> mentioned. Another is having the sync finishing before the
>> >>> >>> >>> client
>> >>> >>> >>> submits
>> >>> >>> >>> the read and having a write committing in between. We rely
>> >>> >>> >>> upon
>> >>> >>> >>> the way we
>> >>> >>> >>> implement timeouts and some minimum degree of synchrony for
>> >>> >>> >>> the
>> >>> >>> >>> clients when
>> >>> >>> >>> submitting operations to guarantee that the scheme work.
>> >>> >>> >>>
>> >>> >>> >>> We thought about the option of having the sync operation going
>> >>> >>> >>> through the pipeline, and in fact it would have been easier to
>> >>> >>> >>> implement it
>> >>> >>> >>> just as a regular write, but we opted not to because we felt
>> >>> >>> >>> it
>> >>> >>> >>> was
>> >>> >>> >>> sufficient for the use cases we had and more efficient as I
>> >>> >>> >>> already argued.
>> >>> >>> >>>
>> >>> >>> >>> Hope it helps to clarify.
>> >>> >>> >>>
>> >>> >>> >>> -Flavio
>> >>> >>> >>>
>> >>> >>> >>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
>> >>> >>> >>>
>> >>> >>> >>>> thanks for the explanation! but how do you avoid having the
>> >>> >>> >>>> scenario
>> >>> >>> >>>> raised by John ?
>> >>> >>> >>>> lets say you're a client connected to F, and F is connected
>> >>> >>> >>>> to
>> >>> >>> >>>> L.
>> >>> >>> >>>> Lets
>> >>> >>> >>>> also say that L's pipeline
>> >>> >>> >>>> is now empty, and both F and L are partitioned from 3 other
>> >>> >>> >>>> servers
>> >>> >>> >>>> in
>> >>> >>> >>>> the system that have already
>> >>> >>> >>>> elected a new leader L'. Now I go to L' and write something.
>> >>> >>> >>>> L
>> >>> >>> >>>> still
>> >>> >>> >>>> thinks its the leader because the
>> >>> >>> >>>> detection that followers left it is obviously timeout
>> >>> >>> >>>> dependent.
>> >>> >>> >>>> So
>> >>> >>> >>>> when F sends your sync to L and L returns
>> >>> >>> >>>> it to F, you actually miss my write!
>> >>> >>> >>>>
>> >>> >>> >>>> Alex
>> >>> >>> >>>>
>> >>> >>> >>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira
>> >>> >>> >>>> <fp...@yahoo-inc.com> wrote:
>> >>> >>> >>>>> Hi Alex, Because of the following:
>> >>> >>> >>>>>
>> >>> >>> >>>>> 1- A follower F processes operations from a client in FIFO
>> >>> >>> >>>>> order,
>> >>> >>> >>>>> and say that a client submits as you say sync + read;
>> >>> >>> >>>>> 2- A sync will be processed by the leader and returned to
>> >>> >>> >>>>> the
>> >>> >>> >>>>> follower. It will be queued after all pending updates that
>> >>> >>> >>>>> the
>> >>> >>> >>>>> follower
>> >>> >>> >>>>> hasn't processed;
>> >>> >>> >>>>> 3- The follower will process all pending updates before
>> >>> >>> >>>>> processing
>> >>> >>> >>>>> the response of the sync;
>> >>> >>> >>>>> 4- Once the follower processes the sync, it picks the read
>> >>> >>> >>>>> operation to process. It reads the local state of the
>> >>> >>> >>>>> follower
>> >>> >>> >>>>> and returns
>> >>> >>> >>>>> to the client.
>> >>> >>> >>>>>
>> >>> >>> >>>>> When we process the read in Step 4, we have applied all
>> >>> >>> >>>>> pending
>> >>> >>> >>>>> updates the leader had for the follower by the time the read
>> >>> >>> >>>>> request
>> >>> >>> >>>>> started.
>> >>> >>> >>>>>
>> >>> >>> >>>>> This implementation is a bit of a hack because it doesn't
>> >>> >>> >>>>> follow
>> >>> >>> >>>>> the same code path as the other operations that go to the
>> >>> >>> >>>>> leader, but it
>> >>> >>> >>>>> avoids some unnecessary steps, which is important for fast
>> >>> >>> >>>>> reads. In the
>> >>> >>> >>>>> sync case, the other followers don't really need to know
>> >>> >>> >>>>> about
>> >>> >>> >>>>> it (there is
>> >>> >>> >>>>> nothing to be updated) and the leader simply inserts it in
>> >>> >>> >>>>> the
>> >>> >>> >>>>> sequence of
>> >>> >>> >>>>> updates of F, ordering it.
>> >>> >>> >>>>>
>> >>> >>> >>>>> -Flavio
>> >>> >>> >>>>>
>> >>> >>> >>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
>> >>> >>> >>>>>
>> >>> >>> >>>>>> Hi Flavio,
>> >>> >>> >>>>>>
>> >>> >>> >>>>>>> Starting a read operation concurrently with a sync implies
>> >>> >>> >>>>>>> that
>> >>> >>> >>>>>>> the result of the read will not miss an update committed
>> >>> >>> >>>>>>> before the read
>> >>> >>> >>>>>>> started.
>> >>> >>> >>>>>>
>> >>> >>> >>>>>> I thought that the intention of sync was to give something
>> >>> >>> >>>>>> like
>> >>> >>> >>>>>> linearizable reads, so if you invoke a sync and then a
>> >>> >>> >>>>>> read,
>> >>> >>> >>>>>> your
>> >>> >>> >>>>>> read
>> >>> >>> >>>>>> is guaranteed to (at least) see any write which completed
>> >>> >>> >>>>>> before
>> >>> >>> >>>>>> the
>> >>> >>> >>>>>> sync began. Is this the intention ? If so, how is this
>> >>> >>> >>>>>> achieved
>> >>> >>> >>>>>> without running agreement on the sync op ?
>> >>> >>> >>>>>>
>> >>> >>> >>>>>> Thanks,
>> >>> >>> >>>>>> Alex
>> >>> >>> >>>>>>
>> >>> >>> >>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira
>> >>> >>> >>>>>> <fp...@yahoo-inc.com> wrote:
>> >>> >>> >>>>>>> sync simply flushes the channel between the leader and the
>> >>> >>> >>>>>>> follower that forwarded the sync operation, so it doesn't
>> >>> >>> >>>>>>> go
>> >>> >>> >>>>>>> through the
>> >>> >>> >>>>>>> full zab pipeline. Flushing means that all pending updates
>> >>> >>> >>>>>>> from the leader
>> >>> >>> >>>>>>> to the follower are received by the time sync completes.
>> >>> >>> >>>>>>> Starting a read
>> >>> >>> >>>>>>> operation concurrently with a sync implies that the result
>> >>> >>> >>>>>>> of
>> >>> >>> >>>>>>> the read will
>> >>> >>> >>>>>>> not miss an update committed before the read started.
>> >>> >>> >>>>>>>
>> >>> >>> >>>>>>> -Flavio
>> >>> >>> >>>>>>>
>> >>> >>> >>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
>> >>> >>> >>>>>>>
>> >>> >>> >>>>>>>> Its strange that sync doesn't run through agreement, I
>> >>> >>> >>>>>>>> was
>> >>> >>> >>>>>>>> always
>> >>> >>> >>>>>>>> assuming that it is... Exactly for the reason you say -
>> >>> >>> >>>>>>>> you may trust your leader, but I may have a different
>> >>> >>> >>>>>>>> leader
>> >>> >>> >>>>>>>> and
>> >>> >>> >>>>>>>> your
>> >>> >>> >>>>>>>> leader may not detect it yet and still think its the
>> >>> >>> >>>>>>>> leader.
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>> This seems like a bug to me.
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't
>> >>> >>> >>>>>>>> (or
>> >>> >>> >>>>>>>> shouldn't)
>> >>> >>> >>>>>>>> depend on timing assumption.
>> >>> >>> >>>>>>>> Only progress guarantees depend on time.
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>> Alex
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>>
>> >>> >>> >>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino
>> >>> >>> >>>>>>>> <jo...@gmail.com> wrote:
>> >>> >>> >>>>>>>>> I have some pretty strong requirements in terms of
>> >>> >>> >>>>>>>>> consistency
>> >>> >>> >>>>>>>>> where
>> >>> >>> >>>>>>>>> reading from followers that may be behind in terms of
>> >>> >>> >>>>>>>>> updates
>> >>> >>> >>>>>>>>> isn't ok for
>> >>> >>> >>>>>>>>> my use case.
>> >>> >>> >>>>>>>>>
>> >>> >>> >>>>>>>>> One error case that worries me is if a follower and
>> >>> >>> >>>>>>>>> leader
>> >>> >>> >>>>>>>>> are
>> >>> >>> >>>>>>>>> partitioned
>> >>> >>> >>>>>>>>> off from the network.  A new leader is elected, but the
>> >>> >>> >>>>>>>>> follower and old
>> >>> >>> >>>>>>>>> leader don't know about it.
>> >>> >>> >>>>>>>>>
>> >>> >>> >>>>>>>>> Normally I think sync was made for this purpost, but I
>> >>> >>> >>>>>>>>> looked
>> >>> >>> >>>>>>>>> at the sync
>> >>> >>> >>>>>>>>> code and if there aren't any outstanding proposals the
>> >>> >>> >>>>>>>>> leader
>> >>> >>> >>>>>>>>> sends the
>> >>> >>> >>>>>>>>> sync right back to the client without first verifying
>> >>> >>> >>>>>>>>> that
>> >>> >>> >>>>>>>>> it
>> >>> >>> >>>>>>>>> still has
>> >>> >>> >>>>>>>>> quorum, so this won't work for my use case.
>> >>> >>> >>>>>>>>>
>> >>> >>> >>>>>>>>> At the core of the issue all I really need is a call
>> >>> >>> >>>>>>>>> that
>> >>> >>> >>>>>>>>> will
>> >>> >>> >>>>>>>>> make it's
>> >>> >>> >>>>>>>>> way to the leader and will ping it's followers, ensure
>> >>> >>> >>>>>>>>> it
>> >>> >>> >>>>>>>>> still
>> >>> >>> >>>>>>>>> has a
>> >>> >>> >>>>>>>>> quorum and return success.
>> >>> >>> >>>>>>>>>
>> >>> >>> >>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be
>> >>> >>> >>>>>>>>> forwarded to the
>> >>> >>> >>>>>>>>> leader, leader will ensure it still has quorum and
>> >>> >>> >>>>>>>>> return
>> >>> >>> >>>>>>>>> it's
>> >>> >>> >>>>>>>>> epoch.  I
>> >>> >>> >>>>>>>>> can use this primitive to implement all the other
>> >>> >>> >>>>>>>>> properties I
>> >>> >>> >>>>>>>>> want to
>> >>> >>> >>>>>>>>> verify (assuming that my client will never connect to an
>> >>> >>> >>>>>>>>> older
>> >>> >>> >>>>>>>>> epoch after
>> >>> >>> >>>>>>>>> this call returns). Also the nice thing about this
>> >>> >>> >>>>>>>>> method
>> >>> >>> >>>>>>>>> is
>> >>> >>> >>>>>>>>> that it will
>> >>> >>> >>>>>>>>> not have to hit disk and the latency should just be a
>> >>> >>> >>>>>>>>> round
>> >>> >>> >>>>>>>>> trip to the
>> >>> >>> >>>>>>>>> followers.
>> >>> >>> >>>>>>>>>
>> >>> >>> >>>>>>>>> Most of the guarentees offered by zookeeper are time
>> >>> >>> >>>>>>>>> based
>> >>> >>> >>>>>>>>> an
>> >>> >>> >>>>>>>>> rely on
>> >>> >>> >>>>>>>>> clocks and expiring timers, but I'm hoping to offer some
>> >>> >>> >>>>>>>>> guarantees in
>> >>> >>> >>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends
>> >>> >>> >>>>>>>>> and
>> >>> >>> >>>>>>>>> any
>> >>> >>> >>>>>>>>> other way
>> >>> >>> >>>>>>>>> time is broken.
>> >>> >>> >>>>>>>>>
>> >>> >>> >>>>>>>>> Also if people are interested I can go into more detail
>> >>> >>> >>>>>>>>> about
>> >>> >>> >>>>>>>>> what I am
>> >>> >>> >>>>>>>>> trying to write.
>> >>> >>> >>>>>>>>>
>> >>> >>> >>>>>>>>> -jc
>> >>> >>> >>>>>>>
>> >>> >>> >>>>>
>> >>> >>> >>>
>> >>> >>>
>> >>> >>
>> >>> >
>> >>
>> >>
>> >
>
>

Re: sync semantics

Posted by Alexander Shraer <sh...@gmail.com>.

Bumping this thread as it was brought up again by Hongchao in the context
of failing tests.
I agree with him that its much better to make sync a quorum operation,
however this may come in the expense
of slightly slower strong reads. I think this is acceptable since if
someone does "sync + read" he cares more about semantics
than latency.

On Fri, Mar 1, 2013 at 7:06 PM, Alexander Shraer <sh...@gmail.com> wrote:

> This is an old thread (below), but it doesn't seem like any conclusion
> was reached on what we want to do to address the issue.
>
> Reminder of the problem: sync only gets you strong semantics if there
> is no leader change. If there is a leader change,
> then these semantics are guaranteed only if we make some timing
> assumptions, not made elsewhere in ZooKeeper. It would be much
> better not to make timing assumptions for such safety/consistency
> properties, only for liveness.
>
> The problem happens when your leader is no longer the leader but
> doesn't know it yet. He responds to a sync, but that doesn't mean
> you follower sees all committed state. Some other server may have
> already become the leader and committed some updates, which the sync
> won't flush to
> your follower, which is still connected to the old leader.
>
> To prevent this we should broadcast the sync like updates, or
> piggyback them on other ops, or perhaps create a new type of sync that
> is broadcasted.
>
> As Ben pointed out, this problem is also mentioned in Section 4.4 of
> the ZooKeeper peper (but the proposed solution there is insufficient
> to solve the issue, as
> discussed below).
>
> Alex
>
>
> On Fri, Sep 28, 2012 at 4:45 PM, John Carrino <jo...@gmail.com>
> wrote:
> > Ben, after thinking about this more. I don't think this solution gets the
> > property that I need.  Just because there are outstanding proposals that
> are
> > committed later doesn't imply we are still the leader.  It only means
> that
> > when the new leader does recovery it will also see these proposals as
> > committed.
> >
> > Let's say we have a 5 node cluster and L1 has one pending request out.
> F2-F5
> > are followers. We get back an ack from F2.  Now F5 and L1 are partitioned
> > off from the network along with client C1.
> >
> > Recovery happens on F2-F4 and F2 becomes L2.  During recovery this
> proposal
> > is accepted because F2 had acked it.  Now L2 does a bunch of stuff
> including
> > deleting your ephemeral node.
> >
> > Now a sync comes in from C1 through F5. Now L1 finally gets that ack
> from F5
> > and goes ahead and commits it and responds to the outstanding sync
> request
> > to C1.
> >
> > We can see with this ordering there isn't a happens after relationship
> > between the sync request and knowing about all commits that occurred
> before
> > the sync request.
> >
> > Yes, I realize that this ordering is unlikely to happen in practice, but
> I
> > hate trusting time for anything.
> >
> > -jc
> >
> > On Fri, Sep 28, 2012 at 7:31 AM, John Carrino <jo...@gmail.com>
> > wrote:
> >>
> >> This seems like a good compromise.  We still have to eat the latency of
> a
> >> write, but we easily achieve smart batching in this case so many
> outstanding
> >> sync can all be serviced by the same lastPending request.
> >>
> >> -jc
> >>
> >>
> >> On Thu, Sep 27, 2012 at 11:17 PM, Benjamin Reed <be...@gmail.com>
> >> wrote:
> >>>
> >>> there is a very easy solution to this. we only rely on clocks in the
> >>> case that there are no pending transactions. (if there are pending
> >>> transactions, the sync will only return if in fact the leader is still
> >>> the leader, otherwise the transaction that the sync is waiting on will
> >>> never commit and the sync will never return.)
> >>>
> >>> so, if there aren't any transactions, just submit one. make it a bogus
> >>> one: create / for example. then queue the sync behind it.
> >>>
> >>> ben
> >>>
> >>> ps - we bring up this issue and the solution and the rational for the
> >>> current implementation in section 4.4 of the zookeeper usenix paper.
> >>>
> >>> On Thu, Sep 27, 2012 at 9:57 AM, John Carrino <jo...@gmail.com>
> >>> wrote:
> >>> > So I think it's time to explain what I'm writing just so everyone has
> >>> > more
> >>> > situation awareness.  Its just a timestamp server, nothing fancy.
> >>> >
> >>> > Looks like this:
> >>> >
> >>> > public interface TimestampService {
> >>> >     /**
> >>> >      * This will get a fresh timestamp that is guarenteed to be newer
> >>> > than
> >>> > any other timestamp
> >>> >      * handed out before this method was called.
> >>> >      */
> >>> >     long getFreshTimestamp();
> >>> > }
> >>> >
> >>> > The only requirement is that the timestamp handed back is greater
> than
> >>> > every
> >>> > other timestamp that was returned before getFreshTs was called.
> There
> >>> > is no
> >>> > ordering requirement for concurrent requests.
> >>> >
> >>> > My impl is to reserve blocks of timestamps that are safe to hand out
> >>> > (1M at
> >>> > a time) using compare and swap in ZK.
> >>> > lastPossibleUsed = read(HighWater)
> >>> > safeToHandout = compareAndSwap(lastPossibleUsed, lastPossibleUsed+1M)
> >>> >
> >>> > Now my leader can hand back timestamps up to safeToHandout, but
> before
> >>> > it
> >>> > hands one out it must ensure it is still the leader (no one else has
> >>> > handed
> >>> > back something higher).
> >>> > I can use ensureQuorum(), exists(myEphemNode) to make sure this is
> the
> >>> > case.
> >>> > Now I have a service that is guarenteed to be correct, but doesn't
> >>> > require
> >>> > disk hits in the steady state which brings down my latency (if you
> get
> >>> > close
> >>> > to running out, you can compareAndSwap for more timestamps).
> >>> >
> >>> > If many requests come in at the same time I can use smart batching to
> >>> > verify
> >>> > happens after for all at once.  We can also add more layers if we
> need
> >>> > more
> >>> > bandwidth to scale up at the cost of adding latency.  Basically our
> >>> > latency
> >>> > will be O(lg(requestRate)) if we keep adding layers as each previous
> >>> > layer
> >>> > becomes saturated.
> >>> >
> >>> > I hope this explanation helps. I am busy for the next 4 hours, but if
> >>> > you
> >>> > need more clarification I can respond to them at that time.
> >>> >
> >>> > -jc
> >>> >
> >>> >
> >>> > On Thu, Sep 27, 2012 at 9:26 AM, John Carrino <
> john.carrino@gmail.com>
> >>> > wrote:
> >>> >>
> >>> >> First, thanks everyone for talking this through with me.
> >>> >>
> >>> >> Flavio, for your example, this is actually ok.  There is a happens
> >>> >> after
> >>> >> relationship between the client making the request and my leader C1
> >>> >> still
> >>> >> being the leader.  My service only needs to guarantee that what it
> >>> >> hands
> >>> >> back is at least as new as anything that existed when the client
> made
> >>> >> the
> >>> >> request.  If C2 were to answer requests while C1 is stalling that is
> >>> >> ok
> >>> >> because these would be considered concurrent requests and the stuff
> >>> >> returned
> >>> >> by C2 may be newer but that doesn't violate any guarentees.
> >>> >>
> >>> >> If some client were to get back something from C2 and then (happens
> >>> >> after
> >>> >> relationship) someone tried to read from C1, it needs to fail.
> >>> >>
> >>> >> To address your concern of adding too much bandwidth we can get this
> >>> >> easily by doing what Martin Thompson calls smart batching
> >>> >> (
> http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html).
> >>> >>
> >>> >> 1. ensureQuorum request comes in to L1
> >>> >> 2. send ENSURE to all followers
> >>> >> 3. 10 more ensureQuorum requests come in
> >>> >> 4. get back ENSURE from quorum
> >>> >> 5. we can now service all 10 pending ensureQuorum requests with
> >>> >> another
> >>> >> round trip ENSURE.
> >>> >>
> >>> >> We don't need to send an ENSURE for every ensureQuorum request, we
> >>> >> just
> >>> >> need it to be happens after from when the request arrived.
> >>> >>
> >>> >> I am fine with the Ephemeral node being removed after some time
> >>> >> expires,
> >>> >> but only by the leader.  If the leaders clock is broken and the
> client
> >>> >> owning the Ephemeral node drops off, then we don't have liveness
> >>> >> (because
> >>> >> this node may not get cleaned up in a timely fashion).  However, we
> >>> >> still
> >>> >> preserve corectness.
> >>> >>
> >>> >> -jc
> >>> >>
> >>> >>
> >>> >> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <
> fpj@yahoo-inc.com>
> >>> >> wrote:
> >>> >>>
> >>> >>> Say that we implement what you're suggesting. Could you check if
> this
> >>> >>> scenario can happen:
> >>> >>>
> >>> >>> 1- Client C1 is the current leader and it super boosted read to
> make
> >>> >>> sure
> >>> >>> it is still the leader;
> >>> >>> 2- We process the super boosted read having it through the zab
> >>> >>> pipeline;
> >>> >>> 3- When we send the response to C1 we slow down the whole deal: the
> >>> >>> response to C1 gets delayed and we stall C1;
> >>> >>> 4- In the meanwhile, C1's session expires on the server side and
> its
> >>> >>> ephemeral leadership node is removed;
> >>> >>> 5- A new client C2 is elected and starts exercising leadership;
> >>> >>> 6- Now C1 comes back to normal and receives the response of the
> super
> >>> >>> boosted read saying that it is still the leader.
> >>> >>>
> >>> >>> If my interpretation is not incorrect, the only way to prevent this
> >>> >>> scenario from happening is if the session expires on the client
> side
> >>> >>> before
> >>> >>> it receives the response of the read. It doesn't look like we can
> do
> >>> >>> it if
> >>> >>> process clocks can be arbitrarily delayed.
> >>> >>>
> >>> >>> Note that one issue is that the behavior of ephemerals is highly
> >>> >>> dependent upon timers, so I don't think we can avoid making some
> >>> >>> timing
> >>> >>> assumptions altogether. The question is if we are better off with a
> >>> >>> mechanism relying upon acknowledgements. My sense is that
> >>> >>> application-level
> >>> >>> fencing is preferable (if not necessary) for applications like the
> >>> >>> ones JC
> >>> >>> is mentioning or BookKeeper.
> >>> >>>
> >>> >>> I'm not concerned about writes to disk, which I agree we don't need
> >>> >>> for
> >>> >>> sync. I'm more concerned about having it going through the whole
> >>> >>> pipeline,
> >>> >>> which will induce more traffic to zab and increase latency for an
> >>> >>> application that uses it heavily.
> >>> >>>
> >>> >>> -Flavio
> >>> >>>
> >>> >>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote:
> >>> >>>
> >>> >>> > another idea is to add this functionality to MultiOp - have read
> >>> >>> > only
> >>> >>> > transactions be replicated but not logged or logged
> asynchronously.
> >>> >>> > I'm not sure how it works right now if I do a read-only MultiOp
> >>> >>> > transaction - does it replicate the transaction or answer it
> >>> >>> > locally
> >>> >>> > on the leader ?
> >>> >>> >
> >>> >>> > Alex
> >>> >>> >
> >>> >>> > On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer
> >>> >>> > <sh...@gmail.com>
> >>> >>> > wrote:
> >>> >>> >> Thanks for the explanation.
> >>> >>> >>
> >>> >>> >> I guess one could always invoke a write operation instead of
> sync
> >>> >>> >> to
> >>> >>> >> get the more strict semantics, but as John suggests, it might
> be a
> >>> >>> >> good idea to add a new type of operation that requires followers
> >>> >>> >> to
> >>> >>> >> ack but doesn't require them to log to disk - this seems
> >>> >>> >> sufficient in
> >>> >>> >> our case.
> >>> >>> >>
> >>> >>> >> Alex
> >>> >>> >>
> >>> >>> >> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira
> >>> >>> >> <fp...@yahoo-inc.com>
> >>> >>> >> wrote:
> >>> >>> >>> In theory, the scenario you're describing could happen, but I
> >>> >>> >>> would
> >>> >>> >>> argue that it is unlikely given that: 1) a leader pings
> followers
> >>> >>> >>> twice a
> >>> >>> >>> tick to make sure that it has a quorum of supporters (lead());
> 2)
> >>> >>> >>> followers
> >>> >>> >>> give up on a leader upon catching an exception
> (followLeader()).
> >>> >>> >>> One could
> >>> >>> >>> calibrate tickTime to make the probability of having this
> >>> >>> >>> scenario low.
> >>> >>> >>>
> >>> >>> >>> Let me also revisit the motivation for the way we designed
> sync.
> >>> >>> >>> ZooKeeper has been designed to serve reads efficiently and
> making
> >>> >>> >>> sync go
> >>> >>> >>> through the pipeline would slow down reads. Although optional,
> we
> >>> >>> >>> thought it
> >>> >>> >>> would be a good idea to make it as efficient as possible to
> >>> >>> >>> comply with the
> >>> >>> >>> original expectations for the service. We consequently came up
> >>> >>> >>> with this
> >>> >>> >>> cheap way of making sure that a read sees all pending updates.
> It
> >>> >>> >>> is correct
> >>> >>> >>> that there are some corner cases that it doesn't cover. One is
> >>> >>> >>> the case you
> >>> >>> >>> mentioned. Another is having the sync finishing before the
> client
> >>> >>> >>> submits
> >>> >>> >>> the read and having a write committing in between. We rely upon
> >>> >>> >>> the way we
> >>> >>> >>> implement timeouts and some minimum degree of synchrony for the
> >>> >>> >>> clients when
> >>> >>> >>> submitting operations to guarantee that the scheme work.
> >>> >>> >>>
> >>> >>> >>> We thought about the option of having the sync operation going
> >>> >>> >>> through the pipeline, and in fact it would have been easier to
> >>> >>> >>> implement it
> >>> >>> >>> just as a regular write, but we opted not to because we felt it
> >>> >>> >>> was
> >>> >>> >>> sufficient for the use cases we had and more efficient as I
> >>> >>> >>> already argued.
> >>> >>> >>>
> >>> >>> >>> Hope it helps to clarify.
> >>> >>> >>>
> >>> >>> >>> -Flavio
> >>> >>> >>>
> >>> >>> >>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
> >>> >>> >>>
> >>> >>> >>>> thanks for the explanation! but how do you avoid having the
> >>> >>> >>>> scenario
> >>> >>> >>>> raised by John ?
> >>> >>> >>>> lets say you're a client connected to F, and F is connected to
> >>> >>> >>>> L.
> >>> >>> >>>> Lets
> >>> >>> >>>> also say that L's pipeline
> >>> >>> >>>> is now empty, and both F and L are partitioned from 3 other
> >>> >>> >>>> servers
> >>> >>> >>>> in
> >>> >>> >>>> the system that have already
> >>> >>> >>>> elected a new leader L'. Now I go to L' and write something. L
> >>> >>> >>>> still
> >>> >>> >>>> thinks its the leader because the
> >>> >>> >>>> detection that followers left it is obviously timeout
> dependent.
> >>> >>> >>>> So
> >>> >>> >>>> when F sends your sync to L and L returns
> >>> >>> >>>> it to F, you actually miss my write!
> >>> >>> >>>>
> >>> >>> >>>> Alex
> >>> >>> >>>>
> >>> >>> >>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira
> >>> >>> >>>> <fp...@yahoo-inc.com> wrote:
> >>> >>> >>>>> Hi Alex, Because of the following:
> >>> >>> >>>>>
> >>> >>> >>>>> 1- A follower F processes operations from a client in FIFO
> >>> >>> >>>>> order,
> >>> >>> >>>>> and say that a client submits as you say sync + read;
> >>> >>> >>>>> 2- A sync will be processed by the leader and returned to the
> >>> >>> >>>>> follower. It will be queued after all pending updates that
> the
> >>> >>> >>>>> follower
> >>> >>> >>>>> hasn't processed;
> >>> >>> >>>>> 3- The follower will process all pending updates before
> >>> >>> >>>>> processing
> >>> >>> >>>>> the response of the sync;
> >>> >>> >>>>> 4- Once the follower processes the sync, it picks the read
> >>> >>> >>>>> operation to process. It reads the local state of the
> follower
> >>> >>> >>>>> and returns
> >>> >>> >>>>> to the client.
> >>> >>> >>>>>
> >>> >>> >>>>> When we process the read in Step 4, we have applied all
> pending
> >>> >>> >>>>> updates the leader had for the follower by the time the read
> >>> >>> >>>>> request
> >>> >>> >>>>> started.
> >>> >>> >>>>>
> >>> >>> >>>>> This implementation is a bit of a hack because it doesn't
> >>> >>> >>>>> follow
> >>> >>> >>>>> the same code path as the other operations that go to the
> >>> >>> >>>>> leader, but it
> >>> >>> >>>>> avoids some unnecessary steps, which is important for fast
> >>> >>> >>>>> reads. In the
> >>> >>> >>>>> sync case, the other followers don't really need to know
> about
> >>> >>> >>>>> it (there is
> >>> >>> >>>>> nothing to be updated) and the leader simply inserts it in
> the
> >>> >>> >>>>> sequence of
> >>> >>> >>>>> updates of F, ordering it.
> >>> >>> >>>>>
> >>> >>> >>>>> -Flavio
> >>> >>> >>>>>
> >>> >>> >>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
> >>> >>> >>>>>
> >>> >>> >>>>>> Hi Flavio,
> >>> >>> >>>>>>
> >>> >>> >>>>>>> Starting a read operation concurrently with a sync implies
> >>> >>> >>>>>>> that
> >>> >>> >>>>>>> the result of the read will not miss an update committed
> >>> >>> >>>>>>> before the read
> >>> >>> >>>>>>> started.
> >>> >>> >>>>>>
> >>> >>> >>>>>> I thought that the intention of sync was to give something
> >>> >>> >>>>>> like
> >>> >>> >>>>>> linearizable reads, so if you invoke a sync and then a read,
> >>> >>> >>>>>> your
> >>> >>> >>>>>> read
> >>> >>> >>>>>> is guaranteed to (at least) see any write which completed
> >>> >>> >>>>>> before
> >>> >>> >>>>>> the
> >>> >>> >>>>>> sync began. Is this the intention ? If so, how is this
> >>> >>> >>>>>> achieved
> >>> >>> >>>>>> without running agreement on the sync op ?
> >>> >>> >>>>>>
> >>> >>> >>>>>> Thanks,
> >>> >>> >>>>>> Alex
> >>> >>> >>>>>>
> >>> >>> >>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira
> >>> >>> >>>>>> <fp...@yahoo-inc.com> wrote:
> >>> >>> >>>>>>> sync simply flushes the channel between the leader and the
> >>> >>> >>>>>>> follower that forwarded the sync operation, so it doesn't
> go
> >>> >>> >>>>>>> through the
> >>> >>> >>>>>>> full zab pipeline. Flushing means that all pending updates
> >>> >>> >>>>>>> from the leader
> >>> >>> >>>>>>> to the follower are received by the time sync completes.
> >>> >>> >>>>>>> Starting a read
> >>> >>> >>>>>>> operation concurrently with a sync implies that the result
> of
> >>> >>> >>>>>>> the read will
> >>> >>> >>>>>>> not miss an update committed before the read started.
> >>> >>> >>>>>>>
> >>> >>> >>>>>>> -Flavio
> >>> >>> >>>>>>>
> >>> >>> >>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
> >>> >>> >>>>>>>
> >>> >>> >>>>>>>> Its strange that sync doesn't run through agreement, I was
> >>> >>> >>>>>>>> always
> >>> >>> >>>>>>>> assuming that it is... Exactly for the reason you say -
> >>> >>> >>>>>>>> you may trust your leader, but I may have a different
> leader
> >>> >>> >>>>>>>> and
> >>> >>> >>>>>>>> your
> >>> >>> >>>>>>>> leader may not detect it yet and still think its the
> leader.
> >>> >>> >>>>>>>>
> >>> >>> >>>>>>>> This seems like a bug to me.
> >>> >>> >>>>>>>>
> >>> >>> >>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't
> (or
> >>> >>> >>>>>>>> shouldn't)
> >>> >>> >>>>>>>> depend on timing assumption.
> >>> >>> >>>>>>>> Only progress guarantees depend on time.
> >>> >>> >>>>>>>>
> >>> >>> >>>>>>>> Alex
> >>> >>> >>>>>>>>
> >>> >>> >>>>>>>>
> >>> >>> >>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino
> >>> >>> >>>>>>>> <jo...@gmail.com> wrote:
> >>> >>> >>>>>>>>> I have some pretty strong requirements in terms of
> >>> >>> >>>>>>>>> consistency
> >>> >>> >>>>>>>>> where
> >>> >>> >>>>>>>>> reading from followers that may be behind in terms of
> >>> >>> >>>>>>>>> updates
> >>> >>> >>>>>>>>> isn't ok for
> >>> >>> >>>>>>>>> my use case.
> >>> >>> >>>>>>>>>
> >>> >>> >>>>>>>>> One error case that worries me is if a follower and
> leader
> >>> >>> >>>>>>>>> are
> >>> >>> >>>>>>>>> partitioned
> >>> >>> >>>>>>>>> off from the network.  A new leader is elected, but the
> >>> >>> >>>>>>>>> follower and old
> >>> >>> >>>>>>>>> leader don't know about it.
> >>> >>> >>>>>>>>>
> >>> >>> >>>>>>>>> Normally I think sync was made for this purpost, but I
> >>> >>> >>>>>>>>> looked
> >>> >>> >>>>>>>>> at the sync
> >>> >>> >>>>>>>>> code and if there aren't any outstanding proposals the
> >>> >>> >>>>>>>>> leader
> >>> >>> >>>>>>>>> sends the
> >>> >>> >>>>>>>>> sync right back to the client without first verifying
> that
> >>> >>> >>>>>>>>> it
> >>> >>> >>>>>>>>> still has
> >>> >>> >>>>>>>>> quorum, so this won't work for my use case.
> >>> >>> >>>>>>>>>
> >>> >>> >>>>>>>>> At the core of the issue all I really need is a call that
> >>> >>> >>>>>>>>> will
> >>> >>> >>>>>>>>> make it's
> >>> >>> >>>>>>>>> way to the leader and will ping it's followers, ensure it
> >>> >>> >>>>>>>>> still
> >>> >>> >>>>>>>>> has a
> >>> >>> >>>>>>>>> quorum and return success.
> >>> >>> >>>>>>>>>
> >>> >>> >>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be
> >>> >>> >>>>>>>>> forwarded to the
> >>> >>> >>>>>>>>> leader, leader will ensure it still has quorum and return
> >>> >>> >>>>>>>>> it's
> >>> >>> >>>>>>>>> epoch.  I
> >>> >>> >>>>>>>>> can use this primitive to implement all the other
> >>> >>> >>>>>>>>> properties I
> >>> >>> >>>>>>>>> want to
> >>> >>> >>>>>>>>> verify (assuming that my client will never connect to an
> >>> >>> >>>>>>>>> older
> >>> >>> >>>>>>>>> epoch after
> >>> >>> >>>>>>>>> this call returns). Also the nice thing about this method
> >>> >>> >>>>>>>>> is
> >>> >>> >>>>>>>>> that it will
> >>> >>> >>>>>>>>> not have to hit disk and the latency should just be a
> round
> >>> >>> >>>>>>>>> trip to the
> >>> >>> >>>>>>>>> followers.
> >>> >>> >>>>>>>>>
> >>> >>> >>>>>>>>> Most of the guarentees offered by zookeeper are time
> based
> >>> >>> >>>>>>>>> an
> >>> >>> >>>>>>>>> rely on
> >>> >>> >>>>>>>>> clocks and expiring timers, but I'm hoping to offer some
> >>> >>> >>>>>>>>> guarantees in
> >>> >>> >>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and
> >>> >>> >>>>>>>>> any
> >>> >>> >>>>>>>>> other way
> >>> >>> >>>>>>>>> time is broken.
> >>> >>> >>>>>>>>>
> >>> >>> >>>>>>>>> Also if people are interested I can go into more detail
> >>> >>> >>>>>>>>> about
> >>> >>> >>>>>>>>> what I am
> >>> >>> >>>>>>>>> trying to write.
> >>> >>> >>>>>>>>>
> >>> >>> >>>>>>>>> -jc
> >>> >>> >>>>>>>
> >>> >>> >>>>>
> >>> >>> >>>
> >>> >>>
> >>> >>
> >>> >
> >>
> >>
> >
>