You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zookeeper.apache.org by John Carrino <jo...@gmail.com> on 2012/09/27 01:41:04 UTC

Ensure Leader hasn't changed

I have some pretty strong requirements in terms of consistency where
reading from followers that may be behind in terms of updates isn't ok for
my use case.

One error case that worries me is if a follower and leader are partitioned
off from the network.  A new leader is elected, but the follower and old
leader don't know about it.

Normally I think sync was made for this purpost, but I looked at the sync
code and if there aren't any outstanding proposals the leader sends the
sync right back to the client without first verifying that it still has
quorum, so this won't work for my use case.

At the core of the issue all I really need is a call that will make it's
way to the leader and will ping it's followers, ensure it still has a
quorum and return success.

Basically a getCurrentLeaderEpoch() method that will be forwarded to the
leader, leader will ensure it still has quorum and return it's epoch.  I
can use this primitive to implement all the other properties I want to
verify (assuming that my client will never connect to an older epoch after
this call returns). Also the nice thing about this method is that it will
not have to hit disk and the latency should just be a round trip to the
followers.

Most of the guarentees offered by zookeeper are time based an rely on
clocks and expiring timers, but I'm hoping to offer some guarantees in
spite of busted clocks, horrible GC perf, VM suspends and any other way
time is broken.

Also if people are interested I can go into more detail about what I am
trying to write.

-jc

Re: Ensure Leader hasn't changed

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.

Hi John, I think the problem you're referring to can be illustrated roughly by the following scenario:

1- Client C1 is your master (let's call it master not to confuse with ZK leader) and it is performing work happily;
2- Client C2 becomes the master and C1 hasn't received any notification of master change yet;
3- Client C1 receives a new request and processes it, which is wrong since it is not the master any longer.

If this is roughly the scenario you'd like to avoid, then you need some form of fencing, which is application-specific. In BookKeeper, for example, we have a simple mechanism to prevent a client writing to a ledger from completing a write successfully in the case a client reader has decided to close. This mechanism consists of informing bookies (our storage servers) that they shouldn't process new write requests for the ledger, which prevents the writer from getting a quorum.

-Flavio 

On Sep 27, 2012, at 8:41 AM, John Carrino wrote:

> Basically I am doing leader election using ZK with sequential ephemeral nodes.  I want a guarenteed way to ensure that my ephemeral node still exists (no other leader has done work).  Let's call my elected leader L1.  L1 may serve a request if it is the leader at the time the request was made.  L1 may lose leadership during the request or after it has responded.  I only need a happens after relationship between the server getting the request and checking getCurrentLeaderEpoch() and doing a read to ensure L1 still has the lowest seq/ephemeral node (no disk writes needed).  
> 
> Normally when people think of locking they assume that the lock must be held throughout the entire request.  In this way distrubuted locking is "hard" or it might even be impossible (I haven't really looked into it formally).  I don't need the lock for the duration of the request to ensure correctness in my system.  I only require that the leader still held it's lock some time "after" the request was initiated.
> 
> I most likely could use ZK as is and won't hit any bugs from this, but I am kinda OCD when it comes to building this type of infrastructure.
> 
> I think of this more as a feature request than a bug because from reading up it seems like ZK gives you Reliable, Total order and Causal message delivery.  If I were to do a write request, then a sync, then check my node still exists would have the property I desire.  However I don't want to take the perf hit of doing a write. 
> 
> Thanks!
> 
> -jc
> 
> 
> On Wed, Sep 26, 2012 at 6:43 PM, Alexander Shraer <sh...@gmail.com> wrote:
> Its strange that sync doesn't run through agreement, I was always
> assuming that it is... Exactly for the reason you say -
> you may trust your leader, but I may have a different leader and your
> leader may not detect it yet and still think its the leader.
> 
> This seems like a bug to me.
> 
> Similarly to Paxos, Zookeeper's safety guarantees don't (or shouldn't)
> depend on timing assumption.
> Only progress guarantees depend on time.
> 
> Alex
> 
> 
> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <jo...@gmail.com> wrote:
> > I have some pretty strong requirements in terms of consistency where
> > reading from followers that may be behind in terms of updates isn't ok for
> > my use case.
> >
> > One error case that worries me is if a follower and leader are partitioned
> > off from the network.  A new leader is elected, but the follower and old
> > leader don't know about it.
> >
> > Normally I think sync was made for this purpost, but I looked at the sync
> > code and if there aren't any outstanding proposals the leader sends the
> > sync right back to the client without first verifying that it still has
> > quorum, so this won't work for my use case.
> >
> > At the core of the issue all I really need is a call that will make it's
> > way to the leader and will ping it's followers, ensure it still has a
> > quorum and return success.
> >
> > Basically a getCurrentLeaderEpoch() method that will be forwarded to the
> > leader, leader will ensure it still has quorum and return it's epoch.  I
> > can use this primitive to implement all the other properties I want to
> > verify (assuming that my client will never connect to an older epoch after
> > this call returns). Also the nice thing about this method is that it will
> > not have to hit disk and the latency should just be a round trip to the
> > followers.
> >
> > Most of the guarentees offered by zookeeper are time based an rely on
> > clocks and expiring timers, but I'm hoping to offer some guarantees in
> > spite of busted clocks, horrible GC perf, VM suspends and any other way
> > time is broken.
> >
> > Also if people are interested I can go into more detail about what I am
> > trying to write.
> >
> > -jc
>

Re: Ensure Leader hasn't changed

Posted by Ted Dunning <te...@gmail.com>.

Seems like you can have a simpler mechanism.

With ephemeral nodes (I don't think that sequential ephemeral is strictly
needed) nobody is going to become leader while that ephemeral file exists.
 Thus, if the master has not received a disconnect notification, it will be
at least several 10's of seconds before somebody else can become leader.
 If you know that the master won't skip around in time, then you should be
safe.  This guarantee might be violated if the master is on a VM that is
paused at the critical moment, but I think your other implementation has
that same problem.

On Thu, Sep 27, 2012 at 2:41 AM, John Carrino <jo...@gmail.com>wrote:

> Basically I am doing leader election using ZK with sequential ephemeral
> nodes.  I want a guarenteed way to ensure that my ephemeral node still
> exists (no other leader has done work).  Let's call my elected leader L1.
> L1 may serve a request if it is the leader at the time the request was
> made.  L1 may lose leadership during the request or after it has
> responded.  I only need a happens after relationship between the server
> getting the request and checking getCurrentLeaderEpoch() and doing a read
> to ensure L1 still has the lowest seq/ephemeral node (no disk writes
> needed).
>
> Normally when people think of locking they assume that the lock must be
> held throughout the entire request.  In this way distrubuted locking is
> "hard" or it might even be impossible (I haven't really looked into it
> formally).  I don't need the lock for the duration of the request to ensure
> correctness in my system.  I only require that the leader still held it's
> lock some time "after" the request was initiated.
>
> I most likely could use ZK as is and won't hit any bugs from this, but I am
> kinda OCD when it comes to building this type of infrastructure.
>
> I think of this more as a feature request than a bug because from reading
> up it seems like ZK gives you Reliable, Total order and Causal message
> delivery.  If I were to do a write request, then a sync, then check my node
> still exists would have the property I desire.  However I don't want to
> take the perf hit of doing a write.
>
> Thanks!
>
> -jc
>
>
> On Wed, Sep 26, 2012 at 6:43 PM, Alexander Shraer <sh...@gmail.com>
> wrote:
>
> > Its strange that sync doesn't run through agreement, I was always
> > assuming that it is... Exactly for the reason you say -
> > you may trust your leader, but I may have a different leader and your
> > leader may not detect it yet and still think its the leader.
> >
> > This seems like a bug to me.
> >
> > Similarly to Paxos, Zookeeper's safety guarantees don't (or shouldn't)
> > depend on timing assumption.
> > Only progress guarantees depend on time.
> >
> > Alex
> >
> >
> > On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <jo...@gmail.com>
> > wrote:
> > > I have some pretty strong requirements in terms of consistency where
> > > reading from followers that may be behind in terms of updates isn't ok
> > for
> > > my use case.
> > >
> > > One error case that worries me is if a follower and leader are
> > partitioned
> > > off from the network.  A new leader is elected, but the follower and
> old
> > > leader don't know about it.
> > >
> > > Normally I think sync was made for this purpost, but I looked at the
> sync
> > > code and if there aren't any outstanding proposals the leader sends the
> > > sync right back to the client without first verifying that it still has
> > > quorum, so this won't work for my use case.
> > >
> > > At the core of the issue all I really need is a call that will make
> it's
> > > way to the leader and will ping it's followers, ensure it still has a
> > > quorum and return success.
> > >
> > > Basically a getCurrentLeaderEpoch() method that will be forwarded to
> the
> > > leader, leader will ensure it still has quorum and return it's epoch.
>  I
> > > can use this primitive to implement all the other properties I want to
> > > verify (assuming that my client will never connect to an older epoch
> > after
> > > this call returns). Also the nice thing about this method is that it
> will
> > > not have to hit disk and the latency should just be a round trip to the
> > > followers.
> > >
> > > Most of the guarentees offered by zookeeper are time based an rely on
> > > clocks and expiring timers, but I'm hoping to offer some guarantees in
> > > spite of busted clocks, horrible GC perf, VM suspends and any other way
> > > time is broken.
> > >
> > > Also if people are interested I can go into more detail about what I am
> > > trying to write.
> > >
> > > -jc
> >
>

Re: Ensure Leader hasn't changed

Posted by John Carrino <jo...@gmail.com>.

Basically I am doing leader election using ZK with sequential ephemeral
nodes.  I want a guarenteed way to ensure that my ephemeral node still
exists (no other leader has done work).  Let's call my elected leader L1.
L1 may serve a request if it is the leader at the time the request was
made.  L1 may lose leadership during the request or after it has
responded.  I only need a happens after relationship between the server
getting the request and checking getCurrentLeaderEpoch() and doing a read
to ensure L1 still has the lowest seq/ephemeral node (no disk writes
needed).

Normally when people think of locking they assume that the lock must be
held throughout the entire request.  In this way distrubuted locking is
"hard" or it might even be impossible (I haven't really looked into it
formally).  I don't need the lock for the duration of the request to ensure
correctness in my system.  I only require that the leader still held it's
lock some time "after" the request was initiated.

I most likely could use ZK as is and won't hit any bugs from this, but I am
kinda OCD when it comes to building this type of infrastructure.

I think of this more as a feature request than a bug because from reading
up it seems like ZK gives you Reliable, Total order and Causal message
delivery.  If I were to do a write request, then a sync, then check my node
still exists would have the property I desire.  However I don't want to
take the perf hit of doing a write.

Thanks!

-jc

On Wed, Sep 26, 2012 at 6:43 PM, Alexander Shraer <sh...@gmail.com> wrote:

> Its strange that sync doesn't run through agreement, I was always
> assuming that it is... Exactly for the reason you say -
> you may trust your leader, but I may have a different leader and your
> leader may not detect it yet and still think its the leader.
>
> This seems like a bug to me.
>
> Similarly to Paxos, Zookeeper's safety guarantees don't (or shouldn't)
> depend on timing assumption.
> Only progress guarantees depend on time.
>
> Alex
>
>
> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <jo...@gmail.com>
> wrote:
> > I have some pretty strong requirements in terms of consistency where
> > reading from followers that may be behind in terms of updates isn't ok
> for
> > my use case.
> >
> > One error case that worries me is if a follower and leader are
> partitioned
> > off from the network.  A new leader is elected, but the follower and old
> > leader don't know about it.
> >
> > Normally I think sync was made for this purpost, but I looked at the sync
> > code and if there aren't any outstanding proposals the leader sends the
> > sync right back to the client without first verifying that it still has
> > quorum, so this won't work for my use case.
> >
> > At the core of the issue all I really need is a call that will make it's
> > way to the leader and will ping it's followers, ensure it still has a
> > quorum and return success.
> >
> > Basically a getCurrentLeaderEpoch() method that will be forwarded to the
> > leader, leader will ensure it still has quorum and return it's epoch.  I
> > can use this primitive to implement all the other properties I want to
> > verify (assuming that my client will never connect to an older epoch
> after
> > this call returns). Also the nice thing about this method is that it will
> > not have to hit disk and the latency should just be a round trip to the
> > followers.
> >
> > Most of the guarentees offered by zookeeper are time based an rely on
> > clocks and expiring timers, but I'm hoping to offer some guarantees in
> > spite of busted clocks, horrible GC perf, VM suspends and any other way
> > time is broken.
> >
> > Also if people are interested I can go into more detail about what I am
> > trying to write.
> >
> > -jc
>

Re: Ensure Leader hasn't changed

Posted by Alexander Shraer <sh...@gmail.com>.

I think that the guarantee we still get in your scenario is that when
the "boosted read" completes, the client knows that it got every write
that completed before the boosted read (or "strong sync" which
preceded it if we use that idea) was invoked. Of course as you say it
doesn't guarantee that its still the leader, but if the read returns
and says "you're still the leader" then you can process every event
you got (if that's what your application is doing) before you invoked
the read without worrying that someone else can also process these
events.  If I understand correctly this is what John is saying in his
first email.

Its true that even if we run an agreement without any data it will
slow down the application somewhat, but its the application's choice
not to use this operation - currently an application that wants strong
semantics already queues its sync behind a list of writes on the
leader - I'm not sure that actually sending this sync out to followers
would add much to the latency of the sync. It will only add
leader-followers roundtrip latency if the pipeline was already empty.

Alex

On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
> Say that we implement what you're suggesting. Could you check if this scenario can happen:
>
> 1- Client C1 is the current leader and it super boosted read to make sure it is still the leader;
> 2- We process the super boosted read having it through the zab pipeline;
> 3- When we send the response to C1 we slow down the whole deal: the response to C1 gets delayed and we stall C1;
> 4- In the meanwhile, C1's session expires on the server side and its ephemeral leadership node is removed;
> 5- A new client C2 is elected and starts exercising leadership;
> 6- Now C1 comes back to normal and receives the response of the super boosted read saying that it is still the leader.
>
> If my interpretation is not incorrect, the only way to prevent this scenario from happening is if the session expires on the client side before it receives the response of the read. It doesn't look like we can do it if process clocks can be arbitrarily delayed.
>
> Note that one issue is that the behavior of ephemerals is highly dependent upon timers, so I don't think we can avoid making some timing assumptions altogether. The question is if we are better off with a mechanism relying upon acknowledgements. My sense is that application-level fencing is preferable (if not necessary) for applications like the ones JC is mentioning or BookKeeper.
>
> I'm not concerned about writes to disk, which I agree we don't need for sync. I'm more concerned about having it going through the whole pipeline, which will induce more traffic to zab and increase latency for an application that uses it heavily.
>
> -Flavio
>
> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote:
>
>> another idea is to add this functionality to MultiOp - have read only
>> transactions be replicated but not logged or logged asynchronously.
>> I'm not sure how it works right now if I do a read-only MultiOp
>> transaction - does it replicate the transaction or answer it locally
>> on the leader ?
>>
>> Alex
>>
>> On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <sh...@gmail.com> wrote:
>>> Thanks for the explanation.
>>>
>>> I guess one could always invoke a write operation instead of sync to
>>> get the more strict semantics, but as John suggests, it might be a
>>> good idea to add a new type of operation that requires followers to
>>> ack but doesn't require them to log to disk - this seems sufficient in
>>> our case.
>>>
>>> Alex
>>>
>>> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
>>>> In theory, the scenario you're describing could happen, but I would argue that it is unlikely given that: 1) a leader pings followers twice a tick to make sure that it has a quorum of supporters (lead()); 2) followers give up on a leader upon catching an exception (followLeader()). One could calibrate tickTime to make the probability of having this scenario low.
>>>>
>>>> Let me also revisit the motivation for the way we designed sync. ZooKeeper has been designed to serve reads efficiently and making sync go through the pipeline would slow down reads. Although optional, we thought it would be a good idea to make it as efficient as possible to comply with the original expectations for the service. We consequently came up with this cheap way of making sure that a read sees all pending updates. It is correct that there are some corner cases that it doesn't cover. One is the case you mentioned. Another is having the sync finishing before the client submits the read and having a write committing in between. We rely upon the way we implement timeouts and some minimum degree of synchrony for the clients when submitting operations to guarantee that the scheme work.
>>>>
>>>> We thought about the option of having the sync operation going through the pipeline, and in fact it would have been easier to implement it just as a regular write, but we opted not to because we felt it was sufficient for the use cases we had and more efficient as I already argued.
>>>>
>>>> Hope it helps to clarify.
>>>>
>>>> -Flavio
>>>>
>>>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
>>>>
>>>>> thanks for the explanation! but how do you avoid having the scenario
>>>>> raised by John ?
>>>>> lets say you're a client connected to F, and F is connected to L. Lets
>>>>> also say that L's pipeline
>>>>> is now empty, and both F and L are partitioned from 3 other servers in
>>>>> the system that have already
>>>>> elected a new leader L'. Now I go to L' and write something. L still
>>>>> thinks its the leader because the
>>>>> detection that followers left it is obviously timeout dependent. So
>>>>> when F sends your sync to L and L returns
>>>>> it to F, you actually miss my write!
>>>>>
>>>>> Alex
>>>>>
>>>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
>>>>>> Hi Alex, Because of the following:
>>>>>>
>>>>>> 1- A follower F processes operations from a client in FIFO order, and say that a client submits as you say sync + read;
>>>>>> 2- A sync will be processed by the leader and returned to the follower. It will be queued after all pending updates that the follower hasn't processed;
>>>>>> 3- The follower will process all pending updates before processing the response of the sync;
>>>>>> 4- Once the follower processes the sync, it picks the read operation to process. It reads the local state of the follower and returns to the client.
>>>>>>
>>>>>> When we process the read in Step 4, we have applied all pending updates the leader had for the follower by the time the read request started.
>>>>>>
>>>>>> This implementation is a bit of a hack because it doesn't follow the same code path as the other operations that go to the leader, but it avoids some unnecessary steps, which is important for fast reads. In the sync case, the other followers don't really need to know about it (there is nothing to be updated) and the leader simply inserts it in the sequence of updates of F, ordering it.
>>>>>>
>>>>>> -Flavio
>>>>>>
>>>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
>>>>>>
>>>>>>> Hi Flavio,
>>>>>>>
>>>>>>>> Starting a read operation concurrently with a sync implies that the result of the read will not miss an update committed before the read started.
>>>>>>>
>>>>>>> I thought that the intention of sync was to give something like
>>>>>>> linearizable reads, so if you invoke a sync and then a read, your read
>>>>>>> is guaranteed to (at least) see any write which completed before the
>>>>>>> sync began. Is this the intention ? If so, how is this achieved
>>>>>>> without running agreement on the sync op ?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Alex
>>>>>>>
>>>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
>>>>>>>> sync simply flushes the channel between the leader and the follower that forwarded the sync operation, so it doesn't go through the full zab pipeline. Flushing means that all pending updates from the leader to the follower are received by the time sync completes. Starting a read operation concurrently with a sync implies that the result of the read will not miss an update committed before the read started.
>>>>>>>>
>>>>>>>> -Flavio
>>>>>>>>
>>>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
>>>>>>>>
>>>>>>>>> Its strange that sync doesn't run through agreement, I was always
>>>>>>>>> assuming that it is... Exactly for the reason you say -
>>>>>>>>> you may trust your leader, but I may have a different leader and your
>>>>>>>>> leader may not detect it yet and still think its the leader.
>>>>>>>>>
>>>>>>>>> This seems like a bug to me.
>>>>>>>>>
>>>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or shouldn't)
>>>>>>>>> depend on timing assumption.
>>>>>>>>> Only progress guarantees depend on time.
>>>>>>>>>
>>>>>>>>> Alex
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <jo...@gmail.com> wrote:
>>>>>>>>>> I have some pretty strong requirements in terms of consistency where
>>>>>>>>>> reading from followers that may be behind in terms of updates isn't ok for
>>>>>>>>>> my use case.
>>>>>>>>>>
>>>>>>>>>> One error case that worries me is if a follower and leader are partitioned
>>>>>>>>>> off from the network.  A new leader is elected, but the follower and old
>>>>>>>>>> leader don't know about it.
>>>>>>>>>>
>>>>>>>>>> Normally I think sync was made for this purpost, but I looked at the sync
>>>>>>>>>> code and if there aren't any outstanding proposals the leader sends the
>>>>>>>>>> sync right back to the client without first verifying that it still has
>>>>>>>>>> quorum, so this won't work for my use case.
>>>>>>>>>>
>>>>>>>>>> At the core of the issue all I really need is a call that will make it's
>>>>>>>>>> way to the leader and will ping it's followers, ensure it still has a
>>>>>>>>>> quorum and return success.
>>>>>>>>>>
>>>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be forwarded to the
>>>>>>>>>> leader, leader will ensure it still has quorum and return it's epoch.  I
>>>>>>>>>> can use this primitive to implement all the other properties I want to
>>>>>>>>>> verify (assuming that my client will never connect to an older epoch after
>>>>>>>>>> this call returns). Also the nice thing about this method is that it will
>>>>>>>>>> not have to hit disk and the latency should just be a round trip to the
>>>>>>>>>> followers.
>>>>>>>>>>
>>>>>>>>>> Most of the guarentees offered by zookeeper are time based an rely on
>>>>>>>>>> clocks and expiring timers, but I'm hoping to offer some guarantees in
>>>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and any other way
>>>>>>>>>> time is broken.
>>>>>>>>>>
>>>>>>>>>> Also if people are interested I can go into more detail about what I am
>>>>>>>>>> trying to write.
>>>>>>>>>>
>>>>>>>>>> -jc
>>>>>>>>
>>>>>>
>>>>
>

Re: Ensure Leader hasn't changed

Posted by John Carrino <jo...@gmail.com>.

Ben, after thinking about this more. I don't think this solution gets the
property that I need.  Just because there are outstanding proposals that
are committed later doesn't imply we are still the leader.  It only means
that when the new leader does recovery it will also see these proposals as
committed.

Let's say we have a 5 node cluster and L1 has one pending request out.
F2-F5 are followers. We get back an ack from F2.  Now F5 and L1
are partitioned off from the network along with client C1.

Recovery happens on F2-F4 and F2 becomes L2.  During recovery this proposal
is accepted because F2 had acked it.  Now L2 does a bunch of stuff
including deleting your ephemeral node.

Now a sync comes in from C1 through F5. Now L1 finally gets that ack from
F5 and goes ahead and commits it and responds to the outstanding sync
request to C1.

We can see with this ordering there isn't a happens after relationship
between the sync request and knowing about all commits that occurred before
the sync request.

Yes, I realize that this ordering is unlikely to happen in practice, but I
hate trusting time for anything.

-jc

On Fri, Sep 28, 2012 at 7:31 AM, John Carrino <jo...@gmail.com>wrote:

> This seems like a good compromise.  We still have to eat the latency of a
> write, but we easily achieve smart batching in this case so many
> outstanding sync can all be serviced by the same lastPending request.
>
> -jc
>
>
> On Thu, Sep 27, 2012 at 11:17 PM, Benjamin Reed <be...@gmail.com>wrote:
>
>> there is a very easy solution to this. we only rely on clocks in the
>> case that there are no pending transactions. (if there are pending
>> transactions, the sync will only return if in fact the leader is still
>> the leader, otherwise the transaction that the sync is waiting on will
>> never commit and the sync will never return.)
>>
>> so, if there aren't any transactions, just submit one. make it a bogus
>> one: create / for example. then queue the sync behind it.
>>
>> ben
>>
>> ps - we bring up this issue and the solution and the rational for the
>> current implementation in section 4.4 of the zookeeper usenix paper.
>>
>> On Thu, Sep 27, 2012 at 9:57 AM, John Carrino <jo...@gmail.com>
>> wrote:
>> > So I think it's time to explain what I'm writing just so everyone has
>> more
>> > situation awareness.  Its just a timestamp server, nothing fancy.
>> >
>> > Looks like this:
>> >
>> > public interface TimestampService {
>> >     /**
>> >      * This will get a fresh timestamp that is guarenteed to be newer
>> than
>> > any other timestamp
>> >      * handed out before this method was called.
>> >      */
>> >     long getFreshTimestamp();
>> > }
>> >
>> > The only requirement is that the timestamp handed back is greater than
>> every
>> > other timestamp that was returned before getFreshTs was called.  There
>> is no
>> > ordering requirement for concurrent requests.
>> >
>> > My impl is to reserve blocks of timestamps that are safe to hand out
>> (1M at
>> > a time) using compare and swap in ZK.
>> > lastPossibleUsed = read(HighWater)
>> > safeToHandout = compareAndSwap(lastPossibleUsed, lastPossibleUsed+1M)
>> >
>> > Now my leader can hand back timestamps up to safeToHandout, but before
>> it
>> > hands one out it must ensure it is still the leader (no one else has
>> handed
>> > back something higher).
>> > I can use ensureQuorum(), exists(myEphemNode) to make sure this is the
>> case.
>> > Now I have a service that is guarenteed to be correct, but doesn't
>> require
>> > disk hits in the steady state which brings down my latency (if you get
>> close
>> > to running out, you can compareAndSwap for more timestamps).
>> >
>> > If many requests come in at the same time I can use smart batching to
>> verify
>> > happens after for all at once.  We can also add more layers if we need
>> more
>> > bandwidth to scale up at the cost of adding latency.  Basically our
>> latency
>> > will be O(lg(requestRate)) if we keep adding layers as each previous
>> layer
>> > becomes saturated.
>> >
>> > I hope this explanation helps. I am busy for the next 4 hours, but if
>> you
>> > need more clarification I can respond to them at that time.
>> >
>> > -jc
>> >
>> >
>> > On Thu, Sep 27, 2012 at 9:26 AM, John Carrino <jo...@gmail.com>
>> > wrote:
>> >>
>> >> First, thanks everyone for talking this through with me.
>> >>
>> >> Flavio, for your example, this is actually ok.  There is a happens
>> after
>> >> relationship between the client making the request and my leader C1
>> still
>> >> being the leader.  My service only needs to guarantee that what it
>> hands
>> >> back is at least as new as anything that existed when the client made
>> the
>> >> request.  If C2 were to answer requests while C1 is stalling that is ok
>> >> because these would be considered concurrent requests and the stuff
>> returned
>> >> by C2 may be newer but that doesn't violate any guarentees.
>> >>
>> >> If some client were to get back something from C2 and then (happens
>> after
>> >> relationship) someone tried to read from C1, it needs to fail.
>> >>
>> >> To address your concern of adding too much bandwidth we can get this
>> >> easily by doing what Martin Thompson calls smart batching
>> >> (http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html).
>> >>
>> >> 1. ensureQuorum request comes in to L1
>> >> 2. send ENSURE to all followers
>> >> 3. 10 more ensureQuorum requests come in
>> >> 4. get back ENSURE from quorum
>> >> 5. we can now service all 10 pending ensureQuorum requests with another
>> >> round trip ENSURE.
>> >>
>> >> We don't need to send an ENSURE for every ensureQuorum request, we just
>> >> need it to be happens after from when the request arrived.
>> >>
>> >> I am fine with the Ephemeral node being removed after some time
>> expires,
>> >> but only by the leader.  If the leaders clock is broken and the client
>> >> owning the Ephemeral node drops off, then we don't have liveness
>> (because
>> >> this node may not get cleaned up in a timely fashion).  However, we
>> still
>> >> preserve corectness.
>> >>
>> >> -jc
>> >>
>> >>
>> >> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <fp...@yahoo-inc.com>
>> >> wrote:
>> >>>
>> >>> Say that we implement what you're suggesting. Could you check if this
>> >>> scenario can happen:
>> >>>
>> >>> 1- Client C1 is the current leader and it super boosted read to make
>> sure
>> >>> it is still the leader;
>> >>> 2- We process the super boosted read having it through the zab
>> pipeline;
>> >>> 3- When we send the response to C1 we slow down the whole deal: the
>> >>> response to C1 gets delayed and we stall C1;
>> >>> 4- In the meanwhile, C1's session expires on the server side and its
>> >>> ephemeral leadership node is removed;
>> >>> 5- A new client C2 is elected and starts exercising leadership;
>> >>> 6- Now C1 comes back to normal and receives the response of the super
>> >>> boosted read saying that it is still the leader.
>> >>>
>> >>> If my interpretation is not incorrect, the only way to prevent this
>> >>> scenario from happening is if the session expires on the client side
>> before
>> >>> it receives the response of the read. It doesn't look like we can do
>> it if
>> >>> process clocks can be arbitrarily delayed.
>> >>>
>> >>> Note that one issue is that the behavior of ephemerals is highly
>> >>> dependent upon timers, so I don't think we can avoid making some
>> timing
>> >>> assumptions altogether. The question is if we are better off with a
>> >>> mechanism relying upon acknowledgements. My sense is that
>> application-level
>> >>> fencing is preferable (if not necessary) for applications like the
>> ones JC
>> >>> is mentioning or BookKeeper.
>> >>>
>> >>> I'm not concerned about writes to disk, which I agree we don't need
>> for
>> >>> sync. I'm more concerned about having it going through the whole
>> pipeline,
>> >>> which will induce more traffic to zab and increase latency for an
>> >>> application that uses it heavily.
>> >>>
>> >>> -Flavio
>> >>>
>> >>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote:
>> >>>
>> >>> > another idea is to add this functionality to MultiOp - have read
>> only
>> >>> > transactions be replicated but not logged or logged asynchronously.
>> >>> > I'm not sure how it works right now if I do a read-only MultiOp
>> >>> > transaction - does it replicate the transaction or answer it locally
>> >>> > on the leader ?
>> >>> >
>> >>> > Alex
>> >>> >
>> >>> > On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <
>> shralex@gmail.com>
>> >>> > wrote:
>> >>> >> Thanks for the explanation.
>> >>> >>
>> >>> >> I guess one could always invoke a write operation instead of sync
>> to
>> >>> >> get the more strict semantics, but as John suggests, it might be a
>> >>> >> good idea to add a new type of operation that requires followers to
>> >>> >> ack but doesn't require them to log to disk - this seems
>> sufficient in
>> >>> >> our case.
>> >>> >>
>> >>> >> Alex
>> >>> >>
>> >>> >> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <
>> fpj@yahoo-inc.com>
>> >>> >> wrote:
>> >>> >>> In theory, the scenario you're describing could happen, but I
>> would
>> >>> >>> argue that it is unlikely given that: 1) a leader pings followers
>> twice a
>> >>> >>> tick to make sure that it has a quorum of supporters (lead()); 2)
>> followers
>> >>> >>> give up on a leader upon catching an exception (followLeader()).
>> One could
>> >>> >>> calibrate tickTime to make the probability of having this
>> scenario low.
>> >>> >>>
>> >>> >>> Let me also revisit the motivation for the way we designed sync.
>> >>> >>> ZooKeeper has been designed to serve reads efficiently and making
>> sync go
>> >>> >>> through the pipeline would slow down reads. Although optional, we
>> thought it
>> >>> >>> would be a good idea to make it as efficient as possible to
>> comply with the
>> >>> >>> original expectations for the service. We consequently came up
>> with this
>> >>> >>> cheap way of making sure that a read sees all pending updates. It
>> is correct
>> >>> >>> that there are some corner cases that it doesn't cover. One is
>> the case you
>> >>> >>> mentioned. Another is having the sync finishing before the client
>> submits
>> >>> >>> the read and having a write committing in between. We rely upon
>> the way we
>> >>> >>> implement timeouts and some minimum degree of synchrony for the
>> clients when
>> >>> >>> submitting operations to guarantee that the scheme work.
>> >>> >>>
>> >>> >>> We thought about the option of having the sync operation going
>> >>> >>> through the pipeline, and in fact it would have been easier to
>> implement it
>> >>> >>> just as a regular write, but we opted not to because we felt it
>> was
>> >>> >>> sufficient for the use cases we had and more efficient as I
>> already argued.
>> >>> >>>
>> >>> >>> Hope it helps to clarify.
>> >>> >>>
>> >>> >>> -Flavio
>> >>> >>>
>> >>> >>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
>> >>> >>>
>> >>> >>>> thanks for the explanation! but how do you avoid having the
>> scenario
>> >>> >>>> raised by John ?
>> >>> >>>> lets say you're a client connected to F, and F is connected to L.
>> >>> >>>> Lets
>> >>> >>>> also say that L's pipeline
>> >>> >>>> is now empty, and both F and L are partitioned from 3 other
>> servers
>> >>> >>>> in
>> >>> >>>> the system that have already
>> >>> >>>> elected a new leader L'. Now I go to L' and write something. L
>> still
>> >>> >>>> thinks its the leader because the
>> >>> >>>> detection that followers left it is obviously timeout dependent.
>> So
>> >>> >>>> when F sends your sync to L and L returns
>> >>> >>>> it to F, you actually miss my write!
>> >>> >>>>
>> >>> >>>> Alex
>> >>> >>>>
>> >>> >>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira
>> >>> >>>> <fp...@yahoo-inc.com> wrote:
>> >>> >>>>> Hi Alex, Because of the following:
>> >>> >>>>>
>> >>> >>>>> 1- A follower F processes operations from a client in FIFO
>> order,
>> >>> >>>>> and say that a client submits as you say sync + read;
>> >>> >>>>> 2- A sync will be processed by the leader and returned to the
>> >>> >>>>> follower. It will be queued after all pending updates that the
>> follower
>> >>> >>>>> hasn't processed;
>> >>> >>>>> 3- The follower will process all pending updates before
>> processing
>> >>> >>>>> the response of the sync;
>> >>> >>>>> 4- Once the follower processes the sync, it picks the read
>> >>> >>>>> operation to process. It reads the local state of the follower
>> and returns
>> >>> >>>>> to the client.
>> >>> >>>>>
>> >>> >>>>> When we process the read in Step 4, we have applied all pending
>> >>> >>>>> updates the leader had for the follower by the time the read
>> request
>> >>> >>>>> started.
>> >>> >>>>>
>> >>> >>>>> This implementation is a bit of a hack because it doesn't follow
>> >>> >>>>> the same code path as the other operations that go to the
>> leader, but it
>> >>> >>>>> avoids some unnecessary steps, which is important for fast
>> reads. In the
>> >>> >>>>> sync case, the other followers don't really need to know about
>> it (there is
>> >>> >>>>> nothing to be updated) and the leader simply inserts it in the
>> sequence of
>> >>> >>>>> updates of F, ordering it.
>> >>> >>>>>
>> >>> >>>>> -Flavio
>> >>> >>>>>
>> >>> >>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
>> >>> >>>>>
>> >>> >>>>>> Hi Flavio,
>> >>> >>>>>>
>> >>> >>>>>>> Starting a read operation concurrently with a sync implies
>> that
>> >>> >>>>>>> the result of the read will not miss an update committed
>> before the read
>> >>> >>>>>>> started.
>> >>> >>>>>>
>> >>> >>>>>> I thought that the intention of sync was to give something like
>> >>> >>>>>> linearizable reads, so if you invoke a sync and then a read,
>> your
>> >>> >>>>>> read
>> >>> >>>>>> is guaranteed to (at least) see any write which completed
>> before
>> >>> >>>>>> the
>> >>> >>>>>> sync began. Is this the intention ? If so, how is this achieved
>> >>> >>>>>> without running agreement on the sync op ?
>> >>> >>>>>>
>> >>> >>>>>> Thanks,
>> >>> >>>>>> Alex
>> >>> >>>>>>
>> >>> >>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira
>> >>> >>>>>> <fp...@yahoo-inc.com> wrote:
>> >>> >>>>>>> sync simply flushes the channel between the leader and the
>> >>> >>>>>>> follower that forwarded the sync operation, so it doesn't go
>> through the
>> >>> >>>>>>> full zab pipeline. Flushing means that all pending updates
>> from the leader
>> >>> >>>>>>> to the follower are received by the time sync completes.
>> Starting a read
>> >>> >>>>>>> operation concurrently with a sync implies that the result of
>> the read will
>> >>> >>>>>>> not miss an update committed before the read started.
>> >>> >>>>>>>
>> >>> >>>>>>> -Flavio
>> >>> >>>>>>>
>> >>> >>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
>> >>> >>>>>>>
>> >>> >>>>>>>> Its strange that sync doesn't run through agreement, I was
>> >>> >>>>>>>> always
>> >>> >>>>>>>> assuming that it is... Exactly for the reason you say -
>> >>> >>>>>>>> you may trust your leader, but I may have a different leader
>> and
>> >>> >>>>>>>> your
>> >>> >>>>>>>> leader may not detect it yet and still think its the leader.
>> >>> >>>>>>>>
>> >>> >>>>>>>> This seems like a bug to me.
>> >>> >>>>>>>>
>> >>> >>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or
>> >>> >>>>>>>> shouldn't)
>> >>> >>>>>>>> depend on timing assumption.
>> >>> >>>>>>>> Only progress guarantees depend on time.
>> >>> >>>>>>>>
>> >>> >>>>>>>> Alex
>> >>> >>>>>>>>
>> >>> >>>>>>>>
>> >>> >>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino
>> >>> >>>>>>>> <jo...@gmail.com> wrote:
>> >>> >>>>>>>>> I have some pretty strong requirements in terms of
>> consistency
>> >>> >>>>>>>>> where
>> >>> >>>>>>>>> reading from followers that may be behind in terms of
>> updates
>> >>> >>>>>>>>> isn't ok for
>> >>> >>>>>>>>> my use case.
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> One error case that worries me is if a follower and leader
>> are
>> >>> >>>>>>>>> partitioned
>> >>> >>>>>>>>> off from the network.  A new leader is elected, but the
>> >>> >>>>>>>>> follower and old
>> >>> >>>>>>>>> leader don't know about it.
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> Normally I think sync was made for this purpost, but I
>> looked
>> >>> >>>>>>>>> at the sync
>> >>> >>>>>>>>> code and if there aren't any outstanding proposals the
>> leader
>> >>> >>>>>>>>> sends the
>> >>> >>>>>>>>> sync right back to the client without first verifying that
>> it
>> >>> >>>>>>>>> still has
>> >>> >>>>>>>>> quorum, so this won't work for my use case.
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> At the core of the issue all I really need is a call that
>> will
>> >>> >>>>>>>>> make it's
>> >>> >>>>>>>>> way to the leader and will ping it's followers, ensure it
>> still
>> >>> >>>>>>>>> has a
>> >>> >>>>>>>>> quorum and return success.
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be
>> >>> >>>>>>>>> forwarded to the
>> >>> >>>>>>>>> leader, leader will ensure it still has quorum and return
>> it's
>> >>> >>>>>>>>> epoch.  I
>> >>> >>>>>>>>> can use this primitive to implement all the other
>> properties I
>> >>> >>>>>>>>> want to
>> >>> >>>>>>>>> verify (assuming that my client will never connect to an
>> older
>> >>> >>>>>>>>> epoch after
>> >>> >>>>>>>>> this call returns). Also the nice thing about this method is
>> >>> >>>>>>>>> that it will
>> >>> >>>>>>>>> not have to hit disk and the latency should just be a round
>> >>> >>>>>>>>> trip to the
>> >>> >>>>>>>>> followers.
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> Most of the guarentees offered by zookeeper are time based
>> an
>> >>> >>>>>>>>> rely on
>> >>> >>>>>>>>> clocks and expiring timers, but I'm hoping to offer some
>> >>> >>>>>>>>> guarantees in
>> >>> >>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and
>> any
>> >>> >>>>>>>>> other way
>> >>> >>>>>>>>> time is broken.
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> Also if people are interested I can go into more detail
>> about
>> >>> >>>>>>>>> what I am
>> >>> >>>>>>>>> trying to write.
>> >>> >>>>>>>>>
>> >>> >>>>>>>>> -jc
>> >>> >>>>>>>
>> >>> >>>>>
>> >>> >>>
>> >>>
>> >>
>> >
>>
>
>

Re: Ensure Leader hasn't changed

Posted by John Carrino <jo...@gmail.com>.

This seems like a good compromise.  We still have to eat the latency of a
write, but we easily achieve smart batching in this case so many
outstanding sync can all be serviced by the same lastPending request.

-jc


On Thu, Sep 27, 2012 at 11:17 PM, Benjamin Reed <be...@gmail.com> wrote:

> there is a very easy solution to this. we only rely on clocks in the
> case that there are no pending transactions. (if there are pending
> transactions, the sync will only return if in fact the leader is still
> the leader, otherwise the transaction that the sync is waiting on will
> never commit and the sync will never return.)
>
> so, if there aren't any transactions, just submit one. make it a bogus
> one: create / for example. then queue the sync behind it.
>
> ben
>
> ps - we bring up this issue and the solution and the rational for the
> current implementation in section 4.4 of the zookeeper usenix paper.
>
> On Thu, Sep 27, 2012 at 9:57 AM, John Carrino <jo...@gmail.com>
> wrote:
> > So I think it's time to explain what I'm writing just so everyone has
> more
> > situation awareness.  Its just a timestamp server, nothing fancy.
> >
> > Looks like this:
> >
> > public interface TimestampService {
> >     /**
> >      * This will get a fresh timestamp that is guarenteed to be newer
> than
> > any other timestamp
> >      * handed out before this method was called.
> >      */
> >     long getFreshTimestamp();
> > }
> >
> > The only requirement is that the timestamp handed back is greater than
> every
> > other timestamp that was returned before getFreshTs was called.  There
> is no
> > ordering requirement for concurrent requests.
> >
> > My impl is to reserve blocks of timestamps that are safe to hand out (1M
> at
> > a time) using compare and swap in ZK.
> > lastPossibleUsed = read(HighWater)
> > safeToHandout = compareAndSwap(lastPossibleUsed, lastPossibleUsed+1M)
> >
> > Now my leader can hand back timestamps up to safeToHandout, but before it
> > hands one out it must ensure it is still the leader (no one else has
> handed
> > back something higher).
> > I can use ensureQuorum(), exists(myEphemNode) to make sure this is the
> case.
> > Now I have a service that is guarenteed to be correct, but doesn't
> require
> > disk hits in the steady state which brings down my latency (if you get
> close
> > to running out, you can compareAndSwap for more timestamps).
> >
> > If many requests come in at the same time I can use smart batching to
> verify
> > happens after for all at once.  We can also add more layers if we need
> more
> > bandwidth to scale up at the cost of adding latency.  Basically our
> latency
> > will be O(lg(requestRate)) if we keep adding layers as each previous
> layer
> > becomes saturated.
> >
> > I hope this explanation helps. I am busy for the next 4 hours, but if you
> > need more clarification I can respond to them at that time.
> >
> > -jc
> >
> >
> > On Thu, Sep 27, 2012 at 9:26 AM, John Carrino <jo...@gmail.com>
> > wrote:
> >>
> >> First, thanks everyone for talking this through with me.
> >>
> >> Flavio, for your example, this is actually ok.  There is a happens after
> >> relationship between the client making the request and my leader C1
> still
> >> being the leader.  My service only needs to guarantee that what it hands
> >> back is at least as new as anything that existed when the client made
> the
> >> request.  If C2 were to answer requests while C1 is stalling that is ok
> >> because these would be considered concurrent requests and the stuff
> returned
> >> by C2 may be newer but that doesn't violate any guarentees.
> >>
> >> If some client were to get back something from C2 and then (happens
> after
> >> relationship) someone tried to read from C1, it needs to fail.
> >>
> >> To address your concern of adding too much bandwidth we can get this
> >> easily by doing what Martin Thompson calls smart batching
> >> (http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html).
> >>
> >> 1. ensureQuorum request comes in to L1
> >> 2. send ENSURE to all followers
> >> 3. 10 more ensureQuorum requests come in
> >> 4. get back ENSURE from quorum
> >> 5. we can now service all 10 pending ensureQuorum requests with another
> >> round trip ENSURE.
> >>
> >> We don't need to send an ENSURE for every ensureQuorum request, we just
> >> need it to be happens after from when the request arrived.
> >>
> >> I am fine with the Ephemeral node being removed after some time expires,
> >> but only by the leader.  If the leaders clock is broken and the client
> >> owning the Ephemeral node drops off, then we don't have liveness
> (because
> >> this node may not get cleaned up in a timely fashion).  However, we
> still
> >> preserve corectness.
> >>
> >> -jc
> >>
> >>
> >> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <fp...@yahoo-inc.com>
> >> wrote:
> >>>
> >>> Say that we implement what you're suggesting. Could you check if this
> >>> scenario can happen:
> >>>
> >>> 1- Client C1 is the current leader and it super boosted read to make
> sure
> >>> it is still the leader;
> >>> 2- We process the super boosted read having it through the zab
> pipeline;
> >>> 3- When we send the response to C1 we slow down the whole deal: the
> >>> response to C1 gets delayed and we stall C1;
> >>> 4- In the meanwhile, C1's session expires on the server side and its
> >>> ephemeral leadership node is removed;
> >>> 5- A new client C2 is elected and starts exercising leadership;
> >>> 6- Now C1 comes back to normal and receives the response of the super
> >>> boosted read saying that it is still the leader.
> >>>
> >>> If my interpretation is not incorrect, the only way to prevent this
> >>> scenario from happening is if the session expires on the client side
> before
> >>> it receives the response of the read. It doesn't look like we can do
> it if
> >>> process clocks can be arbitrarily delayed.
> >>>
> >>> Note that one issue is that the behavior of ephemerals is highly
> >>> dependent upon timers, so I don't think we can avoid making some timing
> >>> assumptions altogether. The question is if we are better off with a
> >>> mechanism relying upon acknowledgements. My sense is that
> application-level
> >>> fencing is preferable (if not necessary) for applications like the
> ones JC
> >>> is mentioning or BookKeeper.
> >>>
> >>> I'm not concerned about writes to disk, which I agree we don't need for
> >>> sync. I'm more concerned about having it going through the whole
> pipeline,
> >>> which will induce more traffic to zab and increase latency for an
> >>> application that uses it heavily.
> >>>
> >>> -Flavio
> >>>
> >>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote:
> >>>
> >>> > another idea is to add this functionality to MultiOp - have read only
> >>> > transactions be replicated but not logged or logged asynchronously.
> >>> > I'm not sure how it works right now if I do a read-only MultiOp
> >>> > transaction - does it replicate the transaction or answer it locally
> >>> > on the leader ?
> >>> >
> >>> > Alex
> >>> >
> >>> > On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <shralex@gmail.com
> >
> >>> > wrote:
> >>> >> Thanks for the explanation.
> >>> >>
> >>> >> I guess one could always invoke a write operation instead of sync to
> >>> >> get the more strict semantics, but as John suggests, it might be a
> >>> >> good idea to add a new type of operation that requires followers to
> >>> >> ack but doesn't require them to log to disk - this seems sufficient
> in
> >>> >> our case.
> >>> >>
> >>> >> Alex
> >>> >>
> >>> >> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <
> fpj@yahoo-inc.com>
> >>> >> wrote:
> >>> >>> In theory, the scenario you're describing could happen, but I would
> >>> >>> argue that it is unlikely given that: 1) a leader pings followers
> twice a
> >>> >>> tick to make sure that it has a quorum of supporters (lead()); 2)
> followers
> >>> >>> give up on a leader upon catching an exception (followLeader()).
> One could
> >>> >>> calibrate tickTime to make the probability of having this scenario
> low.
> >>> >>>
> >>> >>> Let me also revisit the motivation for the way we designed sync.
> >>> >>> ZooKeeper has been designed to serve reads efficiently and making
> sync go
> >>> >>> through the pipeline would slow down reads. Although optional, we
> thought it
> >>> >>> would be a good idea to make it as efficient as possible to comply
> with the
> >>> >>> original expectations for the service. We consequently came up
> with this
> >>> >>> cheap way of making sure that a read sees all pending updates. It
> is correct
> >>> >>> that there are some corner cases that it doesn't cover. One is the
> case you
> >>> >>> mentioned. Another is having the sync finishing before the client
> submits
> >>> >>> the read and having a write committing in between. We rely upon
> the way we
> >>> >>> implement timeouts and some minimum degree of synchrony for the
> clients when
> >>> >>> submitting operations to guarantee that the scheme work.
> >>> >>>
> >>> >>> We thought about the option of having the sync operation going
> >>> >>> through the pipeline, and in fact it would have been easier to
> implement it
> >>> >>> just as a regular write, but we opted not to because we felt it was
> >>> >>> sufficient for the use cases we had and more efficient as I
> already argued.
> >>> >>>
> >>> >>> Hope it helps to clarify.
> >>> >>>
> >>> >>> -Flavio
> >>> >>>
> >>> >>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
> >>> >>>
> >>> >>>> thanks for the explanation! but how do you avoid having the
> scenario
> >>> >>>> raised by John ?
> >>> >>>> lets say you're a client connected to F, and F is connected to L.
> >>> >>>> Lets
> >>> >>>> also say that L's pipeline
> >>> >>>> is now empty, and both F and L are partitioned from 3 other
> servers
> >>> >>>> in
> >>> >>>> the system that have already
> >>> >>>> elected a new leader L'. Now I go to L' and write something. L
> still
> >>> >>>> thinks its the leader because the
> >>> >>>> detection that followers left it is obviously timeout dependent.
> So
> >>> >>>> when F sends your sync to L and L returns
> >>> >>>> it to F, you actually miss my write!
> >>> >>>>
> >>> >>>> Alex
> >>> >>>>
> >>> >>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira
> >>> >>>> <fp...@yahoo-inc.com> wrote:
> >>> >>>>> Hi Alex, Because of the following:
> >>> >>>>>
> >>> >>>>> 1- A follower F processes operations from a client in FIFO order,
> >>> >>>>> and say that a client submits as you say sync + read;
> >>> >>>>> 2- A sync will be processed by the leader and returned to the
> >>> >>>>> follower. It will be queued after all pending updates that the
> follower
> >>> >>>>> hasn't processed;
> >>> >>>>> 3- The follower will process all pending updates before
> processing
> >>> >>>>> the response of the sync;
> >>> >>>>> 4- Once the follower processes the sync, it picks the read
> >>> >>>>> operation to process. It reads the local state of the follower
> and returns
> >>> >>>>> to the client.
> >>> >>>>>
> >>> >>>>> When we process the read in Step 4, we have applied all pending
> >>> >>>>> updates the leader had for the follower by the time the read
> request
> >>> >>>>> started.
> >>> >>>>>
> >>> >>>>> This implementation is a bit of a hack because it doesn't follow
> >>> >>>>> the same code path as the other operations that go to the
> leader, but it
> >>> >>>>> avoids some unnecessary steps, which is important for fast
> reads. In the
> >>> >>>>> sync case, the other followers don't really need to know about
> it (there is
> >>> >>>>> nothing to be updated) and the leader simply inserts it in the
> sequence of
> >>> >>>>> updates of F, ordering it.
> >>> >>>>>
> >>> >>>>> -Flavio
> >>> >>>>>
> >>> >>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
> >>> >>>>>
> >>> >>>>>> Hi Flavio,
> >>> >>>>>>
> >>> >>>>>>> Starting a read operation concurrently with a sync implies that
> >>> >>>>>>> the result of the read will not miss an update committed
> before the read
> >>> >>>>>>> started.
> >>> >>>>>>
> >>> >>>>>> I thought that the intention of sync was to give something like
> >>> >>>>>> linearizable reads, so if you invoke a sync and then a read,
> your
> >>> >>>>>> read
> >>> >>>>>> is guaranteed to (at least) see any write which completed before
> >>> >>>>>> the
> >>> >>>>>> sync began. Is this the intention ? If so, how is this achieved
> >>> >>>>>> without running agreement on the sync op ?
> >>> >>>>>>
> >>> >>>>>> Thanks,
> >>> >>>>>> Alex
> >>> >>>>>>
> >>> >>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira
> >>> >>>>>> <fp...@yahoo-inc.com> wrote:
> >>> >>>>>>> sync simply flushes the channel between the leader and the
> >>> >>>>>>> follower that forwarded the sync operation, so it doesn't go
> through the
> >>> >>>>>>> full zab pipeline. Flushing means that all pending updates
> from the leader
> >>> >>>>>>> to the follower are received by the time sync completes.
> Starting a read
> >>> >>>>>>> operation concurrently with a sync implies that the result of
> the read will
> >>> >>>>>>> not miss an update committed before the read started.
> >>> >>>>>>>
> >>> >>>>>>> -Flavio
> >>> >>>>>>>
> >>> >>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
> >>> >>>>>>>
> >>> >>>>>>>> Its strange that sync doesn't run through agreement, I was
> >>> >>>>>>>> always
> >>> >>>>>>>> assuming that it is... Exactly for the reason you say -
> >>> >>>>>>>> you may trust your leader, but I may have a different leader
> and
> >>> >>>>>>>> your
> >>> >>>>>>>> leader may not detect it yet and still think its the leader.
> >>> >>>>>>>>
> >>> >>>>>>>> This seems like a bug to me.
> >>> >>>>>>>>
> >>> >>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or
> >>> >>>>>>>> shouldn't)
> >>> >>>>>>>> depend on timing assumption.
> >>> >>>>>>>> Only progress guarantees depend on time.
> >>> >>>>>>>>
> >>> >>>>>>>> Alex
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino
> >>> >>>>>>>> <jo...@gmail.com> wrote:
> >>> >>>>>>>>> I have some pretty strong requirements in terms of
> consistency
> >>> >>>>>>>>> where
> >>> >>>>>>>>> reading from followers that may be behind in terms of updates
> >>> >>>>>>>>> isn't ok for
> >>> >>>>>>>>> my use case.
> >>> >>>>>>>>>
> >>> >>>>>>>>> One error case that worries me is if a follower and leader
> are
> >>> >>>>>>>>> partitioned
> >>> >>>>>>>>> off from the network.  A new leader is elected, but the
> >>> >>>>>>>>> follower and old
> >>> >>>>>>>>> leader don't know about it.
> >>> >>>>>>>>>
> >>> >>>>>>>>> Normally I think sync was made for this purpost, but I looked
> >>> >>>>>>>>> at the sync
> >>> >>>>>>>>> code and if there aren't any outstanding proposals the leader
> >>> >>>>>>>>> sends the
> >>> >>>>>>>>> sync right back to the client without first verifying that it
> >>> >>>>>>>>> still has
> >>> >>>>>>>>> quorum, so this won't work for my use case.
> >>> >>>>>>>>>
> >>> >>>>>>>>> At the core of the issue all I really need is a call that
> will
> >>> >>>>>>>>> make it's
> >>> >>>>>>>>> way to the leader and will ping it's followers, ensure it
> still
> >>> >>>>>>>>> has a
> >>> >>>>>>>>> quorum and return success.
> >>> >>>>>>>>>
> >>> >>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be
> >>> >>>>>>>>> forwarded to the
> >>> >>>>>>>>> leader, leader will ensure it still has quorum and return
> it's
> >>> >>>>>>>>> epoch.  I
> >>> >>>>>>>>> can use this primitive to implement all the other properties
> I
> >>> >>>>>>>>> want to
> >>> >>>>>>>>> verify (assuming that my client will never connect to an
> older
> >>> >>>>>>>>> epoch after
> >>> >>>>>>>>> this call returns). Also the nice thing about this method is
> >>> >>>>>>>>> that it will
> >>> >>>>>>>>> not have to hit disk and the latency should just be a round
> >>> >>>>>>>>> trip to the
> >>> >>>>>>>>> followers.
> >>> >>>>>>>>>
> >>> >>>>>>>>> Most of the guarentees offered by zookeeper are time based an
> >>> >>>>>>>>> rely on
> >>> >>>>>>>>> clocks and expiring timers, but I'm hoping to offer some
> >>> >>>>>>>>> guarantees in
> >>> >>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and any
> >>> >>>>>>>>> other way
> >>> >>>>>>>>> time is broken.
> >>> >>>>>>>>>
> >>> >>>>>>>>> Also if people are interested I can go into more detail about
> >>> >>>>>>>>> what I am
> >>> >>>>>>>>> trying to write.
> >>> >>>>>>>>>
> >>> >>>>>>>>> -jc
> >>> >>>>>>>
> >>> >>>>>
> >>> >>>
> >>>
> >>
> >
>

Re: Ensure Leader hasn't changed

Posted by Benjamin Reed <be...@gmail.com>.

there is a very easy solution to this. we only rely on clocks in the
case that there are no pending transactions. (if there are pending
transactions, the sync will only return if in fact the leader is still
the leader, otherwise the transaction that the sync is waiting on will
never commit and the sync will never return.)

so, if there aren't any transactions, just submit one. make it a bogus
one: create / for example. then queue the sync behind it.

ben

ps - we bring up this issue and the solution and the rational for the
current implementation in section 4.4 of the zookeeper usenix paper.

On Thu, Sep 27, 2012 at 9:57 AM, John Carrino <jo...@gmail.com> wrote:
> So I think it's time to explain what I'm writing just so everyone has more
> situation awareness.  Its just a timestamp server, nothing fancy.
>
> Looks like this:
>
> public interface TimestampService {
>     /**
>      * This will get a fresh timestamp that is guarenteed to be newer than
> any other timestamp
>      * handed out before this method was called.
>      */
>     long getFreshTimestamp();
> }
>
> The only requirement is that the timestamp handed back is greater than every
> other timestamp that was returned before getFreshTs was called.  There is no
> ordering requirement for concurrent requests.
>
> My impl is to reserve blocks of timestamps that are safe to hand out (1M at
> a time) using compare and swap in ZK.
> lastPossibleUsed = read(HighWater)
> safeToHandout = compareAndSwap(lastPossibleUsed, lastPossibleUsed+1M)
>
> Now my leader can hand back timestamps up to safeToHandout, but before it
> hands one out it must ensure it is still the leader (no one else has handed
> back something higher).
> I can use ensureQuorum(), exists(myEphemNode) to make sure this is the case.
> Now I have a service that is guarenteed to be correct, but doesn't require
> disk hits in the steady state which brings down my latency (if you get close
> to running out, you can compareAndSwap for more timestamps).
>
> If many requests come in at the same time I can use smart batching to verify
> happens after for all at once.  We can also add more layers if we need more
> bandwidth to scale up at the cost of adding latency.  Basically our latency
> will be O(lg(requestRate)) if we keep adding layers as each previous layer
> becomes saturated.
>
> I hope this explanation helps. I am busy for the next 4 hours, but if you
> need more clarification I can respond to them at that time.
>
> -jc
>
>
> On Thu, Sep 27, 2012 at 9:26 AM, John Carrino <jo...@gmail.com>
> wrote:
>>
>> First, thanks everyone for talking this through with me.
>>
>> Flavio, for your example, this is actually ok.  There is a happens after
>> relationship between the client making the request and my leader C1 still
>> being the leader.  My service only needs to guarantee that what it hands
>> back is at least as new as anything that existed when the client made the
>> request.  If C2 were to answer requests while C1 is stalling that is ok
>> because these would be considered concurrent requests and the stuff returned
>> by C2 may be newer but that doesn't violate any guarentees.
>>
>> If some client were to get back something from C2 and then (happens after
>> relationship) someone tried to read from C1, it needs to fail.
>>
>> To address your concern of adding too much bandwidth we can get this
>> easily by doing what Martin Thompson calls smart batching
>> (http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html).
>>
>> 1. ensureQuorum request comes in to L1
>> 2. send ENSURE to all followers
>> 3. 10 more ensureQuorum requests come in
>> 4. get back ENSURE from quorum
>> 5. we can now service all 10 pending ensureQuorum requests with another
>> round trip ENSURE.
>>
>> We don't need to send an ENSURE for every ensureQuorum request, we just
>> need it to be happens after from when the request arrived.
>>
>> I am fine with the Ephemeral node being removed after some time expires,
>> but only by the leader.  If the leaders clock is broken and the client
>> owning the Ephemeral node drops off, then we don't have liveness (because
>> this node may not get cleaned up in a timely fashion).  However, we still
>> preserve corectness.
>>
>> -jc
>>
>>
>> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <fp...@yahoo-inc.com>
>> wrote:
>>>
>>> Say that we implement what you're suggesting. Could you check if this
>>> scenario can happen:
>>>
>>> 1- Client C1 is the current leader and it super boosted read to make sure
>>> it is still the leader;
>>> 2- We process the super boosted read having it through the zab pipeline;
>>> 3- When we send the response to C1 we slow down the whole deal: the
>>> response to C1 gets delayed and we stall C1;
>>> 4- In the meanwhile, C1's session expires on the server side and its
>>> ephemeral leadership node is removed;
>>> 5- A new client C2 is elected and starts exercising leadership;
>>> 6- Now C1 comes back to normal and receives the response of the super
>>> boosted read saying that it is still the leader.
>>>
>>> If my interpretation is not incorrect, the only way to prevent this
>>> scenario from happening is if the session expires on the client side before
>>> it receives the response of the read. It doesn't look like we can do it if
>>> process clocks can be arbitrarily delayed.
>>>
>>> Note that one issue is that the behavior of ephemerals is highly
>>> dependent upon timers, so I don't think we can avoid making some timing
>>> assumptions altogether. The question is if we are better off with a
>>> mechanism relying upon acknowledgements. My sense is that application-level
>>> fencing is preferable (if not necessary) for applications like the ones JC
>>> is mentioning or BookKeeper.
>>>
>>> I'm not concerned about writes to disk, which I agree we don't need for
>>> sync. I'm more concerned about having it going through the whole pipeline,
>>> which will induce more traffic to zab and increase latency for an
>>> application that uses it heavily.
>>>
>>> -Flavio
>>>
>>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote:
>>>
>>> > another idea is to add this functionality to MultiOp - have read only
>>> > transactions be replicated but not logged or logged asynchronously.
>>> > I'm not sure how it works right now if I do a read-only MultiOp
>>> > transaction - does it replicate the transaction or answer it locally
>>> > on the leader ?
>>> >
>>> > Alex
>>> >
>>> > On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <sh...@gmail.com>
>>> > wrote:
>>> >> Thanks for the explanation.
>>> >>
>>> >> I guess one could always invoke a write operation instead of sync to
>>> >> get the more strict semantics, but as John suggests, it might be a
>>> >> good idea to add a new type of operation that requires followers to
>>> >> ack but doesn't require them to log to disk - this seems sufficient in
>>> >> our case.
>>> >>
>>> >> Alex
>>> >>
>>> >> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <fp...@yahoo-inc.com>
>>> >> wrote:
>>> >>> In theory, the scenario you're describing could happen, but I would
>>> >>> argue that it is unlikely given that: 1) a leader pings followers twice a
>>> >>> tick to make sure that it has a quorum of supporters (lead()); 2) followers
>>> >>> give up on a leader upon catching an exception (followLeader()). One could
>>> >>> calibrate tickTime to make the probability of having this scenario low.
>>> >>>
>>> >>> Let me also revisit the motivation for the way we designed sync.
>>> >>> ZooKeeper has been designed to serve reads efficiently and making sync go
>>> >>> through the pipeline would slow down reads. Although optional, we thought it
>>> >>> would be a good idea to make it as efficient as possible to comply with the
>>> >>> original expectations for the service. We consequently came up with this
>>> >>> cheap way of making sure that a read sees all pending updates. It is correct
>>> >>> that there are some corner cases that it doesn't cover. One is the case you
>>> >>> mentioned. Another is having the sync finishing before the client submits
>>> >>> the read and having a write committing in between. We rely upon the way we
>>> >>> implement timeouts and some minimum degree of synchrony for the clients when
>>> >>> submitting operations to guarantee that the scheme work.
>>> >>>
>>> >>> We thought about the option of having the sync operation going
>>> >>> through the pipeline, and in fact it would have been easier to implement it
>>> >>> just as a regular write, but we opted not to because we felt it was
>>> >>> sufficient for the use cases we had and more efficient as I already argued.
>>> >>>
>>> >>> Hope it helps to clarify.
>>> >>>
>>> >>> -Flavio
>>> >>>
>>> >>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
>>> >>>
>>> >>>> thanks for the explanation! but how do you avoid having the scenario
>>> >>>> raised by John ?
>>> >>>> lets say you're a client connected to F, and F is connected to L.
>>> >>>> Lets
>>> >>>> also say that L's pipeline
>>> >>>> is now empty, and both F and L are partitioned from 3 other servers
>>> >>>> in
>>> >>>> the system that have already
>>> >>>> elected a new leader L'. Now I go to L' and write something. L still
>>> >>>> thinks its the leader because the
>>> >>>> detection that followers left it is obviously timeout dependent. So
>>> >>>> when F sends your sync to L and L returns
>>> >>>> it to F, you actually miss my write!
>>> >>>>
>>> >>>> Alex
>>> >>>>
>>> >>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira
>>> >>>> <fp...@yahoo-inc.com> wrote:
>>> >>>>> Hi Alex, Because of the following:
>>> >>>>>
>>> >>>>> 1- A follower F processes operations from a client in FIFO order,
>>> >>>>> and say that a client submits as you say sync + read;
>>> >>>>> 2- A sync will be processed by the leader and returned to the
>>> >>>>> follower. It will be queued after all pending updates that the follower
>>> >>>>> hasn't processed;
>>> >>>>> 3- The follower will process all pending updates before processing
>>> >>>>> the response of the sync;
>>> >>>>> 4- Once the follower processes the sync, it picks the read
>>> >>>>> operation to process. It reads the local state of the follower and returns
>>> >>>>> to the client.
>>> >>>>>
>>> >>>>> When we process the read in Step 4, we have applied all pending
>>> >>>>> updates the leader had for the follower by the time the read request
>>> >>>>> started.
>>> >>>>>
>>> >>>>> This implementation is a bit of a hack because it doesn't follow
>>> >>>>> the same code path as the other operations that go to the leader, but it
>>> >>>>> avoids some unnecessary steps, which is important for fast reads. In the
>>> >>>>> sync case, the other followers don't really need to know about it (there is
>>> >>>>> nothing to be updated) and the leader simply inserts it in the sequence of
>>> >>>>> updates of F, ordering it.
>>> >>>>>
>>> >>>>> -Flavio
>>> >>>>>
>>> >>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
>>> >>>>>
>>> >>>>>> Hi Flavio,
>>> >>>>>>
>>> >>>>>>> Starting a read operation concurrently with a sync implies that
>>> >>>>>>> the result of the read will not miss an update committed before the read
>>> >>>>>>> started.
>>> >>>>>>
>>> >>>>>> I thought that the intention of sync was to give something like
>>> >>>>>> linearizable reads, so if you invoke a sync and then a read, your
>>> >>>>>> read
>>> >>>>>> is guaranteed to (at least) see any write which completed before
>>> >>>>>> the
>>> >>>>>> sync began. Is this the intention ? If so, how is this achieved
>>> >>>>>> without running agreement on the sync op ?
>>> >>>>>>
>>> >>>>>> Thanks,
>>> >>>>>> Alex
>>> >>>>>>
>>> >>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira
>>> >>>>>> <fp...@yahoo-inc.com> wrote:
>>> >>>>>>> sync simply flushes the channel between the leader and the
>>> >>>>>>> follower that forwarded the sync operation, so it doesn't go through the
>>> >>>>>>> full zab pipeline. Flushing means that all pending updates from the leader
>>> >>>>>>> to the follower are received by the time sync completes. Starting a read
>>> >>>>>>> operation concurrently with a sync implies that the result of the read will
>>> >>>>>>> not miss an update committed before the read started.
>>> >>>>>>>
>>> >>>>>>> -Flavio
>>> >>>>>>>
>>> >>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
>>> >>>>>>>
>>> >>>>>>>> Its strange that sync doesn't run through agreement, I was
>>> >>>>>>>> always
>>> >>>>>>>> assuming that it is... Exactly for the reason you say -
>>> >>>>>>>> you may trust your leader, but I may have a different leader and
>>> >>>>>>>> your
>>> >>>>>>>> leader may not detect it yet and still think its the leader.
>>> >>>>>>>>
>>> >>>>>>>> This seems like a bug to me.
>>> >>>>>>>>
>>> >>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or
>>> >>>>>>>> shouldn't)
>>> >>>>>>>> depend on timing assumption.
>>> >>>>>>>> Only progress guarantees depend on time.
>>> >>>>>>>>
>>> >>>>>>>> Alex
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino
>>> >>>>>>>> <jo...@gmail.com> wrote:
>>> >>>>>>>>> I have some pretty strong requirements in terms of consistency
>>> >>>>>>>>> where
>>> >>>>>>>>> reading from followers that may be behind in terms of updates
>>> >>>>>>>>> isn't ok for
>>> >>>>>>>>> my use case.
>>> >>>>>>>>>
>>> >>>>>>>>> One error case that worries me is if a follower and leader are
>>> >>>>>>>>> partitioned
>>> >>>>>>>>> off from the network.  A new leader is elected, but the
>>> >>>>>>>>> follower and old
>>> >>>>>>>>> leader don't know about it.
>>> >>>>>>>>>
>>> >>>>>>>>> Normally I think sync was made for this purpost, but I looked
>>> >>>>>>>>> at the sync
>>> >>>>>>>>> code and if there aren't any outstanding proposals the leader
>>> >>>>>>>>> sends the
>>> >>>>>>>>> sync right back to the client without first verifying that it
>>> >>>>>>>>> still has
>>> >>>>>>>>> quorum, so this won't work for my use case.
>>> >>>>>>>>>
>>> >>>>>>>>> At the core of the issue all I really need is a call that will
>>> >>>>>>>>> make it's
>>> >>>>>>>>> way to the leader and will ping it's followers, ensure it still
>>> >>>>>>>>> has a
>>> >>>>>>>>> quorum and return success.
>>> >>>>>>>>>
>>> >>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be
>>> >>>>>>>>> forwarded to the
>>> >>>>>>>>> leader, leader will ensure it still has quorum and return it's
>>> >>>>>>>>> epoch.  I
>>> >>>>>>>>> can use this primitive to implement all the other properties I
>>> >>>>>>>>> want to
>>> >>>>>>>>> verify (assuming that my client will never connect to an older
>>> >>>>>>>>> epoch after
>>> >>>>>>>>> this call returns). Also the nice thing about this method is
>>> >>>>>>>>> that it will
>>> >>>>>>>>> not have to hit disk and the latency should just be a round
>>> >>>>>>>>> trip to the
>>> >>>>>>>>> followers.
>>> >>>>>>>>>
>>> >>>>>>>>> Most of the guarentees offered by zookeeper are time based an
>>> >>>>>>>>> rely on
>>> >>>>>>>>> clocks and expiring timers, but I'm hoping to offer some
>>> >>>>>>>>> guarantees in
>>> >>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and any
>>> >>>>>>>>> other way
>>> >>>>>>>>> time is broken.
>>> >>>>>>>>>
>>> >>>>>>>>> Also if people are interested I can go into more detail about
>>> >>>>>>>>> what I am
>>> >>>>>>>>> trying to write.
>>> >>>>>>>>>
>>> >>>>>>>>> -jc
>>> >>>>>>>
>>> >>>>>
>>> >>>
>>>
>>
>

Re: Ensure Leader hasn't changed

Posted by John Carrino <jo...@gmail.com>.

So I'm building a transactional db using the timestamps. The property I
need is that when you get back a ts it is greater than all others returned
before that request was made. If 2 leaders are returning timestamps from
different batches then C2 may have handed out some and in that case C1
shouldn't be allowed to hand back any from it's batch. So it is uniqueness
and this slightly stronger ordering property.

If getFreshTs doesn't ensure this then I can't get the transaction
properties I am shooting for.

-sent from my phone
On Sep 28, 2012 8:01 AM, "Flavio Junqueira" <fp...@yahoo-inc.com> wrote:

> I was thinking that you can do a write per timestamp batch, and not per
> individual timestamp. In the worst case, a former leader won't use some
> timestamps, and I would think it is ok, but it depends on your application.
>
> Also, even if two clients believe they are leaders simultaneously and
> serve timestamps, the property that you seem to care about the most is
> uniqueness: a timestamp is not served twice. Order of timestamps would be
> preserved per leader, but in the case of overlapping leaders you could end
> up serving timestamps that do not follow an increasing order.
>
> -Flavio
>
> On Sep 28, 2012, at 4:37 PM, John Carrino wrote:
>
> > CompareAndSwap is an atomic check and update. Basically only update the
> > value if it is the same as the expected value.
> >
> > I think with your approach you'd have to do a write for every single
> > timestamp you wanted to hand out.  The latency hit on this would too
> much.
> >
> > My approach is different in that a timestamp server reserves a bunch of
> > timestamps up front and proceeds to hand them out as long as it is the
> > leader.  Leader check can be done without hitting disk hopefully.
> >
> > Thanks!
> >
> > -jc
> >
> >
> > On Fri, Sep 28, 2012 at 7:19 AM, Flavio Junqueira <fp...@yahoo-inc.com>
> wrote:
> >
> >> I don't know what your compareAndSwap method does, but I was wondering
> if
> >> your client process can use conditional writes to a znode to make sure
> that
> >> it was the last one to update the state of timestamp batches. You can
> treat
> >> the master election problem separately and it does not have to be as
> strict
> >> as you have been thinking you need. Thats is, it wouldn't hurt if a
> client
> >> still thinks it is leading even if it is not because no two clients
> will be
> >> able to update the state of timestamp blocks without noticing that
> another
> >> client is also updating it.
> >>
> >> -Flavio
> >>
> >> On Sep 27, 2012, at 6:57 PM, John Carrino wrote:
> >>
> >>> So I think it's time to explain what I'm writing just so everyone has
> >>> more situation awareness.  Its just a timestamp server, nothing fancy.
> >>>
> >>> Looks like this:
> >>>
> >>> public interface TimestampService {
> >>>   /**
> >>>    * This will get a fresh timestamp that is guarenteed to be newer
> than
> >>> any other timestamp
> >>>    * handed out before this method was called.
> >>>    */
> >>>   long getFreshTimestamp();
> >>> }
> >>>
> >>> The only requirement is that the timestamp handed back is greater than
> >>> every other timestamp that was returned before getFreshTs was called.
> >>> There is no ordering requirement for concurrent requests.
> >>>
> >>> My impl is to reserve blocks of timestamps that are safe to hand out
> (1M
> >> at
> >>> a time) using compare and swap in ZK.
> >>> lastPossibleUsed = read(HighWater)
> >>> safeToHandout = compareAndSwap(lastPossibleUsed, lastPossibleUsed+1M)
> >>>
> >>> Now my leader can hand back timestamps up to safeToHandout, but before
> it
> >>> hands one out it must ensure it is still the leader (no one else has
> >> handed
> >>> back something higher).
> >>> I can use ensureQuorum(), exists(myEphemNode) to make sure this is the
> >>> case.  Now I have a service that is guarenteed to be correct, but
> doesn't
> >>> require disk hits in the steady state which brings down my latency (if
> >> you
> >>> get close to running out, you can compareAndSwap for more timestamps).
> >>>
> >>> If many requests come in at the same time I can use smart batching to
> >>> verify happens after for all at once.  We can also add more layers if
> we
> >>> need more bandwidth to scale up at the cost of adding latency.
>  Basically
> >>> our latency will be O(lg(requestRate)) if we keep adding layers as each
> >>> previous layer becomes saturated.
> >>>
> >>> I hope this explanation helps. I am busy for the next 4 hours, but if
> you
> >>> need more clarification I can respond to them at that time.
> >>>
> >>> -jc
> >>>
> >>>
> >>> On Thu, Sep 27, 2012 at 9:26 AM, John Carrino <john.carrino@gmail.com
> >>> wrote:
> >>>
> >>>> First, thanks everyone for talking this through with me.
> >>>>
> >>>> Flavio, for your example, this is actually ok.  There is a happens
> after
> >>>> relationship between the client making the request and my leader C1
> >> still
> >>>> being the leader.  My service only needs to guarantee that what it
> hands
> >>>> back is at least as new as anything that existed when the client made
> >> the
> >>>> request.  If C2 were to answer requests while C1 is stalling that is
> ok
> >>>> because these would be considered concurrent requests and the stuff
> >>>> returned by C2 may be newer but that doesn't violate any guarentees.
> >>>>
> >>>> If some client were to get back something from C2 and then (happens
> >> after
> >>>> relationship) someone tried to read from C1, it needs to fail.
> >>>>
> >>>> To address your concern of adding too much bandwidth we can get this
> >>>> easily by doing what Martin Thompson calls smart batching (
> >>>> http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html).
> >>>>
> >>>> 1. ensureQuorum request comes in to L1
> >>>> 2. send ENSURE to all followers
> >>>> 3. 10 more ensureQuorum requests come in
> >>>> 4. get back ENSURE from quorum
> >>>> 5. we can now service all 10 pending ensureQuorum requests with
> another
> >>>> round trip ENSURE.
> >>>>
> >>>> We don't need to send an ENSURE for every ensureQuorum request, we
> just
> >>>> need it to be happens after from when the request arrived.
> >>>>
> >>>> I am fine with the Ephemeral node being removed after some time
> expires,
> >>>> but only by the leader.  If the leaders clock is broken and the client
> >>>> owning the Ephemeral node drops off, then we don't have liveness
> >> (because
> >>>> this node may not get cleaned up in a timely fashion).  However, we
> >> still
> >>>> preserve corectness.
> >>>>
> >>>> -jc
> >>>>
> >>>>
> >>>> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <fpj@yahoo-inc.com
> >>> wrote:
> >>>>
> >>>>> Say that we implement what you're suggesting. Could you check if this
> >>>>> scenario can happen:
> >>>>>
> >>>>> 1- Client C1 is the current leader and it super boosted read to make
> >> sure
> >>>>> it is still the leader;
> >>>>> 2- We process the super boosted read having it through the zab
> >> pipeline;
> >>>>> 3- When we send the response to C1 we slow down the whole deal: the
> >>>>> response to C1 gets delayed and we stall C1;
> >>>>> 4- In the meanwhile, C1's session expires on the server side and its
> >>>>> ephemeral leadership node is removed;
> >>>>> 5- A new client C2 is elected and starts exercising leadership;
> >>>>> 6- Now C1 comes back to normal and receives the response of the super
> >>>>> boosted read saying that it is still the leader.
> >>>>>
> >>>>> If my interpretation is not incorrect, the only way to prevent this
> >>>>> scenario from happening is if the session expires on the client side
> >> before
> >>>>> it receives the response of the read. It doesn't look like we can do
> >> it if
> >>>>> process clocks can be arbitrarily delayed.
> >>>>>
> >>>>> Note that one issue is that the behavior of ephemerals is highly
> >>>>> dependent upon timers, so I don't think we can avoid making some
> timing
> >>>>> assumptions altogether. The question is if we are better off with a
> >>>>> mechanism relying upon acknowledgements. My sense is that
> >> application-level
> >>>>> fencing is preferable (if not necessary) for applications like the
> >> ones JC
> >>>>> is mentioning or BookKeeper.
> >>>>>
> >>>>> I'm not concerned about writes to disk, which I agree we don't need
> for
> >>>>> sync. I'm more concerned about having it going through the whole
> >> pipeline,
> >>>>> which will induce more traffic to zab and increase latency for an
> >>>>> application that uses it heavily.
> >>>>>
> >>>>> -Flavio
> >>>>>
> >>>>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote:
> >>>>>
> >>>>>> another idea is to add this functionality to MultiOp - have read
> only
> >>>>>> transactions be replicated but not logged or logged asynchronously.
> >>>>>> I'm not sure how it works right now if I do a read-only MultiOp
> >>>>>> transaction - does it replicate the transaction or answer it locally
> >>>>>> on the leader ?
> >>>>>>
> >>>>>> Alex
> >>>>>>
> >>>>>> On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <
> shralex@gmail.com>
> >>>>> wrote:
> >>>>>>> Thanks for the explanation.
> >>>>>>>
> >>>>>>> I guess one could always invoke a write operation instead of sync
> to
> >>>>>>> get the more strict semantics, but as John suggests, it might be a
> >>>>>>> good idea to add a new type of operation that requires followers to
> >>>>>>> ack but doesn't require them to log to disk - this seems sufficient
> >> in
> >>>>>>> our case.
> >>>>>>>
> >>>>>>> Alex
> >>>>>>>
> >>>>>>> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <
> fpj@yahoo-inc.com
> >>>
> >>>>> wrote:
> >>>>>>>> In theory, the scenario you're describing could happen, but I
> would
> >>>>> argue that it is unlikely given that: 1) a leader pings followers
> >> twice a
> >>>>> tick to make sure that it has a quorum of supporters (lead()); 2)
> >> followers
> >>>>> give up on a leader upon catching an exception (followLeader()). One
> >> could
> >>>>> calibrate tickTime to make the probability of having this scenario
> low.
> >>>>>>>>
> >>>>>>>> Let me also revisit the motivation for the way we designed sync.
> >>>>> ZooKeeper has been designed to serve reads efficiently and making
> sync
> >> go
> >>>>> through the pipeline would slow down reads. Although optional, we
> >> thought
> >>>>> it would be a good idea to make it as efficient as possible to comply
> >> with
> >>>>> the original expectations for the service. We consequently came up
> with
> >>>>> this cheap way of making sure that a read sees all pending updates.
> It
> >> is
> >>>>> correct that there are some corner cases that it doesn't cover. One
> is
> >> the
> >>>>> case you mentioned. Another is having the sync finishing before the
> >> client
> >>>>> submits the read and having a write committing in between. We rely
> >> upon the
> >>>>> way we implement timeouts and some minimum degree of synchrony for
> the
> >>>>> clients when submitting operations to guarantee that the scheme work.
> >>>>>>>>
> >>>>>>>> We thought about the option of having the sync operation going
> >>>>> through the pipeline, and in fact it would have been easier to
> >> implement it
> >>>>> just as a regular write, but we opted not to because we felt it was
> >>>>> sufficient for the use cases we had and more efficient as I already
> >> argued.
> >>>>>>>>
> >>>>>>>> Hope it helps to clarify.
> >>>>>>>>
> >>>>>>>> -Flavio
> >>>>>>>>
> >>>>>>>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
> >>>>>>>>
> >>>>>>>>> thanks for the explanation! but how do you avoid having the
> >> scenario
> >>>>>>>>> raised by John ?
> >>>>>>>>> lets say you're a client connected to F, and F is connected to L.
> >>>>> Lets
> >>>>>>>>> also say that L's pipeline
> >>>>>>>>> is now empty, and both F and L are partitioned from 3 other
> servers
> >>>>> in
> >>>>>>>>> the system that have already
> >>>>>>>>> elected a new leader L'. Now I go to L' and write something. L
> >> still
> >>>>>>>>> thinks its the leader because the
> >>>>>>>>> detection that followers left it is obviously timeout dependent.
> So
> >>>>>>>>> when F sends your sync to L and L returns
> >>>>>>>>> it to F, you actually miss my write!
> >>>>>>>>>
> >>>>>>>>> Alex
> >>>>>>>>>
> >>>>>>>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <
> >>>>> fpj@yahoo-inc.com> wrote:
> >>>>>>>>>> Hi Alex, Because of the following:
> >>>>>>>>>>
> >>>>>>>>>> 1- A follower F processes operations from a client in FIFO
> order,
> >>>>> and say that a client submits as you say sync + read;
> >>>>>>>>>> 2- A sync will be processed by the leader and returned to the
> >>>>> follower. It will be queued after all pending updates that the
> follower
> >>>>> hasn't processed;
> >>>>>>>>>> 3- The follower will process all pending updates before
> processing
> >>>>> the response of the sync;
> >>>>>>>>>> 4- Once the follower processes the sync, it picks the read
> >>>>> operation to process. It reads the local state of the follower and
> >> returns
> >>>>> to the client.
> >>>>>>>>>>
> >>>>>>>>>> When we process the read in Step 4, we have applied all pending
> >>>>> updates the leader had for the follower by the time the read request
> >>>>> started.
> >>>>>>>>>>
> >>>>>>>>>> This implementation is a bit of a hack because it doesn't follow
> >>>>> the same code path as the other operations that go to the leader, but
> >> it
> >>>>> avoids some unnecessary steps, which is important for fast reads. In
> >> the
> >>>>> sync case, the other followers don't really need to know about it
> >> (there is
> >>>>> nothing to be updated) and the leader simply inserts it in the
> >> sequence of
> >>>>> updates of F, ordering it.
> >>>>>>>>>>
> >>>>>>>>>> -Flavio
> >>>>>>>>>>
> >>>>>>>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Flavio,
> >>>>>>>>>>>
> >>>>>>>>>>>> Starting a read operation concurrently with a sync implies
> that
> >>>>> the result of the read will not miss an update committed before the
> >> read
> >>>>> started.
> >>>>>>>>>>>
> >>>>>>>>>>> I thought that the intention of sync was to give something like
> >>>>>>>>>>> linearizable reads, so if you invoke a sync and then a read,
> your
> >>>>> read
> >>>>>>>>>>> is guaranteed to (at least) see any write which completed
> before
> >>>>> the
> >>>>>>>>>>> sync began. Is this the intention ? If so, how is this achieved
> >>>>>>>>>>> without running agreement on the sync op ?
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>> Alex
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <
> >>>>> fpj@yahoo-inc.com> wrote:
> >>>>>>>>>>>> sync simply flushes the channel between the leader and the
> >>>>> follower that forwarded the sync operation, so it doesn't go through
> >> the
> >>>>> full zab pipeline. Flushing means that all pending updates from the
> >> leader
> >>>>> to the follower are received by the time sync completes. Starting a
> >> read
> >>>>> operation concurrently with a sync implies that the result of the
> read
> >> will
> >>>>> not miss an update committed before the read started.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -Flavio
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Its strange that sync doesn't run through agreement, I was
> >> always
> >>>>>>>>>>>>> assuming that it is... Exactly for the reason you say -
> >>>>>>>>>>>>> you may trust your leader, but I may have a different leader
> >> and
> >>>>> your
> >>>>>>>>>>>>> leader may not detect it yet and still think its the leader.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This seems like a bug to me.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or
> >>>>> shouldn't)
> >>>>>>>>>>>>> depend on timing assumption.
> >>>>>>>>>>>>> Only progress guarantees depend on time.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Alex
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <
> >>>>> john.carrino@gmail.com> wrote:
> >>>>>>>>>>>>>> I have some pretty strong requirements in terms of
> consistency
> >>>>> where
> >>>>>>>>>>>>>> reading from followers that may be behind in terms of
> updates
> >>>>> isn't ok for
> >>>>>>>>>>>>>> my use case.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> One error case that worries me is if a follower and leader
> are
> >>>>> partitioned
> >>>>>>>>>>>>>> off from the network.  A new leader is elected, but the
> >>>>> follower and old
> >>>>>>>>>>>>>> leader don't know about it.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Normally I think sync was made for this purpost, but I
> looked
> >>>>> at the sync
> >>>>>>>>>>>>>> code and if there aren't any outstanding proposals the
> leader
> >>>>> sends the
> >>>>>>>>>>>>>> sync right back to the client without first verifying that
> it
> >>>>> still has
> >>>>>>>>>>>>>> quorum, so this won't work for my use case.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> At the core of the issue all I really need is a call that
> will
> >>>>> make it's
> >>>>>>>>>>>>>> way to the leader and will ping it's followers, ensure it
> >> still
> >>>>> has a
> >>>>>>>>>>>>>> quorum and return success.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be
> >>>>> forwarded to the
> >>>>>>>>>>>>>> leader, leader will ensure it still has quorum and return
> it's
> >>>>> epoch.  I
> >>>>>>>>>>>>>> can use this primitive to implement all the other
> properties I
> >>>>> want to
> >>>>>>>>>>>>>> verify (assuming that my client will never connect to an
> older
> >>>>> epoch after
> >>>>>>>>>>>>>> this call returns). Also the nice thing about this method is
> >>>>> that it will
> >>>>>>>>>>>>>> not have to hit disk and the latency should just be a round
> >>>>> trip to the
> >>>>>>>>>>>>>> followers.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Most of the guarentees offered by zookeeper are time based
> an
> >>>>> rely on
> >>>>>>>>>>>>>> clocks and expiring timers, but I'm hoping to offer some
> >>>>> guarantees in
> >>>>>>>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and
> any
> >>>>> other way
> >>>>>>>>>>>>>> time is broken.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Also if people are interested I can go into more detail
> about
> >>>>> what I am
> >>>>>>>>>>>>>> trying to write.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -jc
> >>>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>
> >>>>>
> >>>>
> >>
> >>
>
>

Re: Ensure Leader hasn't changed

Posted by John Carrino <jo...@gmail.com>.

Yes, we are basically trying to implement the TS oracle talked about in the
percolator paper.  If we limit to a single host then that clearly has the
downside of single point of failure, but is very easy to implement and is
what we have now. I think we can do better and I want to lean on ZK for
this because it seems like a good fit.

In terms of throughput, you can always batch up more request together if
you need to.  Basically in situations like this you can trade off latency
for more throughput.  One way to do this would be to have 2nd level
timestamp servers that delegate to the leader TS server.  These 2nd layer
servers can smart batch up to 1K getFreshTs requests and send them all as
one command to the leader TS server and it it can just do one
AtomicLong.addAndGet(1K) and service all 1K of these in one request.
 That's what I was referring to before where the total latency will be
O(lg(throughput)).  You can extend this to 3 layers if you need to, but
that is most likely not going to be needed.  Getting millions/s will be
pretty easy if we have 2 layers.

Even with a single layer using smart batching I think we can get 1M/s ts
handed out.  Assuming no writes, we can get a ZK strongSync, checkExists
done every 1ms or so and each of these can serve about 1K pending
getFreshTs requests. That gets us to the 1M mark pretty easily without
sacrificing the availability of running a single Ts node.

This all assumes that there are enough transactions to go really wide.  In
the case where there is only one thread doing transactions, then throughput
would be limited by the latency of the TS server which we'd like to bring
down around 1-2ms if possible.

-jc



On Fri, Sep 28, 2012 at 8:59 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:

> I suppose you're aware of the Percolator approach to timestamps. According
> to their description, they have a single server serving timestamps, which
> makes performance very good in the absence of crashes of the timestamp
> oracle. They mention restarts of the oracle, but they don't say how they
> implement it; it is possibly done manually.
>
> Using master election to deal with oracle crashes requires that all
> masters, primary and backups, agree on the sequence of timestamps handed to
> clients. This observation essentially implies that you need go through zab
> one way or the other to have agreement on the sequence of timestamps
> handed. Even if we bypass disk writes but have it going though zab, I don't
> think it will be able to get the same performance the timestamp oracle of
> Percolator is getting (they claim 2 million/s). How much throughput
> performance is sufficient for you if I may ask? Order of magnitude would be
> sufficient.
>
> -Flavio
>
> On Sep 28, 2012, at 5:17 PM, John Carrino wrote:
>
> > Yes. This property needs to be global because otherwise another
> transaction
> > can start and commit in the past when it shouldn't be allowed to.
> >
> > -sent from my phone
> > On Sep 28, 2012 8:13 AM, "Flavio Junqueira" <fp...@yahoo-inc.com> wrote:
> >
> >> I checked your message again and the contract is that
> getFreshTimestamp()
> >> always gets a fresh timestamp (not older than any returned by that
> method).
> >> One aspect of the contract that is not clear to me: does this freshness
> >> property hold across clients or only for a single client? If it holds
> only
> >> per client, then you can guarantee it with master epochs. No user of
> your
> >> application will connect to the master of epoch y < x once it connects
> to
> >> the master of epoch x. The epochs can be the sequence number of a
> >> sequential znode. How does it sound?
> >>
> >> -Flavio
> >>
> >> On Sep 28, 2012, at 5:00 PM, Flavio Junqueira wrote:
> >>
> >>> I was thinking that you can do a write per timestamp batch, and not per
> >> individual timestamp. In the worst case, a former leader won't use some
> >> timestamps, and I would think it is ok, but it depends on your
> application.
> >>>
> >>> Also, even if two clients believe they are leaders simultaneously and
> >> serve timestamps, the property that you seem to care about the most is
> >> uniqueness: a timestamp is not served twice. Order of timestamps would
> be
> >> preserved per leader, but in the case of overlapping leaders you could
> end
> >> up serving timestamps that do not follow an increasing order.
> >>>
> >>> -Flavio
> >>>
> >>> On Sep 28, 2012, at 4:37 PM, John Carrino wrote:
> >>>
> >>>> CompareAndSwap is an atomic check and update. Basically only update
> the
> >>>> value if it is the same as the expected value.
> >>>>
> >>>> I think with your approach you'd have to do a write for every single
> >>>> timestamp you wanted to hand out.  The latency hit on this would too
> >> much.
> >>>>
> >>>> My approach is different in that a timestamp server reserves a bunch
> of
> >>>> timestamps up front and proceeds to hand them out as long as it is the
> >>>> leader.  Leader check can be done without hitting disk hopefully.
> >>>>
> >>>> Thanks!
> >>>>
> >>>> -jc
> >>>>
> >>>>
> >>>> On Fri, Sep 28, 2012 at 7:19 AM, Flavio Junqueira <fp...@yahoo-inc.com>
> >> wrote:
> >>>>
> >>>>> I don't know what your compareAndSwap method does, but I was
> wondering
> >> if
> >>>>> your client process can use conditional writes to a znode to make
> sure
> >> that
> >>>>> it was the last one to update the state of timestamp batches. You can
> >> treat
> >>>>> the master election problem separately and it does not have to be as
> >> strict
> >>>>> as you have been thinking you need. Thats is, it wouldn't hurt if a
> >> client
> >>>>> still thinks it is leading even if it is not because no two clients
> >> will be
> >>>>> able to update the state of timestamp blocks without noticing that
> >> another
> >>>>> client is also updating it.
> >>>>>
> >>>>> -Flavio
> >>>>>
> >>>>> On Sep 27, 2012, at 6:57 PM, John Carrino wrote:
> >>>>>
> >>>>>> So I think it's time to explain what I'm writing just so everyone
> has
> >>>>>> more situation awareness.  Its just a timestamp server, nothing
> fancy.
> >>>>>>
> >>>>>> Looks like this:
> >>>>>>
> >>>>>> public interface TimestampService {
> >>>>>> /**
> >>>>>> * This will get a fresh timestamp that is guarenteed to be newer
> >> than
> >>>>>> any other timestamp
> >>>>>> * handed out before this method was called.
> >>>>>> */
> >>>>>> long getFreshTimestamp();
> >>>>>> }
> >>>>>>
> >>>>>> The only requirement is that the timestamp handed back is greater
> than
> >>>>>> every other timestamp that was returned before getFreshTs was
> called.
> >>>>>> There is no ordering requirement for concurrent requests.
> >>>>>>
> >>>>>> My impl is to reserve blocks of timestamps that are safe to hand out
> >> (1M
> >>>>> at
> >>>>>> a time) using compare and swap in ZK.
> >>>>>> lastPossibleUsed = read(HighWater)
> >>>>>> safeToHandout = compareAndSwap(lastPossibleUsed,
> lastPossibleUsed+1M)
> >>>>>>
> >>>>>> Now my leader can hand back timestamps up to safeToHandout, but
> >> before it
> >>>>>> hands one out it must ensure it is still the leader (no one else has
> >>>>> handed
> >>>>>> back something higher).
> >>>>>> I can use ensureQuorum(), exists(myEphemNode) to make sure this is
> the
> >>>>>> case.  Now I have a service that is guarenteed to be correct, but
> >> doesn't
> >>>>>> require disk hits in the steady state which brings down my latency
> (if
> >>>>> you
> >>>>>> get close to running out, you can compareAndSwap for more
> timestamps).
> >>>>>>
> >>>>>> If many requests come in at the same time I can use smart batching
> to
> >>>>>> verify happens after for all at once.  We can also add more layers
> if
> >> we
> >>>>>> need more bandwidth to scale up at the cost of adding latency.
> >> Basically
> >>>>>> our latency will be O(lg(requestRate)) if we keep adding layers as
> >> each
> >>>>>> previous layer becomes saturated.
> >>>>>>
> >>>>>> I hope this explanation helps. I am busy for the next 4 hours, but
> if
> >> you
> >>>>>> need more clarification I can respond to them at that time.
> >>>>>>
> >>>>>> -jc
> >>>>>>
> >>>>>>
> >>>>>> On Thu, Sep 27, 2012 at 9:26 AM, John Carrino <
> john.carrino@gmail.com
> >>>>>> wrote:
> >>>>>>
> >>>>>>> First, thanks everyone for talking this through with me.
> >>>>>>>
> >>>>>>> Flavio, for your example, this is actually ok.  There is a happens
> >> after
> >>>>>>> relationship between the client making the request and my leader C1
> >>>>> still
> >>>>>>> being the leader.  My service only needs to guarantee that what it
> >> hands
> >>>>>>> back is at least as new as anything that existed when the client
> made
> >>>>> the
> >>>>>>> request.  If C2 were to answer requests while C1 is stalling that
> is
> >> ok
> >>>>>>> because these would be considered concurrent requests and the stuff
> >>>>>>> returned by C2 may be newer but that doesn't violate any
> guarentees.
> >>>>>>>
> >>>>>>> If some client were to get back something from C2 and then (happens
> >>>>> after
> >>>>>>> relationship) someone tried to read from C1, it needs to fail.
> >>>>>>>
> >>>>>>> To address your concern of adding too much bandwidth we can get
> this
> >>>>>>> easily by doing what Martin Thompson calls smart batching (
> >>>>>>>
> http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html
> >> ).
> >>>>>>>
> >>>>>>> 1. ensureQuorum request comes in to L1
> >>>>>>> 2. send ENSURE to all followers
> >>>>>>> 3. 10 more ensureQuorum requests come in
> >>>>>>> 4. get back ENSURE from quorum
> >>>>>>> 5. we can now service all 10 pending ensureQuorum requests with
> >> another
> >>>>>>> round trip ENSURE.
> >>>>>>>
> >>>>>>> We don't need to send an ENSURE for every ensureQuorum request, we
> >> just
> >>>>>>> need it to be happens after from when the request arrived.
> >>>>>>>
> >>>>>>> I am fine with the Ephemeral node being removed after some time
> >> expires,
> >>>>>>> but only by the leader.  If the leaders clock is broken and the
> >> client
> >>>>>>> owning the Ephemeral node drops off, then we don't have liveness
> >>>>> (because
> >>>>>>> this node may not get cleaned up in a timely fashion).  However, we
> >>>>> still
> >>>>>>> preserve corectness.
> >>>>>>>
> >>>>>>> -jc
> >>>>>>>
> >>>>>>>
> >>>>>>> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <
> fpj@yahoo-inc.com
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Say that we implement what you're suggesting. Could you check if
> >> this
> >>>>>>>> scenario can happen:
> >>>>>>>>
> >>>>>>>> 1- Client C1 is the current leader and it super boosted read to
> make
> >>>>> sure
> >>>>>>>> it is still the leader;
> >>>>>>>> 2- We process the super boosted read having it through the zab
> >>>>> pipeline;
> >>>>>>>> 3- When we send the response to C1 we slow down the whole deal:
> the
> >>>>>>>> response to C1 gets delayed and we stall C1;
> >>>>>>>> 4- In the meanwhile, C1's session expires on the server side and
> its
> >>>>>>>> ephemeral leadership node is removed;
> >>>>>>>> 5- A new client C2 is elected and starts exercising leadership;
> >>>>>>>> 6- Now C1 comes back to normal and receives the response of the
> >> super
> >>>>>>>> boosted read saying that it is still the leader.
> >>>>>>>>
> >>>>>>>> If my interpretation is not incorrect, the only way to prevent
> this
> >>>>>>>> scenario from happening is if the session expires on the client
> side
> >>>>> before
> >>>>>>>> it receives the response of the read. It doesn't look like we can
> do
> >>>>> it if
> >>>>>>>> process clocks can be arbitrarily delayed.
> >>>>>>>>
> >>>>>>>> Note that one issue is that the behavior of ephemerals is highly
> >>>>>>>> dependent upon timers, so I don't think we can avoid making some
> >> timing
> >>>>>>>> assumptions altogether. The question is if we are better off with
> a
> >>>>>>>> mechanism relying upon acknowledgements. My sense is that
> >>>>> application-level
> >>>>>>>> fencing is preferable (if not necessary) for applications like the
> >>>>> ones JC
> >>>>>>>> is mentioning or BookKeeper.
> >>>>>>>>
> >>>>>>>> I'm not concerned about writes to disk, which I agree we don't
> need
> >> for
> >>>>>>>> sync. I'm more concerned about having it going through the whole
> >>>>> pipeline,
> >>>>>>>> which will induce more traffic to zab and increase latency for an
> >>>>>>>> application that uses it heavily.
> >>>>>>>>
> >>>>>>>> -Flavio
> >>>>>>>>
> >>>>>>>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote:
> >>>>>>>>
> >>>>>>>>> another idea is to add this functionality to MultiOp - have read
> >> only
> >>>>>>>>> transactions be replicated but not logged or logged
> asynchronously.
> >>>>>>>>> I'm not sure how it works right now if I do a read-only MultiOp
> >>>>>>>>> transaction - does it replicate the transaction or answer it
> >> locally
> >>>>>>>>> on the leader ?
> >>>>>>>>>
> >>>>>>>>> Alex
> >>>>>>>>>
> >>>>>>>>> On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <
> >> shralex@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>> Thanks for the explanation.
> >>>>>>>>>>
> >>>>>>>>>> I guess one could always invoke a write operation instead of
> sync
> >> to
> >>>>>>>>>> get the more strict semantics, but as John suggests, it might
> be a
> >>>>>>>>>> good idea to add a new type of operation that requires followers
> >> to
> >>>>>>>>>> ack but doesn't require them to log to disk - this seems
> >> sufficient
> >>>>> in
> >>>>>>>>>> our case.
> >>>>>>>>>>
> >>>>>>>>>> Alex
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <
> >> fpj@yahoo-inc.com
> >>>>>>
> >>>>>>>> wrote:
> >>>>>>>>>>> In theory, the scenario you're describing could happen, but I
> >> would
> >>>>>>>> argue that it is unlikely given that: 1) a leader pings followers
> >>>>> twice a
> >>>>>>>> tick to make sure that it has a quorum of supporters (lead()); 2)
> >>>>> followers
> >>>>>>>> give up on a leader upon catching an exception (followLeader()).
> One
> >>>>> could
> >>>>>>>> calibrate tickTime to make the probability of having this scenario
> >> low.
> >>>>>>>>>>>
> >>>>>>>>>>> Let me also revisit the motivation for the way we designed
> sync.
> >>>>>>>> ZooKeeper has been designed to serve reads efficiently and making
> >> sync
> >>>>> go
> >>>>>>>> through the pipeline would slow down reads. Although optional, we
> >>>>> thought
> >>>>>>>> it would be a good idea to make it as efficient as possible to
> >> comply
> >>>>> with
> >>>>>>>> the original expectations for the service. We consequently came up
> >> with
> >>>>>>>> this cheap way of making sure that a read sees all pending
> updates.
> >> It
> >>>>> is
> >>>>>>>> correct that there are some corner cases that it doesn't cover.
> One
> >> is
> >>>>> the
> >>>>>>>> case you mentioned. Another is having the sync finishing before
> the
> >>>>> client
> >>>>>>>> submits the read and having a write committing in between. We rely
> >>>>> upon the
> >>>>>>>> way we implement timeouts and some minimum degree of synchrony for
> >> the
> >>>>>>>> clients when submitting operations to guarantee that the scheme
> >> work.
> >>>>>>>>>>>
> >>>>>>>>>>> We thought about the option of having the sync operation going
> >>>>>>>> through the pipeline, and in fact it would have been easier to
> >>>>> implement it
> >>>>>>>> just as a regular write, but we opted not to because we felt it
> was
> >>>>>>>> sufficient for the use cases we had and more efficient as I
> already
> >>>>> argued.
> >>>>>>>>>>>
> >>>>>>>>>>> Hope it helps to clarify.
> >>>>>>>>>>>
> >>>>>>>>>>> -Flavio
> >>>>>>>>>>>
> >>>>>>>>>>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> thanks for the explanation! but how do you avoid having the
> >>>>> scenario
> >>>>>>>>>>>> raised by John ?
> >>>>>>>>>>>> lets say you're a client connected to F, and F is connected to
> >> L.
> >>>>>>>> Lets
> >>>>>>>>>>>> also say that L's pipeline
> >>>>>>>>>>>> is now empty, and both F and L are partitioned from 3 other
> >> servers
> >>>>>>>> in
> >>>>>>>>>>>> the system that have already
> >>>>>>>>>>>> elected a new leader L'. Now I go to L' and write something. L
> >>>>> still
> >>>>>>>>>>>> thinks its the leader because the
> >>>>>>>>>>>> detection that followers left it is obviously timeout
> >> dependent. So
> >>>>>>>>>>>> when F sends your sync to L and L returns
> >>>>>>>>>>>> it to F, you actually miss my write!
> >>>>>>>>>>>>
> >>>>>>>>>>>> Alex
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <
> >>>>>>>> fpj@yahoo-inc.com> wrote:
> >>>>>>>>>>>>> Hi Alex, Because of the following:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 1- A follower F processes operations from a client in FIFO
> >> order,
> >>>>>>>> and say that a client submits as you say sync + read;
> >>>>>>>>>>>>> 2- A sync will be processed by the leader and returned to the
> >>>>>>>> follower. It will be queued after all pending updates that the
> >> follower
> >>>>>>>> hasn't processed;
> >>>>>>>>>>>>> 3- The follower will process all pending updates before
> >> processing
> >>>>>>>> the response of the sync;
> >>>>>>>>>>>>> 4- Once the follower processes the sync, it picks the read
> >>>>>>>> operation to process. It reads the local state of the follower and
> >>>>> returns
> >>>>>>>> to the client.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> When we process the read in Step 4, we have applied all
> pending
> >>>>>>>> updates the leader had for the follower by the time the read
> request
> >>>>>>>> started.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This implementation is a bit of a hack because it doesn't
> >> follow
> >>>>>>>> the same code path as the other operations that go to the leader,
> >> but
> >>>>> it
> >>>>>>>> avoids some unnecessary steps, which is important for fast reads.
> In
> >>>>> the
> >>>>>>>> sync case, the other followers don't really need to know about it
> >>>>> (there is
> >>>>>>>> nothing to be updated) and the leader simply inserts it in the
> >>>>> sequence of
> >>>>>>>> updates of F, ordering it.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -Flavio
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hi Flavio,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Starting a read operation concurrently with a sync implies
> >> that
> >>>>>>>> the result of the read will not miss an update committed before
> the
> >>>>> read
> >>>>>>>> started.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I thought that the intention of sync was to give something
> >> like
> >>>>>>>>>>>>>> linearizable reads, so if you invoke a sync and then a read,
> >> your
> >>>>>>>> read
> >>>>>>>>>>>>>> is guaranteed to (at least) see any write which completed
> >> before
> >>>>>>>> the
> >>>>>>>>>>>>>> sync began. Is this the intention ? If so, how is this
> >> achieved
> >>>>>>>>>>>>>> without running agreement on the sync op ?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> Alex
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <
> >>>>>>>> fpj@yahoo-inc.com> wrote:
> >>>>>>>>>>>>>>> sync simply flushes the channel between the leader and the
> >>>>>>>> follower that forwarded the sync operation, so it doesn't go
> through
> >>>>> the
> >>>>>>>> full zab pipeline. Flushing means that all pending updates from
> the
> >>>>> leader
> >>>>>>>> to the follower are received by the time sync completes. Starting
> a
> >>>>> read
> >>>>>>>> operation concurrently with a sync implies that the result of the
> >> read
> >>>>> will
> >>>>>>>> not miss an update committed before the read started.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -Flavio
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Its strange that sync doesn't run through agreement, I was
> >>>>> always
> >>>>>>>>>>>>>>>> assuming that it is... Exactly for the reason you say -
> >>>>>>>>>>>>>>>> you may trust your leader, but I may have a different
> leader
> >>>>> and
> >>>>>>>> your
> >>>>>>>>>>>>>>>> leader may not detect it yet and still think its the
> leader.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> This seems like a bug to me.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't
> (or
> >>>>>>>> shouldn't)
> >>>>>>>>>>>>>>>> depend on timing assumption.
> >>>>>>>>>>>>>>>> Only progress guarantees depend on time.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Alex
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <
> >>>>>>>> john.carrino@gmail.com> wrote:
> >>>>>>>>>>>>>>>>> I have some pretty strong requirements in terms of
> >> consistency
> >>>>>>>> where
> >>>>>>>>>>>>>>>>> reading from followers that may be behind in terms of
> >> updates
> >>>>>>>> isn't ok for
> >>>>>>>>>>>>>>>>> my use case.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> One error case that worries me is if a follower and
> leader
> >> are
> >>>>>>>> partitioned
> >>>>>>>>>>>>>>>>> off from the network.  A new leader is elected, but the
> >>>>>>>> follower and old
> >>>>>>>>>>>>>>>>> leader don't know about it.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Normally I think sync was made for this purpost, but I
> >> looked
> >>>>>>>> at the sync
> >>>>>>>>>>>>>>>>> code and if there aren't any outstanding proposals the
> >> leader
> >>>>>>>> sends the
> >>>>>>>>>>>>>>>>> sync right back to the client without first verifying
> that
> >> it
> >>>>>>>> still has
> >>>>>>>>>>>>>>>>> quorum, so this won't work for my use case.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> At the core of the issue all I really need is a call that
> >> will
> >>>>>>>> make it's
> >>>>>>>>>>>>>>>>> way to the leader and will ping it's followers, ensure it
> >>>>> still
> >>>>>>>> has a
> >>>>>>>>>>>>>>>>> quorum and return success.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be
> >>>>>>>> forwarded to the
> >>>>>>>>>>>>>>>>> leader, leader will ensure it still has quorum and return
> >> it's
> >>>>>>>> epoch.  I
> >>>>>>>>>>>>>>>>> can use this primitive to implement all the other
> >> properties I
> >>>>>>>> want to
> >>>>>>>>>>>>>>>>> verify (assuming that my client will never connect to an
> >> older
> >>>>>>>> epoch after
> >>>>>>>>>>>>>>>>> this call returns). Also the nice thing about this method
> >> is
> >>>>>>>> that it will
> >>>>>>>>>>>>>>>>> not have to hit disk and the latency should just be a
> round
> >>>>>>>> trip to the
> >>>>>>>>>>>>>>>>> followers.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Most of the guarentees offered by zookeeper are time
> based
> >> an
> >>>>>>>> rely on
> >>>>>>>>>>>>>>>>> clocks and expiring timers, but I'm hoping to offer some
> >>>>>>>> guarantees in
> >>>>>>>>>>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and
> >> any
> >>>>>>>> other way
> >>>>>>>>>>>>>>>>> time is broken.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Also if people are interested I can go into more detail
> >> about
> >>>>>>>> what I am
> >>>>>>>>>>>>>>>>> trying to write.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> -jc
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>
> >>
> >>
>
>

Re: Ensure Leader hasn't changed

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.

I suppose you're aware of the Percolator approach to timestamps. According to their description, they have a single server serving timestamps, which makes performance very good in the absence of crashes of the timestamp oracle. They mention restarts of the oracle, but they don't say how they implement it; it is possibly done manually.  

Using master election to deal with oracle crashes requires that all masters, primary and backups, agree on the sequence of timestamps handed to clients. This observation essentially implies that you need go through zab one way or the other to have agreement on the sequence of timestamps handed. Even if we bypass disk writes but have it going though zab, I don't think it will be able to get the same performance the timestamp oracle of Percolator is getting (they claim 2 million/s). How much throughput performance is sufficient for you if I may ask? Order of magnitude would be sufficient.

-Flavio 

On Sep 28, 2012, at 5:17 PM, John Carrino wrote:

> Yes. This property needs to be global because otherwise another transaction
> can start and commit in the past when it shouldn't be allowed to.
> 
> -sent from my phone
> On Sep 28, 2012 8:13 AM, "Flavio Junqueira" <fp...@yahoo-inc.com> wrote:
> 
>> I checked your message again and the contract is that getFreshTimestamp()
>> always gets a fresh timestamp (not older than any returned by that method).
>> One aspect of the contract that is not clear to me: does this freshness
>> property hold across clients or only for a single client? If it holds only
>> per client, then you can guarantee it with master epochs. No user of your
>> application will connect to the master of epoch y < x once it connects to
>> the master of epoch x. The epochs can be the sequence number of a
>> sequential znode. How does it sound?
>> 
>> -Flavio
>> 
>> On Sep 28, 2012, at 5:00 PM, Flavio Junqueira wrote:
>> 
>>> I was thinking that you can do a write per timestamp batch, and not per
>> individual timestamp. In the worst case, a former leader won't use some
>> timestamps, and I would think it is ok, but it depends on your application.
>>> 
>>> Also, even if two clients believe they are leaders simultaneously and
>> serve timestamps, the property that you seem to care about the most is
>> uniqueness: a timestamp is not served twice. Order of timestamps would be
>> preserved per leader, but in the case of overlapping leaders you could end
>> up serving timestamps that do not follow an increasing order.
>>> 
>>> -Flavio
>>> 
>>> On Sep 28, 2012, at 4:37 PM, John Carrino wrote:
>>> 
>>>> CompareAndSwap is an atomic check and update. Basically only update the
>>>> value if it is the same as the expected value.
>>>> 
>>>> I think with your approach you'd have to do a write for every single
>>>> timestamp you wanted to hand out.  The latency hit on this would too
>> much.
>>>> 
>>>> My approach is different in that a timestamp server reserves a bunch of
>>>> timestamps up front and proceeds to hand them out as long as it is the
>>>> leader.  Leader check can be done without hitting disk hopefully.
>>>> 
>>>> Thanks!
>>>> 
>>>> -jc
>>>> 
>>>> 
>>>> On Fri, Sep 28, 2012 at 7:19 AM, Flavio Junqueira <fp...@yahoo-inc.com>
>> wrote:
>>>> 
>>>>> I don't know what your compareAndSwap method does, but I was wondering
>> if
>>>>> your client process can use conditional writes to a znode to make sure
>> that
>>>>> it was the last one to update the state of timestamp batches. You can
>> treat
>>>>> the master election problem separately and it does not have to be as
>> strict
>>>>> as you have been thinking you need. Thats is, it wouldn't hurt if a
>> client
>>>>> still thinks it is leading even if it is not because no two clients
>> will be
>>>>> able to update the state of timestamp blocks without noticing that
>> another
>>>>> client is also updating it.
>>>>> 
>>>>> -Flavio
>>>>> 
>>>>> On Sep 27, 2012, at 6:57 PM, John Carrino wrote:
>>>>> 
>>>>>> So I think it's time to explain what I'm writing just so everyone has
>>>>>> more situation awareness.  Its just a timestamp server, nothing fancy.
>>>>>> 
>>>>>> Looks like this:
>>>>>> 
>>>>>> public interface TimestampService {
>>>>>> /**
>>>>>> * This will get a fresh timestamp that is guarenteed to be newer
>> than
>>>>>> any other timestamp
>>>>>> * handed out before this method was called.
>>>>>> */
>>>>>> long getFreshTimestamp();
>>>>>> }
>>>>>> 
>>>>>> The only requirement is that the timestamp handed back is greater than
>>>>>> every other timestamp that was returned before getFreshTs was called.
>>>>>> There is no ordering requirement for concurrent requests.
>>>>>> 
>>>>>> My impl is to reserve blocks of timestamps that are safe to hand out
>> (1M
>>>>> at
>>>>>> a time) using compare and swap in ZK.
>>>>>> lastPossibleUsed = read(HighWater)
>>>>>> safeToHandout = compareAndSwap(lastPossibleUsed, lastPossibleUsed+1M)
>>>>>> 
>>>>>> Now my leader can hand back timestamps up to safeToHandout, but
>> before it
>>>>>> hands one out it must ensure it is still the leader (no one else has
>>>>> handed
>>>>>> back something higher).
>>>>>> I can use ensureQuorum(), exists(myEphemNode) to make sure this is the
>>>>>> case.  Now I have a service that is guarenteed to be correct, but
>> doesn't
>>>>>> require disk hits in the steady state which brings down my latency (if
>>>>> you
>>>>>> get close to running out, you can compareAndSwap for more timestamps).
>>>>>> 
>>>>>> If many requests come in at the same time I can use smart batching to
>>>>>> verify happens after for all at once.  We can also add more layers if
>> we
>>>>>> need more bandwidth to scale up at the cost of adding latency.
>> Basically
>>>>>> our latency will be O(lg(requestRate)) if we keep adding layers as
>> each
>>>>>> previous layer becomes saturated.
>>>>>> 
>>>>>> I hope this explanation helps. I am busy for the next 4 hours, but if
>> you
>>>>>> need more clarification I can respond to them at that time.
>>>>>> 
>>>>>> -jc
>>>>>> 
>>>>>> 
>>>>>> On Thu, Sep 27, 2012 at 9:26 AM, John Carrino <john.carrino@gmail.com
>>>>>> wrote:
>>>>>> 
>>>>>>> First, thanks everyone for talking this through with me.
>>>>>>> 
>>>>>>> Flavio, for your example, this is actually ok.  There is a happens
>> after
>>>>>>> relationship between the client making the request and my leader C1
>>>>> still
>>>>>>> being the leader.  My service only needs to guarantee that what it
>> hands
>>>>>>> back is at least as new as anything that existed when the client made
>>>>> the
>>>>>>> request.  If C2 were to answer requests while C1 is stalling that is
>> ok
>>>>>>> because these would be considered concurrent requests and the stuff
>>>>>>> returned by C2 may be newer but that doesn't violate any guarentees.
>>>>>>> 
>>>>>>> If some client were to get back something from C2 and then (happens
>>>>> after
>>>>>>> relationship) someone tried to read from C1, it needs to fail.
>>>>>>> 
>>>>>>> To address your concern of adding too much bandwidth we can get this
>>>>>>> easily by doing what Martin Thompson calls smart batching (
>>>>>>> http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html
>> ).
>>>>>>> 
>>>>>>> 1. ensureQuorum request comes in to L1
>>>>>>> 2. send ENSURE to all followers
>>>>>>> 3. 10 more ensureQuorum requests come in
>>>>>>> 4. get back ENSURE from quorum
>>>>>>> 5. we can now service all 10 pending ensureQuorum requests with
>> another
>>>>>>> round trip ENSURE.
>>>>>>> 
>>>>>>> We don't need to send an ENSURE for every ensureQuorum request, we
>> just
>>>>>>> need it to be happens after from when the request arrived.
>>>>>>> 
>>>>>>> I am fine with the Ephemeral node being removed after some time
>> expires,
>>>>>>> but only by the leader.  If the leaders clock is broken and the
>> client
>>>>>>> owning the Ephemeral node drops off, then we don't have liveness
>>>>> (because
>>>>>>> this node may not get cleaned up in a timely fashion).  However, we
>>>>> still
>>>>>>> preserve corectness.
>>>>>>> 
>>>>>>> -jc
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <fpj@yahoo-inc.com
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Say that we implement what you're suggesting. Could you check if
>> this
>>>>>>>> scenario can happen:
>>>>>>>> 
>>>>>>>> 1- Client C1 is the current leader and it super boosted read to make
>>>>> sure
>>>>>>>> it is still the leader;
>>>>>>>> 2- We process the super boosted read having it through the zab
>>>>> pipeline;
>>>>>>>> 3- When we send the response to C1 we slow down the whole deal: the
>>>>>>>> response to C1 gets delayed and we stall C1;
>>>>>>>> 4- In the meanwhile, C1's session expires on the server side and its
>>>>>>>> ephemeral leadership node is removed;
>>>>>>>> 5- A new client C2 is elected and starts exercising leadership;
>>>>>>>> 6- Now C1 comes back to normal and receives the response of the
>> super
>>>>>>>> boosted read saying that it is still the leader.
>>>>>>>> 
>>>>>>>> If my interpretation is not incorrect, the only way to prevent this
>>>>>>>> scenario from happening is if the session expires on the client side
>>>>> before
>>>>>>>> it receives the response of the read. It doesn't look like we can do
>>>>> it if
>>>>>>>> process clocks can be arbitrarily delayed.
>>>>>>>> 
>>>>>>>> Note that one issue is that the behavior of ephemerals is highly
>>>>>>>> dependent upon timers, so I don't think we can avoid making some
>> timing
>>>>>>>> assumptions altogether. The question is if we are better off with a
>>>>>>>> mechanism relying upon acknowledgements. My sense is that
>>>>> application-level
>>>>>>>> fencing is preferable (if not necessary) for applications like the
>>>>> ones JC
>>>>>>>> is mentioning or BookKeeper.
>>>>>>>> 
>>>>>>>> I'm not concerned about writes to disk, which I agree we don't need
>> for
>>>>>>>> sync. I'm more concerned about having it going through the whole
>>>>> pipeline,
>>>>>>>> which will induce more traffic to zab and increase latency for an
>>>>>>>> application that uses it heavily.
>>>>>>>> 
>>>>>>>> -Flavio
>>>>>>>> 
>>>>>>>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote:
>>>>>>>> 
>>>>>>>>> another idea is to add this functionality to MultiOp - have read
>> only
>>>>>>>>> transactions be replicated but not logged or logged asynchronously.
>>>>>>>>> I'm not sure how it works right now if I do a read-only MultiOp
>>>>>>>>> transaction - does it replicate the transaction or answer it
>> locally
>>>>>>>>> on the leader ?
>>>>>>>>> 
>>>>>>>>> Alex
>>>>>>>>> 
>>>>>>>>> On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <
>> shralex@gmail.com>
>>>>>>>> wrote:
>>>>>>>>>> Thanks for the explanation.
>>>>>>>>>> 
>>>>>>>>>> I guess one could always invoke a write operation instead of sync
>> to
>>>>>>>>>> get the more strict semantics, but as John suggests, it might be a
>>>>>>>>>> good idea to add a new type of operation that requires followers
>> to
>>>>>>>>>> ack but doesn't require them to log to disk - this seems
>> sufficient
>>>>> in
>>>>>>>>>> our case.
>>>>>>>>>> 
>>>>>>>>>> Alex
>>>>>>>>>> 
>>>>>>>>>> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <
>> fpj@yahoo-inc.com
>>>>>> 
>>>>>>>> wrote:
>>>>>>>>>>> In theory, the scenario you're describing could happen, but I
>> would
>>>>>>>> argue that it is unlikely given that: 1) a leader pings followers
>>>>> twice a
>>>>>>>> tick to make sure that it has a quorum of supporters (lead()); 2)
>>>>> followers
>>>>>>>> give up on a leader upon catching an exception (followLeader()). One
>>>>> could
>>>>>>>> calibrate tickTime to make the probability of having this scenario
>> low.
>>>>>>>>>>> 
>>>>>>>>>>> Let me also revisit the motivation for the way we designed sync.
>>>>>>>> ZooKeeper has been designed to serve reads efficiently and making
>> sync
>>>>> go
>>>>>>>> through the pipeline would slow down reads. Although optional, we
>>>>> thought
>>>>>>>> it would be a good idea to make it as efficient as possible to
>> comply
>>>>> with
>>>>>>>> the original expectations for the service. We consequently came up
>> with
>>>>>>>> this cheap way of making sure that a read sees all pending updates.
>> It
>>>>> is
>>>>>>>> correct that there are some corner cases that it doesn't cover. One
>> is
>>>>> the
>>>>>>>> case you mentioned. Another is having the sync finishing before the
>>>>> client
>>>>>>>> submits the read and having a write committing in between. We rely
>>>>> upon the
>>>>>>>> way we implement timeouts and some minimum degree of synchrony for
>> the
>>>>>>>> clients when submitting operations to guarantee that the scheme
>> work.
>>>>>>>>>>> 
>>>>>>>>>>> We thought about the option of having the sync operation going
>>>>>>>> through the pipeline, and in fact it would have been easier to
>>>>> implement it
>>>>>>>> just as a regular write, but we opted not to because we felt it was
>>>>>>>> sufficient for the use cases we had and more efficient as I already
>>>>> argued.
>>>>>>>>>>> 
>>>>>>>>>>> Hope it helps to clarify.
>>>>>>>>>>> 
>>>>>>>>>>> -Flavio
>>>>>>>>>>> 
>>>>>>>>>>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> thanks for the explanation! but how do you avoid having the
>>>>> scenario
>>>>>>>>>>>> raised by John ?
>>>>>>>>>>>> lets say you're a client connected to F, and F is connected to
>> L.
>>>>>>>> Lets
>>>>>>>>>>>> also say that L's pipeline
>>>>>>>>>>>> is now empty, and both F and L are partitioned from 3 other
>> servers
>>>>>>>> in
>>>>>>>>>>>> the system that have already
>>>>>>>>>>>> elected a new leader L'. Now I go to L' and write something. L
>>>>> still
>>>>>>>>>>>> thinks its the leader because the
>>>>>>>>>>>> detection that followers left it is obviously timeout
>> dependent. So
>>>>>>>>>>>> when F sends your sync to L and L returns
>>>>>>>>>>>> it to F, you actually miss my write!
>>>>>>>>>>>> 
>>>>>>>>>>>> Alex
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <
>>>>>>>> fpj@yahoo-inc.com> wrote:
>>>>>>>>>>>>> Hi Alex, Because of the following:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 1- A follower F processes operations from a client in FIFO
>> order,
>>>>>>>> and say that a client submits as you say sync + read;
>>>>>>>>>>>>> 2- A sync will be processed by the leader and returned to the
>>>>>>>> follower. It will be queued after all pending updates that the
>> follower
>>>>>>>> hasn't processed;
>>>>>>>>>>>>> 3- The follower will process all pending updates before
>> processing
>>>>>>>> the response of the sync;
>>>>>>>>>>>>> 4- Once the follower processes the sync, it picks the read
>>>>>>>> operation to process. It reads the local state of the follower and
>>>>> returns
>>>>>>>> to the client.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> When we process the read in Step 4, we have applied all pending
>>>>>>>> updates the leader had for the follower by the time the read request
>>>>>>>> started.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This implementation is a bit of a hack because it doesn't
>> follow
>>>>>>>> the same code path as the other operations that go to the leader,
>> but
>>>>> it
>>>>>>>> avoids some unnecessary steps, which is important for fast reads. In
>>>>> the
>>>>>>>> sync case, the other followers don't really need to know about it
>>>>> (there is
>>>>>>>> nothing to be updated) and the leader simply inserts it in the
>>>>> sequence of
>>>>>>>> updates of F, ordering it.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi Flavio,
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Starting a read operation concurrently with a sync implies
>> that
>>>>>>>> the result of the read will not miss an update committed before the
>>>>> read
>>>>>>>> started.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I thought that the intention of sync was to give something
>> like
>>>>>>>>>>>>>> linearizable reads, so if you invoke a sync and then a read,
>> your
>>>>>>>> read
>>>>>>>>>>>>>> is guaranteed to (at least) see any write which completed
>> before
>>>>>>>> the
>>>>>>>>>>>>>> sync began. Is this the intention ? If so, how is this
>> achieved
>>>>>>>>>>>>>> without running agreement on the sync op ?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Alex
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <
>>>>>>>> fpj@yahoo-inc.com> wrote:
>>>>>>>>>>>>>>> sync simply flushes the channel between the leader and the
>>>>>>>> follower that forwarded the sync operation, so it doesn't go through
>>>>> the
>>>>>>>> full zab pipeline. Flushing means that all pending updates from the
>>>>> leader
>>>>>>>> to the follower are received by the time sync completes. Starting a
>>>>> read
>>>>>>>> operation concurrently with a sync implies that the result of the
>> read
>>>>> will
>>>>>>>> not miss an update committed before the read started.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Its strange that sync doesn't run through agreement, I was
>>>>> always
>>>>>>>>>>>>>>>> assuming that it is... Exactly for the reason you say -
>>>>>>>>>>>>>>>> you may trust your leader, but I may have a different leader
>>>>> and
>>>>>>>> your
>>>>>>>>>>>>>>>> leader may not detect it yet and still think its the leader.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> This seems like a bug to me.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or
>>>>>>>> shouldn't)
>>>>>>>>>>>>>>>> depend on timing assumption.
>>>>>>>>>>>>>>>> Only progress guarantees depend on time.
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Alex
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <
>>>>>>>> john.carrino@gmail.com> wrote:
>>>>>>>>>>>>>>>>> I have some pretty strong requirements in terms of
>> consistency
>>>>>>>> where
>>>>>>>>>>>>>>>>> reading from followers that may be behind in terms of
>> updates
>>>>>>>> isn't ok for
>>>>>>>>>>>>>>>>> my use case.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> One error case that worries me is if a follower and leader
>> are
>>>>>>>> partitioned
>>>>>>>>>>>>>>>>> off from the network.  A new leader is elected, but the
>>>>>>>> follower and old
>>>>>>>>>>>>>>>>> leader don't know about it.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Normally I think sync was made for this purpost, but I
>> looked
>>>>>>>> at the sync
>>>>>>>>>>>>>>>>> code and if there aren't any outstanding proposals the
>> leader
>>>>>>>> sends the
>>>>>>>>>>>>>>>>> sync right back to the client without first verifying that
>> it
>>>>>>>> still has
>>>>>>>>>>>>>>>>> quorum, so this won't work for my use case.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> At the core of the issue all I really need is a call that
>> will
>>>>>>>> make it's
>>>>>>>>>>>>>>>>> way to the leader and will ping it's followers, ensure it
>>>>> still
>>>>>>>> has a
>>>>>>>>>>>>>>>>> quorum and return success.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be
>>>>>>>> forwarded to the
>>>>>>>>>>>>>>>>> leader, leader will ensure it still has quorum and return
>> it's
>>>>>>>> epoch.  I
>>>>>>>>>>>>>>>>> can use this primitive to implement all the other
>> properties I
>>>>>>>> want to
>>>>>>>>>>>>>>>>> verify (assuming that my client will never connect to an
>> older
>>>>>>>> epoch after
>>>>>>>>>>>>>>>>> this call returns). Also the nice thing about this method
>> is
>>>>>>>> that it will
>>>>>>>>>>>>>>>>> not have to hit disk and the latency should just be a round
>>>>>>>> trip to the
>>>>>>>>>>>>>>>>> followers.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Most of the guarentees offered by zookeeper are time based
>> an
>>>>>>>> rely on
>>>>>>>>>>>>>>>>> clocks and expiring timers, but I'm hoping to offer some
>>>>>>>> guarantees in
>>>>>>>>>>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and
>> any
>>>>>>>> other way
>>>>>>>>>>>>>>>>> time is broken.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> Also if people are interested I can go into more detail
>> about
>>>>>>>> what I am
>>>>>>>>>>>>>>>>> trying to write.
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> -jc
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>> 
>>

Re: Ensure Leader hasn't changed

Posted by John Carrino <jo...@gmail.com>.

Yes. This property needs to be global because otherwise another transaction
can start and commit in the past when it shouldn't be allowed to.

-sent from my phone
On Sep 28, 2012 8:13 AM, "Flavio Junqueira" <fp...@yahoo-inc.com> wrote:

> I checked your message again and the contract is that getFreshTimestamp()
> always gets a fresh timestamp (not older than any returned by that method).
> One aspect of the contract that is not clear to me: does this freshness
> property hold across clients or only for a single client? If it holds only
> per client, then you can guarantee it with master epochs. No user of your
> application will connect to the master of epoch y < x once it connects to
> the master of epoch x. The epochs can be the sequence number of a
> sequential znode. How does it sound?
>
> -Flavio
>
> On Sep 28, 2012, at 5:00 PM, Flavio Junqueira wrote:
>
> > I was thinking that you can do a write per timestamp batch, and not per
> individual timestamp. In the worst case, a former leader won't use some
> timestamps, and I would think it is ok, but it depends on your application.
> >
> > Also, even if two clients believe they are leaders simultaneously and
> serve timestamps, the property that you seem to care about the most is
> uniqueness: a timestamp is not served twice. Order of timestamps would be
> preserved per leader, but in the case of overlapping leaders you could end
> up serving timestamps that do not follow an increasing order.
> >
> > -Flavio
> >
> > On Sep 28, 2012, at 4:37 PM, John Carrino wrote:
> >
> >> CompareAndSwap is an atomic check and update. Basically only update the
> >> value if it is the same as the expected value.
> >>
> >> I think with your approach you'd have to do a write for every single
> >> timestamp you wanted to hand out.  The latency hit on this would too
> much.
> >>
> >> My approach is different in that a timestamp server reserves a bunch of
> >> timestamps up front and proceeds to hand them out as long as it is the
> >> leader.  Leader check can be done without hitting disk hopefully.
> >>
> >> Thanks!
> >>
> >> -jc
> >>
> >>
> >> On Fri, Sep 28, 2012 at 7:19 AM, Flavio Junqueira <fp...@yahoo-inc.com>
> wrote:
> >>
> >>> I don't know what your compareAndSwap method does, but I was wondering
> if
> >>> your client process can use conditional writes to a znode to make sure
> that
> >>> it was the last one to update the state of timestamp batches. You can
> treat
> >>> the master election problem separately and it does not have to be as
> strict
> >>> as you have been thinking you need. Thats is, it wouldn't hurt if a
> client
> >>> still thinks it is leading even if it is not because no two clients
> will be
> >>> able to update the state of timestamp blocks without noticing that
> another
> >>> client is also updating it.
> >>>
> >>> -Flavio
> >>>
> >>> On Sep 27, 2012, at 6:57 PM, John Carrino wrote:
> >>>
> >>>> So I think it's time to explain what I'm writing just so everyone has
> >>>> more situation awareness.  Its just a timestamp server, nothing fancy.
> >>>>
> >>>> Looks like this:
> >>>>
> >>>> public interface TimestampService {
> >>>>  /**
> >>>>   * This will get a fresh timestamp that is guarenteed to be newer
> than
> >>>> any other timestamp
> >>>>   * handed out before this method was called.
> >>>>   */
> >>>>  long getFreshTimestamp();
> >>>> }
> >>>>
> >>>> The only requirement is that the timestamp handed back is greater than
> >>>> every other timestamp that was returned before getFreshTs was called.
> >>>> There is no ordering requirement for concurrent requests.
> >>>>
> >>>> My impl is to reserve blocks of timestamps that are safe to hand out
> (1M
> >>> at
> >>>> a time) using compare and swap in ZK.
> >>>> lastPossibleUsed = read(HighWater)
> >>>> safeToHandout = compareAndSwap(lastPossibleUsed, lastPossibleUsed+1M)
> >>>>
> >>>> Now my leader can hand back timestamps up to safeToHandout, but
> before it
> >>>> hands one out it must ensure it is still the leader (no one else has
> >>> handed
> >>>> back something higher).
> >>>> I can use ensureQuorum(), exists(myEphemNode) to make sure this is the
> >>>> case.  Now I have a service that is guarenteed to be correct, but
> doesn't
> >>>> require disk hits in the steady state which brings down my latency (if
> >>> you
> >>>> get close to running out, you can compareAndSwap for more timestamps).
> >>>>
> >>>> If many requests come in at the same time I can use smart batching to
> >>>> verify happens after for all at once.  We can also add more layers if
> we
> >>>> need more bandwidth to scale up at the cost of adding latency.
>  Basically
> >>>> our latency will be O(lg(requestRate)) if we keep adding layers as
> each
> >>>> previous layer becomes saturated.
> >>>>
> >>>> I hope this explanation helps. I am busy for the next 4 hours, but if
> you
> >>>> need more clarification I can respond to them at that time.
> >>>>
> >>>> -jc
> >>>>
> >>>>
> >>>> On Thu, Sep 27, 2012 at 9:26 AM, John Carrino <john.carrino@gmail.com
> >>>> wrote:
> >>>>
> >>>>> First, thanks everyone for talking this through with me.
> >>>>>
> >>>>> Flavio, for your example, this is actually ok.  There is a happens
> after
> >>>>> relationship between the client making the request and my leader C1
> >>> still
> >>>>> being the leader.  My service only needs to guarantee that what it
> hands
> >>>>> back is at least as new as anything that existed when the client made
> >>> the
> >>>>> request.  If C2 were to answer requests while C1 is stalling that is
> ok
> >>>>> because these would be considered concurrent requests and the stuff
> >>>>> returned by C2 may be newer but that doesn't violate any guarentees.
> >>>>>
> >>>>> If some client were to get back something from C2 and then (happens
> >>> after
> >>>>> relationship) someone tried to read from C1, it needs to fail.
> >>>>>
> >>>>> To address your concern of adding too much bandwidth we can get this
> >>>>> easily by doing what Martin Thompson calls smart batching (
> >>>>> http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html
> ).
> >>>>>
> >>>>> 1. ensureQuorum request comes in to L1
> >>>>> 2. send ENSURE to all followers
> >>>>> 3. 10 more ensureQuorum requests come in
> >>>>> 4. get back ENSURE from quorum
> >>>>> 5. we can now service all 10 pending ensureQuorum requests with
> another
> >>>>> round trip ENSURE.
> >>>>>
> >>>>> We don't need to send an ENSURE for every ensureQuorum request, we
> just
> >>>>> need it to be happens after from when the request arrived.
> >>>>>
> >>>>> I am fine with the Ephemeral node being removed after some time
> expires,
> >>>>> but only by the leader.  If the leaders clock is broken and the
> client
> >>>>> owning the Ephemeral node drops off, then we don't have liveness
> >>> (because
> >>>>> this node may not get cleaned up in a timely fashion).  However, we
> >>> still
> >>>>> preserve corectness.
> >>>>>
> >>>>> -jc
> >>>>>
> >>>>>
> >>>>> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <fpj@yahoo-inc.com
> >>>> wrote:
> >>>>>
> >>>>>> Say that we implement what you're suggesting. Could you check if
> this
> >>>>>> scenario can happen:
> >>>>>>
> >>>>>> 1- Client C1 is the current leader and it super boosted read to make
> >>> sure
> >>>>>> it is still the leader;
> >>>>>> 2- We process the super boosted read having it through the zab
> >>> pipeline;
> >>>>>> 3- When we send the response to C1 we slow down the whole deal: the
> >>>>>> response to C1 gets delayed and we stall C1;
> >>>>>> 4- In the meanwhile, C1's session expires on the server side and its
> >>>>>> ephemeral leadership node is removed;
> >>>>>> 5- A new client C2 is elected and starts exercising leadership;
> >>>>>> 6- Now C1 comes back to normal and receives the response of the
> super
> >>>>>> boosted read saying that it is still the leader.
> >>>>>>
> >>>>>> If my interpretation is not incorrect, the only way to prevent this
> >>>>>> scenario from happening is if the session expires on the client side
> >>> before
> >>>>>> it receives the response of the read. It doesn't look like we can do
> >>> it if
> >>>>>> process clocks can be arbitrarily delayed.
> >>>>>>
> >>>>>> Note that one issue is that the behavior of ephemerals is highly
> >>>>>> dependent upon timers, so I don't think we can avoid making some
> timing
> >>>>>> assumptions altogether. The question is if we are better off with a
> >>>>>> mechanism relying upon acknowledgements. My sense is that
> >>> application-level
> >>>>>> fencing is preferable (if not necessary) for applications like the
> >>> ones JC
> >>>>>> is mentioning or BookKeeper.
> >>>>>>
> >>>>>> I'm not concerned about writes to disk, which I agree we don't need
> for
> >>>>>> sync. I'm more concerned about having it going through the whole
> >>> pipeline,
> >>>>>> which will induce more traffic to zab and increase latency for an
> >>>>>> application that uses it heavily.
> >>>>>>
> >>>>>> -Flavio
> >>>>>>
> >>>>>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote:
> >>>>>>
> >>>>>>> another idea is to add this functionality to MultiOp - have read
> only
> >>>>>>> transactions be replicated but not logged or logged asynchronously.
> >>>>>>> I'm not sure how it works right now if I do a read-only MultiOp
> >>>>>>> transaction - does it replicate the transaction or answer it
> locally
> >>>>>>> on the leader ?
> >>>>>>>
> >>>>>>> Alex
> >>>>>>>
> >>>>>>> On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <
> shralex@gmail.com>
> >>>>>> wrote:
> >>>>>>>> Thanks for the explanation.
> >>>>>>>>
> >>>>>>>> I guess one could always invoke a write operation instead of sync
> to
> >>>>>>>> get the more strict semantics, but as John suggests, it might be a
> >>>>>>>> good idea to add a new type of operation that requires followers
> to
> >>>>>>>> ack but doesn't require them to log to disk - this seems
> sufficient
> >>> in
> >>>>>>>> our case.
> >>>>>>>>
> >>>>>>>> Alex
> >>>>>>>>
> >>>>>>>> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <
> fpj@yahoo-inc.com
> >>>>
> >>>>>> wrote:
> >>>>>>>>> In theory, the scenario you're describing could happen, but I
> would
> >>>>>> argue that it is unlikely given that: 1) a leader pings followers
> >>> twice a
> >>>>>> tick to make sure that it has a quorum of supporters (lead()); 2)
> >>> followers
> >>>>>> give up on a leader upon catching an exception (followLeader()). One
> >>> could
> >>>>>> calibrate tickTime to make the probability of having this scenario
> low.
> >>>>>>>>>
> >>>>>>>>> Let me also revisit the motivation for the way we designed sync.
> >>>>>> ZooKeeper has been designed to serve reads efficiently and making
> sync
> >>> go
> >>>>>> through the pipeline would slow down reads. Although optional, we
> >>> thought
> >>>>>> it would be a good idea to make it as efficient as possible to
> comply
> >>> with
> >>>>>> the original expectations for the service. We consequently came up
> with
> >>>>>> this cheap way of making sure that a read sees all pending updates.
> It
> >>> is
> >>>>>> correct that there are some corner cases that it doesn't cover. One
> is
> >>> the
> >>>>>> case you mentioned. Another is having the sync finishing before the
> >>> client
> >>>>>> submits the read and having a write committing in between. We rely
> >>> upon the
> >>>>>> way we implement timeouts and some minimum degree of synchrony for
> the
> >>>>>> clients when submitting operations to guarantee that the scheme
> work.
> >>>>>>>>>
> >>>>>>>>> We thought about the option of having the sync operation going
> >>>>>> through the pipeline, and in fact it would have been easier to
> >>> implement it
> >>>>>> just as a regular write, but we opted not to because we felt it was
> >>>>>> sufficient for the use cases we had and more efficient as I already
> >>> argued.
> >>>>>>>>>
> >>>>>>>>> Hope it helps to clarify.
> >>>>>>>>>
> >>>>>>>>> -Flavio
> >>>>>>>>>
> >>>>>>>>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
> >>>>>>>>>
> >>>>>>>>>> thanks for the explanation! but how do you avoid having the
> >>> scenario
> >>>>>>>>>> raised by John ?
> >>>>>>>>>> lets say you're a client connected to F, and F is connected to
> L.
> >>>>>> Lets
> >>>>>>>>>> also say that L's pipeline
> >>>>>>>>>> is now empty, and both F and L are partitioned from 3 other
> servers
> >>>>>> in
> >>>>>>>>>> the system that have already
> >>>>>>>>>> elected a new leader L'. Now I go to L' and write something. L
> >>> still
> >>>>>>>>>> thinks its the leader because the
> >>>>>>>>>> detection that followers left it is obviously timeout
> dependent. So
> >>>>>>>>>> when F sends your sync to L and L returns
> >>>>>>>>>> it to F, you actually miss my write!
> >>>>>>>>>>
> >>>>>>>>>> Alex
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <
> >>>>>> fpj@yahoo-inc.com> wrote:
> >>>>>>>>>>> Hi Alex, Because of the following:
> >>>>>>>>>>>
> >>>>>>>>>>> 1- A follower F processes operations from a client in FIFO
> order,
> >>>>>> and say that a client submits as you say sync + read;
> >>>>>>>>>>> 2- A sync will be processed by the leader and returned to the
> >>>>>> follower. It will be queued after all pending updates that the
> follower
> >>>>>> hasn't processed;
> >>>>>>>>>>> 3- The follower will process all pending updates before
> processing
> >>>>>> the response of the sync;
> >>>>>>>>>>> 4- Once the follower processes the sync, it picks the read
> >>>>>> operation to process. It reads the local state of the follower and
> >>> returns
> >>>>>> to the client.
> >>>>>>>>>>>
> >>>>>>>>>>> When we process the read in Step 4, we have applied all pending
> >>>>>> updates the leader had for the follower by the time the read request
> >>>>>> started.
> >>>>>>>>>>>
> >>>>>>>>>>> This implementation is a bit of a hack because it doesn't
> follow
> >>>>>> the same code path as the other operations that go to the leader,
> but
> >>> it
> >>>>>> avoids some unnecessary steps, which is important for fast reads. In
> >>> the
> >>>>>> sync case, the other followers don't really need to know about it
> >>> (there is
> >>>>>> nothing to be updated) and the leader simply inserts it in the
> >>> sequence of
> >>>>>> updates of F, ordering it.
> >>>>>>>>>>>
> >>>>>>>>>>> -Flavio
> >>>>>>>>>>>
> >>>>>>>>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Flavio,
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Starting a read operation concurrently with a sync implies
> that
> >>>>>> the result of the read will not miss an update committed before the
> >>> read
> >>>>>> started.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I thought that the intention of sync was to give something
> like
> >>>>>>>>>>>> linearizable reads, so if you invoke a sync and then a read,
> your
> >>>>>> read
> >>>>>>>>>>>> is guaranteed to (at least) see any write which completed
> before
> >>>>>> the
> >>>>>>>>>>>> sync began. Is this the intention ? If so, how is this
> achieved
> >>>>>>>>>>>> without running agreement on the sync op ?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Alex
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <
> >>>>>> fpj@yahoo-inc.com> wrote:
> >>>>>>>>>>>>> sync simply flushes the channel between the leader and the
> >>>>>> follower that forwarded the sync operation, so it doesn't go through
> >>> the
> >>>>>> full zab pipeline. Flushing means that all pending updates from the
> >>> leader
> >>>>>> to the follower are received by the time sync completes. Starting a
> >>> read
> >>>>>> operation concurrently with a sync implies that the result of the
> read
> >>> will
> >>>>>> not miss an update committed before the read started.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -Flavio
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Its strange that sync doesn't run through agreement, I was
> >>> always
> >>>>>>>>>>>>>> assuming that it is... Exactly for the reason you say -
> >>>>>>>>>>>>>> you may trust your leader, but I may have a different leader
> >>> and
> >>>>>> your
> >>>>>>>>>>>>>> leader may not detect it yet and still think its the leader.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> This seems like a bug to me.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or
> >>>>>> shouldn't)
> >>>>>>>>>>>>>> depend on timing assumption.
> >>>>>>>>>>>>>> Only progress guarantees depend on time.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Alex
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <
> >>>>>> john.carrino@gmail.com> wrote:
> >>>>>>>>>>>>>>> I have some pretty strong requirements in terms of
> consistency
> >>>>>> where
> >>>>>>>>>>>>>>> reading from followers that may be behind in terms of
> updates
> >>>>>> isn't ok for
> >>>>>>>>>>>>>>> my use case.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> One error case that worries me is if a follower and leader
> are
> >>>>>> partitioned
> >>>>>>>>>>>>>>> off from the network.  A new leader is elected, but the
> >>>>>> follower and old
> >>>>>>>>>>>>>>> leader don't know about it.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Normally I think sync was made for this purpost, but I
> looked
> >>>>>> at the sync
> >>>>>>>>>>>>>>> code and if there aren't any outstanding proposals the
> leader
> >>>>>> sends the
> >>>>>>>>>>>>>>> sync right back to the client without first verifying that
> it
> >>>>>> still has
> >>>>>>>>>>>>>>> quorum, so this won't work for my use case.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> At the core of the issue all I really need is a call that
> will
> >>>>>> make it's
> >>>>>>>>>>>>>>> way to the leader and will ping it's followers, ensure it
> >>> still
> >>>>>> has a
> >>>>>>>>>>>>>>> quorum and return success.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be
> >>>>>> forwarded to the
> >>>>>>>>>>>>>>> leader, leader will ensure it still has quorum and return
> it's
> >>>>>> epoch.  I
> >>>>>>>>>>>>>>> can use this primitive to implement all the other
> properties I
> >>>>>> want to
> >>>>>>>>>>>>>>> verify (assuming that my client will never connect to an
> older
> >>>>>> epoch after
> >>>>>>>>>>>>>>> this call returns). Also the nice thing about this method
> is
> >>>>>> that it will
> >>>>>>>>>>>>>>> not have to hit disk and the latency should just be a round
> >>>>>> trip to the
> >>>>>>>>>>>>>>> followers.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Most of the guarentees offered by zookeeper are time based
> an
> >>>>>> rely on
> >>>>>>>>>>>>>>> clocks and expiring timers, but I'm hoping to offer some
> >>>>>> guarantees in
> >>>>>>>>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and
> any
> >>>>>> other way
> >>>>>>>>>>>>>>> time is broken.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Also if people are interested I can go into more detail
> about
> >>>>>> what I am
> >>>>>>>>>>>>>>> trying to write.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> -jc
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>
> >>>
> >
>
>

Re: Ensure Leader hasn't changed

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.

I checked your message again and the contract is that getFreshTimestamp() always gets a fresh timestamp (not older than any returned by that method). One aspect of the contract that is not clear to me: does this freshness property hold across clients or only for a single client? If it holds only per client, then you can guarantee it with master epochs. No user of your application will connect to the master of epoch y < x once it connects to the master of epoch x. The epochs can be the sequence number of a sequential znode. How does it sound?

-Flavio  

On Sep 28, 2012, at 5:00 PM, Flavio Junqueira wrote:

> I was thinking that you can do a write per timestamp batch, and not per individual timestamp. In the worst case, a former leader won't use some timestamps, and I would think it is ok, but it depends on your application.
> 
> Also, even if two clients believe they are leaders simultaneously and serve timestamps, the property that you seem to care about the most is uniqueness: a timestamp is not served twice. Order of timestamps would be preserved per leader, but in the case of overlapping leaders you could end up serving timestamps that do not follow an increasing order.
> 
> -Flavio
> 
> On Sep 28, 2012, at 4:37 PM, John Carrino wrote:
> 
>> CompareAndSwap is an atomic check and update. Basically only update the
>> value if it is the same as the expected value.
>> 
>> I think with your approach you'd have to do a write for every single
>> timestamp you wanted to hand out.  The latency hit on this would too much.
>> 
>> My approach is different in that a timestamp server reserves a bunch of
>> timestamps up front and proceeds to hand them out as long as it is the
>> leader.  Leader check can be done without hitting disk hopefully.
>> 
>> Thanks!
>> 
>> -jc
>> 
>> 
>> On Fri, Sep 28, 2012 at 7:19 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
>> 
>>> I don't know what your compareAndSwap method does, but I was wondering if
>>> your client process can use conditional writes to a znode to make sure that
>>> it was the last one to update the state of timestamp batches. You can treat
>>> the master election problem separately and it does not have to be as strict
>>> as you have been thinking you need. Thats is, it wouldn't hurt if a client
>>> still thinks it is leading even if it is not because no two clients will be
>>> able to update the state of timestamp blocks without noticing that another
>>> client is also updating it.
>>> 
>>> -Flavio
>>> 
>>> On Sep 27, 2012, at 6:57 PM, John Carrino wrote:
>>> 
>>>> So I think it's time to explain what I'm writing just so everyone has
>>>> more situation awareness.  Its just a timestamp server, nothing fancy.
>>>> 
>>>> Looks like this:
>>>> 
>>>> public interface TimestampService {
>>>>  /**
>>>>   * This will get a fresh timestamp that is guarenteed to be newer than
>>>> any other timestamp
>>>>   * handed out before this method was called.
>>>>   */
>>>>  long getFreshTimestamp();
>>>> }
>>>> 
>>>> The only requirement is that the timestamp handed back is greater than
>>>> every other timestamp that was returned before getFreshTs was called.
>>>> There is no ordering requirement for concurrent requests.
>>>> 
>>>> My impl is to reserve blocks of timestamps that are safe to hand out (1M
>>> at
>>>> a time) using compare and swap in ZK.
>>>> lastPossibleUsed = read(HighWater)
>>>> safeToHandout = compareAndSwap(lastPossibleUsed, lastPossibleUsed+1M)
>>>> 
>>>> Now my leader can hand back timestamps up to safeToHandout, but before it
>>>> hands one out it must ensure it is still the leader (no one else has
>>> handed
>>>> back something higher).
>>>> I can use ensureQuorum(), exists(myEphemNode) to make sure this is the
>>>> case.  Now I have a service that is guarenteed to be correct, but doesn't
>>>> require disk hits in the steady state which brings down my latency (if
>>> you
>>>> get close to running out, you can compareAndSwap for more timestamps).
>>>> 
>>>> If many requests come in at the same time I can use smart batching to
>>>> verify happens after for all at once.  We can also add more layers if we
>>>> need more bandwidth to scale up at the cost of adding latency.  Basically
>>>> our latency will be O(lg(requestRate)) if we keep adding layers as each
>>>> previous layer becomes saturated.
>>>> 
>>>> I hope this explanation helps. I am busy for the next 4 hours, but if you
>>>> need more clarification I can respond to them at that time.
>>>> 
>>>> -jc
>>>> 
>>>> 
>>>> On Thu, Sep 27, 2012 at 9:26 AM, John Carrino <john.carrino@gmail.com
>>>> wrote:
>>>> 
>>>>> First, thanks everyone for talking this through with me.
>>>>> 
>>>>> Flavio, for your example, this is actually ok.  There is a happens after
>>>>> relationship between the client making the request and my leader C1
>>> still
>>>>> being the leader.  My service only needs to guarantee that what it hands
>>>>> back is at least as new as anything that existed when the client made
>>> the
>>>>> request.  If C2 were to answer requests while C1 is stalling that is ok
>>>>> because these would be considered concurrent requests and the stuff
>>>>> returned by C2 may be newer but that doesn't violate any guarentees.
>>>>> 
>>>>> If some client were to get back something from C2 and then (happens
>>> after
>>>>> relationship) someone tried to read from C1, it needs to fail.
>>>>> 
>>>>> To address your concern of adding too much bandwidth we can get this
>>>>> easily by doing what Martin Thompson calls smart batching (
>>>>> http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html).
>>>>> 
>>>>> 1. ensureQuorum request comes in to L1
>>>>> 2. send ENSURE to all followers
>>>>> 3. 10 more ensureQuorum requests come in
>>>>> 4. get back ENSURE from quorum
>>>>> 5. we can now service all 10 pending ensureQuorum requests with another
>>>>> round trip ENSURE.
>>>>> 
>>>>> We don't need to send an ENSURE for every ensureQuorum request, we just
>>>>> need it to be happens after from when the request arrived.
>>>>> 
>>>>> I am fine with the Ephemeral node being removed after some time expires,
>>>>> but only by the leader.  If the leaders clock is broken and the client
>>>>> owning the Ephemeral node drops off, then we don't have liveness
>>> (because
>>>>> this node may not get cleaned up in a timely fashion).  However, we
>>> still
>>>>> preserve corectness.
>>>>> 
>>>>> -jc
>>>>> 
>>>>> 
>>>>> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <fpj@yahoo-inc.com
>>>> wrote:
>>>>> 
>>>>>> Say that we implement what you're suggesting. Could you check if this
>>>>>> scenario can happen:
>>>>>> 
>>>>>> 1- Client C1 is the current leader and it super boosted read to make
>>> sure
>>>>>> it is still the leader;
>>>>>> 2- We process the super boosted read having it through the zab
>>> pipeline;
>>>>>> 3- When we send the response to C1 we slow down the whole deal: the
>>>>>> response to C1 gets delayed and we stall C1;
>>>>>> 4- In the meanwhile, C1's session expires on the server side and its
>>>>>> ephemeral leadership node is removed;
>>>>>> 5- A new client C2 is elected and starts exercising leadership;
>>>>>> 6- Now C1 comes back to normal and receives the response of the super
>>>>>> boosted read saying that it is still the leader.
>>>>>> 
>>>>>> If my interpretation is not incorrect, the only way to prevent this
>>>>>> scenario from happening is if the session expires on the client side
>>> before
>>>>>> it receives the response of the read. It doesn't look like we can do
>>> it if
>>>>>> process clocks can be arbitrarily delayed.
>>>>>> 
>>>>>> Note that one issue is that the behavior of ephemerals is highly
>>>>>> dependent upon timers, so I don't think we can avoid making some timing
>>>>>> assumptions altogether. The question is if we are better off with a
>>>>>> mechanism relying upon acknowledgements. My sense is that
>>> application-level
>>>>>> fencing is preferable (if not necessary) for applications like the
>>> ones JC
>>>>>> is mentioning or BookKeeper.
>>>>>> 
>>>>>> I'm not concerned about writes to disk, which I agree we don't need for
>>>>>> sync. I'm more concerned about having it going through the whole
>>> pipeline,
>>>>>> which will induce more traffic to zab and increase latency for an
>>>>>> application that uses it heavily.
>>>>>> 
>>>>>> -Flavio
>>>>>> 
>>>>>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote:
>>>>>> 
>>>>>>> another idea is to add this functionality to MultiOp - have read only
>>>>>>> transactions be replicated but not logged or logged asynchronously.
>>>>>>> I'm not sure how it works right now if I do a read-only MultiOp
>>>>>>> transaction - does it replicate the transaction or answer it locally
>>>>>>> on the leader ?
>>>>>>> 
>>>>>>> Alex
>>>>>>> 
>>>>>>> On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <sh...@gmail.com>
>>>>>> wrote:
>>>>>>>> Thanks for the explanation.
>>>>>>>> 
>>>>>>>> I guess one could always invoke a write operation instead of sync to
>>>>>>>> get the more strict semantics, but as John suggests, it might be a
>>>>>>>> good idea to add a new type of operation that requires followers to
>>>>>>>> ack but doesn't require them to log to disk - this seems sufficient
>>> in
>>>>>>>> our case.
>>>>>>>> 
>>>>>>>> Alex
>>>>>>>> 
>>>>>>>> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <fpj@yahoo-inc.com
>>>> 
>>>>>> wrote:
>>>>>>>>> In theory, the scenario you're describing could happen, but I would
>>>>>> argue that it is unlikely given that: 1) a leader pings followers
>>> twice a
>>>>>> tick to make sure that it has a quorum of supporters (lead()); 2)
>>> followers
>>>>>> give up on a leader upon catching an exception (followLeader()). One
>>> could
>>>>>> calibrate tickTime to make the probability of having this scenario low.
>>>>>>>>> 
>>>>>>>>> Let me also revisit the motivation for the way we designed sync.
>>>>>> ZooKeeper has been designed to serve reads efficiently and making sync
>>> go
>>>>>> through the pipeline would slow down reads. Although optional, we
>>> thought
>>>>>> it would be a good idea to make it as efficient as possible to comply
>>> with
>>>>>> the original expectations for the service. We consequently came up with
>>>>>> this cheap way of making sure that a read sees all pending updates. It
>>> is
>>>>>> correct that there are some corner cases that it doesn't cover. One is
>>> the
>>>>>> case you mentioned. Another is having the sync finishing before the
>>> client
>>>>>> submits the read and having a write committing in between. We rely
>>> upon the
>>>>>> way we implement timeouts and some minimum degree of synchrony for the
>>>>>> clients when submitting operations to guarantee that the scheme work.
>>>>>>>>> 
>>>>>>>>> We thought about the option of having the sync operation going
>>>>>> through the pipeline, and in fact it would have been easier to
>>> implement it
>>>>>> just as a regular write, but we opted not to because we felt it was
>>>>>> sufficient for the use cases we had and more efficient as I already
>>> argued.
>>>>>>>>> 
>>>>>>>>> Hope it helps to clarify.
>>>>>>>>> 
>>>>>>>>> -Flavio
>>>>>>>>> 
>>>>>>>>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
>>>>>>>>> 
>>>>>>>>>> thanks for the explanation! but how do you avoid having the
>>> scenario
>>>>>>>>>> raised by John ?
>>>>>>>>>> lets say you're a client connected to F, and F is connected to L.
>>>>>> Lets
>>>>>>>>>> also say that L's pipeline
>>>>>>>>>> is now empty, and both F and L are partitioned from 3 other servers
>>>>>> in
>>>>>>>>>> the system that have already
>>>>>>>>>> elected a new leader L'. Now I go to L' and write something. L
>>> still
>>>>>>>>>> thinks its the leader because the
>>>>>>>>>> detection that followers left it is obviously timeout dependent. So
>>>>>>>>>> when F sends your sync to L and L returns
>>>>>>>>>> it to F, you actually miss my write!
>>>>>>>>>> 
>>>>>>>>>> Alex
>>>>>>>>>> 
>>>>>>>>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <
>>>>>> fpj@yahoo-inc.com> wrote:
>>>>>>>>>>> Hi Alex, Because of the following:
>>>>>>>>>>> 
>>>>>>>>>>> 1- A follower F processes operations from a client in FIFO order,
>>>>>> and say that a client submits as you say sync + read;
>>>>>>>>>>> 2- A sync will be processed by the leader and returned to the
>>>>>> follower. It will be queued after all pending updates that the follower
>>>>>> hasn't processed;
>>>>>>>>>>> 3- The follower will process all pending updates before processing
>>>>>> the response of the sync;
>>>>>>>>>>> 4- Once the follower processes the sync, it picks the read
>>>>>> operation to process. It reads the local state of the follower and
>>> returns
>>>>>> to the client.
>>>>>>>>>>> 
>>>>>>>>>>> When we process the read in Step 4, we have applied all pending
>>>>>> updates the leader had for the follower by the time the read request
>>>>>> started.
>>>>>>>>>>> 
>>>>>>>>>>> This implementation is a bit of a hack because it doesn't follow
>>>>>> the same code path as the other operations that go to the leader, but
>>> it
>>>>>> avoids some unnecessary steps, which is important for fast reads. In
>>> the
>>>>>> sync case, the other followers don't really need to know about it
>>> (there is
>>>>>> nothing to be updated) and the leader simply inserts it in the
>>> sequence of
>>>>>> updates of F, ordering it.
>>>>>>>>>>> 
>>>>>>>>>>> -Flavio
>>>>>>>>>>> 
>>>>>>>>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi Flavio,
>>>>>>>>>>>> 
>>>>>>>>>>>>> Starting a read operation concurrently with a sync implies that
>>>>>> the result of the read will not miss an update committed before the
>>> read
>>>>>> started.
>>>>>>>>>>>> 
>>>>>>>>>>>> I thought that the intention of sync was to give something like
>>>>>>>>>>>> linearizable reads, so if you invoke a sync and then a read, your
>>>>>> read
>>>>>>>>>>>> is guaranteed to (at least) see any write which completed before
>>>>>> the
>>>>>>>>>>>> sync began. Is this the intention ? If so, how is this achieved
>>>>>>>>>>>> without running agreement on the sync op ?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Alex
>>>>>>>>>>>> 
>>>>>>>>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <
>>>>>> fpj@yahoo-inc.com> wrote:
>>>>>>>>>>>>> sync simply flushes the channel between the leader and the
>>>>>> follower that forwarded the sync operation, so it doesn't go through
>>> the
>>>>>> full zab pipeline. Flushing means that all pending updates from the
>>> leader
>>>>>> to the follower are received by the time sync completes. Starting a
>>> read
>>>>>> operation concurrently with a sync implies that the result of the read
>>> will
>>>>>> not miss an update committed before the read started.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -Flavio
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Its strange that sync doesn't run through agreement, I was
>>> always
>>>>>>>>>>>>>> assuming that it is... Exactly for the reason you say -
>>>>>>>>>>>>>> you may trust your leader, but I may have a different leader
>>> and
>>>>>> your
>>>>>>>>>>>>>> leader may not detect it yet and still think its the leader.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This seems like a bug to me.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or
>>>>>> shouldn't)
>>>>>>>>>>>>>> depend on timing assumption.
>>>>>>>>>>>>>> Only progress guarantees depend on time.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Alex
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <
>>>>>> john.carrino@gmail.com> wrote:
>>>>>>>>>>>>>>> I have some pretty strong requirements in terms of consistency
>>>>>> where
>>>>>>>>>>>>>>> reading from followers that may be behind in terms of updates
>>>>>> isn't ok for
>>>>>>>>>>>>>>> my use case.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> One error case that worries me is if a follower and leader are
>>>>>> partitioned
>>>>>>>>>>>>>>> off from the network.  A new leader is elected, but the
>>>>>> follower and old
>>>>>>>>>>>>>>> leader don't know about it.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Normally I think sync was made for this purpost, but I looked
>>>>>> at the sync
>>>>>>>>>>>>>>> code and if there aren't any outstanding proposals the leader
>>>>>> sends the
>>>>>>>>>>>>>>> sync right back to the client without first verifying that it
>>>>>> still has
>>>>>>>>>>>>>>> quorum, so this won't work for my use case.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> At the core of the issue all I really need is a call that will
>>>>>> make it's
>>>>>>>>>>>>>>> way to the leader and will ping it's followers, ensure it
>>> still
>>>>>> has a
>>>>>>>>>>>>>>> quorum and return success.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be
>>>>>> forwarded to the
>>>>>>>>>>>>>>> leader, leader will ensure it still has quorum and return it's
>>>>>> epoch.  I
>>>>>>>>>>>>>>> can use this primitive to implement all the other properties I
>>>>>> want to
>>>>>>>>>>>>>>> verify (assuming that my client will never connect to an older
>>>>>> epoch after
>>>>>>>>>>>>>>> this call returns). Also the nice thing about this method is
>>>>>> that it will
>>>>>>>>>>>>>>> not have to hit disk and the latency should just be a round
>>>>>> trip to the
>>>>>>>>>>>>>>> followers.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Most of the guarentees offered by zookeeper are time based an
>>>>>> rely on
>>>>>>>>>>>>>>> clocks and expiring timers, but I'm hoping to offer some
>>>>>> guarantees in
>>>>>>>>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and any
>>>>>> other way
>>>>>>>>>>>>>>> time is broken.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> Also if people are interested I can go into more detail about
>>>>>> what I am
>>>>>>>>>>>>>>> trying to write.
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> -jc
>>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>> 
>>> 
>

Re: Ensure Leader hasn't changed

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.

I was thinking that you can do a write per timestamp batch, and not per individual timestamp. In the worst case, a former leader won't use some timestamps, and I would think it is ok, but it depends on your application. 

Also, even if two clients believe they are leaders simultaneously and serve timestamps, the property that you seem to care about the most is uniqueness: a timestamp is not served twice. Order of timestamps would be preserved per leader, but in the case of overlapping leaders you could end up serving timestamps that do not follow an increasing order.

-Flavio

On Sep 28, 2012, at 4:37 PM, John Carrino wrote:

> CompareAndSwap is an atomic check and update. Basically only update the
> value if it is the same as the expected value.
> 
> I think with your approach you'd have to do a write for every single
> timestamp you wanted to hand out.  The latency hit on this would too much.
> 
> My approach is different in that a timestamp server reserves a bunch of
> timestamps up front and proceeds to hand them out as long as it is the
> leader.  Leader check can be done without hitting disk hopefully.
> 
> Thanks!
> 
> -jc
> 
> 
> On Fri, Sep 28, 2012 at 7:19 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
> 
>> I don't know what your compareAndSwap method does, but I was wondering if
>> your client process can use conditional writes to a znode to make sure that
>> it was the last one to update the state of timestamp batches. You can treat
>> the master election problem separately and it does not have to be as strict
>> as you have been thinking you need. Thats is, it wouldn't hurt if a client
>> still thinks it is leading even if it is not because no two clients will be
>> able to update the state of timestamp blocks without noticing that another
>> client is also updating it.
>> 
>> -Flavio
>> 
>> On Sep 27, 2012, at 6:57 PM, John Carrino wrote:
>> 
>>> So I think it's time to explain what I'm writing just so everyone has
>>> more situation awareness.  Its just a timestamp server, nothing fancy.
>>> 
>>> Looks like this:
>>> 
>>> public interface TimestampService {
>>>   /**
>>>    * This will get a fresh timestamp that is guarenteed to be newer than
>>> any other timestamp
>>>    * handed out before this method was called.
>>>    */
>>>   long getFreshTimestamp();
>>> }
>>> 
>>> The only requirement is that the timestamp handed back is greater than
>>> every other timestamp that was returned before getFreshTs was called.
>>> There is no ordering requirement for concurrent requests.
>>> 
>>> My impl is to reserve blocks of timestamps that are safe to hand out (1M
>> at
>>> a time) using compare and swap in ZK.
>>> lastPossibleUsed = read(HighWater)
>>> safeToHandout = compareAndSwap(lastPossibleUsed, lastPossibleUsed+1M)
>>> 
>>> Now my leader can hand back timestamps up to safeToHandout, but before it
>>> hands one out it must ensure it is still the leader (no one else has
>> handed
>>> back something higher).
>>> I can use ensureQuorum(), exists(myEphemNode) to make sure this is the
>>> case.  Now I have a service that is guarenteed to be correct, but doesn't
>>> require disk hits in the steady state which brings down my latency (if
>> you
>>> get close to running out, you can compareAndSwap for more timestamps).
>>> 
>>> If many requests come in at the same time I can use smart batching to
>>> verify happens after for all at once.  We can also add more layers if we
>>> need more bandwidth to scale up at the cost of adding latency.  Basically
>>> our latency will be O(lg(requestRate)) if we keep adding layers as each
>>> previous layer becomes saturated.
>>> 
>>> I hope this explanation helps. I am busy for the next 4 hours, but if you
>>> need more clarification I can respond to them at that time.
>>> 
>>> -jc
>>> 
>>> 
>>> On Thu, Sep 27, 2012 at 9:26 AM, John Carrino <john.carrino@gmail.com
>>> wrote:
>>> 
>>>> First, thanks everyone for talking this through with me.
>>>> 
>>>> Flavio, for your example, this is actually ok.  There is a happens after
>>>> relationship between the client making the request and my leader C1
>> still
>>>> being the leader.  My service only needs to guarantee that what it hands
>>>> back is at least as new as anything that existed when the client made
>> the
>>>> request.  If C2 were to answer requests while C1 is stalling that is ok
>>>> because these would be considered concurrent requests and the stuff
>>>> returned by C2 may be newer but that doesn't violate any guarentees.
>>>> 
>>>> If some client were to get back something from C2 and then (happens
>> after
>>>> relationship) someone tried to read from C1, it needs to fail.
>>>> 
>>>> To address your concern of adding too much bandwidth we can get this
>>>> easily by doing what Martin Thompson calls smart batching (
>>>> http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html).
>>>> 
>>>> 1. ensureQuorum request comes in to L1
>>>> 2. send ENSURE to all followers
>>>> 3. 10 more ensureQuorum requests come in
>>>> 4. get back ENSURE from quorum
>>>> 5. we can now service all 10 pending ensureQuorum requests with another
>>>> round trip ENSURE.
>>>> 
>>>> We don't need to send an ENSURE for every ensureQuorum request, we just
>>>> need it to be happens after from when the request arrived.
>>>> 
>>>> I am fine with the Ephemeral node being removed after some time expires,
>>>> but only by the leader.  If the leaders clock is broken and the client
>>>> owning the Ephemeral node drops off, then we don't have liveness
>> (because
>>>> this node may not get cleaned up in a timely fashion).  However, we
>> still
>>>> preserve corectness.
>>>> 
>>>> -jc
>>>> 
>>>> 
>>>> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <fpj@yahoo-inc.com
>>> wrote:
>>>> 
>>>>> Say that we implement what you're suggesting. Could you check if this
>>>>> scenario can happen:
>>>>> 
>>>>> 1- Client C1 is the current leader and it super boosted read to make
>> sure
>>>>> it is still the leader;
>>>>> 2- We process the super boosted read having it through the zab
>> pipeline;
>>>>> 3- When we send the response to C1 we slow down the whole deal: the
>>>>> response to C1 gets delayed and we stall C1;
>>>>> 4- In the meanwhile, C1's session expires on the server side and its
>>>>> ephemeral leadership node is removed;
>>>>> 5- A new client C2 is elected and starts exercising leadership;
>>>>> 6- Now C1 comes back to normal and receives the response of the super
>>>>> boosted read saying that it is still the leader.
>>>>> 
>>>>> If my interpretation is not incorrect, the only way to prevent this
>>>>> scenario from happening is if the session expires on the client side
>> before
>>>>> it receives the response of the read. It doesn't look like we can do
>> it if
>>>>> process clocks can be arbitrarily delayed.
>>>>> 
>>>>> Note that one issue is that the behavior of ephemerals is highly
>>>>> dependent upon timers, so I don't think we can avoid making some timing
>>>>> assumptions altogether. The question is if we are better off with a
>>>>> mechanism relying upon acknowledgements. My sense is that
>> application-level
>>>>> fencing is preferable (if not necessary) for applications like the
>> ones JC
>>>>> is mentioning or BookKeeper.
>>>>> 
>>>>> I'm not concerned about writes to disk, which I agree we don't need for
>>>>> sync. I'm more concerned about having it going through the whole
>> pipeline,
>>>>> which will induce more traffic to zab and increase latency for an
>>>>> application that uses it heavily.
>>>>> 
>>>>> -Flavio
>>>>> 
>>>>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote:
>>>>> 
>>>>>> another idea is to add this functionality to MultiOp - have read only
>>>>>> transactions be replicated but not logged or logged asynchronously.
>>>>>> I'm not sure how it works right now if I do a read-only MultiOp
>>>>>> transaction - does it replicate the transaction or answer it locally
>>>>>> on the leader ?
>>>>>> 
>>>>>> Alex
>>>>>> 
>>>>>> On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <sh...@gmail.com>
>>>>> wrote:
>>>>>>> Thanks for the explanation.
>>>>>>> 
>>>>>>> I guess one could always invoke a write operation instead of sync to
>>>>>>> get the more strict semantics, but as John suggests, it might be a
>>>>>>> good idea to add a new type of operation that requires followers to
>>>>>>> ack but doesn't require them to log to disk - this seems sufficient
>> in
>>>>>>> our case.
>>>>>>> 
>>>>>>> Alex
>>>>>>> 
>>>>>>> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <fpj@yahoo-inc.com
>>> 
>>>>> wrote:
>>>>>>>> In theory, the scenario you're describing could happen, but I would
>>>>> argue that it is unlikely given that: 1) a leader pings followers
>> twice a
>>>>> tick to make sure that it has a quorum of supporters (lead()); 2)
>> followers
>>>>> give up on a leader upon catching an exception (followLeader()). One
>> could
>>>>> calibrate tickTime to make the probability of having this scenario low.
>>>>>>>> 
>>>>>>>> Let me also revisit the motivation for the way we designed sync.
>>>>> ZooKeeper has been designed to serve reads efficiently and making sync
>> go
>>>>> through the pipeline would slow down reads. Although optional, we
>> thought
>>>>> it would be a good idea to make it as efficient as possible to comply
>> with
>>>>> the original expectations for the service. We consequently came up with
>>>>> this cheap way of making sure that a read sees all pending updates. It
>> is
>>>>> correct that there are some corner cases that it doesn't cover. One is
>> the
>>>>> case you mentioned. Another is having the sync finishing before the
>> client
>>>>> submits the read and having a write committing in between. We rely
>> upon the
>>>>> way we implement timeouts and some minimum degree of synchrony for the
>>>>> clients when submitting operations to guarantee that the scheme work.
>>>>>>>> 
>>>>>>>> We thought about the option of having the sync operation going
>>>>> through the pipeline, and in fact it would have been easier to
>> implement it
>>>>> just as a regular write, but we opted not to because we felt it was
>>>>> sufficient for the use cases we had and more efficient as I already
>> argued.
>>>>>>>> 
>>>>>>>> Hope it helps to clarify.
>>>>>>>> 
>>>>>>>> -Flavio
>>>>>>>> 
>>>>>>>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
>>>>>>>> 
>>>>>>>>> thanks for the explanation! but how do you avoid having the
>> scenario
>>>>>>>>> raised by John ?
>>>>>>>>> lets say you're a client connected to F, and F is connected to L.
>>>>> Lets
>>>>>>>>> also say that L's pipeline
>>>>>>>>> is now empty, and both F and L are partitioned from 3 other servers
>>>>> in
>>>>>>>>> the system that have already
>>>>>>>>> elected a new leader L'. Now I go to L' and write something. L
>> still
>>>>>>>>> thinks its the leader because the
>>>>>>>>> detection that followers left it is obviously timeout dependent. So
>>>>>>>>> when F sends your sync to L and L returns
>>>>>>>>> it to F, you actually miss my write!
>>>>>>>>> 
>>>>>>>>> Alex
>>>>>>>>> 
>>>>>>>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <
>>>>> fpj@yahoo-inc.com> wrote:
>>>>>>>>>> Hi Alex, Because of the following:
>>>>>>>>>> 
>>>>>>>>>> 1- A follower F processes operations from a client in FIFO order,
>>>>> and say that a client submits as you say sync + read;
>>>>>>>>>> 2- A sync will be processed by the leader and returned to the
>>>>> follower. It will be queued after all pending updates that the follower
>>>>> hasn't processed;
>>>>>>>>>> 3- The follower will process all pending updates before processing
>>>>> the response of the sync;
>>>>>>>>>> 4- Once the follower processes the sync, it picks the read
>>>>> operation to process. It reads the local state of the follower and
>> returns
>>>>> to the client.
>>>>>>>>>> 
>>>>>>>>>> When we process the read in Step 4, we have applied all pending
>>>>> updates the leader had for the follower by the time the read request
>>>>> started.
>>>>>>>>>> 
>>>>>>>>>> This implementation is a bit of a hack because it doesn't follow
>>>>> the same code path as the other operations that go to the leader, but
>> it
>>>>> avoids some unnecessary steps, which is important for fast reads. In
>> the
>>>>> sync case, the other followers don't really need to know about it
>> (there is
>>>>> nothing to be updated) and the leader simply inserts it in the
>> sequence of
>>>>> updates of F, ordering it.
>>>>>>>>>> 
>>>>>>>>>> -Flavio
>>>>>>>>>> 
>>>>>>>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Flavio,
>>>>>>>>>>> 
>>>>>>>>>>>> Starting a read operation concurrently with a sync implies that
>>>>> the result of the read will not miss an update committed before the
>> read
>>>>> started.
>>>>>>>>>>> 
>>>>>>>>>>> I thought that the intention of sync was to give something like
>>>>>>>>>>> linearizable reads, so if you invoke a sync and then a read, your
>>>>> read
>>>>>>>>>>> is guaranteed to (at least) see any write which completed before
>>>>> the
>>>>>>>>>>> sync began. Is this the intention ? If so, how is this achieved
>>>>>>>>>>> without running agreement on the sync op ?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Alex
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <
>>>>> fpj@yahoo-inc.com> wrote:
>>>>>>>>>>>> sync simply flushes the channel between the leader and the
>>>>> follower that forwarded the sync operation, so it doesn't go through
>> the
>>>>> full zab pipeline. Flushing means that all pending updates from the
>> leader
>>>>> to the follower are received by the time sync completes. Starting a
>> read
>>>>> operation concurrently with a sync implies that the result of the read
>> will
>>>>> not miss an update committed before the read started.
>>>>>>>>>>>> 
>>>>>>>>>>>> -Flavio
>>>>>>>>>>>> 
>>>>>>>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Its strange that sync doesn't run through agreement, I was
>> always
>>>>>>>>>>>>> assuming that it is... Exactly for the reason you say -
>>>>>>>>>>>>> you may trust your leader, but I may have a different leader
>> and
>>>>> your
>>>>>>>>>>>>> leader may not detect it yet and still think its the leader.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> This seems like a bug to me.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or
>>>>> shouldn't)
>>>>>>>>>>>>> depend on timing assumption.
>>>>>>>>>>>>> Only progress guarantees depend on time.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Alex
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <
>>>>> john.carrino@gmail.com> wrote:
>>>>>>>>>>>>>> I have some pretty strong requirements in terms of consistency
>>>>> where
>>>>>>>>>>>>>> reading from followers that may be behind in terms of updates
>>>>> isn't ok for
>>>>>>>>>>>>>> my use case.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> One error case that worries me is if a follower and leader are
>>>>> partitioned
>>>>>>>>>>>>>> off from the network.  A new leader is elected, but the
>>>>> follower and old
>>>>>>>>>>>>>> leader don't know about it.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Normally I think sync was made for this purpost, but I looked
>>>>> at the sync
>>>>>>>>>>>>>> code and if there aren't any outstanding proposals the leader
>>>>> sends the
>>>>>>>>>>>>>> sync right back to the client without first verifying that it
>>>>> still has
>>>>>>>>>>>>>> quorum, so this won't work for my use case.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> At the core of the issue all I really need is a call that will
>>>>> make it's
>>>>>>>>>>>>>> way to the leader and will ping it's followers, ensure it
>> still
>>>>> has a
>>>>>>>>>>>>>> quorum and return success.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be
>>>>> forwarded to the
>>>>>>>>>>>>>> leader, leader will ensure it still has quorum and return it's
>>>>> epoch.  I
>>>>>>>>>>>>>> can use this primitive to implement all the other properties I
>>>>> want to
>>>>>>>>>>>>>> verify (assuming that my client will never connect to an older
>>>>> epoch after
>>>>>>>>>>>>>> this call returns). Also the nice thing about this method is
>>>>> that it will
>>>>>>>>>>>>>> not have to hit disk and the latency should just be a round
>>>>> trip to the
>>>>>>>>>>>>>> followers.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Most of the guarentees offered by zookeeper are time based an
>>>>> rely on
>>>>>>>>>>>>>> clocks and expiring timers, but I'm hoping to offer some
>>>>> guarantees in
>>>>>>>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and any
>>>>> other way
>>>>>>>>>>>>>> time is broken.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Also if people are interested I can go into more detail about
>>>>> what I am
>>>>>>>>>>>>>> trying to write.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> -jc
>>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>> 
>>>>> 
>>>>> 
>>>> 
>> 
>>

Re: Ensure Leader hasn't changed

Posted by John Carrino <jo...@gmail.com>.

CompareAndSwap is an atomic check and update. Basically only update the
value if it is the same as the expected value.

I think with your approach you'd have to do a write for every single
timestamp you wanted to hand out.  The latency hit on this would too much.

My approach is different in that a timestamp server reserves a bunch of
timestamps up front and proceeds to hand them out as long as it is the
leader.  Leader check can be done without hitting disk hopefully.

Thanks!

-jc


On Fri, Sep 28, 2012 at 7:19 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:

> I don't know what your compareAndSwap method does, but I was wondering if
> your client process can use conditional writes to a znode to make sure that
> it was the last one to update the state of timestamp batches. You can treat
> the master election problem separately and it does not have to be as strict
> as you have been thinking you need. Thats is, it wouldn't hurt if a client
> still thinks it is leading even if it is not because no two clients will be
> able to update the state of timestamp blocks without noticing that another
> client is also updating it.
>
> -Flavio
>
> On Sep 27, 2012, at 6:57 PM, John Carrino wrote:
>
> > So I think it's time to explain what I'm writing just so everyone has
> > more situation awareness.  Its just a timestamp server, nothing fancy.
> >
> > Looks like this:
> >
> > public interface TimestampService {
> >    /**
> >     * This will get a fresh timestamp that is guarenteed to be newer than
> > any other timestamp
> >     * handed out before this method was called.
> >     */
> >    long getFreshTimestamp();
> > }
> >
> > The only requirement is that the timestamp handed back is greater than
> > every other timestamp that was returned before getFreshTs was called.
> > There is no ordering requirement for concurrent requests.
> >
> > My impl is to reserve blocks of timestamps that are safe to hand out (1M
> at
> > a time) using compare and swap in ZK.
> > lastPossibleUsed = read(HighWater)
> > safeToHandout = compareAndSwap(lastPossibleUsed, lastPossibleUsed+1M)
> >
> > Now my leader can hand back timestamps up to safeToHandout, but before it
> > hands one out it must ensure it is still the leader (no one else has
> handed
> > back something higher).
> > I can use ensureQuorum(), exists(myEphemNode) to make sure this is the
> > case.  Now I have a service that is guarenteed to be correct, but doesn't
> > require disk hits in the steady state which brings down my latency (if
> you
> > get close to running out, you can compareAndSwap for more timestamps).
> >
> > If many requests come in at the same time I can use smart batching to
> > verify happens after for all at once.  We can also add more layers if we
> > need more bandwidth to scale up at the cost of adding latency.  Basically
> > our latency will be O(lg(requestRate)) if we keep adding layers as each
> > previous layer becomes saturated.
> >
> > I hope this explanation helps. I am busy for the next 4 hours, but if you
> > need more clarification I can respond to them at that time.
> >
> > -jc
> >
> >
> > On Thu, Sep 27, 2012 at 9:26 AM, John Carrino <john.carrino@gmail.com
> >wrote:
> >
> >> First, thanks everyone for talking this through with me.
> >>
> >> Flavio, for your example, this is actually ok.  There is a happens after
> >> relationship between the client making the request and my leader C1
> still
> >> being the leader.  My service only needs to guarantee that what it hands
> >> back is at least as new as anything that existed when the client made
> the
> >> request.  If C2 were to answer requests while C1 is stalling that is ok
> >> because these would be considered concurrent requests and the stuff
> >> returned by C2 may be newer but that doesn't violate any guarentees.
> >>
> >> If some client were to get back something from C2 and then (happens
> after
> >> relationship) someone tried to read from C1, it needs to fail.
> >>
> >> To address your concern of adding too much bandwidth we can get this
> >> easily by doing what Martin Thompson calls smart batching (
> >> http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html).
> >>
> >> 1. ensureQuorum request comes in to L1
> >> 2. send ENSURE to all followers
> >> 3. 10 more ensureQuorum requests come in
> >> 4. get back ENSURE from quorum
> >> 5. we can now service all 10 pending ensureQuorum requests with another
> >> round trip ENSURE.
> >>
> >> We don't need to send an ENSURE for every ensureQuorum request, we just
> >> need it to be happens after from when the request arrived.
> >>
> >> I am fine with the Ephemeral node being removed after some time expires,
> >> but only by the leader.  If the leaders clock is broken and the client
> >> owning the Ephemeral node drops off, then we don't have liveness
> (because
> >> this node may not get cleaned up in a timely fashion).  However, we
> still
> >> preserve corectness.
> >>
> >> -jc
> >>
> >>
> >> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <fpj@yahoo-inc.com
> >wrote:
> >>
> >>> Say that we implement what you're suggesting. Could you check if this
> >>> scenario can happen:
> >>>
> >>> 1- Client C1 is the current leader and it super boosted read to make
> sure
> >>> it is still the leader;
> >>> 2- We process the super boosted read having it through the zab
> pipeline;
> >>> 3- When we send the response to C1 we slow down the whole deal: the
> >>> response to C1 gets delayed and we stall C1;
> >>> 4- In the meanwhile, C1's session expires on the server side and its
> >>> ephemeral leadership node is removed;
> >>> 5- A new client C2 is elected and starts exercising leadership;
> >>> 6- Now C1 comes back to normal and receives the response of the super
> >>> boosted read saying that it is still the leader.
> >>>
> >>> If my interpretation is not incorrect, the only way to prevent this
> >>> scenario from happening is if the session expires on the client side
> before
> >>> it receives the response of the read. It doesn't look like we can do
> it if
> >>> process clocks can be arbitrarily delayed.
> >>>
> >>> Note that one issue is that the behavior of ephemerals is highly
> >>> dependent upon timers, so I don't think we can avoid making some timing
> >>> assumptions altogether. The question is if we are better off with a
> >>> mechanism relying upon acknowledgements. My sense is that
> application-level
> >>> fencing is preferable (if not necessary) for applications like the
> ones JC
> >>> is mentioning or BookKeeper.
> >>>
> >>> I'm not concerned about writes to disk, which I agree we don't need for
> >>> sync. I'm more concerned about having it going through the whole
> pipeline,
> >>> which will induce more traffic to zab and increase latency for an
> >>> application that uses it heavily.
> >>>
> >>> -Flavio
> >>>
> >>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote:
> >>>
> >>>> another idea is to add this functionality to MultiOp - have read only
> >>>> transactions be replicated but not logged or logged asynchronously.
> >>>> I'm not sure how it works right now if I do a read-only MultiOp
> >>>> transaction - does it replicate the transaction or answer it locally
> >>>> on the leader ?
> >>>>
> >>>> Alex
> >>>>
> >>>> On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <sh...@gmail.com>
> >>> wrote:
> >>>>> Thanks for the explanation.
> >>>>>
> >>>>> I guess one could always invoke a write operation instead of sync to
> >>>>> get the more strict semantics, but as John suggests, it might be a
> >>>>> good idea to add a new type of operation that requires followers to
> >>>>> ack but doesn't require them to log to disk - this seems sufficient
> in
> >>>>> our case.
> >>>>>
> >>>>> Alex
> >>>>>
> >>>>> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <fpj@yahoo-inc.com
> >
> >>> wrote:
> >>>>>> In theory, the scenario you're describing could happen, but I would
> >>> argue that it is unlikely given that: 1) a leader pings followers
> twice a
> >>> tick to make sure that it has a quorum of supporters (lead()); 2)
> followers
> >>> give up on a leader upon catching an exception (followLeader()). One
> could
> >>> calibrate tickTime to make the probability of having this scenario low.
> >>>>>>
> >>>>>> Let me also revisit the motivation for the way we designed sync.
> >>> ZooKeeper has been designed to serve reads efficiently and making sync
> go
> >>> through the pipeline would slow down reads. Although optional, we
> thought
> >>> it would be a good idea to make it as efficient as possible to comply
> with
> >>> the original expectations for the service. We consequently came up with
> >>> this cheap way of making sure that a read sees all pending updates. It
> is
> >>> correct that there are some corner cases that it doesn't cover. One is
> the
> >>> case you mentioned. Another is having the sync finishing before the
> client
> >>> submits the read and having a write committing in between. We rely
> upon the
> >>> way we implement timeouts and some minimum degree of synchrony for the
> >>> clients when submitting operations to guarantee that the scheme work.
> >>>>>>
> >>>>>> We thought about the option of having the sync operation going
> >>> through the pipeline, and in fact it would have been easier to
> implement it
> >>> just as a regular write, but we opted not to because we felt it was
> >>> sufficient for the use cases we had and more efficient as I already
> argued.
> >>>>>>
> >>>>>> Hope it helps to clarify.
> >>>>>>
> >>>>>> -Flavio
> >>>>>>
> >>>>>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
> >>>>>>
> >>>>>>> thanks for the explanation! but how do you avoid having the
> scenario
> >>>>>>> raised by John ?
> >>>>>>> lets say you're a client connected to F, and F is connected to L.
> >>> Lets
> >>>>>>> also say that L's pipeline
> >>>>>>> is now empty, and both F and L are partitioned from 3 other servers
> >>> in
> >>>>>>> the system that have already
> >>>>>>> elected a new leader L'. Now I go to L' and write something. L
> still
> >>>>>>> thinks its the leader because the
> >>>>>>> detection that followers left it is obviously timeout dependent. So
> >>>>>>> when F sends your sync to L and L returns
> >>>>>>> it to F, you actually miss my write!
> >>>>>>>
> >>>>>>> Alex
> >>>>>>>
> >>>>>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <
> >>> fpj@yahoo-inc.com> wrote:
> >>>>>>>> Hi Alex, Because of the following:
> >>>>>>>>
> >>>>>>>> 1- A follower F processes operations from a client in FIFO order,
> >>> and say that a client submits as you say sync + read;
> >>>>>>>> 2- A sync will be processed by the leader and returned to the
> >>> follower. It will be queued after all pending updates that the follower
> >>> hasn't processed;
> >>>>>>>> 3- The follower will process all pending updates before processing
> >>> the response of the sync;
> >>>>>>>> 4- Once the follower processes the sync, it picks the read
> >>> operation to process. It reads the local state of the follower and
> returns
> >>> to the client.
> >>>>>>>>
> >>>>>>>> When we process the read in Step 4, we have applied all pending
> >>> updates the leader had for the follower by the time the read request
> >>> started.
> >>>>>>>>
> >>>>>>>> This implementation is a bit of a hack because it doesn't follow
> >>> the same code path as the other operations that go to the leader, but
> it
> >>> avoids some unnecessary steps, which is important for fast reads. In
> the
> >>> sync case, the other followers don't really need to know about it
> (there is
> >>> nothing to be updated) and the leader simply inserts it in the
> sequence of
> >>> updates of F, ordering it.
> >>>>>>>>
> >>>>>>>> -Flavio
> >>>>>>>>
> >>>>>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
> >>>>>>>>
> >>>>>>>>> Hi Flavio,
> >>>>>>>>>
> >>>>>>>>>> Starting a read operation concurrently with a sync implies that
> >>> the result of the read will not miss an update committed before the
> read
> >>> started.
> >>>>>>>>>
> >>>>>>>>> I thought that the intention of sync was to give something like
> >>>>>>>>> linearizable reads, so if you invoke a sync and then a read, your
> >>> read
> >>>>>>>>> is guaranteed to (at least) see any write which completed before
> >>> the
> >>>>>>>>> sync began. Is this the intention ? If so, how is this achieved
> >>>>>>>>> without running agreement on the sync op ?
> >>>>>>>>>
> >>>>>>>>> Thanks,
> >>>>>>>>> Alex
> >>>>>>>>>
> >>>>>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <
> >>> fpj@yahoo-inc.com> wrote:
> >>>>>>>>>> sync simply flushes the channel between the leader and the
> >>> follower that forwarded the sync operation, so it doesn't go through
> the
> >>> full zab pipeline. Flushing means that all pending updates from the
> leader
> >>> to the follower are received by the time sync completes. Starting a
> read
> >>> operation concurrently with a sync implies that the result of the read
> will
> >>> not miss an update committed before the read started.
> >>>>>>>>>>
> >>>>>>>>>> -Flavio
> >>>>>>>>>>
> >>>>>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Its strange that sync doesn't run through agreement, I was
> always
> >>>>>>>>>>> assuming that it is... Exactly for the reason you say -
> >>>>>>>>>>> you may trust your leader, but I may have a different leader
> and
> >>> your
> >>>>>>>>>>> leader may not detect it yet and still think its the leader.
> >>>>>>>>>>>
> >>>>>>>>>>> This seems like a bug to me.
> >>>>>>>>>>>
> >>>>>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or
> >>> shouldn't)
> >>>>>>>>>>> depend on timing assumption.
> >>>>>>>>>>> Only progress guarantees depend on time.
> >>>>>>>>>>>
> >>>>>>>>>>> Alex
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <
> >>> john.carrino@gmail.com> wrote:
> >>>>>>>>>>>> I have some pretty strong requirements in terms of consistency
> >>> where
> >>>>>>>>>>>> reading from followers that may be behind in terms of updates
> >>> isn't ok for
> >>>>>>>>>>>> my use case.
> >>>>>>>>>>>>
> >>>>>>>>>>>> One error case that worries me is if a follower and leader are
> >>> partitioned
> >>>>>>>>>>>> off from the network.  A new leader is elected, but the
> >>> follower and old
> >>>>>>>>>>>> leader don't know about it.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Normally I think sync was made for this purpost, but I looked
> >>> at the sync
> >>>>>>>>>>>> code and if there aren't any outstanding proposals the leader
> >>> sends the
> >>>>>>>>>>>> sync right back to the client without first verifying that it
> >>> still has
> >>>>>>>>>>>> quorum, so this won't work for my use case.
> >>>>>>>>>>>>
> >>>>>>>>>>>> At the core of the issue all I really need is a call that will
> >>> make it's
> >>>>>>>>>>>> way to the leader and will ping it's followers, ensure it
> still
> >>> has a
> >>>>>>>>>>>> quorum and return success.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be
> >>> forwarded to the
> >>>>>>>>>>>> leader, leader will ensure it still has quorum and return it's
> >>> epoch.  I
> >>>>>>>>>>>> can use this primitive to implement all the other properties I
> >>> want to
> >>>>>>>>>>>> verify (assuming that my client will never connect to an older
> >>> epoch after
> >>>>>>>>>>>> this call returns). Also the nice thing about this method is
> >>> that it will
> >>>>>>>>>>>> not have to hit disk and the latency should just be a round
> >>> trip to the
> >>>>>>>>>>>> followers.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Most of the guarentees offered by zookeeper are time based an
> >>> rely on
> >>>>>>>>>>>> clocks and expiring timers, but I'm hoping to offer some
> >>> guarantees in
> >>>>>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and any
> >>> other way
> >>>>>>>>>>>> time is broken.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Also if people are interested I can go into more detail about
> >>> what I am
> >>>>>>>>>>>> trying to write.
> >>>>>>>>>>>>
> >>>>>>>>>>>> -jc
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>
> >>>
> >>
>
>

Re: Ensure Leader hasn't changed

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.

I don't know what your compareAndSwap method does, but I was wondering if your client process can use conditional writes to a znode to make sure that it was the last one to update the state of timestamp batches. You can treat the master election problem separately and it does not have to be as strict as you have been thinking you need. Thats is, it wouldn't hurt if a client still thinks it is leading even if it is not because no two clients will be able to update the state of timestamp blocks without noticing that another client is also updating it.

-Flavio

On Sep 27, 2012, at 6:57 PM, John Carrino wrote:

> So I think it's time to explain what I'm writing just so everyone has
> more situation awareness.  Its just a timestamp server, nothing fancy.
> 
> Looks like this:
> 
> public interface TimestampService {
>    /**
>     * This will get a fresh timestamp that is guarenteed to be newer than
> any other timestamp
>     * handed out before this method was called.
>     */
>    long getFreshTimestamp();
> }
> 
> The only requirement is that the timestamp handed back is greater than
> every other timestamp that was returned before getFreshTs was called.
> There is no ordering requirement for concurrent requests.
> 
> My impl is to reserve blocks of timestamps that are safe to hand out (1M at
> a time) using compare and swap in ZK.
> lastPossibleUsed = read(HighWater)
> safeToHandout = compareAndSwap(lastPossibleUsed, lastPossibleUsed+1M)
> 
> Now my leader can hand back timestamps up to safeToHandout, but before it
> hands one out it must ensure it is still the leader (no one else has handed
> back something higher).
> I can use ensureQuorum(), exists(myEphemNode) to make sure this is the
> case.  Now I have a service that is guarenteed to be correct, but doesn't
> require disk hits in the steady state which brings down my latency (if you
> get close to running out, you can compareAndSwap for more timestamps).
> 
> If many requests come in at the same time I can use smart batching to
> verify happens after for all at once.  We can also add more layers if we
> need more bandwidth to scale up at the cost of adding latency.  Basically
> our latency will be O(lg(requestRate)) if we keep adding layers as each
> previous layer becomes saturated.
> 
> I hope this explanation helps. I am busy for the next 4 hours, but if you
> need more clarification I can respond to them at that time.
> 
> -jc
> 
> 
> On Thu, Sep 27, 2012 at 9:26 AM, John Carrino <jo...@gmail.com>wrote:
> 
>> First, thanks everyone for talking this through with me.
>> 
>> Flavio, for your example, this is actually ok.  There is a happens after
>> relationship between the client making the request and my leader C1 still
>> being the leader.  My service only needs to guarantee that what it hands
>> back is at least as new as anything that existed when the client made the
>> request.  If C2 were to answer requests while C1 is stalling that is ok
>> because these would be considered concurrent requests and the stuff
>> returned by C2 may be newer but that doesn't violate any guarentees.
>> 
>> If some client were to get back something from C2 and then (happens after
>> relationship) someone tried to read from C1, it needs to fail.
>> 
>> To address your concern of adding too much bandwidth we can get this
>> easily by doing what Martin Thompson calls smart batching (
>> http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html).
>> 
>> 1. ensureQuorum request comes in to L1
>> 2. send ENSURE to all followers
>> 3. 10 more ensureQuorum requests come in
>> 4. get back ENSURE from quorum
>> 5. we can now service all 10 pending ensureQuorum requests with another
>> round trip ENSURE.
>> 
>> We don't need to send an ENSURE for every ensureQuorum request, we just
>> need it to be happens after from when the request arrived.
>> 
>> I am fine with the Ephemeral node being removed after some time expires,
>> but only by the leader.  If the leaders clock is broken and the client
>> owning the Ephemeral node drops off, then we don't have liveness (because
>> this node may not get cleaned up in a timely fashion).  However, we still
>> preserve corectness.
>> 
>> -jc
>> 
>> 
>> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <fp...@yahoo-inc.com>wrote:
>> 
>>> Say that we implement what you're suggesting. Could you check if this
>>> scenario can happen:
>>> 
>>> 1- Client C1 is the current leader and it super boosted read to make sure
>>> it is still the leader;
>>> 2- We process the super boosted read having it through the zab pipeline;
>>> 3- When we send the response to C1 we slow down the whole deal: the
>>> response to C1 gets delayed and we stall C1;
>>> 4- In the meanwhile, C1's session expires on the server side and its
>>> ephemeral leadership node is removed;
>>> 5- A new client C2 is elected and starts exercising leadership;
>>> 6- Now C1 comes back to normal and receives the response of the super
>>> boosted read saying that it is still the leader.
>>> 
>>> If my interpretation is not incorrect, the only way to prevent this
>>> scenario from happening is if the session expires on the client side before
>>> it receives the response of the read. It doesn't look like we can do it if
>>> process clocks can be arbitrarily delayed.
>>> 
>>> Note that one issue is that the behavior of ephemerals is highly
>>> dependent upon timers, so I don't think we can avoid making some timing
>>> assumptions altogether. The question is if we are better off with a
>>> mechanism relying upon acknowledgements. My sense is that application-level
>>> fencing is preferable (if not necessary) for applications like the ones JC
>>> is mentioning or BookKeeper.
>>> 
>>> I'm not concerned about writes to disk, which I agree we don't need for
>>> sync. I'm more concerned about having it going through the whole pipeline,
>>> which will induce more traffic to zab and increase latency for an
>>> application that uses it heavily.
>>> 
>>> -Flavio
>>> 
>>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote:
>>> 
>>>> another idea is to add this functionality to MultiOp - have read only
>>>> transactions be replicated but not logged or logged asynchronously.
>>>> I'm not sure how it works right now if I do a read-only MultiOp
>>>> transaction - does it replicate the transaction or answer it locally
>>>> on the leader ?
>>>> 
>>>> Alex
>>>> 
>>>> On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <sh...@gmail.com>
>>> wrote:
>>>>> Thanks for the explanation.
>>>>> 
>>>>> I guess one could always invoke a write operation instead of sync to
>>>>> get the more strict semantics, but as John suggests, it might be a
>>>>> good idea to add a new type of operation that requires followers to
>>>>> ack but doesn't require them to log to disk - this seems sufficient in
>>>>> our case.
>>>>> 
>>>>> Alex
>>>>> 
>>>>> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <fp...@yahoo-inc.com>
>>> wrote:
>>>>>> In theory, the scenario you're describing could happen, but I would
>>> argue that it is unlikely given that: 1) a leader pings followers twice a
>>> tick to make sure that it has a quorum of supporters (lead()); 2) followers
>>> give up on a leader upon catching an exception (followLeader()). One could
>>> calibrate tickTime to make the probability of having this scenario low.
>>>>>> 
>>>>>> Let me also revisit the motivation for the way we designed sync.
>>> ZooKeeper has been designed to serve reads efficiently and making sync go
>>> through the pipeline would slow down reads. Although optional, we thought
>>> it would be a good idea to make it as efficient as possible to comply with
>>> the original expectations for the service. We consequently came up with
>>> this cheap way of making sure that a read sees all pending updates. It is
>>> correct that there are some corner cases that it doesn't cover. One is the
>>> case you mentioned. Another is having the sync finishing before the client
>>> submits the read and having a write committing in between. We rely upon the
>>> way we implement timeouts and some minimum degree of synchrony for the
>>> clients when submitting operations to guarantee that the scheme work.
>>>>>> 
>>>>>> We thought about the option of having the sync operation going
>>> through the pipeline, and in fact it would have been easier to implement it
>>> just as a regular write, but we opted not to because we felt it was
>>> sufficient for the use cases we had and more efficient as I already argued.
>>>>>> 
>>>>>> Hope it helps to clarify.
>>>>>> 
>>>>>> -Flavio
>>>>>> 
>>>>>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
>>>>>> 
>>>>>>> thanks for the explanation! but how do you avoid having the scenario
>>>>>>> raised by John ?
>>>>>>> lets say you're a client connected to F, and F is connected to L.
>>> Lets
>>>>>>> also say that L's pipeline
>>>>>>> is now empty, and both F and L are partitioned from 3 other servers
>>> in
>>>>>>> the system that have already
>>>>>>> elected a new leader L'. Now I go to L' and write something. L still
>>>>>>> thinks its the leader because the
>>>>>>> detection that followers left it is obviously timeout dependent. So
>>>>>>> when F sends your sync to L and L returns
>>>>>>> it to F, you actually miss my write!
>>>>>>> 
>>>>>>> Alex
>>>>>>> 
>>>>>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <
>>> fpj@yahoo-inc.com> wrote:
>>>>>>>> Hi Alex, Because of the following:
>>>>>>>> 
>>>>>>>> 1- A follower F processes operations from a client in FIFO order,
>>> and say that a client submits as you say sync + read;
>>>>>>>> 2- A sync will be processed by the leader and returned to the
>>> follower. It will be queued after all pending updates that the follower
>>> hasn't processed;
>>>>>>>> 3- The follower will process all pending updates before processing
>>> the response of the sync;
>>>>>>>> 4- Once the follower processes the sync, it picks the read
>>> operation to process. It reads the local state of the follower and returns
>>> to the client.
>>>>>>>> 
>>>>>>>> When we process the read in Step 4, we have applied all pending
>>> updates the leader had for the follower by the time the read request
>>> started.
>>>>>>>> 
>>>>>>>> This implementation is a bit of a hack because it doesn't follow
>>> the same code path as the other operations that go to the leader, but it
>>> avoids some unnecessary steps, which is important for fast reads. In the
>>> sync case, the other followers don't really need to know about it (there is
>>> nothing to be updated) and the leader simply inserts it in the sequence of
>>> updates of F, ordering it.
>>>>>>>> 
>>>>>>>> -Flavio
>>>>>>>> 
>>>>>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
>>>>>>>> 
>>>>>>>>> Hi Flavio,
>>>>>>>>> 
>>>>>>>>>> Starting a read operation concurrently with a sync implies that
>>> the result of the read will not miss an update committed before the read
>>> started.
>>>>>>>>> 
>>>>>>>>> I thought that the intention of sync was to give something like
>>>>>>>>> linearizable reads, so if you invoke a sync and then a read, your
>>> read
>>>>>>>>> is guaranteed to (at least) see any write which completed before
>>> the
>>>>>>>>> sync began. Is this the intention ? If so, how is this achieved
>>>>>>>>> without running agreement on the sync op ?
>>>>>>>>> 
>>>>>>>>> Thanks,
>>>>>>>>> Alex
>>>>>>>>> 
>>>>>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <
>>> fpj@yahoo-inc.com> wrote:
>>>>>>>>>> sync simply flushes the channel between the leader and the
>>> follower that forwarded the sync operation, so it doesn't go through the
>>> full zab pipeline. Flushing means that all pending updates from the leader
>>> to the follower are received by the time sync completes. Starting a read
>>> operation concurrently with a sync implies that the result of the read will
>>> not miss an update committed before the read started.
>>>>>>>>>> 
>>>>>>>>>> -Flavio
>>>>>>>>>> 
>>>>>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
>>>>>>>>>> 
>>>>>>>>>>> Its strange that sync doesn't run through agreement, I was always
>>>>>>>>>>> assuming that it is... Exactly for the reason you say -
>>>>>>>>>>> you may trust your leader, but I may have a different leader and
>>> your
>>>>>>>>>>> leader may not detect it yet and still think its the leader.
>>>>>>>>>>> 
>>>>>>>>>>> This seems like a bug to me.
>>>>>>>>>>> 
>>>>>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or
>>> shouldn't)
>>>>>>>>>>> depend on timing assumption.
>>>>>>>>>>> Only progress guarantees depend on time.
>>>>>>>>>>> 
>>>>>>>>>>> Alex
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <
>>> john.carrino@gmail.com> wrote:
>>>>>>>>>>>> I have some pretty strong requirements in terms of consistency
>>> where
>>>>>>>>>>>> reading from followers that may be behind in terms of updates
>>> isn't ok for
>>>>>>>>>>>> my use case.
>>>>>>>>>>>> 
>>>>>>>>>>>> One error case that worries me is if a follower and leader are
>>> partitioned
>>>>>>>>>>>> off from the network.  A new leader is elected, but the
>>> follower and old
>>>>>>>>>>>> leader don't know about it.
>>>>>>>>>>>> 
>>>>>>>>>>>> Normally I think sync was made for this purpost, but I looked
>>> at the sync
>>>>>>>>>>>> code and if there aren't any outstanding proposals the leader
>>> sends the
>>>>>>>>>>>> sync right back to the client without first verifying that it
>>> still has
>>>>>>>>>>>> quorum, so this won't work for my use case.
>>>>>>>>>>>> 
>>>>>>>>>>>> At the core of the issue all I really need is a call that will
>>> make it's
>>>>>>>>>>>> way to the leader and will ping it's followers, ensure it still
>>> has a
>>>>>>>>>>>> quorum and return success.
>>>>>>>>>>>> 
>>>>>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be
>>> forwarded to the
>>>>>>>>>>>> leader, leader will ensure it still has quorum and return it's
>>> epoch.  I
>>>>>>>>>>>> can use this primitive to implement all the other properties I
>>> want to
>>>>>>>>>>>> verify (assuming that my client will never connect to an older
>>> epoch after
>>>>>>>>>>>> this call returns). Also the nice thing about this method is
>>> that it will
>>>>>>>>>>>> not have to hit disk and the latency should just be a round
>>> trip to the
>>>>>>>>>>>> followers.
>>>>>>>>>>>> 
>>>>>>>>>>>> Most of the guarentees offered by zookeeper are time based an
>>> rely on
>>>>>>>>>>>> clocks and expiring timers, but I'm hoping to offer some
>>> guarantees in
>>>>>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and any
>>> other way
>>>>>>>>>>>> time is broken.
>>>>>>>>>>>> 
>>>>>>>>>>>> Also if people are interested I can go into more detail about
>>> what I am
>>>>>>>>>>>> trying to write.
>>>>>>>>>>>> 
>>>>>>>>>>>> -jc
>>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>> 
>>> 
>>

Re: Ensure Leader hasn't changed

Posted by John Carrino <jo...@gmail.com>.

So I think it's time to explain what I'm writing just so everyone has
more situation awareness.  Its just a timestamp server, nothing fancy.

Looks like this:

public interface TimestampService {
    /**
     * This will get a fresh timestamp that is guarenteed to be newer than
any other timestamp
     * handed out before this method was called.
     */
    long getFreshTimestamp();
}

The only requirement is that the timestamp handed back is greater than
every other timestamp that was returned before getFreshTs was called.
 There is no ordering requirement for concurrent requests.

My impl is to reserve blocks of timestamps that are safe to hand out (1M at
a time) using compare and swap in ZK.
lastPossibleUsed = read(HighWater)
safeToHandout = compareAndSwap(lastPossibleUsed, lastPossibleUsed+1M)

Now my leader can hand back timestamps up to safeToHandout, but before it
hands one out it must ensure it is still the leader (no one else has handed
back something higher).
I can use ensureQuorum(), exists(myEphemNode) to make sure this is the
case.  Now I have a service that is guarenteed to be correct, but doesn't
require disk hits in the steady state which brings down my latency (if you
get close to running out, you can compareAndSwap for more timestamps).

If many requests come in at the same time I can use smart batching to
verify happens after for all at once.  We can also add more layers if we
need more bandwidth to scale up at the cost of adding latency.  Basically
our latency will be O(lg(requestRate)) if we keep adding layers as each
previous layer becomes saturated.

I hope this explanation helps. I am busy for the next 4 hours, but if you
need more clarification I can respond to them at that time.

-jc


On Thu, Sep 27, 2012 at 9:26 AM, John Carrino <jo...@gmail.com>wrote:

> First, thanks everyone for talking this through with me.
>
> Flavio, for your example, this is actually ok.  There is a happens after
> relationship between the client making the request and my leader C1 still
> being the leader.  My service only needs to guarantee that what it hands
> back is at least as new as anything that existed when the client made the
> request.  If C2 were to answer requests while C1 is stalling that is ok
> because these would be considered concurrent requests and the stuff
> returned by C2 may be newer but that doesn't violate any guarentees.
>
> If some client were to get back something from C2 and then (happens after
> relationship) someone tried to read from C1, it needs to fail.
>
> To address your concern of adding too much bandwidth we can get this
> easily by doing what Martin Thompson calls smart batching (
> http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html).
>
> 1. ensureQuorum request comes in to L1
> 2. send ENSURE to all followers
> 3. 10 more ensureQuorum requests come in
> 4. get back ENSURE from quorum
> 5. we can now service all 10 pending ensureQuorum requests with another
> round trip ENSURE.
>
> We don't need to send an ENSURE for every ensureQuorum request, we just
> need it to be happens after from when the request arrived.
>
> I am fine with the Ephemeral node being removed after some time expires,
> but only by the leader.  If the leaders clock is broken and the client
> owning the Ephemeral node drops off, then we don't have liveness (because
> this node may not get cleaned up in a timely fashion).  However, we still
> preserve corectness.
>
> -jc
>
>
> On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <fp...@yahoo-inc.com>wrote:
>
>> Say that we implement what you're suggesting. Could you check if this
>> scenario can happen:
>>
>> 1- Client C1 is the current leader and it super boosted read to make sure
>> it is still the leader;
>> 2- We process the super boosted read having it through the zab pipeline;
>> 3- When we send the response to C1 we slow down the whole deal: the
>> response to C1 gets delayed and we stall C1;
>> 4- In the meanwhile, C1's session expires on the server side and its
>> ephemeral leadership node is removed;
>> 5- A new client C2 is elected and starts exercising leadership;
>> 6- Now C1 comes back to normal and receives the response of the super
>> boosted read saying that it is still the leader.
>>
>> If my interpretation is not incorrect, the only way to prevent this
>> scenario from happening is if the session expires on the client side before
>> it receives the response of the read. It doesn't look like we can do it if
>> process clocks can be arbitrarily delayed.
>>
>> Note that one issue is that the behavior of ephemerals is highly
>> dependent upon timers, so I don't think we can avoid making some timing
>> assumptions altogether. The question is if we are better off with a
>> mechanism relying upon acknowledgements. My sense is that application-level
>> fencing is preferable (if not necessary) for applications like the ones JC
>> is mentioning or BookKeeper.
>>
>> I'm not concerned about writes to disk, which I agree we don't need for
>> sync. I'm more concerned about having it going through the whole pipeline,
>> which will induce more traffic to zab and increase latency for an
>> application that uses it heavily.
>>
>> -Flavio
>>
>> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote:
>>
>> > another idea is to add this functionality to MultiOp - have read only
>> > transactions be replicated but not logged or logged asynchronously.
>> > I'm not sure how it works right now if I do a read-only MultiOp
>> > transaction - does it replicate the transaction or answer it locally
>> > on the leader ?
>> >
>> > Alex
>> >
>> > On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <sh...@gmail.com>
>> wrote:
>> >> Thanks for the explanation.
>> >>
>> >> I guess one could always invoke a write operation instead of sync to
>> >> get the more strict semantics, but as John suggests, it might be a
>> >> good idea to add a new type of operation that requires followers to
>> >> ack but doesn't require them to log to disk - this seems sufficient in
>> >> our case.
>> >>
>> >> Alex
>> >>
>> >> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <fp...@yahoo-inc.com>
>> wrote:
>> >>> In theory, the scenario you're describing could happen, but I would
>> argue that it is unlikely given that: 1) a leader pings followers twice a
>> tick to make sure that it has a quorum of supporters (lead()); 2) followers
>> give up on a leader upon catching an exception (followLeader()). One could
>> calibrate tickTime to make the probability of having this scenario low.
>> >>>
>> >>> Let me also revisit the motivation for the way we designed sync.
>> ZooKeeper has been designed to serve reads efficiently and making sync go
>> through the pipeline would slow down reads. Although optional, we thought
>> it would be a good idea to make it as efficient as possible to comply with
>> the original expectations for the service. We consequently came up with
>> this cheap way of making sure that a read sees all pending updates. It is
>> correct that there are some corner cases that it doesn't cover. One is the
>> case you mentioned. Another is having the sync finishing before the client
>> submits the read and having a write committing in between. We rely upon the
>> way we implement timeouts and some minimum degree of synchrony for the
>> clients when submitting operations to guarantee that the scheme work.
>> >>>
>> >>> We thought about the option of having the sync operation going
>> through the pipeline, and in fact it would have been easier to implement it
>> just as a regular write, but we opted not to because we felt it was
>> sufficient for the use cases we had and more efficient as I already argued.
>> >>>
>> >>> Hope it helps to clarify.
>> >>>
>> >>> -Flavio
>> >>>
>> >>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
>> >>>
>> >>>> thanks for the explanation! but how do you avoid having the scenario
>> >>>> raised by John ?
>> >>>> lets say you're a client connected to F, and F is connected to L.
>> Lets
>> >>>> also say that L's pipeline
>> >>>> is now empty, and both F and L are partitioned from 3 other servers
>> in
>> >>>> the system that have already
>> >>>> elected a new leader L'. Now I go to L' and write something. L still
>> >>>> thinks its the leader because the
>> >>>> detection that followers left it is obviously timeout dependent. So
>> >>>> when F sends your sync to L and L returns
>> >>>> it to F, you actually miss my write!
>> >>>>
>> >>>> Alex
>> >>>>
>> >>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <
>> fpj@yahoo-inc.com> wrote:
>> >>>>> Hi Alex, Because of the following:
>> >>>>>
>> >>>>> 1- A follower F processes operations from a client in FIFO order,
>> and say that a client submits as you say sync + read;
>> >>>>> 2- A sync will be processed by the leader and returned to the
>> follower. It will be queued after all pending updates that the follower
>> hasn't processed;
>> >>>>> 3- The follower will process all pending updates before processing
>> the response of the sync;
>> >>>>> 4- Once the follower processes the sync, it picks the read
>> operation to process. It reads the local state of the follower and returns
>> to the client.
>> >>>>>
>> >>>>> When we process the read in Step 4, we have applied all pending
>> updates the leader had for the follower by the time the read request
>> started.
>> >>>>>
>> >>>>> This implementation is a bit of a hack because it doesn't follow
>> the same code path as the other operations that go to the leader, but it
>> avoids some unnecessary steps, which is important for fast reads. In the
>> sync case, the other followers don't really need to know about it (there is
>> nothing to be updated) and the leader simply inserts it in the sequence of
>> updates of F, ordering it.
>> >>>>>
>> >>>>> -Flavio
>> >>>>>
>> >>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
>> >>>>>
>> >>>>>> Hi Flavio,
>> >>>>>>
>> >>>>>>> Starting a read operation concurrently with a sync implies that
>> the result of the read will not miss an update committed before the read
>> started.
>> >>>>>>
>> >>>>>> I thought that the intention of sync was to give something like
>> >>>>>> linearizable reads, so if you invoke a sync and then a read, your
>> read
>> >>>>>> is guaranteed to (at least) see any write which completed before
>> the
>> >>>>>> sync began. Is this the intention ? If so, how is this achieved
>> >>>>>> without running agreement on the sync op ?
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> Alex
>> >>>>>>
>> >>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <
>> fpj@yahoo-inc.com> wrote:
>> >>>>>>> sync simply flushes the channel between the leader and the
>> follower that forwarded the sync operation, so it doesn't go through the
>> full zab pipeline. Flushing means that all pending updates from the leader
>> to the follower are received by the time sync completes. Starting a read
>> operation concurrently with a sync implies that the result of the read will
>> not miss an update committed before the read started.
>> >>>>>>>
>> >>>>>>> -Flavio
>> >>>>>>>
>> >>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
>> >>>>>>>
>> >>>>>>>> Its strange that sync doesn't run through agreement, I was always
>> >>>>>>>> assuming that it is... Exactly for the reason you say -
>> >>>>>>>> you may trust your leader, but I may have a different leader and
>> your
>> >>>>>>>> leader may not detect it yet and still think its the leader.
>> >>>>>>>>
>> >>>>>>>> This seems like a bug to me.
>> >>>>>>>>
>> >>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or
>> shouldn't)
>> >>>>>>>> depend on timing assumption.
>> >>>>>>>> Only progress guarantees depend on time.
>> >>>>>>>>
>> >>>>>>>> Alex
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <
>> john.carrino@gmail.com> wrote:
>> >>>>>>>>> I have some pretty strong requirements in terms of consistency
>> where
>> >>>>>>>>> reading from followers that may be behind in terms of updates
>> isn't ok for
>> >>>>>>>>> my use case.
>> >>>>>>>>>
>> >>>>>>>>> One error case that worries me is if a follower and leader are
>> partitioned
>> >>>>>>>>> off from the network.  A new leader is elected, but the
>> follower and old
>> >>>>>>>>> leader don't know about it.
>> >>>>>>>>>
>> >>>>>>>>> Normally I think sync was made for this purpost, but I looked
>> at the sync
>> >>>>>>>>> code and if there aren't any outstanding proposals the leader
>> sends the
>> >>>>>>>>> sync right back to the client without first verifying that it
>> still has
>> >>>>>>>>> quorum, so this won't work for my use case.
>> >>>>>>>>>
>> >>>>>>>>> At the core of the issue all I really need is a call that will
>> make it's
>> >>>>>>>>> way to the leader and will ping it's followers, ensure it still
>> has a
>> >>>>>>>>> quorum and return success.
>> >>>>>>>>>
>> >>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be
>> forwarded to the
>> >>>>>>>>> leader, leader will ensure it still has quorum and return it's
>> epoch.  I
>> >>>>>>>>> can use this primitive to implement all the other properties I
>> want to
>> >>>>>>>>> verify (assuming that my client will never connect to an older
>> epoch after
>> >>>>>>>>> this call returns). Also the nice thing about this method is
>> that it will
>> >>>>>>>>> not have to hit disk and the latency should just be a round
>> trip to the
>> >>>>>>>>> followers.
>> >>>>>>>>>
>> >>>>>>>>> Most of the guarentees offered by zookeeper are time based an
>> rely on
>> >>>>>>>>> clocks and expiring timers, but I'm hoping to offer some
>> guarantees in
>> >>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and any
>> other way
>> >>>>>>>>> time is broken.
>> >>>>>>>>>
>> >>>>>>>>> Also if people are interested I can go into more detail about
>> what I am
>> >>>>>>>>> trying to write.
>> >>>>>>>>>
>> >>>>>>>>> -jc
>> >>>>>>>
>> >>>>>
>> >>>
>>
>>
>

Re: Ensure Leader hasn't changed

Posted by John Carrino <jo...@gmail.com>.

First, thanks everyone for talking this through with me.

Flavio, for your example, this is actually ok.  There is a happens after
relationship between the client making the request and my leader C1 still
being the leader.  My service only needs to guarantee that what it hands
back is at least as new as anything that existed when the client made the
request.  If C2 were to answer requests while C1 is stalling that is ok
because these would be considered concurrent requests and the stuff
returned by C2 may be newer but that doesn't violate any guarentees.

If some client were to get back something from C2 and then (happens after
relationship) someone tried to read from C1, it needs to fail.

To address your concern of adding too much bandwidth we can get this easily
by doing what Martin Thompson calls smart batching (
http://mechanical-sympathy.blogspot.com/2011/10/smart-batching.html).

1. ensureQuorum request comes in to L1
2. send ENSURE to all followers
3. 10 more ensureQuorum requests come in
4. get back ENSURE from quorum
5. we can now service all 10 pending ensureQuorum requests with another
round trip ENSURE.

We don't need to send an ENSURE for every ensureQuorum request, we just
need it to be happens after from when the request arrived.

I am fine with the Ephemeral node being removed after some time expires,
but only by the leader.  If the leaders clock is broken and the client
owning the Ephemeral node drops off, then we don't have liveness (because
this node may not get cleaned up in a timely fashion).  However, we still
preserve corectness.

-jc


On Thu, Sep 27, 2012 at 9:02 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:

> Say that we implement what you're suggesting. Could you check if this
> scenario can happen:
>
> 1- Client C1 is the current leader and it super boosted read to make sure
> it is still the leader;
> 2- We process the super boosted read having it through the zab pipeline;
> 3- When we send the response to C1 we slow down the whole deal: the
> response to C1 gets delayed and we stall C1;
> 4- In the meanwhile, C1's session expires on the server side and its
> ephemeral leadership node is removed;
> 5- A new client C2 is elected and starts exercising leadership;
> 6- Now C1 comes back to normal and receives the response of the super
> boosted read saying that it is still the leader.
>
> If my interpretation is not incorrect, the only way to prevent this
> scenario from happening is if the session expires on the client side before
> it receives the response of the read. It doesn't look like we can do it if
> process clocks can be arbitrarily delayed.
>
> Note that one issue is that the behavior of ephemerals is highly dependent
> upon timers, so I don't think we can avoid making some timing assumptions
> altogether. The question is if we are better off with a mechanism relying
> upon acknowledgements. My sense is that application-level fencing is
> preferable (if not necessary) for applications like the ones JC is
> mentioning or BookKeeper.
>
> I'm not concerned about writes to disk, which I agree we don't need for
> sync. I'm more concerned about having it going through the whole pipeline,
> which will induce more traffic to zab and increase latency for an
> application that uses it heavily.
>
> -Flavio
>
> On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote:
>
> > another idea is to add this functionality to MultiOp - have read only
> > transactions be replicated but not logged or logged asynchronously.
> > I'm not sure how it works right now if I do a read-only MultiOp
> > transaction - does it replicate the transaction or answer it locally
> > on the leader ?
> >
> > Alex
> >
> > On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <sh...@gmail.com>
> wrote:
> >> Thanks for the explanation.
> >>
> >> I guess one could always invoke a write operation instead of sync to
> >> get the more strict semantics, but as John suggests, it might be a
> >> good idea to add a new type of operation that requires followers to
> >> ack but doesn't require them to log to disk - this seems sufficient in
> >> our case.
> >>
> >> Alex
> >>
> >> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <fp...@yahoo-inc.com>
> wrote:
> >>> In theory, the scenario you're describing could happen, but I would
> argue that it is unlikely given that: 1) a leader pings followers twice a
> tick to make sure that it has a quorum of supporters (lead()); 2) followers
> give up on a leader upon catching an exception (followLeader()). One could
> calibrate tickTime to make the probability of having this scenario low.
> >>>
> >>> Let me also revisit the motivation for the way we designed sync.
> ZooKeeper has been designed to serve reads efficiently and making sync go
> through the pipeline would slow down reads. Although optional, we thought
> it would be a good idea to make it as efficient as possible to comply with
> the original expectations for the service. We consequently came up with
> this cheap way of making sure that a read sees all pending updates. It is
> correct that there are some corner cases that it doesn't cover. One is the
> case you mentioned. Another is having the sync finishing before the client
> submits the read and having a write committing in between. We rely upon the
> way we implement timeouts and some minimum degree of synchrony for the
> clients when submitting operations to guarantee that the scheme work.
> >>>
> >>> We thought about the option of having the sync operation going through
> the pipeline, and in fact it would have been easier to implement it just as
> a regular write, but we opted not to because we felt it was sufficient for
> the use cases we had and more efficient as I already argued.
> >>>
> >>> Hope it helps to clarify.
> >>>
> >>> -Flavio
> >>>
> >>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
> >>>
> >>>> thanks for the explanation! but how do you avoid having the scenario
> >>>> raised by John ?
> >>>> lets say you're a client connected to F, and F is connected to L. Lets
> >>>> also say that L's pipeline
> >>>> is now empty, and both F and L are partitioned from 3 other servers in
> >>>> the system that have already
> >>>> elected a new leader L'. Now I go to L' and write something. L still
> >>>> thinks its the leader because the
> >>>> detection that followers left it is obviously timeout dependent. So
> >>>> when F sends your sync to L and L returns
> >>>> it to F, you actually miss my write!
> >>>>
> >>>> Alex
> >>>>
> >>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <fp...@yahoo-inc.com>
> wrote:
> >>>>> Hi Alex, Because of the following:
> >>>>>
> >>>>> 1- A follower F processes operations from a client in FIFO order,
> and say that a client submits as you say sync + read;
> >>>>> 2- A sync will be processed by the leader and returned to the
> follower. It will be queued after all pending updates that the follower
> hasn't processed;
> >>>>> 3- The follower will process all pending updates before processing
> the response of the sync;
> >>>>> 4- Once the follower processes the sync, it picks the read operation
> to process. It reads the local state of the follower and returns to the
> client.
> >>>>>
> >>>>> When we process the read in Step 4, we have applied all pending
> updates the leader had for the follower by the time the read request
> started.
> >>>>>
> >>>>> This implementation is a bit of a hack because it doesn't follow the
> same code path as the other operations that go to the leader, but it avoids
> some unnecessary steps, which is important for fast reads. In the sync
> case, the other followers don't really need to know about it (there is
> nothing to be updated) and the leader simply inserts it in the sequence of
> updates of F, ordering it.
> >>>>>
> >>>>> -Flavio
> >>>>>
> >>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
> >>>>>
> >>>>>> Hi Flavio,
> >>>>>>
> >>>>>>> Starting a read operation concurrently with a sync implies that
> the result of the read will not miss an update committed before the read
> started.
> >>>>>>
> >>>>>> I thought that the intention of sync was to give something like
> >>>>>> linearizable reads, so if you invoke a sync and then a read, your
> read
> >>>>>> is guaranteed to (at least) see any write which completed before the
> >>>>>> sync began. Is this the intention ? If so, how is this achieved
> >>>>>> without running agreement on the sync op ?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Alex
> >>>>>>
> >>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <
> fpj@yahoo-inc.com> wrote:
> >>>>>>> sync simply flushes the channel between the leader and the
> follower that forwarded the sync operation, so it doesn't go through the
> full zab pipeline. Flushing means that all pending updates from the leader
> to the follower are received by the time sync completes. Starting a read
> operation concurrently with a sync implies that the result of the read will
> not miss an update committed before the read started.
> >>>>>>>
> >>>>>>> -Flavio
> >>>>>>>
> >>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
> >>>>>>>
> >>>>>>>> Its strange that sync doesn't run through agreement, I was always
> >>>>>>>> assuming that it is... Exactly for the reason you say -
> >>>>>>>> you may trust your leader, but I may have a different leader and
> your
> >>>>>>>> leader may not detect it yet and still think its the leader.
> >>>>>>>>
> >>>>>>>> This seems like a bug to me.
> >>>>>>>>
> >>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or
> shouldn't)
> >>>>>>>> depend on timing assumption.
> >>>>>>>> Only progress guarantees depend on time.
> >>>>>>>>
> >>>>>>>> Alex
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <
> john.carrino@gmail.com> wrote:
> >>>>>>>>> I have some pretty strong requirements in terms of consistency
> where
> >>>>>>>>> reading from followers that may be behind in terms of updates
> isn't ok for
> >>>>>>>>> my use case.
> >>>>>>>>>
> >>>>>>>>> One error case that worries me is if a follower and leader are
> partitioned
> >>>>>>>>> off from the network.  A new leader is elected, but the follower
> and old
> >>>>>>>>> leader don't know about it.
> >>>>>>>>>
> >>>>>>>>> Normally I think sync was made for this purpost, but I looked at
> the sync
> >>>>>>>>> code and if there aren't any outstanding proposals the leader
> sends the
> >>>>>>>>> sync right back to the client without first verifying that it
> still has
> >>>>>>>>> quorum, so this won't work for my use case.
> >>>>>>>>>
> >>>>>>>>> At the core of the issue all I really need is a call that will
> make it's
> >>>>>>>>> way to the leader and will ping it's followers, ensure it still
> has a
> >>>>>>>>> quorum and return success.
> >>>>>>>>>
> >>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be
> forwarded to the
> >>>>>>>>> leader, leader will ensure it still has quorum and return it's
> epoch.  I
> >>>>>>>>> can use this primitive to implement all the other properties I
> want to
> >>>>>>>>> verify (assuming that my client will never connect to an older
> epoch after
> >>>>>>>>> this call returns). Also the nice thing about this method is
> that it will
> >>>>>>>>> not have to hit disk and the latency should just be a round trip
> to the
> >>>>>>>>> followers.
> >>>>>>>>>
> >>>>>>>>> Most of the guarentees offered by zookeeper are time based an
> rely on
> >>>>>>>>> clocks and expiring timers, but I'm hoping to offer some
> guarantees in
> >>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and any
> other way
> >>>>>>>>> time is broken.
> >>>>>>>>>
> >>>>>>>>> Also if people are interested I can go into more detail about
> what I am
> >>>>>>>>> trying to write.
> >>>>>>>>>
> >>>>>>>>> -jc
> >>>>>>>
> >>>>>
> >>>
>
>

Re: Ensure Leader hasn't changed

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.

Say that we implement what you're suggesting. Could you check if this scenario can happen:

1- Client C1 is the current leader and it super boosted read to make sure it is still the leader;
2- We process the super boosted read having it through the zab pipeline;
3- When we send the response to C1 we slow down the whole deal: the response to C1 gets delayed and we stall C1;
4- In the meanwhile, C1's session expires on the server side and its ephemeral leadership node is removed;
5- A new client C2 is elected and starts exercising leadership;
6- Now C1 comes back to normal and receives the response of the super boosted read saying that it is still the leader.

If my interpretation is not incorrect, the only way to prevent this scenario from happening is if the session expires on the client side before it receives the response of the read. It doesn't look like we can do it if process clocks can be arbitrarily delayed.

Note that one issue is that the behavior of ephemerals is highly dependent upon timers, so I don't think we can avoid making some timing assumptions altogether. The question is if we are better off with a mechanism relying upon acknowledgements. My sense is that application-level fencing is preferable (if not necessary) for applications like the ones JC is mentioning or BookKeeper.

I'm not concerned about writes to disk, which I agree we don't need for sync. I'm more concerned about having it going through the whole pipeline, which will induce more traffic to zab and increase latency for an application that uses it heavily.

-Flavio

On Sep 27, 2012, at 5:27 PM, Alexander Shraer wrote:

> another idea is to add this functionality to MultiOp - have read only
> transactions be replicated but not logged or logged asynchronously.
> I'm not sure how it works right now if I do a read-only MultiOp
> transaction - does it replicate the transaction or answer it locally
> on the leader ?
> 
> Alex
> 
> On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <sh...@gmail.com> wrote:
>> Thanks for the explanation.
>> 
>> I guess one could always invoke a write operation instead of sync to
>> get the more strict semantics, but as John suggests, it might be a
>> good idea to add a new type of operation that requires followers to
>> ack but doesn't require them to log to disk - this seems sufficient in
>> our case.
>> 
>> Alex
>> 
>> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
>>> In theory, the scenario you're describing could happen, but I would argue that it is unlikely given that: 1) a leader pings followers twice a tick to make sure that it has a quorum of supporters (lead()); 2) followers give up on a leader upon catching an exception (followLeader()). One could calibrate tickTime to make the probability of having this scenario low.
>>> 
>>> Let me also revisit the motivation for the way we designed sync. ZooKeeper has been designed to serve reads efficiently and making sync go through the pipeline would slow down reads. Although optional, we thought it would be a good idea to make it as efficient as possible to comply with the original expectations for the service. We consequently came up with this cheap way of making sure that a read sees all pending updates. It is correct that there are some corner cases that it doesn't cover. One is the case you mentioned. Another is having the sync finishing before the client submits the read and having a write committing in between. We rely upon the way we implement timeouts and some minimum degree of synchrony for the clients when submitting operations to guarantee that the scheme work.
>>> 
>>> We thought about the option of having the sync operation going through the pipeline, and in fact it would have been easier to implement it just as a regular write, but we opted not to because we felt it was sufficient for the use cases we had and more efficient as I already argued.
>>> 
>>> Hope it helps to clarify.
>>> 
>>> -Flavio
>>> 
>>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
>>> 
>>>> thanks for the explanation! but how do you avoid having the scenario
>>>> raised by John ?
>>>> lets say you're a client connected to F, and F is connected to L. Lets
>>>> also say that L's pipeline
>>>> is now empty, and both F and L are partitioned from 3 other servers in
>>>> the system that have already
>>>> elected a new leader L'. Now I go to L' and write something. L still
>>>> thinks its the leader because the
>>>> detection that followers left it is obviously timeout dependent. So
>>>> when F sends your sync to L and L returns
>>>> it to F, you actually miss my write!
>>>> 
>>>> Alex
>>>> 
>>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
>>>>> Hi Alex, Because of the following:
>>>>> 
>>>>> 1- A follower F processes operations from a client in FIFO order, and say that a client submits as you say sync + read;
>>>>> 2- A sync will be processed by the leader and returned to the follower. It will be queued after all pending updates that the follower hasn't processed;
>>>>> 3- The follower will process all pending updates before processing the response of the sync;
>>>>> 4- Once the follower processes the sync, it picks the read operation to process. It reads the local state of the follower and returns to the client.
>>>>> 
>>>>> When we process the read in Step 4, we have applied all pending updates the leader had for the follower by the time the read request started.
>>>>> 
>>>>> This implementation is a bit of a hack because it doesn't follow the same code path as the other operations that go to the leader, but it avoids some unnecessary steps, which is important for fast reads. In the sync case, the other followers don't really need to know about it (there is nothing to be updated) and the leader simply inserts it in the sequence of updates of F, ordering it.
>>>>> 
>>>>> -Flavio
>>>>> 
>>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
>>>>> 
>>>>>> Hi Flavio,
>>>>>> 
>>>>>>> Starting a read operation concurrently with a sync implies that the result of the read will not miss an update committed before the read started.
>>>>>> 
>>>>>> I thought that the intention of sync was to give something like
>>>>>> linearizable reads, so if you invoke a sync and then a read, your read
>>>>>> is guaranteed to (at least) see any write which completed before the
>>>>>> sync began. Is this the intention ? If so, how is this achieved
>>>>>> without running agreement on the sync op ?
>>>>>> 
>>>>>> Thanks,
>>>>>> Alex
>>>>>> 
>>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
>>>>>>> sync simply flushes the channel between the leader and the follower that forwarded the sync operation, so it doesn't go through the full zab pipeline. Flushing means that all pending updates from the leader to the follower are received by the time sync completes. Starting a read operation concurrently with a sync implies that the result of the read will not miss an update committed before the read started.
>>>>>>> 
>>>>>>> -Flavio
>>>>>>> 
>>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
>>>>>>> 
>>>>>>>> Its strange that sync doesn't run through agreement, I was always
>>>>>>>> assuming that it is... Exactly for the reason you say -
>>>>>>>> you may trust your leader, but I may have a different leader and your
>>>>>>>> leader may not detect it yet and still think its the leader.
>>>>>>>> 
>>>>>>>> This seems like a bug to me.
>>>>>>>> 
>>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or shouldn't)
>>>>>>>> depend on timing assumption.
>>>>>>>> Only progress guarantees depend on time.
>>>>>>>> 
>>>>>>>> Alex
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <jo...@gmail.com> wrote:
>>>>>>>>> I have some pretty strong requirements in terms of consistency where
>>>>>>>>> reading from followers that may be behind in terms of updates isn't ok for
>>>>>>>>> my use case.
>>>>>>>>> 
>>>>>>>>> One error case that worries me is if a follower and leader are partitioned
>>>>>>>>> off from the network.  A new leader is elected, but the follower and old
>>>>>>>>> leader don't know about it.
>>>>>>>>> 
>>>>>>>>> Normally I think sync was made for this purpost, but I looked at the sync
>>>>>>>>> code and if there aren't any outstanding proposals the leader sends the
>>>>>>>>> sync right back to the client without first verifying that it still has
>>>>>>>>> quorum, so this won't work for my use case.
>>>>>>>>> 
>>>>>>>>> At the core of the issue all I really need is a call that will make it's
>>>>>>>>> way to the leader and will ping it's followers, ensure it still has a
>>>>>>>>> quorum and return success.
>>>>>>>>> 
>>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be forwarded to the
>>>>>>>>> leader, leader will ensure it still has quorum and return it's epoch.  I
>>>>>>>>> can use this primitive to implement all the other properties I want to
>>>>>>>>> verify (assuming that my client will never connect to an older epoch after
>>>>>>>>> this call returns). Also the nice thing about this method is that it will
>>>>>>>>> not have to hit disk and the latency should just be a round trip to the
>>>>>>>>> followers.
>>>>>>>>> 
>>>>>>>>> Most of the guarentees offered by zookeeper are time based an rely on
>>>>>>>>> clocks and expiring timers, but I'm hoping to offer some guarantees in
>>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and any other way
>>>>>>>>> time is broken.
>>>>>>>>> 
>>>>>>>>> Also if people are interested I can go into more detail about what I am
>>>>>>>>> trying to write.
>>>>>>>>> 
>>>>>>>>> -jc
>>>>>>> 
>>>>> 
>>>

Re: Ensure Leader hasn't changed

Posted by Alexander Shraer <sh...@gmail.com>.

another idea is to add this functionality to MultiOp - have read only
transactions be replicated but not logged or logged asynchronously.
I'm not sure how it works right now if I do a read-only MultiOp
transaction - does it replicate the transaction or answer it locally
on the leader ?

Alex

On Thu, Sep 27, 2012 at 8:07 AM, Alexander Shraer <sh...@gmail.com> wrote:
> Thanks for the explanation.
>
> I guess one could always invoke a write operation instead of sync to
> get the more strict semantics, but as John suggests, it might be a
> good idea to add a new type of operation that requires followers to
> ack but doesn't require them to log to disk - this seems sufficient in
> our case.
>
> Alex
>
> On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
>> In theory, the scenario you're describing could happen, but I would argue that it is unlikely given that: 1) a leader pings followers twice a tick to make sure that it has a quorum of supporters (lead()); 2) followers give up on a leader upon catching an exception (followLeader()). One could calibrate tickTime to make the probability of having this scenario low.
>>
>> Let me also revisit the motivation for the way we designed sync. ZooKeeper has been designed to serve reads efficiently and making sync go through the pipeline would slow down reads. Although optional, we thought it would be a good idea to make it as efficient as possible to comply with the original expectations for the service. We consequently came up with this cheap way of making sure that a read sees all pending updates. It is correct that there are some corner cases that it doesn't cover. One is the case you mentioned. Another is having the sync finishing before the client submits the read and having a write committing in between. We rely upon the way we implement timeouts and some minimum degree of synchrony for the clients when submitting operations to guarantee that the scheme work.
>>
>> We thought about the option of having the sync operation going through the pipeline, and in fact it would have been easier to implement it just as a regular write, but we opted not to because we felt it was sufficient for the use cases we had and more efficient as I already argued.
>>
>> Hope it helps to clarify.
>>
>> -Flavio
>>
>> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
>>
>>> thanks for the explanation! but how do you avoid having the scenario
>>> raised by John ?
>>> lets say you're a client connected to F, and F is connected to L. Lets
>>> also say that L's pipeline
>>> is now empty, and both F and L are partitioned from 3 other servers in
>>> the system that have already
>>> elected a new leader L'. Now I go to L' and write something. L still
>>> thinks its the leader because the
>>> detection that followers left it is obviously timeout dependent. So
>>> when F sends your sync to L and L returns
>>> it to F, you actually miss my write!
>>>
>>> Alex
>>>
>>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
>>>> Hi Alex, Because of the following:
>>>>
>>>> 1- A follower F processes operations from a client in FIFO order, and say that a client submits as you say sync + read;
>>>> 2- A sync will be processed by the leader and returned to the follower. It will be queued after all pending updates that the follower hasn't processed;
>>>> 3- The follower will process all pending updates before processing the response of the sync;
>>>> 4- Once the follower processes the sync, it picks the read operation to process. It reads the local state of the follower and returns to the client.
>>>>
>>>> When we process the read in Step 4, we have applied all pending updates the leader had for the follower by the time the read request started.
>>>>
>>>> This implementation is a bit of a hack because it doesn't follow the same code path as the other operations that go to the leader, but it avoids some unnecessary steps, which is important for fast reads. In the sync case, the other followers don't really need to know about it (there is nothing to be updated) and the leader simply inserts it in the sequence of updates of F, ordering it.
>>>>
>>>> -Flavio
>>>>
>>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
>>>>
>>>>> Hi Flavio,
>>>>>
>>>>>> Starting a read operation concurrently with a sync implies that the result of the read will not miss an update committed before the read started.
>>>>>
>>>>> I thought that the intention of sync was to give something like
>>>>> linearizable reads, so if you invoke a sync and then a read, your read
>>>>> is guaranteed to (at least) see any write which completed before the
>>>>> sync began. Is this the intention ? If so, how is this achieved
>>>>> without running agreement on the sync op ?
>>>>>
>>>>> Thanks,
>>>>> Alex
>>>>>
>>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
>>>>>> sync simply flushes the channel between the leader and the follower that forwarded the sync operation, so it doesn't go through the full zab pipeline. Flushing means that all pending updates from the leader to the follower are received by the time sync completes. Starting a read operation concurrently with a sync implies that the result of the read will not miss an update committed before the read started.
>>>>>>
>>>>>> -Flavio
>>>>>>
>>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
>>>>>>
>>>>>>> Its strange that sync doesn't run through agreement, I was always
>>>>>>> assuming that it is... Exactly for the reason you say -
>>>>>>> you may trust your leader, but I may have a different leader and your
>>>>>>> leader may not detect it yet and still think its the leader.
>>>>>>>
>>>>>>> This seems like a bug to me.
>>>>>>>
>>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or shouldn't)
>>>>>>> depend on timing assumption.
>>>>>>> Only progress guarantees depend on time.
>>>>>>>
>>>>>>> Alex
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <jo...@gmail.com> wrote:
>>>>>>>> I have some pretty strong requirements in terms of consistency where
>>>>>>>> reading from followers that may be behind in terms of updates isn't ok for
>>>>>>>> my use case.
>>>>>>>>
>>>>>>>> One error case that worries me is if a follower and leader are partitioned
>>>>>>>> off from the network.  A new leader is elected, but the follower and old
>>>>>>>> leader don't know about it.
>>>>>>>>
>>>>>>>> Normally I think sync was made for this purpost, but I looked at the sync
>>>>>>>> code and if there aren't any outstanding proposals the leader sends the
>>>>>>>> sync right back to the client without first verifying that it still has
>>>>>>>> quorum, so this won't work for my use case.
>>>>>>>>
>>>>>>>> At the core of the issue all I really need is a call that will make it's
>>>>>>>> way to the leader and will ping it's followers, ensure it still has a
>>>>>>>> quorum and return success.
>>>>>>>>
>>>>>>>> Basically a getCurrentLeaderEpoch() method that will be forwarded to the
>>>>>>>> leader, leader will ensure it still has quorum and return it's epoch.  I
>>>>>>>> can use this primitive to implement all the other properties I want to
>>>>>>>> verify (assuming that my client will never connect to an older epoch after
>>>>>>>> this call returns). Also the nice thing about this method is that it will
>>>>>>>> not have to hit disk and the latency should just be a round trip to the
>>>>>>>> followers.
>>>>>>>>
>>>>>>>> Most of the guarentees offered by zookeeper are time based an rely on
>>>>>>>> clocks and expiring timers, but I'm hoping to offer some guarantees in
>>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and any other way
>>>>>>>> time is broken.
>>>>>>>>
>>>>>>>> Also if people are interested I can go into more detail about what I am
>>>>>>>> trying to write.
>>>>>>>>
>>>>>>>> -jc
>>>>>>
>>>>
>>

Re: Ensure Leader hasn't changed

Posted by Alexander Shraer <sh...@gmail.com>.

Thanks for the explanation.

I guess one could always invoke a write operation instead of sync to
get the more strict semantics, but as John suggests, it might be a
good idea to add a new type of operation that requires followers to
ack but doesn't require them to log to disk - this seems sufficient in
our case.

Alex

On Thu, Sep 27, 2012 at 3:56 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
> In theory, the scenario you're describing could happen, but I would argue that it is unlikely given that: 1) a leader pings followers twice a tick to make sure that it has a quorum of supporters (lead()); 2) followers give up on a leader upon catching an exception (followLeader()). One could calibrate tickTime to make the probability of having this scenario low.
>
> Let me also revisit the motivation for the way we designed sync. ZooKeeper has been designed to serve reads efficiently and making sync go through the pipeline would slow down reads. Although optional, we thought it would be a good idea to make it as efficient as possible to comply with the original expectations for the service. We consequently came up with this cheap way of making sure that a read sees all pending updates. It is correct that there are some corner cases that it doesn't cover. One is the case you mentioned. Another is having the sync finishing before the client submits the read and having a write committing in between. We rely upon the way we implement timeouts and some minimum degree of synchrony for the clients when submitting operations to guarantee that the scheme work.
>
> We thought about the option of having the sync operation going through the pipeline, and in fact it would have been easier to implement it just as a regular write, but we opted not to because we felt it was sufficient for the use cases we had and more efficient as I already argued.
>
> Hope it helps to clarify.
>
> -Flavio
>
> On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:
>
>> thanks for the explanation! but how do you avoid having the scenario
>> raised by John ?
>> lets say you're a client connected to F, and F is connected to L. Lets
>> also say that L's pipeline
>> is now empty, and both F and L are partitioned from 3 other servers in
>> the system that have already
>> elected a new leader L'. Now I go to L' and write something. L still
>> thinks its the leader because the
>> detection that followers left it is obviously timeout dependent. So
>> when F sends your sync to L and L returns
>> it to F, you actually miss my write!
>>
>> Alex
>>
>> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
>>> Hi Alex, Because of the following:
>>>
>>> 1- A follower F processes operations from a client in FIFO order, and say that a client submits as you say sync + read;
>>> 2- A sync will be processed by the leader and returned to the follower. It will be queued after all pending updates that the follower hasn't processed;
>>> 3- The follower will process all pending updates before processing the response of the sync;
>>> 4- Once the follower processes the sync, it picks the read operation to process. It reads the local state of the follower and returns to the client.
>>>
>>> When we process the read in Step 4, we have applied all pending updates the leader had for the follower by the time the read request started.
>>>
>>> This implementation is a bit of a hack because it doesn't follow the same code path as the other operations that go to the leader, but it avoids some unnecessary steps, which is important for fast reads. In the sync case, the other followers don't really need to know about it (there is nothing to be updated) and the leader simply inserts it in the sequence of updates of F, ordering it.
>>>
>>> -Flavio
>>>
>>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
>>>
>>>> Hi Flavio,
>>>>
>>>>> Starting a read operation concurrently with a sync implies that the result of the read will not miss an update committed before the read started.
>>>>
>>>> I thought that the intention of sync was to give something like
>>>> linearizable reads, so if you invoke a sync and then a read, your read
>>>> is guaranteed to (at least) see any write which completed before the
>>>> sync began. Is this the intention ? If so, how is this achieved
>>>> without running agreement on the sync op ?
>>>>
>>>> Thanks,
>>>> Alex
>>>>
>>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
>>>>> sync simply flushes the channel between the leader and the follower that forwarded the sync operation, so it doesn't go through the full zab pipeline. Flushing means that all pending updates from the leader to the follower are received by the time sync completes. Starting a read operation concurrently with a sync implies that the result of the read will not miss an update committed before the read started.
>>>>>
>>>>> -Flavio
>>>>>
>>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
>>>>>
>>>>>> Its strange that sync doesn't run through agreement, I was always
>>>>>> assuming that it is... Exactly for the reason you say -
>>>>>> you may trust your leader, but I may have a different leader and your
>>>>>> leader may not detect it yet and still think its the leader.
>>>>>>
>>>>>> This seems like a bug to me.
>>>>>>
>>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or shouldn't)
>>>>>> depend on timing assumption.
>>>>>> Only progress guarantees depend on time.
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>>
>>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <jo...@gmail.com> wrote:
>>>>>>> I have some pretty strong requirements in terms of consistency where
>>>>>>> reading from followers that may be behind in terms of updates isn't ok for
>>>>>>> my use case.
>>>>>>>
>>>>>>> One error case that worries me is if a follower and leader are partitioned
>>>>>>> off from the network.  A new leader is elected, but the follower and old
>>>>>>> leader don't know about it.
>>>>>>>
>>>>>>> Normally I think sync was made for this purpost, but I looked at the sync
>>>>>>> code and if there aren't any outstanding proposals the leader sends the
>>>>>>> sync right back to the client without first verifying that it still has
>>>>>>> quorum, so this won't work for my use case.
>>>>>>>
>>>>>>> At the core of the issue all I really need is a call that will make it's
>>>>>>> way to the leader and will ping it's followers, ensure it still has a
>>>>>>> quorum and return success.
>>>>>>>
>>>>>>> Basically a getCurrentLeaderEpoch() method that will be forwarded to the
>>>>>>> leader, leader will ensure it still has quorum and return it's epoch.  I
>>>>>>> can use this primitive to implement all the other properties I want to
>>>>>>> verify (assuming that my client will never connect to an older epoch after
>>>>>>> this call returns). Also the nice thing about this method is that it will
>>>>>>> not have to hit disk and the latency should just be a round trip to the
>>>>>>> followers.
>>>>>>>
>>>>>>> Most of the guarentees offered by zookeeper are time based an rely on
>>>>>>> clocks and expiring timers, but I'm hoping to offer some guarantees in
>>>>>>> spite of busted clocks, horrible GC perf, VM suspends and any other way
>>>>>>> time is broken.
>>>>>>>
>>>>>>> Also if people are interested I can go into more detail about what I am
>>>>>>> trying to write.
>>>>>>>
>>>>>>> -jc
>>>>>
>>>
>

Re: Ensure Leader hasn't changed

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.

In theory, the scenario you're describing could happen, but I would argue that it is unlikely given that: 1) a leader pings followers twice a tick to make sure that it has a quorum of supporters (lead()); 2) followers give up on a leader upon catching an exception (followLeader()). One could calibrate tickTime to make the probability of having this scenario low.

Let me also revisit the motivation for the way we designed sync. ZooKeeper has been designed to serve reads efficiently and making sync go through the pipeline would slow down reads. Although optional, we thought it would be a good idea to make it as efficient as possible to comply with the original expectations for the service. We consequently came up with this cheap way of making sure that a read sees all pending updates. It is correct that there are some corner cases that it doesn't cover. One is the case you mentioned. Another is having the sync finishing before the client submits the read and having a write committing in between. We rely upon the way we implement timeouts and some minimum degree of synchrony for the clients when submitting operations to guarantee that the scheme work. 

We thought about the option of having the sync operation going through the pipeline, and in fact it would have been easier to implement it just as a regular write, but we opted not to because we felt it was sufficient for the use cases we had and more efficient as I already argued.

Hope it helps to clarify.

-Flavio
 
On Sep 27, 2012, at 9:38 AM, Alexander Shraer wrote:

> thanks for the explanation! but how do you avoid having the scenario
> raised by John ?
> lets say you're a client connected to F, and F is connected to L. Lets
> also say that L's pipeline
> is now empty, and both F and L are partitioned from 3 other servers in
> the system that have already
> elected a new leader L'. Now I go to L' and write something. L still
> thinks its the leader because the
> detection that followers left it is obviously timeout dependent. So
> when F sends your sync to L and L returns
> it to F, you actually miss my write!
> 
> Alex
> 
> On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
>> Hi Alex, Because of the following:
>> 
>> 1- A follower F processes operations from a client in FIFO order, and say that a client submits as you say sync + read;
>> 2- A sync will be processed by the leader and returned to the follower. It will be queued after all pending updates that the follower hasn't processed;
>> 3- The follower will process all pending updates before processing the response of the sync;
>> 4- Once the follower processes the sync, it picks the read operation to process. It reads the local state of the follower and returns to the client.
>> 
>> When we process the read in Step 4, we have applied all pending updates the leader had for the follower by the time the read request started.
>> 
>> This implementation is a bit of a hack because it doesn't follow the same code path as the other operations that go to the leader, but it avoids some unnecessary steps, which is important for fast reads. In the sync case, the other followers don't really need to know about it (there is nothing to be updated) and the leader simply inserts it in the sequence of updates of F, ordering it.
>> 
>> -Flavio
>> 
>> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
>> 
>>> Hi Flavio,
>>> 
>>>> Starting a read operation concurrently with a sync implies that the result of the read will not miss an update committed before the read started.
>>> 
>>> I thought that the intention of sync was to give something like
>>> linearizable reads, so if you invoke a sync and then a read, your read
>>> is guaranteed to (at least) see any write which completed before the
>>> sync began. Is this the intention ? If so, how is this achieved
>>> without running agreement on the sync op ?
>>> 
>>> Thanks,
>>> Alex
>>> 
>>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
>>>> sync simply flushes the channel between the leader and the follower that forwarded the sync operation, so it doesn't go through the full zab pipeline. Flushing means that all pending updates from the leader to the follower are received by the time sync completes. Starting a read operation concurrently with a sync implies that the result of the read will not miss an update committed before the read started.
>>>> 
>>>> -Flavio
>>>> 
>>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
>>>> 
>>>>> Its strange that sync doesn't run through agreement, I was always
>>>>> assuming that it is... Exactly for the reason you say -
>>>>> you may trust your leader, but I may have a different leader and your
>>>>> leader may not detect it yet and still think its the leader.
>>>>> 
>>>>> This seems like a bug to me.
>>>>> 
>>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or shouldn't)
>>>>> depend on timing assumption.
>>>>> Only progress guarantees depend on time.
>>>>> 
>>>>> Alex
>>>>> 
>>>>> 
>>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <jo...@gmail.com> wrote:
>>>>>> I have some pretty strong requirements in terms of consistency where
>>>>>> reading from followers that may be behind in terms of updates isn't ok for
>>>>>> my use case.
>>>>>> 
>>>>>> One error case that worries me is if a follower and leader are partitioned
>>>>>> off from the network.  A new leader is elected, but the follower and old
>>>>>> leader don't know about it.
>>>>>> 
>>>>>> Normally I think sync was made for this purpost, but I looked at the sync
>>>>>> code and if there aren't any outstanding proposals the leader sends the
>>>>>> sync right back to the client without first verifying that it still has
>>>>>> quorum, so this won't work for my use case.
>>>>>> 
>>>>>> At the core of the issue all I really need is a call that will make it's
>>>>>> way to the leader and will ping it's followers, ensure it still has a
>>>>>> quorum and return success.
>>>>>> 
>>>>>> Basically a getCurrentLeaderEpoch() method that will be forwarded to the
>>>>>> leader, leader will ensure it still has quorum and return it's epoch.  I
>>>>>> can use this primitive to implement all the other properties I want to
>>>>>> verify (assuming that my client will never connect to an older epoch after
>>>>>> this call returns). Also the nice thing about this method is that it will
>>>>>> not have to hit disk and the latency should just be a round trip to the
>>>>>> followers.
>>>>>> 
>>>>>> Most of the guarentees offered by zookeeper are time based an rely on
>>>>>> clocks and expiring timers, but I'm hoping to offer some guarantees in
>>>>>> spite of busted clocks, horrible GC perf, VM suspends and any other way
>>>>>> time is broken.
>>>>>> 
>>>>>> Also if people are interested I can go into more detail about what I am
>>>>>> trying to write.
>>>>>> 
>>>>>> -jc
>>>> 
>>

Re: Ensure Leader hasn't changed

Posted by Alexander Shraer <sh...@gmail.com>.

thanks for the explanation! but how do you avoid having the scenario
raised by John ?
lets say you're a client connected to F, and F is connected to L. Lets
also say that L's pipeline
is now empty, and both F and L are partitioned from 3 other servers in
the system that have already
elected a new leader L'. Now I go to L' and write something. L still
thinks its the leader because the
detection that followers left it is obviously timeout dependent. So
when F sends your sync to L and L returns
it to F, you actually miss my write!

Alex

On Thu, Sep 27, 2012 at 12:32 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
> Hi Alex, Because of the following:
>
> 1- A follower F processes operations from a client in FIFO order, and say that a client submits as you say sync + read;
> 2- A sync will be processed by the leader and returned to the follower. It will be queued after all pending updates that the follower hasn't processed;
> 3- The follower will process all pending updates before processing the response of the sync;
> 4- Once the follower processes the sync, it picks the read operation to process. It reads the local state of the follower and returns to the client.
>
> When we process the read in Step 4, we have applied all pending updates the leader had for the follower by the time the read request started.
>
> This implementation is a bit of a hack because it doesn't follow the same code path as the other operations that go to the leader, but it avoids some unnecessary steps, which is important for fast reads. In the sync case, the other followers don't really need to know about it (there is nothing to be updated) and the leader simply inserts it in the sequence of updates of F, ordering it.
>
> -Flavio
>
> On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:
>
>> Hi Flavio,
>>
>>> Starting a read operation concurrently with a sync implies that the result of the read will not miss an update committed before the read started.
>>
>> I thought that the intention of sync was to give something like
>> linearizable reads, so if you invoke a sync and then a read, your read
>> is guaranteed to (at least) see any write which completed before the
>> sync began. Is this the intention ? If so, how is this achieved
>> without running agreement on the sync op ?
>>
>> Thanks,
>> Alex
>>
>> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
>>> sync simply flushes the channel between the leader and the follower that forwarded the sync operation, so it doesn't go through the full zab pipeline. Flushing means that all pending updates from the leader to the follower are received by the time sync completes. Starting a read operation concurrently with a sync implies that the result of the read will not miss an update committed before the read started.
>>>
>>> -Flavio
>>>
>>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
>>>
>>>> Its strange that sync doesn't run through agreement, I was always
>>>> assuming that it is... Exactly for the reason you say -
>>>> you may trust your leader, but I may have a different leader and your
>>>> leader may not detect it yet and still think its the leader.
>>>>
>>>> This seems like a bug to me.
>>>>
>>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or shouldn't)
>>>> depend on timing assumption.
>>>> Only progress guarantees depend on time.
>>>>
>>>> Alex
>>>>
>>>>
>>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <jo...@gmail.com> wrote:
>>>>> I have some pretty strong requirements in terms of consistency where
>>>>> reading from followers that may be behind in terms of updates isn't ok for
>>>>> my use case.
>>>>>
>>>>> One error case that worries me is if a follower and leader are partitioned
>>>>> off from the network.  A new leader is elected, but the follower and old
>>>>> leader don't know about it.
>>>>>
>>>>> Normally I think sync was made for this purpost, but I looked at the sync
>>>>> code and if there aren't any outstanding proposals the leader sends the
>>>>> sync right back to the client without first verifying that it still has
>>>>> quorum, so this won't work for my use case.
>>>>>
>>>>> At the core of the issue all I really need is a call that will make it's
>>>>> way to the leader and will ping it's followers, ensure it still has a
>>>>> quorum and return success.
>>>>>
>>>>> Basically a getCurrentLeaderEpoch() method that will be forwarded to the
>>>>> leader, leader will ensure it still has quorum and return it's epoch.  I
>>>>> can use this primitive to implement all the other properties I want to
>>>>> verify (assuming that my client will never connect to an older epoch after
>>>>> this call returns). Also the nice thing about this method is that it will
>>>>> not have to hit disk and the latency should just be a round trip to the
>>>>> followers.
>>>>>
>>>>> Most of the guarentees offered by zookeeper are time based an rely on
>>>>> clocks and expiring timers, but I'm hoping to offer some guarantees in
>>>>> spite of busted clocks, horrible GC perf, VM suspends and any other way
>>>>> time is broken.
>>>>>
>>>>> Also if people are interested I can go into more detail about what I am
>>>>> trying to write.
>>>>>
>>>>> -jc
>>>
>

Re: Ensure Leader hasn't changed

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.

Hi Alex, Because of the following:

1- A follower F processes operations from a client in FIFO order, and say that a client submits as you say sync + read;
2- A sync will be processed by the leader and returned to the follower. It will be queued after all pending updates that the follower hasn't processed;
3- The follower will process all pending updates before processing the response of the sync;
4- Once the follower processes the sync, it picks the read operation to process. It reads the local state of the follower and returns to the client.

When we process the read in Step 4, we have applied all pending updates the leader had for the follower by the time the read request started. 

This implementation is a bit of a hack because it doesn't follow the same code path as the other operations that go to the leader, but it avoids some unnecessary steps, which is important for fast reads. In the sync case, the other followers don't really need to know about it (there is nothing to be updated) and the leader simply inserts it in the sequence of updates of F, ordering it.

-Flavio

On Sep 27, 2012, at 9:12 AM, Alexander Shraer wrote:

> Hi Flavio,
> 
>> Starting a read operation concurrently with a sync implies that the result of the read will not miss an update committed before the read started.
> 
> I thought that the intention of sync was to give something like
> linearizable reads, so if you invoke a sync and then a read, your read
> is guaranteed to (at least) see any write which completed before the
> sync began. Is this the intention ? If so, how is this achieved
> without running agreement on the sync op ?
> 
> Thanks,
> Alex
> 
> On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
>> sync simply flushes the channel between the leader and the follower that forwarded the sync operation, so it doesn't go through the full zab pipeline. Flushing means that all pending updates from the leader to the follower are received by the time sync completes. Starting a read operation concurrently with a sync implies that the result of the read will not miss an update committed before the read started.
>> 
>> -Flavio
>> 
>> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
>> 
>>> Its strange that sync doesn't run through agreement, I was always
>>> assuming that it is... Exactly for the reason you say -
>>> you may trust your leader, but I may have a different leader and your
>>> leader may not detect it yet and still think its the leader.
>>> 
>>> This seems like a bug to me.
>>> 
>>> Similarly to Paxos, Zookeeper's safety guarantees don't (or shouldn't)
>>> depend on timing assumption.
>>> Only progress guarantees depend on time.
>>> 
>>> Alex
>>> 
>>> 
>>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <jo...@gmail.com> wrote:
>>>> I have some pretty strong requirements in terms of consistency where
>>>> reading from followers that may be behind in terms of updates isn't ok for
>>>> my use case.
>>>> 
>>>> One error case that worries me is if a follower and leader are partitioned
>>>> off from the network.  A new leader is elected, but the follower and old
>>>> leader don't know about it.
>>>> 
>>>> Normally I think sync was made for this purpost, but I looked at the sync
>>>> code and if there aren't any outstanding proposals the leader sends the
>>>> sync right back to the client without first verifying that it still has
>>>> quorum, so this won't work for my use case.
>>>> 
>>>> At the core of the issue all I really need is a call that will make it's
>>>> way to the leader and will ping it's followers, ensure it still has a
>>>> quorum and return success.
>>>> 
>>>> Basically a getCurrentLeaderEpoch() method that will be forwarded to the
>>>> leader, leader will ensure it still has quorum and return it's epoch.  I
>>>> can use this primitive to implement all the other properties I want to
>>>> verify (assuming that my client will never connect to an older epoch after
>>>> this call returns). Also the nice thing about this method is that it will
>>>> not have to hit disk and the latency should just be a round trip to the
>>>> followers.
>>>> 
>>>> Most of the guarentees offered by zookeeper are time based an rely on
>>>> clocks and expiring timers, but I'm hoping to offer some guarantees in
>>>> spite of busted clocks, horrible GC perf, VM suspends and any other way
>>>> time is broken.
>>>> 
>>>> Also if people are interested I can go into more detail about what I am
>>>> trying to write.
>>>> 
>>>> -jc
>>

Re: Ensure Leader hasn't changed

Posted by Alexander Shraer <sh...@gmail.com>.

Hi Flavio,

> Starting a read operation concurrently with a sync implies that the result of the read will not miss an update committed before the read started.

I thought that the intention of sync was to give something like
linearizable reads, so if you invoke a sync and then a read, your read
is guaranteed to (at least) see any write which completed before the
sync began. Is this the intention ? If so, how is this achieved
without running agreement on the sync op ?

Thanks,
Alex

On Thu, Sep 27, 2012 at 12:05 AM, Flavio Junqueira <fp...@yahoo-inc.com> wrote:
> sync simply flushes the channel between the leader and the follower that forwarded the sync operation, so it doesn't go through the full zab pipeline. Flushing means that all pending updates from the leader to the follower are received by the time sync completes. Starting a read operation concurrently with a sync implies that the result of the read will not miss an update committed before the read started.
>
> -Flavio
>
> On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:
>
>> Its strange that sync doesn't run through agreement, I was always
>> assuming that it is... Exactly for the reason you say -
>> you may trust your leader, but I may have a different leader and your
>> leader may not detect it yet and still think its the leader.
>>
>> This seems like a bug to me.
>>
>> Similarly to Paxos, Zookeeper's safety guarantees don't (or shouldn't)
>> depend on timing assumption.
>> Only progress guarantees depend on time.
>>
>> Alex
>>
>>
>> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <jo...@gmail.com> wrote:
>>> I have some pretty strong requirements in terms of consistency where
>>> reading from followers that may be behind in terms of updates isn't ok for
>>> my use case.
>>>
>>> One error case that worries me is if a follower and leader are partitioned
>>> off from the network.  A new leader is elected, but the follower and old
>>> leader don't know about it.
>>>
>>> Normally I think sync was made for this purpost, but I looked at the sync
>>> code and if there aren't any outstanding proposals the leader sends the
>>> sync right back to the client without first verifying that it still has
>>> quorum, so this won't work for my use case.
>>>
>>> At the core of the issue all I really need is a call that will make it's
>>> way to the leader and will ping it's followers, ensure it still has a
>>> quorum and return success.
>>>
>>> Basically a getCurrentLeaderEpoch() method that will be forwarded to the
>>> leader, leader will ensure it still has quorum and return it's epoch.  I
>>> can use this primitive to implement all the other properties I want to
>>> verify (assuming that my client will never connect to an older epoch after
>>> this call returns). Also the nice thing about this method is that it will
>>> not have to hit disk and the latency should just be a round trip to the
>>> followers.
>>>
>>> Most of the guarentees offered by zookeeper are time based an rely on
>>> clocks and expiring timers, but I'm hoping to offer some guarantees in
>>> spite of busted clocks, horrible GC perf, VM suspends and any other way
>>> time is broken.
>>>
>>> Also if people are interested I can go into more detail about what I am
>>> trying to write.
>>>
>>> -jc
>

Re: Ensure Leader hasn't changed

Posted by Flavio Junqueira <fp...@yahoo-inc.com>.

sync simply flushes the channel between the leader and the follower that forwarded the sync operation, so it doesn't go through the full zab pipeline. Flushing means that all pending updates from the leader to the follower are received by the time sync completes. Starting a read operation concurrently with a sync implies that the result of the read will not miss an update committed before the read started. 

-Flavio

On Sep 27, 2012, at 3:43 AM, Alexander Shraer wrote:

> Its strange that sync doesn't run through agreement, I was always
> assuming that it is... Exactly for the reason you say -
> you may trust your leader, but I may have a different leader and your
> leader may not detect it yet and still think its the leader.
> 
> This seems like a bug to me.
> 
> Similarly to Paxos, Zookeeper's safety guarantees don't (or shouldn't)
> depend on timing assumption.
> Only progress guarantees depend on time.
> 
> Alex
> 
> 
> On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <jo...@gmail.com> wrote:
>> I have some pretty strong requirements in terms of consistency where
>> reading from followers that may be behind in terms of updates isn't ok for
>> my use case.
>> 
>> One error case that worries me is if a follower and leader are partitioned
>> off from the network.  A new leader is elected, but the follower and old
>> leader don't know about it.
>> 
>> Normally I think sync was made for this purpost, but I looked at the sync
>> code and if there aren't any outstanding proposals the leader sends the
>> sync right back to the client without first verifying that it still has
>> quorum, so this won't work for my use case.
>> 
>> At the core of the issue all I really need is a call that will make it's
>> way to the leader and will ping it's followers, ensure it still has a
>> quorum and return success.
>> 
>> Basically a getCurrentLeaderEpoch() method that will be forwarded to the
>> leader, leader will ensure it still has quorum and return it's epoch.  I
>> can use this primitive to implement all the other properties I want to
>> verify (assuming that my client will never connect to an older epoch after
>> this call returns). Also the nice thing about this method is that it will
>> not have to hit disk and the latency should just be a round trip to the
>> followers.
>> 
>> Most of the guarentees offered by zookeeper are time based an rely on
>> clocks and expiring timers, but I'm hoping to offer some guarantees in
>> spite of busted clocks, horrible GC perf, VM suspends and any other way
>> time is broken.
>> 
>> Also if people are interested I can go into more detail about what I am
>> trying to write.
>> 
>> -jc

Re: Ensure Leader hasn't changed

Posted by Alexander Shraer <sh...@gmail.com>.

Its strange that sync doesn't run through agreement, I was always
assuming that it is... Exactly for the reason you say -
you may trust your leader, but I may have a different leader and your
leader may not detect it yet and still think its the leader.

This seems like a bug to me.

Similarly to Paxos, Zookeeper's safety guarantees don't (or shouldn't)
depend on timing assumption.
Only progress guarantees depend on time.

Alex


On Wed, Sep 26, 2012 at 4:41 PM, John Carrino <jo...@gmail.com> wrote:
> I have some pretty strong requirements in terms of consistency where
> reading from followers that may be behind in terms of updates isn't ok for
> my use case.
>
> One error case that worries me is if a follower and leader are partitioned
> off from the network.  A new leader is elected, but the follower and old
> leader don't know about it.
>
> Normally I think sync was made for this purpost, but I looked at the sync
> code and if there aren't any outstanding proposals the leader sends the
> sync right back to the client without first verifying that it still has
> quorum, so this won't work for my use case.
>
> At the core of the issue all I really need is a call that will make it's
> way to the leader and will ping it's followers, ensure it still has a
> quorum and return success.
>
> Basically a getCurrentLeaderEpoch() method that will be forwarded to the
> leader, leader will ensure it still has quorum and return it's epoch.  I
> can use this primitive to implement all the other properties I want to
> verify (assuming that my client will never connect to an older epoch after
> this call returns). Also the nice thing about this method is that it will
> not have to hit disk and the latency should just be a round trip to the
> followers.
>
> Most of the guarentees offered by zookeeper are time based an rely on
> clocks and expiring timers, but I'm hoping to offer some guarantees in
> spite of busted clocks, horrible GC perf, VM suspends and any other way
> time is broken.
>
> Also if people are interested I can go into more detail about what I am
> trying to write.
>
> -jc