You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@zookeeper.apache.org by Qing Yan <qi...@gmail.com> on 2010/01/27 03:20:49 UTC

Q about ZK internal: how commit is being remembered

Hi,

I have question about how zookeeper *remembers* a commit operation.

According to
http://hadoop.apache.org/zookeeper/docs/r3.2.2/zookeeperInternals.html#sc_summary

<quote>


The leader will issue a COMMIT to all followers as soon as a quorum of
followers have ACKed a message. Since messages are ACKed in order, COMMITs
will be sent by the leader as received by the followers in order.

COMMITs are processed in order. Followers deliver a proposals message when
that proposal is committed.
</quote>

My question is will leader wait for COMMIT to be processed by quorum
of followers before consider
COMMIT to be success? From the documentation it seems that leader handles
COMMIT asynchronously and
don't expect confirmation from followers. In the extreme case, what happens
if leader issue a COMMIT
to all followers and crash immediately before the COMMIT message can go out
of the network. How the system
remembers the COMMIT ever happens?

Actually this is related to the leader election process:

<quote>
ZooKeeper messaging doesn't care about the exact method of electing a leader
has long as the following holds:

   -

   The leader has seen the highest zxid of all the followers.
   -

   A quorum of servers have committed to following the leader.

 Of these two requirements only the first, the highest zxid amoung the
followers needs to hold for correct operation.

</quote>

Is there a liveness issue try to find "The leader has seen the highest zxid
of all the followers"? What if some of the followers (which happens to
holding the highest zxid) cannot be contacted(FLP impossible result?)
 It will be more striaghtforward if COMMIT requires confirmation from a
quorum of the followers. But I guess things get
optimized according to Zab's FIFO nature...just want to hear some
clarification about it.

Thanks alot!

Re: Q about ZK internal: how commit is being remembered

Posted by Qian Ye <ye...@gmail.com>.

Thanks Mahadev, I see what you mean.


On Fri, Jan 29, 2010 at 10:06 AM, Mahadev Konar <ma...@yahoo-inc.com>wrote:

> Qian,
>
>  ZooKeeper gurantees that if a client sees some transaction response, then
> it will persist but the one's that a client does not see might be discarded
> or committed. So in case a quorum does not log the transaction, there might
> be a case wherein a zookeeper server which does not have the logged
> transaction becomes the leader (because the machines with the logged
> transaction are down). In that case the transaction is discarded. In a case
> when a machine which has the logged transaction becomes the leader that
> transaction will be committed.
>
> Hope that clear your doubt.
>
> mahadev
>
>
> On 1/28/10 6:02 PM, "Qian Ye" <ye...@gmail.com> wrote:
>
> > Thanks henry and ben, actually I have read the paper henry mentioned in
> this
> > mail, but I'm still not so clear with some of the details. Anyway, maybe
> > more study on the source code can help me understanding. Since Ben said
> > that, "if less than a quorum of servers have accepted a transaction, we
> can
> > commit or discard". Would this feature cause any unexpected problem? Can
> you
> > give some hints about this issue?
> >
> >
> >
> > On Fri, Jan 29, 2010 at 1:09 AM, Benjamin Reed <br...@yahoo-inc.com>
> wrote:
> >
> >> henry is correct. just to state another way, Zab guarantees that if a
> >> quorum of servers have accepted a transaction, the transaction will
> commit.
> >> this means that if less than a quorum of servers have accepted a
> >> transaction, we can commit or discard. the only constraint we have in
> >> choosing is ordering. we have to decide which partially accepted
> >> transactions are going to be committed and which discarded before we
> propose
> >> any new messages so that ordering is preserved.
> >>
> >> ben
> >>
> >>
> >> Henry Robinson wrote:
> >>
> >>> Hi -
> >>>
> >>> Note that a machine that has the highest received zxid will necessarily
> >>> have
> >>> seen the most recent transaction that was logged by a quorum of
> followers
> >>> (the FIFO property of TCP again ensures that all previous messages will
> >>> have
> >>> been seen). This is the property that ZAB needs to preserve. The idea
> is
> >>> to
> >>> avoid missing a commit that went to a node that has since failed.
> >>>
> >>> I was therefore slightly imprecise in my previous mail - it's possible
> for
> >>> only partially-proposed proposals to be committed if the leader that is
> >>> elected next has seen them. Only when another proposal is committed
> >>> instead
> >>> must the original proposal be discarded.
> >>>
> >>> I highly recommend Ben Reed's and Flavio Junqueira's LADIS paper on the
> >>> subject, for those with portal.acm.org access:
> >>> http://portal.acm.org/citation.cfm?id=1529978
> >>>
> >>> Henry
> >>>
> >>> On 27 January 2010 21:52, Qian Ye <ye...@gmail.com> wrote:
> >>>
> >>>
> >>>
> >>>> Hi Henry:
> >>>>
> >>>> According to your explanation, "*ZAB makes the guarantee that a
> proposal
> >>>> which has been logged by
> >>>> a quorum of followers will eventually be committed*" , however, the
> >>>> source
> >>>> code of Zookeeper, the FastLeaderElection.java file, shows that, in
> the
> >>>> election, the candidates only provide their zxid in the votes, the one
> >>>> with
> >>>> the max zxid would win the election. I mean, it seems that no check
> has
> >>>> been
> >>>> made to make sure whether the latest proposal has been logged by a
> quorum
> >>>> of
> >>>> servers.
> >>>>
> >>>> In this situation, the zookeeper would deliver a proposal, which is
> known
> >>>> as
> >>>> a failed one by the client. Imagine this scenario, a zookeeper cluster
> >>>> with
> >>>> 5 servers, Leader only receives 1 ack for proposal A, after a timeout,
> >>>> the
> >>>> client is told that the proposal failed. At this time, all servers
> >>>> restart
> >>>> due to a power failure. The server have the log of proposal A would be
> >>>> the
> >>>> leader, however, the client is told the proposal A failed.
> >>>>
> >>>> Do I misunderstand this?
> >>>>
> >>>>
> >>>> On Wed, Jan 27, 2010 at 10:37 AM, Henry Robinson <he...@cloudera.com>
> >>>> wrote:
> >>>>
> >>>>
> >>>>
> >>>>> Qing -
> >>>>>
> >>>>> That part of the documentation is slightly confusing. The elected
> leader
> >>>>> must have the highest zxid that has been written to disk by a quorum
> of
> >>>>> followers. ZAB makes the guarantee that a proposal which has been
> logged
> >>>>>
> >>>>>
> >>>> by
> >>>>
> >>>>
> >>>>> a quorum of followers will eventually be committed. Conversely, any
> >>>>> proposals that *don't* get logged by a quorum before the leader
> sending
> >>>>> them
> >>>>> dies will not be committed. One of the ZAB papers covers both these
> >>>>> situations - making sure proposals are committed or skipped at the
> right
> >>>>> moments.
> >>>>>
> >>>>> So you get the neat property that leader election can be live in
> exactly
> >>>>> the
> >>>>> case where the ZK cluster is live. If a quorum of peers aren't
> available
> >>>>>
> >>>>>
> >>>> to
> >>>>
> >>>>
> >>>>> elect the leader, the resulting cluster won't be live anyhow, so it's
> ok
> >>>>> for
> >>>>> leader election to fail.
> >>>>>
> >>>>> FLP impossibility isn't actually strictly relevant for ZAB, because
> FLP
> >>>>> requires that message reordering is possible (see all the stuff in
> that
> >>>>> paper about non-deterministically drawing messages from a potentially
> >>>>> deliverable set). TCP FIFO channels don't reorder, so provide the
> extra
> >>>>> signalling that ZAB requires.
> >>>>>
> >>>>> cheers,
> >>>>> Henry
> >>>>>
> >>>>> 2010/1/26 Qing Yan <qi...@gmail.com>
> >>>>>
> >>>>>
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> I have question about how zookeeper *remembers* a commit operation.
> >>>>>>
> >>>>>> According to
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>
> http://hadoop.apache.org/zookeeper/docs/r3.2.2/zookeeperInternals.html#sc_s
> >>>> ummary
> >>>>
> >>>>
> >>>>> <quote>
> >>>>>>
> >>>>>>
> >>>>>> The leader will issue a COMMIT to all followers as soon as a quorum
> of
> >>>>>> followers have ACKed a message. Since messages are ACKed in order,
> >>>>>>
> >>>>>>
> >>>>> COMMITs
> >>>>>
> >>>>>
> >>>>>> will be sent by the leader as received by the followers in order.
> >>>>>>
> >>>>>> COMMITs are processed in order. Followers deliver a proposals
> message
> >>>>>>
> >>>>>>
> >>>>> when
> >>>>>
> >>>>>
> >>>>>> that proposal is committed.
> >>>>>> </quote>
> >>>>>>
> >>>>>> My question is will leader wait for COMMIT to be processed by quorum
> >>>>>> of followers before consider
> >>>>>> COMMIT to be success? From the documentation it seems that leader
> >>>>>>
> >>>>>>
> >>>>> handles
> >>>>
> >>>>
> >>>>> COMMIT asynchronously and
> >>>>>> don't expect confirmation from followers. In the extreme case, what
> >>>>>>
> >>>>>>
> >>>>> happens
> >>>>>
> >>>>>
> >>>>>> if leader issue a COMMIT
> >>>>>> to all followers and crash immediately before the COMMIT message can
> go
> >>>>>>
> >>>>>>
> >>>>> out
> >>>>>
> >>>>>
> >>>>>> of the network. How the system
> >>>>>> remembers the COMMIT ever happens?
> >>>>>>
> >>>>>> Actually this is related to the leader election process:
> >>>>>>
> >>>>>> <quote>
> >>>>>> ZooKeeper messaging doesn't care about the exact method of electing
> a
> >>>>>> leader
> >>>>>> has long as the following holds:
> >>>>>>
> >>>>>>  -
> >>>>>>
> >>>>>>  The leader has seen the highest zxid of all the followers.
> >>>>>>  -
> >>>>>>
> >>>>>>  A quorum of servers have committed to following the leader.
> >>>>>>
> >>>>>>  Of these two requirements only the first, the highest zxid amoung
> the
> >>>>>> followers needs to hold for correct operation.
> >>>>>>
> >>>>>> </quote>
> >>>>>>
> >>>>>> Is there a liveness issue try to find "The leader has seen the
> highest
> >>>>>>
> >>>>>>
> >>>>> zxid
> >>>>>
> >>>>>
> >>>>>> of all the followers"? What if some of the followers (which happens
> to
> >>>>>> holding the highest zxid) cannot be contacted(FLP impossible
> result?)
> >>>>>>  It will be more striaghtforward if COMMIT requires confirmation
> from a
> >>>>>> quorum of the followers. But I guess things get
> >>>>>> optimized according to Zab's FIFO nature...just want to hear some
> >>>>>> clarification about it.
> >>>>>>
> >>>>>> Thanks alot!
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>> --
> >>>> With Regards!
> >>>>
> >>>> Ye, Qian
> >>>> Made in Zhejiang University
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >
>
>


-- 
With Regards!

Ye, Qian
Made in Zhejiang University

Re: Q about ZK internal: how commit is being remembered

Posted by Ted Dunning <te...@gmail.com>.

Just a quick plug for my company, but all ACM publications are available to
rent at www.deepdyve.com.  See, for instance,
http://www.deepdyve.com/search?query=A+simple+totally+ordered+broadcast+protocol

This rental service isn't the same as getting the PDF and you may prefer to
subscribe to the ACM to get the actual documents.  It is much cheaper,
however, and might fit the bill.

If you try it, send me email.  It is a new service and we need feedback.



On Thu, Jan 28, 2010 at 6:31 PM, Qing Yan <qi...@gmail.com> wrote:

> Hi Qian Ye,
>
>
> Could you forward me a copy of the paper?  I don't have ACM access...duo
> xie!
>
>
> btw, I was a ZJUer too..
>
> cheers,
>
> Qing
>
>
>
> On Fri, Jan 29, 2010 at 10:02 AM, Qian Ye <ye...@gmail.com> wrote:
>
> > Thanks henry and ben, actually I have read the paper henry mentioned in
> > this
> > mail, but I'm still not so clear with some of the details. Anyway, maybe
> > more study on the source code can help me understanding. Since Ben said
> > that, "if less than a quorum of servers have accepted a transaction, we
> can
> > commit or discard". Would this feature cause any unexpected problem? Can
> > you
> > give some hints about this issue?
> >
> >
> >
> > On Fri, Jan 29, 2010 at 1:09 AM, Benjamin Reed <br...@yahoo-inc.com>
> > wrote:
> >
> > > henry is correct. just to state another way, Zab guarantees that if a
> > > quorum of servers have accepted a transaction, the transaction will
> > commit.
> > > this means that if less than a quorum of servers have accepted a
> > > transaction, we can commit or discard. the only constraint we have in
> > > choosing is ordering. we have to decide which partially accepted
> > > transactions are going to be committed and which discarded before we
> > propose
> > > any new messages so that ordering is preserved.
> > >
> > > ben
> > >
> > >
> > > Henry Robinson wrote:
> > >
> > >> Hi -
> > >>
> > >> Note that a machine that has the highest received zxid will
> necessarily
> > >> have
> > >> seen the most recent transaction that was logged by a quorum of
> > followers
> > >> (the FIFO property of TCP again ensures that all previous messages
> will
> > >> have
> > >> been seen). This is the property that ZAB needs to preserve. The idea
> is
> > >> to
> > >> avoid missing a commit that went to a node that has since failed.
> > >>
> > >> I was therefore slightly imprecise in my previous mail - it's possible
> > for
> > >> only partially-proposed proposals to be committed if the leader that
> is
> > >> elected next has seen them. Only when another proposal is committed
> > >> instead
> > >> must the original proposal be discarded.
> > >>
> > >> I highly recommend Ben Reed's and Flavio Junqueira's LADIS paper on
> the
> > >> subject, for those with portal.acm.org access:
> > >> http://portal.acm.org/citation.cfm?id=1529978
> > >>
> > >> Henry
> > >>
> > >> On 27 January 2010 21:52, Qian Ye <ye...@gmail.com> wrote:
> > >>
> > >>
> > >>
> > >>> Hi Henry:
> > >>>
> > >>> According to your explanation, "*ZAB makes the guarantee that a
> > proposal
> > >>> which has been logged by
> > >>> a quorum of followers will eventually be committed*" , however, the
> > >>> source
> > >>> code of Zookeeper, the FastLeaderElection.java file, shows that, in
> the
> > >>> election, the candidates only provide their zxid in the votes, the
> one
> > >>> with
> > >>> the max zxid would win the election. I mean, it seems that no check
> has
> > >>> been
> > >>> made to make sure whether the latest proposal has been logged by a
> > quorum
> > >>> of
> > >>> servers.
> > >>>
> > >>> In this situation, the zookeeper would deliver a proposal, which is
> > known
> > >>> as
> > >>> a failed one by the client. Imagine this scenario, a zookeeper
> cluster
> > >>> with
> > >>> 5 servers, Leader only receives 1 ack for proposal A, after a
> timeout,
> > >>> the
> > >>> client is told that the proposal failed. At this time, all servers
> > >>> restart
> > >>> due to a power failure. The server have the log of proposal A would
> be
> > >>> the
> > >>> leader, however, the client is told the proposal A failed.
> > >>>
> > >>> Do I misunderstand this?
> > >>>
> > >>>
> > >>> On Wed, Jan 27, 2010 at 10:37 AM, Henry Robinson <henry@cloudera.com
> >
> > >>> wrote:
> > >>>
> > >>>
> > >>>
> > >>>> Qing -
> > >>>>
> > >>>> That part of the documentation is slightly confusing. The elected
> > leader
> > >>>> must have the highest zxid that has been written to disk by a quorum
> > of
> > >>>> followers. ZAB makes the guarantee that a proposal which has been
> > logged
> > >>>>
> > >>>>
> > >>> by
> > >>>
> > >>>
> > >>>> a quorum of followers will eventually be committed. Conversely, any
> > >>>> proposals that *don't* get logged by a quorum before the leader
> > sending
> > >>>> them
> > >>>> dies will not be committed. One of the ZAB papers covers both these
> > >>>> situations - making sure proposals are committed or skipped at the
> > right
> > >>>> moments.
> > >>>>
> > >>>> So you get the neat property that leader election can be live in
> > exactly
> > >>>> the
> > >>>> case where the ZK cluster is live. If a quorum of peers aren't
> > available
> > >>>>
> > >>>>
> > >>> to
> > >>>
> > >>>
> > >>>> elect the leader, the resulting cluster won't be live anyhow, so
> it's
> > ok
> > >>>> for
> > >>>> leader election to fail.
> > >>>>
> > >>>> FLP impossibility isn't actually strictly relevant for ZAB, because
> > FLP
> > >>>> requires that message reordering is possible (see all the stuff in
> > that
> > >>>> paper about non-deterministically drawing messages from a
> potentially
> > >>>> deliverable set). TCP FIFO channels don't reorder, so provide the
> > extra
> > >>>> signalling that ZAB requires.
> > >>>>
> > >>>> cheers,
> > >>>> Henry
> > >>>>
> > >>>> 2010/1/26 Qing Yan <qi...@gmail.com>
> > >>>>
> > >>>>
> > >>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> I have question about how zookeeper *remembers* a commit operation.
> > >>>>>
> > >>>>> According to
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> http://hadoop.apache.org/zookeeper/docs/r3.2.2/zookeeperInternals.html#sc_summary
> > >>>
> > >>>
> > >>>> <quote>
> > >>>>>
> > >>>>>
> > >>>>> The leader will issue a COMMIT to all followers as soon as a quorum
> > of
> > >>>>> followers have ACKed a message. Since messages are ACKed in order,
> > >>>>>
> > >>>>>
> > >>>> COMMITs
> > >>>>
> > >>>>
> > >>>>> will be sent by the leader as received by the followers in order.
> > >>>>>
> > >>>>> COMMITs are processed in order. Followers deliver a proposals
> message
> > >>>>>
> > >>>>>
> > >>>> when
> > >>>>
> > >>>>
> > >>>>> that proposal is committed.
> > >>>>> </quote>
> > >>>>>
> > >>>>> My question is will leader wait for COMMIT to be processed by
> quorum
> > >>>>> of followers before consider
> > >>>>> COMMIT to be success? From the documentation it seems that leader
> > >>>>>
> > >>>>>
> > >>>> handles
> > >>>
> > >>>
> > >>>> COMMIT asynchronously and
> > >>>>> don't expect confirmation from followers. In the extreme case, what
> > >>>>>
> > >>>>>
> > >>>> happens
> > >>>>
> > >>>>
> > >>>>> if leader issue a COMMIT
> > >>>>> to all followers and crash immediately before the COMMIT message
> can
> > go
> > >>>>>
> > >>>>>
> > >>>> out
> > >>>>
> > >>>>
> > >>>>> of the network. How the system
> > >>>>> remembers the COMMIT ever happens?
> > >>>>>
> > >>>>> Actually this is related to the leader election process:
> > >>>>>
> > >>>>> <quote>
> > >>>>> ZooKeeper messaging doesn't care about the exact method of electing
> a
> > >>>>> leader
> > >>>>> has long as the following holds:
> > >>>>>
> > >>>>>  -
> > >>>>>
> > >>>>>  The leader has seen the highest zxid of all the followers.
> > >>>>>  -
> > >>>>>
> > >>>>>  A quorum of servers have committed to following the leader.
> > >>>>>
> > >>>>>  Of these two requirements only the first, the highest zxid amoung
> > the
> > >>>>> followers needs to hold for correct operation.
> > >>>>>
> > >>>>> </quote>
> > >>>>>
> > >>>>> Is there a liveness issue try to find "The leader has seen the
> > highest
> > >>>>>
> > >>>>>
> > >>>> zxid
> > >>>>
> > >>>>
> > >>>>> of all the followers"? What if some of the followers (which happens
> > to
> > >>>>> holding the highest zxid) cannot be contacted(FLP impossible
> result?)
> > >>>>>  It will be more striaghtforward if COMMIT requires confirmation
> from
> > a
> > >>>>> quorum of the followers. But I guess things get
> > >>>>> optimized according to Zab's FIFO nature...just want to hear some
> > >>>>> clarification about it.
> > >>>>>
> > >>>>> Thanks alot!
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>> --
> > >>> With Regards!
> > >>>
> > >>> Ye, Qian
> > >>> Made in Zhejiang University
> > >>>
> > >>>
> > >>>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >
> > >
> >
> >
> > --
> >  With Regards!
> >
> > Ye, Qian
> > Made in Zhejiang University
> >
>



-- 
Ted Dunning, CTO
DeepDyve

Re: Q about ZK internal: how commit is being remembered

Posted by Qing Yan <qi...@gmail.com>.

Hi Qian Ye,


Could you forward me a copy of the paper?  I don't have ACM access...duo
xie!


btw, I was a ZJUer too..

cheers,

Qing



On Fri, Jan 29, 2010 at 10:02 AM, Qian Ye <ye...@gmail.com> wrote:

> Thanks henry and ben, actually I have read the paper henry mentioned in
> this
> mail, but I'm still not so clear with some of the details. Anyway, maybe
> more study on the source code can help me understanding. Since Ben said
> that, "if less than a quorum of servers have accepted a transaction, we can
> commit or discard". Would this feature cause any unexpected problem? Can
> you
> give some hints about this issue?
>
>
>
> On Fri, Jan 29, 2010 at 1:09 AM, Benjamin Reed <br...@yahoo-inc.com>
> wrote:
>
> > henry is correct. just to state another way, Zab guarantees that if a
> > quorum of servers have accepted a transaction, the transaction will
> commit.
> > this means that if less than a quorum of servers have accepted a
> > transaction, we can commit or discard. the only constraint we have in
> > choosing is ordering. we have to decide which partially accepted
> > transactions are going to be committed and which discarded before we
> propose
> > any new messages so that ordering is preserved.
> >
> > ben
> >
> >
> > Henry Robinson wrote:
> >
> >> Hi -
> >>
> >> Note that a machine that has the highest received zxid will necessarily
> >> have
> >> seen the most recent transaction that was logged by a quorum of
> followers
> >> (the FIFO property of TCP again ensures that all previous messages will
> >> have
> >> been seen). This is the property that ZAB needs to preserve. The idea is
> >> to
> >> avoid missing a commit that went to a node that has since failed.
> >>
> >> I was therefore slightly imprecise in my previous mail - it's possible
> for
> >> only partially-proposed proposals to be committed if the leader that is
> >> elected next has seen them. Only when another proposal is committed
> >> instead
> >> must the original proposal be discarded.
> >>
> >> I highly recommend Ben Reed's and Flavio Junqueira's LADIS paper on the
> >> subject, for those with portal.acm.org access:
> >> http://portal.acm.org/citation.cfm?id=1529978
> >>
> >> Henry
> >>
> >> On 27 January 2010 21:52, Qian Ye <ye...@gmail.com> wrote:
> >>
> >>
> >>
> >>> Hi Henry:
> >>>
> >>> According to your explanation, "*ZAB makes the guarantee that a
> proposal
> >>> which has been logged by
> >>> a quorum of followers will eventually be committed*" , however, the
> >>> source
> >>> code of Zookeeper, the FastLeaderElection.java file, shows that, in the
> >>> election, the candidates only provide their zxid in the votes, the one
> >>> with
> >>> the max zxid would win the election. I mean, it seems that no check has
> >>> been
> >>> made to make sure whether the latest proposal has been logged by a
> quorum
> >>> of
> >>> servers.
> >>>
> >>> In this situation, the zookeeper would deliver a proposal, which is
> known
> >>> as
> >>> a failed one by the client. Imagine this scenario, a zookeeper cluster
> >>> with
> >>> 5 servers, Leader only receives 1 ack for proposal A, after a timeout,
> >>> the
> >>> client is told that the proposal failed. At this time, all servers
> >>> restart
> >>> due to a power failure. The server have the log of proposal A would be
> >>> the
> >>> leader, however, the client is told the proposal A failed.
> >>>
> >>> Do I misunderstand this?
> >>>
> >>>
> >>> On Wed, Jan 27, 2010 at 10:37 AM, Henry Robinson <he...@cloudera.com>
> >>> wrote:
> >>>
> >>>
> >>>
> >>>> Qing -
> >>>>
> >>>> That part of the documentation is slightly confusing. The elected
> leader
> >>>> must have the highest zxid that has been written to disk by a quorum
> of
> >>>> followers. ZAB makes the guarantee that a proposal which has been
> logged
> >>>>
> >>>>
> >>> by
> >>>
> >>>
> >>>> a quorum of followers will eventually be committed. Conversely, any
> >>>> proposals that *don't* get logged by a quorum before the leader
> sending
> >>>> them
> >>>> dies will not be committed. One of the ZAB papers covers both these
> >>>> situations - making sure proposals are committed or skipped at the
> right
> >>>> moments.
> >>>>
> >>>> So you get the neat property that leader election can be live in
> exactly
> >>>> the
> >>>> case where the ZK cluster is live. If a quorum of peers aren't
> available
> >>>>
> >>>>
> >>> to
> >>>
> >>>
> >>>> elect the leader, the resulting cluster won't be live anyhow, so it's
> ok
> >>>> for
> >>>> leader election to fail.
> >>>>
> >>>> FLP impossibility isn't actually strictly relevant for ZAB, because
> FLP
> >>>> requires that message reordering is possible (see all the stuff in
> that
> >>>> paper about non-deterministically drawing messages from a potentially
> >>>> deliverable set). TCP FIFO channels don't reorder, so provide the
> extra
> >>>> signalling that ZAB requires.
> >>>>
> >>>> cheers,
> >>>> Henry
> >>>>
> >>>> 2010/1/26 Qing Yan <qi...@gmail.com>
> >>>>
> >>>>
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I have question about how zookeeper *remembers* a commit operation.
> >>>>>
> >>>>> According to
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> http://hadoop.apache.org/zookeeper/docs/r3.2.2/zookeeperInternals.html#sc_summary
> >>>
> >>>
> >>>> <quote>
> >>>>>
> >>>>>
> >>>>> The leader will issue a COMMIT to all followers as soon as a quorum
> of
> >>>>> followers have ACKed a message. Since messages are ACKed in order,
> >>>>>
> >>>>>
> >>>> COMMITs
> >>>>
> >>>>
> >>>>> will be sent by the leader as received by the followers in order.
> >>>>>
> >>>>> COMMITs are processed in order. Followers deliver a proposals message
> >>>>>
> >>>>>
> >>>> when
> >>>>
> >>>>
> >>>>> that proposal is committed.
> >>>>> </quote>
> >>>>>
> >>>>> My question is will leader wait for COMMIT to be processed by quorum
> >>>>> of followers before consider
> >>>>> COMMIT to be success? From the documentation it seems that leader
> >>>>>
> >>>>>
> >>>> handles
> >>>
> >>>
> >>>> COMMIT asynchronously and
> >>>>> don't expect confirmation from followers. In the extreme case, what
> >>>>>
> >>>>>
> >>>> happens
> >>>>
> >>>>
> >>>>> if leader issue a COMMIT
> >>>>> to all followers and crash immediately before the COMMIT message can
> go
> >>>>>
> >>>>>
> >>>> out
> >>>>
> >>>>
> >>>>> of the network. How the system
> >>>>> remembers the COMMIT ever happens?
> >>>>>
> >>>>> Actually this is related to the leader election process:
> >>>>>
> >>>>> <quote>
> >>>>> ZooKeeper messaging doesn't care about the exact method of electing a
> >>>>> leader
> >>>>> has long as the following holds:
> >>>>>
> >>>>>  -
> >>>>>
> >>>>>  The leader has seen the highest zxid of all the followers.
> >>>>>  -
> >>>>>
> >>>>>  A quorum of servers have committed to following the leader.
> >>>>>
> >>>>>  Of these two requirements only the first, the highest zxid amoung
> the
> >>>>> followers needs to hold for correct operation.
> >>>>>
> >>>>> </quote>
> >>>>>
> >>>>> Is there a liveness issue try to find "The leader has seen the
> highest
> >>>>>
> >>>>>
> >>>> zxid
> >>>>
> >>>>
> >>>>> of all the followers"? What if some of the followers (which happens
> to
> >>>>> holding the highest zxid) cannot be contacted(FLP impossible result?)
> >>>>>  It will be more striaghtforward if COMMIT requires confirmation from
> a
> >>>>> quorum of the followers. But I guess things get
> >>>>> optimized according to Zab's FIFO nature...just want to hear some
> >>>>> clarification about it.
> >>>>>
> >>>>> Thanks alot!
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>> --
> >>> With Regards!
> >>>
> >>> Ye, Qian
> >>> Made in Zhejiang University
> >>>
> >>>
> >>>
> >>
> >>
> >>
> >>
> >>
> >
> >
>
>
> --
>  With Regards!
>
> Ye, Qian
> Made in Zhejiang University
>

Re: Q about ZK internal: how commit is being remembered

Posted by Mahadev Konar <ma...@yahoo-inc.com>.

Qian,

  ZooKeeper gurantees that if a client sees some transaction response, then
it will persist but the one's that a client does not see might be discarded
or committed. So in case a quorum does not log the transaction, there might
be a case wherein a zookeeper server which does not have the logged
transaction becomes the leader (because the machines with the logged
transaction are down). In that case the transaction is discarded. In a case
when a machine which has the logged transaction becomes the leader that
transaction will be committed.

Hope that clear your doubt.

mahadev


On 1/28/10 6:02 PM, "Qian Ye" <ye...@gmail.com> wrote:

> Thanks henry and ben, actually I have read the paper henry mentioned in this
> mail, but I'm still not so clear with some of the details. Anyway, maybe
> more study on the source code can help me understanding. Since Ben said
> that, "if less than a quorum of servers have accepted a transaction, we can
> commit or discard". Would this feature cause any unexpected problem? Can you
> give some hints about this issue?
> 
> 
> 
> On Fri, Jan 29, 2010 at 1:09 AM, Benjamin Reed <br...@yahoo-inc.com> wrote:
> 
>> henry is correct. just to state another way, Zab guarantees that if a
>> quorum of servers have accepted a transaction, the transaction will commit.
>> this means that if less than a quorum of servers have accepted a
>> transaction, we can commit or discard. the only constraint we have in
>> choosing is ordering. we have to decide which partially accepted
>> transactions are going to be committed and which discarded before we propose
>> any new messages so that ordering is preserved.
>> 
>> ben
>> 
>> 
>> Henry Robinson wrote:
>> 
>>> Hi -
>>> 
>>> Note that a machine that has the highest received zxid will necessarily
>>> have
>>> seen the most recent transaction that was logged by a quorum of followers
>>> (the FIFO property of TCP again ensures that all previous messages will
>>> have
>>> been seen). This is the property that ZAB needs to preserve. The idea is
>>> to
>>> avoid missing a commit that went to a node that has since failed.
>>> 
>>> I was therefore slightly imprecise in my previous mail - it's possible for
>>> only partially-proposed proposals to be committed if the leader that is
>>> elected next has seen them. Only when another proposal is committed
>>> instead
>>> must the original proposal be discarded.
>>> 
>>> I highly recommend Ben Reed's and Flavio Junqueira's LADIS paper on the
>>> subject, for those with portal.acm.org access:
>>> http://portal.acm.org/citation.cfm?id=1529978
>>> 
>>> Henry
>>> 
>>> On 27 January 2010 21:52, Qian Ye <ye...@gmail.com> wrote:
>>> 
>>> 
>>> 
>>>> Hi Henry:
>>>> 
>>>> According to your explanation, "*ZAB makes the guarantee that a proposal
>>>> which has been logged by
>>>> a quorum of followers will eventually be committed*" , however, the
>>>> source
>>>> code of Zookeeper, the FastLeaderElection.java file, shows that, in the
>>>> election, the candidates only provide their zxid in the votes, the one
>>>> with
>>>> the max zxid would win the election. I mean, it seems that no check has
>>>> been
>>>> made to make sure whether the latest proposal has been logged by a quorum
>>>> of
>>>> servers.
>>>> 
>>>> In this situation, the zookeeper would deliver a proposal, which is known
>>>> as
>>>> a failed one by the client. Imagine this scenario, a zookeeper cluster
>>>> with
>>>> 5 servers, Leader only receives 1 ack for proposal A, after a timeout,
>>>> the
>>>> client is told that the proposal failed. At this time, all servers
>>>> restart
>>>> due to a power failure. The server have the log of proposal A would be
>>>> the
>>>> leader, however, the client is told the proposal A failed.
>>>> 
>>>> Do I misunderstand this?
>>>> 
>>>> 
>>>> On Wed, Jan 27, 2010 at 10:37 AM, Henry Robinson <he...@cloudera.com>
>>>> wrote:
>>>> 
>>>> 
>>>> 
>>>>> Qing -
>>>>> 
>>>>> That part of the documentation is slightly confusing. The elected leader
>>>>> must have the highest zxid that has been written to disk by a quorum of
>>>>> followers. ZAB makes the guarantee that a proposal which has been logged
>>>>> 
>>>>> 
>>>> by
>>>> 
>>>> 
>>>>> a quorum of followers will eventually be committed. Conversely, any
>>>>> proposals that *don't* get logged by a quorum before the leader sending
>>>>> them
>>>>> dies will not be committed. One of the ZAB papers covers both these
>>>>> situations - making sure proposals are committed or skipped at the right
>>>>> moments.
>>>>> 
>>>>> So you get the neat property that leader election can be live in exactly
>>>>> the
>>>>> case where the ZK cluster is live. If a quorum of peers aren't available
>>>>> 
>>>>> 
>>>> to
>>>> 
>>>> 
>>>>> elect the leader, the resulting cluster won't be live anyhow, so it's ok
>>>>> for
>>>>> leader election to fail.
>>>>> 
>>>>> FLP impossibility isn't actually strictly relevant for ZAB, because FLP
>>>>> requires that message reordering is possible (see all the stuff in that
>>>>> paper about non-deterministically drawing messages from a potentially
>>>>> deliverable set). TCP FIFO channels don't reorder, so provide the extra
>>>>> signalling that ZAB requires.
>>>>> 
>>>>> cheers,
>>>>> Henry
>>>>> 
>>>>> 2010/1/26 Qing Yan <qi...@gmail.com>
>>>>> 
>>>>> 
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I have question about how zookeeper *remembers* a commit operation.
>>>>>> 
>>>>>> According to
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> http://hadoop.apache.org/zookeeper/docs/r3.2.2/zookeeperInternals.html#sc_s
>>>> ummary
>>>> 
>>>> 
>>>>> <quote>
>>>>>> 
>>>>>> 
>>>>>> The leader will issue a COMMIT to all followers as soon as a quorum of
>>>>>> followers have ACKed a message. Since messages are ACKed in order,
>>>>>> 
>>>>>> 
>>>>> COMMITs
>>>>> 
>>>>> 
>>>>>> will be sent by the leader as received by the followers in order.
>>>>>> 
>>>>>> COMMITs are processed in order. Followers deliver a proposals message
>>>>>> 
>>>>>> 
>>>>> when
>>>>> 
>>>>> 
>>>>>> that proposal is committed.
>>>>>> </quote>
>>>>>> 
>>>>>> My question is will leader wait for COMMIT to be processed by quorum
>>>>>> of followers before consider
>>>>>> COMMIT to be success? From the documentation it seems that leader
>>>>>> 
>>>>>> 
>>>>> handles
>>>> 
>>>> 
>>>>> COMMIT asynchronously and
>>>>>> don't expect confirmation from followers. In the extreme case, what
>>>>>> 
>>>>>> 
>>>>> happens
>>>>> 
>>>>> 
>>>>>> if leader issue a COMMIT
>>>>>> to all followers and crash immediately before the COMMIT message can go
>>>>>> 
>>>>>> 
>>>>> out
>>>>> 
>>>>> 
>>>>>> of the network. How the system
>>>>>> remembers the COMMIT ever happens?
>>>>>> 
>>>>>> Actually this is related to the leader election process:
>>>>>> 
>>>>>> <quote>
>>>>>> ZooKeeper messaging doesn't care about the exact method of electing a
>>>>>> leader
>>>>>> has long as the following holds:
>>>>>> 
>>>>>>  -
>>>>>> 
>>>>>>  The leader has seen the highest zxid of all the followers.
>>>>>>  -
>>>>>> 
>>>>>>  A quorum of servers have committed to following the leader.
>>>>>> 
>>>>>>  Of these two requirements only the first, the highest zxid amoung the
>>>>>> followers needs to hold for correct operation.
>>>>>> 
>>>>>> </quote>
>>>>>> 
>>>>>> Is there a liveness issue try to find "The leader has seen the highest
>>>>>> 
>>>>>> 
>>>>> zxid
>>>>> 
>>>>> 
>>>>>> of all the followers"? What if some of the followers (which happens to
>>>>>> holding the highest zxid) cannot be contacted(FLP impossible result?)
>>>>>>  It will be more striaghtforward if COMMIT requires confirmation from a
>>>>>> quorum of the followers. But I guess things get
>>>>>> optimized according to Zab's FIFO nature...just want to hear some
>>>>>> clarification about it.
>>>>>> 
>>>>>> Thanks alot!
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>> --
>>>> With Regards!
>>>> 
>>>> Ye, Qian
>>>> Made in Zhejiang University
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>

Re: Q about ZK internal: how commit is being remembered

Posted by Qian Ye <ye...@gmail.com>.

Thanks henry and ben, actually I have read the paper henry mentioned in this
mail, but I'm still not so clear with some of the details. Anyway, maybe
more study on the source code can help me understanding. Since Ben said
that, "if less than a quorum of servers have accepted a transaction, we can
commit or discard". Would this feature cause any unexpected problem? Can you
give some hints about this issue?



On Fri, Jan 29, 2010 at 1:09 AM, Benjamin Reed <br...@yahoo-inc.com> wrote:

> henry is correct. just to state another way, Zab guarantees that if a
> quorum of servers have accepted a transaction, the transaction will commit.
> this means that if less than a quorum of servers have accepted a
> transaction, we can commit or discard. the only constraint we have in
> choosing is ordering. we have to decide which partially accepted
> transactions are going to be committed and which discarded before we propose
> any new messages so that ordering is preserved.
>
> ben
>
>
> Henry Robinson wrote:
>
>> Hi -
>>
>> Note that a machine that has the highest received zxid will necessarily
>> have
>> seen the most recent transaction that was logged by a quorum of followers
>> (the FIFO property of TCP again ensures that all previous messages will
>> have
>> been seen). This is the property that ZAB needs to preserve. The idea is
>> to
>> avoid missing a commit that went to a node that has since failed.
>>
>> I was therefore slightly imprecise in my previous mail - it's possible for
>> only partially-proposed proposals to be committed if the leader that is
>> elected next has seen them. Only when another proposal is committed
>> instead
>> must the original proposal be discarded.
>>
>> I highly recommend Ben Reed's and Flavio Junqueira's LADIS paper on the
>> subject, for those with portal.acm.org access:
>> http://portal.acm.org/citation.cfm?id=1529978
>>
>> Henry
>>
>> On 27 January 2010 21:52, Qian Ye <ye...@gmail.com> wrote:
>>
>>
>>
>>> Hi Henry:
>>>
>>> According to your explanation, "*ZAB makes the guarantee that a proposal
>>> which has been logged by
>>> a quorum of followers will eventually be committed*" , however, the
>>> source
>>> code of Zookeeper, the FastLeaderElection.java file, shows that, in the
>>> election, the candidates only provide their zxid in the votes, the one
>>> with
>>> the max zxid would win the election. I mean, it seems that no check has
>>> been
>>> made to make sure whether the latest proposal has been logged by a quorum
>>> of
>>> servers.
>>>
>>> In this situation, the zookeeper would deliver a proposal, which is known
>>> as
>>> a failed one by the client. Imagine this scenario, a zookeeper cluster
>>> with
>>> 5 servers, Leader only receives 1 ack for proposal A, after a timeout,
>>> the
>>> client is told that the proposal failed. At this time, all servers
>>> restart
>>> due to a power failure. The server have the log of proposal A would be
>>> the
>>> leader, however, the client is told the proposal A failed.
>>>
>>> Do I misunderstand this?
>>>
>>>
>>> On Wed, Jan 27, 2010 at 10:37 AM, Henry Robinson <he...@cloudera.com>
>>> wrote:
>>>
>>>
>>>
>>>> Qing -
>>>>
>>>> That part of the documentation is slightly confusing. The elected leader
>>>> must have the highest zxid that has been written to disk by a quorum of
>>>> followers. ZAB makes the guarantee that a proposal which has been logged
>>>>
>>>>
>>> by
>>>
>>>
>>>> a quorum of followers will eventually be committed. Conversely, any
>>>> proposals that *don't* get logged by a quorum before the leader sending
>>>> them
>>>> dies will not be committed. One of the ZAB papers covers both these
>>>> situations - making sure proposals are committed or skipped at the right
>>>> moments.
>>>>
>>>> So you get the neat property that leader election can be live in exactly
>>>> the
>>>> case where the ZK cluster is live. If a quorum of peers aren't available
>>>>
>>>>
>>> to
>>>
>>>
>>>> elect the leader, the resulting cluster won't be live anyhow, so it's ok
>>>> for
>>>> leader election to fail.
>>>>
>>>> FLP impossibility isn't actually strictly relevant for ZAB, because FLP
>>>> requires that message reordering is possible (see all the stuff in that
>>>> paper about non-deterministically drawing messages from a potentially
>>>> deliverable set). TCP FIFO channels don't reorder, so provide the extra
>>>> signalling that ZAB requires.
>>>>
>>>> cheers,
>>>> Henry
>>>>
>>>> 2010/1/26 Qing Yan <qi...@gmail.com>
>>>>
>>>>
>>>>
>>>>> Hi,
>>>>>
>>>>> I have question about how zookeeper *remembers* a commit operation.
>>>>>
>>>>> According to
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>> http://hadoop.apache.org/zookeeper/docs/r3.2.2/zookeeperInternals.html#sc_summary
>>>
>>>
>>>> <quote>
>>>>>
>>>>>
>>>>> The leader will issue a COMMIT to all followers as soon as a quorum of
>>>>> followers have ACKed a message. Since messages are ACKed in order,
>>>>>
>>>>>
>>>> COMMITs
>>>>
>>>>
>>>>> will be sent by the leader as received by the followers in order.
>>>>>
>>>>> COMMITs are processed in order. Followers deliver a proposals message
>>>>>
>>>>>
>>>> when
>>>>
>>>>
>>>>> that proposal is committed.
>>>>> </quote>
>>>>>
>>>>> My question is will leader wait for COMMIT to be processed by quorum
>>>>> of followers before consider
>>>>> COMMIT to be success? From the documentation it seems that leader
>>>>>
>>>>>
>>>> handles
>>>
>>>
>>>> COMMIT asynchronously and
>>>>> don't expect confirmation from followers. In the extreme case, what
>>>>>
>>>>>
>>>> happens
>>>>
>>>>
>>>>> if leader issue a COMMIT
>>>>> to all followers and crash immediately before the COMMIT message can go
>>>>>
>>>>>
>>>> out
>>>>
>>>>
>>>>> of the network. How the system
>>>>> remembers the COMMIT ever happens?
>>>>>
>>>>> Actually this is related to the leader election process:
>>>>>
>>>>> <quote>
>>>>> ZooKeeper messaging doesn't care about the exact method of electing a
>>>>> leader
>>>>> has long as the following holds:
>>>>>
>>>>>  -
>>>>>
>>>>>  The leader has seen the highest zxid of all the followers.
>>>>>  -
>>>>>
>>>>>  A quorum of servers have committed to following the leader.
>>>>>
>>>>>  Of these two requirements only the first, the highest zxid amoung the
>>>>> followers needs to hold for correct operation.
>>>>>
>>>>> </quote>
>>>>>
>>>>> Is there a liveness issue try to find "The leader has seen the highest
>>>>>
>>>>>
>>>> zxid
>>>>
>>>>
>>>>> of all the followers"? What if some of the followers (which happens to
>>>>> holding the highest zxid) cannot be contacted(FLP impossible result?)
>>>>>  It will be more striaghtforward if COMMIT requires confirmation from a
>>>>> quorum of the followers. But I guess things get
>>>>> optimized according to Zab's FIFO nature...just want to hear some
>>>>> clarification about it.
>>>>>
>>>>> Thanks alot!
>>>>>
>>>>>
>>>>>
>>>>
>>> --
>>> With Regards!
>>>
>>> Ye, Qian
>>> Made in Zhejiang University
>>>
>>>
>>>
>>
>>
>>
>>
>>
>
>


-- 
With Regards!

Ye, Qian
Made in Zhejiang University

Re: Q about ZK internal: how commit is being remembered

Posted by Benjamin Reed <br...@yahoo-inc.com>.

henry is correct. just to state another way, Zab guarantees that if a 
quorum of servers have accepted a transaction, the transaction will 
commit. this means that if less than a quorum of servers have accepted a 
transaction, we can commit or discard. the only constraint we have in 
choosing is ordering. we have to decide which partially accepted 
transactions are going to be committed and which discarded before we 
propose any new messages so that ordering is preserved.

ben

Henry Robinson wrote:
> Hi -
>
> Note that a machine that has the highest received zxid will necessarily have
> seen the most recent transaction that was logged by a quorum of followers
> (the FIFO property of TCP again ensures that all previous messages will have
> been seen). This is the property that ZAB needs to preserve. The idea is to
> avoid missing a commit that went to a node that has since failed.
>
> I was therefore slightly imprecise in my previous mail - it's possible for
> only partially-proposed proposals to be committed if the leader that is
> elected next has seen them. Only when another proposal is committed instead
> must the original proposal be discarded.
>
> I highly recommend Ben Reed's and Flavio Junqueira's LADIS paper on the
> subject, for those with portal.acm.org access:
> http://portal.acm.org/citation.cfm?id=1529978
>
> Henry
>
> On 27 January 2010 21:52, Qian Ye <ye...@gmail.com> wrote:
>
>   
>> Hi Henry:
>>
>> According to your explanation, "*ZAB makes the guarantee that a proposal
>> which has been logged by
>> a quorum of followers will eventually be committed*" , however, the source
>> code of Zookeeper, the FastLeaderElection.java file, shows that, in the
>> election, the candidates only provide their zxid in the votes, the one with
>> the max zxid would win the election. I mean, it seems that no check has
>> been
>> made to make sure whether the latest proposal has been logged by a quorum
>> of
>> servers.
>>
>> In this situation, the zookeeper would deliver a proposal, which is known
>> as
>> a failed one by the client. Imagine this scenario, a zookeeper cluster with
>> 5 servers, Leader only receives 1 ack for proposal A, after a timeout, the
>> client is told that the proposal failed. At this time, all servers restart
>> due to a power failure. The server have the log of proposal A would be the
>> leader, however, the client is told the proposal A failed.
>>
>> Do I misunderstand this?
>>
>>
>> On Wed, Jan 27, 2010 at 10:37 AM, Henry Robinson <he...@cloudera.com>
>> wrote:
>>
>>     
>>> Qing -
>>>
>>> That part of the documentation is slightly confusing. The elected leader
>>> must have the highest zxid that has been written to disk by a quorum of
>>> followers. ZAB makes the guarantee that a proposal which has been logged
>>>       
>> by
>>     
>>> a quorum of followers will eventually be committed. Conversely, any
>>> proposals that *don't* get logged by a quorum before the leader sending
>>> them
>>> dies will not be committed. One of the ZAB papers covers both these
>>> situations - making sure proposals are committed or skipped at the right
>>> moments.
>>>
>>> So you get the neat property that leader election can be live in exactly
>>> the
>>> case where the ZK cluster is live. If a quorum of peers aren't available
>>>       
>> to
>>     
>>> elect the leader, the resulting cluster won't be live anyhow, so it's ok
>>> for
>>> leader election to fail.
>>>
>>> FLP impossibility isn't actually strictly relevant for ZAB, because FLP
>>> requires that message reordering is possible (see all the stuff in that
>>> paper about non-deterministically drawing messages from a potentially
>>> deliverable set). TCP FIFO channels don't reorder, so provide the extra
>>> signalling that ZAB requires.
>>>
>>> cheers,
>>> Henry
>>>
>>> 2010/1/26 Qing Yan <qi...@gmail.com>
>>>
>>>       
>>>> Hi,
>>>>
>>>> I have question about how zookeeper *remembers* a commit operation.
>>>>
>>>> According to
>>>>
>>>>
>>>>         
>> http://hadoop.apache.org/zookeeper/docs/r3.2.2/zookeeperInternals.html#sc_summary
>>     
>>>> <quote>
>>>>
>>>>
>>>> The leader will issue a COMMIT to all followers as soon as a quorum of
>>>> followers have ACKed a message. Since messages are ACKed in order,
>>>>         
>>> COMMITs
>>>       
>>>> will be sent by the leader as received by the followers in order.
>>>>
>>>> COMMITs are processed in order. Followers deliver a proposals message
>>>>         
>>> when
>>>       
>>>> that proposal is committed.
>>>> </quote>
>>>>
>>>> My question is will leader wait for COMMIT to be processed by quorum
>>>> of followers before consider
>>>> COMMIT to be success? From the documentation it seems that leader
>>>>         
>> handles
>>     
>>>> COMMIT asynchronously and
>>>> don't expect confirmation from followers. In the extreme case, what
>>>>         
>>> happens
>>>       
>>>> if leader issue a COMMIT
>>>> to all followers and crash immediately before the COMMIT message can go
>>>>         
>>> out
>>>       
>>>> of the network. How the system
>>>> remembers the COMMIT ever happens?
>>>>
>>>> Actually this is related to the leader election process:
>>>>
>>>> <quote>
>>>> ZooKeeper messaging doesn't care about the exact method of electing a
>>>> leader
>>>> has long as the following holds:
>>>>
>>>>   -
>>>>
>>>>   The leader has seen the highest zxid of all the followers.
>>>>   -
>>>>
>>>>   A quorum of servers have committed to following the leader.
>>>>
>>>>  Of these two requirements only the first, the highest zxid amoung the
>>>> followers needs to hold for correct operation.
>>>>
>>>> </quote>
>>>>
>>>> Is there a liveness issue try to find "The leader has seen the highest
>>>>         
>>> zxid
>>>       
>>>> of all the followers"? What if some of the followers (which happens to
>>>> holding the highest zxid) cannot be contacted(FLP impossible result?)
>>>>  It will be more striaghtforward if COMMIT requires confirmation from a
>>>> quorum of the followers. But I guess things get
>>>> optimized according to Zab's FIFO nature...just want to hear some
>>>> clarification about it.
>>>>
>>>> Thanks alot!
>>>>
>>>>         
>>
>> --
>> With Regards!
>>
>> Ye, Qian
>> Made in Zhejiang University
>>
>>     
>
>
>
>

Re: Q about ZK internal: how commit is being remembered

Posted by Henry Robinson <he...@cloudera.com>.

Hi -

Note that a machine that has the highest received zxid will necessarily have
seen the most recent transaction that was logged by a quorum of followers
(the FIFO property of TCP again ensures that all previous messages will have
been seen). This is the property that ZAB needs to preserve. The idea is to
avoid missing a commit that went to a node that has since failed.

I was therefore slightly imprecise in my previous mail - it's possible for
only partially-proposed proposals to be committed if the leader that is
elected next has seen them. Only when another proposal is committed instead
must the original proposal be discarded.

I highly recommend Ben Reed's and Flavio Junqueira's LADIS paper on the
subject, for those with portal.acm.org access:
http://portal.acm.org/citation.cfm?id=1529978

Henry

On 27 January 2010 21:52, Qian Ye <ye...@gmail.com> wrote:

> Hi Henry:
>
> According to your explanation, "*ZAB makes the guarantee that a proposal
> which has been logged by
> a quorum of followers will eventually be committed*" , however, the source
> code of Zookeeper, the FastLeaderElection.java file, shows that, in the
> election, the candidates only provide their zxid in the votes, the one with
> the max zxid would win the election. I mean, it seems that no check has
> been
> made to make sure whether the latest proposal has been logged by a quorum
> of
> servers.
>
> In this situation, the zookeeper would deliver a proposal, which is known
> as
> a failed one by the client. Imagine this scenario, a zookeeper cluster with
> 5 servers, Leader only receives 1 ack for proposal A, after a timeout, the
> client is told that the proposal failed. At this time, all servers restart
> due to a power failure. The server have the log of proposal A would be the
> leader, however, the client is told the proposal A failed.
>
> Do I misunderstand this?
>
>
> On Wed, Jan 27, 2010 at 10:37 AM, Henry Robinson <he...@cloudera.com>
> wrote:
>
> > Qing -
> >
> > That part of the documentation is slightly confusing. The elected leader
> > must have the highest zxid that has been written to disk by a quorum of
> > followers. ZAB makes the guarantee that a proposal which has been logged
> by
> > a quorum of followers will eventually be committed. Conversely, any
> > proposals that *don't* get logged by a quorum before the leader sending
> > them
> > dies will not be committed. One of the ZAB papers covers both these
> > situations - making sure proposals are committed or skipped at the right
> > moments.
> >
> > So you get the neat property that leader election can be live in exactly
> > the
> > case where the ZK cluster is live. If a quorum of peers aren't available
> to
> > elect the leader, the resulting cluster won't be live anyhow, so it's ok
> > for
> > leader election to fail.
> >
> > FLP impossibility isn't actually strictly relevant for ZAB, because FLP
> > requires that message reordering is possible (see all the stuff in that
> > paper about non-deterministically drawing messages from a potentially
> > deliverable set). TCP FIFO channels don't reorder, so provide the extra
> > signalling that ZAB requires.
> >
> > cheers,
> > Henry
> >
> > 2010/1/26 Qing Yan <qi...@gmail.com>
> >
> > > Hi,
> > >
> > > I have question about how zookeeper *remembers* a commit operation.
> > >
> > > According to
> > >
> > >
> >
> http://hadoop.apache.org/zookeeper/docs/r3.2.2/zookeeperInternals.html#sc_summary
> > >
> > > <quote>
> > >
> > >
> > > The leader will issue a COMMIT to all followers as soon as a quorum of
> > > followers have ACKed a message. Since messages are ACKed in order,
> > COMMITs
> > > will be sent by the leader as received by the followers in order.
> > >
> > > COMMITs are processed in order. Followers deliver a proposals message
> > when
> > > that proposal is committed.
> > > </quote>
> > >
> > > My question is will leader wait for COMMIT to be processed by quorum
> > > of followers before consider
> > > COMMIT to be success? From the documentation it seems that leader
> handles
> > > COMMIT asynchronously and
> > > don't expect confirmation from followers. In the extreme case, what
> > happens
> > > if leader issue a COMMIT
> > > to all followers and crash immediately before the COMMIT message can go
> > out
> > > of the network. How the system
> > > remembers the COMMIT ever happens?
> > >
> > > Actually this is related to the leader election process:
> > >
> > > <quote>
> > > ZooKeeper messaging doesn't care about the exact method of electing a
> > > leader
> > > has long as the following holds:
> > >
> > >   -
> > >
> > >   The leader has seen the highest zxid of all the followers.
> > >   -
> > >
> > >   A quorum of servers have committed to following the leader.
> > >
> > >  Of these two requirements only the first, the highest zxid amoung the
> > > followers needs to hold for correct operation.
> > >
> > > </quote>
> > >
> > > Is there a liveness issue try to find "The leader has seen the highest
> > zxid
> > > of all the followers"? What if some of the followers (which happens to
> > > holding the highest zxid) cannot be contacted(FLP impossible result?)
> > >  It will be more striaghtforward if COMMIT requires confirmation from a
> > > quorum of the followers. But I guess things get
> > > optimized according to Zab's FIFO nature...just want to hear some
> > > clarification about it.
> > >
> > > Thanks alot!
> > >
> >
>
>
>
> --
> With Regards!
>
> Ye, Qian
> Made in Zhejiang University
>



-- 
Henry Robinson
Software Engineer
Cloudera
415-994-6679

Re: Q about ZK internal: how commit is being remembered

Posted by Qian Ye <ye...@gmail.com>.

Hi Henry:

According to your explanation, "*ZAB makes the guarantee that a proposal
which has been logged by
a quorum of followers will eventually be committed*" , however, the source
code of Zookeeper, the FastLeaderElection.java file, shows that, in the
election, the candidates only provide their zxid in the votes, the one with
the max zxid would win the election. I mean, it seems that no check has been
made to make sure whether the latest proposal has been logged by a quorum of
servers.

In this situation, the zookeeper would deliver a proposal, which is known as
a failed one by the client. Imagine this scenario, a zookeeper cluster with
5 servers, Leader only receives 1 ack for proposal A, after a timeout, the
client is told that the proposal failed. At this time, all servers restart
due to a power failure. The server have the log of proposal A would be the
leader, however, the client is told the proposal A failed.

Do I misunderstand this?


On Wed, Jan 27, 2010 at 10:37 AM, Henry Robinson <he...@cloudera.com> wrote:

> Qing -
>
> That part of the documentation is slightly confusing. The elected leader
> must have the highest zxid that has been written to disk by a quorum of
> followers. ZAB makes the guarantee that a proposal which has been logged by
> a quorum of followers will eventually be committed. Conversely, any
> proposals that *don't* get logged by a quorum before the leader sending
> them
> dies will not be committed. One of the ZAB papers covers both these
> situations - making sure proposals are committed or skipped at the right
> moments.
>
> So you get the neat property that leader election can be live in exactly
> the
> case where the ZK cluster is live. If a quorum of peers aren't available to
> elect the leader, the resulting cluster won't be live anyhow, so it's ok
> for
> leader election to fail.
>
> FLP impossibility isn't actually strictly relevant for ZAB, because FLP
> requires that message reordering is possible (see all the stuff in that
> paper about non-deterministically drawing messages from a potentially
> deliverable set). TCP FIFO channels don't reorder, so provide the extra
> signalling that ZAB requires.
>
> cheers,
> Henry
>
> 2010/1/26 Qing Yan <qi...@gmail.com>
>
> > Hi,
> >
> > I have question about how zookeeper *remembers* a commit operation.
> >
> > According to
> >
> >
> http://hadoop.apache.org/zookeeper/docs/r3.2.2/zookeeperInternals.html#sc_summary
> >
> > <quote>
> >
> >
> > The leader will issue a COMMIT to all followers as soon as a quorum of
> > followers have ACKed a message. Since messages are ACKed in order,
> COMMITs
> > will be sent by the leader as received by the followers in order.
> >
> > COMMITs are processed in order. Followers deliver a proposals message
> when
> > that proposal is committed.
> > </quote>
> >
> > My question is will leader wait for COMMIT to be processed by quorum
> > of followers before consider
> > COMMIT to be success? From the documentation it seems that leader handles
> > COMMIT asynchronously and
> > don't expect confirmation from followers. In the extreme case, what
> happens
> > if leader issue a COMMIT
> > to all followers and crash immediately before the COMMIT message can go
> out
> > of the network. How the system
> > remembers the COMMIT ever happens?
> >
> > Actually this is related to the leader election process:
> >
> > <quote>
> > ZooKeeper messaging doesn't care about the exact method of electing a
> > leader
> > has long as the following holds:
> >
> >   -
> >
> >   The leader has seen the highest zxid of all the followers.
> >   -
> >
> >   A quorum of servers have committed to following the leader.
> >
> >  Of these two requirements only the first, the highest zxid amoung the
> > followers needs to hold for correct operation.
> >
> > </quote>
> >
> > Is there a liveness issue try to find "The leader has seen the highest
> zxid
> > of all the followers"? What if some of the followers (which happens to
> > holding the highest zxid) cannot be contacted(FLP impossible result?)
> >  It will be more striaghtforward if COMMIT requires confirmation from a
> > quorum of the followers. But I guess things get
> > optimized according to Zab's FIFO nature...just want to hear some
> > clarification about it.
> >
> > Thanks alot!
> >
>



-- 
With Regards!

Ye, Qian
Made in Zhejiang University

Re: Q about ZK internal: how commit is being remembered

Posted by Qing Yan <qi...@gmail.com>.

I haven't read the code, so I can't comment on how and
when FastLeaderElection works(Maybe it is some kind of
optimization and won't be able to handle catastrophic failures, just a wild
guess...)

BTW I want to ask how the ZAB protocol works in the following situation:

 Suppose a ZK cluster consists of 5 nodes(A,B,C,D,E). Leader A proposed a
new message:

case a) A,B logged successfully, quorum not reached, commit should fail.
case b) A,B,C logged successfully, quorum has been reached, commit should
success.

Now A crashed and went away, how the new leader distinguish case a> and case
b> since neither of them(B or B+C) has reached the quorum?

Re: Q about ZK internal: how commit is being remembered

Posted by Henry Robinson <he...@cloudera.com>.

Qing -

That part of the documentation is slightly confusing. The elected leader
must have the highest zxid that has been written to disk by a quorum of
followers. ZAB makes the guarantee that a proposal which has been logged by
a quorum of followers will eventually be committed. Conversely, any
proposals that *don't* get logged by a quorum before the leader sending them
dies will not be committed. One of the ZAB papers covers both these
situations - making sure proposals are committed or skipped at the right
moments.

So you get the neat property that leader election can be live in exactly the
case where the ZK cluster is live. If a quorum of peers aren't available to
elect the leader, the resulting cluster won't be live anyhow, so it's ok for
leader election to fail.

FLP impossibility isn't actually strictly relevant for ZAB, because FLP
requires that message reordering is possible (see all the stuff in that
paper about non-deterministically drawing messages from a potentially
deliverable set). TCP FIFO channels don't reorder, so provide the extra
signalling that ZAB requires.

cheers,
Henry

2010/1/26 Qing Yan <qi...@gmail.com>

> Hi,
>
> I have question about how zookeeper *remembers* a commit operation.
>
> According to
>
> http://hadoop.apache.org/zookeeper/docs/r3.2.2/zookeeperInternals.html#sc_summary
>
> <quote>
>
>
> The leader will issue a COMMIT to all followers as soon as a quorum of
> followers have ACKed a message. Since messages are ACKed in order, COMMITs
> will be sent by the leader as received by the followers in order.
>
> COMMITs are processed in order. Followers deliver a proposals message when
> that proposal is committed.
> </quote>
>
> My question is will leader wait for COMMIT to be processed by quorum
> of followers before consider
> COMMIT to be success? From the documentation it seems that leader handles
> COMMIT asynchronously and
> don't expect confirmation from followers. In the extreme case, what happens
> if leader issue a COMMIT
> to all followers and crash immediately before the COMMIT message can go out
> of the network. How the system
> remembers the COMMIT ever happens?
>
> Actually this is related to the leader election process:
>
> <quote>
> ZooKeeper messaging doesn't care about the exact method of electing a
> leader
> has long as the following holds:
>
>   -
>
>   The leader has seen the highest zxid of all the followers.
>   -
>
>   A quorum of servers have committed to following the leader.
>
>  Of these two requirements only the first, the highest zxid amoung the
> followers needs to hold for correct operation.
>
> </quote>
>
> Is there a liveness issue try to find "The leader has seen the highest zxid
> of all the followers"? What if some of the followers (which happens to
> holding the highest zxid) cannot be contacted(FLP impossible result?)
>  It will be more striaghtforward if COMMIT requires confirmation from a
> quorum of the followers. But I guess things get
> optimized according to Zab's FIFO nature...just want to hear some
> clarification about it.
>
> Thanks alot!
>

Re: Q about ZK internal: how commit is being remembered

Posted by Mahadev Konar <ma...@yahoo-inc.com>.

Qing,
  This is how it happens -

  Leader will issue a COMMIT only when a quorum of servers have logged the
transaction to disk. The issuing of COMMIT just means that all the servers
can make the transaction visible to the client. The COMMIT message is never
logged. 

On leader crash, another machine with max zxid will be elected. Since a
quorum have logged the above transaction to disk, the new leader will have
that transaction on disk and will let the other members of the quorum know
of the transaction in case they havent logged it. This way the transaction
is always remembered if a client has seen that transaction go through the
zookeeper service.

You can read more about this in one of the internals presentations at:
http://wiki.apache.org/hadoop/ZooKeeper/ZooKeeperPresentations

Thanks
mahadev


On 1/26/10 6:20 PM, "Qing Yan" <qi...@gmail.com> wrote:

> Hi,
> 
> I have question about how zookeeper *remembers* a commit operation.
> 
> According to
> http://hadoop.apache.org/zookeeper/docs/r3.2.2/zookeeperInternals.html#sc_summ
> ary
> 
> <quote>
> 
> 
> The leader will issue a COMMIT to all followers as soon as a quorum of
> followers have ACKed a message. Since messages are ACKed in order, COMMITs
> will be sent by the leader as received by the followers in order.
> 
> COMMITs are processed in order. Followers deliver a proposals message when
> that proposal is committed.
> </quote>
> 
> My question is will leader wait for COMMIT to be processed by quorum
> of followers before consider
> COMMIT to be success? From the documentation it seems that leader handles
> COMMIT asynchronously and
> don't expect confirmation from followers. In the extreme case, what happens
> if leader issue a COMMIT
> to all followers and crash immediately before the COMMIT message can go out
> of the network. How the system
> remembers the COMMIT ever happens?
> 
> Actually this is related to the leader election process:
> 
> <quote>
> ZooKeeper messaging doesn't care about the exact method of electing a leader
> has long as the following holds:
> 
>    -
> 
>    The leader has seen the highest zxid of all the followers.
>    -
> 
>    A quorum of servers have committed to following the leader.
> 
>  Of these two requirements only the first, the highest zxid amoung the
> followers needs to hold for correct operation.
> 
> </quote>
> 
> Is there a liveness issue try to find "The leader has seen the highest zxid
> of all the followers"? What if some of the followers (which happens to
> holding the highest zxid) cannot be contacted(FLP impossible result?)
>  It will be more striaghtforward if COMMIT requires confirmation from a
> quorum of the followers. But I guess things get
> optimized according to Zab's FIFO nature...just want to hear some
> clarification about it.
> 
> Thanks alot!