You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zookeeper.apache.org by Flavio Junqueira <fp...@yahoo.com> on 2013/07/17 12:30:16 UTC

Recovery time (was: Maximum size of a snapshot)

Moving the discussion to dev but keeping user on CC.

Let's step back. The reason why we started the latest discussion in this thread was because Kishore is concerned about recovery time. There are a number of improvements we have been looking at for the next release, let me go over my current understanding of the main points that add to the recovery time:

1- Before we even start leader election, each server loads state from disk to determine its last zxid. The last zxid is used in the election;
2- Once the leader is elected, it loads state from disk and take a snapshot. Loading the database again is unecessary (ZOOKEEPER-1642) and the snapshot adds latency. In fact, it is not even correct to have it there (ZOOKEEPER-1558).
3- A follower takes a snapshot before acknowledging the NEWLEADER message, so the leader has to wait until a quorum of followers finishes their snapshot.

The proposal I've heard here is to touch (1). For now, I'd rather keep (1) as is and focus on fixing (2). We might be able to do something about (3) and I'm actually not sure if there has been a discussion about it or not.

-Flavio

On Jul 17, 2013, at 5:40 AM, Thawan Kooburat <th...@fb.com> wrote:

> Client will get session expire event only when a server explicitly tells
> the client. So any established sessions will remain in a disconnected
> state during the period
> 
> So my comment about the need for longer session timeout might be
> incorrect. While the quorum is down during leader election, session won't
> expire during this period. When the quorum comes back, the client have to
> reconnect within session timeout in order to resume the session.  However,
> client won't be able to issue any read/write request or create a new
> session while the quorum is down.
> 
> However, some application may need a stronger consistency guarantee. They
> will have a special logic to abort the client if it was disconnected for
> an extended period. This is because the client won't be able to tell if
> the quorum is down or there is a network partition between the client and
> the quorum. 
> 
> 
> -- 
> Thawan Kooburat
> 
> 
> 
> 
> 
> On 7/16/13 6:46 PM, "kishore g" <g....@gmail.com> wrote:
> 
>> Thanks Thawan. Another question to follow up, so lets say client c1 is
>> connected to leader and leader fails. Now c1 is trying to connect to
>> another zk server but all servers are busy loading snapshot and can take a
>> minute or two. According to Flavio zk servers dont accept any request
>> while
>> synchronization, but most clients dont keep that high connection timeout.
>> So does this mean clients will timeout on connection?. Is my understanding
>> correct or zk servers will accept connection requests but reject
>> read/write
>> requests.
>> 
>> thanks,
>> Kishore G
>> 
>> 
>> On Tue, Jul 16, 2013 at 3:45 PM, Thawan Kooburat <th...@fb.com> wrote:
>> 
>>> There is a plan to work on this optimization ZOOKEEPER-1674.
>>> 
>>> 
>>> --
>>> Thawan Kooburat
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 7/16/13 1:37 PM, "kishore g" <g....@gmail.com> wrote:
>>> 
>>>> All servers in the quorum reading the snapshot from disk as part of the
>>>> synchronization phase. From Thawan's email it looks like when ever
>>> there
>>>> is
>>>> a leader election, all zk servers read the snapshot from disk. I am not
>>>> sure why all servers should reload the snapshot from disk as this
>>>> increases
>>>> unavailability time.
>>>> 
>>>> 
>>>> On Tue, Jul 16, 2013 at 12:35 PM, Flavio Junqueira
>>>> <fp...@yahoo.com>wrote:
>>>> 
>>>>> The synchronization phase is part of the protocol and we use it to
>>>>> guarantee that we expose a consistent view of the state. During the
>>>>> synchronization phase, servers do not accept requests.
>>>>> 
>>>>> Which behavior are you proposing we change, Kishore?
>>>>> 
>>>>> -Flavio
>>>>> 
>>>>> On Jul 16, 2013, at 7:04 PM, kishore g <g....@gmail.com> wrote:
>>>>> 
>>>>>> Thanks for clarification Flavio. Does this mean during the leader
>>>>> election,
>>>>>> both reads and writes are not supported?. Do we start a separate
>>>>>> thread/jira of changing this behavior?.
>>>>>> 
>>>>>> thanks,
>>>>>> Kishore G
>>>>>> 
>>>>>> 
>>>>>> On Tue, Jul 16, 2013 at 9:16 AM, Flavio Junqueira
>>>>> <fpjunqueira@yahoo.com
>>>>>> wrote:
>>>>>> 
>>>>>>> The disk state should be the authoritative state of a server, so
>>> if I
>>>>>>> remember correctly, we load the database as a way of validating
>>> the
>>>>> disk
>>>>>>> state. I don't claim that this is strictly necessary, but if we
>>> are
>>>>> to
>>>>>>> change it, then I would need to think this through.
>>>>>>> 
>>>>>>> About leader election, if a leader loses support from a quorum of
>>>>>>> followers,
>>>>>>> then it will drop leadership. Any event that causes a follower to
>>>>> stop
>>>>>>> receiving messages from the leader or the follower to disconnect
>>> from
>>>>> the
>>>>>>> leader will make it stop supporting the current leader.
>>>>>>> 
>>>>>>> -Flavio
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Sergey Maslyakov [mailto:evolvah@gmail.com]
>>>>>>> Sent: 16 July 2013 16:16
>>>>>>> To: user@zookeeper.apache.org
>>>>>>> Subject: Re: Maximum size of a snapshot
>>>>>>> 
>>>>>>> And another extension on top of Kishore's question: do the
>>>>> reelections
>>>>>>> happen if the previously elected leader remains in the cluster? In
>>>>> other
>>>>>>> words, what events can trigger re-election and the corresponding
>>>>> temporary
>>>>>>> degradation of the service provided by Zookeeper?
>>>>>>> 
>>>>>>> 
>>>>>>> Thank you,
>>>>>>> /Sergey
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Jul 16, 2013 at 2:21 AM, kishore g <g....@gmail.com>
>>>>> wrote:
>>>>>>> 
>>>>>>>> Regarding #2. Is that really true that during leader election
>>> every
>>>>>>>> machine reloads snapshot data from disk? Any reason why this is
>>>>> needed
>>>>>>>> unless it really needs to truncate or undo conflicting
>>> transactions
>>>>>>> already applied?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Mon, Jul 15, 2013 at 9:50 PM, Thawan Kooburat <th...@fb.com>
>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Max snapshot size:
>>>>>>>>> 
>>>>>>>>> Here is my take on these issue,  others feel free to add or
>>>>> correct.
>>>>>>>>> 
>>>>>>>>> 1. Depends on how much RAM your machine has.  Snapshot is
>>> should be
>>>>>>>>> less than the available RAM since everything is loaded into
>>> memory.
>>>>>>>>> 2. Depends on what is the availability guarantee that the client
>>>>> needs.
>>>>>>>>> If there is leader election, every machine need to reload the
>>> data
>>>>>>>>> from disk. So the quorum will be down for at least the same as
>>>>>>>>> snapshot
>>>>>>>> loading
>>>>>>>>> time. The session timeout on the client side should be at least
>>>>>>>>> longer than expected downtime during leader election.
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Thawan Kooburat
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 7/15/13 8:46 PM, "Sergey Maslyakov" <ev...@gmail.com>
>>> wrote:
>>>>>>>>> 
>>>>>>>>>> I have a couple of sizing questions to the users and
>>> developers.
>>>>>>>>>> Hope,
>>>>>>>> you
>>>>>>>>>> don't mind answering those.
>>>>>>>>>> 
>>>>>>>>>> What is the guideline for the maximum reasonable size of a
>>>>> DataTree
>>>>>>>> that a
>>>>>>>>>> single ZK server can manage? If ZK server writes out a
>>> snapshot of
>>>>>>>>>> about 1GB in size, is it pushed beyond the limits or is it
>>> still
>>>>>>> manageable?
>>>>>>>> If
>>>>>>>>>> so, where is the critical threshold when ZK is really being
>>>>> abused?
>>>>>>>>>> 
>>>>>>>>>> Similarly, how can I estimate the propagation delay of a change
>>>>>>>>>> across
>>>>>>>> an
>>>>>>>>>> ensemble of three ZK servers?
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Thank you,
>>>>>>>>>> /Sergey
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>>> 
>>> 
>>> 
>

Re: Recovery time (was: Maximum size of a snapshot)

Posted by kishore g <g....@gmail.com>.

On 1), load state from disk to find last zxid, does this mean it loads
snapshot or simply reads the tail of transaction log?.





On Wed, Jul 17, 2013 at 6:43 AM, Flavio Junqueira <fp...@yahoo.com>wrote:

> I need to also mention ZOOKEEPER-1549 in the context of point (2) below.
> That's a blocker for 3.5.0.
>
> -Flavio
>
> On Jul 17, 2013, at 12:30 PM, Flavio Junqueira <fp...@yahoo.com>
> wrote:
>
> > Moving the discussion to dev but keeping user on CC.
> >
> > Let's step back. The reason why we started the latest discussion in this
> thread was because Kishore is concerned about recovery time. There are a
> number of improvements we have been looking at for the next release, let me
> go over my current understanding of the main points that add to the
> recovery time:
> >
> > 1- Before we even start leader election, each server loads state from
> disk to determine its last zxid. The last zxid is used in the election;
> > 2- Once the leader is elected, it loads state from disk and take a
> snapshot. Loading the database again is unecessary (ZOOKEEPER-1642) and the
> snapshot adds latency. In fact, it is not even correct to have it there
> (ZOOKEEPER-1558).
> > 3- A follower takes a snapshot before acknowledging the NEWLEADER
> message, so the leader has to wait until a quorum of followers finishes
> their snapshot.
> >
> > The proposal I've heard here is to touch (1). For now, I'd rather keep
> (1) as is and focus on fixing (2). We might be able to do something about
> (3) and I'm actually not sure if there has been a discussion about it or
> not.
> >
> > -Flavio
> >
> > On Jul 17, 2013, at 5:40 AM, Thawan Kooburat <th...@fb.com> wrote:
> >
> >> Client will get session expire event only when a server explicitly tells
> >> the client. So any established sessions will remain in a disconnected
> >> state during the period
> >>
> >> So my comment about the need for longer session timeout might be
> >> incorrect. While the quorum is down during leader election, session
> won't
> >> expire during this period. When the quorum comes back, the client have
> to
> >> reconnect within session timeout in order to resume the session.
>  However,
> >> client won't be able to issue any read/write request or create a new
> >> session while the quorum is down.
> >>
> >> However, some application may need a stronger consistency guarantee.
> They
> >> will have a special logic to abort the client if it was disconnected for
> >> an extended period. This is because the client won't be able to tell if
> >> the quorum is down or there is a network partition between the client
> and
> >> the quorum.
> >>
> >>
> >> --
> >> Thawan Kooburat
> >>
> >>
> >>
> >>
> >>
> >> On 7/16/13 6:46 PM, "kishore g" <g....@gmail.com> wrote:
> >>
> >>> Thanks Thawan. Another question to follow up, so lets say client c1 is
> >>> connected to leader and leader fails. Now c1 is trying to connect to
> >>> another zk server but all servers are busy loading snapshot and can
> take a
> >>> minute or two. According to Flavio zk servers dont accept any request
> >>> while
> >>> synchronization, but most clients dont keep that high connection
> timeout.
> >>> So does this mean clients will timeout on connection?. Is my
> understanding
> >>> correct or zk servers will accept connection requests but reject
> >>> read/write
> >>> requests.
> >>>
> >>> thanks,
> >>> Kishore G
> >>>
> >>>
> >>> On Tue, Jul 16, 2013 at 3:45 PM, Thawan Kooburat <th...@fb.com>
> wrote:
> >>>
> >>>> There is a plan to work on this optimization ZOOKEEPER-1674.
> >>>>
> >>>>
> >>>> --
> >>>> Thawan Kooburat
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 7/16/13 1:37 PM, "kishore g" <g....@gmail.com> wrote:
> >>>>
> >>>>> All servers in the quorum reading the snapshot from disk as part of
> the
> >>>>> synchronization phase. From Thawan's email it looks like when ever
> >>>> there
> >>>>> is
> >>>>> a leader election, all zk servers read the snapshot from disk. I am
> not
> >>>>> sure why all servers should reload the snapshot from disk as this
> >>>>> increases
> >>>>> unavailability time.
> >>>>>
> >>>>>
> >>>>> On Tue, Jul 16, 2013 at 12:35 PM, Flavio Junqueira
> >>>>> <fp...@yahoo.com>wrote:
> >>>>>
> >>>>>> The synchronization phase is part of the protocol and we use it to
> >>>>>> guarantee that we expose a consistent view of the state. During the
> >>>>>> synchronization phase, servers do not accept requests.
> >>>>>>
> >>>>>> Which behavior are you proposing we change, Kishore?
> >>>>>>
> >>>>>> -Flavio
> >>>>>>
> >>>>>> On Jul 16, 2013, at 7:04 PM, kishore g <g....@gmail.com> wrote:
> >>>>>>
> >>>>>>> Thanks for clarification Flavio. Does this mean during the leader
> >>>>>> election,
> >>>>>>> both reads and writes are not supported?. Do we start a separate
> >>>>>>> thread/jira of changing this behavior?.
> >>>>>>>
> >>>>>>> thanks,
> >>>>>>> Kishore G
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Jul 16, 2013 at 9:16 AM, Flavio Junqueira
> >>>>>> <fpjunqueira@yahoo.com
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> The disk state should be the authoritative state of a server, so
> >>>> if I
> >>>>>>>> remember correctly, we load the database as a way of validating
> >>>> the
> >>>>>> disk
> >>>>>>>> state. I don't claim that this is strictly necessary, but if we
> >>>> are
> >>>>>> to
> >>>>>>>> change it, then I would need to think this through.
> >>>>>>>>
> >>>>>>>> About leader election, if a leader loses support from a quorum of
> >>>>>>>> followers,
> >>>>>>>> then it will drop leadership. Any event that causes a follower to
> >>>>>> stop
> >>>>>>>> receiving messages from the leader or the follower to disconnect
> >>>> from
> >>>>>> the
> >>>>>>>> leader will make it stop supporting the current leader.
> >>>>>>>>
> >>>>>>>> -Flavio
> >>>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Sergey Maslyakov [mailto:evolvah@gmail.com]
> >>>>>>>> Sent: 16 July 2013 16:16
> >>>>>>>> To: user@zookeeper.apache.org
> >>>>>>>> Subject: Re: Maximum size of a snapshot
> >>>>>>>>
> >>>>>>>> And another extension on top of Kishore's question: do the
> >>>>>> reelections
> >>>>>>>> happen if the previously elected leader remains in the cluster? In
> >>>>>> other
> >>>>>>>> words, what events can trigger re-election and the corresponding
> >>>>>> temporary
> >>>>>>>> degradation of the service provided by Zookeeper?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thank you,
> >>>>>>>> /Sergey
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Jul 16, 2013 at 2:21 AM, kishore g <g....@gmail.com>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Regarding #2. Is that really true that during leader election
> >>>> every
> >>>>>>>>> machine reloads snapshot data from disk? Any reason why this is
> >>>>>> needed
> >>>>>>>>> unless it really needs to truncate or undo conflicting
> >>>> transactions
> >>>>>>>> already applied?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Jul 15, 2013 at 9:50 PM, Thawan Kooburat <th...@fb.com>
> >>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Max snapshot size:
> >>>>>>>>>>
> >>>>>>>>>> Here is my take on these issue,  others feel free to add or
> >>>>>> correct.
> >>>>>>>>>>
> >>>>>>>>>> 1. Depends on how much RAM your machine has.  Snapshot is
> >>>> should be
> >>>>>>>>>> less than the available RAM since everything is loaded into
> >>>> memory.
> >>>>>>>>>> 2. Depends on what is the availability guarantee that the client
> >>>>>> needs.
> >>>>>>>>>> If there is leader election, every machine need to reload the
> >>>> data
> >>>>>>>>>> from disk. So the quorum will be down for at least the same as
> >>>>>>>>>> snapshot
> >>>>>>>>> loading
> >>>>>>>>>> time. The session timeout on the client side should be at least
> >>>>>>>>>> longer than expected downtime during leader election.
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Thawan Kooburat
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 7/15/13 8:46 PM, "Sergey Maslyakov" <ev...@gmail.com>
> >>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> I have a couple of sizing questions to the users and
> >>>> developers.
> >>>>>>>>>>> Hope,
> >>>>>>>>> you
> >>>>>>>>>>> don't mind answering those.
> >>>>>>>>>>>
> >>>>>>>>>>> What is the guideline for the maximum reasonable size of a
> >>>>>> DataTree
> >>>>>>>>> that a
> >>>>>>>>>>> single ZK server can manage? If ZK server writes out a
> >>>> snapshot of
> >>>>>>>>>>> about 1GB in size, is it pushed beyond the limits or is it
> >>>> still
> >>>>>>>> manageable?
> >>>>>>>>> If
> >>>>>>>>>>> so, where is the critical threshold when ZK is really being
> >>>>>> abused?
> >>>>>>>>>>>
> >>>>>>>>>>> Similarly, how can I estimate the propagation delay of a change
> >>>>>>>>>>> across
> >>>>>>>>> an
> >>>>>>>>>>> ensemble of three ZK servers?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Thank you,
> >>>>>>>>>>> /Sergey
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >
>
>

Re: Recovery time (was: Maximum size of a snapshot)

Posted by kishore g <g....@gmail.com>.

On 1), load state from disk to find last zxid, does this mean it loads
snapshot or simply reads the tail of transaction log?.





On Wed, Jul 17, 2013 at 6:43 AM, Flavio Junqueira <fp...@yahoo.com>wrote:

> I need to also mention ZOOKEEPER-1549 in the context of point (2) below.
> That's a blocker for 3.5.0.
>
> -Flavio
>
> On Jul 17, 2013, at 12:30 PM, Flavio Junqueira <fp...@yahoo.com>
> wrote:
>
> > Moving the discussion to dev but keeping user on CC.
> >
> > Let's step back. The reason why we started the latest discussion in this
> thread was because Kishore is concerned about recovery time. There are a
> number of improvements we have been looking at for the next release, let me
> go over my current understanding of the main points that add to the
> recovery time:
> >
> > 1- Before we even start leader election, each server loads state from
> disk to determine its last zxid. The last zxid is used in the election;
> > 2- Once the leader is elected, it loads state from disk and take a
> snapshot. Loading the database again is unecessary (ZOOKEEPER-1642) and the
> snapshot adds latency. In fact, it is not even correct to have it there
> (ZOOKEEPER-1558).
> > 3- A follower takes a snapshot before acknowledging the NEWLEADER
> message, so the leader has to wait until a quorum of followers finishes
> their snapshot.
> >
> > The proposal I've heard here is to touch (1). For now, I'd rather keep
> (1) as is and focus on fixing (2). We might be able to do something about
> (3) and I'm actually not sure if there has been a discussion about it or
> not.
> >
> > -Flavio
> >
> > On Jul 17, 2013, at 5:40 AM, Thawan Kooburat <th...@fb.com> wrote:
> >
> >> Client will get session expire event only when a server explicitly tells
> >> the client. So any established sessions will remain in a disconnected
> >> state during the period
> >>
> >> So my comment about the need for longer session timeout might be
> >> incorrect. While the quorum is down during leader election, session
> won't
> >> expire during this period. When the quorum comes back, the client have
> to
> >> reconnect within session timeout in order to resume the session.
>  However,
> >> client won't be able to issue any read/write request or create a new
> >> session while the quorum is down.
> >>
> >> However, some application may need a stronger consistency guarantee.
> They
> >> will have a special logic to abort the client if it was disconnected for
> >> an extended period. This is because the client won't be able to tell if
> >> the quorum is down or there is a network partition between the client
> and
> >> the quorum.
> >>
> >>
> >> --
> >> Thawan Kooburat
> >>
> >>
> >>
> >>
> >>
> >> On 7/16/13 6:46 PM, "kishore g" <g....@gmail.com> wrote:
> >>
> >>> Thanks Thawan. Another question to follow up, so lets say client c1 is
> >>> connected to leader and leader fails. Now c1 is trying to connect to
> >>> another zk server but all servers are busy loading snapshot and can
> take a
> >>> minute or two. According to Flavio zk servers dont accept any request
> >>> while
> >>> synchronization, but most clients dont keep that high connection
> timeout.
> >>> So does this mean clients will timeout on connection?. Is my
> understanding
> >>> correct or zk servers will accept connection requests but reject
> >>> read/write
> >>> requests.
> >>>
> >>> thanks,
> >>> Kishore G
> >>>
> >>>
> >>> On Tue, Jul 16, 2013 at 3:45 PM, Thawan Kooburat <th...@fb.com>
> wrote:
> >>>
> >>>> There is a plan to work on this optimization ZOOKEEPER-1674.
> >>>>
> >>>>
> >>>> --
> >>>> Thawan Kooburat
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 7/16/13 1:37 PM, "kishore g" <g....@gmail.com> wrote:
> >>>>
> >>>>> All servers in the quorum reading the snapshot from disk as part of
> the
> >>>>> synchronization phase. From Thawan's email it looks like when ever
> >>>> there
> >>>>> is
> >>>>> a leader election, all zk servers read the snapshot from disk. I am
> not
> >>>>> sure why all servers should reload the snapshot from disk as this
> >>>>> increases
> >>>>> unavailability time.
> >>>>>
> >>>>>
> >>>>> On Tue, Jul 16, 2013 at 12:35 PM, Flavio Junqueira
> >>>>> <fp...@yahoo.com>wrote:
> >>>>>
> >>>>>> The synchronization phase is part of the protocol and we use it to
> >>>>>> guarantee that we expose a consistent view of the state. During the
> >>>>>> synchronization phase, servers do not accept requests.
> >>>>>>
> >>>>>> Which behavior are you proposing we change, Kishore?
> >>>>>>
> >>>>>> -Flavio
> >>>>>>
> >>>>>> On Jul 16, 2013, at 7:04 PM, kishore g <g....@gmail.com> wrote:
> >>>>>>
> >>>>>>> Thanks for clarification Flavio. Does this mean during the leader
> >>>>>> election,
> >>>>>>> both reads and writes are not supported?. Do we start a separate
> >>>>>>> thread/jira of changing this behavior?.
> >>>>>>>
> >>>>>>> thanks,
> >>>>>>> Kishore G
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Jul 16, 2013 at 9:16 AM, Flavio Junqueira
> >>>>>> <fpjunqueira@yahoo.com
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> The disk state should be the authoritative state of a server, so
> >>>> if I
> >>>>>>>> remember correctly, we load the database as a way of validating
> >>>> the
> >>>>>> disk
> >>>>>>>> state. I don't claim that this is strictly necessary, but if we
> >>>> are
> >>>>>> to
> >>>>>>>> change it, then I would need to think this through.
> >>>>>>>>
> >>>>>>>> About leader election, if a leader loses support from a quorum of
> >>>>>>>> followers,
> >>>>>>>> then it will drop leadership. Any event that causes a follower to
> >>>>>> stop
> >>>>>>>> receiving messages from the leader or the follower to disconnect
> >>>> from
> >>>>>> the
> >>>>>>>> leader will make it stop supporting the current leader.
> >>>>>>>>
> >>>>>>>> -Flavio
> >>>>>>>>
> >>>>>>>> -----Original Message-----
> >>>>>>>> From: Sergey Maslyakov [mailto:evolvah@gmail.com]
> >>>>>>>> Sent: 16 July 2013 16:16
> >>>>>>>> To: user@zookeeper.apache.org
> >>>>>>>> Subject: Re: Maximum size of a snapshot
> >>>>>>>>
> >>>>>>>> And another extension on top of Kishore's question: do the
> >>>>>> reelections
> >>>>>>>> happen if the previously elected leader remains in the cluster? In
> >>>>>> other
> >>>>>>>> words, what events can trigger re-election and the corresponding
> >>>>>> temporary
> >>>>>>>> degradation of the service provided by Zookeeper?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thank you,
> >>>>>>>> /Sergey
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Jul 16, 2013 at 2:21 AM, kishore g <g....@gmail.com>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Regarding #2. Is that really true that during leader election
> >>>> every
> >>>>>>>>> machine reloads snapshot data from disk? Any reason why this is
> >>>>>> needed
> >>>>>>>>> unless it really needs to truncate or undo conflicting
> >>>> transactions
> >>>>>>>> already applied?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Jul 15, 2013 at 9:50 PM, Thawan Kooburat <th...@fb.com>
> >>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Max snapshot size:
> >>>>>>>>>>
> >>>>>>>>>> Here is my take on these issue,  others feel free to add or
> >>>>>> correct.
> >>>>>>>>>>
> >>>>>>>>>> 1. Depends on how much RAM your machine has.  Snapshot is
> >>>> should be
> >>>>>>>>>> less than the available RAM since everything is loaded into
> >>>> memory.
> >>>>>>>>>> 2. Depends on what is the availability guarantee that the client
> >>>>>> needs.
> >>>>>>>>>> If there is leader election, every machine need to reload the
> >>>> data
> >>>>>>>>>> from disk. So the quorum will be down for at least the same as
> >>>>>>>>>> snapshot
> >>>>>>>>> loading
> >>>>>>>>>> time. The session timeout on the client side should be at least
> >>>>>>>>>> longer than expected downtime during leader election.
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> Thawan Kooburat
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 7/15/13 8:46 PM, "Sergey Maslyakov" <ev...@gmail.com>
> >>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> I have a couple of sizing questions to the users and
> >>>> developers.
> >>>>>>>>>>> Hope,
> >>>>>>>>> you
> >>>>>>>>>>> don't mind answering those.
> >>>>>>>>>>>
> >>>>>>>>>>> What is the guideline for the maximum reasonable size of a
> >>>>>> DataTree
> >>>>>>>>> that a
> >>>>>>>>>>> single ZK server can manage? If ZK server writes out a
> >>>> snapshot of
> >>>>>>>>>>> about 1GB in size, is it pushed beyond the limits or is it
> >>>> still
> >>>>>>>> manageable?
> >>>>>>>>> If
> >>>>>>>>>>> so, where is the critical threshold when ZK is really being
> >>>>>> abused?
> >>>>>>>>>>>
> >>>>>>>>>>> Similarly, how can I estimate the propagation delay of a change
> >>>>>>>>>>> across
> >>>>>>>>> an
> >>>>>>>>>>> ensemble of three ZK servers?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Thank you,
> >>>>>>>>>>> /Sergey
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >
>
>

Re: Recovery time (was: Maximum size of a snapshot)

Posted by Flavio Junqueira <fp...@yahoo.com>.

I need to also mention ZOOKEEPER-1549 in the context of point (2) below. That's a blocker for 3.5.0. 

-Flavio

On Jul 17, 2013, at 12:30 PM, Flavio Junqueira <fp...@yahoo.com> wrote:

> Moving the discussion to dev but keeping user on CC.
> 
> Let's step back. The reason why we started the latest discussion in this thread was because Kishore is concerned about recovery time. There are a number of improvements we have been looking at for the next release, let me go over my current understanding of the main points that add to the recovery time:
> 
> 1- Before we even start leader election, each server loads state from disk to determine its last zxid. The last zxid is used in the election;
> 2- Once the leader is elected, it loads state from disk and take a snapshot. Loading the database again is unecessary (ZOOKEEPER-1642) and the snapshot adds latency. In fact, it is not even correct to have it there (ZOOKEEPER-1558).
> 3- A follower takes a snapshot before acknowledging the NEWLEADER message, so the leader has to wait until a quorum of followers finishes their snapshot.
> 
> The proposal I've heard here is to touch (1). For now, I'd rather keep (1) as is and focus on fixing (2). We might be able to do something about (3) and I'm actually not sure if there has been a discussion about it or not.
> 
> -Flavio
> 
> On Jul 17, 2013, at 5:40 AM, Thawan Kooburat <th...@fb.com> wrote:
> 
>> Client will get session expire event only when a server explicitly tells
>> the client. So any established sessions will remain in a disconnected
>> state during the period
>> 
>> So my comment about the need for longer session timeout might be
>> incorrect. While the quorum is down during leader election, session won't
>> expire during this period. When the quorum comes back, the client have to
>> reconnect within session timeout in order to resume the session.  However,
>> client won't be able to issue any read/write request or create a new
>> session while the quorum is down.
>> 
>> However, some application may need a stronger consistency guarantee. They
>> will have a special logic to abort the client if it was disconnected for
>> an extended period. This is because the client won't be able to tell if
>> the quorum is down or there is a network partition between the client and
>> the quorum. 
>> 
>> 
>> -- 
>> Thawan Kooburat
>> 
>> 
>> 
>> 
>> 
>> On 7/16/13 6:46 PM, "kishore g" <g....@gmail.com> wrote:
>> 
>>> Thanks Thawan. Another question to follow up, so lets say client c1 is
>>> connected to leader and leader fails. Now c1 is trying to connect to
>>> another zk server but all servers are busy loading snapshot and can take a
>>> minute or two. According to Flavio zk servers dont accept any request
>>> while
>>> synchronization, but most clients dont keep that high connection timeout.
>>> So does this mean clients will timeout on connection?. Is my understanding
>>> correct or zk servers will accept connection requests but reject
>>> read/write
>>> requests.
>>> 
>>> thanks,
>>> Kishore G
>>> 
>>> 
>>> On Tue, Jul 16, 2013 at 3:45 PM, Thawan Kooburat <th...@fb.com> wrote:
>>> 
>>>> There is a plan to work on this optimization ZOOKEEPER-1674.
>>>> 
>>>> 
>>>> --
>>>> Thawan Kooburat
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 7/16/13 1:37 PM, "kishore g" <g....@gmail.com> wrote:
>>>> 
>>>>> All servers in the quorum reading the snapshot from disk as part of the
>>>>> synchronization phase. From Thawan's email it looks like when ever
>>>> there
>>>>> is
>>>>> a leader election, all zk servers read the snapshot from disk. I am not
>>>>> sure why all servers should reload the snapshot from disk as this
>>>>> increases
>>>>> unavailability time.
>>>>> 
>>>>> 
>>>>> On Tue, Jul 16, 2013 at 12:35 PM, Flavio Junqueira
>>>>> <fp...@yahoo.com>wrote:
>>>>> 
>>>>>> The synchronization phase is part of the protocol and we use it to
>>>>>> guarantee that we expose a consistent view of the state. During the
>>>>>> synchronization phase, servers do not accept requests.
>>>>>> 
>>>>>> Which behavior are you proposing we change, Kishore?
>>>>>> 
>>>>>> -Flavio
>>>>>> 
>>>>>> On Jul 16, 2013, at 7:04 PM, kishore g <g....@gmail.com> wrote:
>>>>>> 
>>>>>>> Thanks for clarification Flavio. Does this mean during the leader
>>>>>> election,
>>>>>>> both reads and writes are not supported?. Do we start a separate
>>>>>>> thread/jira of changing this behavior?.
>>>>>>> 
>>>>>>> thanks,
>>>>>>> Kishore G
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Jul 16, 2013 at 9:16 AM, Flavio Junqueira
>>>>>> <fpjunqueira@yahoo.com
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> The disk state should be the authoritative state of a server, so
>>>> if I
>>>>>>>> remember correctly, we load the database as a way of validating
>>>> the
>>>>>> disk
>>>>>>>> state. I don't claim that this is strictly necessary, but if we
>>>> are
>>>>>> to
>>>>>>>> change it, then I would need to think this through.
>>>>>>>> 
>>>>>>>> About leader election, if a leader loses support from a quorum of
>>>>>>>> followers,
>>>>>>>> then it will drop leadership. Any event that causes a follower to
>>>>>> stop
>>>>>>>> receiving messages from the leader or the follower to disconnect
>>>> from
>>>>>> the
>>>>>>>> leader will make it stop supporting the current leader.
>>>>>>>> 
>>>>>>>> -Flavio
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Sergey Maslyakov [mailto:evolvah@gmail.com]
>>>>>>>> Sent: 16 July 2013 16:16
>>>>>>>> To: user@zookeeper.apache.org
>>>>>>>> Subject: Re: Maximum size of a snapshot
>>>>>>>> 
>>>>>>>> And another extension on top of Kishore's question: do the
>>>>>> reelections
>>>>>>>> happen if the previously elected leader remains in the cluster? In
>>>>>> other
>>>>>>>> words, what events can trigger re-election and the corresponding
>>>>>> temporary
>>>>>>>> degradation of the service provided by Zookeeper?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thank you,
>>>>>>>> /Sergey
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Jul 16, 2013 at 2:21 AM, kishore g <g....@gmail.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Regarding #2. Is that really true that during leader election
>>>> every
>>>>>>>>> machine reloads snapshot data from disk? Any reason why this is
>>>>>> needed
>>>>>>>>> unless it really needs to truncate or undo conflicting
>>>> transactions
>>>>>>>> already applied?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Mon, Jul 15, 2013 at 9:50 PM, Thawan Kooburat <th...@fb.com>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Max snapshot size:
>>>>>>>>>> 
>>>>>>>>>> Here is my take on these issue,  others feel free to add or
>>>>>> correct.
>>>>>>>>>> 
>>>>>>>>>> 1. Depends on how much RAM your machine has.  Snapshot is
>>>> should be
>>>>>>>>>> less than the available RAM since everything is loaded into
>>>> memory.
>>>>>>>>>> 2. Depends on what is the availability guarantee that the client
>>>>>> needs.
>>>>>>>>>> If there is leader election, every machine need to reload the
>>>> data
>>>>>>>>>> from disk. So the quorum will be down for at least the same as
>>>>>>>>>> snapshot
>>>>>>>>> loading
>>>>>>>>>> time. The session timeout on the client side should be at least
>>>>>>>>>> longer than expected downtime during leader election.
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Thawan Kooburat
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 7/15/13 8:46 PM, "Sergey Maslyakov" <ev...@gmail.com>
>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I have a couple of sizing questions to the users and
>>>> developers.
>>>>>>>>>>> Hope,
>>>>>>>>> you
>>>>>>>>>>> don't mind answering those.
>>>>>>>>>>> 
>>>>>>>>>>> What is the guideline for the maximum reasonable size of a
>>>>>> DataTree
>>>>>>>>> that a
>>>>>>>>>>> single ZK server can manage? If ZK server writes out a
>>>> snapshot of
>>>>>>>>>>> about 1GB in size, is it pushed beyond the limits or is it
>>>> still
>>>>>>>> manageable?
>>>>>>>>> If
>>>>>>>>>>> so, where is the critical threshold when ZK is really being
>>>>>> abused?
>>>>>>>>>>> 
>>>>>>>>>>> Similarly, how can I estimate the propagation delay of a change
>>>>>>>>>>> across
>>>>>>>>> an
>>>>>>>>>>> ensemble of three ZK servers?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Thank you,
>>>>>>>>>>> /Sergey
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>

Re: Recovery time (was: Maximum size of a snapshot)

Posted by Flavio Junqueira <fp...@yahoo.com>.

I need to also mention ZOOKEEPER-1549 in the context of point (2) below. That's a blocker for 3.5.0. 

-Flavio

On Jul 17, 2013, at 12:30 PM, Flavio Junqueira <fp...@yahoo.com> wrote:

> Moving the discussion to dev but keeping user on CC.
> 
> Let's step back. The reason why we started the latest discussion in this thread was because Kishore is concerned about recovery time. There are a number of improvements we have been looking at for the next release, let me go over my current understanding of the main points that add to the recovery time:
> 
> 1- Before we even start leader election, each server loads state from disk to determine its last zxid. The last zxid is used in the election;
> 2- Once the leader is elected, it loads state from disk and take a snapshot. Loading the database again is unecessary (ZOOKEEPER-1642) and the snapshot adds latency. In fact, it is not even correct to have it there (ZOOKEEPER-1558).
> 3- A follower takes a snapshot before acknowledging the NEWLEADER message, so the leader has to wait until a quorum of followers finishes their snapshot.
> 
> The proposal I've heard here is to touch (1). For now, I'd rather keep (1) as is and focus on fixing (2). We might be able to do something about (3) and I'm actually not sure if there has been a discussion about it or not.
> 
> -Flavio
> 
> On Jul 17, 2013, at 5:40 AM, Thawan Kooburat <th...@fb.com> wrote:
> 
>> Client will get session expire event only when a server explicitly tells
>> the client. So any established sessions will remain in a disconnected
>> state during the period
>> 
>> So my comment about the need for longer session timeout might be
>> incorrect. While the quorum is down during leader election, session won't
>> expire during this period. When the quorum comes back, the client have to
>> reconnect within session timeout in order to resume the session.  However,
>> client won't be able to issue any read/write request or create a new
>> session while the quorum is down.
>> 
>> However, some application may need a stronger consistency guarantee. They
>> will have a special logic to abort the client if it was disconnected for
>> an extended period. This is because the client won't be able to tell if
>> the quorum is down or there is a network partition between the client and
>> the quorum. 
>> 
>> 
>> -- 
>> Thawan Kooburat
>> 
>> 
>> 
>> 
>> 
>> On 7/16/13 6:46 PM, "kishore g" <g....@gmail.com> wrote:
>> 
>>> Thanks Thawan. Another question to follow up, so lets say client c1 is
>>> connected to leader and leader fails. Now c1 is trying to connect to
>>> another zk server but all servers are busy loading snapshot and can take a
>>> minute or two. According to Flavio zk servers dont accept any request
>>> while
>>> synchronization, but most clients dont keep that high connection timeout.
>>> So does this mean clients will timeout on connection?. Is my understanding
>>> correct or zk servers will accept connection requests but reject
>>> read/write
>>> requests.
>>> 
>>> thanks,
>>> Kishore G
>>> 
>>> 
>>> On Tue, Jul 16, 2013 at 3:45 PM, Thawan Kooburat <th...@fb.com> wrote:
>>> 
>>>> There is a plan to work on this optimization ZOOKEEPER-1674.
>>>> 
>>>> 
>>>> --
>>>> Thawan Kooburat
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 7/16/13 1:37 PM, "kishore g" <g....@gmail.com> wrote:
>>>> 
>>>>> All servers in the quorum reading the snapshot from disk as part of the
>>>>> synchronization phase. From Thawan's email it looks like when ever
>>>> there
>>>>> is
>>>>> a leader election, all zk servers read the snapshot from disk. I am not
>>>>> sure why all servers should reload the snapshot from disk as this
>>>>> increases
>>>>> unavailability time.
>>>>> 
>>>>> 
>>>>> On Tue, Jul 16, 2013 at 12:35 PM, Flavio Junqueira
>>>>> <fp...@yahoo.com>wrote:
>>>>> 
>>>>>> The synchronization phase is part of the protocol and we use it to
>>>>>> guarantee that we expose a consistent view of the state. During the
>>>>>> synchronization phase, servers do not accept requests.
>>>>>> 
>>>>>> Which behavior are you proposing we change, Kishore?
>>>>>> 
>>>>>> -Flavio
>>>>>> 
>>>>>> On Jul 16, 2013, at 7:04 PM, kishore g <g....@gmail.com> wrote:
>>>>>> 
>>>>>>> Thanks for clarification Flavio. Does this mean during the leader
>>>>>> election,
>>>>>>> both reads and writes are not supported?. Do we start a separate
>>>>>>> thread/jira of changing this behavior?.
>>>>>>> 
>>>>>>> thanks,
>>>>>>> Kishore G
>>>>>>> 
>>>>>>> 
>>>>>>> On Tue, Jul 16, 2013 at 9:16 AM, Flavio Junqueira
>>>>>> <fpjunqueira@yahoo.com
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> The disk state should be the authoritative state of a server, so
>>>> if I
>>>>>>>> remember correctly, we load the database as a way of validating
>>>> the
>>>>>> disk
>>>>>>>> state. I don't claim that this is strictly necessary, but if we
>>>> are
>>>>>> to
>>>>>>>> change it, then I would need to think this through.
>>>>>>>> 
>>>>>>>> About leader election, if a leader loses support from a quorum of
>>>>>>>> followers,
>>>>>>>> then it will drop leadership. Any event that causes a follower to
>>>>>> stop
>>>>>>>> receiving messages from the leader or the follower to disconnect
>>>> from
>>>>>> the
>>>>>>>> leader will make it stop supporting the current leader.
>>>>>>>> 
>>>>>>>> -Flavio
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: Sergey Maslyakov [mailto:evolvah@gmail.com]
>>>>>>>> Sent: 16 July 2013 16:16
>>>>>>>> To: user@zookeeper.apache.org
>>>>>>>> Subject: Re: Maximum size of a snapshot
>>>>>>>> 
>>>>>>>> And another extension on top of Kishore's question: do the
>>>>>> reelections
>>>>>>>> happen if the previously elected leader remains in the cluster? In
>>>>>> other
>>>>>>>> words, what events can trigger re-election and the corresponding
>>>>>> temporary
>>>>>>>> degradation of the service provided by Zookeeper?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Thank you,
>>>>>>>> /Sergey
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Jul 16, 2013 at 2:21 AM, kishore g <g....@gmail.com>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Regarding #2. Is that really true that during leader election
>>>> every
>>>>>>>>> machine reloads snapshot data from disk? Any reason why this is
>>>>>> needed
>>>>>>>>> unless it really needs to truncate or undo conflicting
>>>> transactions
>>>>>>>> already applied?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Mon, Jul 15, 2013 at 9:50 PM, Thawan Kooburat <th...@fb.com>
>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Max snapshot size:
>>>>>>>>>> 
>>>>>>>>>> Here is my take on these issue,  others feel free to add or
>>>>>> correct.
>>>>>>>>>> 
>>>>>>>>>> 1. Depends on how much RAM your machine has.  Snapshot is
>>>> should be
>>>>>>>>>> less than the available RAM since everything is loaded into
>>>> memory.
>>>>>>>>>> 2. Depends on what is the availability guarantee that the client
>>>>>> needs.
>>>>>>>>>> If there is leader election, every machine need to reload the
>>>> data
>>>>>>>>>> from disk. So the quorum will be down for at least the same as
>>>>>>>>>> snapshot
>>>>>>>>> loading
>>>>>>>>>> time. The session timeout on the client side should be at least
>>>>>>>>>> longer than expected downtime during leader election.
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Thawan Kooburat
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On 7/15/13 8:46 PM, "Sergey Maslyakov" <ev...@gmail.com>
>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> I have a couple of sizing questions to the users and
>>>> developers.
>>>>>>>>>>> Hope,
>>>>>>>>> you
>>>>>>>>>>> don't mind answering those.
>>>>>>>>>>> 
>>>>>>>>>>> What is the guideline for the maximum reasonable size of a
>>>>>> DataTree
>>>>>>>>> that a
>>>>>>>>>>> single ZK server can manage? If ZK server writes out a
>>>> snapshot of
>>>>>>>>>>> about 1GB in size, is it pushed beyond the limits or is it
>>>> still
>>>>>>>> manageable?
>>>>>>>>> If
>>>>>>>>>>> so, where is the critical threshold when ZK is really being
>>>>>> abused?
>>>>>>>>>>> 
>>>>>>>>>>> Similarly, how can I estimate the propagation delay of a change
>>>>>>>>>>> across
>>>>>>>>> an
>>>>>>>>>>> ensemble of three ZK servers?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Thank you,
>>>>>>>>>>> /Sergey
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>