You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@zookeeper.apache.org by Alexander Shraer <sh...@yahoo-inc.com> on 2011/03/10 08:19:25 UTC

dynamic membership (reconfiguration) in Zookeeper

Hi,



I'm working on adding support for dynamic membership changes in Zookeeper clusters (ZOOKEEPER-107). I've added a proposed algorithm and some discussion here:



https://cwiki.apache.org/confluence/display/ZOOKEEPER/ClusterMembership



Any comments and suggestions are very welcome.



Best Regards,

Alex

Re: dynamic membership (reconfiguration) in Zookeeper

Posted by Mahadev Konar <ma...@apache.org>.

Vishal/Alex,
 Would it be good to have these comments/design on the jira? Its
probably better to keep design discussion/comments on the jira.

thanks
mahadev

On Sun, Mar 13, 2011 at 10:25 AM, Vishal Kher <vi...@gmail.com> wrote:
> Hi Alex,
>
> Great! Thanks for the design. I have a few suggestions/questions below.
>
> A. Bootstrapping the cluster
>    The algorithm assumes that a leader exists prior to doing a reconfig.
> So it looks like the reconfig() API is not intended to use for bootstrapping
> the cluster. How about we define a API for initializing the cluster?
>
> Instead of relying on the current method of setting the initial
> configuration
> in the config file, we should probably also have to define a join(M) (or
> init(M)) API. When a peer receives this request, it will try to connect to
> the
> specified peers.  During bootstrap peers will connect to each other (as they
> do
> now) and elect a leader.
>
> B. Storing membership in znode and updating client (a tangential topic to
> this
> design).
>    Earlier ZOOKEEPER-107 proposed of using a URI based approach for the
> clients to fetch server list. I am not opposed to the URI based approach,
> however, that shouldn't be our only approach. URI based approach requires
> extra
> resources (e.g., fault tolerant web service or shared storage for file,
> etc).
> In certain environments it may not be feasible to have such a resource.
> Instead, can we use a mechanism similar to ZooKeeper watchpoints for this?
> Store the membership information in a znode and let the ZooKeeper server
> inform
> the clients of the changed configuration.
>
> C. Locating most recent config on servers
>    The URI based approach will be more helpful to servers. For example, if
> A was down when M={A, B, C} was changed to M'={A, D, E}, then when A comes
> online it won't be able to locate the most recent config. In this case, A
> can
> query the URI. Second approach is to ask the leader to try to periodically
> send the membership (at least to nodes that are down).
>
> D. "Send a message to all members(M’) instructing them to connect to
> leader(M)."
>    leader(M) can potentially change after sending this message. Should
> this be "Send a messages to all members(M') to connect to members(M)? See
> point G. below.
>
> Also, in step 7, why do you send leader(M') along with
> <activate-config>message>?
>
> E. "3. Submit a reconfig(M’) operation to leader(M)"
>    What if leader(M) fails after receiving the request, but before
> informing any other peer? How will the administrative client know whether to
> retry or not?  Should s retry if leader fails and should the client API
> retry
> if s fails?
>
> F. Regarding 3.a and 3.b.
>
> The algorithm mentions:
>
> "3. Otherwise:
>   a. Designate the most up-to-date server in connected quorum of M' to be
> leader(M)
>   b. Stop processing new operations, return failure for any further ops
> received"
>
> Why do we need to change the leader if leader(M) is not in M'? How about we
> let
> the leader perform the reconfig and at the end of phase 3 (while processing
> retire) the leader will give up leadership. This will cause a new leader
> election and one of the peer in M' will become the leader. Similarly, after
> the
> second phase, members(M') will stop following leader(M) if leader(M) is not
> in
> M'. I think this will be simpler.
>
> G. Online VS offline
>
>    In your "Online vs offline" section, you mentioned that the offline
> strategy is preferred. I think we can do reconfiguration online.
> I pictured M' as modified OBSERVERS till the time the reconfiguration is
> complete.
>
> a. Let new members(M') join the cluster as OBSERVERS. Based on the current
> implementation, M' will essentially sync with the leader after becoming
> OBSERVERS and it will be easier to leverage the RequestProcessor pipeline
> for
> reconfig.
>
> b. Once a majority of M' join the leader, the leader executes phase 1 as you
> described.
>
> c. After phase 1., the leader changes the transaction logic a bit. Every
> transaction after this point (including reconfig-start) will be sent to M
> and
> M'.  Leader then waits for quorum(M) and quorum(M') to ack the transaction.
> So
> M' are not pure OBSERVERS as we think today. However, only a member of M can
> become a leader until reconfig succeeds.  Also, M' - (M n M') cannot serve
> client requests until reconfiguration is complete. By doing a transaction on
> both M and M' and waiting for the quorum of each set to ack, we keep
> transfering
> the state to both the configurations.
>
> d. After receiving ack for phase 2, the leader sends <switch-to-new-config>
> transaction to M and M' (instead of sending retire just to M).
>
> After receiving this message, M' will close connections with (and reject
> connections from) members not in M'.  Members that are supposed to leave the
> cluster will shutdown QuorumPeer. If leader(M) is not in M', then a new
> leader(M') will be elected.
>
> Let me know what you think about this.
>
> H. What happens if a majority of M' keep failing during reconfig?
>
> M={A, B}, M'={A, C, D, E}. What if D, E fail?
>
> Failure of a majority of M' will permanently stall reconfig. While this is
> less likely to happen, I think ZooKeeper should handle this
> automatically. After a few retries, we should abort reconfig.  Otherwise, we
> could disrupt a running cluster and we will never be able to recover without
> manual intervention. If the majority fails after phase 1, then this would
> mean
> sending a <abort-reconfig, version(M), M'> to M.
>
> Of course, one can argue - what if majority of M' fail after phase 3? So not
> sure if this is an overkill, but I feel we should handle this case.
>
> I. "Old reconfiguration request"
>
> a. We should use ZAB
> b. A side note - I think ZOOKEEPER-335 needs to be resolved for reconfig to
> work. This bug causes logs to diverge if ZK leader fails before sending
> PROPOSALs to followers (see
> http://www.mail-archive.com/zookeeper-user@hadoop.apache.org/msg00403.html).
>
> Because of this bug we could run into the following scenario:
> - A peer B that was leader when reconfig(M') was issued will have reconfig
> M'
>  in its transaction log.
> - A peer C that became leader after B's failure, can have reconfig(M'') in
> its
>  log.
> - Now, if B and C fail (say both reboot), then the outcome of reconfig will
>  depend on which node takes leadership. If B becomes a leader, then out
> come
> is M'. If C becomes a leader, then outcome is M''
>
> J. Policies/ sanity checks
> - Should we allow removal of a peer in a 3 node configuration? How about in
> a
> 2-node configuration?
>
> Can you please add section numbers to the design paper? It will be easier to
> refer to the text by numbers.
>
> Thanks again for working on this. We are about to release the first version
> of
> our product using ZooKeeper, which uses static configuration. Our next
> version
> is heavily dependent on dynamic membership. We have resource allocated at
> work
> that can dedicate time for implementing this feature for our next release
> and
> we are interested in contributing to it. We will be happy to chip in from
> our
> side to help with the implementation.
>
> Regards,
> -Vishal
>
> On Thu, Mar 10, 2011 at 2:19 AM, Alexander Shraer <sh...@yahoo-inc.com>wrote:
>
>> Hi,
>>
>>
>>
>> I'm working on adding support for dynamic membership changes in Zookeeper
>> clusters (ZOOKEEPER-107). I've added a proposed algorithm and some
>> discussion here:
>>
>>
>>
>> https://cwiki.apache.org/confluence/display/ZOOKEEPER/ClusterMembership
>>
>>
>>
>> Any comments and suggestions are very welcome.
>>
>>
>>
>> Best Regards,
>>
>> Alex
>>
>

Re: dynamic membership (reconfiguration) in Zookeeper

Posted by Vishal Kher <vi...@gmail.com>.

Hi Alex,

Great! Thanks for the design. I have a few suggestions/questions below.

A. Bootstrapping the cluster
    The algorithm assumes that a leader exists prior to doing a reconfig.
So it looks like the reconfig() API is not intended to use for bootstrapping
the cluster. How about we define a API for initializing the cluster?

Instead of relying on the current method of setting the initial
configuration
in the config file, we should probably also have to define a join(M) (or
init(M)) API. When a peer receives this request, it will try to connect to
the
specified peers.  During bootstrap peers will connect to each other (as they
do
now) and elect a leader.

B. Storing membership in znode and updating client (a tangential topic to
this
design).
    Earlier ZOOKEEPER-107 proposed of using a URI based approach for the
clients to fetch server list. I am not opposed to the URI based approach,
however, that shouldn't be our only approach. URI based approach requires
extra
resources (e.g., fault tolerant web service or shared storage for file,
etc).
In certain environments it may not be feasible to have such a resource.
Instead, can we use a mechanism similar to ZooKeeper watchpoints for this?
Store the membership information in a znode and let the ZooKeeper server
inform
the clients of the changed configuration.

C. Locating most recent config on servers
    The URI based approach will be more helpful to servers. For example, if
A was down when M={A, B, C} was changed to M'={A, D, E}, then when A comes
online it won't be able to locate the most recent config. In this case, A
can
query the URI. Second approach is to ask the leader to try to periodically
send the membership (at least to nodes that are down).

D. "Send a message to all members(M’) instructing them to connect to
leader(M)."
    leader(M) can potentially change after sending this message. Should
this be "Send a messages to all members(M') to connect to members(M)? See
point G. below.

Also, in step 7, why do you send leader(M') along with
<activate-config>message>?

E. "3. Submit a reconfig(M’) operation to leader(M)"
    What if leader(M) fails after receiving the request, but before
informing any other peer? How will the administrative client know whether to
retry or not?  Should s retry if leader fails and should the client API
retry
if s fails?

F. Regarding 3.a and 3.b.

The algorithm mentions:

"3. Otherwise:
   a. Designate the most up-to-date server in connected quorum of M' to be
leader(M)
   b. Stop processing new operations, return failure for any further ops
received"

Why do we need to change the leader if leader(M) is not in M'? How about we
let
the leader perform the reconfig and at the end of phase 3 (while processing
retire) the leader will give up leadership. This will cause a new leader
election and one of the peer in M' will become the leader. Similarly, after
the
second phase, members(M') will stop following leader(M) if leader(M) is not
in
M'. I think this will be simpler.

G. Online VS offline

    In your "Online vs offline" section, you mentioned that the offline
strategy is preferred. I think we can do reconfiguration online.
I pictured M' as modified OBSERVERS till the time the reconfiguration is
complete.

a. Let new members(M') join the cluster as OBSERVERS. Based on the current
implementation, M' will essentially sync with the leader after becoming
OBSERVERS and it will be easier to leverage the RequestProcessor pipeline
for
reconfig.

b. Once a majority of M' join the leader, the leader executes phase 1 as you
described.

c. After phase 1., the leader changes the transaction logic a bit. Every
transaction after this point (including reconfig-start) will be sent to M
and
M'.  Leader then waits for quorum(M) and quorum(M') to ack the transaction.
So
M' are not pure OBSERVERS as we think today. However, only a member of M can
become a leader until reconfig succeeds.  Also, M' - (M n M') cannot serve
client requests until reconfiguration is complete. By doing a transaction on
both M and M' and waiting for the quorum of each set to ack, we keep
transfering
the state to both the configurations.

d. After receiving ack for phase 2, the leader sends <switch-to-new-config>
transaction to M and M' (instead of sending retire just to M).

After receiving this message, M' will close connections with (and reject
connections from) members not in M'.  Members that are supposed to leave the
cluster will shutdown QuorumPeer. If leader(M) is not in M', then a new
leader(M') will be elected.

Let me know what you think about this.

H. What happens if a majority of M' keep failing during reconfig?

M={A, B}, M'={A, C, D, E}. What if D, E fail?

Failure of a majority of M' will permanently stall reconfig. While this is
less likely to happen, I think ZooKeeper should handle this
automatically. After a few retries, we should abort reconfig.  Otherwise, we
could disrupt a running cluster and we will never be able to recover without
manual intervention. If the majority fails after phase 1, then this would
mean
sending a <abort-reconfig, version(M), M'> to M.

Of course, one can argue - what if majority of M' fail after phase 3? So not
sure if this is an overkill, but I feel we should handle this case.

I. "Old reconfiguration request"

a. We should use ZAB
b. A side note - I think ZOOKEEPER-335 needs to be resolved for reconfig to
work. This bug causes logs to diverge if ZK leader fails before sending
PROPOSALs to followers (see
http://www.mail-archive.com/zookeeper-user@hadoop.apache.org/msg00403.html).

Because of this bug we could run into the following scenario:
- A peer B that was leader when reconfig(M') was issued will have reconfig
M'
  in its transaction log.
- A peer C that became leader after B's failure, can have reconfig(M'') in
its
  log.
- Now, if B and C fail (say both reboot), then the outcome of reconfig will
  depend on which node takes leadership. If B becomes a leader, then out
come
is M'. If C becomes a leader, then outcome is M''

J. Policies/ sanity checks
- Should we allow removal of a peer in a 3 node configuration? How about in
a
2-node configuration?

Can you please add section numbers to the design paper? It will be easier to
refer to the text by numbers.

Thanks again for working on this. We are about to release the first version
of
our product using ZooKeeper, which uses static configuration. Our next
version
is heavily dependent on dynamic membership. We have resource allocated at
work
that can dedicate time for implementing this feature for our next release
and
we are interested in contributing to it. We will be happy to chip in from
our
side to help with the implementation.

Regards,
-Vishal

On Thu, Mar 10, 2011 at 2:19 AM, Alexander Shraer <sh...@yahoo-inc.com>wrote:

> Hi,
>
>
>
> I'm working on adding support for dynamic membership changes in Zookeeper
> clusters (ZOOKEEPER-107). I've added a proposed algorithm and some
> discussion here:
>
>
>
> https://cwiki.apache.org/confluence/display/ZOOKEEPER/ClusterMembership
>
>
>
> Any comments and suggestions are very welcome.
>
>
>
> Best Regards,
>
> Alex
>