You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Vishal K (JIRA)" <ji...@apache.org> on 2011/03/25 23:13:06 UTC
[jira] [Commented] (ZOOKEEPER-107) Allow dynamic changes to server cluster membership

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011450#comment-13011450 ] 

Vishal K commented on ZOOKEEPER-107:
------------------------------------

Hi Alex,

Thanks for the updated design. Some comments below. Some of these might be
repetition of what we discussed earlier. I am adding them here so that we
have it on the jira.

{quote}
D. The purpose of connecting is only to begin state transfer, so only
connecting to the leader matters.  for members(M) to detect failure of new
servers that are trying to connect. If the leader fails then the new leader can
be responsible for establishing connections to members(M') as part of
completing the reconfiguration (see first step in Section 4.2).
{quote}

What will M' do if the leader(M) fails? We could run into a scenario where M'
don't know about leader(M) and leader(M) may not know about M'. So it might
be easier to ask M' to connect to M. Essentially, we need a way to figure out
who the current leader is. We could potentially run the virtual IP based
approach that I mentioned earlier. Let the leader runs a "cluster IP" and
anyone (including ZK clients) that wants to get cluster/leader information can
send query to this IP.

{quote}
the old leader can appoint one of members(M') to be the new leader (the new leader
should be chosen s.t. it is up-to-date with M). This way leader election in M'
is only required if the designated leader fails.
{quote}

I am not in favor of this approach. May I suggest that from implementation
perspective, we leave this
in future work/ improvements section? It certainly is a good optimization, but
I don't see it giving us significant benefits. IMHO, it seems a bit counter
intuitive for node x to designate node y as a leader in a distributed protocol
that is designed to elect leaders. I find it easier to think "elect a leader if
you need one". I am also concerned that this is going to make the
implementation prone to strange corner cases and more complex (to test as well).

{quote}
E. This is very related to ZOOKEEPER-22 (how does a client know whether its
operation has been executed if the leader failed) ?
{quote}

Agreed. However, the client library attempts to reconnect to the servers.
Also, the application can verify if the transaction is done when it
reconnects next time.  We may have to do something similar as well. 

{quote}
reconfiguration attempt and completed it or no further leader will (if we fix
ZOOKEEPER-335 )
{quote}

Why should we not fix ZOOKEEPER-335? Won't the log divergence cause
unpredictable outcome of reconfiguration? log of A has M' and log of B has M''.
Depending upon who wins election the configuration will be either M' or M''.

{quote}
I think a reconfig operations should be sent to the leader only after a
majority of M' has connected to it.
{quote}

Sure, though (as you pointed out) this is not strictly necessary for the
correctness of the protocol.

{quote}
Here are some issues that I see with your proposal:

1. Suppose that leader(M) fails while sending phase-3 messages and all phase-3 messages arrive to members(M) but none arrive to members(M'). Now the system is stuck: M' - (M n M') cannot serve client requests whereas members(M) are no longer in the system.
This is why I think before garbage-collecting M we must make sure that M' is operational by itself.
{quote}

Could this be fixed by sending the message to M' first and then sending to M
after receiving ack from majority of M'

{quote}
2. In my proposal members(M') can accept operations as soon as they get the phase-2 message, which is the purpose of phase 2.
I don't see why phase-2 is needed in your proposal. I suggest to stick with 3 phases s.t. phase-2 message allows M' to be independent.

3. Because of Global Primary Order property of ZAB we can't allow a commit message to arrive from leader(M) to members(M') after they are allowed to process
messages independently in M'. Your solution handles this because members(M') close connections to leader(M) as soon as they're allowed to process messages independently, but then its not clear how the outstanding ops get committed. In order to commit these operations nevertheless so I suggest that this would be the task of leader(M').

Here's what I suggest:

    * after leader(M) sends phase-1 message to members(M) it starts sending every received new op to both members(M) and members(M') (treating them as followers). Commit messages must not be sent to members(M') by leader(M) and sending such messages to members(M) does not hurt but is not necessary. In any case, clients are not acked regarding these ops.

    * As soon as enough acks are received for the phase-1 message from members(M), leader(M) sends a phase-2 message to members(M') and from this moment doesn't accept new ops (unless leader(M) is also in M' and therefore acts also as leader(M')).

    * When members(M') receive the phase-2 message from leader(M) they send an ack both to leader(M) and to leader(M') - in practice it can send an ack to leader(M) first, then disconnect from it and then connect to leader(M') and send an ack to it but it doesn't matter for correctness. When leader(M') receives this ack from a quorum of members(M') (as well as the phase-2 message from leader(M)) it can commit all preceding ops in M'.
{quote}

There are subtle differences between the protocol that I suggested and your
proposal. I am not fully convinced if the changes are necessary. I might be
missing something here. I will get in touch with you offline for further
clarification. 

There are a few problems in the revised protocol posted on the wiki:

1. leader(M) does not wait for confirmation of ongoing transactions from
majority of M' (during phase 1). How do you guarantee that once M' starts
leader election all the transactions that are committed to M are also committed
to M? A majority of M' might be lagging behind and one of them might end up
becoming the leader(M').

2. Why is Step 8. ("Stop processing new operations, return failure for any
further ops received") necessary? The protocol that I suggested does not reject
any transactions from the clients. In the most common case of reconfiguration,
only a subset (very likely 1) of the peers would be added/removed. So ZK
client library can reconnect to one of the other ZK servers using the same
code that it does today. As a result, the application will not even notice that
a reconfiguration has happened (other than potentially receiving notifications
about the new configuration).  

{quote}
H. I agree - it's an overkill We can make sure that a quorum of M' is connected
when the reconfiguration starts.
{quote}

What should we tell the administrator to do if majority of M' fail during
reconfiguration? During normal operations, if a majority of nodes fail, then
the admin has a choice to copy the DB from one of the live nodes to rest of the
nodes and get the cluster going immediately. There is a risk of loosing some
transactions, but there is also a chance that the one of the node has
reasonably up-to-date copy of the data tree.  However, during reconfiguration
if majority of M' fail the cluster is unrecoverable even if majority of M are
online. Are going to assume that the admin needs to take a backup before doing
the reconfig?

Thanks.
-Vishal

> Allow dynamic changes to server cluster membership
> --------------------------------------------------
>
>                 Key: ZOOKEEPER-107
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-107
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>            Reporter: Patrick Hunt
>            Assignee: Henry Robinson
>         Attachments: SimpleAddition.rtf
>
>
> Currently cluster membership is statically defined, adding/removing hosts to/from the server cluster dynamically needs to be supported.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira