You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@zookeeper.apache.org by "Thawan Kooburat (JIRA)" <ji...@apache.org> on 2012/06/01 01:55:23 UTC
[jira] [Commented] (ZOOKEEPER-1465) Cluster availability following new leader election takes a long time with large datasets - is correlated to dataset size

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287042#comment-13287042 ] 

Thawan Kooburat commented on ZOOKEEPER-1465:
--------------------------------------------

Here is my current understanding of the problem. Essentially, the leader uses 3 lists to sync with follower: committedLog, toBeApplied and outstandingProposals.

What I believe that the existing logic intend to do is that, if follower missed the committedLog (either committedLog is empty or peerLastZxid is not in range), then we send snapshot and follow by transactions in toBeApplied and outstandingProposals.

However, the problem that we see here is that the follower doesn’t missed the committedLog, but the logic below fail to setup DIFF packet correctly because peerLastZxid == maxCommittedLog (last element in commitedLog).
{noformat}
                        for (Proposal propose: proposals) {
                            // skip the proposals the peer already has
                            if (propose.packet.getZxid() <= peerLastZxid) {
                                prevProposalZxid = propose.packet.getZxid();
                                continue;
                            } else {
                                // If we are sending the first packet, figure out whether to trunc
                                // in case the follower has some proposals that the leader doesn't
                                if (firstPacket) {
                                    firstPacket = false;
                                    // Does the peer have some proposals that the leader hasn't seen yet
                                    if (prevProposalZxid < peerLastZxid) {
                                        // send a trunc message before sending the diff
                                        packetToSend = Leader.TRUNC;
                                        LOG.info("Sending TRUNC");
                                        zxidToSend = prevProposalZxid;
                                        updates = zxidToSend;
                                    }
                                    else {
                                        // Just send the diff
                                        packetToSend = Leader.DIFF;
                                        LOG.info("Sending diff");
                                        zxidToSend = maxCommittedLog;
                                    }

                                }
                                queuePacket(propose.packet);
                                QuorumPacket qcommit = new QuorumPacket(Leader.COMMIT, propose.packet.getZxid(),
                                        null, null);
                                queuePacket(qcommit);
                            }
                        }
{noformat} 

I believe that if we fix the problem at the root cause, we can remove the code below completely since deciding whether to send DIFF packet is based on the fact that follower miss the committedLog or not. The startForwarding() method should handle inflight transactions correctly.
{noformat}
                if (peerLastZxid == leaderLastZxid) {
                    LOG.debug("Leader and follower are in sync, sending empty diff. zxid=0x{}",
                            Long.toHexString(leaderLastZxid));
                    // We are in sync so we'll do an empty diff
                    packetToSend = Leader.DIFF;
                    zxidToSend = leaderLastZxid;
                }
{noformat}

The proposed fix minimizes code changes, but should we fix the problem at the root?
                
> Cluster availability following new leader election takes a long time with large datasets - is correlated to dataset size
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1465
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1465
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection
>    Affects Versions: 3.4.3
>            Reporter: Alex Gvozdenovic
>            Assignee: Camille Fournier
>            Priority: Critical
>             Fix For: 3.4.4
>
>         Attachments: ZOOKEEPER-1465.patch
>
>
> When re-electing a new leader of a cluster, it takes a long time for the cluster to become available if the dataset is large
> Test Data
> ----------
> 650mb snapshot size
> 20k nodes of varied size 
> 3 member cluster 
> On 3.4.x branch (http://svn.apache.org/repos/asf/zookeeper/branches/branch-3.4?r=1244779)
> ------------------------------------------------------------------------------------------
> Takes 3-4 minutes to bring up a cluster from cold 
> Takes 40-50 secs to recover from a leader failure 
> Takes 10 secs for a new follower to join the cluster 
> Using the 3.3.5 release on the same hardware with the same dataset
> -----------------------------------------------------------------
> Takes 10-20 secs to bring up a cluster from cold 
> Takes 10 secs to recover from a leader failure 
> Takes 10 secs for a new follower to join the cluster 
> I can see from the logs in 3.4.x that once a new leader is elected, it pushes a new snapshot to each of the followers who need to save it before they ack the leader who can then mark the cluster as available. 
> The kit being used is a low spec vm so the times taken are not relevant per se - more the fact that a snapshot is always sent even through there is no difference between the persisted state on each peer.
> No data is being added to the cluster while the peers are being restarted.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira