You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Shuai Lin (JIRA)" <ji...@apache.org> on 2015/12/05 16:27:11 UTC

[jira] [Commented] (MESOS-1806) Substituting etcd for Zookeeper

    [ https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15043346#comment-15043346 ] 

Shuai Lin commented on MESOS-1806:
----------------------------------

There are two situations to handle:

* Etcd servers wound't accept requests from clients during the leader election phase. So when there is a leader re-election among the etcd servers, the request from the current master to renew the timestamp of the {{v2/keys/mesos}} node would fail, and the current code would immediately retry with the next server, which would refuse the request as well. Thus the master would exit due to all servers fail its requests. The same happens with slaves -- detector would fail after requests to all the etcd servers are refused. To solve this, we can add logic to wait for a while before trying the next server.

* If the the current master somehow fails to update the {{v2/keys/mesos}} node in time, that node would expire, the detector would detect this, commit suicide due to lost of leadership. This is correct behavior, but the current TTL is kind of small: only 5 seconds, and the current master is set to update the node at 80% of the TTL, i.e. 4 seconds, so the chance of this problem is not that low, e.g. if there happens ephemeral network problem. This can be achieved by increase the TTL to 10 seconds, or let the current master try to update the node at 60% of the TTL.

[~cmaloney] [~benjaminhindman] What do you think?

> Substituting etcd for Zookeeper
> -------------------------------
>
>                 Key: MESOS-1806
>                 URL: https://issues.apache.org/jira/browse/MESOS-1806
>             Project: Mesos
>          Issue Type: Task
>          Components: leader election
>            Reporter: Ed Ropple
>            Assignee: Shuai Lin
>            Priority: Minor
>
> <adam_mesos>	 eropple: Could you also file a new JIRA for Mesos to drop ZK in favor of etcd or ReplicatedLog? Would love to get some momentum going on that one.
> --
> Consider it filed. =)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)