You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Andrei Budnik (JIRA)" <ji...@apache.org> on 2018/07/06 18:28:00 UTC

[jira] [Created] (MESOS-9054) Scheduler driver hangs on syncing its state with ZooKeeper during Master detection

Andrei Budnik created MESOS-9054:
------------------------------------

             Summary: Scheduler driver hangs on syncing its state with ZooKeeper during Master detection
                 Key: MESOS-9054
                 URL: https://issues.apache.org/jira/browse/MESOS-9054
             Project: Mesos
          Issue Type: Bug
          Components: scheduler driver
            Reporter: Andrei Budnik


A framework (namely, Marathon) uses scheduler driver (V0 API) to connect to the Mesos master, but never receives `registered()`. The hanging framework prints the following messages:
{code:java}
2018-06-26 05:30:23: I0626 05:30:23.899340 14465 sched.cpp:232] Version: 1.4.0
2018-06-26 05:30:24: I0626 05:30:24.022102 14523 group.cpp:341] Group process (zookeeper-group(1)@10.136.5.234:15101) connected to ZooKeeper
2018-06-26 05:30:24: I0626 05:30:24.022148 14523 group.cpp:831] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
2018-06-26 05:30:24: I0626 05:30:24.022166 14523 group.cpp:419] Trying to create path '/mesos' in ZooKeeper{code}
When the framework calls `scheduler_driver->start()`, it creates and spawns `ZooKeeperMasterDetectorProcess`, which creates a detector of type `zookeeper::Group`. After the detector connects to ZK, it [calls|https://github.com/apache/mesos/blob/77921200a69564f966917bfc8e07a3d1e3ada196/include/mesos/zookeeper/watcher.hpp#L49-L52] `zookeeper::Group::connected()`. Then, `Group` tries to `[sync()|https://github.com/apache/mesos/blob/77921200a69564f966917bfc8e07a3d1e3ada196/src/zookeeper/group.cpp#L374]`, which calls `[create()|https://github.com/apache/mesos/blob/77921200a69564f966917bfc8e07a3d1e3ada196/src/zookeeper/group.cpp#L851]`. At this point we call `[zk->create()|https://github.com/apache/mesos/blob/77921200a69564f966917bfc8e07a3d1e3ada196/src/zookeeper/group.cpp#L421]`, which is a *synchronous* call, see [`dispatch(...).get()`|https://github.com/apache/mesos/blob/77921200a69564f966917bfc8e07a3d1e3ada196/src/zookeeper/zookeeper.cpp#L602-L610].

Since ZK library or ZK itself might hang, a scheduler driver can stuck in this state, so a framework will never receive any callbacks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)