You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mesos.apache.org by "Travis Hegner (JIRA)" <ji...@apache.org> on 2016/02/02 21:30:39 UTC

[jira] [Created] (MESOS-4581) mesos-docker-executor has a race condition causing docker tasks to be stuck in staging when trying to launch

Travis Hegner created MESOS-4581:
------------------------------------

Summary: mesos-docker-executor has a race condition causing docker tasks to be stuck in staging when trying to launch
Key: MESOS-4581
URL: https://issues.apache.org/jira/browse/MESOS-4581
Project: Mesos
Issue Type: Bug
Components: containerization, docker
Affects Versions: 0.26.0, 0.27.0
Environment: Ubuntu 14.04, Docker 1.9.1, Marathon 0.15.0
Reporter: Travis Hegner

We are still working to understand the root cause of this issue, but here is what we know so far:

Symptoms:

Launching docker containers from marathon in mesos results in the marathon app being stuck in a "staged" status, and the mesos task being stuck in a "staging" status until a timeout, at which point it will launch on another host, with approximately a 50/50 chance of working or being stuck staging again.

We have a lot of custom containers, custom networking configs, and custom docker run parameters, but we can't seem to narrow this down to any one particular aspect of our environment. This happens randomly per marathon app while it's attempting to start or restart an instance, whether the apps config has changed or not. I can't seem to find anyone else having a similar issue, which leads me to believe that it is a culmination of aspects within our environment to trigger this race condition.

Deeper analysis:
The mesos-docker-executor fires the "docker run ..." command in a future. It simultaneously (for all intents and purposes) fires a "docker inspect" against the container which it is trying to start at that moment. When we see this bug, the container starts normally, but the docker inspect command hangs forever. It never re-tries, and never times out.

When the task launches successfully, the docker inspect command fails once with an exit code, and retries 500ms later, working successfully and flagging the task as "RUNNING" in both mesos and marathon simultaneously.

If you watch the docker log, you'll notice that a "docker run" via the command line actually triggers 3 docker API calls in succession. "create", "attach", and "start", in that order. It's been fairly consistent that when we see this bug triggered, the docker log has the "create" from the run command, then a GET for the inspect command, then an "attach", and "start" later. When we see this work successfully, we see the GET first (failing, of course because the container doesn't exist yet), and then the "create", "attach", and "start".

Rudimentary Solution:
We have written a very basic patch which delays that initial inspect call on the container until ".after()" at least one DOCKER_INSPECT_DELAY (500ms) of the docker run command. This has eliminated the bug as far as we can tell.

I am not sure if this one time initial delay is the most appropriate fix, or if it would be better to add a timeout to the inspect call in the mesos-docker-executor, which destroys the current inspect thread and starts a new one. The timeout/retry may be appropriate whether the initial delay exists or not.

In Summary:
It appears that mesos-docker-executor does not have a race condition itself, but it seems to be triggering one in docker. Since we haven't found this issue anywhere else with any substance, we understand that it is likely related to our environment. Our custom network driver for docker does some cluster-wide coordination, and may introduce just enough delay between the "create" and "attach" calls that are causing us to witness this bug at about a 50-60% rate of attempted container start.

The inspectDelay patch that I've written for this issue is located in my inspectDelay branch at:
https://github.com/travishegner/mesos/tree/inspectDelay

I am happy to supply this patch as a pull request, or put it through the review board if the maintainers feel this is an appropriate fix, or at least as a stop-gap measure until a better fix can be written.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)