You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Liqiang Lin (JIRA)" <ji...@apache.org> on 2016/04/12 11:31:25 UTC

[jira] [Created] (MESOS-5188) docker executor thinks task is failed when docker container was stopped

Liqiang Lin created MESOS-5188:
----------------------------------

             Summary: docker executor thinks task is failed when docker container was stopped
                 Key: MESOS-5188
                 URL: https://issues.apache.org/jira/browse/MESOS-5188
             Project: Mesos
          Issue Type: Bug
          Components: docker
    Affects Versions: 0.28.0
            Reporter: Liqiang Lin
             Fix For: 0.29.0


Test cases:
1. Launch a task with Swarm (on Mesos).

{code}
# docker -H 192.168.56.110:54375 run -d --cpu-shares 1 ubuntu sleep 300
{code}

2. Then stop the docker container.

{code}
# docker -H 192.168.56.110:54375 ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
b4813ba3ed4d        ubuntu              "sleep 300"         9 seconds ago       Up 8 seconds                            mesos1/mesos-2cd5576e-6260-4262-a62c-b0dc45c86c45-S1.1595e79b-aef2-44b6-a313-ad4ff8626958

# docker -H 192.168.56.110:54375 stop b4813ba3ed4d
b4813ba3ed4d
{code}

3. Found the task is failed. See Mesos slave log,

{code}
I0407 09:10:57.606552 32307 slave.cpp:1508] Got assigned task 99ee7dc74861 for framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000
I0407 09:10:57.608230 32307 slave.cpp:1627] Launching task 99ee7dc74861 for framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000
I0407 09:10:57.609979 32307 paths.cpp:528] Trying to chown '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9' to user 'root'
I0407 09:10:57.615881 32307 slave.cpp:5586] Launching executor 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 with resources cpus(*):0.1; mem(*):32 in work directory '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9'
I0407 09:12:18.458449 32307 slave.cpp:1845] Queuing task '99ee7dc74861' for executor '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000
I0407 09:12:18.459092 32307 slave.cpp:3711] No pings from master received within 75secs
I0407 09:12:18.460212 32307 slave.cpp:4593] Current disk usage 56.53%. Max allowed age: 2.342613645432778days
I0407 09:12:18.463484 32307 slave.cpp:928] Re-detecting master
I0407 09:12:18.463969 32307 slave.cpp:975] Detecting new master
I0407 09:12:18.464501 32307 slave.cpp:939] New master detected at master@192.168.56.110:5050
I0407 09:12:18.464848 32307 slave.cpp:964] No credentials provided. Attempting to register without authentication
I0407 09:12:18.465237 32307 slave.cpp:975] Detecting new master
I0407 09:12:18.463611 32312 status_update_manager.cpp:174] Pausing sending status updates
I0407 09:12:18.465744 32312 status_update_manager.cpp:174] Pausing sending status updates
I0407 09:12:18.472323 32313 docker.cpp:1011] Starting container '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' for task '99ee7dc74861' (and executor '99ee7dc74861') of framework '5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000'
I0407 09:12:18.588739 32313 slave.cpp:1218] Re-registered with master master@192.168.56.110:5050
I0407 09:12:18.588927 32313 slave.cpp:1254] Forwarding total oversubscribed resources
I0407 09:12:18.589320 32313 slave.cpp:2395] Updating framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 pid to scheduler(1)@192.168.56.110:53375
I0407 09:12:18.592079 32308 status_update_manager.cpp:181] Resuming sending status updates
I0407 09:12:18.592842 32313 slave.cpp:2534] Updated checkpointed resources from  to
I0407 09:12:18.592793 32308 status_update_manager.cpp:181] Resuming sending status updates
I0407 09:12:20.582041 32307 slave.cpp:2836] Got registration for executor '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 from executor(1)@192.168.56.110:40725
I0407 09:12:20.584446 32307 docker.cpp:1308] Ignoring updating container '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' with resources passed to update is identical to existing resources
I0407 09:12:20.585093 32307 slave.cpp:2010] Sending queued task '99ee7dc74861' to executor '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 at executor(1)@192.168.56.110:40725
I0407 09:12:21.307077 32312 slave.cpp:3195] Handling status update TASK_RUNNING (UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for task 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 from executor(1)@192.168.56.110:40725
I0407 09:12:21.308820 32308 status_update_manager.cpp:320] Received status update TASK_RUNNING (UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for task 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000
I0407 09:12:21.310058 32308 slave.cpp:3593] Forwarding the update TASK_RUNNING (UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for task 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 to master@192.168.56.110:5050
I0407 09:12:21.310744 32308 slave.cpp:3503] Sending acknowledgement for status update TASK_RUNNING (UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for task 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 to executor(1)@192.168.56.110:40725
I0407 09:12:21.315765 32307 status_update_manager.cpp:392] Received status update acknowledgement (UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for task 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000

I0407 09:16:13.943831 32306 slave.cpp:3195] Handling status update TASK_FAILED (UUID: 552e07b9-f7b8-4442-8e5c-8e17b948f6dc) for task 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 from executor(1)@192.168.56.110:40725
I0407 09:16:14.063880 32307 status_update_manager.cpp:320] Received status update TASK_FAILED (UUID: 552e07b9-f7b8-4442-8e5c-8e17b948f6dc) for task 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000
I0407 09:16:14.064471 32312 slave.cpp:3593] Forwarding the update TASK_FAILED (UUID: 552e07b9-f7b8-4442-8e5c-8e17b948f6dc) for task 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 to master@192.168.56.110:5050
I0407 09:16:14.065014 32312 slave.cpp:3503] Sending acknowledgement for status update TASK_FAILED (UUID: 552e07b9-f7b8-4442-8e5c-8e17b948f6dc) for task 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 to executor(1)@192.168.56.110:40725
I0407 09:16:14.071398 32307 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 552e07b9-f7b8-4442-8e5c-8e17b948f6dc) for task 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000
I0407 09:16:14.950019 32311 slave.cpp:3721] executor(1)@192.168.56.110:40725 exited
I0407 09:16:14.993948 32308 docker.cpp:1997] Executor for container '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' has exited
I0407 09:16:14.994104 32308 docker.cpp:1717] Destroying container '250a169f-7aba-474d-a4f5-cd24ecf0e7d9'
I0407 09:16:14.994875 32308 docker.cpp:1845] Running docker stop on container '250a169f-7aba-474d-a4f5-cd24ecf0e7d9'
I0407 09:16:14.996547 32308 slave.cpp:4097] Executor '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 exited with status 0
I0407 09:16:14.996695 32308 slave.cpp:4201] Cleaning up executor '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000 at executor(1)@192.168.56.110:40725
I0407 09:16:14.997028 32312 gc.cpp:55] Scheduling '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9' for gc 6.99998846192days in the future
I0407 09:16:14.997289 32308 slave.cpp:4289] Cleaning up framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000
I0407 09:16:14.997691 32307 status_update_manager.cpp:282] Closing status update streams for framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000
I0407 09:16:14.997697 32312 gc.cpp:55] Scheduling '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000/executors/99ee7dc74861' for gc 6.99998845958222days in the future
I0407 09:16:14.998423 32312 gc.cpp:55] Scheduling '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-0000' for gc 6.99998845141333days in the future
I0407 09:16:18.470573 32308 slave.cpp:4593] Current disk usage 56.56%. Max allowed age: 2.341048334320162days
{code}

Why docker executor thinks the task is failed when stop the docker container? The docker container still exists. Should it be relay on state that docker container is removed?

Docker executor code
{code}
  void reaped(
      ExecutorDriver* _driver,
      const Future<Nothing>& run)
  {
    // Wait for docker->stop to finish, and best effort wait for the
    // inspect future to complete with a timeout.
    stop.onAny(defer(self(), [=](const Future<Nothing>&) {
      inspect
        .after(DOCKER_INSPECT_TIMEOUT, [=](const Future<Nothing>&) {
          inspect.discard();
          return inspect;
        })
        .onAny(defer(self(), [=](const Future<Nothing>&) {
          CHECK_SOME(driver);
          TaskState state;
          string message;
          if (!stop.isReady()) {
            state = TASK_FAILED;
            message = "Unable to stop docker container, error: " +
                      (stop.isFailed() ? stop.failure() : "future discarded");
          } else if (killed) {
            state = TASK_KILLED;
          } else if (!run.isReady()) {
            state = TASK_FAILED;
            message = "Docker container run error: " +
                      (run.isFailed() ?
                       run.failure() : "future discarded");
{code}

Another problem is, Mesos thinks the task is failed, and do some clean up for executor. But the orphan docker container is not removed. I still can start it manually, Mesos can not trace the resource usage of this orphan container.

{code}
# docker -H 192.168.56.110:54375 start b4813ba3ed4d
b4813ba3ed4d
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)