You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Vinod Kone (JIRA)" <ji...@apache.org> on 2016/05/14 20:45:12 UTC

[jira] [Commented] (MESOS-5380) Killing a queued task can cause the corresponding command executor to never terminate.

    [ https://issues.apache.org/jira/browse/MESOS-5380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15283667#comment-15283667 ] 

Vinod Kone commented on MESOS-5380:
-----------------------------------

Committed phase 1.  Phase 2 will handle agent restart case.

commit 326f4062e16619dcbb7e5c2c98062fbe2957fcf9
Author: Vinod Kone <vi...@gmail.com>
Date:   Fri May 13 16:06:49 2016 -0700

    Fixed agent to properly handle killTask of unregistered executor.
    
    The agent now shuts down the executor during registration if it does not
    have any queued tasks (e.g., framework sent a killTask before
    registration).
    
    Note that if the executor doesn't register at all, it will be cleaned up
    anyway after the registration timeout value.
    
    Also, note that this doesn't handle the case where the agent restarts
    after processing the killTask() but before cleaning up the executor.
    
    Review: https://reviews.apache.org/r/47381


> Killing a queued task can cause the corresponding command executor to never terminate.
> --------------------------------------------------------------------------------------
>
>                 Key: MESOS-5380
>                 URL: https://issues.apache.org/jira/browse/MESOS-5380
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>    Affects Versions: 0.28.0, 0.28.1
>            Reporter: Jie Yu
>            Assignee: Vinod Kone
>            Priority: Blocker
>              Labels: mesosphere
>             Fix For: 0.29.0, 0.28.2
>
>
> We observed this in our testing environment. Sequence of events:
> 1) A command task is queued since the executor has not registered yet.
> 2) The framework issues a killTask.
> 3) Since executor is in REGISTERING state, agent calls `statusUpdate(TASK_KILLED, UPID())`
> 4) `statusUpdate` now will call `containerizer->status()` before calling `executor->terminateTask(status.task_id(), status);` which will remove the queued task. (Introduced in this patch: https://reviews.apache.org/r/43258).
> 5) Since the above is async, it's possible that the task is still in queued task when we trying to see if we need to kill unregistered executor in `killTask`:
> {code}
>       // TODO(jieyu): Here, we kill the executor if it no longer has
>       // any task to run and has not yet registered. This is a
>       // workaround for those single task executors that do not have a
>       // proper self terminating logic when they haven't received the
>       // task within a timeout.
>       if (executor->queuedTasks.empty()) {
>         CHECK(executor->launchedTasks.empty())
>             << " Unregistered executor '" << executor->id
>             << "' has launched tasks";
>         LOG(WARNING) << "Killing the unregistered executor " << *executor
>                      << " because it has no tasks";
>         executor->state = Executor::TERMINATING;
>         containerizer->destroy(executor->containerId);
>       }    
> {code}
> 6) Consequently, the executor will never be terminated by Mesos.
> Attaching the relevant agent log:
> {noformat}
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.640527  1342 slave.cpp:1361] Got assigned task mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 for framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.641034  1342 slave.cpp:1480] Launching task mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 for framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.641440  1342 paths.cpp:528] Trying to chown '/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a' to user 'root'
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.644664  1342 slave.cpp:5389] Launching executor mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000 with resources cpus(*):0.1; mem(*):32 in work directory '/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a'
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.645195  1342 slave.cpp:1698] Queuing task 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' for executor 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.645491  1338 containerizer.cpp:671] Starting container '24762d43-2134-475e-b724-caa72110497a' for executor 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework 'a3ad8418-cb77-4705-b353-4b514ceca52c-0000'
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.647897  1345 cpushare.cpp:389] Updated 'cpu.shares' to 1126 (cpus 1.1) for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.648619  1345 cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 110ms (cpus 1.1) for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.650180  1341 mem.cpp:602] Started listening for OOM events for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.650718  1341 mem.cpp:722] Started listening on low memory pressure events for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.651147  1341 mem.cpp:722] Started listening on medium memory pressure events for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.651599  1341 mem.cpp:722] Started listening on critical memory pressure events for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.652015  1341 mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 160MB for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:13 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:13.652719  1341 mem.cpp:388] Updated 'memory.limit_in_bytes' to 160MB for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.508930  1342 slave.cpp:1891] Asked to kill task mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.509063  1342 slave.cpp:3048] Handling status update TASK_KILLED (UUID: f9d15955-6c9a-4a73-98c3-97c0128510ba) for task mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000 from @0.0.0.0:0
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.509702  1340 disk.cpp:169] Updating the disk resources for container 24762d43-2134-475e-b724-caa72110497a to cpus(*):0.1; mem(*):32
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.510298  1343 mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.510349  1341 cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.511102  1343 mem.cpp:388] Updated 'memory.limit_in_bytes' to 32MB for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.511495  1341 cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.511715  1341 status_update_manager.cpp:320] Received status update TASK_KILLED (UUID: f9d15955-6c9a-4a73-98c3-97c0128510ba) for task mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.512032  1341 status_update_manager.cpp:824] Checkpointing UPDATE for status update TASK_KILLED (UUID: f9d15955-6c9a-4a73-98c3-97c0128510ba) for task mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.513849  1343 slave.cpp:3446] Forwarding the update TASK_KILLED (UUID: f9d15955-6c9a-4a73-98c3-97c0128510ba) for task mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000 to master@10.0.5.79:5050
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.528929  1344 status_update_manager.cpp:392] Received status update acknowledgement (UUID: f9d15955-6c9a-4a73-98c3-97c0128510ba) for task mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 15:36:25 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:25.529002  1344 status_update_manager.cpp:824] Checkpointing ACK for status update TASK_KILLED (UUID: f9d15955-6c9a-4a73-98c3-97c0128510ba) for task mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6 of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:28.199105  1345 isolator.cpp:469] Mounting docker volume mount point '//var/lib/rexray/volumes/jdef-test-125/data' to '/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/data' for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:28.207062  1338 containerizer.cpp:1184] Checkpointing executor's forked pid 5810 to '/var/lib/mesos/slave/meta/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/pids/forked.pid'
> May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:28.832330  1338 slave.cpp:2689] Got registration for executor 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000 from executor(1)@10.0.2.74:46154
> May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:28.833149  1345 disk.cpp:169] Updating the disk resources for container 24762d43-2134-475e-b724-caa72110497a to cpus(*):0.1; mem(*):32
> May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:28.833804  1342 mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:28.833871  1340 cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container 24762d43-2134-475e-b724-caa72110497a
> May 13 15:36:28 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[1304]: I0513 15:36:28.835160  1340 cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container 24762d43-2134-475e-b724-caa72110497a
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: 5804 'mesos-logrotate-logger --help=false --log_filename=/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/stdout --logrotate_options=rotate 9 --logrotate_path=logrotate --max_size=2MB '
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: 5809 'mesos-logrotate-logger --help=false --log_filename=/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/stderr --logrotate_options=rotate 9 --logrotate_path=logrotate --max_size=2MB '
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: 5804 'mesos-logrotate-logger --help=false --log_filename=/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/stdout --logrotate_options=rotate 9 --logrotate_path=logrotate --max_size=2MB '
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: 5809 'mesos-logrotate-logger --help=false --log_filename=/var/lib/mesos/slave/slaves/a3ad8418-cb77-4705-b353-4b514ceca52c-S0/frameworks/a3ad8418-cb77-4705-b353-4b514ceca52c-0000/executors/mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6/runs/24762d43-2134-475e-b724-caa72110497a/stderr --logrotate_options=rotate 9 --logrotate_path=logrotate --max_size=2MB '
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.374567 30993 slave.cpp:5498] Recovering executor 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.420411 30990 status_update_manager.cpp:208] Recovering executor 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.513164 30994 containerizer.cpp:467] Recovering container '24762d43-2134-475e-b724-caa72110497a' for executor 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.533478 30988 mem.cpp:602] Started listening for OOM events for container 24762d43-2134-475e-b724-caa72110497a
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.534553 30988 mem.cpp:722] Started listening on low memory pressure events for container 24762d43-2134-475e-b724-caa72110497a
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.535269 30988 mem.cpp:722] Started listening on medium memory pressure events for container 24762d43-2134-475e-b724-caa72110497a
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.536198 30988 mem.cpp:722] Started listening on critical memory pressure events for container 24762d43-2134-475e-b724-caa72110497a
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.579385 30988 docker.cpp:859] Skipping recovery of executor 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework 'a3ad8418-cb77-4705-b353-4b514ceca52c-0000' because it was not launched from docker containerizer
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.587158 30989 slave.cpp:4527] Sending reconnect request to executor 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000 at executor(1)@10.0.2.74:46154
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.588287 30990 slave.cpp:2838] Re-registering executor 'mesosvol.6ccd993c-1920-11e6-a722-9648cb19afd6' of framework a3ad8418-cb77-4705-b353-4b514ceca52c-0000
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.589736 30988 disk.cpp:169] Updating the disk resources for container 24762d43-2134-475e-b724-caa72110497a to cpus(*):0.1; mem(*):32
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.590117 30990 cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container 24762d43-2134-475e-b724-caa72110497a
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.591284 30990 cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container 24762d43-2134-475e-b724-caa72110497a
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.595403 30992 mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container 24762d43-2134-475e-b724-caa72110497a
> May 13 16:58:30 ip-10-0-2-74.us-west-2.compute.internal mesos-slave[30985]: I0513 16:58:30.596102 30992 mem.cpp:388] Updated 'memory.limit_in_bytes' to 32MB for container 24762d43-2134-475e-b724-caa72110497a
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)