You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Vinod Kone (JIRA)" <ji...@apache.org> on 2013/03/25 04:25:15 UTC

[jira] [Resolved] (MESOS-215) In slave, a framework won't be shutdown if no executor in it.

     [ https://issues.apache.org/jira/browse/MESOS-215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kone resolved MESOS-215.
------------------------------

    Resolution: Fixed
    
> In slave, a framework won't be shutdown if no executor in it.
> -------------------------------------------------------------
>
>                 Key: MESOS-215
>                 URL: https://issues.apache.org/jira/browse/MESOS-215
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>    Affects Versions: 0.9.0
>         Environment: All platforms.
>            Reporter: Jie Yu
>            Assignee: Vinod Kone
>            Priority: Minor
>
> In slave, a framework won't be shutdown if no executor in it. In some cases, this could cause the slave keep resending status updates to master if the user scheduler terminate before the corresponding status update acknowledgement is sent.
> void Slave::shutdownFramework(const FrameworkID& frameworkId)
> {
>   LOG(INFO) << "Asked to shut down framework " << frameworkId;
>   Framework* framework = getFramework(frameworkId);
>   if (framework != NULL) {
>     LOG(INFO) << "Shutting down framework " << framework->id;
>     // Shut down all executors of this framework.
>     foreachvalue (Executor* executor, framework->executors) {
>       shutdownExecutor(framework, executor);
>     }    
>   }
> }
> If no executor in the framework (e.g. killed due to unexpected process exit), shutdownExecutor will be executed. As a result, the framework will not be removed from the slave. If in some case, the slave does not receive an acknowledgment for a status update (e.g. the user scheduler terminate before it is sent), the slave will keep resending status update message to master.
> Here is the output from my test:
> ======= start of master =======
> jyu@jyu-vm-ubuntu:~/workspace/mesos/build$ sudo src/mesos-master --port=5432
> I0621 17:18:07.984211 31857 logging.cpp:86] Logging to STDERR
> I0621 17:18:07.990334 31857 main.cpp:104] Build: 2012-05-31 09:05:54 by jyu
> I0621 17:18:07.990653 31857 main.cpp:105] Starting Mesos master
> I0621 17:18:07.991225 31872 master.cpp:262] Master started on 127.0.1.1:5432
> I0621 17:18:07.991291 31872 master.cpp:277] Master ID: 201206211718-16842879-5432-31857
> I0621 17:18:07.993168 31872 master.cpp:493] Elected as master!
> I0621 17:18:08.011967 31874 webui_utils.cpp:49] Loading webui script at '/home/jyu/workspace/mesos/install/share/mesos/webui/master/webui.py'
> Bottle server starting up (using WSGIRefServer())...
> Listening on http://0.0.0.0:8080/
> Use Ctrl-C to quit.
> I0621 17:18:09.480581 31871 master.cpp:858] Attempting to register slave on jyu-vm-ubuntu at slave(1)@127.0.1.1:46234
> I0621 17:18:09.480648 31871 master.cpp:1075] Master now considering a slave at jyu-vm-ubuntu:46234 as active
> I0621 17:18:09.480692 31871 master.cpp:1611] Adding slave 201206211718-16842879-5432-31857-0 at jyu-vm-ubuntu with cpus=1; mem=96
> I0621 17:18:09.481227 31871 simple_allocator.cpp:69] Added slave 201206211718-16842879-5432-31857-0 with cpus=1; mem=96
> I0621 17:18:10.850127 31871 master.cpp:536] Registering framework 201206211718-16842879-5432-31857-0000 at scheduler(1)@127.0.1.1:39471
> I0621 17:18:10.850338 31871 simple_allocator.cpp:46] Added framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.850414 31871 master.cpp:1166] Sending 1 offers to framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.851227 31871 master.cpp:704] Received reply for offer 201206211718-16842879-5432-31857-0
> I0621 17:18:10.851323 31871 master.cpp:1473] Launching task 1 with resources mem=32 on slave 201206211718-16842879-5432-31857-0 (jyu-vm-ubuntu)
> I0621 17:18:34.898843 31871 master.cpp:956] Status update from slave(1)@127.0.1.1:46234: task 1 of framework 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> I0621 17:18:34.899086 31871 master.cpp:1055] Executor default of framework 201206211718-16842879-5432-31857-0000 on slave 201206211718-16842879-5432-31857-0 (jyu-vm-ubuntu) exited with status 0
> I0621 17:18:34.902322 31871 master.cpp:435] Framework 201206211718-16842879-5432-31857-0000 disconnected
> I0621 17:18:34.902359 31871 master.cpp:444] Giving framework 201206211718-16842879-5432-31857-0000 0 seconds to failover
> I0621 17:18:34.902570 31871 master.cpp:1125] Framework failover timeout, removing framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:34.902668 31871 simple_allocator.cpp:59] Removed framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:44.899116 31871 master.cpp:956] Status update from slave(1)@127.0.1.1:46234: task 1 of framework 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:18:44.899209 31871 master.cpp:994] Status update from slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:54.901684 31871 master.cpp:956] Status update from slave(1)@127.0.1.1:46234: task 1 of framework 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:18:54.901762 31871 master.cpp:994] Status update from slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:04.904207 31871 master.cpp:956] Status update from slave(1)@127.0.1.1:46234: task 1 of framework 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:19:04.904311 31871 master.cpp:994] Status update from slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:14.908376 31871 master.cpp:956] Status update from slave(1)@127.0.1.1:46234: task 1 of framework 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:19:14.908475 31871 master.cpp:994] Status update from slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:24.910850 31871 master.cpp:956] Status update from slave(1)@127.0.1.1:46234: task 1 of framework 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:19:24.910948 31871 master.cpp:994] Status update from slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:34.914938 31871 master.cpp:956] Status update from slave(1)@127.0.1.1:46234: task 1 of framework 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:19:34.915150 31871 master.cpp:994] Status update from slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:44.917757 31871 master.cpp:956] Status update from slave(1)@127.0.1.1:46234: task 1 of framework 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:19:44.917917 31871 master.cpp:994] Status update from slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:54.921114 31871 master.cpp:956] Status update from slave(1)@127.0.1.1:46234: task 1 of framework 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> W0621 17:19:54.921288 31871 master.cpp:994] Status update from slave(1)@127.0.1.1:46234 (jyu-vm-ubuntu): error, couldn't lookup framework 201206211718-16842879-5432-31857-0000
> ======= end of master =======
> ======= start of slave =======
> jyu@jyu-vm-ubuntu:~/workspace/mesos/build$ sudo src/mesos-slave --master=localhost:5432 --resources="cpus:1;mem:96" --isolation=cgroups
> I0621 17:18:09.466815 31877 logging.cpp:86] Logging to STDERR
> I0621 17:18:09.473896 31877 main.cpp:111] Creating "cgroups" isolation module
> I0621 17:18:09.474149 31877 main.cpp:119] Build: 2012-05-31 09:05:54 by jyu
> I0621 17:18:09.474203 31877 main.cpp:120] Starting Mesos slave
> I0621 17:18:09.476152 31877 slave.cpp:209] Slave started on 1)@127.0.1.1:46234
> I0621 17:18:09.476459 31877 slave.cpp:210] Slave resources: cpus=1; mem=96
> I0621 17:18:09.477195 31877 slave.cpp:376] New master detected at master@127.0.0.1:5432
> I0621 17:18:09.481650 31891 slave.cpp:396] Registered with master; given slave ID 201206211718-16842879-5432-31857-0
> I0621 17:18:09.496419 31894 webui_utils.cpp:49] Loading webui script at '/home/jyu/workspace/mesos/install/share/mesos/webui/slave/webui.py'
> Bottle server starting up (using WSGIRefServer())...
> Listening on http://0.0.0.0:8081/
> Use Ctrl-C to quit.
> I0621 17:18:10.851681 31891 slave.cpp:457] Got assigned task 1 for framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.851780 31891 slave.cpp:1559] Generating a unique work directory for executor 'default' of framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.852123 31891 slave.cpp:522] Using '/tmp/mesos/slaves/201206211718-16842879-5432-31857-0/frameworks/201206211718-16842879-5432-31857-0000/executors/default/runs/0' as work directory for executor 'default' of framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.852707 31891 cgroups_isolation_module.cpp:149] Launching default (/home/jyu/workspace/mesos/build/src/.libs/balloon-executor) in /tmp/mesos/slaves/201206211718-16842879-5432-31857-0/frameworks/201206211718-16842879-5432-31857-0000/executors/default/runs/0 with resources mem=64' for framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.853230 31891 cgroups_isolation_module.cpp:323] Changing cgroup controls in /cgroups/mesos_cgroup_executor_default_framework_201206211718-16842879-5432-31857-0000 to mem=64
> I0621 17:18:10.853401 31891 cgroups_isolation_module.cpp:339] Write cpu.shares = 10
> I0621 17:18:10.853543 31891 cgroups_isolation_module.cpp:353] Write memory.limit_in_bytes = 67108864
> I0621 17:18:10.853701 31891 cgroups_isolation_module.cpp:371] Start listen on OOM events
> I0621 17:18:10.854008 31891 cgroups_isolation_module.cpp:187] Forked executor at = 31913
> I0621 17:18:10.897469 31891 slave.cpp:789] Got registration for executor 'default' of framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.897694 31892 cgroups_isolation_module.cpp:323] Changing cgroup controls in /cgroups/mesos_cgroup_executor_default_framework_201206211718-16842879-5432-31857-0000 to mem=96
> I0621 17:18:10.897886 31891 slave.cpp:847] Flushing queued tasks for framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:10.898143 31892 cgroups_isolation_module.cpp:339] Write cpu.shares = 10
> I0621 17:18:10.898398 31892 cgroups_isolation_module.cpp:353] Write memory.limit_in_bytes = 100663296
> I0621 17:18:34.892823 31892 cgroups_isolation_module.cpp:389] OOM notifier is triggered
> I0621 17:18:34.892909 31892 cgroups_isolation_module.cpp:434] OOM detected in executor default of framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:34.892930 31892 cgroups_isolation_module.cpp:229] Killing executor default for framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:34.894033 31892 slave.cpp:1383] Executor 'default' of framework 201206211718-16842879-5432-31857-0000 has exited with status 0
> I0621 17:18:34.894765 31892 slave.cpp:989] Status update: task 1 of framework 201206211718-16842879-5432-31857-0000 is now in state TASK_LOST
> I0621 17:18:34.895120 31892 slave.cpp:1507] Scheduling executor directory /tmp/mesos/slaves/201206211718-16842879-5432-31857-0/frameworks/201206211718-16842879-5432-31857-0000/executors/default/runs/0 for deletion
> I0621 17:18:34.902997 31891 slave.cpp:625] Asked to shut down framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:34.903031 31891 slave.cpp:629] Shutting down framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:44.897243 31891 slave.cpp:1083] Resending status update for task 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:18:54.900529 31891 slave.cpp:1083] Resending status update for task 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:04.903036 31892 slave.cpp:1083] Resending status update for task 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:14.905769 31892 slave.cpp:1083] Resending status update for task 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:24.909895 31891 slave.cpp:1083] Resending status update for task 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:34.912976 31891 slave.cpp:1083] Resending status update for task 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:44.916512 31891 slave.cpp:1083] Resending status update for task 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:19:54.920135 31891 slave.cpp:1083] Resending status update for task 1 of framework 201206211718-16842879-5432-31857-0000
> I0621 17:20:04.922044 31892 slave.cpp:1083] Resending status update for task 1 of framework 201206211718-16842879-5432-31857-0000
> ====== end of slave =======

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira