You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Vinod Kone <vi...@gmail.com> on 2012/11/06 02:58:59 UTC

Review Request: Send TASK_FAILED updates when an executor is destroyed by the isolation module

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/7887/
-----------------------------------------------------------

Review request for mesos, Benjamin Hindman and Ben Mahler.


Description
-------

See summary


Diffs
-----

  src/common/protobuf_utils.hpp 77b300d7c1a02a836100d3365e205889c48ae99a 
  src/examples/balloon_framework.cpp e9b60de0c7d3a96381aff37340e0f5ac499850dd 
  src/slave/cgroups_isolation_module.hpp dd4703a1ca584d2347efac95bcdfae9a84544d4a 
  src/slave/cgroups_isolation_module.cpp 3d10ee568b8f194543707374f34f21bd3a927958 
  src/slave/lxc_isolation_module.cpp 36d86e08f7b511371a9a2053ddf43477063a79f1 
  src/slave/process_based_isolation_module.cpp b0b6a81c93acc68d1f4acbdda5ab2f9f96b5fb5a 
  src/slave/slave.hpp be0d7cc239e51636bb07e12c3046e0751a958787 
  src/slave/slave.cpp 2bd2dbce538a6108dd9fe607829cfbdab33e0778 
  src/tests/fault_tolerance_tests.cpp a01d1aef012b636f2ced64d4d2ffabfb6ce42644 
  src/tests/gc_tests.cpp b61b2de621e227f327ce546b62f8dfc528f3894e 
  src/tests/master_tests.cpp d9cd09c5650234351f570f0a035f4b61cd2d00f5 

Diff: https://reviews.apache.org/r/7887/diff/


Testing
-------

make check (CentOs)

[vinod@smfd-aki-27-sr1:~/mesos/build] $ sudo GLOG_v=1 ./bin/mesos-tests.sh  --gtest_filter="*CgroupsIsolationTest*" --verbose
...
...
I1106 01:53:54.852120 61941 cgroups_isolation_module.cpp:617] OOM notifier is triggered for executor default of framework 201211060153-2081170186-5432-61885-0000 with tag bf7fc2e7-a9c4-4240-8300-18acb99490dc
I1106 01:53:54.852165 61941 cgroups_isolation_module.cpp:662] OOM detected for executor default of framework 201211060153-2081170186-5432-61885-0000 with tag bf7fc2e7-a9c4-4240-8300-18acb99490dc
I1106 01:53:54.852854 61941 cgroups_isolation_module.cpp:689] MEMORY LIMIT: 100663296 bytes
MEMORY USAGE: 100663296 bytes
MEMORY STATISTICS: 
cache 245760
rss 100417536
mapped_file 24576
pgpgin 7320
pgpgout 6250
inactive_anon 0
active_anon 1826816
inactive_file 192512
active_file 53248
unevictable 98590720
hierarchical_memory_limit 100663296
total_cache 245760
total_rss 100417536
total_mapped_file 24576
total_pgpgin 7320
total_pgpgout 6250
total_inactive_anon 0
total_active_anon 1826816
total_inactive_file 192512
total_active_file 53248
total_unevictable 98590720
I1106 01:53:54.852898 61941 cgroups_isolation_module.cpp:408] Killing executor default of framework 201211060153-2081170186-5432-61885-0000
I1106 01:53:54.855185 61937 cgroups.cpp:1116] Attempting to freeze cgroup 'mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc'
I1106 01:53:55.536480 61907 hierarchical_allocator_process.hpp:608] No resources available to allocate!
I1106 01:53:55.536576 61907 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 130.08us
I1106 01:53:56.537866 61903 hierarchical_allocator_process.hpp:608] No resources available to allocate!
I1106 01:53:56.537951 61903 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 103.18us
I1106 01:53:57.538408 61912 hierarchical_allocator_process.hpp:608] No resources available to allocate!
I1106 01:53:57.538483 61912 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 93.44us
I1106 01:53:58.539499 61908 hierarchical_allocator_process.hpp:608] No resources available to allocate!
I1106 01:53:58.539593 61908 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 113.75us
W1106 01:53:59.532685 61903 master.cpp:79] No whitelist given. Advertising offers for all slaves
I1106 01:53:59.540832 61912 hierarchical_allocator_process.hpp:608] No resources available to allocate!
I1106 01:53:59.540907 61912 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 91.56us
W1106 01:54:00.020642 61941 cgroups.cpp:1201] Unable to freeze cgroup 'mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc' within 51 attempts
I1106 01:54:00.022102 61937 cgroups.cpp:1131] Attempting to thaw cgroup 'mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc'
I1106 01:54:00.022274 61937 cgroups.cpp:1237] Successfully thawed cgroup 'mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc'
I1106 01:54:00.030532 61948 process.cpp:872] Socket closed while receiving
I1106 01:54:00.129642 61936 cgroups_isolation_module.cpp:705] Successfully destroyed the cgroup mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc
I1106 01:54:00.539801 61944 cgroups_isolation_module.cpp:468] Telling slave of terminated executor default of framework 201211060153-2081170186-5432-61885-0000
I1106 01:54:00.539939 61934 slave.cpp:1008] Executor 'default' of framework 201211060153-2081170186-5432-61885-0000 has terminated with signal Killed
I1106 01:54:00.541018 61934 slave.cpp:833] Status update: task 1 of framework 201211060153-2081170186-5432-61885-0000 is now in state TASK_FAILED
I1106 01:54:00.541290 61944 cgroups_isolation_module.cpp:441] Asked to update resources for an unknown/terminated executor
I1106 01:54:00.541384 61904 hierarchical_allocator_process.hpp:608] No resources available to allocate!
I1106 01:54:00.541460 61904 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 87.63us
I1106 01:54:00.541471 61936 gc.cpp:97] Scheduling /tmp/mesos/slaves/201211060153-2081170186-5432-61885-0/frameworks/201211060153-2081170186-5432-61885-0000/executors/default/runs/c842b51d-d962-4b20-a80a-bfe484f6dc95 for removal
I1106 01:54:00.541610 61907 master.cpp:1024] Status update from slave(1)@10.35.12.124:36146: task 1 of framework 201211060153-2081170186-5432-61885-0000 is now in state TASK_FAILED
I1106 01:54:00.541759 61907 master.hpp:288] Removing task with resources mem=32 on slave 201211060153-2081170186-5432-61885-0
I1106 01:54:00.541872 61907 master.cpp:1125] Executor default of framework 201211060153-2081170186-5432-61885-0000 on slave 201211060153-2081170186-5432-61885-0 (smfd-aki-27-sr1.devel.twitter.com) exited with status 9
I1106 01:54:00.541872 61912 hierarchical_allocator_process.hpp:491] Recovered mem=32 on slave 201211060153-2081170186-5432-61885-0 from framework 201211060153-2081170186-5432-61885-0000
I1106 01:54:00.541967 61912 hierarchical_allocator_process.hpp:491] Recovered mem=64 on slave 201211060153-2081170186-5432-61885-0 from framework 201211060153-2081170186-5432-61885-0000
I1106 01:54:00.542150 61984 sched.cpp:326] Status update: task 1 of framework 201211060153-2081170186-5432-61885-0000 is now in state TASK_FAILED
Task in state TASK_FAILED
Reason: MEMORY LIMIT: 100663296 bytes
MEMORY USAGE: 100663296 bytes
MEMORY STATISTICS: 
cache 245760
rss 100417536
mapped_file 24576
pgpgin 7320
pgpgout 6250
inactive_anon 0
active_anon 1826816
inactive_file 192512
active_file 53248
unevictable 98590720
hierarchical_memory_limit 100663296
total_cache 245760
total_rss 100417536
total_mapped_file 24576
total_pgpgin 7320
total_pgpgout 6250
total_inactive_anon 0
total_active_anon 1826816
total_inactive_file 192512
total_active_file 53248
total_unevictable 98590720


Thanks,

Vinod Kone


Re: Review Request: Send TASK_FAILED updates when an executor is destroyed by the isolation module

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/7887/#review13203
-----------------------------------------------------------

Ship it!



src/slave/cgroups_isolation_module.hpp
<https://reviews.apache.org/r/7887/#comment28405>

    It's a bit odd to have:
    
    killed // whether killExecutor() called
    destroyed // whether destroyed by module
    
    Maybe rename to something more indicative?
    bool killAttempted; // Have we tried to kill it via killExecutor()?



src/slave/cgroups_isolation_module.cpp
<https://reviews.apache.org/r/7887/#comment28407>

    This message comes out a bit rough in the log:
    
    I1106 01:53:54.852854 61941 cgroups_isolation_module.cpp:689] MEMORY LIMIT: 100663296 bytes
    MEMORY USAGE: 100663296 bytes
    MEMORY STATISTICS: 
    cache 245760
    rss 100417536
    mapped_file 24576
    pgpgin 7320
    pgpgout 6250
    inactive_anon 0
    active_anon 1826816
    inactive_file 192512
    active_file 53248
    unevictable 98590720
    hierarchical_memory_limit 100663296
    total_cache 245760
    total_rss 100417536
    total_mapped_file 24576
    total_pgpgin 7320
    total_pgpgout 6250
    total_inactive_anon 0
    total_active_anon 1826816
    total_inactive_file 192512
    total_active_file 53248
    total_unevictable 98590720
    
    vs having the oom + data in 1 log message + indentation
    
    I1106 01:53:54.852854 61941 cgroups_isolation_module.cpp:689] OOM detected for executor default of framework 201211060153-2081170186-5432-61885-0000 with tag bf7fc2e7-a9c4-4240-8300-18acb99490dc
      MEMORY LIMIT: 100663296 bytes
      MEMORY USAGE: 100663296 bytes
      MEMORY STATISTICS: 
        cache 245760
        rss 100417536
        mapped_file 24576
        pgpgin 7320
        pgpgout 6250
        inactive_anon 0
        active_anon 1826816
        inactive_file 192512
        active_file 53248
        unevictable 98590720
        hierarchical_memory_limit 100663296
        total_cache 245760
        total_rss 100417536
        total_mapped_file 24576
        total_pgpgin 7320
        total_pgpgout 6250
        total_inactive_anon 0
        total_active_anon 1826816
        total_inactive_file 192512
        total_active_file 53248
        total_unevictable 98590720
    
    Also, for the reason, can you prepend the fact that an OOM happened?
    
    like:
    I1106 01:54:00.542150 61984 sched.cpp:326] Status update: task 1 of framework 201211060153-2081170186-5432-61885-0000 is now in state TASK_FAILED
    Task in state TASK_FAILED
    Reason: OOM Detected // <-- Here
    MEMORY LIMIT: 100663296 bytes
    MEMORY USAGE: 100663296 bytes
    MEMORY STATISTICS: 
    



src/slave/slave.cpp
<https://reviews.apache.org/r/7887/#comment28406>

    Just curious, why the check for command executor?
    
    More specifically, why is a terminated non-destroyed command executor failed instead of lost?


- Ben Mahler


On Nov. 6, 2012, 8:33 p.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/7887/
> -----------------------------------------------------------
> 
> (Updated Nov. 6, 2012, 8:33 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Ben Mahler.
> 
> 
> Description
> -------
> 
> See summary
> 
> 
> Diffs
> -----
> 
>   src/common/protobuf_utils.hpp 77b300d7c1a02a836100d3365e205889c48ae99a 
>   src/examples/balloon_framework.cpp e9b60de0c7d3a96381aff37340e0f5ac499850dd 
>   src/slave/cgroups_isolation_module.hpp dd4703a1ca584d2347efac95bcdfae9a84544d4a 
>   src/slave/cgroups_isolation_module.cpp 3d10ee568b8f194543707374f34f21bd3a927958 
>   src/slave/lxc_isolation_module.cpp 36d86e08f7b511371a9a2053ddf43477063a79f1 
>   src/slave/process_based_isolation_module.cpp b0b6a81c93acc68d1f4acbdda5ab2f9f96b5fb5a 
>   src/slave/slave.hpp be0d7cc239e51636bb07e12c3046e0751a958787 
>   src/slave/slave.cpp 2bd2dbce538a6108dd9fe607829cfbdab33e0778 
>   src/tests/fault_tolerance_tests.cpp a01d1aef012b636f2ced64d4d2ffabfb6ce42644 
>   src/tests/gc_tests.cpp b61b2de621e227f327ce546b62f8dfc528f3894e 
>   src/tests/master_tests.cpp d9cd09c5650234351f570f0a035f4b61cd2d00f5 
> 
> Diff: https://reviews.apache.org/r/7887/diff/
> 
> 
> Testing
> -------
> 
> make check (CentOs)
> 
> [vinod@smfd-aki-27-sr1:~/mesos/build] $ sudo GLOG_v=1 ./bin/mesos-tests.sh  --gtest_filter="*CgroupsIsolationTest*" --verbose
> ...
> ...
> I1106 01:53:54.852120 61941 cgroups_isolation_module.cpp:617] OOM notifier is triggered for executor default of framework 201211060153-2081170186-5432-61885-0000 with tag bf7fc2e7-a9c4-4240-8300-18acb99490dc
> I1106 01:53:54.852165 61941 cgroups_isolation_module.cpp:662] OOM detected for executor default of framework 201211060153-2081170186-5432-61885-0000 with tag bf7fc2e7-a9c4-4240-8300-18acb99490dc
> I1106 01:53:54.852854 61941 cgroups_isolation_module.cpp:689] MEMORY LIMIT: 100663296 bytes
> MEMORY USAGE: 100663296 bytes
> MEMORY STATISTICS: 
> cache 245760
> rss 100417536
> mapped_file 24576
> pgpgin 7320
> pgpgout 6250
> inactive_anon 0
> active_anon 1826816
> inactive_file 192512
> active_file 53248
> unevictable 98590720
> hierarchical_memory_limit 100663296
> total_cache 245760
> total_rss 100417536
> total_mapped_file 24576
> total_pgpgin 7320
> total_pgpgout 6250
> total_inactive_anon 0
> total_active_anon 1826816
> total_inactive_file 192512
> total_active_file 53248
> total_unevictable 98590720
> I1106 01:53:54.852898 61941 cgroups_isolation_module.cpp:408] Killing executor default of framework 201211060153-2081170186-5432-61885-0000
> I1106 01:53:54.855185 61937 cgroups.cpp:1116] Attempting to freeze cgroup 'mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc'
> I1106 01:53:55.536480 61907 hierarchical_allocator_process.hpp:608] No resources available to allocate!
> I1106 01:53:55.536576 61907 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 130.08us
> I1106 01:53:56.537866 61903 hierarchical_allocator_process.hpp:608] No resources available to allocate!
> I1106 01:53:56.537951 61903 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 103.18us
> I1106 01:53:57.538408 61912 hierarchical_allocator_process.hpp:608] No resources available to allocate!
> I1106 01:53:57.538483 61912 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 93.44us
> I1106 01:53:58.539499 61908 hierarchical_allocator_process.hpp:608] No resources available to allocate!
> I1106 01:53:58.539593 61908 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 113.75us
> W1106 01:53:59.532685 61903 master.cpp:79] No whitelist given. Advertising offers for all slaves
> I1106 01:53:59.540832 61912 hierarchical_allocator_process.hpp:608] No resources available to allocate!
> I1106 01:53:59.540907 61912 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 91.56us
> W1106 01:54:00.020642 61941 cgroups.cpp:1201] Unable to freeze cgroup 'mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc' within 51 attempts
> I1106 01:54:00.022102 61937 cgroups.cpp:1131] Attempting to thaw cgroup 'mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc'
> I1106 01:54:00.022274 61937 cgroups.cpp:1237] Successfully thawed cgroup 'mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc'
> I1106 01:54:00.030532 61948 process.cpp:872] Socket closed while receiving
> I1106 01:54:00.129642 61936 cgroups_isolation_module.cpp:705] Successfully destroyed the cgroup mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc
> I1106 01:54:00.539801 61944 cgroups_isolation_module.cpp:468] Telling slave of terminated executor default of framework 201211060153-2081170186-5432-61885-0000
> I1106 01:54:00.539939 61934 slave.cpp:1008] Executor 'default' of framework 201211060153-2081170186-5432-61885-0000 has terminated with signal Killed
> I1106 01:54:00.541018 61934 slave.cpp:833] Status update: task 1 of framework 201211060153-2081170186-5432-61885-0000 is now in state TASK_FAILED
> I1106 01:54:00.541290 61944 cgroups_isolation_module.cpp:441] Asked to update resources for an unknown/terminated executor
> I1106 01:54:00.541384 61904 hierarchical_allocator_process.hpp:608] No resources available to allocate!
> I1106 01:54:00.541460 61904 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 87.63us
> I1106 01:54:00.541471 61936 gc.cpp:97] Scheduling /tmp/mesos/slaves/201211060153-2081170186-5432-61885-0/frameworks/201211060153-2081170186-5432-61885-0000/executors/default/runs/c842b51d-d962-4b20-a80a-bfe484f6dc95 for removal
> I1106 01:54:00.541610 61907 master.cpp:1024] Status update from slave(1)@10.35.12.124:36146: task 1 of framework 201211060153-2081170186-5432-61885-0000 is now in state TASK_FAILED
> I1106 01:54:00.541759 61907 master.hpp:288] Removing task with resources mem=32 on slave 201211060153-2081170186-5432-61885-0
> I1106 01:54:00.541872 61907 master.cpp:1125] Executor default of framework 201211060153-2081170186-5432-61885-0000 on slave 201211060153-2081170186-5432-61885-0 (smfd-aki-27-sr1.devel.twitter.com) exited with status 9
> I1106 01:54:00.541872 61912 hierarchical_allocator_process.hpp:491] Recovered mem=32 on slave 201211060153-2081170186-5432-61885-0 from framework 201211060153-2081170186-5432-61885-0000
> I1106 01:54:00.541967 61912 hierarchical_allocator_process.hpp:491] Recovered mem=64 on slave 201211060153-2081170186-5432-61885-0 from framework 201211060153-2081170186-5432-61885-0000
> I1106 01:54:00.542150 61984 sched.cpp:326] Status update: task 1 of framework 201211060153-2081170186-5432-61885-0000 is now in state TASK_FAILED
> Task in state TASK_FAILED
> Reason: MEMORY LIMIT: 100663296 bytes
> MEMORY USAGE: 100663296 bytes
> MEMORY STATISTICS: 
> cache 245760
> rss 100417536
> mapped_file 24576
> pgpgin 7320
> pgpgout 6250
> inactive_anon 0
> active_anon 1826816
> inactive_file 192512
> active_file 53248
> unevictable 98590720
> hierarchical_memory_limit 100663296
> total_cache 245760
> total_rss 100417536
> total_mapped_file 24576
> total_pgpgin 7320
> total_pgpgout 6250
> total_inactive_anon 0
> total_active_anon 1826816
> total_inactive_file 192512
> total_active_file 53248
> total_unevictable 98590720
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request: Send TASK_FAILED updates when an executor is destroyed by the isolation module

Posted by Benjamin Hindman <be...@berkeley.edu>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/7887/#review13231
-----------------------------------------------------------

Ship it!


Ship It!

- Benjamin Hindman


On Nov. 6, 2012, 8:33 p.m., Vinod Kone wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/7887/
> -----------------------------------------------------------
> 
> (Updated Nov. 6, 2012, 8:33 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman and Ben Mahler.
> 
> 
> Description
> -------
> 
> See summary
> 
> 
> Diffs
> -----
> 
>   src/common/protobuf_utils.hpp 77b300d7c1a02a836100d3365e205889c48ae99a 
>   src/examples/balloon_framework.cpp e9b60de0c7d3a96381aff37340e0f5ac499850dd 
>   src/slave/cgroups_isolation_module.hpp dd4703a1ca584d2347efac95bcdfae9a84544d4a 
>   src/slave/cgroups_isolation_module.cpp 3d10ee568b8f194543707374f34f21bd3a927958 
>   src/slave/lxc_isolation_module.cpp 36d86e08f7b511371a9a2053ddf43477063a79f1 
>   src/slave/process_based_isolation_module.cpp b0b6a81c93acc68d1f4acbdda5ab2f9f96b5fb5a 
>   src/slave/slave.hpp be0d7cc239e51636bb07e12c3046e0751a958787 
>   src/slave/slave.cpp 2bd2dbce538a6108dd9fe607829cfbdab33e0778 
>   src/tests/fault_tolerance_tests.cpp a01d1aef012b636f2ced64d4d2ffabfb6ce42644 
>   src/tests/gc_tests.cpp b61b2de621e227f327ce546b62f8dfc528f3894e 
>   src/tests/master_tests.cpp d9cd09c5650234351f570f0a035f4b61cd2d00f5 
> 
> Diff: https://reviews.apache.org/r/7887/diff/
> 
> 
> Testing
> -------
> 
> make check (CentOs)
> 
> [vinod@smfd-aki-27-sr1:~/mesos/build] $ sudo GLOG_v=1 ./bin/mesos-tests.sh  --gtest_filter="*CgroupsIsolationTest*" --verbose
> ...
> ...
> I1106 01:53:54.852120 61941 cgroups_isolation_module.cpp:617] OOM notifier is triggered for executor default of framework 201211060153-2081170186-5432-61885-0000 with tag bf7fc2e7-a9c4-4240-8300-18acb99490dc
> I1106 01:53:54.852165 61941 cgroups_isolation_module.cpp:662] OOM detected for executor default of framework 201211060153-2081170186-5432-61885-0000 with tag bf7fc2e7-a9c4-4240-8300-18acb99490dc
> I1106 01:53:54.852854 61941 cgroups_isolation_module.cpp:689] MEMORY LIMIT: 100663296 bytes
> MEMORY USAGE: 100663296 bytes
> MEMORY STATISTICS: 
> cache 245760
> rss 100417536
> mapped_file 24576
> pgpgin 7320
> pgpgout 6250
> inactive_anon 0
> active_anon 1826816
> inactive_file 192512
> active_file 53248
> unevictable 98590720
> hierarchical_memory_limit 100663296
> total_cache 245760
> total_rss 100417536
> total_mapped_file 24576
> total_pgpgin 7320
> total_pgpgout 6250
> total_inactive_anon 0
> total_active_anon 1826816
> total_inactive_file 192512
> total_active_file 53248
> total_unevictable 98590720
> I1106 01:53:54.852898 61941 cgroups_isolation_module.cpp:408] Killing executor default of framework 201211060153-2081170186-5432-61885-0000
> I1106 01:53:54.855185 61937 cgroups.cpp:1116] Attempting to freeze cgroup 'mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc'
> I1106 01:53:55.536480 61907 hierarchical_allocator_process.hpp:608] No resources available to allocate!
> I1106 01:53:55.536576 61907 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 130.08us
> I1106 01:53:56.537866 61903 hierarchical_allocator_process.hpp:608] No resources available to allocate!
> I1106 01:53:56.537951 61903 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 103.18us
> I1106 01:53:57.538408 61912 hierarchical_allocator_process.hpp:608] No resources available to allocate!
> I1106 01:53:57.538483 61912 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 93.44us
> I1106 01:53:58.539499 61908 hierarchical_allocator_process.hpp:608] No resources available to allocate!
> I1106 01:53:58.539593 61908 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 113.75us
> W1106 01:53:59.532685 61903 master.cpp:79] No whitelist given. Advertising offers for all slaves
> I1106 01:53:59.540832 61912 hierarchical_allocator_process.hpp:608] No resources available to allocate!
> I1106 01:53:59.540907 61912 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 91.56us
> W1106 01:54:00.020642 61941 cgroups.cpp:1201] Unable to freeze cgroup 'mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc' within 51 attempts
> I1106 01:54:00.022102 61937 cgroups.cpp:1131] Attempting to thaw cgroup 'mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc'
> I1106 01:54:00.022274 61937 cgroups.cpp:1237] Successfully thawed cgroup 'mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc'
> I1106 01:54:00.030532 61948 process.cpp:872] Socket closed while receiving
> I1106 01:54:00.129642 61936 cgroups_isolation_module.cpp:705] Successfully destroyed the cgroup mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc
> I1106 01:54:00.539801 61944 cgroups_isolation_module.cpp:468] Telling slave of terminated executor default of framework 201211060153-2081170186-5432-61885-0000
> I1106 01:54:00.539939 61934 slave.cpp:1008] Executor 'default' of framework 201211060153-2081170186-5432-61885-0000 has terminated with signal Killed
> I1106 01:54:00.541018 61934 slave.cpp:833] Status update: task 1 of framework 201211060153-2081170186-5432-61885-0000 is now in state TASK_FAILED
> I1106 01:54:00.541290 61944 cgroups_isolation_module.cpp:441] Asked to update resources for an unknown/terminated executor
> I1106 01:54:00.541384 61904 hierarchical_allocator_process.hpp:608] No resources available to allocate!
> I1106 01:54:00.541460 61904 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 87.63us
> I1106 01:54:00.541471 61936 gc.cpp:97] Scheduling /tmp/mesos/slaves/201211060153-2081170186-5432-61885-0/frameworks/201211060153-2081170186-5432-61885-0000/executors/default/runs/c842b51d-d962-4b20-a80a-bfe484f6dc95 for removal
> I1106 01:54:00.541610 61907 master.cpp:1024] Status update from slave(1)@10.35.12.124:36146: task 1 of framework 201211060153-2081170186-5432-61885-0000 is now in state TASK_FAILED
> I1106 01:54:00.541759 61907 master.hpp:288] Removing task with resources mem=32 on slave 201211060153-2081170186-5432-61885-0
> I1106 01:54:00.541872 61907 master.cpp:1125] Executor default of framework 201211060153-2081170186-5432-61885-0000 on slave 201211060153-2081170186-5432-61885-0 (smfd-aki-27-sr1.devel.twitter.com) exited with status 9
> I1106 01:54:00.541872 61912 hierarchical_allocator_process.hpp:491] Recovered mem=32 on slave 201211060153-2081170186-5432-61885-0 from framework 201211060153-2081170186-5432-61885-0000
> I1106 01:54:00.541967 61912 hierarchical_allocator_process.hpp:491] Recovered mem=64 on slave 201211060153-2081170186-5432-61885-0 from framework 201211060153-2081170186-5432-61885-0000
> I1106 01:54:00.542150 61984 sched.cpp:326] Status update: task 1 of framework 201211060153-2081170186-5432-61885-0000 is now in state TASK_FAILED
> Task in state TASK_FAILED
> Reason: MEMORY LIMIT: 100663296 bytes
> MEMORY USAGE: 100663296 bytes
> MEMORY STATISTICS: 
> cache 245760
> rss 100417536
> mapped_file 24576
> pgpgin 7320
> pgpgout 6250
> inactive_anon 0
> active_anon 1826816
> inactive_file 192512
> active_file 53248
> unevictable 98590720
> hierarchical_memory_limit 100663296
> total_cache 245760
> total_rss 100417536
> total_mapped_file 24576
> total_pgpgin 7320
> total_pgpgout 6250
> total_inactive_anon 0
> total_active_anon 1826816
> total_inactive_file 192512
> total_active_file 53248
> total_unevictable 98590720
> 
> 
> Thanks,
> 
> Vinod Kone
> 
>


Re: Review Request: Send TASK_FAILED updates when an executor is destroyed by the isolation module

Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/7887/
-----------------------------------------------------------

(Updated Nov. 6, 2012, 8:33 p.m.)


Review request for mesos, Benjamin Hindman and Ben Mahler.


Changes
-------

s/terminated/killed.

Added some comments.

Formatting.


Description
-------

See summary


Diffs (updated)
-----

  src/common/protobuf_utils.hpp 77b300d7c1a02a836100d3365e205889c48ae99a 
  src/examples/balloon_framework.cpp e9b60de0c7d3a96381aff37340e0f5ac499850dd 
  src/slave/cgroups_isolation_module.hpp dd4703a1ca584d2347efac95bcdfae9a84544d4a 
  src/slave/cgroups_isolation_module.cpp 3d10ee568b8f194543707374f34f21bd3a927958 
  src/slave/lxc_isolation_module.cpp 36d86e08f7b511371a9a2053ddf43477063a79f1 
  src/slave/process_based_isolation_module.cpp b0b6a81c93acc68d1f4acbdda5ab2f9f96b5fb5a 
  src/slave/slave.hpp be0d7cc239e51636bb07e12c3046e0751a958787 
  src/slave/slave.cpp 2bd2dbce538a6108dd9fe607829cfbdab33e0778 
  src/tests/fault_tolerance_tests.cpp a01d1aef012b636f2ced64d4d2ffabfb6ce42644 
  src/tests/gc_tests.cpp b61b2de621e227f327ce546b62f8dfc528f3894e 
  src/tests/master_tests.cpp d9cd09c5650234351f570f0a035f4b61cd2d00f5 

Diff: https://reviews.apache.org/r/7887/diff/


Testing
-------

make check (CentOs)

[vinod@smfd-aki-27-sr1:~/mesos/build] $ sudo GLOG_v=1 ./bin/mesos-tests.sh  --gtest_filter="*CgroupsIsolationTest*" --verbose
...
...
I1106 01:53:54.852120 61941 cgroups_isolation_module.cpp:617] OOM notifier is triggered for executor default of framework 201211060153-2081170186-5432-61885-0000 with tag bf7fc2e7-a9c4-4240-8300-18acb99490dc
I1106 01:53:54.852165 61941 cgroups_isolation_module.cpp:662] OOM detected for executor default of framework 201211060153-2081170186-5432-61885-0000 with tag bf7fc2e7-a9c4-4240-8300-18acb99490dc
I1106 01:53:54.852854 61941 cgroups_isolation_module.cpp:689] MEMORY LIMIT: 100663296 bytes
MEMORY USAGE: 100663296 bytes
MEMORY STATISTICS: 
cache 245760
rss 100417536
mapped_file 24576
pgpgin 7320
pgpgout 6250
inactive_anon 0
active_anon 1826816
inactive_file 192512
active_file 53248
unevictable 98590720
hierarchical_memory_limit 100663296
total_cache 245760
total_rss 100417536
total_mapped_file 24576
total_pgpgin 7320
total_pgpgout 6250
total_inactive_anon 0
total_active_anon 1826816
total_inactive_file 192512
total_active_file 53248
total_unevictable 98590720
I1106 01:53:54.852898 61941 cgroups_isolation_module.cpp:408] Killing executor default of framework 201211060153-2081170186-5432-61885-0000
I1106 01:53:54.855185 61937 cgroups.cpp:1116] Attempting to freeze cgroup 'mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc'
I1106 01:53:55.536480 61907 hierarchical_allocator_process.hpp:608] No resources available to allocate!
I1106 01:53:55.536576 61907 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 130.08us
I1106 01:53:56.537866 61903 hierarchical_allocator_process.hpp:608] No resources available to allocate!
I1106 01:53:56.537951 61903 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 103.18us
I1106 01:53:57.538408 61912 hierarchical_allocator_process.hpp:608] No resources available to allocate!
I1106 01:53:57.538483 61912 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 93.44us
I1106 01:53:58.539499 61908 hierarchical_allocator_process.hpp:608] No resources available to allocate!
I1106 01:53:58.539593 61908 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 113.75us
W1106 01:53:59.532685 61903 master.cpp:79] No whitelist given. Advertising offers for all slaves
I1106 01:53:59.540832 61912 hierarchical_allocator_process.hpp:608] No resources available to allocate!
I1106 01:53:59.540907 61912 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 91.56us
W1106 01:54:00.020642 61941 cgroups.cpp:1201] Unable to freeze cgroup 'mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc' within 51 attempts
I1106 01:54:00.022102 61937 cgroups.cpp:1131] Attempting to thaw cgroup 'mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc'
I1106 01:54:00.022274 61937 cgroups.cpp:1237] Successfully thawed cgroup 'mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc'
I1106 01:54:00.030532 61948 process.cpp:872] Socket closed while receiving
I1106 01:54:00.129642 61936 cgroups_isolation_module.cpp:705] Successfully destroyed the cgroup mesos/framework_201211060153-2081170186-5432-61885-0000_executor_default_tag_bf7fc2e7-a9c4-4240-8300-18acb99490dc
I1106 01:54:00.539801 61944 cgroups_isolation_module.cpp:468] Telling slave of terminated executor default of framework 201211060153-2081170186-5432-61885-0000
I1106 01:54:00.539939 61934 slave.cpp:1008] Executor 'default' of framework 201211060153-2081170186-5432-61885-0000 has terminated with signal Killed
I1106 01:54:00.541018 61934 slave.cpp:833] Status update: task 1 of framework 201211060153-2081170186-5432-61885-0000 is now in state TASK_FAILED
I1106 01:54:00.541290 61944 cgroups_isolation_module.cpp:441] Asked to update resources for an unknown/terminated executor
I1106 01:54:00.541384 61904 hierarchical_allocator_process.hpp:608] No resources available to allocate!
I1106 01:54:00.541460 61904 hierarchical_allocator_process.hpp:543] Performed allocation for 1 slaves in 87.63us
I1106 01:54:00.541471 61936 gc.cpp:97] Scheduling /tmp/mesos/slaves/201211060153-2081170186-5432-61885-0/frameworks/201211060153-2081170186-5432-61885-0000/executors/default/runs/c842b51d-d962-4b20-a80a-bfe484f6dc95 for removal
I1106 01:54:00.541610 61907 master.cpp:1024] Status update from slave(1)@10.35.12.124:36146: task 1 of framework 201211060153-2081170186-5432-61885-0000 is now in state TASK_FAILED
I1106 01:54:00.541759 61907 master.hpp:288] Removing task with resources mem=32 on slave 201211060153-2081170186-5432-61885-0
I1106 01:54:00.541872 61907 master.cpp:1125] Executor default of framework 201211060153-2081170186-5432-61885-0000 on slave 201211060153-2081170186-5432-61885-0 (smfd-aki-27-sr1.devel.twitter.com) exited with status 9
I1106 01:54:00.541872 61912 hierarchical_allocator_process.hpp:491] Recovered mem=32 on slave 201211060153-2081170186-5432-61885-0 from framework 201211060153-2081170186-5432-61885-0000
I1106 01:54:00.541967 61912 hierarchical_allocator_process.hpp:491] Recovered mem=64 on slave 201211060153-2081170186-5432-61885-0 from framework 201211060153-2081170186-5432-61885-0000
I1106 01:54:00.542150 61984 sched.cpp:326] Status update: task 1 of framework 201211060153-2081170186-5432-61885-0000 is now in state TASK_FAILED
Task in state TASK_FAILED
Reason: MEMORY LIMIT: 100663296 bytes
MEMORY USAGE: 100663296 bytes
MEMORY STATISTICS: 
cache 245760
rss 100417536
mapped_file 24576
pgpgin 7320
pgpgout 6250
inactive_anon 0
active_anon 1826816
inactive_file 192512
active_file 53248
unevictable 98590720
hierarchical_memory_limit 100663296
total_cache 245760
total_rss 100417536
total_mapped_file 24576
total_pgpgin 7320
total_pgpgout 6250
total_inactive_anon 0
total_active_anon 1826816
total_inactive_file 192512
total_active_file 53248
total_unevictable 98590720


Thanks,

Vinod Kone