You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Brenden Matthews <br...@airbedandbreakfast.com> on 2013/05/30 19:43:32 UTC

Slaves deactivating

Hey guys,

I'm having a frequent problem right now in master.  Slaves keep
deactivating and I'm unsure why.  Here's the master log:

W0530 17:35:32.300029 21798 master.cpp:1199] Removing slave
201305300057-1471680778-5050-21299-144 at
slave(1)@10.148.178.186:5051because it has been deactivated
I0530 17:35:32.300742 21800 hierarchical_allocator_process.hpp:423] Removed
slave 201305300057-1471680778-5050-21299-144
I0530 17:35:32.302295 21798 master.hpp:295] Removing task Task_Tracker_475
with resources cpus=23.25; mem=51150; disk=126976; ports=[31001-31001,
31999-31999] on slave 201305300057-1471680778-5050-21299-144
I0530 17:35:32.304235 21798 master.hpp:295] Removing task
ct:join_search_request_yesterday:1369931394916:2 with resources cpus=1;
mem=1; disk=1 on slave 201305300057-1471680778-5050-21299-144
I0530 17:35:32.306157 21798 master.hpp:295] Removing task Task_Tracker_451
with resources cpus=2.25; mem=4950; disk=12288; ports=[31000-31000,
32000-32000] on slave 201305300057-1471680778-5050-21299-144


And here's the slave log:

I0530 17:29:31.326658 24787 slave.cpp:2498] Current usage 0.66%. Max
allowed age: 6.253510014347754days
I0530 17:30:31.328030 24787 slave.cpp:2498] Current usage 0.67%. Max
allowed age: 6.253270522087396days
I0530 17:31:31.329982 24798 slave.cpp:2498] Current usage 1.44%. Max
allowed age: 6.199047245757789days
I0530 17:32:31.333297 24810 slave.cpp:2498] Current usage 1.55%. Max
allowed age: 6.191838835555544days
I0530 17:34:12.236701 24790 slave.cpp:2498] Current usage 1.59%. Max
allowed age: 6.188834669188438days
I0530 17:37:10.797281 24790 slave.cpp:492] Slave asked to shut down by
master@10.17.184.87:5050
I0530 17:37:10.938212 24790 slave.cpp:1114] Asked to shut down framework
201305290115-1471680778-5050-30247-0001 by master@10.17.184.87:5050
I0530 17:37:10.954860 24790 slave.cpp:1139] Shutting down framework
201305290115-1471680778-5050-30247-0001
I0530 17:37:10.955477 24790 slave.cpp:2315] Shutting down executor
'executor_Task_Tracker_475' of framework
201305290115-1471680778-5050-30247-0001
I0530 17:37:10.956305 24790 slave.cpp:2315] Shutting down executor
'executor_Task_Tracker_451' of framework
201305290115-1471680778-5050-30247-0001
I0530 17:37:10.956914 24790 slave.cpp:1114] Asked to shut down framework
chronos by master@10.17.184.87:5050
I0530 17:37:10.992754 24790 slave.cpp:1139] Shutting down framework chronos
I0530 17:37:11.028842 24790 slave.cpp:2315] Shutting down executor
'ct:join_search_request_yesterday:1369931394916:2' of framework chronos
I0530 17:37:14.847417 24809 cgroups_isolator.cpp:804] Executor
ct:join_search_request_yesterday:1369931394916:2 of framework chronos
terminated with status 0
I0530 17:37:15.957542 24792 slave.cpp:2384] Killing executor
'executor_Task_Tracker_475' of framework
201305290115-1471680778-5050-30247-0001
I0530 17:37:16.048948 24809 cgroups_isolator.cpp:620] Killing executor
ct:join_search_request_yesterday:1369931394916:2 of framework chronos
I0530 17:37:16.085058 24792 slave.cpp:2384] Killing executor
'executor_Task_Tracker_451' of framework
201305290115-1471680778-5050-30247-0001
I0530 17:37:16.086105 24792 slave.cpp:2384] Killing executor
'ct:join_search_request_yesterday:1369931394916:2' of framework chronos
W0530 17:37:16.089593 24790 monitor.cpp:167] Failed to collect resource
usage for executor 'ct:join_search_request_yesterday:1369931394916:2' of
framework 'chronos': Unknown or killed executor
I0530 17:37:16.089761 24809 cgroups_isolator.cpp:620] Killing executor
executor_Task_Tracker_475 of framework
201305290115-1471680778-5050-30247-0001
I0530 17:37:16.093415 24810 cgroups.cpp:1175] Trying to freeze cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
I0530 17:37:16.161749 24810 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:16.163240 24809 cgroups_isolator.cpp:1023] OOM notifier is
triggered for executor ct:join_search_request_yesterday:1369931394916:2 of
framework chronos with uuid 199c6567-1296-427f-a74e-13742b95a330
I0530 17:37:16.163627 24809 cgroups_isolator.cpp:1028] Discarded OOM
notifier for executor ct:join_search_request_yesterday:1369931394916:2 of
framework chronos with uuid 199c6567-1296-427f-a74e-13742b95a330
I0530 17:37:16.164384 24809 cgroups_isolator.cpp:620] Killing executor
executor_Task_Tracker_451 of framework
201305290115-1471680778-5050-30247-0001
E0530 17:37:16.166890 24809 cgroups_isolator.cpp:616] Asked to kill an
unknown/killed executor!
I0530 17:37:16.241206 24809 cgroups_isolator.cpp:1023] OOM notifier is
triggered for executor executor_Task_Tracker_475 of framework
201305290115-1471680778-5050-30247-0001 with uuid
e90a8ce9-812d-4757-833e-62c55ada5cda
I0530 17:37:16.264039 24786 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:16.290067 24809 cgroups_isolator.cpp:1028] Discarded OOM
notifier for executor executor_Task_Tracker_475 of framework
201305290115-1471680778-5050-30247-0001 with uuid
e90a8ce9-812d-4757-833e-62c55ada5cda
I0530 17:37:16.313125 24809 cgroups_isolator.cpp:1023] OOM notifier is
triggered for executor executor_Task_Tracker_451 of framework
201305290115-1471680778-5050-30247-0001 with uuid
10c5125d-cca4-42b7-a11a-a7fda594b005
I0530 17:37:16.321915 24809 cgroups_isolator.cpp:1028] Discarded OOM
notifier for executor executor_Task_Tracker_451 of framework
201305290115-1471680778-5050-30247-0001 with uuid
10c5125d-cca4-42b7-a11a-a7fda594b005
F0530 17:37:16.322549 24809 cgroups_isolator.cpp:1165] Failed to destroy
cgroup
mesos/framework_201305290115-1471680778-5050-30247-0001_executor_executor_Task_Tracker_475_tag_e90a8ce9-812d-4757-833e-62c55ada5cda:
Failed to kill tasks in nested cgroups: Collect failed: Failed to send
Killed to process 721: No such process
*** Check failure stack trace: ***
I0530 17:37:16.405956 24802 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:18.258474 24787 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:18.394127 24781 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:18.530743 24785 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:18.670518 24788 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:18.806540 24803 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:18.942659 24796 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:19.078434 24807 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:19.214282 24789 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:19.350401 24793 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:19.486559 24800 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:19.622144 24799 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:19.758436 24806 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:19.894443 24792 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:19.996562 24797 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:20.134127 24783 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:20.270037 24804 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:20.406051 24805 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:20.542073 24808 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:20.678064 24794 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:20.814045 24798 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:20.950011 24810 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:21.052018 24780 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:21.186023 24786 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:21.322306 24781 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
    @     0x7f57595b5c1d  google::LogMessage::Fail()
I0530 17:37:21.424214 24787 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
    @     0x7f57595b83af  google::LogMessage::SendToLog()
I0530 17:37:21.526203 24790 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:21.662224 24785 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:21.764526 24788 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:21.866744 24802 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:21.969965 24780 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:22.072257 24791 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:22.210059 24811 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:22.346051 24782 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
    @     0x7f57595b581b  google::LogMessage::Flush()
I0530 17:37:22.482100 24795 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:22.618103 24784 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:22.754070 24786 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:22.856056 24781 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
    @     0x7f57595b8c3d  google::LogMessageFatal::~LogMessageFatal()
I0530 17:37:22.958155 24790 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:23.094172 24785 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
    @     0x7f575933a42b
 mesos::internal::slave::CgroupsIsolator::_killExecutor()
I0530 17:37:23.196523 24801 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:23.334296 24787 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:23.436355 24788 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:23.538429 24796 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:23.640461 24789 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:23.742709 24800 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:23.844823 24803 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:23.946893 24799 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
I0530 17:37:24.050462 24807 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
    @     0x7f5759355ff9  std::tr1::_Mem_fn<>::operator()()
I0530 17:37:24.152722 24806 cgroups.cpp:1205] Watching cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
frozen
W0530 17:37:24.154515 24806 cgroups.cpp:1263] Unable to freeze
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
within 51 attempts
I0530 17:37:24.159581 24794 cgroups.cpp:1190] Trying to thaw cgroup
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
I0530 17:37:24.160049 24794 cgroups.cpp:1300] Successfully thawed
/cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
    @     0x7f5759354a60
 _ZNSt3tr15_BindIFNS_7_Mem_fnIMN5mesos8internal5slave15CgroupsIsolatorEFvPNS5_10CgroupInfoERKN7process6FutureIbEEEEENS_12_PlaceholderILi1EEES7_SA_EE6__callIIRPS5_EILi0ELi1ELi2EEEENS_9result_ofIFSF_NSN_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSN_IFNSO_IS7_Lb0ELb0EEES7_ST_EE4typeENSN_IFNSO_ISA_Lb0ELb0EEESA_ST_EE4typeEEE4typeERKST_NS_12_Index_tupleIIXspT0_EEEE
    @     0x7f57593514d0
 _ZNSt3tr15_BindIFNS_7_Mem_fnIMN5mesos8internal5slave15CgroupsIsolatorEFvPNS5_10CgroupInfoERKN7process6FutureIbEEEEENS_12_PlaceholderILi1EEES7_SA_EEclIIPS5_EEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSM_IFNSN_IS7_Lb0ELb0EEES7_SS_EE4typeENSM_IFNSN_ISA_Lb0ELb0EEESA_SS_EE4typeEEE4typeEDpRSQ_
    @     0x7f575934cd38  std::tr1::_Function_handler<>::_M_invoke()
    @     0x7f575934ced9  std::tr1::function<>::operator()()
    @     0x7f5759347f31  process::internal::vdispatcher<>()
    @     0x7f5759354b7d
 _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal5slave15CgroupsIsolatorEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIIRS3_EILi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIIXspT0_EEEE
    @     0x7f5759351770
 _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal5slave15CgroupsIsolatorEEEEEEENS_12_PlaceholderILi1EEESD_EEclIIS3_EEENS_9result_ofIFSF_NSL_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSL_IFNSM_ISD_Lb0ELb0EEESD_SR_EE4typeEEE4typeEDpRSP_
    @     0x7f575934cfc4  std::tr1::_Function_handler<>::_M_invoke()
    @     0x7f57594a1a7b  std::tr1::function<>::operator()()
    @     0x7f575948a55f  process::ProcessBase::visit()
    @     0x7f5759490992  process::DispatchEvent::visit()
    @     0x7f57590761d4  process::ProcessBase::serve()
    @     0x7f5759487cd1  process::ProcessManager::resume()
    @     0x7f575947ee2e  process::schedule()
    @     0x7f5757bc6e9a  start_thread
    @     0x7f57578f3ccd  (unknown)


Can you provide me some hints as to what's happening here?  This is
currently a major blocker for me!

Thanks,

Brenden

Re: Slaves deactivating

Posted by Brenden Matthews <br...@airbedandbreakfast.com>.
Hi Ben,

I can't find any good examples of it right now, but from what I recall
there wasn't anything interesting in the log prior.  If it happens again
I'll let you know.


On Wed, Jun 5, 2013 at 10:46 AM, Benjamin Mahler
<be...@gmail.com>wrote:

> Hey Brenden, can you provide more of the slave log if you still have it?
> It's likely something was causing the slave to hang so it would be useful
> to see what happened prior to the first log line you posted. We're
> investigating an issue at Twitter where slaves can hang for 30 minutes - a
> few hours, likely related to the cgroups freezer.
>
>
> On Thu, May 30, 2013 at 11:50 AM, Brenden Matthews <
> brenden.matthews@airbedandbreakfast.com> wrote:
>
> > I agree with you that slaves which fail health checks should be removed.
>  I
> > suspect this is just a matter of tuning, and perhaps an issue related to
> > EC2.  I'll try increasing the value and see if that helps for now.
> >
> > I also found mis-configured filesystem, so perhaps Mesos is not the
> culprit
> > here :)
> >
> > On Thu, May 30, 2013 at 11:30 AM, Vinod Kone <vi...@gmail.com>
> wrote:
> >
> > > Hmm. I'm not sure I agree. If a slave is not responding to health
> checks,
> > > that seems bad to me. A framework would be well off, if the slave is
> > > shutdown so that it can launch its tasks elsewhere in the cluster.
> > >
> > > The current parameters (SLAVE_PING_TIMEOUT, MAX_SLAVE_PING_TIMEOUTS)
> are
> > > such that a slave not responding to health checks for 75s is shutdown
> by
> > > the master. That seems reasonable to me? If you want that to be
> tunable,
> > > however, we can expose them vial masters flags.
> > >
> > > Having said that, the underlying problem that we need to diagnose/fix
> is
> > to
> > > ensure the slave is responsive.
> > >
> > >
> > > On Thu, May 30, 2013 at 11:01 AM, Brenden Matthews <
> > > brenden.matthews@airbedandbreakfast.com> wrote:
> > >
> > > > The slave is running a Hadoop task and is probably under heavy load.
>  I
> > > > think it's normal for it to occasionally respond slowly to health
> > checks,
> > > > and Mesos shouldn't be trying to kill it because of this.  I'm not
> too
> > > > concerned about the kill failing, I'm more concerned with the fact
> that
> > > the
> > > > process is being erroneously killed in the first place.
> > > >
> > > >
> > > > On Thu, May 30, 2013 at 10:55 AM, Vinod Kone <vi...@gmail.com>
> > > wrote:
> > > >
> > > > > Sounds like the slave is not responding to health checks by the
> > master.
> > > > > Does this happen right after you start the slave or after a while?
> > Are
> > > > you
> > > > > able to get system load graph during this time?
> > > > >
> > > > > Also, the check failure while cleaning cgroups is clearly a bug
> > (likely
> > > > > related to MESOS-461 <
> > https://issues.apache.org/jira/browse/MESOS-461
> > > >).
> > > > >
> > > > >
> > > > > On Thu, May 30, 2013 at 10:43 AM, Brenden Matthews <
> > > > > brenden.matthews@airbedandbreakfast.com> wrote:
> > > > >
> > > > > > Hey guys,
> > > > > >
> > > > > > I'm having a frequent problem right now in master.  Slaves keep
> > > > > > deactivating and I'm unsure why.  Here's the master log:
> > > > > >
> > > > > > W0530 17:35:32.300029 21798 master.cpp:1199] Removing slave
> > > > > > 201305300057-1471680778-5050-21299-144 at
> > > > > > slave(1)@10.148.178.186:5051because it has been deactivated
> > > > > > I0530 17:35:32.300742 21800
> hierarchical_allocator_process.hpp:423]
> > > > > Removed
> > > > > > slave 201305300057-1471680778-5050-21299-144
> > > > > > I0530 17:35:32.302295 21798 master.hpp:295] Removing task
> > > > > Task_Tracker_475
> > > > > > with resources cpus=23.25; mem=51150; disk=126976;
> > > ports=[31001-31001,
> > > > > > 31999-31999] on slave 201305300057-1471680778-5050-21299-144
> > > > > > I0530 17:35:32.304235 21798 master.hpp:295] Removing task
> > > > > > ct:join_search_request_yesterday:1369931394916:2 with resources
> > > cpus=1;
> > > > > > mem=1; disk=1 on slave 201305300057-1471680778-5050-21299-144
> > > > > > I0530 17:35:32.306157 21798 master.hpp:295] Removing task
> > > > > Task_Tracker_451
> > > > > > with resources cpus=2.25; mem=4950; disk=12288;
> ports=[31000-31000,
> > > > > > 32000-32000] on slave 201305300057-1471680778-5050-21299-144
> > > > > >
> > > > > >
> > > > > > And here's the slave log:
> > > > > >
> > > > > > I0530 17:29:31.326658 24787 slave.cpp:2498] Current usage 0.66%.
> > Max
> > > > > > allowed age: 6.253510014347754days
> > > > > > I0530 17:30:31.328030 24787 slave.cpp:2498] Current usage 0.67%.
> > Max
> > > > > > allowed age: 6.253270522087396days
> > > > > > I0530 17:31:31.329982 24798 slave.cpp:2498] Current usage 1.44%.
> > Max
> > > > > > allowed age: 6.199047245757789days
> > > > > > I0530 17:32:31.333297 24810 slave.cpp:2498] Current usage 1.55%.
> > Max
> > > > > > allowed age: 6.191838835555544days
> > > > > > I0530 17:34:12.236701 24790 slave.cpp:2498] Current usage 1.59%.
> > Max
> > > > > > allowed age: 6.188834669188438days
> > > > > > I0530 17:37:10.797281 24790 slave.cpp:492] Slave asked to shut
> down
> > > by
> > > > > > master@10.17.184.87:5050
> > > > > > I0530 17:37:10.938212 24790 slave.cpp:1114] Asked to shut down
> > > > framework
> > > > > > 201305290115-1471680778-5050-30247-0001 by
> > master@10.17.184.87:5050
> > > > > > I0530 17:37:10.954860 24790 slave.cpp:1139] Shutting down
> framework
> > > > > > 201305290115-1471680778-5050-30247-0001
> > > > > > I0530 17:37:10.955477 24790 slave.cpp:2315] Shutting down
> executor
> > > > > > 'executor_Task_Tracker_475' of framework
> > > > > > 201305290115-1471680778-5050-30247-0001
> > > > > > I0530 17:37:10.956305 24790 slave.cpp:2315] Shutting down
> executor
> > > > > > 'executor_Task_Tracker_451' of framework
> > > > > > 201305290115-1471680778-5050-30247-0001
> > > > > > I0530 17:37:10.956914 24790 slave.cpp:1114] Asked to shut down
> > > > framework
> > > > > > chronos by master@10.17.184.87:5050
> > > > > > I0530 17:37:10.992754 24790 slave.cpp:1139] Shutting down
> framework
> > > > > chronos
> > > > > > I0530 17:37:11.028842 24790 slave.cpp:2315] Shutting down
> executor
> > > > > > 'ct:join_search_request_yesterday:1369931394916:2' of framework
> > > chronos
> > > > > > I0530 17:37:14.847417 24809 cgroups_isolator.cpp:804] Executor
> > > > > > ct:join_search_request_yesterday:1369931394916:2 of framework
> > chronos
> > > > > > terminated with status 0
> > > > > > I0530 17:37:15.957542 24792 slave.cpp:2384] Killing executor
> > > > > > 'executor_Task_Tracker_475' of framework
> > > > > > 201305290115-1471680778-5050-30247-0001
> > > > > > I0530 17:37:16.048948 24809 cgroups_isolator.cpp:620] Killing
> > > executor
> > > > > > ct:join_search_request_yesterday:1369931394916:2 of framework
> > chronos
> > > > > > I0530 17:37:16.085058 24792 slave.cpp:2384] Killing executor
> > > > > > 'executor_Task_Tracker_451' of framework
> > > > > > 201305290115-1471680778-5050-30247-0001
> > > > > > I0530 17:37:16.086105 24792 slave.cpp:2384] Killing executor
> > > > > > 'ct:join_search_request_yesterday:1369931394916:2' of framework
> > > chronos
> > > > > > W0530 17:37:16.089593 24790 monitor.cpp:167] Failed to collect
> > > resource
> > > > > > usage for executor
> > 'ct:join_search_request_yesterday:1369931394916:2'
> > > > of
> > > > > > framework 'chronos': Unknown or killed executor
> > > > > > I0530 17:37:16.089761 24809 cgroups_isolator.cpp:620] Killing
> > > executor
> > > > > > executor_Task_Tracker_475 of framework
> > > > > > 201305290115-1471680778-5050-30247-0001
> > > > > > I0530 17:37:16.093415 24810 cgroups.cpp:1175] Trying to freeze
> > cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> > > > > > I0530 17:37:16.161749 24810 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:16.163240 24809 cgroups_isolator.cpp:1023] OOM
> notifier
> > > is
> > > > > > triggered for executor
> > > ct:join_search_request_yesterday:1369931394916:2
> > > > > of
> > > > > > framework chronos with uuid 199c6567-1296-427f-a74e-13742b95a330
> > > > > > I0530 17:37:16.163627 24809 cgroups_isolator.cpp:1028] Discarded
> > OOM
> > > > > > notifier for executor
> > > ct:join_search_request_yesterday:1369931394916:2
> > > > of
> > > > > > framework chronos with uuid 199c6567-1296-427f-a74e-13742b95a330
> > > > > > I0530 17:37:16.164384 24809 cgroups_isolator.cpp:620] Killing
> > > executor
> > > > > > executor_Task_Tracker_451 of framework
> > > > > > 201305290115-1471680778-5050-30247-0001
> > > > > > E0530 17:37:16.166890 24809 cgroups_isolator.cpp:616] Asked to
> kill
> > > an
> > > > > > unknown/killed executor!
> > > > > > I0530 17:37:16.241206 24809 cgroups_isolator.cpp:1023] OOM
> notifier
> > > is
> > > > > > triggered for executor executor_Task_Tracker_475 of framework
> > > > > > 201305290115-1471680778-5050-30247-0001 with uuid
> > > > > > e90a8ce9-812d-4757-833e-62c55ada5cda
> > > > > > I0530 17:37:16.264039 24786 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:16.290067 24809 cgroups_isolator.cpp:1028] Discarded
> > OOM
> > > > > > notifier for executor executor_Task_Tracker_475 of framework
> > > > > > 201305290115-1471680778-5050-30247-0001 with uuid
> > > > > > e90a8ce9-812d-4757-833e-62c55ada5cda
> > > > > > I0530 17:37:16.313125 24809 cgroups_isolator.cpp:1023] OOM
> notifier
> > > is
> > > > > > triggered for executor executor_Task_Tracker_451 of framework
> > > > > > 201305290115-1471680778-5050-30247-0001 with uuid
> > > > > > 10c5125d-cca4-42b7-a11a-a7fda594b005
> > > > > > I0530 17:37:16.321915 24809 cgroups_isolator.cpp:1028] Discarded
> > OOM
> > > > > > notifier for executor executor_Task_Tracker_451 of framework
> > > > > > 201305290115-1471680778-5050-30247-0001 with uuid
> > > > > > 10c5125d-cca4-42b7-a11a-a7fda594b005
> > > > > > F0530 17:37:16.322549 24809 cgroups_isolator.cpp:1165] Failed to
> > > > destroy
> > > > > > cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> mesos/framework_201305290115-1471680778-5050-30247-0001_executor_executor_Task_Tracker_475_tag_e90a8ce9-812d-4757-833e-62c55ada5cda:
> > > > > > Failed to kill tasks in nested cgroups: Collect failed: Failed to
> > > send
> > > > > > Killed to process 721: No such process
> > > > > > *** Check failure stack trace: ***
> > > > > > I0530 17:37:16.405956 24802 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:18.258474 24787 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:18.394127 24781 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:18.530743 24785 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:18.670518 24788 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:18.806540 24803 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:18.942659 24796 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:19.078434 24807 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:19.214282 24789 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:19.350401 24793 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:19.486559 24800 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:19.622144 24799 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:19.758436 24806 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:19.894443 24792 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:19.996562 24797 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:20.134127 24783 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:20.270037 24804 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:20.406051 24805 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:20.542073 24808 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:20.678064 24794 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:20.814045 24798 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:20.950011 24810 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:21.052018 24780 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:21.186023 24786 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:21.322306 24781 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > >     @     0x7f57595b5c1d  google::LogMessage::Fail()
> > > > > > I0530 17:37:21.424214 24787 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > >     @     0x7f57595b83af  google::LogMessage::SendToLog()
> > > > > > I0530 17:37:21.526203 24790 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:21.662224 24785 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:21.764526 24788 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:21.866744 24802 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:21.969965 24780 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:22.072257 24791 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:22.210059 24811 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:22.346051 24782 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > >     @     0x7f57595b581b  google::LogMessage::Flush()
> > > > > > I0530 17:37:22.482100 24795 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:22.618103 24784 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:22.754070 24786 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:22.856056 24781 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > >     @     0x7f57595b8c3d
> >  google::LogMessageFatal::~LogMessageFatal()
> > > > > > I0530 17:37:22.958155 24790 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:23.094172 24785 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > >     @     0x7f575933a42b
> > > > > >  mesos::internal::slave::CgroupsIsolator::_killExecutor()
> > > > > > I0530 17:37:23.196523 24801 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:23.334296 24787 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:23.436355 24788 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:23.538429 24796 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:23.640461 24789 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:23.742709 24800 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:23.844823 24803 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:23.946893 24799 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > I0530 17:37:24.050462 24807 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > >     @     0x7f5759355ff9  std::tr1::_Mem_fn<>::operator()()
> > > > > > I0530 17:37:24.152722 24806 cgroups.cpp:1205] Watching cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > > frozen
> > > > > > W0530 17:37:24.154515 24806 cgroups.cpp:1263] Unable to freeze
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> > > > > > within 51 attempts
> > > > > > I0530 17:37:24.159581 24794 cgroups.cpp:1190] Trying to thaw
> cgroup
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> > > > > > I0530 17:37:24.160049 24794 cgroups.cpp:1300] Successfully thawed
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> > > > > >     @     0x7f5759354a60
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>  _ZNSt3tr15_BindIFNS_7_Mem_fnIMN5mesos8internal5slave15CgroupsIsolatorEFvPNS5_10CgroupInfoERKN7process6FutureIbEEEEENS_12_PlaceholderILi1EEES7_SA_EE6__callIIRPS5_EILi0ELi1ELi2EEEENS_9result_ofIFSF_NSN_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSN_IFNSO_IS7_Lb0ELb0EEES7_ST_EE4typeENSN_IFNSO_ISA_Lb0ELb0EEESA_ST_EE4typeEEE4typeERKST_NS_12_Index_tupleIIXspT0_EEEE
> > > > > >     @     0x7f57593514d0
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>  _ZNSt3tr15_BindIFNS_7_Mem_fnIMN5mesos8internal5slave15CgroupsIsolatorEFvPNS5_10CgroupInfoERKN7process6FutureIbEEEEENS_12_PlaceholderILi1EEES7_SA_EEclIIPS5_EEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSM_IFNSN_IS7_Lb0ELb0EEES7_SS_EE4typeENSM_IFNSN_ISA_Lb0ELb0EEESA_SS_EE4typeEEE4typeEDpRSQ_
> > > > > >     @     0x7f575934cd38
> >  std::tr1::_Function_handler<>::_M_invoke()
> > > > > >     @     0x7f575934ced9  std::tr1::function<>::operator()()
> > > > > >     @     0x7f5759347f31  process::internal::vdispatcher<>()
> > > > > >     @     0x7f5759354b7d
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>  _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal5slave15CgroupsIsolatorEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIIRS3_EILi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIIXspT0_EEEE
> > > > > >     @     0x7f5759351770
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>  _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal5slave15CgroupsIsolatorEEEEEEENS_12_PlaceholderILi1EEESD_EEclIIS3_EEENS_9result_ofIFSF_NSL_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSL_IFNSM_ISD_Lb0ELb0EEESD_SR_EE4typeEEE4typeEDpRSP_
> > > > > >     @     0x7f575934cfc4
> >  std::tr1::_Function_handler<>::_M_invoke()
> > > > > >     @     0x7f57594a1a7b  std::tr1::function<>::operator()()
> > > > > >     @     0x7f575948a55f  process::ProcessBase::visit()
> > > > > >     @     0x7f5759490992  process::DispatchEvent::visit()
> > > > > >     @     0x7f57590761d4  process::ProcessBase::serve()
> > > > > >     @     0x7f5759487cd1  process::ProcessManager::resume()
> > > > > >     @     0x7f575947ee2e  process::schedule()
> > > > > >     @     0x7f5757bc6e9a  start_thread
> > > > > >     @     0x7f57578f3ccd  (unknown)
> > > > > >
> > > > > >
> > > > > > Can you provide me some hints as to what's happening here?  This
> is
> > > > > > currently a major blocker for me!
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Brenden
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Slaves deactivating

Posted by Benjamin Mahler <be...@gmail.com>.
Hey Brenden, can you provide more of the slave log if you still have it?
It's likely something was causing the slave to hang so it would be useful
to see what happened prior to the first log line you posted. We're
investigating an issue at Twitter where slaves can hang for 30 minutes - a
few hours, likely related to the cgroups freezer.


On Thu, May 30, 2013 at 11:50 AM, Brenden Matthews <
brenden.matthews@airbedandbreakfast.com> wrote:

> I agree with you that slaves which fail health checks should be removed.  I
> suspect this is just a matter of tuning, and perhaps an issue related to
> EC2.  I'll try increasing the value and see if that helps for now.
>
> I also found mis-configured filesystem, so perhaps Mesos is not the culprit
> here :)
>
> On Thu, May 30, 2013 at 11:30 AM, Vinod Kone <vi...@gmail.com> wrote:
>
> > Hmm. I'm not sure I agree. If a slave is not responding to health checks,
> > that seems bad to me. A framework would be well off, if the slave is
> > shutdown so that it can launch its tasks elsewhere in the cluster.
> >
> > The current parameters (SLAVE_PING_TIMEOUT, MAX_SLAVE_PING_TIMEOUTS) are
> > such that a slave not responding to health checks for 75s is shutdown by
> > the master. That seems reasonable to me? If you want that to be tunable,
> > however, we can expose them vial masters flags.
> >
> > Having said that, the underlying problem that we need to diagnose/fix is
> to
> > ensure the slave is responsive.
> >
> >
> > On Thu, May 30, 2013 at 11:01 AM, Brenden Matthews <
> > brenden.matthews@airbedandbreakfast.com> wrote:
> >
> > > The slave is running a Hadoop task and is probably under heavy load.  I
> > > think it's normal for it to occasionally respond slowly to health
> checks,
> > > and Mesos shouldn't be trying to kill it because of this.  I'm not too
> > > concerned about the kill failing, I'm more concerned with the fact that
> > the
> > > process is being erroneously killed in the first place.
> > >
> > >
> > > On Thu, May 30, 2013 at 10:55 AM, Vinod Kone <vi...@gmail.com>
> > wrote:
> > >
> > > > Sounds like the slave is not responding to health checks by the
> master.
> > > > Does this happen right after you start the slave or after a while?
> Are
> > > you
> > > > able to get system load graph during this time?
> > > >
> > > > Also, the check failure while cleaning cgroups is clearly a bug
> (likely
> > > > related to MESOS-461 <
> https://issues.apache.org/jira/browse/MESOS-461
> > >).
> > > >
> > > >
> > > > On Thu, May 30, 2013 at 10:43 AM, Brenden Matthews <
> > > > brenden.matthews@airbedandbreakfast.com> wrote:
> > > >
> > > > > Hey guys,
> > > > >
> > > > > I'm having a frequent problem right now in master.  Slaves keep
> > > > > deactivating and I'm unsure why.  Here's the master log:
> > > > >
> > > > > W0530 17:35:32.300029 21798 master.cpp:1199] Removing slave
> > > > > 201305300057-1471680778-5050-21299-144 at
> > > > > slave(1)@10.148.178.186:5051because it has been deactivated
> > > > > I0530 17:35:32.300742 21800 hierarchical_allocator_process.hpp:423]
> > > > Removed
> > > > > slave 201305300057-1471680778-5050-21299-144
> > > > > I0530 17:35:32.302295 21798 master.hpp:295] Removing task
> > > > Task_Tracker_475
> > > > > with resources cpus=23.25; mem=51150; disk=126976;
> > ports=[31001-31001,
> > > > > 31999-31999] on slave 201305300057-1471680778-5050-21299-144
> > > > > I0530 17:35:32.304235 21798 master.hpp:295] Removing task
> > > > > ct:join_search_request_yesterday:1369931394916:2 with resources
> > cpus=1;
> > > > > mem=1; disk=1 on slave 201305300057-1471680778-5050-21299-144
> > > > > I0530 17:35:32.306157 21798 master.hpp:295] Removing task
> > > > Task_Tracker_451
> > > > > with resources cpus=2.25; mem=4950; disk=12288; ports=[31000-31000,
> > > > > 32000-32000] on slave 201305300057-1471680778-5050-21299-144
> > > > >
> > > > >
> > > > > And here's the slave log:
> > > > >
> > > > > I0530 17:29:31.326658 24787 slave.cpp:2498] Current usage 0.66%.
> Max
> > > > > allowed age: 6.253510014347754days
> > > > > I0530 17:30:31.328030 24787 slave.cpp:2498] Current usage 0.67%.
> Max
> > > > > allowed age: 6.253270522087396days
> > > > > I0530 17:31:31.329982 24798 slave.cpp:2498] Current usage 1.44%.
> Max
> > > > > allowed age: 6.199047245757789days
> > > > > I0530 17:32:31.333297 24810 slave.cpp:2498] Current usage 1.55%.
> Max
> > > > > allowed age: 6.191838835555544days
> > > > > I0530 17:34:12.236701 24790 slave.cpp:2498] Current usage 1.59%.
> Max
> > > > > allowed age: 6.188834669188438days
> > > > > I0530 17:37:10.797281 24790 slave.cpp:492] Slave asked to shut down
> > by
> > > > > master@10.17.184.87:5050
> > > > > I0530 17:37:10.938212 24790 slave.cpp:1114] Asked to shut down
> > > framework
> > > > > 201305290115-1471680778-5050-30247-0001 by
> master@10.17.184.87:5050
> > > > > I0530 17:37:10.954860 24790 slave.cpp:1139] Shutting down framework
> > > > > 201305290115-1471680778-5050-30247-0001
> > > > > I0530 17:37:10.955477 24790 slave.cpp:2315] Shutting down executor
> > > > > 'executor_Task_Tracker_475' of framework
> > > > > 201305290115-1471680778-5050-30247-0001
> > > > > I0530 17:37:10.956305 24790 slave.cpp:2315] Shutting down executor
> > > > > 'executor_Task_Tracker_451' of framework
> > > > > 201305290115-1471680778-5050-30247-0001
> > > > > I0530 17:37:10.956914 24790 slave.cpp:1114] Asked to shut down
> > > framework
> > > > > chronos by master@10.17.184.87:5050
> > > > > I0530 17:37:10.992754 24790 slave.cpp:1139] Shutting down framework
> > > > chronos
> > > > > I0530 17:37:11.028842 24790 slave.cpp:2315] Shutting down executor
> > > > > 'ct:join_search_request_yesterday:1369931394916:2' of framework
> > chronos
> > > > > I0530 17:37:14.847417 24809 cgroups_isolator.cpp:804] Executor
> > > > > ct:join_search_request_yesterday:1369931394916:2 of framework
> chronos
> > > > > terminated with status 0
> > > > > I0530 17:37:15.957542 24792 slave.cpp:2384] Killing executor
> > > > > 'executor_Task_Tracker_475' of framework
> > > > > 201305290115-1471680778-5050-30247-0001
> > > > > I0530 17:37:16.048948 24809 cgroups_isolator.cpp:620] Killing
> > executor
> > > > > ct:join_search_request_yesterday:1369931394916:2 of framework
> chronos
> > > > > I0530 17:37:16.085058 24792 slave.cpp:2384] Killing executor
> > > > > 'executor_Task_Tracker_451' of framework
> > > > > 201305290115-1471680778-5050-30247-0001
> > > > > I0530 17:37:16.086105 24792 slave.cpp:2384] Killing executor
> > > > > 'ct:join_search_request_yesterday:1369931394916:2' of framework
> > chronos
> > > > > W0530 17:37:16.089593 24790 monitor.cpp:167] Failed to collect
> > resource
> > > > > usage for executor
> 'ct:join_search_request_yesterday:1369931394916:2'
> > > of
> > > > > framework 'chronos': Unknown or killed executor
> > > > > I0530 17:37:16.089761 24809 cgroups_isolator.cpp:620] Killing
> > executor
> > > > > executor_Task_Tracker_475 of framework
> > > > > 201305290115-1471680778-5050-30247-0001
> > > > > I0530 17:37:16.093415 24810 cgroups.cpp:1175] Trying to freeze
> cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> > > > > I0530 17:37:16.161749 24810 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:16.163240 24809 cgroups_isolator.cpp:1023] OOM notifier
> > is
> > > > > triggered for executor
> > ct:join_search_request_yesterday:1369931394916:2
> > > > of
> > > > > framework chronos with uuid 199c6567-1296-427f-a74e-13742b95a330
> > > > > I0530 17:37:16.163627 24809 cgroups_isolator.cpp:1028] Discarded
> OOM
> > > > > notifier for executor
> > ct:join_search_request_yesterday:1369931394916:2
> > > of
> > > > > framework chronos with uuid 199c6567-1296-427f-a74e-13742b95a330
> > > > > I0530 17:37:16.164384 24809 cgroups_isolator.cpp:620] Killing
> > executor
> > > > > executor_Task_Tracker_451 of framework
> > > > > 201305290115-1471680778-5050-30247-0001
> > > > > E0530 17:37:16.166890 24809 cgroups_isolator.cpp:616] Asked to kill
> > an
> > > > > unknown/killed executor!
> > > > > I0530 17:37:16.241206 24809 cgroups_isolator.cpp:1023] OOM notifier
> > is
> > > > > triggered for executor executor_Task_Tracker_475 of framework
> > > > > 201305290115-1471680778-5050-30247-0001 with uuid
> > > > > e90a8ce9-812d-4757-833e-62c55ada5cda
> > > > > I0530 17:37:16.264039 24786 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:16.290067 24809 cgroups_isolator.cpp:1028] Discarded
> OOM
> > > > > notifier for executor executor_Task_Tracker_475 of framework
> > > > > 201305290115-1471680778-5050-30247-0001 with uuid
> > > > > e90a8ce9-812d-4757-833e-62c55ada5cda
> > > > > I0530 17:37:16.313125 24809 cgroups_isolator.cpp:1023] OOM notifier
> > is
> > > > > triggered for executor executor_Task_Tracker_451 of framework
> > > > > 201305290115-1471680778-5050-30247-0001 with uuid
> > > > > 10c5125d-cca4-42b7-a11a-a7fda594b005
> > > > > I0530 17:37:16.321915 24809 cgroups_isolator.cpp:1028] Discarded
> OOM
> > > > > notifier for executor executor_Task_Tracker_451 of framework
> > > > > 201305290115-1471680778-5050-30247-0001 with uuid
> > > > > 10c5125d-cca4-42b7-a11a-a7fda594b005
> > > > > F0530 17:37:16.322549 24809 cgroups_isolator.cpp:1165] Failed to
> > > destroy
> > > > > cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> mesos/framework_201305290115-1471680778-5050-30247-0001_executor_executor_Task_Tracker_475_tag_e90a8ce9-812d-4757-833e-62c55ada5cda:
> > > > > Failed to kill tasks in nested cgroups: Collect failed: Failed to
> > send
> > > > > Killed to process 721: No such process
> > > > > *** Check failure stack trace: ***
> > > > > I0530 17:37:16.405956 24802 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:18.258474 24787 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:18.394127 24781 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:18.530743 24785 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:18.670518 24788 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:18.806540 24803 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:18.942659 24796 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:19.078434 24807 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:19.214282 24789 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:19.350401 24793 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:19.486559 24800 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:19.622144 24799 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:19.758436 24806 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:19.894443 24792 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:19.996562 24797 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:20.134127 24783 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:20.270037 24804 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:20.406051 24805 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:20.542073 24808 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:20.678064 24794 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:20.814045 24798 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:20.950011 24810 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:21.052018 24780 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:21.186023 24786 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:21.322306 24781 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > >     @     0x7f57595b5c1d  google::LogMessage::Fail()
> > > > > I0530 17:37:21.424214 24787 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > >     @     0x7f57595b83af  google::LogMessage::SendToLog()
> > > > > I0530 17:37:21.526203 24790 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:21.662224 24785 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:21.764526 24788 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:21.866744 24802 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:21.969965 24780 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:22.072257 24791 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:22.210059 24811 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:22.346051 24782 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > >     @     0x7f57595b581b  google::LogMessage::Flush()
> > > > > I0530 17:37:22.482100 24795 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:22.618103 24784 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:22.754070 24786 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:22.856056 24781 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > >     @     0x7f57595b8c3d
>  google::LogMessageFatal::~LogMessageFatal()
> > > > > I0530 17:37:22.958155 24790 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:23.094172 24785 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > >     @     0x7f575933a42b
> > > > >  mesos::internal::slave::CgroupsIsolator::_killExecutor()
> > > > > I0530 17:37:23.196523 24801 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:23.334296 24787 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:23.436355 24788 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:23.538429 24796 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:23.640461 24789 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:23.742709 24800 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:23.844823 24803 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:23.946893 24799 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > I0530 17:37:24.050462 24807 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > >     @     0x7f5759355ff9  std::tr1::_Mem_fn<>::operator()()
> > > > > I0530 17:37:24.152722 24806 cgroups.cpp:1205] Watching cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > > frozen
> > > > > W0530 17:37:24.154515 24806 cgroups.cpp:1263] Unable to freeze
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> > > > > within 51 attempts
> > > > > I0530 17:37:24.159581 24794 cgroups.cpp:1190] Trying to thaw cgroup
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> > > > > I0530 17:37:24.160049 24794 cgroups.cpp:1300] Successfully thawed
> > > > >
> > > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> > > > >     @     0x7f5759354a60
> > > > >
> > > > >
> > > >
> > >
> >
>  _ZNSt3tr15_BindIFNS_7_Mem_fnIMN5mesos8internal5slave15CgroupsIsolatorEFvPNS5_10CgroupInfoERKN7process6FutureIbEEEEENS_12_PlaceholderILi1EEES7_SA_EE6__callIIRPS5_EILi0ELi1ELi2EEEENS_9result_ofIFSF_NSN_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSN_IFNSO_IS7_Lb0ELb0EEES7_ST_EE4typeENSN_IFNSO_ISA_Lb0ELb0EEESA_ST_EE4typeEEE4typeERKST_NS_12_Index_tupleIIXspT0_EEEE
> > > > >     @     0x7f57593514d0
> > > > >
> > > > >
> > > >
> > >
> >
>  _ZNSt3tr15_BindIFNS_7_Mem_fnIMN5mesos8internal5slave15CgroupsIsolatorEFvPNS5_10CgroupInfoERKN7process6FutureIbEEEEENS_12_PlaceholderILi1EEES7_SA_EEclIIPS5_EEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSM_IFNSN_IS7_Lb0ELb0EEES7_SS_EE4typeENSM_IFNSN_ISA_Lb0ELb0EEESA_SS_EE4typeEEE4typeEDpRSQ_
> > > > >     @     0x7f575934cd38
>  std::tr1::_Function_handler<>::_M_invoke()
> > > > >     @     0x7f575934ced9  std::tr1::function<>::operator()()
> > > > >     @     0x7f5759347f31  process::internal::vdispatcher<>()
> > > > >     @     0x7f5759354b7d
> > > > >
> > > > >
> > > >
> > >
> >
>  _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal5slave15CgroupsIsolatorEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIIRS3_EILi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIIXspT0_EEEE
> > > > >     @     0x7f5759351770
> > > > >
> > > > >
> > > >
> > >
> >
>  _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal5slave15CgroupsIsolatorEEEEEEENS_12_PlaceholderILi1EEESD_EEclIIS3_EEENS_9result_ofIFSF_NSL_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSL_IFNSM_ISD_Lb0ELb0EEESD_SR_EE4typeEEE4typeEDpRSP_
> > > > >     @     0x7f575934cfc4
>  std::tr1::_Function_handler<>::_M_invoke()
> > > > >     @     0x7f57594a1a7b  std::tr1::function<>::operator()()
> > > > >     @     0x7f575948a55f  process::ProcessBase::visit()
> > > > >     @     0x7f5759490992  process::DispatchEvent::visit()
> > > > >     @     0x7f57590761d4  process::ProcessBase::serve()
> > > > >     @     0x7f5759487cd1  process::ProcessManager::resume()
> > > > >     @     0x7f575947ee2e  process::schedule()
> > > > >     @     0x7f5757bc6e9a  start_thread
> > > > >     @     0x7f57578f3ccd  (unknown)
> > > > >
> > > > >
> > > > > Can you provide me some hints as to what's happening here?  This is
> > > > > currently a major blocker for me!
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Brenden
> > > > >
> > > >
> > >
> >
>

Re: Slaves deactivating

Posted by Brenden Matthews <br...@airbedandbreakfast.com>.
I agree with you that slaves which fail health checks should be removed.  I
suspect this is just a matter of tuning, and perhaps an issue related to
EC2.  I'll try increasing the value and see if that helps for now.

I also found mis-configured filesystem, so perhaps Mesos is not the culprit
here :)

On Thu, May 30, 2013 at 11:30 AM, Vinod Kone <vi...@gmail.com> wrote:

> Hmm. I'm not sure I agree. If a slave is not responding to health checks,
> that seems bad to me. A framework would be well off, if the slave is
> shutdown so that it can launch its tasks elsewhere in the cluster.
>
> The current parameters (SLAVE_PING_TIMEOUT, MAX_SLAVE_PING_TIMEOUTS) are
> such that a slave not responding to health checks for 75s is shutdown by
> the master. That seems reasonable to me? If you want that to be tunable,
> however, we can expose them vial masters flags.
>
> Having said that, the underlying problem that we need to diagnose/fix is to
> ensure the slave is responsive.
>
>
> On Thu, May 30, 2013 at 11:01 AM, Brenden Matthews <
> brenden.matthews@airbedandbreakfast.com> wrote:
>
> > The slave is running a Hadoop task and is probably under heavy load.  I
> > think it's normal for it to occasionally respond slowly to health checks,
> > and Mesos shouldn't be trying to kill it because of this.  I'm not too
> > concerned about the kill failing, I'm more concerned with the fact that
> the
> > process is being erroneously killed in the first place.
> >
> >
> > On Thu, May 30, 2013 at 10:55 AM, Vinod Kone <vi...@gmail.com>
> wrote:
> >
> > > Sounds like the slave is not responding to health checks by the master.
> > > Does this happen right after you start the slave or after a while? Are
> > you
> > > able to get system load graph during this time?
> > >
> > > Also, the check failure while cleaning cgroups is clearly a bug (likely
> > > related to MESOS-461 <https://issues.apache.org/jira/browse/MESOS-461
> >).
> > >
> > >
> > > On Thu, May 30, 2013 at 10:43 AM, Brenden Matthews <
> > > brenden.matthews@airbedandbreakfast.com> wrote:
> > >
> > > > Hey guys,
> > > >
> > > > I'm having a frequent problem right now in master.  Slaves keep
> > > > deactivating and I'm unsure why.  Here's the master log:
> > > >
> > > > W0530 17:35:32.300029 21798 master.cpp:1199] Removing slave
> > > > 201305300057-1471680778-5050-21299-144 at
> > > > slave(1)@10.148.178.186:5051because it has been deactivated
> > > > I0530 17:35:32.300742 21800 hierarchical_allocator_process.hpp:423]
> > > Removed
> > > > slave 201305300057-1471680778-5050-21299-144
> > > > I0530 17:35:32.302295 21798 master.hpp:295] Removing task
> > > Task_Tracker_475
> > > > with resources cpus=23.25; mem=51150; disk=126976;
> ports=[31001-31001,
> > > > 31999-31999] on slave 201305300057-1471680778-5050-21299-144
> > > > I0530 17:35:32.304235 21798 master.hpp:295] Removing task
> > > > ct:join_search_request_yesterday:1369931394916:2 with resources
> cpus=1;
> > > > mem=1; disk=1 on slave 201305300057-1471680778-5050-21299-144
> > > > I0530 17:35:32.306157 21798 master.hpp:295] Removing task
> > > Task_Tracker_451
> > > > with resources cpus=2.25; mem=4950; disk=12288; ports=[31000-31000,
> > > > 32000-32000] on slave 201305300057-1471680778-5050-21299-144
> > > >
> > > >
> > > > And here's the slave log:
> > > >
> > > > I0530 17:29:31.326658 24787 slave.cpp:2498] Current usage 0.66%. Max
> > > > allowed age: 6.253510014347754days
> > > > I0530 17:30:31.328030 24787 slave.cpp:2498] Current usage 0.67%. Max
> > > > allowed age: 6.253270522087396days
> > > > I0530 17:31:31.329982 24798 slave.cpp:2498] Current usage 1.44%. Max
> > > > allowed age: 6.199047245757789days
> > > > I0530 17:32:31.333297 24810 slave.cpp:2498] Current usage 1.55%. Max
> > > > allowed age: 6.191838835555544days
> > > > I0530 17:34:12.236701 24790 slave.cpp:2498] Current usage 1.59%. Max
> > > > allowed age: 6.188834669188438days
> > > > I0530 17:37:10.797281 24790 slave.cpp:492] Slave asked to shut down
> by
> > > > master@10.17.184.87:5050
> > > > I0530 17:37:10.938212 24790 slave.cpp:1114] Asked to shut down
> > framework
> > > > 201305290115-1471680778-5050-30247-0001 by master@10.17.184.87:5050
> > > > I0530 17:37:10.954860 24790 slave.cpp:1139] Shutting down framework
> > > > 201305290115-1471680778-5050-30247-0001
> > > > I0530 17:37:10.955477 24790 slave.cpp:2315] Shutting down executor
> > > > 'executor_Task_Tracker_475' of framework
> > > > 201305290115-1471680778-5050-30247-0001
> > > > I0530 17:37:10.956305 24790 slave.cpp:2315] Shutting down executor
> > > > 'executor_Task_Tracker_451' of framework
> > > > 201305290115-1471680778-5050-30247-0001
> > > > I0530 17:37:10.956914 24790 slave.cpp:1114] Asked to shut down
> > framework
> > > > chronos by master@10.17.184.87:5050
> > > > I0530 17:37:10.992754 24790 slave.cpp:1139] Shutting down framework
> > > chronos
> > > > I0530 17:37:11.028842 24790 slave.cpp:2315] Shutting down executor
> > > > 'ct:join_search_request_yesterday:1369931394916:2' of framework
> chronos
> > > > I0530 17:37:14.847417 24809 cgroups_isolator.cpp:804] Executor
> > > > ct:join_search_request_yesterday:1369931394916:2 of framework chronos
> > > > terminated with status 0
> > > > I0530 17:37:15.957542 24792 slave.cpp:2384] Killing executor
> > > > 'executor_Task_Tracker_475' of framework
> > > > 201305290115-1471680778-5050-30247-0001
> > > > I0530 17:37:16.048948 24809 cgroups_isolator.cpp:620] Killing
> executor
> > > > ct:join_search_request_yesterday:1369931394916:2 of framework chronos
> > > > I0530 17:37:16.085058 24792 slave.cpp:2384] Killing executor
> > > > 'executor_Task_Tracker_451' of framework
> > > > 201305290115-1471680778-5050-30247-0001
> > > > I0530 17:37:16.086105 24792 slave.cpp:2384] Killing executor
> > > > 'ct:join_search_request_yesterday:1369931394916:2' of framework
> chronos
> > > > W0530 17:37:16.089593 24790 monitor.cpp:167] Failed to collect
> resource
> > > > usage for executor 'ct:join_search_request_yesterday:1369931394916:2'
> > of
> > > > framework 'chronos': Unknown or killed executor
> > > > I0530 17:37:16.089761 24809 cgroups_isolator.cpp:620] Killing
> executor
> > > > executor_Task_Tracker_475 of framework
> > > > 201305290115-1471680778-5050-30247-0001
> > > > I0530 17:37:16.093415 24810 cgroups.cpp:1175] Trying to freeze cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> > > > I0530 17:37:16.161749 24810 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:16.163240 24809 cgroups_isolator.cpp:1023] OOM notifier
> is
> > > > triggered for executor
> ct:join_search_request_yesterday:1369931394916:2
> > > of
> > > > framework chronos with uuid 199c6567-1296-427f-a74e-13742b95a330
> > > > I0530 17:37:16.163627 24809 cgroups_isolator.cpp:1028] Discarded OOM
> > > > notifier for executor
> ct:join_search_request_yesterday:1369931394916:2
> > of
> > > > framework chronos with uuid 199c6567-1296-427f-a74e-13742b95a330
> > > > I0530 17:37:16.164384 24809 cgroups_isolator.cpp:620] Killing
> executor
> > > > executor_Task_Tracker_451 of framework
> > > > 201305290115-1471680778-5050-30247-0001
> > > > E0530 17:37:16.166890 24809 cgroups_isolator.cpp:616] Asked to kill
> an
> > > > unknown/killed executor!
> > > > I0530 17:37:16.241206 24809 cgroups_isolator.cpp:1023] OOM notifier
> is
> > > > triggered for executor executor_Task_Tracker_475 of framework
> > > > 201305290115-1471680778-5050-30247-0001 with uuid
> > > > e90a8ce9-812d-4757-833e-62c55ada5cda
> > > > I0530 17:37:16.264039 24786 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:16.290067 24809 cgroups_isolator.cpp:1028] Discarded OOM
> > > > notifier for executor executor_Task_Tracker_475 of framework
> > > > 201305290115-1471680778-5050-30247-0001 with uuid
> > > > e90a8ce9-812d-4757-833e-62c55ada5cda
> > > > I0530 17:37:16.313125 24809 cgroups_isolator.cpp:1023] OOM notifier
> is
> > > > triggered for executor executor_Task_Tracker_451 of framework
> > > > 201305290115-1471680778-5050-30247-0001 with uuid
> > > > 10c5125d-cca4-42b7-a11a-a7fda594b005
> > > > I0530 17:37:16.321915 24809 cgroups_isolator.cpp:1028] Discarded OOM
> > > > notifier for executor executor_Task_Tracker_451 of framework
> > > > 201305290115-1471680778-5050-30247-0001 with uuid
> > > > 10c5125d-cca4-42b7-a11a-a7fda594b005
> > > > F0530 17:37:16.322549 24809 cgroups_isolator.cpp:1165] Failed to
> > destroy
> > > > cgroup
> > > >
> > > >
> > >
> >
> mesos/framework_201305290115-1471680778-5050-30247-0001_executor_executor_Task_Tracker_475_tag_e90a8ce9-812d-4757-833e-62c55ada5cda:
> > > > Failed to kill tasks in nested cgroups: Collect failed: Failed to
> send
> > > > Killed to process 721: No such process
> > > > *** Check failure stack trace: ***
> > > > I0530 17:37:16.405956 24802 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:18.258474 24787 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:18.394127 24781 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:18.530743 24785 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:18.670518 24788 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:18.806540 24803 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:18.942659 24796 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:19.078434 24807 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:19.214282 24789 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:19.350401 24793 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:19.486559 24800 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:19.622144 24799 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:19.758436 24806 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:19.894443 24792 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:19.996562 24797 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:20.134127 24783 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:20.270037 24804 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:20.406051 24805 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:20.542073 24808 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:20.678064 24794 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:20.814045 24798 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:20.950011 24810 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:21.052018 24780 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:21.186023 24786 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:21.322306 24781 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > >     @     0x7f57595b5c1d  google::LogMessage::Fail()
> > > > I0530 17:37:21.424214 24787 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > >     @     0x7f57595b83af  google::LogMessage::SendToLog()
> > > > I0530 17:37:21.526203 24790 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:21.662224 24785 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:21.764526 24788 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:21.866744 24802 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:21.969965 24780 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:22.072257 24791 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:22.210059 24811 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:22.346051 24782 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > >     @     0x7f57595b581b  google::LogMessage::Flush()
> > > > I0530 17:37:22.482100 24795 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:22.618103 24784 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:22.754070 24786 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:22.856056 24781 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > >     @     0x7f57595b8c3d  google::LogMessageFatal::~LogMessageFatal()
> > > > I0530 17:37:22.958155 24790 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:23.094172 24785 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > >     @     0x7f575933a42b
> > > >  mesos::internal::slave::CgroupsIsolator::_killExecutor()
> > > > I0530 17:37:23.196523 24801 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:23.334296 24787 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:23.436355 24788 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:23.538429 24796 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:23.640461 24789 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:23.742709 24800 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:23.844823 24803 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:23.946893 24799 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > I0530 17:37:24.050462 24807 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > >     @     0x7f5759355ff9  std::tr1::_Mem_fn<>::operator()()
> > > > I0530 17:37:24.152722 24806 cgroups.cpp:1205] Watching cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > > frozen
> > > > W0530 17:37:24.154515 24806 cgroups.cpp:1263] Unable to freeze
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> > > > within 51 attempts
> > > > I0530 17:37:24.159581 24794 cgroups.cpp:1190] Trying to thaw cgroup
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> > > > I0530 17:37:24.160049 24794 cgroups.cpp:1300] Successfully thawed
> > > >
> > > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> > > >     @     0x7f5759354a60
> > > >
> > > >
> > >
> >
>  _ZNSt3tr15_BindIFNS_7_Mem_fnIMN5mesos8internal5slave15CgroupsIsolatorEFvPNS5_10CgroupInfoERKN7process6FutureIbEEEEENS_12_PlaceholderILi1EEES7_SA_EE6__callIIRPS5_EILi0ELi1ELi2EEEENS_9result_ofIFSF_NSN_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSN_IFNSO_IS7_Lb0ELb0EEES7_ST_EE4typeENSN_IFNSO_ISA_Lb0ELb0EEESA_ST_EE4typeEEE4typeERKST_NS_12_Index_tupleIIXspT0_EEEE
> > > >     @     0x7f57593514d0
> > > >
> > > >
> > >
> >
>  _ZNSt3tr15_BindIFNS_7_Mem_fnIMN5mesos8internal5slave15CgroupsIsolatorEFvPNS5_10CgroupInfoERKN7process6FutureIbEEEEENS_12_PlaceholderILi1EEES7_SA_EEclIIPS5_EEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSM_IFNSN_IS7_Lb0ELb0EEES7_SS_EE4typeENSM_IFNSN_ISA_Lb0ELb0EEESA_SS_EE4typeEEE4typeEDpRSQ_
> > > >     @     0x7f575934cd38  std::tr1::_Function_handler<>::_M_invoke()
> > > >     @     0x7f575934ced9  std::tr1::function<>::operator()()
> > > >     @     0x7f5759347f31  process::internal::vdispatcher<>()
> > > >     @     0x7f5759354b7d
> > > >
> > > >
> > >
> >
>  _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal5slave15CgroupsIsolatorEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIIRS3_EILi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIIXspT0_EEEE
> > > >     @     0x7f5759351770
> > > >
> > > >
> > >
> >
>  _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal5slave15CgroupsIsolatorEEEEEEENS_12_PlaceholderILi1EEESD_EEclIIS3_EEENS_9result_ofIFSF_NSL_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSL_IFNSM_ISD_Lb0ELb0EEESD_SR_EE4typeEEE4typeEDpRSP_
> > > >     @     0x7f575934cfc4  std::tr1::_Function_handler<>::_M_invoke()
> > > >     @     0x7f57594a1a7b  std::tr1::function<>::operator()()
> > > >     @     0x7f575948a55f  process::ProcessBase::visit()
> > > >     @     0x7f5759490992  process::DispatchEvent::visit()
> > > >     @     0x7f57590761d4  process::ProcessBase::serve()
> > > >     @     0x7f5759487cd1  process::ProcessManager::resume()
> > > >     @     0x7f575947ee2e  process::schedule()
> > > >     @     0x7f5757bc6e9a  start_thread
> > > >     @     0x7f57578f3ccd  (unknown)
> > > >
> > > >
> > > > Can you provide me some hints as to what's happening here?  This is
> > > > currently a major blocker for me!
> > > >
> > > > Thanks,
> > > >
> > > > Brenden
> > > >
> > >
> >
>

Re: Slaves deactivating

Posted by Vinod Kone <vi...@gmail.com>.
Hmm. I'm not sure I agree. If a slave is not responding to health checks,
that seems bad to me. A framework would be well off, if the slave is
shutdown so that it can launch its tasks elsewhere in the cluster.

The current parameters (SLAVE_PING_TIMEOUT, MAX_SLAVE_PING_TIMEOUTS) are
such that a slave not responding to health checks for 75s is shutdown by
the master. That seems reasonable to me? If you want that to be tunable,
however, we can expose them vial masters flags.

Having said that, the underlying problem that we need to diagnose/fix is to
ensure the slave is responsive.


On Thu, May 30, 2013 at 11:01 AM, Brenden Matthews <
brenden.matthews@airbedandbreakfast.com> wrote:

> The slave is running a Hadoop task and is probably under heavy load.  I
> think it's normal for it to occasionally respond slowly to health checks,
> and Mesos shouldn't be trying to kill it because of this.  I'm not too
> concerned about the kill failing, I'm more concerned with the fact that the
> process is being erroneously killed in the first place.
>
>
> On Thu, May 30, 2013 at 10:55 AM, Vinod Kone <vi...@gmail.com> wrote:
>
> > Sounds like the slave is not responding to health checks by the master.
> > Does this happen right after you start the slave or after a while? Are
> you
> > able to get system load graph during this time?
> >
> > Also, the check failure while cleaning cgroups is clearly a bug (likely
> > related to MESOS-461 <https://issues.apache.org/jira/browse/MESOS-461>).
> >
> >
> > On Thu, May 30, 2013 at 10:43 AM, Brenden Matthews <
> > brenden.matthews@airbedandbreakfast.com> wrote:
> >
> > > Hey guys,
> > >
> > > I'm having a frequent problem right now in master.  Slaves keep
> > > deactivating and I'm unsure why.  Here's the master log:
> > >
> > > W0530 17:35:32.300029 21798 master.cpp:1199] Removing slave
> > > 201305300057-1471680778-5050-21299-144 at
> > > slave(1)@10.148.178.186:5051because it has been deactivated
> > > I0530 17:35:32.300742 21800 hierarchical_allocator_process.hpp:423]
> > Removed
> > > slave 201305300057-1471680778-5050-21299-144
> > > I0530 17:35:32.302295 21798 master.hpp:295] Removing task
> > Task_Tracker_475
> > > with resources cpus=23.25; mem=51150; disk=126976; ports=[31001-31001,
> > > 31999-31999] on slave 201305300057-1471680778-5050-21299-144
> > > I0530 17:35:32.304235 21798 master.hpp:295] Removing task
> > > ct:join_search_request_yesterday:1369931394916:2 with resources cpus=1;
> > > mem=1; disk=1 on slave 201305300057-1471680778-5050-21299-144
> > > I0530 17:35:32.306157 21798 master.hpp:295] Removing task
> > Task_Tracker_451
> > > with resources cpus=2.25; mem=4950; disk=12288; ports=[31000-31000,
> > > 32000-32000] on slave 201305300057-1471680778-5050-21299-144
> > >
> > >
> > > And here's the slave log:
> > >
> > > I0530 17:29:31.326658 24787 slave.cpp:2498] Current usage 0.66%. Max
> > > allowed age: 6.253510014347754days
> > > I0530 17:30:31.328030 24787 slave.cpp:2498] Current usage 0.67%. Max
> > > allowed age: 6.253270522087396days
> > > I0530 17:31:31.329982 24798 slave.cpp:2498] Current usage 1.44%. Max
> > > allowed age: 6.199047245757789days
> > > I0530 17:32:31.333297 24810 slave.cpp:2498] Current usage 1.55%. Max
> > > allowed age: 6.191838835555544days
> > > I0530 17:34:12.236701 24790 slave.cpp:2498] Current usage 1.59%. Max
> > > allowed age: 6.188834669188438days
> > > I0530 17:37:10.797281 24790 slave.cpp:492] Slave asked to shut down by
> > > master@10.17.184.87:5050
> > > I0530 17:37:10.938212 24790 slave.cpp:1114] Asked to shut down
> framework
> > > 201305290115-1471680778-5050-30247-0001 by master@10.17.184.87:5050
> > > I0530 17:37:10.954860 24790 slave.cpp:1139] Shutting down framework
> > > 201305290115-1471680778-5050-30247-0001
> > > I0530 17:37:10.955477 24790 slave.cpp:2315] Shutting down executor
> > > 'executor_Task_Tracker_475' of framework
> > > 201305290115-1471680778-5050-30247-0001
> > > I0530 17:37:10.956305 24790 slave.cpp:2315] Shutting down executor
> > > 'executor_Task_Tracker_451' of framework
> > > 201305290115-1471680778-5050-30247-0001
> > > I0530 17:37:10.956914 24790 slave.cpp:1114] Asked to shut down
> framework
> > > chronos by master@10.17.184.87:5050
> > > I0530 17:37:10.992754 24790 slave.cpp:1139] Shutting down framework
> > chronos
> > > I0530 17:37:11.028842 24790 slave.cpp:2315] Shutting down executor
> > > 'ct:join_search_request_yesterday:1369931394916:2' of framework chronos
> > > I0530 17:37:14.847417 24809 cgroups_isolator.cpp:804] Executor
> > > ct:join_search_request_yesterday:1369931394916:2 of framework chronos
> > > terminated with status 0
> > > I0530 17:37:15.957542 24792 slave.cpp:2384] Killing executor
> > > 'executor_Task_Tracker_475' of framework
> > > 201305290115-1471680778-5050-30247-0001
> > > I0530 17:37:16.048948 24809 cgroups_isolator.cpp:620] Killing executor
> > > ct:join_search_request_yesterday:1369931394916:2 of framework chronos
> > > I0530 17:37:16.085058 24792 slave.cpp:2384] Killing executor
> > > 'executor_Task_Tracker_451' of framework
> > > 201305290115-1471680778-5050-30247-0001
> > > I0530 17:37:16.086105 24792 slave.cpp:2384] Killing executor
> > > 'ct:join_search_request_yesterday:1369931394916:2' of framework chronos
> > > W0530 17:37:16.089593 24790 monitor.cpp:167] Failed to collect resource
> > > usage for executor 'ct:join_search_request_yesterday:1369931394916:2'
> of
> > > framework 'chronos': Unknown or killed executor
> > > I0530 17:37:16.089761 24809 cgroups_isolator.cpp:620] Killing executor
> > > executor_Task_Tracker_475 of framework
> > > 201305290115-1471680778-5050-30247-0001
> > > I0530 17:37:16.093415 24810 cgroups.cpp:1175] Trying to freeze cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> > > I0530 17:37:16.161749 24810 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:16.163240 24809 cgroups_isolator.cpp:1023] OOM notifier is
> > > triggered for executor ct:join_search_request_yesterday:1369931394916:2
> > of
> > > framework chronos with uuid 199c6567-1296-427f-a74e-13742b95a330
> > > I0530 17:37:16.163627 24809 cgroups_isolator.cpp:1028] Discarded OOM
> > > notifier for executor ct:join_search_request_yesterday:1369931394916:2
> of
> > > framework chronos with uuid 199c6567-1296-427f-a74e-13742b95a330
> > > I0530 17:37:16.164384 24809 cgroups_isolator.cpp:620] Killing executor
> > > executor_Task_Tracker_451 of framework
> > > 201305290115-1471680778-5050-30247-0001
> > > E0530 17:37:16.166890 24809 cgroups_isolator.cpp:616] Asked to kill an
> > > unknown/killed executor!
> > > I0530 17:37:16.241206 24809 cgroups_isolator.cpp:1023] OOM notifier is
> > > triggered for executor executor_Task_Tracker_475 of framework
> > > 201305290115-1471680778-5050-30247-0001 with uuid
> > > e90a8ce9-812d-4757-833e-62c55ada5cda
> > > I0530 17:37:16.264039 24786 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:16.290067 24809 cgroups_isolator.cpp:1028] Discarded OOM
> > > notifier for executor executor_Task_Tracker_475 of framework
> > > 201305290115-1471680778-5050-30247-0001 with uuid
> > > e90a8ce9-812d-4757-833e-62c55ada5cda
> > > I0530 17:37:16.313125 24809 cgroups_isolator.cpp:1023] OOM notifier is
> > > triggered for executor executor_Task_Tracker_451 of framework
> > > 201305290115-1471680778-5050-30247-0001 with uuid
> > > 10c5125d-cca4-42b7-a11a-a7fda594b005
> > > I0530 17:37:16.321915 24809 cgroups_isolator.cpp:1028] Discarded OOM
> > > notifier for executor executor_Task_Tracker_451 of framework
> > > 201305290115-1471680778-5050-30247-0001 with uuid
> > > 10c5125d-cca4-42b7-a11a-a7fda594b005
> > > F0530 17:37:16.322549 24809 cgroups_isolator.cpp:1165] Failed to
> destroy
> > > cgroup
> > >
> > >
> >
> mesos/framework_201305290115-1471680778-5050-30247-0001_executor_executor_Task_Tracker_475_tag_e90a8ce9-812d-4757-833e-62c55ada5cda:
> > > Failed to kill tasks in nested cgroups: Collect failed: Failed to send
> > > Killed to process 721: No such process
> > > *** Check failure stack trace: ***
> > > I0530 17:37:16.405956 24802 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:18.258474 24787 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:18.394127 24781 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:18.530743 24785 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:18.670518 24788 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:18.806540 24803 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:18.942659 24796 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:19.078434 24807 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:19.214282 24789 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:19.350401 24793 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:19.486559 24800 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:19.622144 24799 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:19.758436 24806 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:19.894443 24792 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:19.996562 24797 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:20.134127 24783 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:20.270037 24804 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:20.406051 24805 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:20.542073 24808 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:20.678064 24794 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:20.814045 24798 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:20.950011 24810 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:21.052018 24780 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:21.186023 24786 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:21.322306 24781 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > >     @     0x7f57595b5c1d  google::LogMessage::Fail()
> > > I0530 17:37:21.424214 24787 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > >     @     0x7f57595b83af  google::LogMessage::SendToLog()
> > > I0530 17:37:21.526203 24790 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:21.662224 24785 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:21.764526 24788 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:21.866744 24802 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:21.969965 24780 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:22.072257 24791 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:22.210059 24811 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:22.346051 24782 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > >     @     0x7f57595b581b  google::LogMessage::Flush()
> > > I0530 17:37:22.482100 24795 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:22.618103 24784 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:22.754070 24786 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:22.856056 24781 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > >     @     0x7f57595b8c3d  google::LogMessageFatal::~LogMessageFatal()
> > > I0530 17:37:22.958155 24790 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:23.094172 24785 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > >     @     0x7f575933a42b
> > >  mesos::internal::slave::CgroupsIsolator::_killExecutor()
> > > I0530 17:37:23.196523 24801 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:23.334296 24787 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:23.436355 24788 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:23.538429 24796 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:23.640461 24789 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:23.742709 24800 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:23.844823 24803 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:23.946893 24799 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > I0530 17:37:24.050462 24807 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > >     @     0x7f5759355ff9  std::tr1::_Mem_fn<>::operator()()
> > > I0530 17:37:24.152722 24806 cgroups.cpp:1205] Watching cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > > frozen
> > > W0530 17:37:24.154515 24806 cgroups.cpp:1263] Unable to freeze
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> > > within 51 attempts
> > > I0530 17:37:24.159581 24794 cgroups.cpp:1190] Trying to thaw cgroup
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> > > I0530 17:37:24.160049 24794 cgroups.cpp:1300] Successfully thawed
> > >
> > >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> > >     @     0x7f5759354a60
> > >
> > >
> >
>  _ZNSt3tr15_BindIFNS_7_Mem_fnIMN5mesos8internal5slave15CgroupsIsolatorEFvPNS5_10CgroupInfoERKN7process6FutureIbEEEEENS_12_PlaceholderILi1EEES7_SA_EE6__callIIRPS5_EILi0ELi1ELi2EEEENS_9result_ofIFSF_NSN_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSN_IFNSO_IS7_Lb0ELb0EEES7_ST_EE4typeENSN_IFNSO_ISA_Lb0ELb0EEESA_ST_EE4typeEEE4typeERKST_NS_12_Index_tupleIIXspT0_EEEE
> > >     @     0x7f57593514d0
> > >
> > >
> >
>  _ZNSt3tr15_BindIFNS_7_Mem_fnIMN5mesos8internal5slave15CgroupsIsolatorEFvPNS5_10CgroupInfoERKN7process6FutureIbEEEEENS_12_PlaceholderILi1EEES7_SA_EEclIIPS5_EEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSM_IFNSN_IS7_Lb0ELb0EEES7_SS_EE4typeENSM_IFNSN_ISA_Lb0ELb0EEESA_SS_EE4typeEEE4typeEDpRSQ_
> > >     @     0x7f575934cd38  std::tr1::_Function_handler<>::_M_invoke()
> > >     @     0x7f575934ced9  std::tr1::function<>::operator()()
> > >     @     0x7f5759347f31  process::internal::vdispatcher<>()
> > >     @     0x7f5759354b7d
> > >
> > >
> >
>  _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal5slave15CgroupsIsolatorEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIIRS3_EILi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIIXspT0_EEEE
> > >     @     0x7f5759351770
> > >
> > >
> >
>  _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal5slave15CgroupsIsolatorEEEEEEENS_12_PlaceholderILi1EEESD_EEclIIS3_EEENS_9result_ofIFSF_NSL_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSL_IFNSM_ISD_Lb0ELb0EEESD_SR_EE4typeEEE4typeEDpRSP_
> > >     @     0x7f575934cfc4  std::tr1::_Function_handler<>::_M_invoke()
> > >     @     0x7f57594a1a7b  std::tr1::function<>::operator()()
> > >     @     0x7f575948a55f  process::ProcessBase::visit()
> > >     @     0x7f5759490992  process::DispatchEvent::visit()
> > >     @     0x7f57590761d4  process::ProcessBase::serve()
> > >     @     0x7f5759487cd1  process::ProcessManager::resume()
> > >     @     0x7f575947ee2e  process::schedule()
> > >     @     0x7f5757bc6e9a  start_thread
> > >     @     0x7f57578f3ccd  (unknown)
> > >
> > >
> > > Can you provide me some hints as to what's happening here?  This is
> > > currently a major blocker for me!
> > >
> > > Thanks,
> > >
> > > Brenden
> > >
> >
>

Re: Slaves deactivating

Posted by Brenden Matthews <br...@airbedandbreakfast.com>.
The slave is running a Hadoop task and is probably under heavy load.  I
think it's normal for it to occasionally respond slowly to health checks,
and Mesos shouldn't be trying to kill it because of this.  I'm not too
concerned about the kill failing, I'm more concerned with the fact that the
process is being erroneously killed in the first place.


On Thu, May 30, 2013 at 10:55 AM, Vinod Kone <vi...@gmail.com> wrote:

> Sounds like the slave is not responding to health checks by the master.
> Does this happen right after you start the slave or after a while? Are you
> able to get system load graph during this time?
>
> Also, the check failure while cleaning cgroups is clearly a bug (likely
> related to MESOS-461 <https://issues.apache.org/jira/browse/MESOS-461>).
>
>
> On Thu, May 30, 2013 at 10:43 AM, Brenden Matthews <
> brenden.matthews@airbedandbreakfast.com> wrote:
>
> > Hey guys,
> >
> > I'm having a frequent problem right now in master.  Slaves keep
> > deactivating and I'm unsure why.  Here's the master log:
> >
> > W0530 17:35:32.300029 21798 master.cpp:1199] Removing slave
> > 201305300057-1471680778-5050-21299-144 at
> > slave(1)@10.148.178.186:5051because it has been deactivated
> > I0530 17:35:32.300742 21800 hierarchical_allocator_process.hpp:423]
> Removed
> > slave 201305300057-1471680778-5050-21299-144
> > I0530 17:35:32.302295 21798 master.hpp:295] Removing task
> Task_Tracker_475
> > with resources cpus=23.25; mem=51150; disk=126976; ports=[31001-31001,
> > 31999-31999] on slave 201305300057-1471680778-5050-21299-144
> > I0530 17:35:32.304235 21798 master.hpp:295] Removing task
> > ct:join_search_request_yesterday:1369931394916:2 with resources cpus=1;
> > mem=1; disk=1 on slave 201305300057-1471680778-5050-21299-144
> > I0530 17:35:32.306157 21798 master.hpp:295] Removing task
> Task_Tracker_451
> > with resources cpus=2.25; mem=4950; disk=12288; ports=[31000-31000,
> > 32000-32000] on slave 201305300057-1471680778-5050-21299-144
> >
> >
> > And here's the slave log:
> >
> > I0530 17:29:31.326658 24787 slave.cpp:2498] Current usage 0.66%. Max
> > allowed age: 6.253510014347754days
> > I0530 17:30:31.328030 24787 slave.cpp:2498] Current usage 0.67%. Max
> > allowed age: 6.253270522087396days
> > I0530 17:31:31.329982 24798 slave.cpp:2498] Current usage 1.44%. Max
> > allowed age: 6.199047245757789days
> > I0530 17:32:31.333297 24810 slave.cpp:2498] Current usage 1.55%. Max
> > allowed age: 6.191838835555544days
> > I0530 17:34:12.236701 24790 slave.cpp:2498] Current usage 1.59%. Max
> > allowed age: 6.188834669188438days
> > I0530 17:37:10.797281 24790 slave.cpp:492] Slave asked to shut down by
> > master@10.17.184.87:5050
> > I0530 17:37:10.938212 24790 slave.cpp:1114] Asked to shut down framework
> > 201305290115-1471680778-5050-30247-0001 by master@10.17.184.87:5050
> > I0530 17:37:10.954860 24790 slave.cpp:1139] Shutting down framework
> > 201305290115-1471680778-5050-30247-0001
> > I0530 17:37:10.955477 24790 slave.cpp:2315] Shutting down executor
> > 'executor_Task_Tracker_475' of framework
> > 201305290115-1471680778-5050-30247-0001
> > I0530 17:37:10.956305 24790 slave.cpp:2315] Shutting down executor
> > 'executor_Task_Tracker_451' of framework
> > 201305290115-1471680778-5050-30247-0001
> > I0530 17:37:10.956914 24790 slave.cpp:1114] Asked to shut down framework
> > chronos by master@10.17.184.87:5050
> > I0530 17:37:10.992754 24790 slave.cpp:1139] Shutting down framework
> chronos
> > I0530 17:37:11.028842 24790 slave.cpp:2315] Shutting down executor
> > 'ct:join_search_request_yesterday:1369931394916:2' of framework chronos
> > I0530 17:37:14.847417 24809 cgroups_isolator.cpp:804] Executor
> > ct:join_search_request_yesterday:1369931394916:2 of framework chronos
> > terminated with status 0
> > I0530 17:37:15.957542 24792 slave.cpp:2384] Killing executor
> > 'executor_Task_Tracker_475' of framework
> > 201305290115-1471680778-5050-30247-0001
> > I0530 17:37:16.048948 24809 cgroups_isolator.cpp:620] Killing executor
> > ct:join_search_request_yesterday:1369931394916:2 of framework chronos
> > I0530 17:37:16.085058 24792 slave.cpp:2384] Killing executor
> > 'executor_Task_Tracker_451' of framework
> > 201305290115-1471680778-5050-30247-0001
> > I0530 17:37:16.086105 24792 slave.cpp:2384] Killing executor
> > 'ct:join_search_request_yesterday:1369931394916:2' of framework chronos
> > W0530 17:37:16.089593 24790 monitor.cpp:167] Failed to collect resource
> > usage for executor 'ct:join_search_request_yesterday:1369931394916:2' of
> > framework 'chronos': Unknown or killed executor
> > I0530 17:37:16.089761 24809 cgroups_isolator.cpp:620] Killing executor
> > executor_Task_Tracker_475 of framework
> > 201305290115-1471680778-5050-30247-0001
> > I0530 17:37:16.093415 24810 cgroups.cpp:1175] Trying to freeze cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> > I0530 17:37:16.161749 24810 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:16.163240 24809 cgroups_isolator.cpp:1023] OOM notifier is
> > triggered for executor ct:join_search_request_yesterday:1369931394916:2
> of
> > framework chronos with uuid 199c6567-1296-427f-a74e-13742b95a330
> > I0530 17:37:16.163627 24809 cgroups_isolator.cpp:1028] Discarded OOM
> > notifier for executor ct:join_search_request_yesterday:1369931394916:2 of
> > framework chronos with uuid 199c6567-1296-427f-a74e-13742b95a330
> > I0530 17:37:16.164384 24809 cgroups_isolator.cpp:620] Killing executor
> > executor_Task_Tracker_451 of framework
> > 201305290115-1471680778-5050-30247-0001
> > E0530 17:37:16.166890 24809 cgroups_isolator.cpp:616] Asked to kill an
> > unknown/killed executor!
> > I0530 17:37:16.241206 24809 cgroups_isolator.cpp:1023] OOM notifier is
> > triggered for executor executor_Task_Tracker_475 of framework
> > 201305290115-1471680778-5050-30247-0001 with uuid
> > e90a8ce9-812d-4757-833e-62c55ada5cda
> > I0530 17:37:16.264039 24786 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:16.290067 24809 cgroups_isolator.cpp:1028] Discarded OOM
> > notifier for executor executor_Task_Tracker_475 of framework
> > 201305290115-1471680778-5050-30247-0001 with uuid
> > e90a8ce9-812d-4757-833e-62c55ada5cda
> > I0530 17:37:16.313125 24809 cgroups_isolator.cpp:1023] OOM notifier is
> > triggered for executor executor_Task_Tracker_451 of framework
> > 201305290115-1471680778-5050-30247-0001 with uuid
> > 10c5125d-cca4-42b7-a11a-a7fda594b005
> > I0530 17:37:16.321915 24809 cgroups_isolator.cpp:1028] Discarded OOM
> > notifier for executor executor_Task_Tracker_451 of framework
> > 201305290115-1471680778-5050-30247-0001 with uuid
> > 10c5125d-cca4-42b7-a11a-a7fda594b005
> > F0530 17:37:16.322549 24809 cgroups_isolator.cpp:1165] Failed to destroy
> > cgroup
> >
> >
> mesos/framework_201305290115-1471680778-5050-30247-0001_executor_executor_Task_Tracker_475_tag_e90a8ce9-812d-4757-833e-62c55ada5cda:
> > Failed to kill tasks in nested cgroups: Collect failed: Failed to send
> > Killed to process 721: No such process
> > *** Check failure stack trace: ***
> > I0530 17:37:16.405956 24802 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:18.258474 24787 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:18.394127 24781 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:18.530743 24785 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:18.670518 24788 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:18.806540 24803 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:18.942659 24796 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:19.078434 24807 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:19.214282 24789 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:19.350401 24793 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:19.486559 24800 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:19.622144 24799 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:19.758436 24806 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:19.894443 24792 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:19.996562 24797 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:20.134127 24783 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:20.270037 24804 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:20.406051 24805 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:20.542073 24808 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:20.678064 24794 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:20.814045 24798 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:20.950011 24810 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:21.052018 24780 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:21.186023 24786 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:21.322306 24781 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> >     @     0x7f57595b5c1d  google::LogMessage::Fail()
> > I0530 17:37:21.424214 24787 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> >     @     0x7f57595b83af  google::LogMessage::SendToLog()
> > I0530 17:37:21.526203 24790 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:21.662224 24785 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:21.764526 24788 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:21.866744 24802 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:21.969965 24780 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:22.072257 24791 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:22.210059 24811 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:22.346051 24782 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> >     @     0x7f57595b581b  google::LogMessage::Flush()
> > I0530 17:37:22.482100 24795 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:22.618103 24784 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:22.754070 24786 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:22.856056 24781 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> >     @     0x7f57595b8c3d  google::LogMessageFatal::~LogMessageFatal()
> > I0530 17:37:22.958155 24790 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:23.094172 24785 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> >     @     0x7f575933a42b
> >  mesos::internal::slave::CgroupsIsolator::_killExecutor()
> > I0530 17:37:23.196523 24801 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:23.334296 24787 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:23.436355 24788 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:23.538429 24796 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:23.640461 24789 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:23.742709 24800 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:23.844823 24803 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:23.946893 24799 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > I0530 17:37:24.050462 24807 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> >     @     0x7f5759355ff9  std::tr1::_Mem_fn<>::operator()()
> > I0530 17:37:24.152722 24806 cgroups.cpp:1205] Watching cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> > frozen
> > W0530 17:37:24.154515 24806 cgroups.cpp:1263] Unable to freeze
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> > within 51 attempts
> > I0530 17:37:24.159581 24794 cgroups.cpp:1190] Trying to thaw cgroup
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> > I0530 17:37:24.160049 24794 cgroups.cpp:1300] Successfully thawed
> >
> >
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> >     @     0x7f5759354a60
> >
> >
>  _ZNSt3tr15_BindIFNS_7_Mem_fnIMN5mesos8internal5slave15CgroupsIsolatorEFvPNS5_10CgroupInfoERKN7process6FutureIbEEEEENS_12_PlaceholderILi1EEES7_SA_EE6__callIIRPS5_EILi0ELi1ELi2EEEENS_9result_ofIFSF_NSN_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSN_IFNSO_IS7_Lb0ELb0EEES7_ST_EE4typeENSN_IFNSO_ISA_Lb0ELb0EEESA_ST_EE4typeEEE4typeERKST_NS_12_Index_tupleIIXspT0_EEEE
> >     @     0x7f57593514d0
> >
> >
>  _ZNSt3tr15_BindIFNS_7_Mem_fnIMN5mesos8internal5slave15CgroupsIsolatorEFvPNS5_10CgroupInfoERKN7process6FutureIbEEEEENS_12_PlaceholderILi1EEES7_SA_EEclIIPS5_EEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSM_IFNSN_IS7_Lb0ELb0EEES7_SS_EE4typeENSM_IFNSN_ISA_Lb0ELb0EEESA_SS_EE4typeEEE4typeEDpRSQ_
> >     @     0x7f575934cd38  std::tr1::_Function_handler<>::_M_invoke()
> >     @     0x7f575934ced9  std::tr1::function<>::operator()()
> >     @     0x7f5759347f31  process::internal::vdispatcher<>()
> >     @     0x7f5759354b7d
> >
> >
>  _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal5slave15CgroupsIsolatorEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIIRS3_EILi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIIXspT0_EEEE
> >     @     0x7f5759351770
> >
> >
>  _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal5slave15CgroupsIsolatorEEEEEEENS_12_PlaceholderILi1EEESD_EEclIIS3_EEENS_9result_ofIFSF_NSL_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSL_IFNSM_ISD_Lb0ELb0EEESD_SR_EE4typeEEE4typeEDpRSP_
> >     @     0x7f575934cfc4  std::tr1::_Function_handler<>::_M_invoke()
> >     @     0x7f57594a1a7b  std::tr1::function<>::operator()()
> >     @     0x7f575948a55f  process::ProcessBase::visit()
> >     @     0x7f5759490992  process::DispatchEvent::visit()
> >     @     0x7f57590761d4  process::ProcessBase::serve()
> >     @     0x7f5759487cd1  process::ProcessManager::resume()
> >     @     0x7f575947ee2e  process::schedule()
> >     @     0x7f5757bc6e9a  start_thread
> >     @     0x7f57578f3ccd  (unknown)
> >
> >
> > Can you provide me some hints as to what's happening here?  This is
> > currently a major blocker for me!
> >
> > Thanks,
> >
> > Brenden
> >
>

Re: Slaves deactivating

Posted by Vinod Kone <vi...@gmail.com>.
Sounds like the slave is not responding to health checks by the master.
Does this happen right after you start the slave or after a while? Are you
able to get system load graph during this time?

Also, the check failure while cleaning cgroups is clearly a bug (likely
related to MESOS-461 <https://issues.apache.org/jira/browse/MESOS-461>).


On Thu, May 30, 2013 at 10:43 AM, Brenden Matthews <
brenden.matthews@airbedandbreakfast.com> wrote:

> Hey guys,
>
> I'm having a frequent problem right now in master.  Slaves keep
> deactivating and I'm unsure why.  Here's the master log:
>
> W0530 17:35:32.300029 21798 master.cpp:1199] Removing slave
> 201305300057-1471680778-5050-21299-144 at
> slave(1)@10.148.178.186:5051because it has been deactivated
> I0530 17:35:32.300742 21800 hierarchical_allocator_process.hpp:423] Removed
> slave 201305300057-1471680778-5050-21299-144
> I0530 17:35:32.302295 21798 master.hpp:295] Removing task Task_Tracker_475
> with resources cpus=23.25; mem=51150; disk=126976; ports=[31001-31001,
> 31999-31999] on slave 201305300057-1471680778-5050-21299-144
> I0530 17:35:32.304235 21798 master.hpp:295] Removing task
> ct:join_search_request_yesterday:1369931394916:2 with resources cpus=1;
> mem=1; disk=1 on slave 201305300057-1471680778-5050-21299-144
> I0530 17:35:32.306157 21798 master.hpp:295] Removing task Task_Tracker_451
> with resources cpus=2.25; mem=4950; disk=12288; ports=[31000-31000,
> 32000-32000] on slave 201305300057-1471680778-5050-21299-144
>
>
> And here's the slave log:
>
> I0530 17:29:31.326658 24787 slave.cpp:2498] Current usage 0.66%. Max
> allowed age: 6.253510014347754days
> I0530 17:30:31.328030 24787 slave.cpp:2498] Current usage 0.67%. Max
> allowed age: 6.253270522087396days
> I0530 17:31:31.329982 24798 slave.cpp:2498] Current usage 1.44%. Max
> allowed age: 6.199047245757789days
> I0530 17:32:31.333297 24810 slave.cpp:2498] Current usage 1.55%. Max
> allowed age: 6.191838835555544days
> I0530 17:34:12.236701 24790 slave.cpp:2498] Current usage 1.59%. Max
> allowed age: 6.188834669188438days
> I0530 17:37:10.797281 24790 slave.cpp:492] Slave asked to shut down by
> master@10.17.184.87:5050
> I0530 17:37:10.938212 24790 slave.cpp:1114] Asked to shut down framework
> 201305290115-1471680778-5050-30247-0001 by master@10.17.184.87:5050
> I0530 17:37:10.954860 24790 slave.cpp:1139] Shutting down framework
> 201305290115-1471680778-5050-30247-0001
> I0530 17:37:10.955477 24790 slave.cpp:2315] Shutting down executor
> 'executor_Task_Tracker_475' of framework
> 201305290115-1471680778-5050-30247-0001
> I0530 17:37:10.956305 24790 slave.cpp:2315] Shutting down executor
> 'executor_Task_Tracker_451' of framework
> 201305290115-1471680778-5050-30247-0001
> I0530 17:37:10.956914 24790 slave.cpp:1114] Asked to shut down framework
> chronos by master@10.17.184.87:5050
> I0530 17:37:10.992754 24790 slave.cpp:1139] Shutting down framework chronos
> I0530 17:37:11.028842 24790 slave.cpp:2315] Shutting down executor
> 'ct:join_search_request_yesterday:1369931394916:2' of framework chronos
> I0530 17:37:14.847417 24809 cgroups_isolator.cpp:804] Executor
> ct:join_search_request_yesterday:1369931394916:2 of framework chronos
> terminated with status 0
> I0530 17:37:15.957542 24792 slave.cpp:2384] Killing executor
> 'executor_Task_Tracker_475' of framework
> 201305290115-1471680778-5050-30247-0001
> I0530 17:37:16.048948 24809 cgroups_isolator.cpp:620] Killing executor
> ct:join_search_request_yesterday:1369931394916:2 of framework chronos
> I0530 17:37:16.085058 24792 slave.cpp:2384] Killing executor
> 'executor_Task_Tracker_451' of framework
> 201305290115-1471680778-5050-30247-0001
> I0530 17:37:16.086105 24792 slave.cpp:2384] Killing executor
> 'ct:join_search_request_yesterday:1369931394916:2' of framework chronos
> W0530 17:37:16.089593 24790 monitor.cpp:167] Failed to collect resource
> usage for executor 'ct:join_search_request_yesterday:1369931394916:2' of
> framework 'chronos': Unknown or killed executor
> I0530 17:37:16.089761 24809 cgroups_isolator.cpp:620] Killing executor
> executor_Task_Tracker_475 of framework
> 201305290115-1471680778-5050-30247-0001
> I0530 17:37:16.093415 24810 cgroups.cpp:1175] Trying to freeze cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> I0530 17:37:16.161749 24810 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:16.163240 24809 cgroups_isolator.cpp:1023] OOM notifier is
> triggered for executor ct:join_search_request_yesterday:1369931394916:2 of
> framework chronos with uuid 199c6567-1296-427f-a74e-13742b95a330
> I0530 17:37:16.163627 24809 cgroups_isolator.cpp:1028] Discarded OOM
> notifier for executor ct:join_search_request_yesterday:1369931394916:2 of
> framework chronos with uuid 199c6567-1296-427f-a74e-13742b95a330
> I0530 17:37:16.164384 24809 cgroups_isolator.cpp:620] Killing executor
> executor_Task_Tracker_451 of framework
> 201305290115-1471680778-5050-30247-0001
> E0530 17:37:16.166890 24809 cgroups_isolator.cpp:616] Asked to kill an
> unknown/killed executor!
> I0530 17:37:16.241206 24809 cgroups_isolator.cpp:1023] OOM notifier is
> triggered for executor executor_Task_Tracker_475 of framework
> 201305290115-1471680778-5050-30247-0001 with uuid
> e90a8ce9-812d-4757-833e-62c55ada5cda
> I0530 17:37:16.264039 24786 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:16.290067 24809 cgroups_isolator.cpp:1028] Discarded OOM
> notifier for executor executor_Task_Tracker_475 of framework
> 201305290115-1471680778-5050-30247-0001 with uuid
> e90a8ce9-812d-4757-833e-62c55ada5cda
> I0530 17:37:16.313125 24809 cgroups_isolator.cpp:1023] OOM notifier is
> triggered for executor executor_Task_Tracker_451 of framework
> 201305290115-1471680778-5050-30247-0001 with uuid
> 10c5125d-cca4-42b7-a11a-a7fda594b005
> I0530 17:37:16.321915 24809 cgroups_isolator.cpp:1028] Discarded OOM
> notifier for executor executor_Task_Tracker_451 of framework
> 201305290115-1471680778-5050-30247-0001 with uuid
> 10c5125d-cca4-42b7-a11a-a7fda594b005
> F0530 17:37:16.322549 24809 cgroups_isolator.cpp:1165] Failed to destroy
> cgroup
>
> mesos/framework_201305290115-1471680778-5050-30247-0001_executor_executor_Task_Tracker_475_tag_e90a8ce9-812d-4757-833e-62c55ada5cda:
> Failed to kill tasks in nested cgroups: Collect failed: Failed to send
> Killed to process 721: No such process
> *** Check failure stack trace: ***
> I0530 17:37:16.405956 24802 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:18.258474 24787 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:18.394127 24781 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:18.530743 24785 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:18.670518 24788 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:18.806540 24803 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:18.942659 24796 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:19.078434 24807 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:19.214282 24789 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:19.350401 24793 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:19.486559 24800 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:19.622144 24799 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:19.758436 24806 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:19.894443 24792 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:19.996562 24797 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:20.134127 24783 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:20.270037 24804 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:20.406051 24805 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:20.542073 24808 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:20.678064 24794 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:20.814045 24798 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:20.950011 24810 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:21.052018 24780 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:21.186023 24786 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:21.322306 24781 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
>     @     0x7f57595b5c1d  google::LogMessage::Fail()
> I0530 17:37:21.424214 24787 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
>     @     0x7f57595b83af  google::LogMessage::SendToLog()
> I0530 17:37:21.526203 24790 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:21.662224 24785 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:21.764526 24788 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:21.866744 24802 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:21.969965 24780 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:22.072257 24791 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:22.210059 24811 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:22.346051 24782 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
>     @     0x7f57595b581b  google::LogMessage::Flush()
> I0530 17:37:22.482100 24795 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:22.618103 24784 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:22.754070 24786 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:22.856056 24781 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
>     @     0x7f57595b8c3d  google::LogMessageFatal::~LogMessageFatal()
> I0530 17:37:22.958155 24790 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:23.094172 24785 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
>     @     0x7f575933a42b
>  mesos::internal::slave::CgroupsIsolator::_killExecutor()
> I0530 17:37:23.196523 24801 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:23.334296 24787 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:23.436355 24788 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:23.538429 24796 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:23.640461 24789 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:23.742709 24800 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:23.844823 24803 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:23.946893 24799 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> I0530 17:37:24.050462 24807 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
>     @     0x7f5759355ff9  std::tr1::_Mem_fn<>::operator()()
> I0530 17:37:24.152722 24806 cgroups.cpp:1205] Watching cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330until
> frozen
> W0530 17:37:24.154515 24806 cgroups.cpp:1263] Unable to freeze
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> within 51 attempts
> I0530 17:37:24.159581 24794 cgroups.cpp:1190] Trying to thaw cgroup
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
> I0530 17:37:24.160049 24794 cgroups.cpp:1300] Successfully thawed
>
> /cgroup/mesos/framework_chronos_executor_ct:join_search_request_yesterday:1369931394916:2_tag_199c6567-1296-427f-a74e-13742b95a330
>     @     0x7f5759354a60
>
>  _ZNSt3tr15_BindIFNS_7_Mem_fnIMN5mesos8internal5slave15CgroupsIsolatorEFvPNS5_10CgroupInfoERKN7process6FutureIbEEEEENS_12_PlaceholderILi1EEES7_SA_EE6__callIIRPS5_EILi0ELi1ELi2EEEENS_9result_ofIFSF_NSN_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSN_IFNSO_IS7_Lb0ELb0EEES7_ST_EE4typeENSN_IFNSO_ISA_Lb0ELb0EEESA_ST_EE4typeEEE4typeERKST_NS_12_Index_tupleIIXspT0_EEEE
>     @     0x7f57593514d0
>
>  _ZNSt3tr15_BindIFNS_7_Mem_fnIMN5mesos8internal5slave15CgroupsIsolatorEFvPNS5_10CgroupInfoERKN7process6FutureIbEEEEENS_12_PlaceholderILi1EEES7_SA_EEclIIPS5_EEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSM_IFNSN_IS7_Lb0ELb0EEES7_SS_EE4typeENSM_IFNSN_ISA_Lb0ELb0EEESA_SS_EE4typeEEE4typeEDpRSQ_
>     @     0x7f575934cd38  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7f575934ced9  std::tr1::function<>::operator()()
>     @     0x7f5759347f31  process::internal::vdispatcher<>()
>     @     0x7f5759354b7d
>
>  _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal5slave15CgroupsIsolatorEEEEEEENS_12_PlaceholderILi1EEESD_EE6__callIIRS3_EILi0ELi1EEEENS_9result_ofIFSF_NSM_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSM_IFNSN_ISD_Lb0ELb0EEESD_SS_EE4typeEEE4typeERKSS_NS_12_Index_tupleIIXspT0_EEEE
>     @     0x7f5759351770
>
>  _ZNSt3tr15_BindIFPFvPN7process11ProcessBaseENS_10shared_ptrINS_8functionIFvPN5mesos8internal5slave15CgroupsIsolatorEEEEEEENS_12_PlaceholderILi1EEESD_EEclIIS3_EEENS_9result_ofIFSF_NSL_IFNS_3_MuISH_Lb0ELb1EEESH_NS_5tupleIIDpT_EEEEE4typeENSL_IFNSM_ISD_Lb0ELb0EEESD_SR_EE4typeEEE4typeEDpRSP_
>     @     0x7f575934cfc4  std::tr1::_Function_handler<>::_M_invoke()
>     @     0x7f57594a1a7b  std::tr1::function<>::operator()()
>     @     0x7f575948a55f  process::ProcessBase::visit()
>     @     0x7f5759490992  process::DispatchEvent::visit()
>     @     0x7f57590761d4  process::ProcessBase::serve()
>     @     0x7f5759487cd1  process::ProcessManager::resume()
>     @     0x7f575947ee2e  process::schedule()
>     @     0x7f5757bc6e9a  start_thread
>     @     0x7f57578f3ccd  (unknown)
>
>
> Can you provide me some hints as to what's happening here?  This is
> currently a major blocker for me!
>
> Thanks,
>
> Brenden
>