You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Alexander Rukletsov (JIRA)" <ji...@apache.org> on 2017/04/26 12:49:04 UTC
[jira] [Updated] (MESOS-7152) The agent may be flapping after the
machine reboots due to provisioner recover.
[ https://issues.apache.org/jira/browse/MESOS-7152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexander Rukletsov updated MESOS-7152:
---------------------------------------
Fix Version/s: 1.1.2
> The agent may be flapping after the machine reboots due to provisioner recover.
> -------------------------------------------------------------------------------
>
> Key: MESOS-7152
> URL: https://issues.apache.org/jira/browse/MESOS-7152
> Project: Mesos
> Issue Type: Bug
> Reporter: Gilbert Song
> Assignee: Gilbert Song
> Priority: Blocker
> Labels: nested, provisioner
> Fix For: 1.1.2, 1.2.0
>
>
> After the agent machine reboots, if the agent work dir survives (e.g., /var/lib/mesos) and the container runtime directory is gone (an empty SlaveState as well), the provisioner recover() would get into segfault because that case break the semantic that a child container should always be cleaned up before it parent container.
> This is a particular case which only happens if the machine reboots and the provisioner directory survives.
> {noformat}
> F0217 01:10:18.423238 30099 provisioner.cpp:504] Check failed: entry.parent() != containerId Failed to destroy container 1 since its nested container 1.2 has not been destroyed yet
> *** Check failure stack trace: ***
> @ 0x7fceb444121d google::LogMessage::Fail()
> @ 0x7fceb44405ee google::LogMessage::SendToLog()
> @ 0x7fceb4440eed google::LogMessage::Flush()
> @ 0x7fceb4444368 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7fceb36137f9 mesos::internal::slave::ProvisionerProcess::destroy()
> @ 0x7fceb36126f0 mesos::internal::slave::ProvisionerProcess::recover()
> @ 0x7fceb3637fc6 _ZZN7process8dispatchI7NothingN5mesos8internal5slave18ProvisionerProcessERK7hashsetINS2_11ContainerIDESt4hashIS7_ESt8equal_toIS7_EESC_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSJ_FSH_T1_ET2_ENKUlPNS_11ProcessBaseEE_clESS_
> @ 0x7fceb3637bc2 _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchI7NothingN5mesos8internal5slave18ProvisionerProcessERK7hashsetINS6_11ContainerIDESt4hashISB_ESt8equal_toISB_EESG_EENS0_6FutureIT_EERKNS0_3PIDIT0_EEMSN_FSL_T1_ET2_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
> @ 0x7fceb43848e4 std::function<>::operator()()
> @ 0x7fceb436baf4 process::ProcessBase::visit()
> @ 0x7fceb43e5fde process::DispatchEvent::visit()
> @ 0x9e4101 process::ProcessBase::serve()
> @ 0x7fceb4369007 process::ProcessManager::resume()
> @ 0x7fceb4377a8c process::ProcessManager::init_threads()::$_2::operator()()
> @ 0x7fceb4377995 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_2vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
> @ 0x7fceb4377965 std::_Bind_simple<>::operator()()
> @ 0x7fceb437793c std::thread::_Impl<>::_M_run()
> @ 0x7fceadefa030 (unknown)
> @ 0x7fcead70b6aa start_thread
> @ 0x7fcead440e9d (unknown)
> {noformat}
> The provisioner directory is supposed to be under the container runtime directory. However, this is not backward compatible. We can only change it after a deprecation cycle.
> For now, we have to three options:
> 1. make provisioner::destroy() recursive.
> 2. sort the container during recovery to guarantee `child before parent` semantic.
> 3. remove the check-failure since the while provisioner dir will be removed eventually at the end (not recommended).
> Recommend (1).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)