You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Greg Mann (JIRA)" <ji...@apache.org> on 2017/12/12 23:51:00 UTC

[jira] [Assigned] (MESOS-8325) Mesos containerizer does not properly handle old running containers

     [ https://issues.apache.org/jira/browse/MESOS-8325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Greg Mann reassigned MESOS-8325:
--------------------------------

    Assignee: Zhitao Li

> Mesos containerizer does not properly handle old running containers
> -------------------------------------------------------------------
>
>                 Key: MESOS-8325
>                 URL: https://issues.apache.org/jira/browse/MESOS-8325
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, containerization
>            Reporter: Greg Mann
>            Assignee: Zhitao Li
>            Priority: Blocker
>              Labels: containerizer, mesosphere, recovery, upgrade
>
> We were testing an upgrade scenario recently and encountered the following assertion failure:
> {code}
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: I1212 16:45:42.693977 20810 http.cpp:3116] Processing LAUNCH_NESTED_CONTAINER_SESSION call for container 'a89b211a-4549-462d-9cc7-0ea2bac2f729.1c262420-7525-4fee-99c1-aff4f66996bd.check-a41362ae-13c6-4750-990e-a1a0b2792b5f'
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: I1212 16:45:42.695179 20807 containerizer.cpp:1169] Trying to chown '/var/lib/mesos/slave/slaves/aaf0a62f-a6eb-4c1d-80db-5fdd26fe8008-S12/frameworks/dcf5f8b5-86a8-44df-ac03-b39404239ad8-0377/executors/kafka__68baefd4-aa8c-4b97-a23e-eb6a73fa91f6/runs/a89b211a-4549-462d-9cc7-0ea2bac2f729/containers/1c262420-7525-4fee-99c1-aff4f66996bd/containers/check-a41362ae-13c6-4750-990e-a1a0b2792b5f' to user 'nobody'
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: W1212 16:45:42.695309 20807 containerizer.cpp:1198] Cannot determine executor_info for root container 'a89b211a-4549-462d-9cc7-0ea2bac2f729' which has no config recovered.
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: I1212 16:45:42.695327 20807 containerizer.cpp:1203] Starting container a89b211a-4549-462d-9cc7-0ea2bac2f729.1c262420-7525-4fee-99c1-aff4f66996bd.check-a41362ae-13c6-4750-990e-a1a0b2792b5f
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: I1212 16:45:42.695829 20807 containerizer.cpp:2932] Transitioning the state of container a89b211a-4549-462d-9cc7-0ea2bac2f729.1c262420-7525-4fee-99c1-aff4f66996bd.check-a41362ae-13c6-4750-990e-a1a0b2792b5f from PROVISIONING to PREPARING
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: I1212 16:45:42.700569 20811 systemd.cpp:98] Assigned child process '20941' to 'mesos_executors.slice'
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: I1212 16:45:42.702945 20811 systemd.cpp:98] Assigned child process '20942' to 'mesos_executors.slice'
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: I1212 16:45:42.706069 20806 switchboard.cpp:575] Created I/O switchboard server (pid: 20943) listening on socket file '/tmp/mesos-io-switchboard-74af71bb-2385-4dde-9762-94d0196124d3' for container a89b211a-4549-462d-9cc7-0ea2bac2f729.1c262420-7525-4fee-99c1-aff4f66996bd.check-a41362ae-13c6-4750-990e-a1a0b2792b5f
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: mesos-agent: /pkg/src/mesos/3rdparty/stout/include/stout/option.hpp:115: T& Option<T>::get() & [with T = mesos::slave::ContainerConfig]: Assertion `isSome()' failed.
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: *** Aborted at 1513097142 (unix time) try "date -d @1513097142" if you are using GNU date ***
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: PC: @     0x7f472f2851f7 __GI_raise
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: *** SIGABRT (@0x5134) received by PID 20788 (TID 0x7f472a2bf700) from PID 20788; stack trace: ***
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @     0x7f472f6225e0 (unknown)
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @     0x7f472f2851f7 __GI_raise
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @     0x7f472f2868e8 __GI_abort
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @     0x7f472f27e266 __assert_fail_base
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @     0x7f472f27e312 __GI___assert_fail
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @     0x7f4731c481e3 _ZNR6OptionIN5mesos5slave15ContainerConfigEE3getEv.part.170
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @     0x7f4731c61c2d mesos::internal::slave::MesosContainerizerProcess::_launch()
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @     0x7f4731c7f403 _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal5slave13Containerizer12LaunchResultENS5_25MesosContainerizerProcessERKNS3_11ContainerIDERK6OptionINS3_5slave11ContainerIOEERKSt3mapISsSsSt4lessISsESaISt4pairIKSsSsEEERKSC_ISsESB_SH_SR_SU_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSZ_FSX_T1_T2_T3_T4_EOT5_OT6_OT7_OT8_EUlSt10unique_ptrINS1_7PromiseIS7_EESt14default_deleteIS1J_EEOS9_OSF_OSP_OSS_PNS1_11ProcessBaseEE_IS1M_S9_SF_SP_SS_S1S_EEEDTclcl7forwardISW_Efp_Espcl7forwardIT0_Efp0_EEEOSW_DpOS1U_
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @     0x7f4731c7f4f1 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave13Containerizer12LaunchResultENSC_25MesosContainerizerProcessERKNSA_11ContainerIDERK6OptionINSA_5slave11ContainerIOEERKSt3mapISsSsSt4lessISsESaISt4pairIKSsSsEEERKSJ_ISsESI_SO_SY_S11_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMS16_FS14_T1_T2_T3_T4_EOT5_OT6_OT7_OT8_EUlSt10unique_ptrINS1_7PromiseISE_EESt14default_deleteIS1Q_EEOSG_OSM_OSW_OSZ_S3_E_IS1T_SG_SM_SW_SZ_St12_PlaceholderILi1EEEEEEclEOS3_
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @     0x7f47325dbb31 process::ProcessBase::consume()
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @     0x7f47325ea882 process::ProcessManager::resume()
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @     0x7f47325efcf6 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @     0x7f472fafa230 (unknown)
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @     0x7f472f61ae25 start_thread
> Dec 12 16:45:42 agent.hostname mesos-agent[20788]: @     0x7f472f34834d __clone
> Dec 12 16:45:42 agent.hostname systemd[1]: dcos-mesos-slave.service: main process exited, code=killed, status=6/ABRT
> Dec 12 16:45:42 agent.hostname systemd[1]: Unit dcos-mesos-slave.service entered failed state.
> Dec 12 16:45:42 agent.hostname systemd[1]: dcos-mesos-slave.service failed.
> {code}
> Looking into {{Slave::_launch}}, indeed we find an unguarded access to the parent container's {{ContainerConfig}} [here|https://github.com/apache/mesos/blob/c320ab3b2dc4a16de7e060b9e15e9865a73389b0/src/slave/containerizer/mesos/containerizer.cpp#L1716].
> We recently [added checkpointing|https://github.com/apache/mesos/commit/03a2a4dfa47b1d47c5eb23e81f5ef8213e46d545] of {{ContainerConfig}} to the Mesos containerizer. It seems that we are not appropriately handling upgrades, when there may be old containers running for which we do not expect to recover a {{ContainerConfig}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)