You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Gilbert Song (JIRA)" <ji...@apache.org> on 2019/01/09 19:20:00 UTC
[jira] [Assigned] (MESOS-9502) IOswitchboard cleanup could get stuck.

     [ https://issues.apache.org/jira/browse/MESOS-9502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gilbert Song reassigned MESOS-9502:
-----------------------------------

        Shepherd: Gilbert Song
        Assignee: Andrei Budnik
          Sprint: Containerization R9 Sprint 37
    Story Points: 8
          Labels: containerizer  (was: )

> IOswitchboard cleanup could get stuck.
> --------------------------------------
>
>                 Key: MESOS-9502
>                 URL: https://issues.apache.org/jira/browse/MESOS-9502
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>    Affects Versions: 1.7.0
>            Reporter: Meng Zhu
>            Assignee: Andrei Budnik
>            Priority: Critical
>              Labels: containerizer
>
> Our check container got stuck during destroy which in turned stucks the parent container. It is blocked by the I/O switchboard cleanup:
> 1223 18:04:41.000000 16269 switchboard.cpp:814] Sending SIGTERM to I/O switchboard server (pid: 62854) since container 4d4074fa-bc87-471b-8659-08e519b68e13.16d02532-675a-4acb-964d-57459ecf6b67.check-e91521a3-bf72-4ac4-8ead-3950e31cf09e is being destroyed
> ....
> 1227 04:45:38.000000  5189 switchboard.cpp:916] I/O switchboard server process for container 4d4074fa-bc87-471b-8659-08e519b68e13.16d02532-675a-4acb-964d-57459ecf6b67.check-e91521a3-bf72-4ac4-8ead-3950e31cf09e has terminated (status=N/A)
> Note the timestamp.
> *Root Cause:*
> Fundamentally, this is caused by a race between *.discard()* triggered by Check Container TIMEOUT and IOSB extracting ContainerIO object. This race could be exposed by overloaded/slow agent process. Please see how this race be triggered below:
> # Right after IOSB server process is running, Check container Timed out and the checker process returns a failure, which would close the HTTP connection with agent.
> # From the agent side, if the connection breaks, the handler will trigger a discard on the returned future and that will result in containerizer->launch()'s future transitioned to DISCARDED state.
> # In containerizer, the DISCARDED state will be propagated back to IOSB prepare(), which stop its continuation on *extracting the containerIO* (it implies the object being cleaned up and FDs(one end of pipes created in IOSB) being closed in its destructor).
> # Agent starts to destroy the container due to its discarded launch result, and asks IOSB to cleanup the container.
> # IOSB server is still running, so agent sends a SIGTERM.
> # SIGTERM handler unblocks the IOSB from redirecting (to redirect stdout/stderr from container to logger before exiting).
> # io::redirect() calls io::splice() and reads the other end of those pipes forever.
> This issue is *not easy to reproduce unless* on a busy agent, because the timeout has to happen exactly *AFTER* IOSB server is running and *BEFORE* IOSB extracts containerIO.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)