You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Andrei Budnik (JIRA)" <ji...@apache.org> on 2018/08/29 12:21:00 UTC

[jira] [Comment Edited] (MESOS-8545) AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.

    [ https://issues.apache.org/jira/browse/MESOS-8545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16596258#comment-16596258 ] 

Andrei Budnik edited comment on MESOS-8545 at 8/29/18 12:20 PM:
----------------------------------------------------------------

When the agent handles `ATTACH_CONTAINER_INPUT` call, it creates an HTTP [streaming connection|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3104] to IOSwitchboard.
 After the agent [sends|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3141] a request to IOSwitchboard, a new instance of `ConnectionProcess` is created, which calls [`ConnectionProcess::read()`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1220] to read an HTTP response from IOSwitchboard.
 If the socket is closed before a `\r\n\r\n` response is received, the `ConnectionProcess` calls `[disconnect()|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1326]`, which in turn [flushes `pipeline`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1197-L1201] containing a `Response` promise. This leads to responding back (to the `AttachInputToNestedContainerSession` [test|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/tests/api_tests.cpp#L7942-L7943]) an `HTTP 500` error with body "Disconnected".

When io redirect finishes, IOSwitchboardServerProcess calls `terminate(self(), false)` (here [\[1\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1262] or there [\[2\]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1713]). Then, `IOSwitchboardServerProcess::finalize()` sets a value to the [`promise`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1304-L1308], which [unblocks `main()`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard_main.cpp#L149-L150] function. As a result, IOSwitchboard process terminates immediately.

When IOSwitchboard terminates, there could be not yet [written|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1699] response messages to the socket. So, if any delay occurs before [sending|https://github.com/apache/mesos/blob/95bbe784da51b3a7eaeb9127e2541ea0b2af07b5/3rdparty/libprocess/src/http.cpp#L1742-L1748] the response back to the agent, the socket will be closed due to IOSwitchboard process termination. That leads to the aforementioned premature socket close in the agent.

See my previous comment including steps to reproduce.


was (Author: abudnik):
When the agent handles `ATTACH_CONTAINER_INPUT` call, it creates an HTTP [streaming connection|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3104] to IOSwitchboard.
 After the agent [sends|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/http.cpp#L3141] a request to IOSwitchboard, a new instance of `ConnectionProcess` is created, which calls [`ConnectionProcess::read()`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1220] to read an HTTP response from IOSwitchboard.
 If the socket is closed before a `\r\n\r\n` response is received, the `ConnectionProcess` calls `[disconnect()|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1326]`, which in turn [flushes `pipeline`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1197-L1201] containing a `Response` promise. This leads to responding back (to the `AttachInputToNestedContainerSession` [test|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/tests/api_tests.cpp#L7942-L7943]) an `HTTP 500` error with body "Disconnected".

When io redirect finishes, IOSwitchboardServerProcess calls `terminate(self(), false)` (here [[1]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1262] or there [[2]|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1713]). Then, `IOSwitchboardServerProcess::finalize()` sets a value to the [`promise`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard.cpp#L1304-L1308], which [unblocks `main()`|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/src/slave/containerizer/mesos/io/switchboard_main.cpp#L149-L150] function. As a result, IOSwitchboard process terminates immediately.

When IOSwitchboard terminates, there could be not yet [written|https://github.com/apache/mesos/blob/12636838f78ad06b66466b3d2fa9c9db94ac70b2/3rdparty/libprocess/src/http.cpp#L1699] response messages to the socket. So, if any delay occurs before [sending|https://github.com/apache/mesos/blob/95bbe784da51b3a7eaeb9127e2541ea0b2af07b5/3rdparty/libprocess/src/http.cpp#L1742-L1748] the response back to the agent, the socket will be closed due to IOSwitchboard process termination. That leads to the aforementioned premature socket close in the agent.

See my previous comment including steps to reproduce.

> AgentAPIStreamingTest.AttachInputToNestedContainerSession is flaky.
> -------------------------------------------------------------------
>
>                 Key: MESOS-8545
>                 URL: https://issues.apache.org/jira/browse/MESOS-8545
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>    Affects Versions: 1.5.0, 1.6.1, 1.7.0
>            Reporter: Andrei Budnik
>            Assignee: Andrei Budnik
>            Priority: Major
>              Labels: Mesosphere, flaky-test
>         Attachments: AgentAPIStreamingTest.AttachInputToNestedContainerSession-badrun.txt, AgentAPIStreamingTest.AttachInputToNestedContainerSession-badrun2.txt
>
>
> {code:java}
> I0205 17:11:01.091872 4898 http_proxy.cpp:132] Returning '500 Internal Server Error' for '/slave(974)/api/v1' (Disconnected)
> /home/centos/workspace/mesos/Mesos_CI-build/FLAG/CMake/label/mesos-ec2-centos-7/mesos/src/tests/api_tests.cpp:6596: Failure
> Value of: (response).get().status
> Actual: "500 Internal Server Error"
> Expected: http::OK().status
> Which is: "200 OK"
> Body: "Disconnected"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)