You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Gastón Kleiman <ga...@mesosphere.com> on 2017/02/01 11:30:35 UTC
Re: Review Request 55901: Added support for command health checks to
the default executor.
> On Jan. 28, 2017, 1:17 a.m., Vinod Kone wrote:
> > src/checks/health_checker.cpp, line 480
> > <https://reviews.apache.org/r/55901/diff/1/?file=1613998#file1613998line480>
> >
> > launch nested container session returns a streaming response, how come you are calling `post()` helper here which expects a non-streaming response?
> >
> > probably one of the reasons why your test is hanging.
>
> Gast�n Kleiman wrote:
> The `post()` helper delegates to `process::http::request()`, which takes boolean flag (`streamedResponse`). If this flag is set to `false`, libprocess will convert a PIPE (streaming) response into a BODY (non-streaming) response. This means that the `Future` returned by the helper will not be completed until the server closes the connection.
>
> Relevant links:
> - https://github.com/apache/mesos/blob/2d0195eed54feac41485fb1503dc4004e5500c81/3rdparty/libprocess/include/process/http.hpp#L953-L955
> - https://github.com/apache/mesos/blob/2d0195eed54feac41485fb1503dc4004e5500c81/3rdparty/libprocess/src/http.cpp#L1191-L1197
> - https://github.com/apache/mesos/blob/2d0195eed54feac41485fb1503dc4004e5500c81/3rdparty/libprocess/src/http.cpp#L1005-L1022
>
> In these particular cases, it means that the `Future` will not be completed until the container exits, which is exactly what we need.
>
>
> Regarding the test hanging in Linux, some further debugging seem to indicate that test hangs on `process::Clock::settle()`, because there's a race + deadlock in `RateLimiter` that leaves a process stuck in `RUNNING`. I'll dig deeper on Monday, but here's some evidence:
>
> # Snippet from a successful test run that "leaks" a running process
>
> ```
> [==========] Running 1 test from 1 test case.
> [----------] Global test environment set-up.
> [----------] 1 test from HealthCheckTest
> [ RUN ] HealthCheckTest.DefaultExecutorCmdHealthCheck
>
> [...]
>
> E0127 11:18:06.607446 8665 limiter.hpp:123] !!!! LIMITER _acquire
> E0127 11:18:06.607471 8665 limiter.hpp:129] !!!! LIMITER There are 1 promises
> E0127 11:18:06.608064 8665 limiter.hpp:186] !!!! LIMITER destroy
>
> **** DEADLOCK DETECTED! ****
> You are waiting on process __limiter__(1419)@192.99.40.208:41947 that it is currently executing.
>
> [...]
>
> [ OK ] HealthCheckTest.DefaultExecutorCmdHealthCheck (592 ms)
> [----------] 1 test from HealthCheckTest (593 ms total)
>
> [----------] Global test environment tear-down
> [==========] 1 test from 1 test case ran. (603 ms total)
> [ PASSED ] 1 test.
>
> Repeating all tests (iteration 474) . . .
> ```
>
> # Snippet from a hung run
>
> ```
> [...]
> E0127 11:18:06.878636 8640 cluster.cpp:358] !!! Settling the clock
> E0127 11:18:06.878646 8640 process.cpp:3491] !!! Attempting to acquire mutex
> E0127 11:18:06.878654 8640 process.cpp:3494] !!! !runq.empty()?
> E0127 11:18:06.878659 8640 process.cpp:3502] !!! running.load() > 0?
> E0127 11:18:06.878667 8640 process.cpp:3504] !!! 1 processes still running
> E0127 11:18:06.878676 8640 process.cpp:3509] !!!! Process: __authentication_router__(1) state: 3
> E0127 11:18:06.878684 8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1893) state: 3
> E0127 11:18:06.878691 8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1894) state: 3
> E0127 11:18:06.878697 8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1895) state: 3
> E0127 11:18:06.878705 8640 process.cpp:3509] !!!! Process: __gc__ state: 3
> E0127 11:18:06.878711 8640 process.cpp:3509] !!!! Process: __limiter__(1419) state: 2
> E0127 11:18:06.878718 8640 process.cpp:3509] !!!! Process: __processes__ state: 3
> E0127 11:18:06.878725 8640 process.cpp:3509] !!!! Process: __reaper__(1) state: 3
> E0127 11:18:06.878731 8640 process.cpp:3509] !!!! Process: crammd5-authenticator(474) state: 3
> E0127 11:18:06.878738 8640 process.cpp:3509] !!!! Process: files state: 3
> E0127 11:18:06.878744 8640 process.cpp:3509] !!!! Process: help state: 3
> E0127 11:18:06.878751 8640 process.cpp:3509] !!!! Process: hierarchical-allocator(474) state: 3
> E0127 11:18:06.878757 8640 process.cpp:3509] !!!! Process: in-memory-storage(474) state: 3
> E0127 11:18:06.878764 8640 process.cpp:3509] !!!! Process: local-authorizer(947) state: 3
> E0127 11:18:06.878772 8640 process.cpp:3509] !!!! Process: logging state: 3
> E0127 11:18:06.878777 8640 process.cpp:3509] !!!! Process: master state: 3
> E0127 11:18:06.878792 8640 process.cpp:3509] !!!! Process: metrics state: 3
> E0127 11:18:06.878800 8640 process.cpp:3509] !!!! Process: profiler state: 3
> E0127 11:18:06.878806 8640 process.cpp:3509] !!!! Process: registrar(474) state: 3
> E0127 11:18:06.878813 8640 process.cpp:3509] !!!! Process: standalone-master-detector(1420) state: 3
> E0127 11:18:06.878820 8640 process.cpp:3509] !!!! Process: system state: 3
> E0127 11:18:06.878828 8640 process.cpp:3509] !!!! Process: version state: 3
> E0127 11:18:06.878834 8640 process.cpp:3509] !!!! Process: whitelist(474) state: 3
> E0127 11:18:06.878840 8640 process.cpp:3491] !!! Attempting to acquire mutex
> E0127 11:18:06.878846 8640 process.cpp:3494] !!! !runq.empty()?
> E0127 11:18:06.878852 8640 process.cpp:3502] !!! running.load() > 0?
> E0127 11:18:06.878859 8640 process.cpp:3504] !!! 1 processes still running
> E0127 11:18:06.878867 8640 process.cpp:3509] !!!! Process: __authentication_router__(1) state: 3
> E0127 11:18:06.878873 8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1893) state: 3
> E0127 11:18:06.878880 8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1894) state: 3
> E0127 11:18:06.878887 8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1895) state: 3
> E0127 11:18:06.878895 8640 process.cpp:3509] !!!! Process: __gc__ state: 3
> E0127 11:18:06.878901 8640 process.cpp:3509] !!!! Process: __limiter__(1419) state: 2
> E0127 11:18:06.878907 8640 process.cpp:3509] !!!! Process: __processes__ state: 3
> E0127 11:18:06.878913 8640 process.cpp:3509] !!!! Process: __reaper__(1) state: 3
> E0127 11:18:06.878921 8640 process.cpp:3509] !!!! Process: crammd5-authenticator(474) state: 3
> E0127 11:18:06.878927 8640 process.cpp:3509] !!!! Process: files state: 3
> E0127 11:18:06.878933 8640 process.cpp:3509] !!!! Process: help state: 3
> E0127 11:18:06.878940 8640 process.cpp:3509] !!!! Process: hierarchical-allocator(474) state: 3
> E0127 11:18:06.878947 8640 process.cpp:3509] !!!! Process: in-memory-storage(474) state: 3
> E0127 11:18:06.878953 8640 process.cpp:3509] !!!! Process: local-authorizer(947) state: 3
> E0127 11:18:06.878960 8640 process.cpp:3509] !!!! Process: logging state: 3
> E0127 11:18:06.878967 8640 process.cpp:3509] !!!! Process: master state: 3
> E0127 11:18:06.878973 8640 process.cpp:3509] !!!! Process: metrics state: 3
> E0127 11:18:06.878979 8640 process.cpp:3509] !!!! Process: profiler state: 3
> E0127 11:18:06.878986 8640 process.cpp:3509] !!!! Process: registrar(474) state: 3
> E0127 11:18:06.878993 8640 process.cpp:3509] !!!! Process: standalone-master-detector(1420) state: 3
> E0127 11:18:06.878999 8640 process.cpp:3509] !!!! Process: system state: 3
> E0127 11:18:06.879006 8640 process.cpp:3509] !!!! Process: version state: 3
> E0127 11:18:06.879014 8640 process.cpp:3509] !!!! Process: whitelist(474) state: 3
> [...] (repeats ad nauseam)
> ```
See https://issues.apache.org/jira/browse/MESOS-7036 for an in-depth analysis of the deadlock.
- Gast�n
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55901/#review163363
-----------------------------------------------------------
On Jan. 28, 2017, 12:39 p.m., Gast�n Kleiman wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55901/
> -----------------------------------------------------------
>
> (Updated Jan. 28, 2017, 12:39 p.m.)
>
>
> Review request for mesos, Alexander Rukletsov, Anand Mazumdar, haosdent huang, and Vinod Kone.
>
>
> Bugs: MESOS-6280
> https://issues.apache.org/jira/browse/MESOS-6280
>
>
> Repository: mesos
>
>
> Description
> -------
>
> Added support for command health checks to the default executor.
>
>
> Diffs
> -----
>
> src/checks/health_checker.hpp 95da1ff7dd6b222a93076633eb3757ec9aa43cf6
> src/checks/health_checker.cpp e70bd7936752613a4f92c70c4c61cd7cdf7c4ee5
> src/launcher/default_executor.cpp 97eee05cac8cb1f62d43e2aecc08a8e54e49eac3
> src/tests/health_check_tests.cpp 710cb66eff6c4447caa22772f0cdc97cfa582c50
>
> Diff: https://reviews.apache.org/r/55901/diff/
>
>
> Testing
> -------
>
> Introduced a new test: `HealthCheckTest.DefaultExecutorCmdHealthCheck`. It passes on Linux, but not on macOS.
>
>
> Thanks,
>
> Gast�n Kleiman
>
>