You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Gastón Kleiman <ga...@mesosphere.com> on 2017/02/01 11:30:35 UTC

Re: Review Request 55901: Added support for command health checks to the default executor.


> On Jan. 28, 2017, 1:17 a.m., Vinod Kone wrote:
> > src/checks/health_checker.cpp, line 480
> > <https://reviews.apache.org/r/55901/diff/1/?file=1613998#file1613998line480>
> >
> >     launch nested container session returns a streaming response, how come you are calling `post()` helper here which expects a non-streaming response?
> >     
> >     probably one of the reasons why your test is hanging.
> 
> Gast�n Kleiman wrote:
>     The `post()` helper delegates to `process::http::request()`, which takes boolean flag (`streamedResponse`). If this flag is set to `false`, libprocess will convert a PIPE (streaming) response into a BODY (non-streaming) response. This means that the `Future` returned by the helper will not be completed until the server closes the connection.
>     
>     Relevant links:
>      - https://github.com/apache/mesos/blob/2d0195eed54feac41485fb1503dc4004e5500c81/3rdparty/libprocess/include/process/http.hpp#L953-L955
>      - https://github.com/apache/mesos/blob/2d0195eed54feac41485fb1503dc4004e5500c81/3rdparty/libprocess/src/http.cpp#L1191-L1197
>      - https://github.com/apache/mesos/blob/2d0195eed54feac41485fb1503dc4004e5500c81/3rdparty/libprocess/src/http.cpp#L1005-L1022
>     
>     In these particular cases, it means that the `Future` will not be completed until the container exits, which is exactly what we need.
>     
>     
>     Regarding the test hanging in Linux, some further debugging seem to indicate that test hangs on `process::Clock::settle()`, because there's a race + deadlock in `RateLimiter` that leaves a process stuck in `RUNNING`. I'll dig deeper on Monday, but here's some evidence:
>     
>     # Snippet from a successful test run that "leaks" a running process
>     
>     ```
>     [==========] Running 1 test from 1 test case.
>     [----------] Global test environment set-up.
>     [----------] 1 test from HealthCheckTest
>     [ RUN      ] HealthCheckTest.DefaultExecutorCmdHealthCheck
>     
>     [...]
>     
>     E0127 11:18:06.607446  8665 limiter.hpp:123] !!!! LIMITER _acquire
>     E0127 11:18:06.607471  8665 limiter.hpp:129] !!!! LIMITER There are 1 promises
>     E0127 11:18:06.608064  8665 limiter.hpp:186] !!!! LIMITER destroy
>     
>     **** DEADLOCK DETECTED! ****
>     You are waiting on process __limiter__(1419)@192.99.40.208:41947 that it is currently executing.
>     
>     [...]
>     
>     [       OK ] HealthCheckTest.DefaultExecutorCmdHealthCheck (592 ms)
>     [----------] 1 test from HealthCheckTest (593 ms total)
>     
>     [----------] Global test environment tear-down
>     [==========] 1 test from 1 test case ran. (603 ms total)
>     [  PASSED  ] 1 test.
>     
>     Repeating all tests (iteration 474) . . .
>     ```
>     
>     # Snippet from a hung run
>     
>     ```
>     [...]
>     E0127 11:18:06.878636  8640 cluster.cpp:358] !!! Settling the clock
>     E0127 11:18:06.878646  8640 process.cpp:3491] !!! Attempting to acquire mutex
>     E0127 11:18:06.878654  8640 process.cpp:3494] !!! !runq.empty()?
>     E0127 11:18:06.878659  8640 process.cpp:3502] !!! running.load() > 0?
>     E0127 11:18:06.878667  8640 process.cpp:3504] !!! 1 processes still running
>     E0127 11:18:06.878676  8640 process.cpp:3509] !!!! Process: __authentication_router__(1) state: 3
>     E0127 11:18:06.878684  8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1893) state: 3
>     E0127 11:18:06.878691  8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1894) state: 3
>     E0127 11:18:06.878697  8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1895) state: 3
>     E0127 11:18:06.878705  8640 process.cpp:3509] !!!! Process: __gc__ state: 3
>     E0127 11:18:06.878711  8640 process.cpp:3509] !!!! Process: __limiter__(1419) state: 2
>     E0127 11:18:06.878718  8640 process.cpp:3509] !!!! Process: __processes__ state: 3
>     E0127 11:18:06.878725  8640 process.cpp:3509] !!!! Process: __reaper__(1) state: 3
>     E0127 11:18:06.878731  8640 process.cpp:3509] !!!! Process: crammd5-authenticator(474) state: 3
>     E0127 11:18:06.878738  8640 process.cpp:3509] !!!! Process: files state: 3
>     E0127 11:18:06.878744  8640 process.cpp:3509] !!!! Process: help state: 3
>     E0127 11:18:06.878751  8640 process.cpp:3509] !!!! Process: hierarchical-allocator(474) state: 3
>     E0127 11:18:06.878757  8640 process.cpp:3509] !!!! Process: in-memory-storage(474) state: 3
>     E0127 11:18:06.878764  8640 process.cpp:3509] !!!! Process: local-authorizer(947) state: 3
>     E0127 11:18:06.878772  8640 process.cpp:3509] !!!! Process: logging state: 3
>     E0127 11:18:06.878777  8640 process.cpp:3509] !!!! Process: master state: 3
>     E0127 11:18:06.878792  8640 process.cpp:3509] !!!! Process: metrics state: 3
>     E0127 11:18:06.878800  8640 process.cpp:3509] !!!! Process: profiler state: 3
>     E0127 11:18:06.878806  8640 process.cpp:3509] !!!! Process: registrar(474) state: 3
>     E0127 11:18:06.878813  8640 process.cpp:3509] !!!! Process: standalone-master-detector(1420) state: 3
>     E0127 11:18:06.878820  8640 process.cpp:3509] !!!! Process: system state: 3
>     E0127 11:18:06.878828  8640 process.cpp:3509] !!!! Process: version state: 3
>     E0127 11:18:06.878834  8640 process.cpp:3509] !!!! Process: whitelist(474) state: 3
>     E0127 11:18:06.878840  8640 process.cpp:3491] !!! Attempting to acquire mutex
>     E0127 11:18:06.878846  8640 process.cpp:3494] !!! !runq.empty()?
>     E0127 11:18:06.878852  8640 process.cpp:3502] !!! running.load() > 0?
>     E0127 11:18:06.878859  8640 process.cpp:3504] !!! 1 processes still running
>     E0127 11:18:06.878867  8640 process.cpp:3509] !!!! Process: __authentication_router__(1) state: 3
>     E0127 11:18:06.878873  8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1893) state: 3
>     E0127 11:18:06.878880  8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1894) state: 3
>     E0127 11:18:06.878887  8640 process.cpp:3509] !!!! Process: __basic_authenticator__(1895) state: 3
>     E0127 11:18:06.878895  8640 process.cpp:3509] !!!! Process: __gc__ state: 3
>     E0127 11:18:06.878901  8640 process.cpp:3509] !!!! Process: __limiter__(1419) state: 2
>     E0127 11:18:06.878907  8640 process.cpp:3509] !!!! Process: __processes__ state: 3
>     E0127 11:18:06.878913  8640 process.cpp:3509] !!!! Process: __reaper__(1) state: 3
>     E0127 11:18:06.878921  8640 process.cpp:3509] !!!! Process: crammd5-authenticator(474) state: 3
>     E0127 11:18:06.878927  8640 process.cpp:3509] !!!! Process: files state: 3
>     E0127 11:18:06.878933  8640 process.cpp:3509] !!!! Process: help state: 3
>     E0127 11:18:06.878940  8640 process.cpp:3509] !!!! Process: hierarchical-allocator(474) state: 3
>     E0127 11:18:06.878947  8640 process.cpp:3509] !!!! Process: in-memory-storage(474) state: 3
>     E0127 11:18:06.878953  8640 process.cpp:3509] !!!! Process: local-authorizer(947) state: 3
>     E0127 11:18:06.878960  8640 process.cpp:3509] !!!! Process: logging state: 3
>     E0127 11:18:06.878967  8640 process.cpp:3509] !!!! Process: master state: 3
>     E0127 11:18:06.878973  8640 process.cpp:3509] !!!! Process: metrics state: 3
>     E0127 11:18:06.878979  8640 process.cpp:3509] !!!! Process: profiler state: 3
>     E0127 11:18:06.878986  8640 process.cpp:3509] !!!! Process: registrar(474) state: 3
>     E0127 11:18:06.878993  8640 process.cpp:3509] !!!! Process: standalone-master-detector(1420) state: 3
>     E0127 11:18:06.878999  8640 process.cpp:3509] !!!! Process: system state: 3
>     E0127 11:18:06.879006  8640 process.cpp:3509] !!!! Process: version state: 3
>     E0127 11:18:06.879014  8640 process.cpp:3509] !!!! Process: whitelist(474) state: 3
>     [...] (repeats ad nauseam)
>     ```

See https://issues.apache.org/jira/browse/MESOS-7036 for an in-depth analysis of the deadlock.


- Gast�n


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55901/#review163363
-----------------------------------------------------------


On Jan. 28, 2017, 12:39 p.m., Gast�n Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55901/
> -----------------------------------------------------------
> 
> (Updated Jan. 28, 2017, 12:39 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Anand Mazumdar, haosdent huang, and Vinod Kone.
> 
> 
> Bugs: MESOS-6280
>     https://issues.apache.org/jira/browse/MESOS-6280
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Added support for command health checks to the default executor.
> 
> 
> Diffs
> -----
> 
>   src/checks/health_checker.hpp 95da1ff7dd6b222a93076633eb3757ec9aa43cf6 
>   src/checks/health_checker.cpp e70bd7936752613a4f92c70c4c61cd7cdf7c4ee5 
>   src/launcher/default_executor.cpp 97eee05cac8cb1f62d43e2aecc08a8e54e49eac3 
>   src/tests/health_check_tests.cpp 710cb66eff6c4447caa22772f0cdc97cfa582c50 
> 
> Diff: https://reviews.apache.org/r/55901/diff/
> 
> 
> Testing
> -------
> 
> Introduced a new test: `HealthCheckTest.DefaultExecutorCmdHealthCheck`. It passes on Linux, but not on macOS.
> 
> 
> Thanks,
> 
> Gast�n Kleiman
> 
>