You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Greg Mann <gr...@mesosphere.io> on 2017/09/27 22:31:36 UTC
Re: Review Request 59746: Stopped accounting aborted container launches as failures.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59746/#review186459
-----------------------------------------------------------




src/slave/slave.cpp
Lines 5230-5232 (patched)
<https://reviews.apache.org/r/59746/#comment263073>

    It looks to me like the significant functional difference in this patch (in addition to the metrics change) is that we will no longer call `containerizer->destroy(containerId)` when the executor state is TERMINATING.
    
    After looking at the code, it seems to me that in all cases where the executor state is TERMINATING, we have either already called `containerizer->destroy`, or we have registered a callback which will do so. So, this seems safe to me, but I wanted to write this comment to see if our understandings of the change and its impact are correct.



src/slave/slave.cpp
Lines 5225-5229 (original), 5233-5237 (patched)
<https://reviews.apache.org/r/59746/#comment263011>

    What do you think about moving this log line up into the `if (!future.isReady())` block? In the case where `executor == nullptr`, the logs would not tell us if the future wasn't ready, which seems potentially useful when debugging.



src/slave/slave.cpp
Lines 5250-5253 (patched)
<https://reviews.apache.org/r/59746/#comment263072>

    This comment is a bit confusing to me. I'm not sure if this captures the precise meaning you want, but here's a suggestion:
    
    "Executor's `pendingTermination` may have already been set by a callback further down the chain. However, we are likely to have more context for this failure here, thus we overwrite the termination to ensure that the state, reason, and message are set correctly."


- Greg Mann


On Sept. 27, 2017, 4:47 p.m., Alexander Rukletsov wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59746/
> -----------------------------------------------------------
> 
> (Updated Sept. 27, 2017, 4:47 p.m.)
> 
> 
> Review request for mesos, Greg Mann, Ian Downes, Jie Yu, Joseph Wu, Jan Schlicht, and Vinod Kone.
> 
> 
> Bugs: MESOS-7601
>     https://issues.apache.org/jira/browse/MESOS-7601
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The container launch future might be failed or discarded (depending
> on the containerizer implementation) if the launch has been aborted,
> for example, a framework might have stopped while its task are being
> started. Such failures should not be accounted as launch errors.
> 
> 
> Diffs
> -----
> 
>   src/slave/slave.cpp 50d2a10cd68f6611efd4e691e5325e6e0c06f33a 
> 
> 
> Diff: https://reviews.apache.org/r/59746/diff/3/
> 
> 
> Testing
> -------
> 
> `make check` on several Linux distros.
> 
> Additional manual tests for (1) mesos and (1) docker containerizers. The framework is asked to exit right after it submits the task to mesos.
> 
> (1) With mesos c-zer
> m: `./bin/mesos-master.sh --work_dir=./m`
> a: `GLOG_v=1 sudo ./bin/mesos-agent.sh --master=<ip>:5050 --work_dir=./a --containerizers=mesos --image_providers="DOCKER" --isolation=filesystem/linux,docker/runtime`
> f: `./src/mesos-execute --master=<ip>:5050 --containerizer=mesos --docker_image=fedora:25 --name=pull-test --command="sleep 1000"`
> 
> (2) With docker c-zer
> m: `./bin/mesos-master.sh --work_dir=./m`
> a: `GLOG_v=1 sudo ./bin/mesos-agent.sh --master=<ip>:5050 --work_dir=./a --containerizers=docker`
> f: `./src/mesos-execute --master=<ip>:5050 --containerizer=docker --docker_image=fedora:25 --name=pull-test --command="sleep 1000"`
> 
> 
> Thanks,
> 
> Alexander Rukletsov
> 
>