You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Alexander Rukletsov <ru...@gmail.com> on 2018/09/03 12:07:11 UTC

Re: Review Request 68495: Made command check always waits before removing the nested container.


> On Aug. 27, 2018, 5:16 p.m., Alexander Rukletsov wrote:
> > src/checks/checker_process.cpp
> > Lines 878-889 (original), 890-901 (patched)
> > <https://reviews.apache.org/r/68495/diff/1/?file=2077041#file2077041line890>
> >
> >     It looks like we should always call `waitNestedContainer()` after we said `previousCheckContainerId = checkContainerId;`. For example here.
> >     
> >     Maybe it makes sense to call `waitNestedContainer()` right in the beginning? We can end up calling it twice, but I think it's fine?
> 
> Qian Zhang wrote:
>     > Maybe it makes sense to call waitNestedContainer() right in the beginning? We can end up calling it twice, but I think it's fine?
>     
>     That means we will call agent API `WAIT_NESTED_CONTAINER` twice for each successful launch of check container, I think that might be a burden for agent in a large scale env. So I'd still prefer to call it only in the places where we have to do it.
> 
> Qian Zhang wrote:
>     And for the case (L879:L888, i.e., the connection to agent failed) that you pointed out above, I think when the connection to agent is back (e.g., agent starts up again), the check container will be treated as orphan container and destroyed by agent, and then we will remove it here: https://github.com/apache/mesos/blob/1.6.1/src/checks/checker_process.cpp#L616:L638. However I am going to post another patch to change these codes (https://github.com/apache/mesos/blob/1.6.1/src/checks/checker_process.cpp#L660:L664) to something like:
>     ```
>               promise->discard();
>             } else {
>               previousCheckContainerId = None();
>               _nestedCommandCheck(promise, cmd, nested);
>             }
>     ```
>     In this way, if we fail to remove the check container (e.g., due to agent has not finished recovery, or the check container is still in `DESTROYING` state), we will try to remove it again.
> 
> Qian Zhang wrote:
>     I posted another patch https://reviews.apache.org/r/68555/ as I mentioned above.

So the assumption you're making is that if `Connection::send()` fails, the container will not start and hence we don't need to call `wait()` before we destroy? It is not obvious to a reader and requires some non-local knowledge about how the UCR works. Can you please leave a comment documenting this assumption so that when it changes and someone will be trying to figure out why checks sometimes started fail, they have a hint : )


- Alexander


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68495/#review207984
-----------------------------------------------------------


On Aug. 24, 2018, 9:54 a.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68495/
> -----------------------------------------------------------
> 
> (Updated Aug. 24, 2018, 9:54 a.m.)
> 
> 
> Review request for mesos, Andrei Budnik, Alexander Rukletsov, Gastón Kleiman, and Gilbert Song.
> 
> 
> Bugs: MESOS-8568
>     https://issues.apache.org/jira/browse/MESOS-8568
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Made command check always waits before removing the nested container.
> 
> 
> Diffs
> -----
> 
>   src/checks/checker_process.cpp 77a76f465fe57eab89f027b5acb74c2339551678 
> 
> 
> Diff: https://reviews.apache.org/r/68495/diff/1/
> 
> 
> Testing
> -------
> 
> sudo make check
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>