You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Qian Zhang <zh...@gmail.com> on 2019/01/10 14:52:33 UTC
Review Request 69705: Made agent not read the forked pid and
libprocess pid after reboot.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69705/
-----------------------------------------------------------
Review request for mesos, Andrei Budnik and Gilbert Song.
Bugs: MESOS-9501
https://issues.apache.org/jira/browse/MESOS-9501
Repository: mesos
Description
-------
After agent host is rebooted, the forked pid and libprocess pid in
agent's meta directory are obsolete, so we should not read them during
agent recovery, otherwise containerizer may wait for an irrelevant
process if the forked pid is reused by another process after reboot.
Diffs
-----
src/slave/state.hpp 4f3d4cefb3fdef29cce3a6abe4cf5db04d45301f
src/slave/state.cpp e7cf84993c74cf6da7fe22d5112e86e039780287
Diff: https://reviews.apache.org/r/69705/diff/1/
Testing
-------
Thanks,
Qian Zhang
Re: Review Request 69705: Made agent not read the forked pid and
libprocess pid after reboot.
Posted by Qian Zhang <zh...@gmail.com>.
> On Jan. 11, 2019, 7:30 a.m., Gilbert Song wrote:
> > Could we vertify the CI failure above `SlaveRecoveryTest/0.Reboot` is not caused by our change?
Actually it is caused by this patch :-( And I have fixed it in this patch: https://reviews.apache.org/r/69716/ .
- Qian
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69705/#review211845
-----------------------------------------------------------
On Jan. 10, 2019, 10:52 p.m., Qian Zhang wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69705/
> -----------------------------------------------------------
>
> (Updated Jan. 10, 2019, 10:52 p.m.)
>
>
> Review request for mesos, Andrei Budnik and Gilbert Song.
>
>
> Bugs: MESOS-9501
> https://issues.apache.org/jira/browse/MESOS-9501
>
>
> Repository: mesos
>
>
> Description
> -------
>
> After agent host is rebooted, the forked pid and libprocess pid in
> agent's meta directory are obsolete, so we should not read them during
> agent recovery, otherwise containerizer may wait for an irrelevant
> process if the forked pid is reused by another process after reboot.
>
>
> Diffs
> -----
>
> src/slave/state.hpp 4f3d4cefb3fdef29cce3a6abe4cf5db04d45301f
> src/slave/state.cpp e7cf84993c74cf6da7fe22d5112e86e039780287
>
>
> Diff: https://reviews.apache.org/r/69705/diff/1/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Qian Zhang
>
>
Re: Review Request 69705: Made agent not read the forked pid and
libprocess pid after reboot.
Posted by Gilbert Song <so...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69705/#review211845
-----------------------------------------------------------
Could we vertify the CI failure above `SlaveRecoveryTest/0.Reboot` is not caused by our change?
- Gilbert Song
On Jan. 10, 2019, 6:52 a.m., Qian Zhang wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69705/
> -----------------------------------------------------------
>
> (Updated Jan. 10, 2019, 6:52 a.m.)
>
>
> Review request for mesos, Andrei Budnik and Gilbert Song.
>
>
> Bugs: MESOS-9501
> https://issues.apache.org/jira/browse/MESOS-9501
>
>
> Repository: mesos
>
>
> Description
> -------
>
> After agent host is rebooted, the forked pid and libprocess pid in
> agent's meta directory are obsolete, so we should not read them during
> agent recovery, otherwise containerizer may wait for an irrelevant
> process if the forked pid is reused by another process after reboot.
>
>
> Diffs
> -----
>
> src/slave/state.hpp 4f3d4cefb3fdef29cce3a6abe4cf5db04d45301f
> src/slave/state.cpp e7cf84993c74cf6da7fe22d5112e86e039780287
>
>
> Diff: https://reviews.apache.org/r/69705/diff/1/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Qian Zhang
>
>
Re: Review Request 69705: Made agent not read the forked pid and
libprocess pid after reboot.
Posted by Gilbert Song <so...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69705/#review211902
-----------------------------------------------------------
Ship it!
Ship It!
- Gilbert Song
On Jan. 10, 2019, 6:52 a.m., Qian Zhang wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69705/
> -----------------------------------------------------------
>
> (Updated Jan. 10, 2019, 6:52 a.m.)
>
>
> Review request for mesos, Andrei Budnik and Gilbert Song.
>
>
> Bugs: MESOS-9501
> https://issues.apache.org/jira/browse/MESOS-9501
>
>
> Repository: mesos
>
>
> Description
> -------
>
> After agent host is rebooted, the forked pid and libprocess pid in
> agent's meta directory are obsolete, so we should not read them during
> agent recovery, otherwise containerizer may wait for an irrelevant
> process if the forked pid is reused by another process after reboot.
>
>
> Diffs
> -----
>
> src/slave/state.hpp 4f3d4cefb3fdef29cce3a6abe4cf5db04d45301f
> src/slave/state.cpp e7cf84993c74cf6da7fe22d5112e86e039780287
>
>
> Diff: https://reviews.apache.org/r/69705/diff/1/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Qian Zhang
>
>
Re: Review Request 69705: Made agent not read the forked pid and
libprocess pid after reboot.
Posted by Qian Zhang <zh...@gmail.com>.
> On Jan. 10, 2019, 11:41 p.m., Vinod Kone wrote:
> > Can you write a unit test for this by spoofing the reboot?
Here it is: https://reviews.apache.org/r/69717/
- Qian
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69705/#review211829
-----------------------------------------------------------
On Jan. 10, 2019, 10:52 p.m., Qian Zhang wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69705/
> -----------------------------------------------------------
>
> (Updated Jan. 10, 2019, 10:52 p.m.)
>
>
> Review request for mesos, Andrei Budnik and Gilbert Song.
>
>
> Bugs: MESOS-9501
> https://issues.apache.org/jira/browse/MESOS-9501
>
>
> Repository: mesos
>
>
> Description
> -------
>
> After agent host is rebooted, the forked pid and libprocess pid in
> agent's meta directory are obsolete, so we should not read them during
> agent recovery, otherwise containerizer may wait for an irrelevant
> process if the forked pid is reused by another process after reboot.
>
>
> Diffs
> -----
>
> src/slave/state.hpp 4f3d4cefb3fdef29cce3a6abe4cf5db04d45301f
> src/slave/state.cpp e7cf84993c74cf6da7fe22d5112e86e039780287
>
>
> Diff: https://reviews.apache.org/r/69705/diff/1/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Qian Zhang
>
>
Re: Review Request 69705: Made agent not read the forked pid and
libprocess pid after reboot.
Posted by Vinod Kone <vi...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69705/#review211829
-----------------------------------------------------------
Can you write a unit test for this by spoofing the reboot?
- Vinod Kone
On Jan. 10, 2019, 2:52 p.m., Qian Zhang wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69705/
> -----------------------------------------------------------
>
> (Updated Jan. 10, 2019, 2:52 p.m.)
>
>
> Review request for mesos, Andrei Budnik and Gilbert Song.
>
>
> Bugs: MESOS-9501
> https://issues.apache.org/jira/browse/MESOS-9501
>
>
> Repository: mesos
>
>
> Description
> -------
>
> After agent host is rebooted, the forked pid and libprocess pid in
> agent's meta directory are obsolete, so we should not read them during
> agent recovery, otherwise containerizer may wait for an irrelevant
> process if the forked pid is reused by another process after reboot.
>
>
> Diffs
> -----
>
> src/slave/state.hpp 4f3d4cefb3fdef29cce3a6abe4cf5db04d45301f
> src/slave/state.cpp e7cf84993c74cf6da7fe22d5112e86e039780287
>
>
> Diff: https://reviews.apache.org/r/69705/diff/1/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Qian Zhang
>
>
Re: Review Request 69705: Made agent not read the forked pid and
libprocess pid after reboot.
Posted by Andrei Budnik <ab...@mesosphere.com>.
> On Jan. 11, 2019, 3:54 p.m., Andrei Budnik wrote:
> > src/slave/state.cpp
> > Lines 493-500 (patched)
> > <https://reviews.apache.org/r/69705/diff/1/?file=2119017#file2119017line494>
> >
> > All tests (including a new one) are passed after removing this code. Maybe we don't need to remove a pid file? Just return the `state`?
Dropping my comment, because agent may crash after recovery, but before `TASK_FAILED` status updates for all previous tasks have been checkpointed (created sentinel files).
- Andrei
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69705/#review211888
-----------------------------------------------------------
On Jan. 10, 2019, 2:52 p.m., Qian Zhang wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69705/
> -----------------------------------------------------------
>
> (Updated Jan. 10, 2019, 2:52 p.m.)
>
>
> Review request for mesos, Andrei Budnik and Gilbert Song.
>
>
> Bugs: MESOS-9501
> https://issues.apache.org/jira/browse/MESOS-9501
>
>
> Repository: mesos
>
>
> Description
> -------
>
> After agent host is rebooted, the forked pid and libprocess pid in
> agent's meta directory are obsolete, so we should not read them during
> agent recovery, otherwise containerizer may wait for an irrelevant
> process if the forked pid is reused by another process after reboot.
>
>
> Diffs
> -----
>
> src/slave/state.hpp 4f3d4cefb3fdef29cce3a6abe4cf5db04d45301f
> src/slave/state.cpp e7cf84993c74cf6da7fe22d5112e86e039780287
>
>
> Diff: https://reviews.apache.org/r/69705/diff/1/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Qian Zhang
>
>
Re: Review Request 69705: Made agent not read the forked pid and
libprocess pid after reboot.
Posted by Andrei Budnik <ab...@mesosphere.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69705/#review211888
-----------------------------------------------------------
src/slave/state.cpp
Lines 493-500 (patched)
<https://reviews.apache.org/r/69705/#comment297465>
All tests (including a new one) are passed after removing this code. Maybe we don't need to remove a pid file? Just return the `state`?
- Andrei Budnik
On Jan. 10, 2019, 2:52 p.m., Qian Zhang wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69705/
> -----------------------------------------------------------
>
> (Updated Jan. 10, 2019, 2:52 p.m.)
>
>
> Review request for mesos, Andrei Budnik and Gilbert Song.
>
>
> Bugs: MESOS-9501
> https://issues.apache.org/jira/browse/MESOS-9501
>
>
> Repository: mesos
>
>
> Description
> -------
>
> After agent host is rebooted, the forked pid and libprocess pid in
> agent's meta directory are obsolete, so we should not read them during
> agent recovery, otherwise containerizer may wait for an irrelevant
> process if the forked pid is reused by another process after reboot.
>
>
> Diffs
> -----
>
> src/slave/state.hpp 4f3d4cefb3fdef29cce3a6abe4cf5db04d45301f
> src/slave/state.cpp e7cf84993c74cf6da7fe22d5112e86e039780287
>
>
> Diff: https://reviews.apache.org/r/69705/diff/1/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Qian Zhang
>
>
Re: Review Request 69705: Made agent not read the forked pid and
libprocess pid after reboot.
Posted by Qian Zhang <zh...@gmail.com>.
> On Jan. 11, 2019, 7:29 a.m., Gilbert Song wrote:
> > src/slave/state.cpp
> > Line 561 (original), 592 (patched)
> > <https://reviews.apache.org/r/69705/diff/1/?file=2119017#file2119017line594>
> >
> > Could we confirm we do not care about this marker case after reboot?
After reboot, the field `state.http` will be `None()` and it will be only used in `Framework::recoverExecutor`, and in that method if we find `state.http` is `None()`, we will set executor's pid to `UPID()` to signify that the connection type for this executor is unknown. I think it is OK since agent should not try to connect any executors after a reboot.
- Qian
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69705/#review211844
-----------------------------------------------------------
On Jan. 10, 2019, 10:52 p.m., Qian Zhang wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69705/
> -----------------------------------------------------------
>
> (Updated Jan. 10, 2019, 10:52 p.m.)
>
>
> Review request for mesos, Andrei Budnik and Gilbert Song.
>
>
> Bugs: MESOS-9501
> https://issues.apache.org/jira/browse/MESOS-9501
>
>
> Repository: mesos
>
>
> Description
> -------
>
> After agent host is rebooted, the forked pid and libprocess pid in
> agent's meta directory are obsolete, so we should not read them during
> agent recovery, otherwise containerizer may wait for an irrelevant
> process if the forked pid is reused by another process after reboot.
>
>
> Diffs
> -----
>
> src/slave/state.hpp 4f3d4cefb3fdef29cce3a6abe4cf5db04d45301f
> src/slave/state.cpp e7cf84993c74cf6da7fe22d5112e86e039780287
>
>
> Diff: https://reviews.apache.org/r/69705/diff/1/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Qian Zhang
>
>
Re: Review Request 69705: Made agent not read the forked pid and
libprocess pid after reboot.
Posted by Gilbert Song <so...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69705/#review211844
-----------------------------------------------------------
Ship it!
src/slave/state.cpp
Line 561 (original), 592 (patched)
<https://reviews.apache.org/r/69705/#comment297433>
Could we confirm we do not care about this marker case after reboot?
- Gilbert Song
On Jan. 10, 2019, 6:52 a.m., Qian Zhang wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69705/
> -----------------------------------------------------------
>
> (Updated Jan. 10, 2019, 6:52 a.m.)
>
>
> Review request for mesos, Andrei Budnik and Gilbert Song.
>
>
> Bugs: MESOS-9501
> https://issues.apache.org/jira/browse/MESOS-9501
>
>
> Repository: mesos
>
>
> Description
> -------
>
> After agent host is rebooted, the forked pid and libprocess pid in
> agent's meta directory are obsolete, so we should not read them during
> agent recovery, otherwise containerizer may wait for an irrelevant
> process if the forked pid is reused by another process after reboot.
>
>
> Diffs
> -----
>
> src/slave/state.hpp 4f3d4cefb3fdef29cce3a6abe4cf5db04d45301f
> src/slave/state.cpp e7cf84993c74cf6da7fe22d5112e86e039780287
>
>
> Diff: https://reviews.apache.org/r/69705/diff/1/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Qian Zhang
>
>
Re: Review Request 69705: Made agent not read the forked pid and
libprocess pid after reboot.
Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69705/#review211830
-----------------------------------------------------------
FAIL: Some of the unit tests failed. Please check the relevant logs.
Reviews applied: `['69705']`
Failed command: `Start-MesosCITesting`
All the build artifacts available at: http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2750/mesos-review-69705
Relevant logs:
- [mesos-tests.log](http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2750/mesos-review-69705/logs/mesos-tests.log):
```
W0110 16:01:56.494325 12084 slave.cpp:3933] Ignoring shutdown framework 68d97b13-a7c5-4a37-933b-31279343ea71-0000 because it is terminating
I0110 16:01:56.496320 864 master.cpp:1271] Agent 68d97b13-a7c5-4a37-933b-31279343ea71-S0 at slave(464)@192.10.1.6:51097 (windows-02.chtsmhjxogyevckjfayqqcnjda.xx.internal.cloudapp.net) disconnected
I0110 16:01:56.496320 864 master.cpp:3274] Disconnecting agent 68d97b13-a7c5-4a37-933b-31279343ea71-S0 at slave(464)@192.10.1.6:51097 (windows-02.chtsmhjxogyevckjfayqqcnjda.xx.internal.cloudapp.net)
I0110 16:01:56.496320 864 master.cpp:3293] Deactivating agent 68d97b13-a7c5-4a37-933b-31279343ea71-S0 at slave(464)@192.10.1.6:51097 (windows-02.chtsmhjxogyevckjfayqqcnjda.xx.internal.cloudapp.net)
I0110 16:01:56.497376 2248 hierarchical.cpp:358] Removed framework 68d97b13-a7c5-4a37-933b-31279343ea71-0000
I0110 16:01:56.497376 2248 hierarchical.cpp:802] Agent 68d97b13-a7c5-4a37-933b-31279343ea71-S0 deactivated
I0110 16:01:56.498402 864 containerizer.cpp:2469] Destroying container ad70a447-bd3b-49ad-918f-54a871fe6be9 in RUNNING state
I0110 16:01:56.498402 864 containerizer.cpp:3136] Transitioning the state of container ad70a447-bd3b-49ad-918f-54a871fe6be9 from RUNNING to DESTROYING
I0110 16:01:56.499404 864 launcher.cpp:161] Asked to destroy container ad70a447-bd3b-49ad-918f-54a871fe6be9
W0110 16:01:56.500319 6488 process.cpp:1423] Failed to recv on socket WindowsFD::Type::SOCKET=2588 to peer '192.10.1.6:52930': IO failed with er[ OK ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0 (685 ms)
[----------] 1 test from IsolationFlag/MemoryIsolatorTest (704 ms total)
[----------] Global test environment tear-down
[==========] 1082 tests from 104 test cases ran. (491907 ms total)
[ PASSED ] 1081 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] SlaveRecoveryTest/0.Reboot, where TypeParam = class mesos::internal::slave::MesosContainerizer
1 FAILED TEST
YOU HAVE 231 DISABLED TESTS
ror code: The specified network name is no longer available.
W0110 16:01:56.500319 6488 process.cpp:838] Failed to recv on socket WindowsFD::Type::SOCKET=2500 to peer '192.10.1.6:52931': IO failed with error code: The specified network name is no longer available.
I0110 16:01:56.597012 2248 containerizer.cpp:2975] Container ad70a447-bd3b-49ad-918f-54a871fe6be9 has exited
I0110 16:01:56.625031 8360 master.cpp:1111] Master terminating
I0110 16:01:56.627017 6148 hierarchical.cpp:644] Removed agent 68d97b13-a7c5-4a37-933b-31279343ea71-S0
I0110 16:01:56.899015 6488 process.cpp:927] Stopped the socket accept loop
```
- Mesos Reviewbot Windows
On Jan. 10, 2019, 2:52 p.m., Qian Zhang wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69705/
> -----------------------------------------------------------
>
> (Updated Jan. 10, 2019, 2:52 p.m.)
>
>
> Review request for mesos, Andrei Budnik and Gilbert Song.
>
>
> Bugs: MESOS-9501
> https://issues.apache.org/jira/browse/MESOS-9501
>
>
> Repository: mesos
>
>
> Description
> -------
>
> After agent host is rebooted, the forked pid and libprocess pid in
> agent's meta directory are obsolete, so we should not read them during
> agent recovery, otherwise containerizer may wait for an irrelevant
> process if the forked pid is reused by another process after reboot.
>
>
> Diffs
> -----
>
> src/slave/state.hpp 4f3d4cefb3fdef29cce3a6abe4cf5db04d45301f
> src/slave/state.cpp e7cf84993c74cf6da7fe22d5112e86e039780287
>
>
> Diff: https://reviews.apache.org/r/69705/diff/1/
>
>
> Testing
> -------
>
>
> Thanks,
>
> Qian Zhang
>
>