You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Qian Zhang <zh...@gmail.com> on 2019/01/10 14:52:33 UTC

Review Request 69705: Made agent not read the forked pid and libprocess pid after reboot.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69705/
-----------------------------------------------------------

Review request for mesos, Andrei Budnik and Gilbert Song.


Bugs: MESOS-9501
    https://issues.apache.org/jira/browse/MESOS-9501


Repository: mesos


Description
-------

After agent host is rebooted, the forked pid and libprocess pid in
agent's meta directory are obsolete, so we should not read them during
agent recovery, otherwise containerizer may wait for an irrelevant
process if the forked pid is reused by another process after reboot.


Diffs
-----

  src/slave/state.hpp 4f3d4cefb3fdef29cce3a6abe4cf5db04d45301f 
  src/slave/state.cpp e7cf84993c74cf6da7fe22d5112e86e039780287 


Diff: https://reviews.apache.org/r/69705/diff/1/


Testing
-------


Thanks,

Qian Zhang


Re: Review Request 69705: Made agent not read the forked pid and libprocess pid after reboot.

Posted by Qian Zhang <zh...@gmail.com>.

> On Jan. 11, 2019, 7:30 a.m., Gilbert Song wrote:
> > Could we vertify the CI failure above `SlaveRecoveryTest/0.Reboot` is not caused by our change?

Actually it is caused by this patch :-( And I have fixed it in this patch: https://reviews.apache.org/r/69716/ .


- Qian


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69705/#review211845
-----------------------------------------------------------


On Jan. 10, 2019, 10:52 p.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69705/
> -----------------------------------------------------------
> 
> (Updated Jan. 10, 2019, 10:52 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9501
>     https://issues.apache.org/jira/browse/MESOS-9501
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> After agent host is rebooted, the forked pid and libprocess pid in
> agent's meta directory are obsolete, so we should not read them during
> agent recovery, otherwise containerizer may wait for an irrelevant
> process if the forked pid is reused by another process after reboot.
> 
> 
> Diffs
> -----
> 
>   src/slave/state.hpp 4f3d4cefb3fdef29cce3a6abe4cf5db04d45301f 
>   src/slave/state.cpp e7cf84993c74cf6da7fe22d5112e86e039780287 
> 
> 
> Diff: https://reviews.apache.org/r/69705/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Re: Review Request 69705: Made agent not read the forked pid and libprocess pid after reboot.

Posted by Gilbert Song <so...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69705/#review211845
-----------------------------------------------------------



Could we vertify the CI failure above `SlaveRecoveryTest/0.Reboot` is not caused by our change?

- Gilbert Song


On Jan. 10, 2019, 6:52 a.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69705/
> -----------------------------------------------------------
> 
> (Updated Jan. 10, 2019, 6:52 a.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9501
>     https://issues.apache.org/jira/browse/MESOS-9501
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> After agent host is rebooted, the forked pid and libprocess pid in
> agent's meta directory are obsolete, so we should not read them during
> agent recovery, otherwise containerizer may wait for an irrelevant
> process if the forked pid is reused by another process after reboot.
> 
> 
> Diffs
> -----
> 
>   src/slave/state.hpp 4f3d4cefb3fdef29cce3a6abe4cf5db04d45301f 
>   src/slave/state.cpp e7cf84993c74cf6da7fe22d5112e86e039780287 
> 
> 
> Diff: https://reviews.apache.org/r/69705/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Re: Review Request 69705: Made agent not read the forked pid and libprocess pid after reboot.

Posted by Gilbert Song <so...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69705/#review211902
-----------------------------------------------------------


Ship it!




Ship It!

- Gilbert Song


On Jan. 10, 2019, 6:52 a.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69705/
> -----------------------------------------------------------
> 
> (Updated Jan. 10, 2019, 6:52 a.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9501
>     https://issues.apache.org/jira/browse/MESOS-9501
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> After agent host is rebooted, the forked pid and libprocess pid in
> agent's meta directory are obsolete, so we should not read them during
> agent recovery, otherwise containerizer may wait for an irrelevant
> process if the forked pid is reused by another process after reboot.
> 
> 
> Diffs
> -----
> 
>   src/slave/state.hpp 4f3d4cefb3fdef29cce3a6abe4cf5db04d45301f 
>   src/slave/state.cpp e7cf84993c74cf6da7fe22d5112e86e039780287 
> 
> 
> Diff: https://reviews.apache.org/r/69705/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Re: Review Request 69705: Made agent not read the forked pid and libprocess pid after reboot.

Posted by Qian Zhang <zh...@gmail.com>.

> On Jan. 10, 2019, 11:41 p.m., Vinod Kone wrote:
> > Can you write a unit test for this by spoofing the reboot?

Here it is: https://reviews.apache.org/r/69717/


- Qian


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69705/#review211829
-----------------------------------------------------------


On Jan. 10, 2019, 10:52 p.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69705/
> -----------------------------------------------------------
> 
> (Updated Jan. 10, 2019, 10:52 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9501
>     https://issues.apache.org/jira/browse/MESOS-9501
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> After agent host is rebooted, the forked pid and libprocess pid in
> agent's meta directory are obsolete, so we should not read them during
> agent recovery, otherwise containerizer may wait for an irrelevant
> process if the forked pid is reused by another process after reboot.
> 
> 
> Diffs
> -----
> 
>   src/slave/state.hpp 4f3d4cefb3fdef29cce3a6abe4cf5db04d45301f 
>   src/slave/state.cpp e7cf84993c74cf6da7fe22d5112e86e039780287 
> 
> 
> Diff: https://reviews.apache.org/r/69705/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Re: Review Request 69705: Made agent not read the forked pid and libprocess pid after reboot.

Posted by Vinod Kone <vi...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69705/#review211829
-----------------------------------------------------------



Can you write a unit test for this by spoofing the reboot?

- Vinod Kone


On Jan. 10, 2019, 2:52 p.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69705/
> -----------------------------------------------------------
> 
> (Updated Jan. 10, 2019, 2:52 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9501
>     https://issues.apache.org/jira/browse/MESOS-9501
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> After agent host is rebooted, the forked pid and libprocess pid in
> agent's meta directory are obsolete, so we should not read them during
> agent recovery, otherwise containerizer may wait for an irrelevant
> process if the forked pid is reused by another process after reboot.
> 
> 
> Diffs
> -----
> 
>   src/slave/state.hpp 4f3d4cefb3fdef29cce3a6abe4cf5db04d45301f 
>   src/slave/state.cpp e7cf84993c74cf6da7fe22d5112e86e039780287 
> 
> 
> Diff: https://reviews.apache.org/r/69705/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Re: Review Request 69705: Made agent not read the forked pid and libprocess pid after reboot.

Posted by Andrei Budnik <ab...@mesosphere.com>.

> On Jan. 11, 2019, 3:54 p.m., Andrei Budnik wrote:
> > src/slave/state.cpp
> > Lines 493-500 (patched)
> > <https://reviews.apache.org/r/69705/diff/1/?file=2119017#file2119017line494>
> >
> >     All tests (including a new one) are passed after removing this code. Maybe we don't need to remove a pid file? Just return the `state`?

Dropping my comment, because agent may crash after recovery, but before `TASK_FAILED` status updates for all previous tasks have been checkpointed (created sentinel files).


- Andrei


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69705/#review211888
-----------------------------------------------------------


On Jan. 10, 2019, 2:52 p.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69705/
> -----------------------------------------------------------
> 
> (Updated Jan. 10, 2019, 2:52 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9501
>     https://issues.apache.org/jira/browse/MESOS-9501
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> After agent host is rebooted, the forked pid and libprocess pid in
> agent's meta directory are obsolete, so we should not read them during
> agent recovery, otherwise containerizer may wait for an irrelevant
> process if the forked pid is reused by another process after reboot.
> 
> 
> Diffs
> -----
> 
>   src/slave/state.hpp 4f3d4cefb3fdef29cce3a6abe4cf5db04d45301f 
>   src/slave/state.cpp e7cf84993c74cf6da7fe22d5112e86e039780287 
> 
> 
> Diff: https://reviews.apache.org/r/69705/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Re: Review Request 69705: Made agent not read the forked pid and libprocess pid after reboot.

Posted by Andrei Budnik <ab...@mesosphere.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69705/#review211888
-----------------------------------------------------------




src/slave/state.cpp
Lines 493-500 (patched)
<https://reviews.apache.org/r/69705/#comment297465>

    All tests (including a new one) are passed after removing this code. Maybe we don't need to remove a pid file? Just return the `state`?


- Andrei Budnik


On Jan. 10, 2019, 2:52 p.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69705/
> -----------------------------------------------------------
> 
> (Updated Jan. 10, 2019, 2:52 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9501
>     https://issues.apache.org/jira/browse/MESOS-9501
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> After agent host is rebooted, the forked pid and libprocess pid in
> agent's meta directory are obsolete, so we should not read them during
> agent recovery, otherwise containerizer may wait for an irrelevant
> process if the forked pid is reused by another process after reboot.
> 
> 
> Diffs
> -----
> 
>   src/slave/state.hpp 4f3d4cefb3fdef29cce3a6abe4cf5db04d45301f 
>   src/slave/state.cpp e7cf84993c74cf6da7fe22d5112e86e039780287 
> 
> 
> Diff: https://reviews.apache.org/r/69705/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Re: Review Request 69705: Made agent not read the forked pid and libprocess pid after reboot.

Posted by Qian Zhang <zh...@gmail.com>.

> On Jan. 11, 2019, 7:29 a.m., Gilbert Song wrote:
> > src/slave/state.cpp
> > Line 561 (original), 592 (patched)
> > <https://reviews.apache.org/r/69705/diff/1/?file=2119017#file2119017line594>
> >
> >     Could we confirm we do not care about this marker case after reboot?

After reboot, the field `state.http` will be `None()` and it will be only used in `Framework::recoverExecutor`, and in that method if we find `state.http` is `None()`, we will set executor's pid to `UPID()` to signify that the connection type for this executor is unknown. I think it is OK since agent should not try to connect any executors after a reboot.


- Qian


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69705/#review211844
-----------------------------------------------------------


On Jan. 10, 2019, 10:52 p.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69705/
> -----------------------------------------------------------
> 
> (Updated Jan. 10, 2019, 10:52 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9501
>     https://issues.apache.org/jira/browse/MESOS-9501
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> After agent host is rebooted, the forked pid and libprocess pid in
> agent's meta directory are obsolete, so we should not read them during
> agent recovery, otherwise containerizer may wait for an irrelevant
> process if the forked pid is reused by another process after reboot.
> 
> 
> Diffs
> -----
> 
>   src/slave/state.hpp 4f3d4cefb3fdef29cce3a6abe4cf5db04d45301f 
>   src/slave/state.cpp e7cf84993c74cf6da7fe22d5112e86e039780287 
> 
> 
> Diff: https://reviews.apache.org/r/69705/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Re: Review Request 69705: Made agent not read the forked pid and libprocess pid after reboot.

Posted by Gilbert Song <so...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69705/#review211844
-----------------------------------------------------------


Ship it!





src/slave/state.cpp
Line 561 (original), 592 (patched)
<https://reviews.apache.org/r/69705/#comment297433>

    Could we confirm we do not care about this marker case after reboot?


- Gilbert Song


On Jan. 10, 2019, 6:52 a.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69705/
> -----------------------------------------------------------
> 
> (Updated Jan. 10, 2019, 6:52 a.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9501
>     https://issues.apache.org/jira/browse/MESOS-9501
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> After agent host is rebooted, the forked pid and libprocess pid in
> agent's meta directory are obsolete, so we should not read them during
> agent recovery, otherwise containerizer may wait for an irrelevant
> process if the forked pid is reused by another process after reboot.
> 
> 
> Diffs
> -----
> 
>   src/slave/state.hpp 4f3d4cefb3fdef29cce3a6abe4cf5db04d45301f 
>   src/slave/state.cpp e7cf84993c74cf6da7fe22d5112e86e039780287 
> 
> 
> Diff: https://reviews.apache.org/r/69705/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>


Re: Review Request 69705: Made agent not read the forked pid and libprocess pid after reboot.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69705/#review211830
-----------------------------------------------------------



FAIL: Some of the unit tests failed. Please check the relevant logs.

Reviews applied: `['69705']`

Failed command: `Start-MesosCITesting`

All the build artifacts available at: http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2750/mesos-review-69705

Relevant logs:

- [mesos-tests.log](http://dcos-win.westus2.cloudapp.azure.com/artifacts/mesos-reviewbot-testing/2750/mesos-review-69705/logs/mesos-tests.log):

```
W0110 16:01:56.494325 12084 slave.cpp:3933] Ignoring shutdown framework 68d97b13-a7c5-4a37-933b-31279343ea71-0000 because it is terminating
I0110 16:01:56.496320   864 master.cpp:1271] Agent 68d97b13-a7c5-4a37-933b-31279343ea71-S0 at slave(464)@192.10.1.6:51097 (windows-02.chtsmhjxogyevckjfayqqcnjda.xx.internal.cloudapp.net) disconnected
I0110 16:01:56.496320   864 master.cpp:3274] Disconnecting agent 68d97b13-a7c5-4a37-933b-31279343ea71-S0 at slave(464)@192.10.1.6:51097 (windows-02.chtsmhjxogyevckjfayqqcnjda.xx.internal.cloudapp.net)
I0110 16:01:56.496320   864 master.cpp:3293] Deactivating agent 68d97b13-a7c5-4a37-933b-31279343ea71-S0 at slave(464)@192.10.1.6:51097 (windows-02.chtsmhjxogyevckjfayqqcnjda.xx.internal.cloudapp.net)
I0110 16:01:56.497376  2248 hierarchical.cpp:358] Removed framework 68d97b13-a7c5-4a37-933b-31279343ea71-0000
I0110 16:01:56.497376  2248 hierarchical.cpp:802] Agent 68d97b13-a7c5-4a37-933b-31279343ea71-S0 deactivated
I0110 16:01:56.498402   864 containerizer.cpp:2469] Destroying container ad70a447-bd3b-49ad-918f-54a871fe6be9 in RUNNING state
I0110 16:01:56.498402   864 containerizer.cpp:3136] Transitioning the state of container ad70a447-bd3b-49ad-918f-54a871fe6be9 from RUNNING to DESTROYING
I0110 16:01:56.499404   864 launcher.cpp:161] Asked to destroy container ad70a447-bd3b-49ad-918f-54a871fe6be9
W0110 16:01:56.500319  6488 process.cpp:1423] Failed to recv on socket WindowsFD::Type::SOCKET=2588 to peer '192.10.1.6:52930': IO failed with er[       OK ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0 (685 ms)
[----------] 1 test from IsolationFlag/MemoryIsolatorTest (704 ms total)

[----------] Global test environment tear-down
[==========] 1082 tests from 104 test cases ran. (491907 ms total)
[  PASSED  ] 1081 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] SlaveRecoveryTest/0.Reboot, where TypeParam = class mesos::internal::slave::MesosContainerizer

 1 FAILED TEST
  YOU HAVE 231 DISABLED TESTS

ror code: The specified network name is no longer available.

W0110 16:01:56.500319  6488 process.cpp:838] Failed to recv on socket WindowsFD::Type::SOCKET=2500 to peer '192.10.1.6:52931': IO failed with error code: The specified network name is no longer available.

I0110 16:01:56.597012  2248 containerizer.cpp:2975] Container ad70a447-bd3b-49ad-918f-54a871fe6be9 has exited
I0110 16:01:56.625031  8360 master.cpp:1111] Master terminating
I0110 16:01:56.627017  6148 hierarchical.cpp:644] Removed agent 68d97b13-a7c5-4a37-933b-31279343ea71-S0
I0110 16:01:56.899015  6488 process.cpp:927] Stopped the socket accept loop
```

- Mesos Reviewbot Windows


On Jan. 10, 2019, 2:52 p.m., Qian Zhang wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69705/
> -----------------------------------------------------------
> 
> (Updated Jan. 10, 2019, 2:52 p.m.)
> 
> 
> Review request for mesos, Andrei Budnik and Gilbert Song.
> 
> 
> Bugs: MESOS-9501
>     https://issues.apache.org/jira/browse/MESOS-9501
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> After agent host is rebooted, the forked pid and libprocess pid in
> agent's meta directory are obsolete, so we should not read them during
> agent recovery, otherwise containerizer may wait for an irrelevant
> process if the forked pid is reused by another process after reboot.
> 
> 
> Diffs
> -----
> 
>   src/slave/state.hpp 4f3d4cefb3fdef29cce3a6abe4cf5db04d45301f 
>   src/slave/state.cpp e7cf84993c74cf6da7fe22d5112e86e039780287 
> 
> 
> Diff: https://reviews.apache.org/r/69705/diff/1/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Qian Zhang
> 
>