You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Gaston Kleiman <ga...@mesosphere.io> on 2018/03/23 02:10:31 UTC

Review Request 66232: Removed unnecessary/invalid checks from the default executor.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66232/
-----------------------------------------------------------

Review request for mesos, Alexander Rukletsov, Joseph Wu, Qian Zhang, and Vinod Kone.


Bugs: MESOS-8530
    https://issues.apache.org/jira/browse/MESOS-8530


Repository: mesos


Description
-------

The default executor uses checks to ensure that it is subscribed to the
agent before killing a task/container; if it is not, it will skip kill
calls or crash.

When killing a task/container, the default executor performs the
following actions:

1) Calls `KILL_NESTED_CONTAINER`.
2) Enqueues a task status update (for tasks only).

None of these actions require an active subscription.

When the checks were added:

1) The default executor supported launching only one task group, so it
was correct to assume that killing a task would eventually trigger the
destruction of all the child containers.
2) It was assumed that the executor will only be unsubscribed for brief
periods of time, mostly during agent recovery, when kill calls are
likely to fail.
3) There was no kill/escalation retry logic.

The checks were added as a way of working around the lack of retry logic
for kill requests, relying on the fact that crashing the executor leads
to the destruction of all the child containers.

This chain adds retry logic for failed kill call escalation, making the
workaround unnecessary.

Now that the default executor supports running multiple task groups, the
checks are not just unnecessary, but also invalid and dangerous. If a
check fails, all the containers started by the executor will be killed,
regardless of which task group they belong to. This is bad and could
lead to data loss.

This patch removes these unnecessary and sometimes invalid checks from
the default executor.


Diffs
-----

  src/launcher/default_executor.cpp 906836f3b8e0af79d7c61f90fd8a95f193b26e84 


Diff: https://reviews.apache.org/r/66232/diff/1/


Testing
-------

`sudo bin/mesos-tests.sh` on GNU/Linux


Thanks,

Gaston Kleiman


Re: Review Request 66232: Removed unnecessary/invalid checks from the default executor.

Posted by Alexander Rukletsov <ru...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66232/#review199975
-----------------------------------------------------------


Ship it!




Ship It!

- Alexander Rukletsov


On March 23, 2018, 5:49 p.m., Gaston Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66232/
> -----------------------------------------------------------
> 
> (Updated March 23, 2018, 5:49 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Joseph Wu, Qian Zhang, and Vinod Kone.
> 
> 
> Bugs: MESOS-8530
>     https://issues.apache.org/jira/browse/MESOS-8530
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The default executor uses checks to ensure that it is subscribed to the
> agent before killing a task/container; if it is not, it will skip kill
> calls or crash.
> 
> When killing a task/container, the default executor performs the
> following actions:
> 
> 1) Calls `KILL_NESTED_CONTAINER`.
> 2) Enqueues a task status update (for tasks only).
> 
> None of these actions require an active subscription.
> 
> When the checks were added:
> 
> 1) The default executor supported launching only one task group, so it
> was correct to assume that killing a task would eventually trigger the
> destruction of all the child containers.
> 2) It was assumed that the executor will only be unsubscribed for brief
> periods of time, mostly during agent recovery, when kill calls are
> likely to fail.
> 3) There was no kill/escalation retry logic.
> 
> The checks were added as a way of working around the lack of retry logic
> for kill requests, relying on the fact that crashing the executor leads
> to the destruction of all the child containers.
> 
> This chain adds retry logic for failed kill call escalation, making the
> workaround unnecessary.
> 
> Now that the default executor supports running multiple task groups, the
> checks are not just unnecessary, but also invalid and dangerous. If a
> check fails, all the containers started by the executor will be killed,
> regardless of which task group they belong to. This is bad and could
> lead to data loss.
> 
> This patch removes these unnecessary and sometimes invalid checks from
> the default executor.
> 
> 
> Diffs
> -----
> 
>   src/launcher/default_executor.cpp 906836f3b8e0af79d7c61f90fd8a95f193b26e84 
> 
> 
> Diff: https://reviews.apache.org/r/66232/diff/2/
> 
> 
> Testing
> -------
> 
> `sudo bin/mesos-tests.sh` on GNU/Linux
> 
> 
> Thanks,
> 
> Gaston Kleiman
> 
>


Re: Review Request 66232: Removed unnecessary/invalid checks from the default executor.

Posted by Gaston Kleiman <ga...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66232/
-----------------------------------------------------------

(Updated March 23, 2018, 10:49 a.m.)


Review request for mesos, Alexander Rukletsov, Joseph Wu, Qian Zhang, and Vinod Kone.


Bugs: MESOS-8530
    https://issues.apache.org/jira/browse/MESOS-8530


Repository: mesos


Description
-------

The default executor uses checks to ensure that it is subscribed to the
agent before killing a task/container; if it is not, it will skip kill
calls or crash.

When killing a task/container, the default executor performs the
following actions:

1) Calls `KILL_NESTED_CONTAINER`.
2) Enqueues a task status update (for tasks only).

None of these actions require an active subscription.

When the checks were added:

1) The default executor supported launching only one task group, so it
was correct to assume that killing a task would eventually trigger the
destruction of all the child containers.
2) It was assumed that the executor will only be unsubscribed for brief
periods of time, mostly during agent recovery, when kill calls are
likely to fail.
3) There was no kill/escalation retry logic.

The checks were added as a way of working around the lack of retry logic
for kill requests, relying on the fact that crashing the executor leads
to the destruction of all the child containers.

This chain adds retry logic for failed kill call escalation, making the
workaround unnecessary.

Now that the default executor supports running multiple task groups, the
checks are not just unnecessary, but also invalid and dangerous. If a
check fails, all the containers started by the executor will be killed,
regardless of which task group they belong to. This is bad and could
lead to data loss.

This patch removes these unnecessary and sometimes invalid checks from
the default executor.


Diffs (updated)
-----

  src/launcher/default_executor.cpp 906836f3b8e0af79d7c61f90fd8a95f193b26e84 


Diff: https://reviews.apache.org/r/66232/diff/2/

Changes: https://reviews.apache.org/r/66232/diff/1-2/


Testing
-------

`sudo bin/mesos-tests.sh` on GNU/Linux


Thanks,

Gaston Kleiman


Re: Review Request 66232: Removed unnecessary/invalid checks from the default executor.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66232/#review199842
-----------------------------------------------------------



FAIL: Some of the unit tests failed. Please check the relevant logs.

Reviews applied: `['65692', '65693', '66232']`

Failed command: `Start-MesosCITesting`

All the build artifacts available at: http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/66232

Relevant logs:

- [mesos-tests-stdout.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/66232/logs/mesos-tests-stdout.log):

```
[       OK ] Endpoint/SlaveEndpointTest.NoAuthorizer/2 (120 ms)
[----------] 9 tests from Endpoint/SlaveEndpointTest (1105 ms total)

[----------] 2 tests from ContainerizerType/DefaultContainerDNSFlagTest
[ RUN      ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/0
[       OK ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/0 (32 ms)
[ RUN      ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/1
[       OK ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/1 (39 ms)
[----------] 2 tests from ContainerizerType/DefaultContainerDNSFlagTest (73 ms total)

[----------] 1 test from IsolationFlag/CpuIsolatorTest
[ RUN      ] IsolationFlag/CpuIsolatorTest.ROOT_UserCpuUsage/0
[       OK ] IsolationFlag/CpuIsolatorTest.ROOT_UserCpuUsage/0 (906 ms)
[----------] 1 test from IsolationFlag/CpuIsolatorTest (933 ms total)

[----------] 1 test from IsolationFlag/MemoryIsolatorTest
[ RUN      ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0
[       OK ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0 (839 ms)
[----------] 1 test from IsolationFlag/MemoryIsolatorTest (870 ms total)

[----------] Global test environment tear-down
[==========] 943 tests from 94 test cases ran. (447506 ms total)
[  PASSED  ] 942 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] CommandExecutorCheckTest.CommandCheckTimeout

 1 FAILED TEST
  YOU HAVE 215 DISABLED TESTS

```

- [mesos-tests-stderr.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/66232/logs/mesos-tests-stderr.log):

```
I0323 03:24:22.550164  9020 master.cpp:10255] Updating the state of task bfa57eef-68c4-480c-ab9e-8713821be243 of framework c5a83c33-0bff-4e06-ba68-0b94e8c26134-0000 (latest state: TASK_KILLED, status update state: TASK_KILLED)
I0323 03:24:22.550164  5672 slave.cpp:3873] Shutting down framework c5a83c33-0bff-4e06-ba68-0b94e8c26134-00I0323 03:24:22.362180 10428 exec.cpp:162] Version: 1.6.0
I0323 03:24:22.390198  7244 exec.cpp:236] Executor registered on agent c5a83c33-0bff-4e06-ba68-0b94e8c26134-S0
I0323 03:24:22.395151 11256 executor.cpp:176] Received SUBSCRIBED event
I0323 03:24:22.399148 11256 executor.cpp:180] Subscribed executor on winbldsrv-01.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net
I0323 03:24:22.399148 11256 executor.cpp:176] Received LAUNCH event
I0323 03:24:22.405328 11256 executor.cpp:648] Starting task bfa57eef-68c4-480c-ab9e-8713821be243
I0323 03:24:22.488149 11256 executor.cpp:483] Running 'D:\DCOS\mesos\src\mesos-containerizer.exe launch <POSSIBLY-SENSITIVE-DATA>'
I0323 03:24:22.521162 11256 executor.cpp:661] Forked command at 8924
I0323 03:24:22.552166  9360 exec.cpp:445] Executor asked to shutdown
I0323 03:24:22.553153  8356 executor.cpp:176] Received SHUTDOWN event
I0323 03:24:22.553153  8356 executor.cpp:758] Shutting down
I0323 03:24:22.553153  8356 executor.cpp:868] Sending SIGTERM to process tree at pid 800
I0323 03:24:22.551165  5672 slave.cpp:6566] Shutting down executor 'bfa57eef-68c4-480c-ab9e-8713821be243' of framework c5a83c33-0bff-4e06-ba68-0b94e8c26134-0000 at executor(1)@10.3.1.8:61938
I0323 03:24:22.551165  5672 slave.cpp:919] Agent terminating
W0323 03:24:22.552166  5672 slave.cpp:3869] Ignoring shutdown framework c5a83c33-0bff-4e06-ba68-0b94e8c26134-0000 because it is terminating
I0323 03:24:22.553153  9020 master.cpp:10354] Removing task bfa57eef-68c4-480c-ab9e-8713821be243 with resources cpus(allocated: *):4; mem(allocated: *):2048; disk(allocated: *):1024; ports(allocated: *):[31000-32000] of framework c5a83c33-0bff-4e06-ba68-0b94e8c26134-0000 on agent c5a83c33-0bff-4e06-ba68-0b94e8c26134-S0 at slave(412)@10.3.1.8:61917 (winbldsrv-01.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net)
I0323 03:24:22.555380  5672 containerizer.cpp:2338] Destroying container 938fb382-e91e-4721-976a-261b48c06e63 in RUNNING state
I0323 03:24:22.556154  5672 containerizer.cpp:2952] Transitioning the state of container 938fb382-e91e-4721-976a-261b48c06e63 from RUNNING to DESTROYING
I0323 03:24:22.557157  9020 master.cpp:1295] Agent c5a83c33-0bff-4e06-ba68-0b94e8c26134-S0 at slave(412)@10.3.1.8:61917 (winbldsrv-01.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net) disconnected
I0323 03:24:22.557157  9020 master.cpp:3283] Disconnecting agent c5a83c33-0bff-4e06-ba68-0b94e8c26134-S0 at slave(412)@10.3.1.8:61917 (winbldsrv-01.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net)
I0323 03:24:22.557157  9020 master.cpp:3302] Deactivating agent c5a83c33-0bff-4e06-ba68-0b94e8c26134-S0 at slave(412)@10.3.1.8:61917 (winbldsrv-01.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net)
I0323 03:24:22.558156  5672 launcher.cpp:156] Asked to destroy container 938fb382-e91e-4721-976a-261b48c06e63
I0323 03:24:22.558156 10564 hierarchical.cpp:344] Removed framework c5a83c33-0bff-4e06-ba68-0b94e8c26134-0000
I0323 03:24:22.558156 10564 hierarchical.cpp:766] Agent c5a83c33-0bff-4e06-ba68-0b94e8c26134-S0 deactivated
I0323 03:24:22.579192  8352 containerizer.cpp:2791] Container 938fb382-e91e-4721-976a-261b48c06e63 has exited
I0323 03:24:22.615169 10160 master.cpp:1137] Master terminating
I0323 03:24:22.617171  6252 hierarchical.cpp:609] Removed agent c5a83c33-0bff-4e06-ba68-0b94e8c26134-S0
I0323 03:24:23.087208  8196 process.cpp:929] Stopped the socket accept loop
```

- Mesos Reviewbot Windows


On March 23, 2018, 2:10 a.m., Gaston Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66232/
> -----------------------------------------------------------
> 
> (Updated March 23, 2018, 2:10 a.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Joseph Wu, Qian Zhang, and Vinod Kone.
> 
> 
> Bugs: MESOS-8530
>     https://issues.apache.org/jira/browse/MESOS-8530
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The default executor uses checks to ensure that it is subscribed to the
> agent before killing a task/container; if it is not, it will skip kill
> calls or crash.
> 
> When killing a task/container, the default executor performs the
> following actions:
> 
> 1) Calls `KILL_NESTED_CONTAINER`.
> 2) Enqueues a task status update (for tasks only).
> 
> None of these actions require an active subscription.
> 
> When the checks were added:
> 
> 1) The default executor supported launching only one task group, so it
> was correct to assume that killing a task would eventually trigger the
> destruction of all the child containers.
> 2) It was assumed that the executor will only be unsubscribed for brief
> periods of time, mostly during agent recovery, when kill calls are
> likely to fail.
> 3) There was no kill/escalation retry logic.
> 
> The checks were added as a way of working around the lack of retry logic
> for kill requests, relying on the fact that crashing the executor leads
> to the destruction of all the child containers.
> 
> This chain adds retry logic for failed kill call escalation, making the
> workaround unnecessary.
> 
> Now that the default executor supports running multiple task groups, the
> checks are not just unnecessary, but also invalid and dangerous. If a
> check fails, all the containers started by the executor will be killed,
> regardless of which task group they belong to. This is bad and could
> lead to data loss.
> 
> This patch removes these unnecessary and sometimes invalid checks from
> the default executor.
> 
> 
> Diffs
> -----
> 
>   src/launcher/default_executor.cpp 906836f3b8e0af79d7c61f90fd8a95f193b26e84 
> 
> 
> Diff: https://reviews.apache.org/r/66232/diff/1/
> 
> 
> Testing
> -------
> 
> `sudo bin/mesos-tests.sh` on GNU/Linux
> 
> 
> Thanks,
> 
> Gaston Kleiman
> 
>