You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Gaston Kleiman <ga...@mesosphere.io> on 2018/02/17 00:27:14 UTC
Review Request 65695: Made the default executor allow schedulers to
retry task kills.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65695/
-----------------------------------------------------------
Review request for mesos, Joseph Wu, Qian Zhang, and Vinod Kone.
Bugs: MESOS-8530
https://issues.apache.org/jira/browse/MESOS-8530
Repository: mesos
Description
-------
The default executor transitions a task to `TASK_KILLING` and marks its
child container as being killed before posting a `KILL` call to the
agent.
The executor ignores kill requests for containers that are marked as
being killed, and it doesn't remove this mark if the `KILL` call fails.
This means that it's possible for tasks to get stuck in a `TASK_KILLING`
state.
This patch makes the default executor remove the killing mark if a
`KILL` call fails. That way a scheduler can retry a kill.
Diffs
-----
src/launcher/default_executor.cpp 8720dada8bc6ca66f9e0fec6dc265eda3dcc7407
Diff: https://reviews.apache.org/r/65695/diff/1/
Testing
-------
`sudo bin/mesos-tests.sh` on GNU/Linux
Thanks,
Gaston Kleiman
Re: Review Request 65695: Made the default executor allow schedulers
to retry task kills.
Posted by Joseph Wu <jo...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65695/#review198841
-----------------------------------------------------------
Ship it!
LGTM.
I wonder if we could repurpose/extend any of our existing tests to cover this case... The DefaultExecutorTests already cover a couple of kill-cases, but none where the `KILL` call itself fails.
- Joseph Wu
On Feb. 16, 2018, 4:27 p.m., Gaston Kleiman wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65695/
> -----------------------------------------------------------
>
> (Updated Feb. 16, 2018, 4:27 p.m.)
>
>
> Review request for mesos, Joseph Wu, Qian Zhang, and Vinod Kone.
>
>
> Bugs: MESOS-8530
> https://issues.apache.org/jira/browse/MESOS-8530
>
>
> Repository: mesos
>
>
> Description
> -------
>
> The default executor transitions a task to `TASK_KILLING` and marks its
> child container as being killed before posting a `KILL` call to the
> agent.
>
> The executor ignores kill requests for containers that are marked as
> being killed, and it doesn't remove this mark if the `KILL` call fails.
> This means that it's possible for tasks to get stuck in a `TASK_KILLING`
> state.
>
> This patch makes the default executor remove the killing mark if a
> `KILL` call fails. That way a scheduler can retry a kill.
>
>
> Diffs
> -----
>
> src/launcher/default_executor.cpp 8720dada8bc6ca66f9e0fec6dc265eda3dcc7407
>
>
> Diff: https://reviews.apache.org/r/65695/diff/1/
>
>
> Testing
> -------
>
> `sudo bin/mesos-tests.sh` on GNU/Linux
>
>
> Thanks,
>
> Gaston Kleiman
>
>
Re: Review Request 65695: Made the default executor allow schedulers
to retry task kills.
Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65695/#review197707
-----------------------------------------------------------
FAIL: Some of the unit tests failed. Please check the relevant logs.
Reviews applied: `['65692', '65693', '65694', '65695']`
Failed command: `Start-MesosCITesting`
All the build artifacts available at: http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/65695
Relevant logs:
- [mesos-tests-stdout.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/65695/logs/mesos-tests-stdout.log):
```
[----------] 2 tests from ContainerizerType/DefaultContainerDNSFlagTest
[ RUN ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/0
[ OK ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/0 (33 ms)
[ RUN ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/1
[ OK ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/1 (40 ms)
[----------] 2 tests from ContainerizerType/DefaultContainerDNSFlagTest (74 ms total)
[----------] 1 test from IsolationFlag/CpuIsolatorTest
[ RUN ] IsolationFlag/CpuIsolatorTest.ROOT_UserCpuUsage/0
[ OK ] IsolationFlag/CpuIsolatorTest.ROOT_UserCpuUsage/0 (2556 ms)
[----------] 1 test from IsolationFlag/CpuIsolatorTest (2580 ms total)
[----------] 1 test from IsolationFlag/MemoryIsolatorTest
[ RUN ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0
[ OK ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0 (2460 ms)
[----------] 1 test from IsolationFlag/MemoryIsolatorTest (2483 ms total)
[----------] Global test environment tear-down
[==========] 906 tests from 90 test cases ran. (464637 ms total)
[ PASSED ] 903 tests.
[ FAILED ] 3 tests, listed below:
[ FAILED ] MesosContainerizer/DefaultExecutorTest.KillTask/0, where GetParam() = "mesos"
[ FAILED ] MesosContainerizer/DefaultExecutorTest.KillMultipleTasks/0, where GetParam() = "mesos"
[ FAILED ] MesosContainerizer/DefaultExecutorTest.CommitSuicideOnKillTask/0, where GetParam() = "mesos"
3 FAILED TESTS
YOU HAVE 211 DISABLED TESTS
```
- [mesos-tests-stderr.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/65695/logs/mesos-tests-stderr.log):
```
I0217 02:51:24.176358 6436 slave.cpp:3879] Shutting down framework 60e5ecc8-9d45-4dca-9829-0308832a3de6-0000
I0217 02:51:24.176358 4732 master.cpp:10249] Updating the state of task 6988de1f-8545-4e3a-b811-49eabacd14c8 of framework 60e5ecc8-9d45-4dca-9829-0308832a3de6-0000 (latest state: TASK_KILLI0217 02:51:23.502336 9936 exec.cpp:162] Version: 1.6.0
I0217 02:51:23.526335 4932 exec.cpp:236] Executor registered on agent 60e5ecc8-9d45-4dca-9829-0308832a3de6-S0
I0217 02:51:23.530336 9936 executor.cpp:174] Received SUBSCRIBED event
I0217 02:51:23.534332 9936 executor.cpp:178] Subscribed executor on build-srv-03.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net
I0217 02:51:23.534332 9936 executor.cpp:174] Received LAUNCH event
I0217 02:51:23.537350 9936 executor.cpp:646] Starting task 6988de1f-8545-4e3a-b811-49eabacd14c8
I0217 02:51:23.611337 9936 executor.cpp:481] Running 'D:\DCOS\mesos\src\mesos-containerizer.exe launch <POSSIBLY-SENSITIVE-DATA>'
I0217 02:51:24.148334 9936 executor.cpp:659] Forked command at 6464
I0217 02:51:24.178335 7248 exec.cpp:445] Executor asked to shutdown
I0217 02:51:24.179335 9936 executor.cpp:174] Received SHUTDOWN event
I0217 02:51:24.179335 9936 executor.cpp:756] Shutting down
I0217 02:51:24.179335 9936 executor.cpp:866] Sending SIGTERM to process tree at pid 6ED, status update state: TASK_KILLED)
I0217 02:51:24.176358 6436 slave.cpp:6586] Shutting down executor '6988de1f-8545-4e3a-b811-49eabacd14c8' of framework 60e5ecc8-9d45-4dca-9829-0308832a3de6-0000 at executor(1)@10.3.1.11:62887
I0217 02:51:24.177357 6436 slave.cpp:922] Agent terminating
W0217 02:51:24.178335 6436 slave.cpp:3875] Ignoring shutdown framework 60e5ecc8-9d45-4dca-9829-0308832a3de6-0000 because it is terminating
I0217 02:51:24.179335 4732 master.cpp:10348] Removing task 6988de1f-8545-4e3a-b811-49eabacd14c8 with resources cpus(allocated: *):4; mem(allocated: *):2048; disk(allocated: *):1024; ports(allocated: *):[31000-32000] of framework 60e5ecc8-9d45-4dca-9829-0308832a3de6-0000 on agent 60e5ecc8-9d45-4dca-9829-0308832a3de6-S0 at slave(392)@10.3.1.11:62866 (build-srv-03.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net)
I0217 02:51:24.181335 6436 containerizer.cpp:2338] Destroying container 87af1d2d-501c-485b-ad57-50ff6bf830f2 in RUNNING state
I0217 02:51:24.181335 6436 containerizer.cpp:2952] Transitioning the state of container 87af1d2d-501c-485b-ad57-50ff6bf830f2 from RUNNING to DESTROYING
I0217 02:51:24.182337 4732 master.cpp:1307] Agent 60e5ecc8-9d45-4dca-9829-0308832a3de6-S0 at slave(392)@10.3.1.11:62866 (build-srv-03.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net) disconnected
I0217 02:51:24.182337 4732 master.cpp:3277] Disconnecting agent 60e5ecc8-9d45-4dca-9829-0308832a3de6-S0 at slave(392)@10.3.1.11:62866 (build-srv-03.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net)
I0217 02:51:24.182337 4732 master.cpp:3296] Deactivating agent 60e5ecc8-9d45-4dca-9829-0308832a3de6-S0 at slave(392)@10.3.1.11:62866 (build-srv-03.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net)
I0217 02:51:24.182337 1920 hierarchical.cpp:344] Removed framework 60e5ecc8-9d45-4dca-9829-0308832a3de6-0000
I0217 02:51:24.182337 6436 launcher.cpp:156] Asked to destroy container 87af1d2d-501c-485b-ad57-50ff6bf830f2
I0217 02:51:24.183357 1920 hierarchical.cpp:766] Agent 60e5ecc8-9d45-4dca-9829-0308832a3de6-S0 deactivated
I0217 02:51:24.290457 6436 containerizer.cpp:2791] Container 87af1d2d-501c-485b-ad57-50ff6bf830f2 has exited
I0217 02:51:24.320493 2160 master.cpp:1149] Master terminating
I0217 02:51:24.322463 11060 hierarchical.cpp:609] Removed agent 60e5ecc8-9d45-4dca-9829-0308832a3de6-S0
I0217 02:51:24.775504 1228 process.cpp:929] Stopped the socket accept loop
```
- Mesos Reviewbot Windows
On Feb. 17, 2018, 12:27 a.m., Gaston Kleiman wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65695/
> -----------------------------------------------------------
>
> (Updated Feb. 17, 2018, 12:27 a.m.)
>
>
> Review request for mesos, Joseph Wu, Qian Zhang, and Vinod Kone.
>
>
> Bugs: MESOS-8530
> https://issues.apache.org/jira/browse/MESOS-8530
>
>
> Repository: mesos
>
>
> Description
> -------
>
> The default executor transitions a task to `TASK_KILLING` and marks its
> child container as being killed before posting a `KILL` call to the
> agent.
>
> The executor ignores kill requests for containers that are marked as
> being killed, and it doesn't remove this mark if the `KILL` call fails.
> This means that it's possible for tasks to get stuck in a `TASK_KILLING`
> state.
>
> This patch makes the default executor remove the killing mark if a
> `KILL` call fails. That way a scheduler can retry a kill.
>
>
> Diffs
> -----
>
> src/launcher/default_executor.cpp 8720dada8bc6ca66f9e0fec6dc265eda3dcc7407
>
>
> Diff: https://reviews.apache.org/r/65695/diff/1/
>
>
> Testing
> -------
>
> `sudo bin/mesos-tests.sh` on GNU/Linux
>
>
> Thanks,
>
> Gaston Kleiman
>
>
Re: Review Request 65695: Made the default executor allow schedulers
to retry task kills.
Posted by Gaston Kleiman <ga...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65695/
-----------------------------------------------------------
(Updated March 16, 2018, 12:53 p.m.)
Review request for mesos, Joseph Wu, Qian Zhang, and Vinod Kone.
Changes
-------
Rebase.
Bugs: MESOS-8530
https://issues.apache.org/jira/browse/MESOS-8530
Repository: mesos
Description
-------
The default executor transitions a task to `TASK_KILLING` and marks its
child container as being killed before posting a `KILL` call to the
agent.
The executor ignores kill requests for containers that are marked as
being killed, and it doesn't remove this mark if the `KILL` call fails.
This means that it's possible for tasks to get stuck in a `TASK_KILLING`
state.
This patch makes the default executor remove the killing mark if a
`KILL` call fails. That way a scheduler can retry a kill.
Diffs (updated)
-----
src/launcher/default_executor.cpp 906836f3b8e0af79d7c61f90fd8a95f193b26e84
Diff: https://reviews.apache.org/r/65695/diff/3/
Changes: https://reviews.apache.org/r/65695/diff/2-3/
Testing
-------
`sudo bin/mesos-tests.sh` on GNU/Linux
Thanks,
Gaston Kleiman
Re: Review Request 65695: Made the default executor allow schedulers
to retry task kills.
Posted by Mesos Reviewbot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65695/#review197714
-----------------------------------------------------------
Patch looks great!
Reviews applied: [65692, 65693, 65694, 65695]
Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose --disable-libtool-wrappers' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh
- Mesos Reviewbot
On Feb. 17, 2018, 12:27 a.m., Gaston Kleiman wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65695/
> -----------------------------------------------------------
>
> (Updated Feb. 17, 2018, 12:27 a.m.)
>
>
> Review request for mesos, Joseph Wu, Qian Zhang, and Vinod Kone.
>
>
> Bugs: MESOS-8530
> https://issues.apache.org/jira/browse/MESOS-8530
>
>
> Repository: mesos
>
>
> Description
> -------
>
> The default executor transitions a task to `TASK_KILLING` and marks its
> child container as being killed before posting a `KILL` call to the
> agent.
>
> The executor ignores kill requests for containers that are marked as
> being killed, and it doesn't remove this mark if the `KILL` call fails.
> This means that it's possible for tasks to get stuck in a `TASK_KILLING`
> state.
>
> This patch makes the default executor remove the killing mark if a
> `KILL` call fails. That way a scheduler can retry a kill.
>
>
> Diffs
> -----
>
> src/launcher/default_executor.cpp 8720dada8bc6ca66f9e0fec6dc265eda3dcc7407
>
>
> Diff: https://reviews.apache.org/r/65695/diff/1/
>
>
> Testing
> -------
>
> `sudo bin/mesos-tests.sh` on GNU/Linux
>
>
> Thanks,
>
> Gaston Kleiman
>
>