You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Gaston Kleiman <ga...@mesosphere.io> on 2018/02/17 00:27:14 UTC

Review Request 65695: Made the default executor allow schedulers to retry task kills.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65695/
-----------------------------------------------------------

Review request for mesos, Joseph Wu, Qian Zhang, and Vinod Kone.


Bugs: MESOS-8530
    https://issues.apache.org/jira/browse/MESOS-8530


Repository: mesos


Description
-------

The default executor transitions a task to `TASK_KILLING` and marks its
child container as being killed before posting a `KILL` call to the
agent.

The executor ignores kill requests for containers that are marked as
being killed, and it doesn't remove this mark if the `KILL` call fails.
This means that it's possible for tasks to get stuck in a `TASK_KILLING`
state.

This patch makes the default executor remove the killing mark if a
`KILL` call fails. That way a scheduler can retry a kill.


Diffs
-----

  src/launcher/default_executor.cpp 8720dada8bc6ca66f9e0fec6dc265eda3dcc7407 


Diff: https://reviews.apache.org/r/65695/diff/1/


Testing
-------

`sudo bin/mesos-tests.sh` on GNU/Linux


Thanks,

Gaston Kleiman


Re: Review Request 65695: Made the default executor allow schedulers to retry task kills.

Posted by Joseph Wu <jo...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65695/#review198841
-----------------------------------------------------------


Ship it!




LGTM.

I wonder if we could repurpose/extend any of our existing tests to cover this case...  The DefaultExecutorTests already cover a couple of kill-cases, but none where the `KILL` call itself fails.

- Joseph Wu


On Feb. 16, 2018, 4:27 p.m., Gaston Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65695/
> -----------------------------------------------------------
> 
> (Updated Feb. 16, 2018, 4:27 p.m.)
> 
> 
> Review request for mesos, Joseph Wu, Qian Zhang, and Vinod Kone.
> 
> 
> Bugs: MESOS-8530
>     https://issues.apache.org/jira/browse/MESOS-8530
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The default executor transitions a task to `TASK_KILLING` and marks its
> child container as being killed before posting a `KILL` call to the
> agent.
> 
> The executor ignores kill requests for containers that are marked as
> being killed, and it doesn't remove this mark if the `KILL` call fails.
> This means that it's possible for tasks to get stuck in a `TASK_KILLING`
> state.
> 
> This patch makes the default executor remove the killing mark if a
> `KILL` call fails. That way a scheduler can retry a kill.
> 
> 
> Diffs
> -----
> 
>   src/launcher/default_executor.cpp 8720dada8bc6ca66f9e0fec6dc265eda3dcc7407 
> 
> 
> Diff: https://reviews.apache.org/r/65695/diff/1/
> 
> 
> Testing
> -------
> 
> `sudo bin/mesos-tests.sh` on GNU/Linux
> 
> 
> Thanks,
> 
> Gaston Kleiman
> 
>


Re: Review Request 65695: Made the default executor allow schedulers to retry task kills.

Posted by Mesos Reviewbot Windows <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65695/#review197707
-----------------------------------------------------------



FAIL: Some of the unit tests failed. Please check the relevant logs.

Reviews applied: `['65692', '65693', '65694', '65695']`

Failed command: `Start-MesosCITesting`

All the build artifacts available at: http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/65695

Relevant logs:

- [mesos-tests-stdout.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/65695/logs/mesos-tests-stdout.log):

```

[----------] 2 tests from ContainerizerType/DefaultContainerDNSFlagTest
[ RUN      ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/0
[       OK ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/0 (33 ms)
[ RUN      ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/1
[       OK ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/1 (40 ms)
[----------] 2 tests from ContainerizerType/DefaultContainerDNSFlagTest (74 ms total)

[----------] 1 test from IsolationFlag/CpuIsolatorTest
[ RUN      ] IsolationFlag/CpuIsolatorTest.ROOT_UserCpuUsage/0
[       OK ] IsolationFlag/CpuIsolatorTest.ROOT_UserCpuUsage/0 (2556 ms)
[----------] 1 test from IsolationFlag/CpuIsolatorTest (2580 ms total)

[----------] 1 test from IsolationFlag/MemoryIsolatorTest
[ RUN      ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0
[       OK ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0 (2460 ms)
[----------] 1 test from IsolationFlag/MemoryIsolatorTest (2483 ms total)

[----------] Global test environment tear-down
[==========] 906 tests from 90 test cases ran. (464637 ms total)
[  PASSED  ] 903 tests.
[  FAILED  ] 3 tests, listed below:
[  FAILED  ] MesosContainerizer/DefaultExecutorTest.KillTask/0, where GetParam() = "mesos"
[  FAILED  ] MesosContainerizer/DefaultExecutorTest.KillMultipleTasks/0, where GetParam() = "mesos"
[  FAILED  ] MesosContainerizer/DefaultExecutorTest.CommitSuicideOnKillTask/0, where GetParam() = "mesos"

 3 FAILED TESTS
  YOU HAVE 211 DISABLED TESTS

```

- [mesos-tests-stderr.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/65695/logs/mesos-tests-stderr.log):

```
I0217 02:51:24.176358  6436 slave.cpp:3879] Shutting down framework 60e5ecc8-9d45-4dca-9829-0308832a3de6-0000
I0217 02:51:24.176358  4732 master.cpp:10249] Updating the state of task 6988de1f-8545-4e3a-b811-49eabacd14c8 of framework 60e5ecc8-9d45-4dca-9829-0308832a3de6-0000 (latest state: TASK_KILLI0217 02:51:23.502336  9936 exec.cpp:162] Version: 1.6.0
I0217 02:51:23.526335  4932 exec.cpp:236] Executor registered on agent 60e5ecc8-9d45-4dca-9829-0308832a3de6-S0
I0217 02:51:23.530336  9936 executor.cpp:174] Received SUBSCRIBED event
I0217 02:51:23.534332  9936 executor.cpp:178] Subscribed executor on build-srv-03.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net
I0217 02:51:23.534332  9936 executor.cpp:174] Received LAUNCH event
I0217 02:51:23.537350  9936 executor.cpp:646] Starting task 6988de1f-8545-4e3a-b811-49eabacd14c8
I0217 02:51:23.611337  9936 executor.cpp:481] Running 'D:\DCOS\mesos\src\mesos-containerizer.exe launch <POSSIBLY-SENSITIVE-DATA>'
I0217 02:51:24.148334  9936 executor.cpp:659] Forked command at 6464
I0217 02:51:24.178335  7248 exec.cpp:445] Executor asked to shutdown
I0217 02:51:24.179335  9936 executor.cpp:174] Received SHUTDOWN event
I0217 02:51:24.179335  9936 executor.cpp:756] Shutting down
I0217 02:51:24.179335  9936 executor.cpp:866] Sending SIGTERM to process tree at pid 6ED, status update state: TASK_KILLED)
I0217 02:51:24.176358  6436 slave.cpp:6586] Shutting down executor '6988de1f-8545-4e3a-b811-49eabacd14c8' of framework 60e5ecc8-9d45-4dca-9829-0308832a3de6-0000 at executor(1)@10.3.1.11:62887
I0217 02:51:24.177357  6436 slave.cpp:922] Agent terminating
W0217 02:51:24.178335  6436 slave.cpp:3875] Ignoring shutdown framework 60e5ecc8-9d45-4dca-9829-0308832a3de6-0000 because it is terminating
I0217 02:51:24.179335  4732 master.cpp:10348] Removing task 6988de1f-8545-4e3a-b811-49eabacd14c8 with resources cpus(allocated: *):4; mem(allocated: *):2048; disk(allocated: *):1024; ports(allocated: *):[31000-32000] of framework 60e5ecc8-9d45-4dca-9829-0308832a3de6-0000 on agent 60e5ecc8-9d45-4dca-9829-0308832a3de6-S0 at slave(392)@10.3.1.11:62866 (build-srv-03.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net)
I0217 02:51:24.181335  6436 containerizer.cpp:2338] Destroying container 87af1d2d-501c-485b-ad57-50ff6bf830f2 in RUNNING state
I0217 02:51:24.181335  6436 containerizer.cpp:2952] Transitioning the state of container 87af1d2d-501c-485b-ad57-50ff6bf830f2 from RUNNING to DESTROYING
I0217 02:51:24.182337  4732 master.cpp:1307] Agent 60e5ecc8-9d45-4dca-9829-0308832a3de6-S0 at slave(392)@10.3.1.11:62866 (build-srv-03.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net) disconnected
I0217 02:51:24.182337  4732 master.cpp:3277] Disconnecting agent 60e5ecc8-9d45-4dca-9829-0308832a3de6-S0 at slave(392)@10.3.1.11:62866 (build-srv-03.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net)
I0217 02:51:24.182337  4732 master.cpp:3296] Deactivating agent 60e5ecc8-9d45-4dca-9829-0308832a3de6-S0 at slave(392)@10.3.1.11:62866 (build-srv-03.zq4gs31qjdiunm1ryi1452nvnh.dx.internal.cloudapp.net)
I0217 02:51:24.182337  1920 hierarchical.cpp:344] Removed framework 60e5ecc8-9d45-4dca-9829-0308832a3de6-0000
I0217 02:51:24.182337  6436 launcher.cpp:156] Asked to destroy container 87af1d2d-501c-485b-ad57-50ff6bf830f2
I0217 02:51:24.183357  1920 hierarchical.cpp:766] Agent 60e5ecc8-9d45-4dca-9829-0308832a3de6-S0 deactivated
I0217 02:51:24.290457  6436 containerizer.cpp:2791] Container 87af1d2d-501c-485b-ad57-50ff6bf830f2 has exited
I0217 02:51:24.320493  2160 master.cpp:1149] Master terminating
I0217 02:51:24.322463 11060 hierarchical.cpp:609] Removed agent 60e5ecc8-9d45-4dca-9829-0308832a3de6-S0
I0217 02:51:24.775504  1228 process.cpp:929] Stopped the socket accept loop
```

- Mesos Reviewbot Windows


On Feb. 17, 2018, 12:27 a.m., Gaston Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65695/
> -----------------------------------------------------------
> 
> (Updated Feb. 17, 2018, 12:27 a.m.)
> 
> 
> Review request for mesos, Joseph Wu, Qian Zhang, and Vinod Kone.
> 
> 
> Bugs: MESOS-8530
>     https://issues.apache.org/jira/browse/MESOS-8530
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The default executor transitions a task to `TASK_KILLING` and marks its
> child container as being killed before posting a `KILL` call to the
> agent.
> 
> The executor ignores kill requests for containers that are marked as
> being killed, and it doesn't remove this mark if the `KILL` call fails.
> This means that it's possible for tasks to get stuck in a `TASK_KILLING`
> state.
> 
> This patch makes the default executor remove the killing mark if a
> `KILL` call fails. That way a scheduler can retry a kill.
> 
> 
> Diffs
> -----
> 
>   src/launcher/default_executor.cpp 8720dada8bc6ca66f9e0fec6dc265eda3dcc7407 
> 
> 
> Diff: https://reviews.apache.org/r/65695/diff/1/
> 
> 
> Testing
> -------
> 
> `sudo bin/mesos-tests.sh` on GNU/Linux
> 
> 
> Thanks,
> 
> Gaston Kleiman
> 
>


Re: Review Request 65695: Made the default executor allow schedulers to retry task kills.

Posted by Gaston Kleiman <ga...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65695/
-----------------------------------------------------------

(Updated March 16, 2018, 12:53 p.m.)


Review request for mesos, Joseph Wu, Qian Zhang, and Vinod Kone.


Changes
-------

Rebase.


Bugs: MESOS-8530
    https://issues.apache.org/jira/browse/MESOS-8530


Repository: mesos


Description
-------

The default executor transitions a task to `TASK_KILLING` and marks its
child container as being killed before posting a `KILL` call to the
agent.

The executor ignores kill requests for containers that are marked as
being killed, and it doesn't remove this mark if the `KILL` call fails.
This means that it's possible for tasks to get stuck in a `TASK_KILLING`
state.

This patch makes the default executor remove the killing mark if a
`KILL` call fails. That way a scheduler can retry a kill.


Diffs (updated)
-----

  src/launcher/default_executor.cpp 906836f3b8e0af79d7c61f90fd8a95f193b26e84 


Diff: https://reviews.apache.org/r/65695/diff/3/

Changes: https://reviews.apache.org/r/65695/diff/2-3/


Testing
-------

`sudo bin/mesos-tests.sh` on GNU/Linux


Thanks,

Gaston Kleiman


Re: Review Request 65695: Made the default executor allow schedulers to retry task kills.

Posted by Mesos Reviewbot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65695/#review197714
-----------------------------------------------------------



Patch looks great!

Reviews applied: [65692, 65693, 65694, 65695]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose --disable-libtool-wrappers' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh

- Mesos Reviewbot


On Feb. 17, 2018, 12:27 a.m., Gaston Kleiman wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65695/
> -----------------------------------------------------------
> 
> (Updated Feb. 17, 2018, 12:27 a.m.)
> 
> 
> Review request for mesos, Joseph Wu, Qian Zhang, and Vinod Kone.
> 
> 
> Bugs: MESOS-8530
>     https://issues.apache.org/jira/browse/MESOS-8530
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The default executor transitions a task to `TASK_KILLING` and marks its
> child container as being killed before posting a `KILL` call to the
> agent.
> 
> The executor ignores kill requests for containers that are marked as
> being killed, and it doesn't remove this mark if the `KILL` call fails.
> This means that it's possible for tasks to get stuck in a `TASK_KILLING`
> state.
> 
> This patch makes the default executor remove the killing mark if a
> `KILL` call fails. That way a scheduler can retry a kill.
> 
> 
> Diffs
> -----
> 
>   src/launcher/default_executor.cpp 8720dada8bc6ca66f9e0fec6dc265eda3dcc7407 
> 
> 
> Diff: https://reviews.apache.org/r/65695/diff/1/
> 
> 
> Testing
> -------
> 
> `sudo bin/mesos-tests.sh` on GNU/Linux
> 
> 
> Thanks,
> 
> Gaston Kleiman
> 
>