You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Alexander Rukletsov <ru...@gmail.com> on 2016/04/18 15:05:18 UTC

Review Request 46325: Updated HTTP command executor to support kill policy in Kill event.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/46325/
-----------------------------------------------------------

Review request for mesos, Ben Mahler and Qian Zhang.


Bugs: MESOS-4908
    https://issues.apache.org/jira/browse/MESOS-4908


Repository: mesos


Description
-------

Kill policy can be provided in a kill event. In this case it should
take precedence over kill policy specified when the task was launched.
When kill event is issued multiple times during the task termination,
the signal escalation timeout (the time a task has between SIGTERM
and SIGKILL) may be reduced.

Since updating the delay timer (we use it for signal escalation delay)
is currently not possible, we cancel the existing signal timer and set
up a new one. `Clock::cancel()` guarantees that, if existed, the timer
is removed before the function returns; hence we do not set up more
than 1 timer for signal escalation delay.


Diffs
-----

  src/launcher/http_command_executor.cpp ad484e0e6f5067b6c166111c91b1ff1e8c05d9ac 

Diff: https://reviews.apache.org/r/46325/diff/


Testing
-------

`make check`

Tested manually using modified `mesos-execute` in a way that **two** extra kill task requests are sent, 2s and 3s after receiving `TASK_RUNNING`. Each kill task request specifies `KillPolicy` with 1s grace period. Together with a kill *without* a kill policy scheduled 1s after the task is being launched, the task receives **three** kill requests in total.

Master: `./bin/mesos-master.sh --work_dir=/tmp/w/m --ip=127.0.0.1`
Agent: `./bin/mesos-slave.sh --work_dir=/tmp/w/s --master=127.0.0.1:5050 --http_command_executor`
Mesos-execute: `./src/mesos-execute --master=127.0.0.1:5050 --name=test --command="/Users/alex/bin/unresponsive_process" --env='{"GLOG_v": "2"}' --kill_after=1secs`

HTTP command executor log:
```
Received SUBSCRIBED event
Subscribed executor on alexr.fritz.box
Received LAUNCH event
Starting task test
sh -c '/Users/alex/bin/unresponsive_process'
Forked command at 17475
14455919081943275466
Received ACKNOWLEDGED event
17172602460659762152
Received KILL event
Received kill for task test
Sending SIGTERM to process tree at pid 17475
Sent SIGTERM to the following process trees:
[ 
--- 17475 /Users/alex/bin/unresponsive_process
]
4381544758593790168
Scheduling escalation to SIGKILL in 3secs from now
Received ACKNOWLEDGED event
Received KILL event
Received kill for task test
Rescheduling escalation to SIGKILL in 1secs from now
10370891801885978953
Process 17475 did not terminate after 1secs, sending SIGKILL to process tree at 17475
Killed the following process trees:
[ 
--- 17475 /Users/alex/bin/unresponsive_process
]
Received KILL event
Received kill for task test
Command terminated with signal Killed: 9 (pid: 17475)
```

Excerpt from the agent log that shows all 3 kill task requests and that the segnal escalation timeout was reduced from 3s to 1s:
```
I0418 14:27:17.825070 244285440 slave.cpp:3605] Forwarding the update TASK_RUNNING (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
I0418 14:27:17.831233 242139136 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:18.834309 244285440 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:18.842150 244285440 http.cpp:178] HTTP POST for /slave(1)/api/v1/executor from 192.168.178.24:54206
I0418 14:27:18.842331 244285440 slave.cpp:3207] Handling status update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:18.843214 242139136 status_update_manager.cpp:320] Received status update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:18.843387 243748864 slave.cpp:3605] Forwarding the update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
I0418 14:27:18.846459 242675712 status_update_manager.cpp:392] Received status update acknowledgement (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:19.836699 240529408 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:20.850658 240529408 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:20.927338 241602560 http.cpp:178] HTTP POST for /slave(1)/api/v1/executor from 192.168.178.24:54206
I0418 14:27:20.927465 241602560 slave.cpp:3207] Handling status update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:20.928771 240529408 status_update_manager.cpp:320] Received status update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:20.928933 243748864 slave.cpp:3605] Forwarding the update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050

```


Thanks,

Alexander Rukletsov


Re: Review Request 46325: Updated HTTP command executor to support kill policy in Kill event.

Posted by Mesos ReviewBot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/46325/#review129344
-----------------------------------------------------------



Patch looks great!

Reviews applied: [46321, 46322, 46323, 46324, 46325]

Passed command: export OS='ubuntu:14.04' CONFIGURATION='--verbose' COMPILER='gcc' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker_build.sh

- Mesos ReviewBot


On April 18, 2016, 1:05 p.m., Alexander Rukletsov wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/46325/
> -----------------------------------------------------------
> 
> (Updated April 18, 2016, 1:05 p.m.)
> 
> 
> Review request for mesos, Ben Mahler and Qian Zhang.
> 
> 
> Bugs: MESOS-4908
>     https://issues.apache.org/jira/browse/MESOS-4908
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Kill policy can be provided in a kill event. In this case it should
> take precedence over kill policy specified when the task was launched.
> When kill event is issued multiple times during the task termination,
> the signal escalation timeout (the time a task has between SIGTERM
> and SIGKILL) may be reduced.
> 
> Since updating the delay timer (we use it for signal escalation delay)
> is currently not possible, we cancel the existing signal timer and set
> up a new one. `Clock::cancel()` guarantees that, if existed, the timer
> is removed before the function returns; hence we do not set up more
> than 1 timer for signal escalation delay.
> 
> 
> Diffs
> -----
> 
>   src/launcher/http_command_executor.cpp ad484e0e6f5067b6c166111c91b1ff1e8c05d9ac 
> 
> Diff: https://reviews.apache.org/r/46325/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> Tested manually using modified `mesos-execute` in a way that **two** extra kill task requests are sent, 2s and 3s after receiving `TASK_RUNNING`. Each kill task request specifies `KillPolicy` with 1s grace period. Together with a kill *without* a kill policy scheduled 1s after the task is being launched, the task receives **three** kill requests in total.
> 
> Master: `./bin/mesos-master.sh --work_dir=/tmp/w/m --ip=127.0.0.1`
> Agent: `./bin/mesos-slave.sh --work_dir=/tmp/w/s --master=127.0.0.1:5050 --http_command_executor`
> Mesos-execute: `./src/mesos-execute --master=127.0.0.1:5050 --name=test --command="/Users/alex/bin/unresponsive_process" --env='{"GLOG_v": "2"}' --kill_after=1secs`
> 
> HTTP command executor log:
> ```
> Received SUBSCRIBED event
> Subscribed executor on alexr.fritz.box
> Received LAUNCH event
> Starting task test
> sh -c '/Users/alex/bin/unresponsive_process'
> Forked command at 17475
> 14455919081943275466
> Received ACKNOWLEDGED event
> 17172602460659762152
> Received KILL event
> Received kill for task test
> Sending SIGTERM to process tree at pid 17475
> Sent SIGTERM to the following process trees:
> [ 
> --- 17475 /Users/alex/bin/unresponsive_process
> ]
> 4381544758593790168
> Scheduling escalation to SIGKILL in 3secs from now
> Received ACKNOWLEDGED event
> Received KILL event
> Received kill for task test
> Rescheduling escalation to SIGKILL in 1secs from now
> 10370891801885978953
> Process 17475 did not terminate after 1secs, sending SIGKILL to process tree at 17475
> Killed the following process trees:
> [ 
> --- 17475 /Users/alex/bin/unresponsive_process
> ]
> Received KILL event
> Received kill for task test
> Command terminated with signal Killed: 9 (pid: 17475)
> ```
> 
> Excerpt from the agent log that shows all 3 kill task requests and that the segnal escalation timeout was reduced from 3s to 1s:
> ```
> I0418 14:27:17.825070 244285440 slave.cpp:3605] Forwarding the update TASK_RUNNING (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
> I0418 14:27:17.831233 242139136 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.834309 244285440 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.842150 244285440 http.cpp:178] HTTP POST for /slave(1)/api/v1/executor from 192.168.178.24:54206
> I0418 14:27:18.842331 244285440 slave.cpp:3207] Handling status update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.843214 242139136 status_update_manager.cpp:320] Received status update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.843387 243748864 slave.cpp:3605] Forwarding the update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
> I0418 14:27:18.846459 242675712 status_update_manager.cpp:392] Received status update acknowledgement (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:19.836699 240529408 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.850658 240529408 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.927338 241602560 http.cpp:178] HTTP POST for /slave(1)/api/v1/executor from 192.168.178.24:54206
> I0418 14:27:20.927465 241602560 slave.cpp:3207] Handling status update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.928771 240529408 status_update_manager.cpp:320] Received status update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.928933 243748864 slave.cpp:3605] Forwarding the update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
> 
> ```
> 
> 
> Thanks,
> 
> Alexander Rukletsov
> 
>


Re: Review Request 46325: Updated HTTP command executor to support kill policy in Kill event.

Posted by Alexander Rukletsov <ru...@gmail.com>.

> On April 21, 2016, 9:56 p.m., Ben Mahler wrote:
> > Could you pull out the 'killed' bug fix? Any reason it's bundled in this patch?

Will do.


- Alexander


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/46325/#review129937
-----------------------------------------------------------


On April 22, 2016, 2:46 p.m., Alexander Rukletsov wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/46325/
> -----------------------------------------------------------
> 
> (Updated April 22, 2016, 2:46 p.m.)
> 
> 
> Review request for mesos, Ben Mahler and Qian Zhang.
> 
> 
> Bugs: MESOS-4908
>     https://issues.apache.org/jira/browse/MESOS-4908
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Kill policy can be provided in a kill event. In this case it should
> take precedence over kill policy specified when the task was launched.
> When kill event is issued multiple times during the task termination,
> the signal escalation timeout (the time a task has between SIGTERM
> and SIGKILL) may be reduced.
> 
> Since updating the delay timer (we use it for signal escalation delay)
> is currently not possible, we cancel the existing signal timer and set
> up a new one. `Clock::cancel()` guarantees that, if existed, the timer
> is removed before the function returns; hence we do not set up more
> than 1 timer for signal escalation delay.
> 
> 
> Diffs
> -----
> 
>   src/launcher/http_command_executor.cpp 2ce296740abac096aa85ad11dd4d191e42c1aa06 
> 
> Diff: https://reviews.apache.org/r/46325/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> Tested manually using modified `mesos-execute` in a way that **two** extra kill task requests are sent, 2s and 3s after receiving `TASK_RUNNING`. Each kill task request specifies `KillPolicy` with 1s grace period. Together with a kill *without* a kill policy scheduled 1s after the task is being launched, the task receives **three** kill requests in total.
> 
> Master: `./bin/mesos-master.sh --work_dir=/tmp/w/m --ip=127.0.0.1`
> Agent: `./bin/mesos-slave.sh --work_dir=/tmp/w/s --master=127.0.0.1:5050 --http_command_executor`
> Mesos-execute: `./src/mesos-execute --master=127.0.0.1:5050 --name=test --command="/Users/alex/bin/unresponsive_process" --env='{"GLOG_v": "2"}' --kill_after=1secs`
> 
> HTTP command executor log:
> ```
> Received SUBSCRIBED event
> Subscribed executor on alexr.fritz.box
> Received LAUNCH event
> Starting task test
> sh -c '/Users/alex/bin/unresponsive_process'
> Forked command at 17475
> 14455919081943275466
> Received ACKNOWLEDGED event
> 17172602460659762152
> Received KILL event
> Received kill for task test
> Sending SIGTERM to process tree at pid 17475
> Sent SIGTERM to the following process trees:
> [ 
> --- 17475 /Users/alex/bin/unresponsive_process
> ]
> 4381544758593790168
> Scheduling escalation to SIGKILL in 3secs from now
> Received ACKNOWLEDGED event
> Received KILL event
> Received kill for task test
> Rescheduling escalation to SIGKILL in 1secs from now
> 10370891801885978953
> Process 17475 did not terminate after 1secs, sending SIGKILL to process tree at 17475
> Killed the following process trees:
> [ 
> --- 17475 /Users/alex/bin/unresponsive_process
> ]
> Received KILL event
> Received kill for task test
> Command terminated with signal Killed: 9 (pid: 17475)
> ```
> 
> Excerpt from the agent log that shows all 3 kill task requests and that the segnal escalation timeout was reduced from 3s to 1s:
> ```
> I0418 14:27:17.825070 244285440 slave.cpp:3605] Forwarding the update TASK_RUNNING (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
> I0418 14:27:17.831233 242139136 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.834309 244285440 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.842150 244285440 http.cpp:178] HTTP POST for /slave(1)/api/v1/executor from 192.168.178.24:54206
> I0418 14:27:18.842331 244285440 slave.cpp:3207] Handling status update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.843214 242139136 status_update_manager.cpp:320] Received status update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.843387 243748864 slave.cpp:3605] Forwarding the update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
> I0418 14:27:18.846459 242675712 status_update_manager.cpp:392] Received status update acknowledgement (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:19.836699 240529408 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.850658 240529408 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.927338 241602560 http.cpp:178] HTTP POST for /slave(1)/api/v1/executor from 192.168.178.24:54206
> I0418 14:27:20.927465 241602560 slave.cpp:3207] Handling status update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.928771 240529408 status_update_manager.cpp:320] Received status update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.928933 243748864 slave.cpp:3605] Forwarding the update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
> 
> ```
> 
> 
> Thanks,
> 
> Alexander Rukletsov
> 
>


Re: Review Request 46325: Updated HTTP command executor to support kill policy in Kill event.

Posted by Alexander Rukletsov <ru...@gmail.com>.

> On April 21, 2016, 9:56 p.m., Ben Mahler wrote:
> > src/launcher/http_command_executor.cpp, lines 172-176
> > <https://reviews.apache.org/r/46325/diff/3/?file=1355102#file1355102line172>
> >
> >     Wrap at the paren:
> >     
> >     ```
> >               kill(event.kill().task_id(),
> >                    event.kill().has_kill_policy()
> >                      ? Option<KillPolicy>(event.kill().kill_policy())
> >                      : None());
> >     ```
> >     
> >     Any way to avoid the explicit Option construction? Does this compile?
> >     
> >     ```
> >               kill(event.kill().task_id(),
> >                    event.kill().has_kill_policy()
> >                     ? Some(event.kill().kill_policy())
> >                     : None());
> >     ```
> >     
> >     If not, consider pulling out the kill policy as an option variable.

Unfortunately, `Some(event.kill().kill_policy())` does not compile. Passing `Event::Kill` seems unfortunate as well: if we want to call `kill()` from another place in the code, we have to prepare the proto message. I'll go with a local variable.


> On April 21, 2016, 9:56 p.m., Ben Mahler wrote:
> > src/launcher/http_command_executor.cpp, line 665
> > <https://reviews.apache.org/r/46325/diff/3/?file=1355102#file1355102line665>
> >
> >     Leaving this unimplemented as a TODO seems to have implications on the documentation in your previous patch:
> >     
> >     https://reviews.apache.org/r/46322/
> >     
> >     It seems like all you would need to do to implement the TODO is to remove this if condition and make the logging more clear? Otherwise, if you leave this out we really need to document it within the Kill Call. My hunch is that in writing that documentation we'll realize we should just do an override always.

Yeah, we can remove the check.


- Alexander


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/46325/#review129937
-----------------------------------------------------------


On April 21, 2016, 2:28 p.m., Alexander Rukletsov wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/46325/
> -----------------------------------------------------------
> 
> (Updated April 21, 2016, 2:28 p.m.)
> 
> 
> Review request for mesos, Ben Mahler and Qian Zhang.
> 
> 
> Bugs: MESOS-4908
>     https://issues.apache.org/jira/browse/MESOS-4908
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Kill policy can be provided in a kill event. In this case it should
> take precedence over kill policy specified when the task was launched.
> When kill event is issued multiple times during the task termination,
> the signal escalation timeout (the time a task has between SIGTERM
> and SIGKILL) may be reduced.
> 
> Since updating the delay timer (we use it for signal escalation delay)
> is currently not possible, we cancel the existing signal timer and set
> up a new one. `Clock::cancel()` guarantees that, if existed, the timer
> is removed before the function returns; hence we do not set up more
> than 1 timer for signal escalation delay.
> 
> 
> Diffs
> -----
> 
>   src/launcher/http_command_executor.cpp 9dfe48108cababeb2d4a6af74434880d79245c21 
> 
> Diff: https://reviews.apache.org/r/46325/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> Tested manually using modified `mesos-execute` in a way that **two** extra kill task requests are sent, 2s and 3s after receiving `TASK_RUNNING`. Each kill task request specifies `KillPolicy` with 1s grace period. Together with a kill *without* a kill policy scheduled 1s after the task is being launched, the task receives **three** kill requests in total.
> 
> Master: `./bin/mesos-master.sh --work_dir=/tmp/w/m --ip=127.0.0.1`
> Agent: `./bin/mesos-slave.sh --work_dir=/tmp/w/s --master=127.0.0.1:5050 --http_command_executor`
> Mesos-execute: `./src/mesos-execute --master=127.0.0.1:5050 --name=test --command="/Users/alex/bin/unresponsive_process" --env='{"GLOG_v": "2"}' --kill_after=1secs`
> 
> HTTP command executor log:
> ```
> Received SUBSCRIBED event
> Subscribed executor on alexr.fritz.box
> Received LAUNCH event
> Starting task test
> sh -c '/Users/alex/bin/unresponsive_process'
> Forked command at 17475
> 14455919081943275466
> Received ACKNOWLEDGED event
> 17172602460659762152
> Received KILL event
> Received kill for task test
> Sending SIGTERM to process tree at pid 17475
> Sent SIGTERM to the following process trees:
> [ 
> --- 17475 /Users/alex/bin/unresponsive_process
> ]
> 4381544758593790168
> Scheduling escalation to SIGKILL in 3secs from now
> Received ACKNOWLEDGED event
> Received KILL event
> Received kill for task test
> Rescheduling escalation to SIGKILL in 1secs from now
> 10370891801885978953
> Process 17475 did not terminate after 1secs, sending SIGKILL to process tree at 17475
> Killed the following process trees:
> [ 
> --- 17475 /Users/alex/bin/unresponsive_process
> ]
> Received KILL event
> Received kill for task test
> Command terminated with signal Killed: 9 (pid: 17475)
> ```
> 
> Excerpt from the agent log that shows all 3 kill task requests and that the segnal escalation timeout was reduced from 3s to 1s:
> ```
> I0418 14:27:17.825070 244285440 slave.cpp:3605] Forwarding the update TASK_RUNNING (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
> I0418 14:27:17.831233 242139136 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.834309 244285440 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.842150 244285440 http.cpp:178] HTTP POST for /slave(1)/api/v1/executor from 192.168.178.24:54206
> I0418 14:27:18.842331 244285440 slave.cpp:3207] Handling status update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.843214 242139136 status_update_manager.cpp:320] Received status update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.843387 243748864 slave.cpp:3605] Forwarding the update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
> I0418 14:27:18.846459 242675712 status_update_manager.cpp:392] Received status update acknowledgement (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:19.836699 240529408 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.850658 240529408 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.927338 241602560 http.cpp:178] HTTP POST for /slave(1)/api/v1/executor from 192.168.178.24:54206
> I0418 14:27:20.927465 241602560 slave.cpp:3207] Handling status update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.928771 240529408 status_update_manager.cpp:320] Received status update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.928933 243748864 slave.cpp:3605] Forwarding the update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
> 
> ```
> 
> 
> Thanks,
> 
> Alexander Rukletsov
> 
>


Re: Review Request 46325: Updated HTTP command executor to support kill policy in Kill event.

Posted by Ben Mahler <be...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/46325/#review129937
-----------------------------------------------------------



Could you pull out the 'killed' bug fix? Any reason it's bundled in this patch?


src/launcher/http_command_executor.cpp (lines 172 - 176)
<https://reviews.apache.org/r/46325/#comment193489>

    Wrap at the paren:
    
    ```
              kill(event.kill().task_id(),
                   event.kill().has_kill_policy()
                     ? Option<KillPolicy>(event.kill().kill_policy())
                     : None());
    ```
    
    Any way to avoid the explicit Option construction? Does this compile?
    
    ```
              kill(event.kill().task_id(),
                   event.kill().has_kill_policy()
                    ? Some(event.kill().kill_policy())
                    : None());
    ```
    
    If not, consider pulling out the kill policy as an option variable.



src/launcher/http_command_executor.cpp (lines 603 - 605)
<https://reviews.apache.org/r/46325/#comment193490>

    How about s/overridingKillPolicy/override/ ?
    
    Alternatively, if we took the `Event::Kill kill` then we could distinguish easily:
    
    ```
    kill.kill_policy(); // vs
    killPolicy; // the member variable
    ```



src/launcher/http_command_executor.cpp (lines 656 - 661)
<https://reviews.apache.org/r/46325/#comment193492>

    The addition of 'killed' and this check looks like a bug fix? Any reason it is being bundled with this patch?



src/launcher/http_command_executor.cpp (line 665)
<https://reviews.apache.org/r/46325/#comment193496>

    Leaving this unimplemented as a TODO seems to have implications on the documentation in your previous patch:
    
    https://reviews.apache.org/r/46322/
    
    It seems like all you would need to do to implement the TODO is to remove this if condition and make the logging more clear? Otherwise, if you leave this out we really need to document it within the Kill Call. My hunch is that in writing that documentation we'll realize we should just do an override always.



src/launcher/http_command_executor.cpp (lines 783 - 787)
<https://reviews.apache.org/r/46325/#comment193497>

    The addition of 'killed' and this check looks like a bug fix? Any reason it is being bundled with this patch?


- Ben Mahler


On April 21, 2016, 2:28 p.m., Alexander Rukletsov wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/46325/
> -----------------------------------------------------------
> 
> (Updated April 21, 2016, 2:28 p.m.)
> 
> 
> Review request for mesos, Ben Mahler and Qian Zhang.
> 
> 
> Bugs: MESOS-4908
>     https://issues.apache.org/jira/browse/MESOS-4908
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Kill policy can be provided in a kill event. In this case it should
> take precedence over kill policy specified when the task was launched.
> When kill event is issued multiple times during the task termination,
> the signal escalation timeout (the time a task has between SIGTERM
> and SIGKILL) may be reduced.
> 
> Since updating the delay timer (we use it for signal escalation delay)
> is currently not possible, we cancel the existing signal timer and set
> up a new one. `Clock::cancel()` guarantees that, if existed, the timer
> is removed before the function returns; hence we do not set up more
> than 1 timer for signal escalation delay.
> 
> 
> Diffs
> -----
> 
>   src/launcher/http_command_executor.cpp 9dfe48108cababeb2d4a6af74434880d79245c21 
> 
> Diff: https://reviews.apache.org/r/46325/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> Tested manually using modified `mesos-execute` in a way that **two** extra kill task requests are sent, 2s and 3s after receiving `TASK_RUNNING`. Each kill task request specifies `KillPolicy` with 1s grace period. Together with a kill *without* a kill policy scheduled 1s after the task is being launched, the task receives **three** kill requests in total.
> 
> Master: `./bin/mesos-master.sh --work_dir=/tmp/w/m --ip=127.0.0.1`
> Agent: `./bin/mesos-slave.sh --work_dir=/tmp/w/s --master=127.0.0.1:5050 --http_command_executor`
> Mesos-execute: `./src/mesos-execute --master=127.0.0.1:5050 --name=test --command="/Users/alex/bin/unresponsive_process" --env='{"GLOG_v": "2"}' --kill_after=1secs`
> 
> HTTP command executor log:
> ```
> Received SUBSCRIBED event
> Subscribed executor on alexr.fritz.box
> Received LAUNCH event
> Starting task test
> sh -c '/Users/alex/bin/unresponsive_process'
> Forked command at 17475
> 14455919081943275466
> Received ACKNOWLEDGED event
> 17172602460659762152
> Received KILL event
> Received kill for task test
> Sending SIGTERM to process tree at pid 17475
> Sent SIGTERM to the following process trees:
> [ 
> --- 17475 /Users/alex/bin/unresponsive_process
> ]
> 4381544758593790168
> Scheduling escalation to SIGKILL in 3secs from now
> Received ACKNOWLEDGED event
> Received KILL event
> Received kill for task test
> Rescheduling escalation to SIGKILL in 1secs from now
> 10370891801885978953
> Process 17475 did not terminate after 1secs, sending SIGKILL to process tree at 17475
> Killed the following process trees:
> [ 
> --- 17475 /Users/alex/bin/unresponsive_process
> ]
> Received KILL event
> Received kill for task test
> Command terminated with signal Killed: 9 (pid: 17475)
> ```
> 
> Excerpt from the agent log that shows all 3 kill task requests and that the segnal escalation timeout was reduced from 3s to 1s:
> ```
> I0418 14:27:17.825070 244285440 slave.cpp:3605] Forwarding the update TASK_RUNNING (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
> I0418 14:27:17.831233 242139136 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.834309 244285440 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.842150 244285440 http.cpp:178] HTTP POST for /slave(1)/api/v1/executor from 192.168.178.24:54206
> I0418 14:27:18.842331 244285440 slave.cpp:3207] Handling status update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.843214 242139136 status_update_manager.cpp:320] Received status update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.843387 243748864 slave.cpp:3605] Forwarding the update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
> I0418 14:27:18.846459 242675712 status_update_manager.cpp:392] Received status update acknowledgement (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:19.836699 240529408 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.850658 240529408 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.927338 241602560 http.cpp:178] HTTP POST for /slave(1)/api/v1/executor from 192.168.178.24:54206
> I0418 14:27:20.927465 241602560 slave.cpp:3207] Handling status update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.928771 240529408 status_update_manager.cpp:320] Received status update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.928933 243748864 slave.cpp:3605] Forwarding the update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
> 
> ```
> 
> 
> Thanks,
> 
> Alexander Rukletsov
> 
>


Re: Review Request 46325: Updated HTTP command executor to support kill policy in Kill event.

Posted by Mesos ReviewBot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/46325/#review130116
-----------------------------------------------------------



Patch looks great!

Reviews applied: [46321, 46491, 46322, 46323, 46325]

Passed command: export OS='ubuntu:14.04' CONFIGURATION='--verbose' COMPILER='gcc' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker_build.sh

- Mesos ReviewBot


On April 22, 2016, 2:46 p.m., Alexander Rukletsov wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/46325/
> -----------------------------------------------------------
> 
> (Updated April 22, 2016, 2:46 p.m.)
> 
> 
> Review request for mesos, Ben Mahler and Qian Zhang.
> 
> 
> Bugs: MESOS-4908
>     https://issues.apache.org/jira/browse/MESOS-4908
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Kill policy can be provided in a kill event. In this case it should
> take precedence over kill policy specified when the task was launched.
> When kill event is issued multiple times during the task termination,
> the signal escalation timeout (the time a task has between SIGTERM
> and SIGKILL) may be reduced.
> 
> Since updating the delay timer (we use it for signal escalation delay)
> is currently not possible, we cancel the existing signal timer and set
> up a new one. `Clock::cancel()` guarantees that, if existed, the timer
> is removed before the function returns; hence we do not set up more
> than 1 timer for signal escalation delay.
> 
> 
> Diffs
> -----
> 
>   src/launcher/http_command_executor.cpp 2ce296740abac096aa85ad11dd4d191e42c1aa06 
> 
> Diff: https://reviews.apache.org/r/46325/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> Tested manually using modified `mesos-execute` in a way that **two** extra kill task requests are sent, 2s and 3s after receiving `TASK_RUNNING`. Each kill task request specifies `KillPolicy` with 1s grace period. Together with a kill *without* a kill policy scheduled 1s after the task is being launched, the task receives **three** kill requests in total.
> 
> Master: `./bin/mesos-master.sh --work_dir=/tmp/w/m --ip=127.0.0.1`
> Agent: `./bin/mesos-slave.sh --work_dir=/tmp/w/s --master=127.0.0.1:5050 --http_command_executor`
> Mesos-execute: `./src/mesos-execute --master=127.0.0.1:5050 --name=test --command="/Users/alex/bin/unresponsive_process" --env='{"GLOG_v": "2"}' --kill_after=1secs`
> 
> HTTP command executor log:
> ```
> Received SUBSCRIBED event
> Subscribed executor on alexr.fritz.box
> Received LAUNCH event
> Starting task test
> sh -c '/Users/alex/bin/unresponsive_process'
> Forked command at 17475
> 14455919081943275466
> Received ACKNOWLEDGED event
> 17172602460659762152
> Received KILL event
> Received kill for task test
> Sending SIGTERM to process tree at pid 17475
> Sent SIGTERM to the following process trees:
> [ 
> --- 17475 /Users/alex/bin/unresponsive_process
> ]
> 4381544758593790168
> Scheduling escalation to SIGKILL in 3secs from now
> Received ACKNOWLEDGED event
> Received KILL event
> Received kill for task test
> Rescheduling escalation to SIGKILL in 1secs from now
> 10370891801885978953
> Process 17475 did not terminate after 1secs, sending SIGKILL to process tree at 17475
> Killed the following process trees:
> [ 
> --- 17475 /Users/alex/bin/unresponsive_process
> ]
> Received KILL event
> Received kill for task test
> Command terminated with signal Killed: 9 (pid: 17475)
> ```
> 
> Excerpt from the agent log that shows all 3 kill task requests and that the segnal escalation timeout was reduced from 3s to 1s:
> ```
> I0418 14:27:17.825070 244285440 slave.cpp:3605] Forwarding the update TASK_RUNNING (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
> I0418 14:27:17.831233 242139136 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.834309 244285440 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.842150 244285440 http.cpp:178] HTTP POST for /slave(1)/api/v1/executor from 192.168.178.24:54206
> I0418 14:27:18.842331 244285440 slave.cpp:3207] Handling status update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.843214 242139136 status_update_manager.cpp:320] Received status update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:18.843387 243748864 slave.cpp:3605] Forwarding the update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
> I0418 14:27:18.846459 242675712 status_update_manager.cpp:392] Received status update acknowledgement (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:19.836699 240529408 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.850658 240529408 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.927338 241602560 http.cpp:178] HTTP POST for /slave(1)/api/v1/executor from 192.168.178.24:54206
> I0418 14:27:20.927465 241602560 slave.cpp:3207] Handling status update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.928771 240529408 status_update_manager.cpp:320] Received status update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
> I0418 14:27:20.928933 243748864 slave.cpp:3605] Forwarding the update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
> 
> ```
> 
> 
> Thanks,
> 
> Alexander Rukletsov
> 
>


Re: Review Request 46325: Updated HTTP command executor to support kill policy in Kill event.

Posted by Alexander Rukletsov <ru...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/46325/
-----------------------------------------------------------

(Updated April 22, 2016, 2:46 p.m.)


Review request for mesos, Ben Mahler and Qian Zhang.


Bugs: MESOS-4908
    https://issues.apache.org/jira/browse/MESOS-4908


Repository: mesos


Description
-------

Kill policy can be provided in a kill event. In this case it should
take precedence over kill policy specified when the task was launched.
When kill event is issued multiple times during the task termination,
the signal escalation timeout (the time a task has between SIGTERM
and SIGKILL) may be reduced.

Since updating the delay timer (we use it for signal escalation delay)
is currently not possible, we cancel the existing signal timer and set
up a new one. `Clock::cancel()` guarantees that, if existed, the timer
is removed before the function returns; hence we do not set up more
than 1 timer for signal escalation delay.


Diffs (updated)
-----

  src/launcher/http_command_executor.cpp 2ce296740abac096aa85ad11dd4d191e42c1aa06 

Diff: https://reviews.apache.org/r/46325/diff/


Testing
-------

`make check`

Tested manually using modified `mesos-execute` in a way that **two** extra kill task requests are sent, 2s and 3s after receiving `TASK_RUNNING`. Each kill task request specifies `KillPolicy` with 1s grace period. Together with a kill *without* a kill policy scheduled 1s after the task is being launched, the task receives **three** kill requests in total.

Master: `./bin/mesos-master.sh --work_dir=/tmp/w/m --ip=127.0.0.1`
Agent: `./bin/mesos-slave.sh --work_dir=/tmp/w/s --master=127.0.0.1:5050 --http_command_executor`
Mesos-execute: `./src/mesos-execute --master=127.0.0.1:5050 --name=test --command="/Users/alex/bin/unresponsive_process" --env='{"GLOG_v": "2"}' --kill_after=1secs`

HTTP command executor log:
```
Received SUBSCRIBED event
Subscribed executor on alexr.fritz.box
Received LAUNCH event
Starting task test
sh -c '/Users/alex/bin/unresponsive_process'
Forked command at 17475
14455919081943275466
Received ACKNOWLEDGED event
17172602460659762152
Received KILL event
Received kill for task test
Sending SIGTERM to process tree at pid 17475
Sent SIGTERM to the following process trees:
[ 
--- 17475 /Users/alex/bin/unresponsive_process
]
4381544758593790168
Scheduling escalation to SIGKILL in 3secs from now
Received ACKNOWLEDGED event
Received KILL event
Received kill for task test
Rescheduling escalation to SIGKILL in 1secs from now
10370891801885978953
Process 17475 did not terminate after 1secs, sending SIGKILL to process tree at 17475
Killed the following process trees:
[ 
--- 17475 /Users/alex/bin/unresponsive_process
]
Received KILL event
Received kill for task test
Command terminated with signal Killed: 9 (pid: 17475)
```

Excerpt from the agent log that shows all 3 kill task requests and that the segnal escalation timeout was reduced from 3s to 1s:
```
I0418 14:27:17.825070 244285440 slave.cpp:3605] Forwarding the update TASK_RUNNING (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
I0418 14:27:17.831233 242139136 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:18.834309 244285440 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:18.842150 244285440 http.cpp:178] HTTP POST for /slave(1)/api/v1/executor from 192.168.178.24:54206
I0418 14:27:18.842331 244285440 slave.cpp:3207] Handling status update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:18.843214 242139136 status_update_manager.cpp:320] Received status update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:18.843387 243748864 slave.cpp:3605] Forwarding the update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
I0418 14:27:18.846459 242675712 status_update_manager.cpp:392] Received status update acknowledgement (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:19.836699 240529408 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:20.850658 240529408 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:20.927338 241602560 http.cpp:178] HTTP POST for /slave(1)/api/v1/executor from 192.168.178.24:54206
I0418 14:27:20.927465 241602560 slave.cpp:3207] Handling status update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:20.928771 240529408 status_update_manager.cpp:320] Received status update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:20.928933 243748864 slave.cpp:3605] Forwarding the update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050

```


Thanks,

Alexander Rukletsov


Re: Review Request 46325: Updated HTTP command executor to support kill policy in Kill event.

Posted by Alexander Rukletsov <ru...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/46325/
-----------------------------------------------------------

(Updated April 21, 2016, 2:28 p.m.)


Review request for mesos, Ben Mahler and Qian Zhang.


Bugs: MESOS-4908
    https://issues.apache.org/jira/browse/MESOS-4908


Repository: mesos


Description
-------

Kill policy can be provided in a kill event. In this case it should
take precedence over kill policy specified when the task was launched.
When kill event is issued multiple times during the task termination,
the signal escalation timeout (the time a task has between SIGTERM
and SIGKILL) may be reduced.

Since updating the delay timer (we use it for signal escalation delay)
is currently not possible, we cancel the existing signal timer and set
up a new one. `Clock::cancel()` guarantees that, if existed, the timer
is removed before the function returns; hence we do not set up more
than 1 timer for signal escalation delay.


Diffs (updated)
-----

  src/launcher/http_command_executor.cpp 9dfe48108cababeb2d4a6af74434880d79245c21 

Diff: https://reviews.apache.org/r/46325/diff/


Testing
-------

`make check`

Tested manually using modified `mesos-execute` in a way that **two** extra kill task requests are sent, 2s and 3s after receiving `TASK_RUNNING`. Each kill task request specifies `KillPolicy` with 1s grace period. Together with a kill *without* a kill policy scheduled 1s after the task is being launched, the task receives **three** kill requests in total.

Master: `./bin/mesos-master.sh --work_dir=/tmp/w/m --ip=127.0.0.1`
Agent: `./bin/mesos-slave.sh --work_dir=/tmp/w/s --master=127.0.0.1:5050 --http_command_executor`
Mesos-execute: `./src/mesos-execute --master=127.0.0.1:5050 --name=test --command="/Users/alex/bin/unresponsive_process" --env='{"GLOG_v": "2"}' --kill_after=1secs`

HTTP command executor log:
```
Received SUBSCRIBED event
Subscribed executor on alexr.fritz.box
Received LAUNCH event
Starting task test
sh -c '/Users/alex/bin/unresponsive_process'
Forked command at 17475
14455919081943275466
Received ACKNOWLEDGED event
17172602460659762152
Received KILL event
Received kill for task test
Sending SIGTERM to process tree at pid 17475
Sent SIGTERM to the following process trees:
[ 
--- 17475 /Users/alex/bin/unresponsive_process
]
4381544758593790168
Scheduling escalation to SIGKILL in 3secs from now
Received ACKNOWLEDGED event
Received KILL event
Received kill for task test
Rescheduling escalation to SIGKILL in 1secs from now
10370891801885978953
Process 17475 did not terminate after 1secs, sending SIGKILL to process tree at 17475
Killed the following process trees:
[ 
--- 17475 /Users/alex/bin/unresponsive_process
]
Received KILL event
Received kill for task test
Command terminated with signal Killed: 9 (pid: 17475)
```

Excerpt from the agent log that shows all 3 kill task requests and that the segnal escalation timeout was reduced from 3s to 1s:
```
I0418 14:27:17.825070 244285440 slave.cpp:3605] Forwarding the update TASK_RUNNING (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
I0418 14:27:17.831233 242139136 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:18.834309 244285440 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:18.842150 244285440 http.cpp:178] HTTP POST for /slave(1)/api/v1/executor from 192.168.178.24:54206
I0418 14:27:18.842331 244285440 slave.cpp:3207] Handling status update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:18.843214 242139136 status_update_manager.cpp:320] Received status update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:18.843387 243748864 slave.cpp:3605] Forwarding the update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
I0418 14:27:18.846459 242675712 status_update_manager.cpp:392] Received status update acknowledgement (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:19.836699 240529408 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:20.850658 240529408 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:20.927338 241602560 http.cpp:178] HTTP POST for /slave(1)/api/v1/executor from 192.168.178.24:54206
I0418 14:27:20.927465 241602560 slave.cpp:3207] Handling status update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:20.928771 240529408 status_update_manager.cpp:320] Received status update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:20.928933 243748864 slave.cpp:3605] Forwarding the update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050

```


Thanks,

Alexander Rukletsov


Re: Review Request 46325: Updated HTTP command executor to support kill policy in Kill event.

Posted by Alexander Rukletsov <ru...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/46325/
-----------------------------------------------------------

(Updated April 21, 2016, 2:11 p.m.)


Review request for mesos, Ben Mahler and Qian Zhang.


Bugs: MESOS-4908
    https://issues.apache.org/jira/browse/MESOS-4908


Repository: mesos


Description
-------

Kill policy can be provided in a kill event. In this case it should
take precedence over kill policy specified when the task was launched.
When kill event is issued multiple times during the task termination,
the signal escalation timeout (the time a task has between SIGTERM
and SIGKILL) may be reduced.

Since updating the delay timer (we use it for signal escalation delay)
is currently not possible, we cancel the existing signal timer and set
up a new one. `Clock::cancel()` guarantees that, if existed, the timer
is removed before the function returns; hence we do not set up more
than 1 timer for signal escalation delay.


Diffs
-----

  src/launcher/http_command_executor.cpp ad484e0e6f5067b6c166111c91b1ff1e8c05d9ac 

Diff: https://reviews.apache.org/r/46325/diff/


Testing
-------

`make check`

Tested manually using modified `mesos-execute` in a way that **two** extra kill task requests are sent, 2s and 3s after receiving `TASK_RUNNING`. Each kill task request specifies `KillPolicy` with 1s grace period. Together with a kill *without* a kill policy scheduled 1s after the task is being launched, the task receives **three** kill requests in total.

Master: `./bin/mesos-master.sh --work_dir=/tmp/w/m --ip=127.0.0.1`
Agent: `./bin/mesos-slave.sh --work_dir=/tmp/w/s --master=127.0.0.1:5050 --http_command_executor`
Mesos-execute: `./src/mesos-execute --master=127.0.0.1:5050 --name=test --command="/Users/alex/bin/unresponsive_process" --env='{"GLOG_v": "2"}' --kill_after=1secs`

HTTP command executor log:
```
Received SUBSCRIBED event
Subscribed executor on alexr.fritz.box
Received LAUNCH event
Starting task test
sh -c '/Users/alex/bin/unresponsive_process'
Forked command at 17475
14455919081943275466
Received ACKNOWLEDGED event
17172602460659762152
Received KILL event
Received kill for task test
Sending SIGTERM to process tree at pid 17475
Sent SIGTERM to the following process trees:
[ 
--- 17475 /Users/alex/bin/unresponsive_process
]
4381544758593790168
Scheduling escalation to SIGKILL in 3secs from now
Received ACKNOWLEDGED event
Received KILL event
Received kill for task test
Rescheduling escalation to SIGKILL in 1secs from now
10370891801885978953
Process 17475 did not terminate after 1secs, sending SIGKILL to process tree at 17475
Killed the following process trees:
[ 
--- 17475 /Users/alex/bin/unresponsive_process
]
Received KILL event
Received kill for task test
Command terminated with signal Killed: 9 (pid: 17475)
```

Excerpt from the agent log that shows all 3 kill task requests and that the segnal escalation timeout was reduced from 3s to 1s:
```
I0418 14:27:17.825070 244285440 slave.cpp:3605] Forwarding the update TASK_RUNNING (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
I0418 14:27:17.831233 242139136 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:18.834309 244285440 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:18.842150 244285440 http.cpp:178] HTTP POST for /slave(1)/api/v1/executor from 192.168.178.24:54206
I0418 14:27:18.842331 244285440 slave.cpp:3207] Handling status update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:18.843214 242139136 status_update_manager.cpp:320] Received status update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:18.843387 243748864 slave.cpp:3605] Forwarding the update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
I0418 14:27:18.846459 242675712 status_update_manager.cpp:392] Received status update acknowledgement (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:19.836699 240529408 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:20.850658 240529408 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:20.927338 241602560 http.cpp:178] HTTP POST for /slave(1)/api/v1/executor from 192.168.178.24:54206
I0418 14:27:20.927465 241602560 slave.cpp:3207] Handling status update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:20.928771 240529408 status_update_manager.cpp:320] Received status update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:20.928933 243748864 slave.cpp:3605] Forwarding the update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050

```


Thanks,

Alexander Rukletsov


Re: Review Request 46325: Updated HTTP command executor to support kill policy in Kill event.

Posted by Alexander Rukletsov <ru...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/46325/
-----------------------------------------------------------

(Updated April 21, 2016, 2:06 p.m.)


Review request for mesos, Ben Mahler and Qian Zhang.


Bugs: MESOS-4908
    https://issues.apache.org/jira/browse/MESOS-4908


Repository: mesos


Description
-------

Kill policy can be provided in a kill event. In this case it should
take precedence over kill policy specified when the task was launched.
When kill event is issued multiple times during the task termination,
the signal escalation timeout (the time a task has between SIGTERM
and SIGKILL) may be reduced.

Since updating the delay timer (we use it for signal escalation delay)
is currently not possible, we cancel the existing signal timer and set
up a new one. `Clock::cancel()` guarantees that, if existed, the timer
is removed before the function returns; hence we do not set up more
than 1 timer for signal escalation delay.


Diffs (updated)
-----

  src/launcher/http_command_executor.cpp ad484e0e6f5067b6c166111c91b1ff1e8c05d9ac 

Diff: https://reviews.apache.org/r/46325/diff/


Testing
-------

`make check`

Tested manually using modified `mesos-execute` in a way that **two** extra kill task requests are sent, 2s and 3s after receiving `TASK_RUNNING`. Each kill task request specifies `KillPolicy` with 1s grace period. Together with a kill *without* a kill policy scheduled 1s after the task is being launched, the task receives **three** kill requests in total.

Master: `./bin/mesos-master.sh --work_dir=/tmp/w/m --ip=127.0.0.1`
Agent: `./bin/mesos-slave.sh --work_dir=/tmp/w/s --master=127.0.0.1:5050 --http_command_executor`
Mesos-execute: `./src/mesos-execute --master=127.0.0.1:5050 --name=test --command="/Users/alex/bin/unresponsive_process" --env='{"GLOG_v": "2"}' --kill_after=1secs`

HTTP command executor log:
```
Received SUBSCRIBED event
Subscribed executor on alexr.fritz.box
Received LAUNCH event
Starting task test
sh -c '/Users/alex/bin/unresponsive_process'
Forked command at 17475
14455919081943275466
Received ACKNOWLEDGED event
17172602460659762152
Received KILL event
Received kill for task test
Sending SIGTERM to process tree at pid 17475
Sent SIGTERM to the following process trees:
[ 
--- 17475 /Users/alex/bin/unresponsive_process
]
4381544758593790168
Scheduling escalation to SIGKILL in 3secs from now
Received ACKNOWLEDGED event
Received KILL event
Received kill for task test
Rescheduling escalation to SIGKILL in 1secs from now
10370891801885978953
Process 17475 did not terminate after 1secs, sending SIGKILL to process tree at 17475
Killed the following process trees:
[ 
--- 17475 /Users/alex/bin/unresponsive_process
]
Received KILL event
Received kill for task test
Command terminated with signal Killed: 9 (pid: 17475)
```

Excerpt from the agent log that shows all 3 kill task requests and that the segnal escalation timeout was reduced from 3s to 1s:
```
I0418 14:27:17.825070 244285440 slave.cpp:3605] Forwarding the update TASK_RUNNING (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
I0418 14:27:17.831233 242139136 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 925e2d89-f6eb-464d-9a50-a74a8e07bc88) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:18.834309 244285440 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:18.842150 244285440 http.cpp:178] HTTP POST for /slave(1)/api/v1/executor from 192.168.178.24:54206
I0418 14:27:18.842331 244285440 slave.cpp:3207] Handling status update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:18.843214 242139136 status_update_manager.cpp:320] Received status update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:18.843387 243748864 slave.cpp:3605] Forwarding the update TASK_KILLING (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050
I0418 14:27:18.846459 242675712 status_update_manager.cpp:392] Received status update acknowledgement (UUID: a2f6eca7-b3e5-4e45-adcb-356f75355563) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:19.836699 240529408 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:20.850658 240529408 slave.cpp:2046] Asked to kill task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:20.927338 241602560 http.cpp:178] HTTP POST for /slave(1)/api/v1/executor from 192.168.178.24:54206
I0418 14:27:20.927465 241602560 slave.cpp:3207] Handling status update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:20.928771 240529408 status_update_manager.cpp:320] Received status update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000
I0418 14:27:20.928933 243748864 slave.cpp:3605] Forwarding the update TASK_KILLED (UUID: fdcb43ec-271b-4976-aaf3-59a031cdc924) for task test of framework ab374773-a018-4531-923b-899cf1e4f573-0000 to master@127.0.0.1:5050

```


Thanks,

Alexander Rukletsov