You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@aurora.apache.org by "Reza Motamedi (JIRA)" <ji...@apache.org> on 2017/07/06 23:39:00 UTC
[jira] [Created] (AURORA-1941) Cause container restart when a
process is killed with a signal.
Reza Motamedi created AURORA-1941:
-------------------------------------
Summary: Cause container restart when a process is killed with a signal.
Key: AURORA-1941
URL: https://issues.apache.org/jira/browse/AURORA-1941
Project: Aurora
Issue Type: Task
Reporter: Reza Motamedi
Priority: Minor
Say you have the following task config. Note all processes have max_failure = 1.
{code}
{
"processes": [
{
"daemon": false,
"name": "hello-0",
"max_failures": 1,
"ephemeral": false,
"min_duration": 5,
"cmdline": "while true; do echo `date`; sleep 60; done",
"final": false
},
{
"daemon": false,
"name": "hello-1",
"max_failures": 1,
"ephemeral": false,
"min_duration": 5,
"cmdline": "while true; do echo `date`; sleep 60; done",
"final": false
},
{
"daemon": false,
"name": "hello-2",
"max_failures": 1,
"ephemeral": false,
"min_duration": 5,
"cmdline": "while true; do echo `date`; sleep 60; done",
"final": false
}
],
"name": "hello-0",
"finalization_wait": 30,
"max_failures": 1,
"max_concurrency": 0,
"resources": {
"gpu": 0,
"disk": 16777216,
"ram": 1048576,
"cpu": 0.1
},
"constraints": []
}
{code}
Say we kill one these thermos processes. In this case, the process gets restarted since it technically did not crash/fail. Even if you kill it with `kill -SIGSEGV <pid>` it still comes back up again and the number of failures is 0. This is being registered as the process being lost and that number correctly increases.
I think it makes sense to check the exit code on a process kill and count it a failure the err code is not `0`.
Note that if one the processes fails / crashes it is handled differently:
- on_killed
{noformat}
D0706 18:38:32.944282 12808 runner.py:156] Process on_killed ProcessStatus(seq=3, process='hello-2', start_time=None, coordinator_pid=None, pid=None, return_code=-9, state=4, stop_time=1499366312.421471, fork_time=None)
{noformat}
- on_failed
{noformat}
D0706 22:37:14.829272 23216 runner.py:138] Process on_failed ProcessStatus(seq=3, process='hello-bad', start_time=None, coordinator_pid=None, pid=None, return_code=139, state=5, stop_time=1499380634.768661, fork_time=None)
{noformat}
We can just check the `ProcessStatus.return_code` and act accordingly.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)