You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@aurora.apache.org by "Zameer Manji (JIRA)" <ji...@apache.org> on 2016/10/06 23:36:20 UTC

[jira] [Commented] (AURORA-1789) namespaces/pid isolator causes lost process

    [ https://issues.apache.org/jira/browse/AURORA-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15553561#comment-15553561 ] 

Zameer Manji commented on AURORA-1789:
--------------------------------------

Here is my gut feeling on the first cut of these logs and what I see in the code. I'm focusing on {{apache/thermos/core/process.py}}.

The coordinator appears to be forked and running. Once it is forked, it is supposed to execute the target process, and then write a checkpoint indicating the process is RUNNING and with the pid. That never happens here.

Another thing going on here is that this task is running using the unified containerizer and has a task filesystem. To me there appears to be a bug inside the {{execute()}} method, perhaps in trying to execute the target process via the mesos-containerizer.



> namespaces/pid isolator causes lost process
> -------------------------------------------
>
>                 Key: AURORA-1789
>                 URL: https://issues.apache.org/jira/browse/AURORA-1789
>             Project: Aurora
>          Issue Type: Bug
>          Components: Executor
>    Affects Versions: 0.16.0
>            Reporter: Justin Pinkul
>            Assignee: Zameer Manji
>
> When using the Mesos containerizer with namespaces/pid isolator and a Docker image the Thermos executor is unable to launch processes. The executor tries to fork the process then is unable to locate the process after the fork.
> {code:title=thermos_runner.INFO}
> I1006 21:36:22.842595 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:37:22.929864 75 runner.py:825] Detected a LOST task: ProcessStatus(seq=205, process=u'BigBrother start', start_time=None, coordinator_pid=1144, pid=None, return_code=None, state=1, stop_time=None, fork_time=1475789782.842882)
> I1006 21:37:22.931456 75 helper.py:153]   Coordinator BigBrother start [pid: 1144] completed.
> I1006 21:37:22.931732 75 runner.py:133] Process BigBrother start had an abnormal termination
> I1006 21:37:22.935580 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:38:23.023725 75 runner.py:825] Detected a LOST task: ProcessStatus(seq=208, process=u'BigBrother start', start_time=None, coordinator_pid=1157, pid=None, return_code=None, state=1, stop_time=None, fork_time=1475789842.935872)
> I1006 21:38:23.025332 75 helper.py:153]   Coordinator BigBrother start [pid: 1157] completed.
> I1006 21:38:23.025629 75 runner.py:133] Process BigBrother start had an abnormal termination
> I1006 21:38:23.029414 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:39:23.117208 75 runner.py:825] Detected a LOST task: ProcessStatus(seq=211, process=u'BigBrother start', start_time=None, coordinator_pid=1170, pid=None, return_code=None, state=1, stop_time=None, fork_time=1475789903.029694)
> I1006 21:39:23.118841 75 helper.py:153]   Coordinator BigBrother start [pid: 1170] completed.
> I1006 21:39:23.119134 75 runner.py:133] Process BigBrother start had an abnormal termination
> I1006 21:39:23.122920 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:40:23.211095 75 runner.py:825] Detected a LOST task: ProcessStatus(seq=214, process=u'BigBrother start', start_time=None, coordinator_pid=1183, pid=None, return_code=None, state=1, stop_time=None, fork_time=1475789963.123206)
> I1006 21:40:23.212711 75 helper.py:153]   Coordinator BigBrother start [pid: 1183] completed.
> I1006 21:40:23.213006 75 runner.py:133] Process BigBrother start had an abnormal termination
> I1006 21:40:23.216810 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:41:23.305505 75 runner.py:825] Detected a LOST task: ProcessStatus(seq=217, process=u'BigBrother start', start_time=None, coordinator_pid=1196, pid=None, return_code=None, state=1, stop_time=None, fork_time=1475790023.21709)
> I1006 21:41:23.307157 75 helper.py:153]   Coordinator BigBrother start [pid: 1196] completed.
> I1006 21:41:23.307450 75 runner.py:133] Process BigBrother start had an abnormal termination
> I1006 21:41:23.311230 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:42:23.398277 75 runner.py:825] Detected a LOST task: ProcessStatus(seq=220, process=u'BigBrother start', start_time=None, coordinator_pid=1209, pid=None, return_code=None, state=1, stop_time=None, fork_time=1475790083.311512)
> I1006 21:42:23.399893 75 helper.py:153]   Coordinator BigBrother start [pid: 1209] completed.
> I1006 21:42:23.400185 75 runner.py:133] Process BigBrother start had an abnormal termination
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)