You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@oozie.apache.org by "Diana Carroll (JIRA)" <ji...@apache.org> on 2015/08/07 15:46:45 UTC

[jira] [Updated] (OOZIE-2326) oozie/yarn/spark: active container remains after failed job

     [ https://issues.apache.org/jira/browse/OOZIE-2326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Diana Carroll updated OOZIE-2326:
---------------------------------
    Attachment: yarnbug1.png
                ooziejob-logs.txt
                yarnbug2.png
                container-logs.txt

> oozie/yarn/spark: active container remains after failed job
> -----------------------------------------------------------
>
>                 Key: OOZIE-2326
>                 URL: https://issues.apache.org/jira/browse/OOZIE-2326
>             Project: Oozie
>          Issue Type: Bug
>          Components: workflow
>    Affects Versions: 4.1.0
>         Environment: pseudo-distributed (single VM), CentOS 6.6, CDH 5.4.3
>            Reporter: Diana Carroll
>         Attachments: container-logs.txt, ooziejob-logs.txt, yarnbug1.png, yarnbug2.png
>
>
> Issue occurs when I launch a Spark job (local mode) that fails.  (My example failed because I tried to read a non-existent file).  When this occur, the job fails, and YARN ends up in a weird state: the RM manager shows the launch job has completed...but a container for the job is still live on the slave node.  Because I'm running in pseudo-dist mode, this totally hangs my cluster: no other jobs can run because there are only resources for a single container, and that container is running the dead Oozie launcher.
> If I wait long enough, YARN will eventually time out and release the container and start accepting new jobs.  But until then I'm dead in the water.
> Attaching screen shots that show the state right after running the failed job:
> the RM shows no jobs running
> the node shows one container running
> Also attaching a log file for the oozie job and the container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)