You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Ash Pran (JIRA)" <ji...@apache.org> on 2016/07/27 16:09:20 UTC

[jira] [Updated] (SPARK-16752) Spark Job Server not releasing jobs from "running list" even after yarn completes the job

     [ https://issues.apache.org/jira/browse/SPARK-16752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ash Pran updated SPARK-16752:
-----------------------------
    Attachment: SJS_JOBS_RUNNING
                SJS_JOB_LOG_CONSOLE
                SJS_JOB_COMP_YARN
                SJS_Limited_Log.txt

Please see the attached files for further reference.

> Spark Job Server not releasing jobs from "running list" even after yarn completes the job
> -----------------------------------------------------------------------------------------
>
>                 Key: SPARK-16752
>                 URL: https://issues.apache.org/jira/browse/SPARK-16752
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 0.6.0, 1.5.0
>         Environment: SJS version 0.6.1 and Spark 1.5.0 running on Yarn-client mode
>            Reporter: Ash Pran
>              Labels: patch
>         Attachments: SJS_JOBS_RUNNING, SJS_JOB_COMP_YARN, SJS_JOB_LOG_CONSOLE, SJS_Limited_Log.txt
>
>
> We are having a strange issue with Spark Job Server (SJS)
> We are using SJS 0.6.1 and Spark 1.5.0 with "yarn-client" mode. The details of settings.sh for SJS is as below
> ********************************************************************
> INSTALL_DIR=$(cd `dirname $0`; pwd -P)
> LOG_DIR=$INSTALL_DIR/logs
> PIDFILE=spark-jobserver.pid
> JOBSERVER_MEMORY=16G
> SPARK_VERSION=1.5.0
> SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.2-1.cdh5.5.2.p0.4/lib/spark
> SPARK_CONF_DIR=$SPARK_HOME/conf
> SCALA_VERSION=2.10.4
> ********************************************************************
> We are using fair scheduling with 2 pools with 50 executors of 1 GB each.
> We are also having max-jobs-per-context set to # of cores, which is 48.
> What we are seeing is for the first 5 minutes or so, it is all good ...the jobs get processed fine.
> After 5 minutes or so, we see these 2 issues happening randomly.
> 1) There are no jobs running in the cluster, completely available, but SJS takes request, but does not submit it to the cluster for almost 3 - 4 minutes and the job will be in "running job" list for that long.
> 2) SJS takes request, submits it to cluster, job gets processed from cluster, but even then, SJS does not move the job to completed list, it keeps it in "running job" list for 3 - 4 minutes before moving it to completed job list and during this time, our application keeps waiting for the response.
> More issue details are documented in the external issue URL given below



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org