You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Andras Piros (JIRA)" <ji...@apache.org> on 2018/12/03 11:15:00 UTC

[jira] [Commented] (OOZIE-3286) [spark-action] Launch Spark application from Oozie server

    [ https://issues.apache.org/jira/browse/OOZIE-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16707011#comment-16707011 ] 

Andras Piros commented on OOZIE-3286:
-------------------------------------

An interesting idea to support also [Spark on Kubernetes|https://stackoverflow.com/questions/50311831/ozzie-with-spark-2-3-on-kubernetes] using the very same mechanism as Spark on YARN.

> [spark-action] Launch Spark application from Oozie server
> ---------------------------------------------------------
>
>                 Key: OOZIE-3286
>                 URL: https://issues.apache.org/jira/browse/OOZIE-3286
>             Project: Oozie
>          Issue Type: New Feature
>          Components: action, core
>    Affects Versions: 5.0.0
>            Reporter: Andras Piros
>            Assignee: Andras Piros
>            Priority: Major
>
> This is a major refactor / rework of the Spark action.
> Today Oozie Spark actions run as follows:
> # {{SparkActionExecutor extends JavaActionExecutor}} (running as part of Oozie server code) launches a {{SparkMain}} (running as YARN application) the usual way on YARN using {{LauncherAM}}
> # {{SparkMain}} runs on a YARN NodeManager container and calls from the same JVM [*{{SparkSubmit}}*|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala]
> # {{SparkSubmit}} fires up a Spark driver:
> ** in {{local}} or {{yarn-client}} mode in the same YARN NodeManager JVM where {{SparkMain}} runs. In {{yarn-client}} mode Spark's [*{{ApplicationMaster}}*|https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala] runs inside the same JVM where the driver runs
> ** [*in {{yarn-cluster}} mode*|https://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn] Spark driver runs in a different JVM than Spark's ApplicationMaster, maybe on a different host. It can go away after the YARN application has been submitted
> Problems with this approach are:
> * too many levels of indirection cause lots of latency, and a whole new world of communication errors can happen
> * since {{SparkSubmit}} is launched from the same JVM where {{SparkMain}} runs, the Spark application will share all the environment variables, classpath, sharelib dependencies etc. with Oozie's Spark launcher code, causing hard-to-nail-down environment and classpath issues
> The future of Oozie Spark action launching looks like this:
> # {{SparkActionExecutor}} (running as part of Oozie server code) launches an [*{{InProcessLauncher}}*|https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/InProcessLauncher.java] (available since Spark 2.3) always in {{yarn-cluster}} mode
> # {{InProcessLauncher}} calls either {{InProcessSparkSubmit}} or {{SparkSubmit}}. We translate any Spark application modes to {{yarn-cluster}}, so that Spark driver will run in a JVM different than the Spark driver
> # this allows for much better resource usage, since we have one YARN ApplicationMaster (Oozie's {{LauncherAM}}) less
> # since the Spark driver and executor are always launched in different JVMs, we don't have any interference of environment variables, driver, or executor classpath



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)