You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by "Vinoth Chandar (JIRA)" <ji...@apache.org> on 2018/03/29 15:34:00 UTC
[jira] [Comment Edited] (GOBBLIN-385) Add Spark execution mode for
Gobblin
[ https://issues.apache.org/jira/browse/GOBBLIN-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419203#comment-16419203 ]
Vinoth Chandar edited comment on GOBBLIN-385 at 3/29/18 3:33 PM:
-----------------------------------------------------------------
At a high level, seems like we need a new
- *CliSparkJobLauncher* : Just wrapping ServiceBasedApplicationLauncher and SparkJobLauncher
- *SparkJobLauncher :* I think we should be able to extend MrJobLauncher and override `runWorkUnits` alone (which I think is the parallel work here). This class can create the SparkContext (there can be only 1 per jvm) and simply reuse existing input/output formats using SparkContext.newAPIHadoopRDD and PairRDD.saveAsNewAPIHadoopFile
[http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaPairRDD.html#saveAsNewAPIHadoopFile-java.lang.String-java.lang.Class-java.lang.Class-java.lang.Class-]
[https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/api/java/JavaSparkContext.html#newAPIHadoopRDD(org.apache.hadoop.conf.Configuration,%20java.lang.Class,%20java.lang.Class,%20java.lang.Class)]
Above seems very simple, I am sure once we actually try doing it, we may see standard spark issues like NotSerializable exceptions or some MR specific paths.
Do you have the Samza patch sitting in a diff somewhere? Could be useful to checkout before I embark on trying this out for reals.
was (Author: vinothchandar):
At a high level, seems like we need a new
- *CliSparkJobLauncher* : Just wrapping ServiceBasedApplicationLauncher and SparkJobLauncher
- *SparkJobLauncher :* I think we should be able to extend MrJobLauncher and override `runWorkUnits` alone (which I think is the parallel work here). This class can create the SparkContext (there can be only 1 per jvm) and simply reuse existing input/output formats using SparkContext.newAPIHadoopRDD and PairRDD.saveAsNewAPIHadoopFile
[http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaPairRDD.html#saveAsNewAPIHadoopFile-java.lang.String-java.lang.Class-java.lang.Class-java.lang.Class-]
[https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/api/java/JavaSparkContext.html#newAPIHadoopRDD(org.apache.hadoop.conf.Configuration,%20java.lang.Class,%20java.lang.Class,%20java.lang.Class)]
Above seems very simple, I am sure once we actually try doing it, we may see standard spark issues like NotSerializable exceptions or some MR specific paths.
Do you have the Samza patch sitting in a diff somewhere? Could be useful to checkout before I embark on trying this out for reals.
> Add Spark execution mode for Gobblin
> ------------------------------------
>
> Key: GOBBLIN-385
> URL: https://issues.apache.org/jira/browse/GOBBLIN-385
> Project: Apache Gobblin
> Issue Type: New Feature
> Components: gobblin-cluster
> Reporter: Vinoth Chandar
> Assignee: Hung Tran
> Priority: Major
>
> If there is interest, happy to contribute spark execution mode and eventually add support for ingesting data into [https://github.com/uber/hudi] format..
> Please provide some guidance
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)