You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by "Vinoth Chandar (JIRA)" <ji...@apache.org> on 2018/03/29 15:34:00 UTC

[jira] [Comment Edited] (GOBBLIN-385) Add Spark execution mode for Gobblin

    [ https://issues.apache.org/jira/browse/GOBBLIN-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16419203#comment-16419203 ] 

Vinoth Chandar edited comment on GOBBLIN-385 at 3/29/18 3:33 PM:
-----------------------------------------------------------------

At a high level, seems like we need a new

 - *CliSparkJobLauncher* : Just wrapping ServiceBasedApplicationLauncher and SparkJobLauncher

 - *SparkJobLauncher :* I think we should be able to extend MrJobLauncher and override `runWorkUnits` alone (which I think is the parallel work here). This class can create the SparkContext (there can be only 1 per jvm) and simply reuse existing input/output formats using SparkContext.newAPIHadoopRDD and PairRDD.saveAsNewAPIHadoopFile

 

[http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaPairRDD.html#saveAsNewAPIHadoopFile-java.lang.String-java.lang.Class-java.lang.Class-java.lang.Class-]

[https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/api/java/JavaSparkContext.html#newAPIHadoopRDD(org.apache.hadoop.conf.Configuration,%20java.lang.Class,%20java.lang.Class,%20java.lang.Class)]

 

Above seems very simple, I am sure once we actually try doing it, we may see standard spark issues  like NotSerializable exceptions or some MR specific paths.

 

Do you have the Samza patch sitting in a diff somewhere? Could be useful to checkout before I embark on trying this out for reals.

 

 

 

 


was (Author: vinothchandar):
At a high level, seems like we need a new

 - *CliSparkJobLauncher* : Just wrapping ServiceBasedApplicationLauncher and SparkJobLauncher

 - *SparkJobLauncher :* I think we should be able to extend MrJobLauncher and override `runWorkUnits` alone (which I think is the parallel work here). This class can create the SparkContext (there can be only 1 per jvm) and simply reuse existing input/output formats using SparkContext.newAPIHadoopRDD and PairRDD.saveAsNewAPIHadoopFile

 

[http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaPairRDD.html#saveAsNewAPIHadoopFile-java.lang.String-java.lang.Class-java.lang.Class-java.lang.Class-]

[https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/api/java/JavaSparkContext.html#newAPIHadoopRDD(org.apache.hadoop.conf.Configuration,%20java.lang.Class,%20java.lang.Class,%20java.lang.Class)]

 

Above seems very simple, I am sure once we actually try doing it, we may see standard spark issues  like NotSerializable exceptions or some MR specific paths.

 

Do you have the Samza patch sitting in a diff somewhere? Could be useful to checkout before I embark on trying this out for reals.

 

 

 

 

> Add Spark execution mode for Gobblin
> ------------------------------------
>
>                 Key: GOBBLIN-385
>                 URL: https://issues.apache.org/jira/browse/GOBBLIN-385
>             Project: Apache Gobblin
>          Issue Type: New Feature
>          Components: gobblin-cluster
>            Reporter: Vinoth Chandar
>            Assignee: Hung Tran
>            Priority: Major
>
> If there is interest, happy to contribute spark execution mode and eventually add support for ingesting data into [https://github.com/uber/hudi] format..
> Please provide some guidance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)