You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Antony Mayi <an...@yahoo.com> on 2017/05/10 08:16:54 UTC

appending beam pipeline to spark job

I've got a (dirty) usecase where I have existing spark batch job which produces an output that I would like to feed into my beam pipeline (assuming running on SparkRunner). I was trying to run it as one job (the output is reduced so not a big data hence ok to do something like Create.of(rdd.collect())) but that's failing because of the two separate spark contexts.
Is it possible to build the beam pipeline on existing spark context?
thx,Antony.

Re: appending beam pipeline to spark job

Posted by Antony Mayi <an...@yahoo.com>.

makes sense, thanks for everything,a. 

    On Wednesday, 10 May 2017, 11:12, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
 

 Hi Antony,

not directly from the Beam SDK: you have to be "in" the spark runner to do so, 
adding your own PTransform and corresponding translator.

Else, it would mean we lost the "portability" of the pipeline to different runners.

Regards
JB

On 05/10/2017 11:08 AM, Antony Mayi wrote:
> very useful, thanks!
>
> btw. to avoid calling the Create.of(rdd.collect()) - is there by any chance way
> to get a pcollection directly from rdd?
>
> thx,
> antony.
>
>
> On Wednesday, 10 May 2017, 10:37, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
>
>
> Hi Antony,
>
> yes, it's possible to "inject"/reuse an existing Spark context via the pipeline
> options. From the SparkPipelineOptions:
>
>  @Description("If the spark runner will be initialized with a provided Spark
> Context. "
>      + "The Spark Context should be provided with SparkContextOptions.")
>  @Default.Boolean(false)
>  boolean getUsesProvidedSparkContext();
>  void setUsesProvidedSparkContext(boolean value);
>
> Regards
> JB
>
> On 05/10/2017 10:16 AM, Antony Mayi wrote:
>> I've got a (dirty) usecase where I have existing spark batch job which produces
>> an output that I would like to feed into my beam pipeline (assuming running on
>> SparkRunner). I was trying to run it as one job (the output is reduced so not a
>> big data hence ok to do something like Create.of(rdd.collect())) but that's
>> failing because of the two separate spark contexts.
>>
>> Is it possible to build the beam pipeline on existing spark context?
>>
>> thx,
>> Antony.
>
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org <ma...@apache.org>
> http://blog.nanthrax.net <http://blog.nanthrax.net/>
> Talend - http://www.talend.com <http://www.talend.com/>
>
>
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: appending beam pipeline to spark job

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Antony,

not directly from the Beam SDK: you have to be "in" the spark runner to do so, 
adding your own PTransform and corresponding translator.

Else, it would mean we lost the "portability" of the pipeline to different runners.

Regards
JB

On 05/10/2017 11:08 AM, Antony Mayi wrote:
> very useful, thanks!
>
> btw. to avoid calling the Create.of(rdd.collect()) - is there by any chance way
> to get a pcollection directly from rdd?
>
> thx,
> antony.
>
>
> On Wednesday, 10 May 2017, 10:37, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
>
>
> Hi Antony,
>
> yes, it's possible to "inject"/reuse an existing Spark context via the pipeline
> options. From the SparkPipelineOptions:
>
>   @Description("If the spark runner will be initialized with a provided Spark
> Context. "
>       + "The Spark Context should be provided with SparkContextOptions.")
>   @Default.Boolean(false)
>   boolean getUsesProvidedSparkContext();
>   void setUsesProvidedSparkContext(boolean value);
>
> Regards
> JB
>
> On 05/10/2017 10:16 AM, Antony Mayi wrote:
>> I've got a (dirty) usecase where I have existing spark batch job which produces
>> an output that I would like to feed into my beam pipeline (assuming running on
>> SparkRunner). I was trying to run it as one job (the output is reduced so not a
>> big data hence ok to do something like Create.of(rdd.collect())) but that's
>> failing because of the two separate spark contexts.
>>
>> Is it possible to build the beam pipeline on existing spark context?
>>
>> thx,
>> Antony.
>
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org <ma...@apache.org>
> http://blog.nanthrax.net <http://blog.nanthrax.net/>
> Talend - http://www.talend.com <http://www.talend.com/>
>
>
>

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: appending beam pipeline to spark job

Posted by Antony Mayi <an...@yahoo.com>.

very useful, thanks!
btw. to avoid calling the Create.of(rdd.collect()) - is there by any chance way to get a pcollection directly from rdd?
thx,antony. 

    On Wednesday, 10 May 2017, 10:37, Jean-Baptiste Onofré <jb...@nanthrax.net> wrote:
 

 Hi Antony,

yes, it's possible to "inject"/reuse an existing Spark context via the pipeline 
options. From the SparkPipelineOptions:

  @Description("If the spark runner will be initialized with a provided Spark 
Context. "
      + "The Spark Context should be provided with SparkContextOptions.")
  @Default.Boolean(false)
  boolean getUsesProvidedSparkContext();
  void setUsesProvidedSparkContext(boolean value);

Regards
JB

On 05/10/2017 10:16 AM, Antony Mayi wrote:
> I've got a (dirty) usecase where I have existing spark batch job which produces
> an output that I would like to feed into my beam pipeline (assuming running on
> SparkRunner). I was trying to run it as one job (the output is reduced so not a
> big data hence ok to do something like Create.of(rdd.collect())) but that's
> failing because of the two separate spark contexts.
>
> Is it possible to build the beam pipeline on existing spark context?
>
> thx,
> Antony.

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: appending beam pipeline to spark job

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Antony,

yes, it's possible to "inject"/reuse an existing Spark context via the pipeline 
options. From the SparkPipelineOptions:

   @Description("If the spark runner will be initialized with a provided Spark 
Context. "
       + "The Spark Context should be provided with SparkContextOptions.")
   @Default.Boolean(false)
   boolean getUsesProvidedSparkContext();
   void setUsesProvidedSparkContext(boolean value);

Regards
JB

On 05/10/2017 10:16 AM, Antony Mayi wrote:
> I've got a (dirty) usecase where I have existing spark batch job which produces
> an output that I would like to feed into my beam pipeline (assuming running on
> SparkRunner). I was trying to run it as one job (the output is reduced so not a
> big data hence ok to do something like Create.of(rdd.collect())) but that's
> failing because of the two separate spark contexts.
>
> Is it possible to build the beam pipeline on existing spark context?
>
> thx,
> Antony.

-- 
Jean-Baptiste Onofré
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com