You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Jon Molle via user <us...@beam.apache.org> on 2023/06/21 15:31:18 UTC

Hi,

I've been looking at the Spark Portable Runner docs, specifically Java when
possible, and I'm a little confused about the organization. The docs seem
to say that the JobService both submits the code to the linked spark
cluster (described in the master url) and requires you to run a
spark-submit command after on whatever artifacts it builds.

Unfortunately I'm not that familiar with Spark generally, so I'm probably
misunderstanding more here, but the job server images either totally lack
documentation or just repeat the spark runner page in the main docs.

For context, I'm trying to port some code that we're currently running on a
Dataflow runner (on GCP) to also run on AWS. A spark cluster on EKS (either
self-managed or potentially through EMR, but likely not based on what I am
reading into the docs and some brief testing) seems the closest analog.

The new Tour does the same thing, in addition to only really having
examples for python and a few more typos. I haven't found any existing
questions like this elsewhere, so I assume that I'm just missing something
that should be obvious.

Thanks for your time.

Re: Re:

Posted by Jon Molle via user <us...@beam.apache.org>.

Hi Moritz,

Yes, yes I am.

I've gotten to the point where I've got a job service runner set up and am
trying to get the spark and archive storage volume set up properly. I can
only find vague references, but it seems like there needs to be an
accessible shared drive between the spark cluster workers and the job
service; I've been unsuccessful at configuring that so far and I'm not sure
exactly where that config lives. Basically k8s readwritemany in my case,
but it seems like it should support some generic NFS or cloud bucket
storage and doesn't have to be managed by k8s.

Best,
Jon

On Thu, Jul 20, 2023 at 1:14 AM Moritz Mack <mm...@talend.com> wrote:

> Hi Jon,
>
>
>
> I just want to check in here briefly, are you still looking for support on
> this?
>
> Sadly yes, this totally lacks documentation and isn’t straight forward to
> set up.
>
>
>
> /Moritz
>
>
>
> On 21.06.23, 23:47, "Jon Molle via user" <us...@beam.apache.org> wrote:
>
>
>
> Hi Pavel, Thanks for your response! I took a look at running Beam on
> Kinesis (analytics), as it is the AWS-recommended way to run Beam jobs. It
> seems like it doesn't work with the portable runner model. Our project is a
> daemon running in
>
> Hi Pavel,
>
>
>
> Thanks for your response! I took a look at running Beam on Kinesis
> (analytics), as it is the AWS-recommended way to run Beam jobs. It seems
> like it doesn't work with the portable runner model. Our project is a
> daemon running in a kubernetes cluster that has Beam code running as part
> of certain tasks, so I'm not exactly sure how that would work with Kinesis
> as I don't see a way to grab the master URL (and I'm not entirely sure if
> the flink image being run by Kinesis would work for Beam). I'd really like
> to avoid using any of the non-portable runners if possible.
>
>
>
> That's part of why I am looking at Spark (although flink looks fairly
> similar): EKS supports autoscaling and other features dataflow does. I
> don't want to make a huge divergence between the GCP and AWS behaviour if
> possible. It seems possible, but the docs for the other runners are a bit
> ambiguous on exactly how much of submitting jobs is handled by the runner.
>
>
>
> On Wed, Jun 21, 2023 at 12:28 PM Pavel Solomin <p....@gmail.com>
> wrote:
>
> Hello!
>
>
>
> > to also run on AWS
>
>
>
> > A spark cluster on EKS seems the closest analog
>
>
>
> There's another way of running Beam apps in AWS -
> https://aws.amazon.com/kinesis/data-analytics/
> <https://urldefense.com/v3/__https:/aws.amazon.com/kinesis/data-analytics/__;!!CiXD_PY!RpworRaeFcd9cQmYZ7h1p-2ZWlIMVM5czNPNWO0aKKKvvg_p2VEw9u6D8SueN0uOo58zOSnTB0hdzg$>
> - which is basically "serverless" Flink. It says Kinesis, but you can run
> any Flink / Beam job there, you don't have to use Kinesis streams. I used
> KDA in multiple projects so far, works OK. FlinkRunner also seems to have
> more docs as far as I can see.
>
>
>
> Here's a pom.xml example:
> https://github.com/aws-samples/amazon-kinesis-data-analytics-examples/blob/master/Beam/pom.xml
> <https://urldefense.com/v3/__https:/github.com/aws-samples/amazon-kinesis-data-analytics-examples/blob/master/Beam/pom.xml__;!!CiXD_PY!RpworRaeFcd9cQmYZ7h1p-2ZWlIMVM5czNPNWO0aKKKvvg_p2VEw9u6D8SueN0uOo58zOSkcebCMJg$>
>
>
>
> Best Regards,
> Pavel Solomin
>
> Tel: +351 962 950 692 <+351%20962%20950%20692> | Skype: pavel_solomin |
> Linkedin
> <https://urldefense.com/v3/__https:/www.linkedin.com/in/pavelsolomin__;!!CiXD_PY!RpworRaeFcd9cQmYZ7h1p-2ZWlIMVM5czNPNWO0aKKKvvg_p2VEw9u6D8SueN0uOo58zOSkakb4QEA$>
>
>
>
>
>
>
>
> On Wed, 21 Jun 2023 at 16:31, Jon Molle via user <us...@beam.apache.org>
> wrote:
>
> Hi,
>
>
>
> I've been looking at the Spark Portable Runner docs, specifically Java
> when possible, and I'm a little confused about the organization. The docs
> seem to say that the JobService both submits the code to the linked spark
> cluster (described in the master url) and requires you to run a
> spark-submit command after on whatever artifacts it builds.
>
>
>
> Unfortunately I'm not that familiar with Spark generally, so I'm probably
> misunderstanding more here, but the job server images either totally lack
> documentation or just repeat the spark runner page in the main docs.
>
>
>
> For context, I'm trying to port some code that we're currently running on
> a Dataflow runner (on GCP) to also run on AWS. A spark cluster on EKS
> (either self-managed or potentially through EMR, but likely not based on
> what I am reading into the docs and some brief testing) seems the closest
> analog.
>
>
>
> The new Tour does the same thing, in addition to only really having
> examples for python and a few more typos. I haven't found any existing
> questions like this elsewhere, so I assume that I'm just missing something
> that should be obvious.
>
>
>
> Thanks for your time.
>
> *As a recipient of an email from the Talend Group, your personal data will
> be processed by our systems. Please see our Privacy Notice
> <https://www.talend.com/privacy-policy/>*for more information about our
> collection and use of your personal information, our security practices,
> and your data protection rights, including any rights you may have to
> object to automated-decision making or profiling we use to analyze support
> or marketing related communications. To manage or discontinue promotional
> communications, use the communication preferences portal
> <https://info.talend.com/emailpreferencesen.html>. To exercise your data
> protection rights, use the privacy request form
> <https://talend.my.onetrust.com/webform/ef906c5a-de41-4ea0-ba73-96c079cdd15a/b191c71d-f3cb-4a42-9815-0c3ca021704cl>.
> Contact us here <https://www.talend.com/contact/>or by mail to either of
> our co-headquarters: Talend, Inc.: 400 South El Camino Real, Ste 1400, San
> Mateo, CA 94402; Talend SAS: 5/7 rue Salomon De Rothschild, 92150 Suresnes,
> France
>

Re: Re:

Posted by Moritz Mack <mm...@talend.com>.

Hi Jon,

I just want to check in here briefly, are you still looking for support on this?
Sadly yes, this totally lacks documentation and isn’t straight forward to set up.

/Moritz

On 21.06.23, 23:47, "Jon Molle via user" <us...@beam.apache.org> wrote:

Hi Pavel, Thanks for your response! I took a look at running Beam on Kinesis (analytics), as it is the AWS-recommended way to run Beam jobs. It seems like it doesn't work with the portable runner model. Our project is a daemon running in

Hi Pavel,

Thanks for your response! I took a look at running Beam on Kinesis (analytics), as it is the AWS-recommended way to run Beam jobs. It seems like it doesn't work with the portable runner model. Our project is a daemon running in a kubernetes cluster that has Beam code running as part of certain tasks, so I'm not exactly sure how that would work with Kinesis as I don't see a way to grab the master URL (and I'm not entirely sure if the flink image being run by Kinesis would work for Beam). I'd really like to avoid using any of the non-portable runners if possible.

That's part of why I am looking at Spark (although flink looks fairly similar): EKS supports autoscaling and other features dataflow does. I don't want to make a huge divergence between the GCP and AWS behaviour if possible. It seems possible, but the docs for the other runners are a bit ambiguous on exactly how much of submitting jobs is handled by the runner.

On Wed, Jun 21, 2023 at 12:28 PM Pavel Solomin <p....@gmail.com>> wrote:
Hello!

> to also run on AWS

> A spark cluster on EKS seems the closest analog

There's another way of running Beam apps in AWS - https://aws.amazon.com/kinesis/data-analytics/<https://urldefense.com/v3/__https:/aws.amazon.com/kinesis/data-analytics/__;!!CiXD_PY!RpworRaeFcd9cQmYZ7h1p-2ZWlIMVM5czNPNWO0aKKKvvg_p2VEw9u6D8SueN0uOo58zOSnTB0hdzg$> - which is basically "serverless" Flink. It says Kinesis, but you can run any Flink / Beam job there, you don't have to use Kinesis streams. I used KDA in multiple projects so far, works OK. FlinkRunner also seems to have more docs as far as I can see.

Here's a pom.xml example: https://github.com/aws-samples/amazon-kinesis-data-analytics-examples/blob/master/Beam/pom.xml<https://urldefense.com/v3/__https:/github.com/aws-samples/amazon-kinesis-data-analytics-examples/blob/master/Beam/pom.xml__;!!CiXD_PY!RpworRaeFcd9cQmYZ7h1p-2ZWlIMVM5czNPNWO0aKKKvvg_p2VEw9u6D8SueN0uOo58zOSkcebCMJg$>

Best Regards,
Pavel Solomin

Tel: +351 962 950 692<tel:+351%20962%20950%20692> | Skype: pavel_solomin | Linkedin<https://urldefense.com/v3/__https:/www.linkedin.com/in/pavelsolomin__;!!CiXD_PY!RpworRaeFcd9cQmYZ7h1p-2ZWlIMVM5czNPNWO0aKKKvvg_p2VEw9u6D8SueN0uOo58zOSkakb4QEA$>

On Wed, 21 Jun 2023 at 16:31, Jon Molle via user <us...@beam.apache.org>> wrote:
Hi,

I've been looking at the Spark Portable Runner docs, specifically Java when possible, and I'm a little confused about the organization. The docs seem to say that the JobService both submits the code to the linked spark cluster (described in the master url) and requires you to run a spark-submit command after on whatever artifacts it builds.

Unfortunately I'm not that familiar with Spark generally, so I'm probably misunderstanding more here, but the job server images either totally lack documentation or just repeat the spark runner page in the main docs.

For context, I'm trying to port some code that we're currently running on a Dataflow runner (on GCP) to also run on AWS. A spark cluster on EKS (either self-managed or potentially through EMR, but likely not based on what I am reading into the docs and some brief testing) seems the closest analog.

The new Tour does the same thing, in addition to only really having examples for python and a few more typos. I haven't found any existing questions like this elsewhere, so I assume that I'm just missing something that should be obvious.

Thanks for your time.

As a recipient of an email from the Talend Group, your personal data will be processed by our systems. Please see our Privacy Notice <https://www.talend.com/privacy-policy/> for more information about our collection and use of your personal information, our security practices, and your data protection rights, including any rights you may have to object to automated-decision making or profiling we use to analyze support or marketing related communications. To manage or discontinue promotional communications, use the communication preferences portal<https://info.talend.com/emailpreferencesen.html>. To exercise your data protection rights, use the privacy request form<https://talend.my.onetrust.com/webform/ef906c5a-de41-4ea0-ba73-96c079cdd15a/b191c71d-f3cb-4a42-9815-0c3ca021704cl>. Contact us here <https://www.talend.com/contact/> or by mail to either of our co-headquarters: Talend, Inc.: 400 South El Camino Real, Ste 1400, San Mateo, CA 94402; Talend SAS: 5/7 rue Salomon De Rothschild, 92150 Suresnes, France

Re:

Posted by Jon Molle via user <us...@beam.apache.org>.

Hi Pavel,

Thanks for your response! I took a look at running Beam on Kinesis
(analytics), as it is the AWS-recommended way to run Beam jobs. It seems
like it doesn't work with the portable runner model. Our project is a
daemon running in a kubernetes cluster that has Beam code running as part
of certain tasks, so I'm not exactly sure how that would work with Kinesis
as I don't see a way to grab the master URL (and I'm not entirely sure if
the flink image being run by Kinesis would work for Beam). I'd really like
to avoid using any of the non-portable runners if possible.

That's part of why I am looking at Spark (although flink looks fairly
similar): EKS supports autoscaling and other features dataflow does. I
don't want to make a huge divergence between the GCP and AWS behaviour if
possible. It seems possible, but the docs for the other runners are a bit
ambiguous on exactly how much of submitting jobs is handled by the runner.

On Wed, Jun 21, 2023 at 12:28 PM Pavel Solomin <p....@gmail.com>
wrote:

> Hello!
>
> > to also run on AWS
>
> > A spark cluster on EKS seems the closest analog
>
> There's another way of running Beam apps in AWS -
> https://aws.amazon.com/kinesis/data-analytics/ - which is basically
> "serverless" Flink. It says Kinesis, but you can run any Flink / Beam job
> there, you don't have to use Kinesis streams. I used KDA in multiple
> projects so far, works OK. FlinkRunner also seems to have more docs as far
> as I can see.
>
> Here's a pom.xml example:
> https://github.com/aws-samples/amazon-kinesis-data-analytics-examples/blob/master/Beam/pom.xml
>
> Best Regards,
> Pavel Solomin
>
> Tel: +351 962 950 692 <+351%20962%20950%20692> | Skype: pavel_solomin |
> Linkedin <https://www.linkedin.com/in/pavelsolomin>
>
>
>
>
>
> On Wed, 21 Jun 2023 at 16:31, Jon Molle via user <us...@beam.apache.org>
> wrote:
>
>> Hi,
>>
>> I've been looking at the Spark Portable Runner docs, specifically Java
>> when possible, and I'm a little confused about the organization. The docs
>> seem to say that the JobService both submits the code to the linked spark
>> cluster (described in the master url) and requires you to run a
>> spark-submit command after on whatever artifacts it builds.
>>
>> Unfortunately I'm not that familiar with Spark generally, so I'm probably
>> misunderstanding more here, but the job server images either totally lack
>> documentation or just repeat the spark runner page in the main docs.
>>
>> For context, I'm trying to port some code that we're currently running on
>> a Dataflow runner (on GCP) to also run on AWS. A spark cluster on EKS
>> (either self-managed or potentially through EMR, but likely not based on
>> what I am reading into the docs and some brief testing) seems the closest
>> analog.
>>
>> The new Tour does the same thing, in addition to only really having
>> examples for python and a few more typos. I haven't found any existing
>> questions like this elsewhere, so I assume that I'm just missing something
>> that should be obvious.
>>
>> Thanks for your time.
>>
>

Re:

Posted by Pavel Solomin <p....@gmail.com>.

Hello!

> to also run on AWS

> A spark cluster on EKS seems the closest analog

There's another way of running Beam apps in AWS -
https://aws.amazon.com/kinesis/data-analytics/ - which is basically
"serverless" Flink. It says Kinesis, but you can run any Flink / Beam job
there, you don't have to use Kinesis streams. I used KDA in multiple
projects so far, works OK. FlinkRunner also seems to have more docs as far
as I can see.

Here's a pom.xml example:
https://github.com/aws-samples/amazon-kinesis-data-analytics-examples/blob/master/Beam/pom.xml

Best Regards,
Pavel Solomin

Tel: +351 962 950 692 | Skype: pavel_solomin | Linkedin
<https://www.linkedin.com/in/pavelsolomin>





On Wed, 21 Jun 2023 at 16:31, Jon Molle via user <us...@beam.apache.org>
wrote:

> Hi,
>
> I've been looking at the Spark Portable Runner docs, specifically Java
> when possible, and I'm a little confused about the organization. The docs
> seem to say that the JobService both submits the code to the linked spark
> cluster (described in the master url) and requires you to run a
> spark-submit command after on whatever artifacts it builds.
>
> Unfortunately I'm not that familiar with Spark generally, so I'm probably
> misunderstanding more here, but the job server images either totally lack
> documentation or just repeat the spark runner page in the main docs.
>
> For context, I'm trying to port some code that we're currently running on
> a Dataflow runner (on GCP) to also run on AWS. A spark cluster on EKS
> (either self-managed or potentially through EMR, but likely not based on
> what I am reading into the docs and some brief testing) seems the closest
> analog.
>
> The new Tour does the same thing, in addition to only really having
> examples for python and a few more typos. I haven't found any existing
> questions like this elsewhere, so I assume that I'm just missing something
> that should be obvious.
>
> Thanks for your time.
>