You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Kamil Wasilewski <ka...@polidea.com> on 2020/06/23 16:16:04 UTC

Running Beam pipeline using Spark on YARN

Hi all,

I'm trying to run a Beam pipeline using Spark on YARN. My pipeline is
written in Python, so I need to use a portable runner. Does anybody know
how I should configure job server parameters, especially
--spark-master-url?  Is there anything else I need to be aware of while
using such setup?

If it makes a difference, I use Google Dataproc.

Best,
Kamil

Re: Running Beam pipeline using Spark on YARN

Posted by Kamil Wasilewski <ka...@polidea.com>.
Thanks for the information. So it looks like we can't easily run portable
pipelines on Dataproc cluster at the moment.

> you can set --output_executable_path to create a jar that you can then
submit to yarn via spark-submit.

I tried to create a jar, but I ran into a problem. I left an error message
in a comment for https://issues.apache.org/jira/browse/BEAM-8970.


On Wed, Jun 24, 2020 at 1:25 AM Kyle Weaver <kc...@google.com> wrote:

> > So hopefully setting --spark-master-url to be yarn will work too.
>
> This is not supported.
>
> On Tue, Jun 23, 2020 at 2:58 PM Xinyu Liu <xi...@gmail.com> wrote:
>
>> I am doing some prototyping on this too. I used spark-submit script
>> instead of the rest api. In my simple setup, I ran
>> SparkJobServerDriver.main() directly in the AM as a spark job, which
>> will submit the python job to the default spark master url pointing to
>> "local". I also use --files in the spark-submit script to upload the python
>> packages and boot script. On the python side, I was using the following
>> pipeline options for submission (thanks to Thomas):
>>
>>     pipeline_options = PipelineOptions([
>>
>>         "--runner=PortableRunner",
>>
>>         "--job_endpoint=your-job-server:8099",
>>
>>         "--environment_type=PROCESS",
>>         "--environment_config={\"command\": \"./boot\"}")]
>>
>> I used my own boot script for customized python packaging. WIth this
>> setup I was able to get a simple hello-world program running. I haven't
>> tried to run the job server separately from the AM yet. So hopefully
>> setting --spark-master-url to be yarn will work too.
>>
>> Thanks,
>> Xinyu
>>
>> On Tue, Jun 23, 2020 at 12:18 PM Kyle Weaver <kc...@google.com> wrote:
>>
>>> Hi Kamil, there is a JIRA for this:
>>> https://issues.apache.org/jira/browse/BEAM-8970 It's theoretically
>>> possible but remains untested as far as I know :)
>>>
>>> As I indicated in a comment, you can set --output_executable_path to
>>> create a jar that you can then submit to yarn via spark-submit.
>>>
>>> If you can get this working, I'd additionally like to script the jar
>>> submission in python to save users the extra step.
>>>
>>> Thanks,
>>> Kyle
>>>
>>> On Tue, Jun 23, 2020 at 9:16 AM Kamil Wasilewski <
>>> kamil.wasilewski@polidea.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm trying to run a Beam pipeline using Spark on YARN. My pipeline is
>>>> written in Python, so I need to use a portable runner. Does anybody know
>>>> how I should configure job server parameters, especially
>>>> --spark-master-url?  Is there anything else I need to be aware of while
>>>> using such setup?
>>>>
>>>> If it makes a difference, I use Google Dataproc.
>>>>
>>>> Best,
>>>> Kamil
>>>>
>>>

Re: Running Beam pipeline using Spark on YARN

Posted by Kyle Weaver <kc...@google.com>.
> So hopefully setting --spark-master-url to be yarn will work too.

This is not supported.

On Tue, Jun 23, 2020 at 2:58 PM Xinyu Liu <xi...@gmail.com> wrote:

> I am doing some prototyping on this too. I used spark-submit script
> instead of the rest api. In my simple setup, I ran
> SparkJobServerDriver.main() directly in the AM as a spark job, which
> will submit the python job to the default spark master url pointing to
> "local". I also use --files in the spark-submit script to upload the python
> packages and boot script. On the python side, I was using the following
> pipeline options for submission (thanks to Thomas):
>
>     pipeline_options = PipelineOptions([
>
>         "--runner=PortableRunner",
>
>         "--job_endpoint=your-job-server:8099",
>
>         "--environment_type=PROCESS",
>         "--environment_config={\"command\": \"./boot\"}")]
>
> I used my own boot script for customized python packaging. WIth this setup
> I was able to get a simple hello-world program running. I haven't tried to
> run the job server separately from the AM yet. So hopefully setting
> --spark-master-url to be yarn will work too.
>
> Thanks,
> Xinyu
>
> On Tue, Jun 23, 2020 at 12:18 PM Kyle Weaver <kc...@google.com> wrote:
>
>> Hi Kamil, there is a JIRA for this:
>> https://issues.apache.org/jira/browse/BEAM-8970 It's theoretically
>> possible but remains untested as far as I know :)
>>
>> As I indicated in a comment, you can set --output_executable_path to
>> create a jar that you can then submit to yarn via spark-submit.
>>
>> If you can get this working, I'd additionally like to script the jar
>> submission in python to save users the extra step.
>>
>> Thanks,
>> Kyle
>>
>> On Tue, Jun 23, 2020 at 9:16 AM Kamil Wasilewski <
>> kamil.wasilewski@polidea.com> wrote:
>>
>>> Hi all,
>>>
>>> I'm trying to run a Beam pipeline using Spark on YARN. My pipeline is
>>> written in Python, so I need to use a portable runner. Does anybody know
>>> how I should configure job server parameters, especially
>>> --spark-master-url?  Is there anything else I need to be aware of while
>>> using such setup?
>>>
>>> If it makes a difference, I use Google Dataproc.
>>>
>>> Best,
>>> Kamil
>>>
>>

Re: Running Beam pipeline using Spark on YARN

Posted by Xinyu Liu <xi...@gmail.com>.
I am doing some prototyping on this too. I used spark-submit script instead
of the rest api. In my simple setup, I ran SparkJobServerDriver.main()
directly in the AM as a spark job, which will submit the python job to the
default spark master url pointing to "local". I also use --files in the
spark-submit script to upload the python packages and boot script. On the
python side, I was using the following pipeline options for submission
(thanks to Thomas):

    pipeline_options = PipelineOptions([

        "--runner=PortableRunner",

        "--job_endpoint=your-job-server:8099",

        "--environment_type=PROCESS",
        "--environment_config={\"command\": \"./boot\"}")]

I used my own boot script for customized python packaging. WIth this setup
I was able to get a simple hello-world program running. I haven't tried to
run the job server separately from the AM yet. So hopefully setting
--spark-master-url to be yarn will work too.

Thanks,
Xinyu

On Tue, Jun 23, 2020 at 12:18 PM Kyle Weaver <kc...@google.com> wrote:

> Hi Kamil, there is a JIRA for this:
> https://issues.apache.org/jira/browse/BEAM-8970 It's theoretically
> possible but remains untested as far as I know :)
>
> As I indicated in a comment, you can set --output_executable_path to
> create a jar that you can then submit to yarn via spark-submit.
>
> If you can get this working, I'd additionally like to script the jar
> submission in python to save users the extra step.
>
> Thanks,
> Kyle
>
> On Tue, Jun 23, 2020 at 9:16 AM Kamil Wasilewski <
> kamil.wasilewski@polidea.com> wrote:
>
>> Hi all,
>>
>> I'm trying to run a Beam pipeline using Spark on YARN. My pipeline is
>> written in Python, so I need to use a portable runner. Does anybody know
>> how I should configure job server parameters, especially
>> --spark-master-url?  Is there anything else I need to be aware of while
>> using such setup?
>>
>> If it makes a difference, I use Google Dataproc.
>>
>> Best,
>> Kamil
>>
>

Re: Running Beam pipeline using Spark on YARN

Posted by Kyle Weaver <kc...@google.com>.
Hi Kamil, there is a JIRA for this:
https://issues.apache.org/jira/browse/BEAM-8970 It's theoretically possible
but remains untested as far as I know :)

As I indicated in a comment, you can set --output_executable_path to create
a jar that you can then submit to yarn via spark-submit.

If you can get this working, I'd additionally like to script the jar
submission in python to save users the extra step.

Thanks,
Kyle

On Tue, Jun 23, 2020 at 9:16 AM Kamil Wasilewski <
kamil.wasilewski@polidea.com> wrote:

> Hi all,
>
> I'm trying to run a Beam pipeline using Spark on YARN. My pipeline is
> written in Python, so I need to use a portable runner. Does anybody know
> how I should configure job server parameters, especially
> --spark-master-url?  Is there anything else I need to be aware of while
> using such setup?
>
> If it makes a difference, I use Google Dataproc.
>
> Best,
> Kamil
>