You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Patrick Wendell <pw...@gmail.com> on 2013/08/03 06:50:21 UTC

How do you run Spark jobs?

Hey All,

I'm working on SPARK-800 [1]. The goal is to document a best practice or
recommended way of bundling and running Spark jobs. We have a quickstart
guide for writing a standlone job, but it doesn't cover how to deal with
packaging up your dependencies and setting the correct environment
variables required to submit a full job to a cluster. This can be a
confusing process for beginners - it would be good to extend the guide to
cover this.

First though I wanted to sample this list and see how people tend to run
Spark jobs inside their org's. Knowing any of the following would be
helpful:

- Do you create an uber jar with all of your job (and Spark)'s recursive
dependencies?
- Do you try to use sbt run or maven exec with some way to pass the correct
environment variables?
- Do people use a modified version of spark's own `run` script?
- Do you have some other way of submitting jobs?

Any notes would be helpful in compiling this!

https://spark-project.atlassian.net/browse/SPARK-800

Re: How do you run Spark jobs?

Posted by Evan Chan <ev...@ooyala.com>.

Here it is:

https://groups.google.com/forum/?fromgroups=#!searchin/spark-users/SBT/spark-users/pHaF01sPwBo/faHr-fEAFbYJ


On Tue, Aug 13, 2013 at 12:55 AM, Grega Kešpret <gr...@celtra.com> wrote:

> Hey Evan,
> any chance you might find the link to the above mentioned SBT recipe?
> Would greatly appreciate it.
>
> Thanks,
> Grega
>
> On Fri, Aug 9, 2013 at 10:00 AM, Evan Chan <ev...@ooyala.com> wrote:
>
> > Hey Patrick,
> >
> > A while back I posted an SBT recipe allowing users to build Scala job
> > assemblies that excluded Spark and its deps, which is what most people
> want
> > I believe.  This allows you to include your own libraries and exclude
> > Spark's for the smallest possible one.
> >
> > We don't use Spark's run script, instead we have SBT configured so that
> you
> > can simply type "run" to run jobs.   I believe this gives maximum
> developer
> > velocity.   We also have "sbt console" hooked up so that you can run
> spark
> > shell from it (no need for ./spark-shell script).
> >
> > And, as you know, we are going to contribute back a job server.   We
> > believe that for most organizations this will provide the easiest way for
> > submitting and managing jobs -- IT/OPS sets up Spark as HTTP service
> (using
> > job server), and users/developers can submit jobs to a managed service.
> > We even have a giter8 template to make creating jobs for job server super
> > simple.  The template has support for local run, spark shell, assembly,
> and
> > testing.
> >
> > So anyways, I believe we'll have a lot to contribute to your guide --
> both
> > now and especially once the job server is contributed....  feel free to
> > touch base offline.
> >
> > -Evan
> >
> >
> >
> >
> >
> > On Fri, Aug 2, 2013 at 9:50 PM, Patrick Wendell <pw...@gmail.com>
> > wrote:
> >
> > > Hey All,
> > >
> > > I'm working on SPARK-800 [1]. The goal is to document a best practice
> or
> > > recommended way of bundling and running Spark jobs. We have a
> quickstart
> > > guide for writing a standlone job, but it doesn't cover how to deal
> with
> > > packaging up your dependencies and setting the correct environment
> > > variables required to submit a full job to a cluster. This can be a
> > > confusing process for beginners - it would be good to extend the guide
> to
> > > cover this.
> > >
> > > First though I wanted to sample this list and see how people tend to
> run
> > > Spark jobs inside their org's. Knowing any of the following would be
> > > helpful:
> > >
> > > - Do you create an uber jar with all of your job (and Spark)'s
> recursive
> > > dependencies?
> > > - Do you try to use sbt run or maven exec with some way to pass the
> > correct
> > > environment variables?
> > > - Do people use a modified version of spark's own `run` script?
> > > - Do you have some other way of submitting jobs?
> > >
> > > Any notes would be helpful in compiling this!
> > >
> > > https://spark-project.atlassian.net/browse/SPARK-800
> > >
> >
> >
> >
> > --
> > --
> > Evan Chan
> > Staff Engineer
> > ev@ooyala.com  |
> >
> > <http://www.ooyala.com/>
> > <http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala
> ><
> > http://www.twitter.com/ooyala>
> >
>



-- 
--
Evan Chan
Staff Engineer
ev@ooyala.com  |

<http://www.ooyala.com/>
<http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><http://www.twitter.com/ooyala>

Re: How do you run Spark jobs?

Posted by Grega Kešpret <gr...@celtra.com>.

Hey Evan,
any chance you might find the link to the above mentioned SBT recipe?
Would greatly appreciate it.

Thanks,
Grega

On Fri, Aug 9, 2013 at 10:00 AM, Evan Chan <ev...@ooyala.com> wrote:

> Hey Patrick,
>
> A while back I posted an SBT recipe allowing users to build Scala job
> assemblies that excluded Spark and its deps, which is what most people want
> I believe.  This allows you to include your own libraries and exclude
> Spark's for the smallest possible one.
>
> We don't use Spark's run script, instead we have SBT configured so that you
> can simply type "run" to run jobs.   I believe this gives maximum developer
> velocity.   We also have "sbt console" hooked up so that you can run spark
> shell from it (no need for ./spark-shell script).
>
> And, as you know, we are going to contribute back a job server.   We
> believe that for most organizations this will provide the easiest way for
> submitting and managing jobs -- IT/OPS sets up Spark as HTTP service (using
> job server), and users/developers can submit jobs to a managed service.
> We even have a giter8 template to make creating jobs for job server super
> simple.  The template has support for local run, spark shell, assembly, and
> testing.
>
> So anyways, I believe we'll have a lot to contribute to your guide -- both
> now and especially once the job server is contributed....  feel free to
> touch base offline.
>
> -Evan
>
>
>
>
>
> On Fri, Aug 2, 2013 at 9:50 PM, Patrick Wendell <pw...@gmail.com>
> wrote:
>
> > Hey All,
> >
> > I'm working on SPARK-800 [1]. The goal is to document a best practice or
> > recommended way of bundling and running Spark jobs. We have a quickstart
> > guide for writing a standlone job, but it doesn't cover how to deal with
> > packaging up your dependencies and setting the correct environment
> > variables required to submit a full job to a cluster. This can be a
> > confusing process for beginners - it would be good to extend the guide to
> > cover this.
> >
> > First though I wanted to sample this list and see how people tend to run
> > Spark jobs inside their org's. Knowing any of the following would be
> > helpful:
> >
> > - Do you create an uber jar with all of your job (and Spark)'s recursive
> > dependencies?
> > - Do you try to use sbt run or maven exec with some way to pass the
> correct
> > environment variables?
> > - Do people use a modified version of spark's own `run` script?
> > - Do you have some other way of submitting jobs?
> >
> > Any notes would be helpful in compiling this!
> >
> > https://spark-project.atlassian.net/browse/SPARK-800
> >
>
>
>
> --
> --
> Evan Chan
> Staff Engineer
> ev@ooyala.com  |
>
> <http://www.ooyala.com/>
> <http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><
> http://www.twitter.com/ooyala>
>

Re: How do you run Spark jobs?

Posted by Evan Chan <ev...@ooyala.com>.

Hey Patrick,

A while back I posted an SBT recipe allowing users to build Scala job
assemblies that excluded Spark and its deps, which is what most people want
I believe.  This allows you to include your own libraries and exclude
Spark's for the smallest possible one.

We don't use Spark's run script, instead we have SBT configured so that you
can simply type "run" to run jobs.   I believe this gives maximum developer
velocity.   We also have "sbt console" hooked up so that you can run spark
shell from it (no need for ./spark-shell script).

And, as you know, we are going to contribute back a job server.   We
believe that for most organizations this will provide the easiest way for
submitting and managing jobs -- IT/OPS sets up Spark as HTTP service (using
job server), and users/developers can submit jobs to a managed service.
We even have a giter8 template to make creating jobs for job server super
simple.  The template has support for local run, spark shell, assembly, and
testing.

So anyways, I believe we'll have a lot to contribute to your guide -- both
now and especially once the job server is contributed....  feel free to
touch base offline.

-Evan

On Fri, Aug 2, 2013 at 9:50 PM, Patrick Wendell <pw...@gmail.com> wrote:

> Hey All,
>
> I'm working on SPARK-800 [1]. The goal is to document a best practice or
> recommended way of bundling and running Spark jobs. We have a quickstart
> guide for writing a standlone job, but it doesn't cover how to deal with
> packaging up your dependencies and setting the correct environment
> variables required to submit a full job to a cluster. This can be a
> confusing process for beginners - it would be good to extend the guide to
> cover this.
>
> First though I wanted to sample this list and see how people tend to run
> Spark jobs inside their org's. Knowing any of the following would be
> helpful:
>
> - Do you create an uber jar with all of your job (and Spark)'s recursive
> dependencies?
> - Do you try to use sbt run or maven exec with some way to pass the correct
> environment variables?
> - Do people use a modified version of spark's own `run` script?
> - Do you have some other way of submitting jobs?
>
> Any notes would be helpful in compiling this!
>
> https://spark-project.atlassian.net/browse/SPARK-800
>

-- 
--
Evan Chan
Staff Engineer
ev@ooyala.com  |

<http://www.ooyala.com/>
<http://www.facebook.com/ooyala><http://www.linkedin.com/company/ooyala><http://www.twitter.com/ooyala>

Re: How do you run Spark jobs?

Posted by Roman Shaposhnik <rv...@apache.org>.

On Fri, Aug 2, 2013 at 9:50 PM, Patrick Wendell <pw...@gmail.com> wrote:
> Hey All,
>
> I'm working on SPARK-800 [1]. The goal is to document a best practice or
> recommended way of bundling and running Spark jobs. We have a quickstart
> guide for writing a standlone job, but it doesn't cover how to deal with
> packaging up your dependencies and setting the correct environment
> variables required to submit a full job to a cluster. This can be a
> confusing process for beginners - it would be good to extend the guide to
> cover this.
>
> First though I wanted to sample this list and see how people tend to run
> Spark jobs inside their org's. Knowing any of the following would be
> helpful:
>
> - Do you create an uber jar with all of your job (and Spark)'s recursive
> dependencies?
> - Do you try to use sbt run or maven exec with some way to pass the correct
> environment variables?
> - Do people use a modified version of spark's own `run` script?
> - Do you have some other way of submitting jobs?
>
> Any notes would be helpful in compiling this!

Now that Spark has been integrated into Bigtop:
    https://issues.apache.org/jira/browse/BIGTOP-715
it may make sense to tackle some of those issues from a
distribution perspective. Bigtop has a luxury of defining an
entire distribution (you always know what versions of Hadoop
and its ecosystem projects your're dealing with). It also
provides helper functionality for a lot of common things (like
finding JAVA_HOME, plugging into the underlying
OS capabilities, etc.).

I guess all I'm saying is that you guys should consider Bigtop
as an integration platform for making Spark easier to use.

Feel free to fork off this thread to dev@bigtop (CCed) if you
think this is an idea worth exploring.

Thanks,
Roman.

Re: How do you run Spark jobs?

Posted by Roman Shaposhnik <rv...@apache.org>.

On Fri, Aug 2, 2013 at 9:50 PM, Patrick Wendell <pw...@gmail.com> wrote:
> Hey All,
>
> I'm working on SPARK-800 [1]. The goal is to document a best practice or
> recommended way of bundling and running Spark jobs. We have a quickstart
> guide for writing a standlone job, but it doesn't cover how to deal with
> packaging up your dependencies and setting the correct environment
> variables required to submit a full job to a cluster. This can be a
> confusing process for beginners - it would be good to extend the guide to
> cover this.
>
> First though I wanted to sample this list and see how people tend to run
> Spark jobs inside their org's. Knowing any of the following would be
> helpful:
>
> - Do you create an uber jar with all of your job (and Spark)'s recursive
> dependencies?
> - Do you try to use sbt run or maven exec with some way to pass the correct
> environment variables?
> - Do people use a modified version of spark's own `run` script?
> - Do you have some other way of submitting jobs?
>
> Any notes would be helpful in compiling this!

Now that Spark has been integrated into Bigtop:
    https://issues.apache.org/jira/browse/BIGTOP-715
it may make sense to tackle some of those issues from a
distribution perspective. Bigtop has a luxury of defining an
entire distribution (you always know what versions of Hadoop
and its ecosystem projects your're dealing with). It also
provides helper functionality for a lot of common things (like
finding JAVA_HOME, plugging into the underlying
OS capabilities, etc.).

I guess all I'm saying is that you guys should consider Bigtop
as an integration platform for making Spark easier to use.

Feel free to fork off this thread to dev@bigtop (CCed) if you
think this is an idea worth exploring.

Thanks,
Roman.