You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@bigtop.apache.org by Roman Shaposhnik <rv...@apache.org> on 2013/08/06 06:24:19 UTC

Re: How do you run Spark jobs?

On Fri, Aug 2, 2013 at 9:50 PM, Patrick Wendell <pw...@gmail.com> wrote:
> Hey All,
>
> I'm working on SPARK-800 [1]. The goal is to document a best practice or
> recommended way of bundling and running Spark jobs. We have a quickstart
> guide for writing a standlone job, but it doesn't cover how to deal with
> packaging up your dependencies and setting the correct environment
> variables required to submit a full job to a cluster. This can be a
> confusing process for beginners - it would be good to extend the guide to
> cover this.
>
> First though I wanted to sample this list and see how people tend to run
> Spark jobs inside their org's. Knowing any of the following would be
> helpful:
>
> - Do you create an uber jar with all of your job (and Spark)'s recursive
> dependencies?
> - Do you try to use sbt run or maven exec with some way to pass the correct
> environment variables?
> - Do people use a modified version of spark's own `run` script?
> - Do you have some other way of submitting jobs?
>
> Any notes would be helpful in compiling this!

Now that Spark has been integrated into Bigtop:
    https://issues.apache.org/jira/browse/BIGTOP-715
it may make sense to tackle some of those issues from a
distribution perspective. Bigtop has a luxury of defining an
entire distribution (you always know what versions of Hadoop
and its ecosystem projects your're dealing with). It also
provides helper functionality for a lot of common things (like
finding JAVA_HOME, plugging into the underlying
OS capabilities, etc.).

I guess all I'm saying is that you guys should consider Bigtop
as an integration platform for making Spark easier to use.

Feel free to fork off this thread to dev@bigtop (CCed) if you
think this is an idea worth exploring.

Thanks,
Roman.