You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Amit Sela <am...@gmail.com> on 2016/07/07 17:49:07 UTC

[DISCUSS] Spark runner packaging

Hi everyone,

Lately I've encountered a number of issues concerning the fact that the
Spark runner does not package Spark along with it and forcing people to do
this on their own.
In addition, this seems to get in the way of having beam-examples executed
against the Spark runner, again because it would have to add Spark
dependencies.

When running on a cluster (which I guess was the original goal here), it is
recommended to have Spark provided by the cluster - this makes sense for
Spark clusters and more so for Spark + YARN clusters where you might have
your Spark built against a specific Hadoop version or using a vendor
distribution.

In order to make the runner more accessible to new adopters, I suggest to
consider releasing a "spark-included" artifact as well.

Thoughts ?

Thanks,
Amit

Re: [DISCUSS] Spark runner packaging

Posted by Lukasz Cwik <lc...@google.com.INVALID>.

What I meant by saying that this could be part of Apache Beam is that the
build scripts that generate the binary artifact could be part of Apache
Beam not the binary artifact itself.
So the question I was asking is whether the build scripts to generate the
artifact should be part of Apache Beam, or separate? Also how?

On Thu, Jul 7, 2016 at 2:59 PM, Robert Bradshaw <robertwb@google.com.invalid
> wrote:

> I don't think the proposal is to put this into the source release, rather
> to have a separate binary artifact that's Beam+Spark.
>
> On Thu, Jul 7, 2016 at 11:54 AM, Vlad Rozov <v....@datatorrent.com>
> wrote:
>
> > I am not sure if I read the proposal correctly, but note that it will be
> > against Apache policy to include compiled binaries into the source
> release.
> > On the other side, each runner may include necessary run-time binaries as
> > test only dependencies into the runner's maven pom.xml
> >
> >
> > On 7/7/16 11:01, Lukasz Cwik wrote:
> >
> >> That makes a lot of sense. I can see other runners following suit where
> >> there is a packaged up version for different scenarios / backend cluster
> >> runtimes.
> >>
> >> Should this be part of Apache Beam as a separate maven module or another
> >> sub-module inside of Apache Beam, or something else?
> >>
> >> On Thu, Jul 7, 2016 at 1:49 PM, Amit Sela <am...@gmail.com> wrote:
> >>
> >> Hi everyone,
> >>>
> >>> Lately I've encountered a number of issues concerning the fact that the
> >>> Spark runner does not package Spark along with it and forcing people to
> >>> do
> >>> this on their own.
> >>> In addition, this seems to get in the way of having beam-examples
> >>> executed
> >>> against the Spark runner, again because it would have to add Spark
> >>> dependencies.
> >>>
> >>> When running on a cluster (which I guess was the original goal here),
> it
> >>> is
> >>> recommended to have Spark provided by the cluster - this makes sense
> for
> >>> Spark clusters and more so for Spark + YARN clusters where you might
> have
> >>> your Spark built against a specific Hadoop version or using a vendor
> >>> distribution.
> >>>
> >>> In order to make the runner more accessible to new adopters, I suggest
> to
> >>> consider releasing a "spark-included" artifact as well.
> >>>
> >>> Thoughts ?
> >>>
> >>> Thanks,
> >>> Amit
> >>>
> >>>
> >
>

Re: [DISCUSS] Spark runner packaging

Posted by Robert Bradshaw <ro...@google.com.INVALID>.

I don't think the proposal is to put this into the source release, rather
to have a separate binary artifact that's Beam+Spark.

On Thu, Jul 7, 2016 at 11:54 AM, Vlad Rozov <v....@datatorrent.com> wrote:

> I am not sure if I read the proposal correctly, but note that it will be
> against Apache policy to include compiled binaries into the source release.
> On the other side, each runner may include necessary run-time binaries as
> test only dependencies into the runner's maven pom.xml
>
>
> On 7/7/16 11:01, Lukasz Cwik wrote:
>
>> That makes a lot of sense. I can see other runners following suit where
>> there is a packaged up version for different scenarios / backend cluster
>> runtimes.
>>
>> Should this be part of Apache Beam as a separate maven module or another
>> sub-module inside of Apache Beam, or something else?
>>
>> On Thu, Jul 7, 2016 at 1:49 PM, Amit Sela <am...@gmail.com> wrote:
>>
>> Hi everyone,
>>>
>>> Lately I've encountered a number of issues concerning the fact that the
>>> Spark runner does not package Spark along with it and forcing people to
>>> do
>>> this on their own.
>>> In addition, this seems to get in the way of having beam-examples
>>> executed
>>> against the Spark runner, again because it would have to add Spark
>>> dependencies.
>>>
>>> When running on a cluster (which I guess was the original goal here), it
>>> is
>>> recommended to have Spark provided by the cluster - this makes sense for
>>> Spark clusters and more so for Spark + YARN clusters where you might have
>>> your Spark built against a specific Hadoop version or using a vendor
>>> distribution.
>>>
>>> In order to make the runner more accessible to new adopters, I suggest to
>>> consider releasing a "spark-included" artifact as well.
>>>
>>> Thoughts ?
>>>
>>> Thanks,
>>> Amit
>>>
>>>
>

Re: [DISCUSS] Spark runner packaging

Posted by Vlad Rozov <v....@datatorrent.com>.

I am not sure if I read the proposal correctly, but note that it will be 
against Apache policy to include compiled binaries into the source 
release. On the other side, each runner may include necessary run-time 
binaries as test only dependencies into the runner's maven pom.xml

On 7/7/16 11:01, Lukasz Cwik wrote:
> That makes a lot of sense. I can see other runners following suit where
> there is a packaged up version for different scenarios / backend cluster
> runtimes.
>
> Should this be part of Apache Beam as a separate maven module or another
> sub-module inside of Apache Beam, or something else?
>
> On Thu, Jul 7, 2016 at 1:49 PM, Amit Sela <am...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> Lately I've encountered a number of issues concerning the fact that the
>> Spark runner does not package Spark along with it and forcing people to do
>> this on their own.
>> In addition, this seems to get in the way of having beam-examples executed
>> against the Spark runner, again because it would have to add Spark
>> dependencies.
>>
>> When running on a cluster (which I guess was the original goal here), it is
>> recommended to have Spark provided by the cluster - this makes sense for
>> Spark clusters and more so for Spark + YARN clusters where you might have
>> your Spark built against a specific Hadoop version or using a vendor
>> distribution.
>>
>> In order to make the runner more accessible to new adopters, I suggest to
>> consider releasing a "spark-included" artifact as well.
>>
>> Thoughts ?
>>
>> Thanks,
>> Amit
>>

Re: [DISCUSS] Spark runner packaging

Posted by Lukasz Cwik <lc...@google.com.INVALID>.

That makes a lot of sense. I can see other runners following suit where
there is a packaged up version for different scenarios / backend cluster
runtimes.

Should this be part of Apache Beam as a separate maven module or another
sub-module inside of Apache Beam, or something else?

On Thu, Jul 7, 2016 at 1:49 PM, Amit Sela <am...@gmail.com> wrote:

> Hi everyone,
>
> Lately I've encountered a number of issues concerning the fact that the
> Spark runner does not package Spark along with it and forcing people to do
> this on their own.
> In addition, this seems to get in the way of having beam-examples executed
> against the Spark runner, again because it would have to add Spark
> dependencies.
>
> When running on a cluster (which I guess was the original goal here), it is
> recommended to have Spark provided by the cluster - this makes sense for
> Spark clusters and more so for Spark + YARN clusters where you might have
> your Spark built against a specific Hadoop version or using a vendor
> distribution.
>
> In order to make the runner more accessible to new adopters, I suggest to
> consider releasing a "spark-included" artifact as well.
>
> Thoughts ?
>
> Thanks,
> Amit
>

Re: [DISCUSS] Spark runner packaging

Posted by Amit Sela <am...@gmail.com>.

I like the profile idea Dan, mostly because while I believe we should do
our best to make adoption easier, we should still default to the actual use
case where such pipelines will run on clusters..

On Fri, Jul 8, 2016 at 1:53 AM Dan Halperin <dh...@google.com.invalid>
wrote:

> Thanks Amit, that does clear things up!
>
> On Thu, Jul 7, 2016 at 3:30 PM, Amit Sela <am...@gmail.com> wrote:
>
> > I don't think that the Spark runner is special, it's just the way it was
> > until now and that's why I brought up the subject here.
> >
> > The main issue is that currently, if a user wants to write a beam app
> using
> > the Spark runner, he'll have to provide the Spark dependencies, or he'll
> > get a ClassNotFoundException (which is exactly the case for
> beam-examples).
> > This of course happens because the Spark runner has provided dependency
> on
> > Spark (not transitive).
> >
>
> Having provided dependencies and making the user include them in their pom
> is
> pretty normal, I think. We already require users to provide a slf4j logger
> and
> Hamcrest+Junit (if they use PAssert).
>     (We including all these in the examples pom.xml
> <
> https://github.com/apache/incubator-beam/blob/master/examples/java/pom.xml#L286
> >
> .)
>
> I don't see any problem for a user who wants to use the Spark runner to add
> these
> provided deps to their pom (aka, putting them as runtime deps in examples
> pom.xml).
>
>
> > The Flink runner avoids this issue by having a compile dependency on
> flink,
> > thus being transitive.
> >
> > By having the cluster provide them I mean that the Spark installation is
> > aware of the binaries pre-deployed on the cluster and adds them to the
> > classpath of the app submitted for execution on the cluster - this is
> > common (AFAIK) for Spark and Spark on YARN, and vendors provide similar
> > binaries, for example: spark-1.6_hadoop-2.4.0_hdp.xxx.jar (Hortonworks).
> >
>
> Makes sense. So a user submitting to a cluster would submit a jar and
> command-line
> options, and the cluster itself would add the provided deps.
>
>
> >  Putting aside our (Beam) issues, the current artifact
> "beam-runners-spark"
> > is more suitable to run on clusters with pre-deployed binaries rather
> than
> > a
> > quick standalone execution with a single dependency that takes care of
> > everything (Spark related),
>
>
> great!
>
>
> > but is more cumbersome for users trying to get
> > going for the first time, which is not good!.
> >
>
> We should decide which experience we're trying to optimize for (I'd lean
> cluster), but
> I think that we should update examples pom.xml with the support.
>
> * For cluster mode default, we would add a profile for 'local' mode
>   (-PsparkIncluded or something) that overrides the provided deps to be
> runtime
>   deps instead.
>
> * We can include switching the profile for local mode in the "getting
> started" instructions.
>
> Dan
>
> I guess Flink uses a compile dependency for the same reason Spark uses
> > provided - because it fits them - what about other runners ?
> >
> > Hope this clarifies some of the questions here.
> >
> > Thanks,
> > Amit
> >
> > On Fri, Jul 8, 2016 at 12:52 AM Dan Halperin <dhalperi@google.com.invalid
> >
> > wrote:
> >
> > > hey folks,
> > >
> > > In general, we should optimize for running on clusters rather than
> > running
> > > locally. Examples is a runner-independent module, with non-compile-time
> > > deps on runners. Most runners are currently listed as being runtime
> deps
> > --
> > > it sounds like that works, for most cases, but might not be the best
> fit
> > > for Spark.
> > >
> > > Q: What does dependencies being provided by the cluster mean? I'm a
> > little
> > > naive here, but how would a user submit a pipeline to a Spark cluster
> > > without actually depending on Spark in mvn? Is it not by running the
> main
> > > method in an example like in all other runners?
> > >
> > > I'd like to understand the above better, but suppose that to optimize
> for
> > > Spark-on-a-cluster, we should default to provided deps in the examples.
> > > That would be fine -- but couldn't we just make a profile for local
> Spark
> > > that overrides the deps from provided to runtime?
> > >
> > > To summarize, I think we do not need new artifacts, but we could use a
> > > profile for local testing if absolutely necessary.
> > >
> > > Thanks,
> > > Dan
> > >
> > > On Thu, Jul 7, 2016 at 2:27 PM, Ismaël Mejía <ie...@gmail.com>
> wrote:
> > >
> > > > Good discussion subject Amit,
> > > >
> > > > I let the whole beam distribution subjects continue in BEAM-320,
> > however
> > > > there
> > > > is a not yet discussed aspect of the spark runner, the maven
> behavior:
> > > >
> > > > When you import the beam spark runner as a dependency you are obliged
> > to
> > > > provide
> > > > your spark dependencies by hand too, in the other runners once you
> > import
> > > > the
> > > > runner everything just works e.g.  google-cloud-dataflow-runner and
> > > > flink-runner.  I understand the arguments for the current setup (the
> > ones
> > > > you
> > > > mention), but I think it is more user friendly to be consistent with
> > the
> > > > other
> > > > runners and have something that just works as the default (and solve
> > the
> > > > examples issue as a consequence).  Anyway I think in the spark case
> we
> > > need
> > > > both, an 'spark-included' flavor and the current one that it is
> really
> > > > useful to
> > > > include the runner as a spark library dependency (like Jesse did in
> his
> > > > video) or
> > > > as a spark-package.
> > > >
> > > > Actually both the all-included and the runner only make sense for
> flink
> > > too
> > > > but this is a different discussion ;)
> > > >
> > > > What do you think about this ? What do the others think ?
> > > >
> > > > Ismaël
> > > >
> > > >
> > > > On Thu, Jul 7, 2016 at 10:19 PM, Jean-Baptiste Onofré <
> jb@nanthrax.net
> > >
> > > > wrote:
> > > >
> > > > > No problem and good idea to discuss in the Jira.
> > > > >
> > > > > Actually, I started to experiment a bit beam distributions on a
> > branch
> > > > > (that I can share with people interested).
> > > > >
> > > > > Regards
> > > > > JB
> > > > >
> > > > >
> > > > > On 07/07/2016 10:12 PM, Amit Sela wrote:
> > > > >
> > > > >> Thanks JB, I've missed that one.
> > > > >>
> > > > >> I suggest we continue this in the ticket comments.
> > > > >>
> > > > >> Thanks,
> > > > >> Amit
> > > > >>
> > > > >> On Thu, Jul 7, 2016 at 11:05 PM Jean-Baptiste Onofré <
> > jb@nanthrax.net
> > > >
> > > > >> wrote:
> > > > >>
> > > > >> Hi Amit,
> > > > >>>
> > > > >>> I think your proposal is related to:
> > > > >>>
> > > > >>> https://issues.apache.org/jira/browse/BEAM-320
> > > > >>>
> > > > >>> As described in the Jira, I'm planning to provide (in dedicated
> > Maven
> > > > >>> modules) is a Beam distribution including:
> > > > >>> - an uber jar to wrap the dependencies
> > > > >>> - the underlying runtime backends
> > > > >>> - etc
> > > > >>>
> > > > >>> Regards
> > > > >>> JB
> > > > >>>
> > > > >>> On 07/07/2016 07:49 PM, Amit Sela wrote:
> > > > >>>
> > > > >>>> Hi everyone,
> > > > >>>>
> > > > >>>> Lately I've encountered a number of issues concerning the fact
> > that
> > > > the
> > > > >>>> Spark runner does not package Spark along with it and forcing
> > people
> > > > to
> > > > >>>>
> > > > >>> do
> > > > >>>
> > > > >>>> this on their own.
> > > > >>>> In addition, this seems to get in the way of having
> beam-examples
> > > > >>>>
> > > > >>> executed
> > > > >>>
> > > > >>>> against the Spark runner, again because it would have to add
> Spark
> > > > >>>> dependencies.
> > > > >>>>
> > > > >>>> When running on a cluster (which I guess was the original goal
> > > here),
> > > > it
> > > > >>>>
> > > > >>> is
> > > > >>>
> > > > >>>> recommended to have Spark provided by the cluster - this makes
> > sense
> > > > for
> > > > >>>> Spark clusters and more so for Spark + YARN clusters where you
> > might
> > > > >>>> have
> > > > >>>> your Spark built against a specific Hadoop version or using a
> > vendor
> > > > >>>> distribution.
> > > > >>>>
> > > > >>>> In order to make the runner more accessible to new adopters, I
> > > suggest
> > > > >>>> to
> > > > >>>> consider releasing a "spark-included" artifact as well.
> > > > >>>>
> > > > >>>> Thoughts ?
> > > > >>>>
> > > > >>>> Thanks,
> > > > >>>> Amit
> > > > >>>>
> > > > >>>>
> > > > >>> --
> > > > >>> Jean-Baptiste Onofré
> > > > >>> jbonofre@apache.org
> > > > >>> http://blog.nanthrax.net
> > > > >>> Talend - http://www.talend.com
> > > > >>>
> > > > >>>
> > > > >>
> > > > > --
> > > > > Jean-Baptiste Onofré
> > > > > jbonofre@apache.org
> > > > > http://blog.nanthrax.net
> > > > > Talend - http://www.talend.com
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Spark runner packaging

Posted by Dan Halperin <dh...@google.com.INVALID>.

Thanks Amit, that does clear things up!

On Thu, Jul 7, 2016 at 3:30 PM, Amit Sela <am...@gmail.com> wrote:

> I don't think that the Spark runner is special, it's just the way it was
> until now and that's why I brought up the subject here.
>
> The main issue is that currently, if a user wants to write a beam app using
> the Spark runner, he'll have to provide the Spark dependencies, or he'll
> get a ClassNotFoundException (which is exactly the case for beam-examples).
> This of course happens because the Spark runner has provided dependency on
> Spark (not transitive).
>

Having provided dependencies and making the user include them in their pom
is
pretty normal, I think. We already require users to provide a slf4j logger
and
Hamcrest+Junit (if they use PAssert).
    (We including all these in the examples pom.xml
<https://github.com/apache/incubator-beam/blob/master/examples/java/pom.xml#L286>
.)

I don't see any problem for a user who wants to use the Spark runner to add
these
provided deps to their pom (aka, putting them as runtime deps in examples
pom.xml).


> The Flink runner avoids this issue by having a compile dependency on flink,
> thus being transitive.
>
> By having the cluster provide them I mean that the Spark installation is
> aware of the binaries pre-deployed on the cluster and adds them to the
> classpath of the app submitted for execution on the cluster - this is
> common (AFAIK) for Spark and Spark on YARN, and vendors provide similar
> binaries, for example: spark-1.6_hadoop-2.4.0_hdp.xxx.jar (Hortonworks).
>

Makes sense. So a user submitting to a cluster would submit a jar and
command-line
options, and the cluster itself would add the provided deps.


>  Putting aside our (Beam) issues, the current artifact "beam-runners-spark"
> is more suitable to run on clusters with pre-deployed binaries rather than
> a
> quick standalone execution with a single dependency that takes care of
> everything (Spark related),


great!


> but is more cumbersome for users trying to get
> going for the first time, which is not good!.
>

We should decide which experience we're trying to optimize for (I'd lean
cluster), but
I think that we should update examples pom.xml with the support.

* For cluster mode default, we would add a profile for 'local' mode
  (-PsparkIncluded or something) that overrides the provided deps to be
runtime
  deps instead.

* We can include switching the profile for local mode in the "getting
started" instructions.

Dan

I guess Flink uses a compile dependency for the same reason Spark uses
> provided - because it fits them - what about other runners ?
>
> Hope this clarifies some of the questions here.
>
> Thanks,
> Amit
>
> On Fri, Jul 8, 2016 at 12:52 AM Dan Halperin <dh...@google.com.invalid>
> wrote:
>
> > hey folks,
> >
> > In general, we should optimize for running on clusters rather than
> running
> > locally. Examples is a runner-independent module, with non-compile-time
> > deps on runners. Most runners are currently listed as being runtime deps
> --
> > it sounds like that works, for most cases, but might not be the best fit
> > for Spark.
> >
> > Q: What does dependencies being provided by the cluster mean? I'm a
> little
> > naive here, but how would a user submit a pipeline to a Spark cluster
> > without actually depending on Spark in mvn? Is it not by running the main
> > method in an example like in all other runners?
> >
> > I'd like to understand the above better, but suppose that to optimize for
> > Spark-on-a-cluster, we should default to provided deps in the examples.
> > That would be fine -- but couldn't we just make a profile for local Spark
> > that overrides the deps from provided to runtime?
> >
> > To summarize, I think we do not need new artifacts, but we could use a
> > profile for local testing if absolutely necessary.
> >
> > Thanks,
> > Dan
> >
> > On Thu, Jul 7, 2016 at 2:27 PM, Ismaël Mejía <ie...@gmail.com> wrote:
> >
> > > Good discussion subject Amit,
> > >
> > > I let the whole beam distribution subjects continue in BEAM-320,
> however
> > > there
> > > is a not yet discussed aspect of the spark runner, the maven behavior:
> > >
> > > When you import the beam spark runner as a dependency you are obliged
> to
> > > provide
> > > your spark dependencies by hand too, in the other runners once you
> import
> > > the
> > > runner everything just works e.g.  google-cloud-dataflow-runner and
> > > flink-runner.  I understand the arguments for the current setup (the
> ones
> > > you
> > > mention), but I think it is more user friendly to be consistent with
> the
> > > other
> > > runners and have something that just works as the default (and solve
> the
> > > examples issue as a consequence).  Anyway I think in the spark case we
> > need
> > > both, an 'spark-included' flavor and the current one that it is really
> > > useful to
> > > include the runner as a spark library dependency (like Jesse did in his
> > > video) or
> > > as a spark-package.
> > >
> > > Actually both the all-included and the runner only make sense for flink
> > too
> > > but this is a different discussion ;)
> > >
> > > What do you think about this ? What do the others think ?
> > >
> > > Ismaël
> > >
> > >
> > > On Thu, Jul 7, 2016 at 10:19 PM, Jean-Baptiste Onofré <jb@nanthrax.net
> >
> > > wrote:
> > >
> > > > No problem and good idea to discuss in the Jira.
> > > >
> > > > Actually, I started to experiment a bit beam distributions on a
> branch
> > > > (that I can share with people interested).
> > > >
> > > > Regards
> > > > JB
> > > >
> > > >
> > > > On 07/07/2016 10:12 PM, Amit Sela wrote:
> > > >
> > > >> Thanks JB, I've missed that one.
> > > >>
> > > >> I suggest we continue this in the ticket comments.
> > > >>
> > > >> Thanks,
> > > >> Amit
> > > >>
> > > >> On Thu, Jul 7, 2016 at 11:05 PM Jean-Baptiste Onofré <
> jb@nanthrax.net
> > >
> > > >> wrote:
> > > >>
> > > >> Hi Amit,
> > > >>>
> > > >>> I think your proposal is related to:
> > > >>>
> > > >>> https://issues.apache.org/jira/browse/BEAM-320
> > > >>>
> > > >>> As described in the Jira, I'm planning to provide (in dedicated
> Maven
> > > >>> modules) is a Beam distribution including:
> > > >>> - an uber jar to wrap the dependencies
> > > >>> - the underlying runtime backends
> > > >>> - etc
> > > >>>
> > > >>> Regards
> > > >>> JB
> > > >>>
> > > >>> On 07/07/2016 07:49 PM, Amit Sela wrote:
> > > >>>
> > > >>>> Hi everyone,
> > > >>>>
> > > >>>> Lately I've encountered a number of issues concerning the fact
> that
> > > the
> > > >>>> Spark runner does not package Spark along with it and forcing
> people
> > > to
> > > >>>>
> > > >>> do
> > > >>>
> > > >>>> this on their own.
> > > >>>> In addition, this seems to get in the way of having beam-examples
> > > >>>>
> > > >>> executed
> > > >>>
> > > >>>> against the Spark runner, again because it would have to add Spark
> > > >>>> dependencies.
> > > >>>>
> > > >>>> When running on a cluster (which I guess was the original goal
> > here),
> > > it
> > > >>>>
> > > >>> is
> > > >>>
> > > >>>> recommended to have Spark provided by the cluster - this makes
> sense
> > > for
> > > >>>> Spark clusters and more so for Spark + YARN clusters where you
> might
> > > >>>> have
> > > >>>> your Spark built against a specific Hadoop version or using a
> vendor
> > > >>>> distribution.
> > > >>>>
> > > >>>> In order to make the runner more accessible to new adopters, I
> > suggest
> > > >>>> to
> > > >>>> consider releasing a "spark-included" artifact as well.
> > > >>>>
> > > >>>> Thoughts ?
> > > >>>>
> > > >>>> Thanks,
> > > >>>> Amit
> > > >>>>
> > > >>>>
> > > >>> --
> > > >>> Jean-Baptiste Onofré
> > > >>> jbonofre@apache.org
> > > >>> http://blog.nanthrax.net
> > > >>> Talend - http://www.talend.com
> > > >>>
> > > >>>
> > > >>
> > > > --
> > > > Jean-Baptiste Onofré
> > > > jbonofre@apache.org
> > > > http://blog.nanthrax.net
> > > > Talend - http://www.talend.com
> > > >
> > >
> >
>

Re: [DISCUSS] Spark runner packaging

Posted by Amit Sela <am...@gmail.com>.

I don't think that the Spark runner is special, it's just the way it was
until now and that's why I brought up the subject here.

The main issue is that currently, if a user wants to write a beam app using
the Spark runner, he'll have to provide the Spark dependencies, or he'll
get a ClassNotFoundException (which is exactly the case for beam-examples).
This of course happens because the Spark runner has provided dependency on
Spark (not transitive).
The Flink runner avoids this issue by having a compile dependency on flink,
thus being transitive.

By having the cluster provide them I mean that the Spark installation is
aware of the binaries pre-deployed on the cluster and adds them to the
classpath of the app submitted for execution on the cluster - this is
common (AFAIK) for Spark and Spark on YARN, and vendors provide similar
binaries, for example: spark-1.6_hadoop-2.4.0_hdp.xxx.jar (Hortonworks).

 Putting aside our (Beam) issues, the current artifact "beam-runners-spark"
is more suitable to run on clusters with pre-deployed binaries rather than a
quick standalone execution with a single dependency that takes care of
everything (Spark related), but is more cumbersome for users trying to get
going for the first time, which is not good!.

I guess Flink uses a compile dependency for the same reason Spark uses
provided - because it fits them - what about other runners ?

Hope this clarifies some of the questions here.

Thanks,
Amit

On Fri, Jul 8, 2016 at 12:52 AM Dan Halperin <dh...@google.com.invalid>
wrote:

> hey folks,
>
> In general, we should optimize for running on clusters rather than running
> locally. Examples is a runner-independent module, with non-compile-time
> deps on runners. Most runners are currently listed as being runtime deps --
> it sounds like that works, for most cases, but might not be the best fit
> for Spark.
>
> Q: What does dependencies being provided by the cluster mean? I'm a little
> naive here, but how would a user submit a pipeline to a Spark cluster
> without actually depending on Spark in mvn? Is it not by running the main
> method in an example like in all other runners?
>
> I'd like to understand the above better, but suppose that to optimize for
> Spark-on-a-cluster, we should default to provided deps in the examples.
> That would be fine -- but couldn't we just make a profile for local Spark
> that overrides the deps from provided to runtime?
>
> To summarize, I think we do not need new artifacts, but we could use a
> profile for local testing if absolutely necessary.
>
> Thanks,
> Dan
>
> On Thu, Jul 7, 2016 at 2:27 PM, Ismaël Mejía <ie...@gmail.com> wrote:
>
> > Good discussion subject Amit,
> >
> > I let the whole beam distribution subjects continue in BEAM-320, however
> > there
> > is a not yet discussed aspect of the spark runner, the maven behavior:
> >
> > When you import the beam spark runner as a dependency you are obliged to
> > provide
> > your spark dependencies by hand too, in the other runners once you import
> > the
> > runner everything just works e.g.  google-cloud-dataflow-runner and
> > flink-runner.  I understand the arguments for the current setup (the ones
> > you
> > mention), but I think it is more user friendly to be consistent with the
> > other
> > runners and have something that just works as the default (and solve the
> > examples issue as a consequence).  Anyway I think in the spark case we
> need
> > both, an 'spark-included' flavor and the current one that it is really
> > useful to
> > include the runner as a spark library dependency (like Jesse did in his
> > video) or
> > as a spark-package.
> >
> > Actually both the all-included and the runner only make sense for flink
> too
> > but this is a different discussion ;)
> >
> > What do you think about this ? What do the others think ?
> >
> > Ismaël
> >
> >
> > On Thu, Jul 7, 2016 at 10:19 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> > wrote:
> >
> > > No problem and good idea to discuss in the Jira.
> > >
> > > Actually, I started to experiment a bit beam distributions on a branch
> > > (that I can share with people interested).
> > >
> > > Regards
> > > JB
> > >
> > >
> > > On 07/07/2016 10:12 PM, Amit Sela wrote:
> > >
> > >> Thanks JB, I've missed that one.
> > >>
> > >> I suggest we continue this in the ticket comments.
> > >>
> > >> Thanks,
> > >> Amit
> > >>
> > >> On Thu, Jul 7, 2016 at 11:05 PM Jean-Baptiste Onofré <jb@nanthrax.net
> >
> > >> wrote:
> > >>
> > >> Hi Amit,
> > >>>
> > >>> I think your proposal is related to:
> > >>>
> > >>> https://issues.apache.org/jira/browse/BEAM-320
> > >>>
> > >>> As described in the Jira, I'm planning to provide (in dedicated Maven
> > >>> modules) is a Beam distribution including:
> > >>> - an uber jar to wrap the dependencies
> > >>> - the underlying runtime backends
> > >>> - etc
> > >>>
> > >>> Regards
> > >>> JB
> > >>>
> > >>> On 07/07/2016 07:49 PM, Amit Sela wrote:
> > >>>
> > >>>> Hi everyone,
> > >>>>
> > >>>> Lately I've encountered a number of issues concerning the fact that
> > the
> > >>>> Spark runner does not package Spark along with it and forcing people
> > to
> > >>>>
> > >>> do
> > >>>
> > >>>> this on their own.
> > >>>> In addition, this seems to get in the way of having beam-examples
> > >>>>
> > >>> executed
> > >>>
> > >>>> against the Spark runner, again because it would have to add Spark
> > >>>> dependencies.
> > >>>>
> > >>>> When running on a cluster (which I guess was the original goal
> here),
> > it
> > >>>>
> > >>> is
> > >>>
> > >>>> recommended to have Spark provided by the cluster - this makes sense
> > for
> > >>>> Spark clusters and more so for Spark + YARN clusters where you might
> > >>>> have
> > >>>> your Spark built against a specific Hadoop version or using a vendor
> > >>>> distribution.
> > >>>>
> > >>>> In order to make the runner more accessible to new adopters, I
> suggest
> > >>>> to
> > >>>> consider releasing a "spark-included" artifact as well.
> > >>>>
> > >>>> Thoughts ?
> > >>>>
> > >>>> Thanks,
> > >>>> Amit
> > >>>>
> > >>>>
> > >>> --
> > >>> Jean-Baptiste Onofré
> > >>> jbonofre@apache.org
> > >>> http://blog.nanthrax.net
> > >>> Talend - http://www.talend.com
> > >>>
> > >>>
> > >>
> > > --
> > > Jean-Baptiste Onofré
> > > jbonofre@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
>

Re: [DISCUSS] Spark runner packaging

Posted by Dan Halperin <dh...@google.com.INVALID>.

hey folks,

In general, we should optimize for running on clusters rather than running
locally. Examples is a runner-independent module, with non-compile-time
deps on runners. Most runners are currently listed as being runtime deps --
it sounds like that works, for most cases, but might not be the best fit
for Spark.

Q: What does dependencies being provided by the cluster mean? I'm a little
naive here, but how would a user submit a pipeline to a Spark cluster
without actually depending on Spark in mvn? Is it not by running the main
method in an example like in all other runners?

I'd like to understand the above better, but suppose that to optimize for
Spark-on-a-cluster, we should default to provided deps in the examples.
That would be fine -- but couldn't we just make a profile for local Spark
that overrides the deps from provided to runtime?

To summarize, I think we do not need new artifacts, but we could use a
profile for local testing if absolutely necessary.

Thanks,
Dan

On Thu, Jul 7, 2016 at 2:27 PM, Ismaël Mejía <ie...@gmail.com> wrote:

> Good discussion subject Amit,
>
> I let the whole beam distribution subjects continue in BEAM-320, however
> there
> is a not yet discussed aspect of the spark runner, the maven behavior:
>
> When you import the beam spark runner as a dependency you are obliged to
> provide
> your spark dependencies by hand too, in the other runners once you import
> the
> runner everything just works e.g.  google-cloud-dataflow-runner and
> flink-runner.  I understand the arguments for the current setup (the ones
> you
> mention), but I think it is more user friendly to be consistent with the
> other
> runners and have something that just works as the default (and solve the
> examples issue as a consequence).  Anyway I think in the spark case we need
> both, an 'spark-included' flavor and the current one that it is really
> useful to
> include the runner as a spark library dependency (like Jesse did in his
> video) or
> as a spark-package.
>
> Actually both the all-included and the runner only make sense for flink too
> but this is a different discussion ;)
>
> What do you think about this ? What do the others think ?
>
> Ismaël
>
>
> On Thu, Jul 7, 2016 at 10:19 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
> wrote:
>
> > No problem and good idea to discuss in the Jira.
> >
> > Actually, I started to experiment a bit beam distributions on a branch
> > (that I can share with people interested).
> >
> > Regards
> > JB
> >
> >
> > On 07/07/2016 10:12 PM, Amit Sela wrote:
> >
> >> Thanks JB, I've missed that one.
> >>
> >> I suggest we continue this in the ticket comments.
> >>
> >> Thanks,
> >> Amit
> >>
> >> On Thu, Jul 7, 2016 at 11:05 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
> >> wrote:
> >>
> >> Hi Amit,
> >>>
> >>> I think your proposal is related to:
> >>>
> >>> https://issues.apache.org/jira/browse/BEAM-320
> >>>
> >>> As described in the Jira, I'm planning to provide (in dedicated Maven
> >>> modules) is a Beam distribution including:
> >>> - an uber jar to wrap the dependencies
> >>> - the underlying runtime backends
> >>> - etc
> >>>
> >>> Regards
> >>> JB
> >>>
> >>> On 07/07/2016 07:49 PM, Amit Sela wrote:
> >>>
> >>>> Hi everyone,
> >>>>
> >>>> Lately I've encountered a number of issues concerning the fact that
> the
> >>>> Spark runner does not package Spark along with it and forcing people
> to
> >>>>
> >>> do
> >>>
> >>>> this on their own.
> >>>> In addition, this seems to get in the way of having beam-examples
> >>>>
> >>> executed
> >>>
> >>>> against the Spark runner, again because it would have to add Spark
> >>>> dependencies.
> >>>>
> >>>> When running on a cluster (which I guess was the original goal here),
> it
> >>>>
> >>> is
> >>>
> >>>> recommended to have Spark provided by the cluster - this makes sense
> for
> >>>> Spark clusters and more so for Spark + YARN clusters where you might
> >>>> have
> >>>> your Spark built against a specific Hadoop version or using a vendor
> >>>> distribution.
> >>>>
> >>>> In order to make the runner more accessible to new adopters, I suggest
> >>>> to
> >>>> consider releasing a "spark-included" artifact as well.
> >>>>
> >>>> Thoughts ?
> >>>>
> >>>> Thanks,
> >>>> Amit
> >>>>
> >>>>
> >>> --
> >>> Jean-Baptiste Onofré
> >>> jbonofre@apache.org
> >>> http://blog.nanthrax.net
> >>> Talend - http://www.talend.com
> >>>
> >>>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbonofre@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Re: [DISCUSS] Spark runner packaging

Posted by Ismaël Mejía <ie...@gmail.com>.

Good discussion subject Amit,

I let the whole beam distribution subjects continue in BEAM-320, however
there
is a not yet discussed aspect of the spark runner, the maven behavior:

When you import the beam spark runner as a dependency you are obliged to
provide
your spark dependencies by hand too, in the other runners once you import
the
runner everything just works e.g.  google-cloud-dataflow-runner and
flink-runner.  I understand the arguments for the current setup (the ones
you
mention), but I think it is more user friendly to be consistent with the
other
runners and have something that just works as the default (and solve the
examples issue as a consequence).  Anyway I think in the spark case we need
both, an 'spark-included' flavor and the current one that it is really
useful to
include the runner as a spark library dependency (like Jesse did in his
video) or
as a spark-package.

Actually both the all-included and the runner only make sense for flink too
but this is a different discussion ;)

What do you think about this ? What do the others think ?

Ismaël


On Thu, Jul 7, 2016 at 10:19 PM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> No problem and good idea to discuss in the Jira.
>
> Actually, I started to experiment a bit beam distributions on a branch
> (that I can share with people interested).
>
> Regards
> JB
>
>
> On 07/07/2016 10:12 PM, Amit Sela wrote:
>
>> Thanks JB, I've missed that one.
>>
>> I suggest we continue this in the ticket comments.
>>
>> Thanks,
>> Amit
>>
>> On Thu, Jul 7, 2016 at 11:05 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
>> wrote:
>>
>> Hi Amit,
>>>
>>> I think your proposal is related to:
>>>
>>> https://issues.apache.org/jira/browse/BEAM-320
>>>
>>> As described in the Jira, I'm planning to provide (in dedicated Maven
>>> modules) is a Beam distribution including:
>>> - an uber jar to wrap the dependencies
>>> - the underlying runtime backends
>>> - etc
>>>
>>> Regards
>>> JB
>>>
>>> On 07/07/2016 07:49 PM, Amit Sela wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> Lately I've encountered a number of issues concerning the fact that the
>>>> Spark runner does not package Spark along with it and forcing people to
>>>>
>>> do
>>>
>>>> this on their own.
>>>> In addition, this seems to get in the way of having beam-examples
>>>>
>>> executed
>>>
>>>> against the Spark runner, again because it would have to add Spark
>>>> dependencies.
>>>>
>>>> When running on a cluster (which I guess was the original goal here), it
>>>>
>>> is
>>>
>>>> recommended to have Spark provided by the cluster - this makes sense for
>>>> Spark clusters and more so for Spark + YARN clusters where you might
>>>> have
>>>> your Spark built against a specific Hadoop version or using a vendor
>>>> distribution.
>>>>
>>>> In order to make the runner more accessible to new adopters, I suggest
>>>> to
>>>> consider releasing a "spark-included" artifact as well.
>>>>
>>>> Thoughts ?
>>>>
>>>> Thanks,
>>>> Amit
>>>>
>>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbonofre@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [DISCUSS] Spark runner packaging

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

No problem and good idea to discuss in the Jira.

Actually, I started to experiment a bit beam distributions on a branch 
(that I can share with people interested).

Regards
JB

On 07/07/2016 10:12 PM, Amit Sela wrote:
> Thanks JB, I've missed that one.
>
> I suggest we continue this in the ticket comments.
>
> Thanks,
> Amit
>
> On Thu, Jul 7, 2016 at 11:05 PM Jean-Baptiste Onofr� <jb...@nanthrax.net>
> wrote:
>
>> Hi Amit,
>>
>> I think your proposal is related to:
>>
>> https://issues.apache.org/jira/browse/BEAM-320
>>
>> As described in the Jira, I'm planning to provide (in dedicated Maven
>> modules) is a Beam distribution including:
>> - an uber jar to wrap the dependencies
>> - the underlying runtime backends
>> - etc
>>
>> Regards
>> JB
>>
>> On 07/07/2016 07:49 PM, Amit Sela wrote:
>>> Hi everyone,
>>>
>>> Lately I've encountered a number of issues concerning the fact that the
>>> Spark runner does not package Spark along with it and forcing people to
>> do
>>> this on their own.
>>> In addition, this seems to get in the way of having beam-examples
>> executed
>>> against the Spark runner, again because it would have to add Spark
>>> dependencies.
>>>
>>> When running on a cluster (which I guess was the original goal here), it
>> is
>>> recommended to have Spark provided by the cluster - this makes sense for
>>> Spark clusters and more so for Spark + YARN clusters where you might have
>>> your Spark built against a specific Hadoop version or using a vendor
>>> distribution.
>>>
>>> In order to make the runner more accessible to new adopters, I suggest to
>>> consider releasing a "spark-included" artifact as well.
>>>
>>> Thoughts ?
>>>
>>> Thanks,
>>> Amit
>>>
>>
>> --
>> Jean-Baptiste Onofr�
>> jbonofre@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

Re: [DISCUSS] Spark runner packaging

Posted by Amit Sela <am...@gmail.com>.

Thanks JB, I've missed that one.

I suggest we continue this in the ticket comments.

Thanks,
Amit

On Thu, Jul 7, 2016 at 11:05 PM Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Hi Amit,
>
> I think your proposal is related to:
>
> https://issues.apache.org/jira/browse/BEAM-320
>
> As described in the Jira, I'm planning to provide (in dedicated Maven
> modules) is a Beam distribution including:
> - an uber jar to wrap the dependencies
> - the underlying runtime backends
> - etc
>
> Regards
> JB
>
> On 07/07/2016 07:49 PM, Amit Sela wrote:
> > Hi everyone,
> >
> > Lately I've encountered a number of issues concerning the fact that the
> > Spark runner does not package Spark along with it and forcing people to
> do
> > this on their own.
> > In addition, this seems to get in the way of having beam-examples
> executed
> > against the Spark runner, again because it would have to add Spark
> > dependencies.
> >
> > When running on a cluster (which I guess was the original goal here), it
> is
> > recommended to have Spark provided by the cluster - this makes sense for
> > Spark clusters and more so for Spark + YARN clusters where you might have
> > your Spark built against a specific Hadoop version or using a vendor
> > distribution.
> >
> > In order to make the runner more accessible to new adopters, I suggest to
> > consider releasing a "spark-included" artifact as well.
> >
> > Thoughts ?
> >
> > Thanks,
> > Amit
> >
>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: [DISCUSS] Spark runner packaging

Posted by Jean-Baptiste Onofré <jb...@nanthrax.net>.

Hi Amit,

I think your proposal is related to:

https://issues.apache.org/jira/browse/BEAM-320

As described in the Jira, I'm planning to provide (in dedicated Maven 
modules) is a Beam distribution including:
- an uber jar to wrap the dependencies
- the underlying runtime backends
- etc

Regards
JB

On 07/07/2016 07:49 PM, Amit Sela wrote:
> Hi everyone,
>
> Lately I've encountered a number of issues concerning the fact that the
> Spark runner does not package Spark along with it and forcing people to do
> this on their own.
> In addition, this seems to get in the way of having beam-examples executed
> against the Spark runner, again because it would have to add Spark
> dependencies.
>
> When running on a cluster (which I guess was the original goal here), it is
> recommended to have Spark provided by the cluster - this makes sense for
> Spark clusters and more so for Spark + YARN clusters where you might have
> your Spark built against a specific Hadoop version or using a vendor
> distribution.
>
> In order to make the runner more accessible to new adopters, I suggest to
> consider releasing a "spark-included" artifact as well.
>
> Thoughts ?
>
> Thanks,
> Amit
>

-- 
Jean-Baptiste Onofr�
jbonofre@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com